ExpressionEngine CMS
Open, Free, Amazing

Thread

This is an archived forum and the content is probably no longer relevant, but is provided here for posterity.

The active forums are here.

converting .doc files into xhtml (and clean) files

December 17, 2008 10:44am

Subscribe [1]
  • #1 / Dec 17, 2008 10:44am

    andreagam

    91 posts

    Not CI related indeed, but every now and then I come across to this issue many of you may now:
    When a customer sends you tons of .doc pages to be translated into html pages, how do you manage to sort out clean semantic markup from the files?

    I tried exporting from Word to xhtml, from DreamWeaver and every text editor I could open in my desktop, but I always get those pesky Mso tags and useless classes all over the text.

    Anyone has a one-stop solution to help me?
    I’m on Mac.

    TIA,

    Andrea G

  • #2 / Dec 17, 2008 12:45pm

    elvix

    81 posts

    Textedit can usually do this pretty easily, if the word doc isn’t formatted too crazy.

    http://www.macosxhints.com/article.php?story=20060828093624972

    The other benefit is that Textedit’s engine is accessible via the command line (for batch scripts). I have used a command like the following to convert RTF to HTML in the past. Not sure if it could be adapted to doc, but maybe.

    find . -name '*.rtf' | xargs textutil -convert html -strip -excludedelements '(head,span,style,font)'

    I have this saved in my .profile as an alias.

    Search google for “textutil” for more.

  • #3 / Dec 18, 2008 6:21am

    andreagam

    91 posts

    Thanx a lot Elvix.

    I didn’t realize that TextEdit could make such a good job.
    The Xhtml resulting is close to perfect from what I’ve seen by now, except
    for a ‘p’  tag placed in every line (I guess no app can’t distinguish from a good and an unnecessary ‘p’)...

    Unfortunately I’m completely noob in using CLI so I can’t help using your shortcut.
    Would it be possible to convert it to an AppleScript?
    Anybody manged to do that?

    Thanks again,

    Andrea G

.(JavaScript must be enabled to view this email address)

ExpressionEngine News!

#eecms, #events, #releases