Getting from Word to XML
OUCS,
July 2001
| Steven J. DeRose, Ph.D. | |
| Brown University Scholarly Technology
Group Steven_DeRose@brown.edu http://www.stg.brown.edu/~sjd |
|
| "Save as HTML" | |
| Ok to post to Web for browsing | |
| But take a closer look... |
| No XML declaration, no DOCTYPE declaration | |
| Unquoted attributes | |
| Closing unopen tags; crossing tags | |
| Character sets mislabeled | |
| Font codes everywhere | |
| Countless BR, align, empty, blank containers | |
| No style information survives at all | |
| Better: | ||
| Maps headings to H1/H2/etc | ||
| Seems to not write crossed-over elements | ||
| Not better: | ||
| Still no XML/DOCTYPE declarations | ||
| Unquoted attributes | ||
| Crammed with metadata, format details, etc. -- in comments | ||
| Massive useless/redundant/empty/blank elements | ||
| All style information still dies | ||
| Rumored Microsoft "Save As XML" | |||
| No doubt will write well-formed documents | |||
| Said to avoid absurd amounts of style information | |||
| Unknown how much it does, or if it does any style creation | |||
| Third-party tools | |||
| Most I've examined don't do much: | |||
| Such as making everything a <table>, <p>, or <span> | |||
| Word doesn't know what it's dealing with | ||
| Doesn't notice consistency | ||
| So puts 1000 font tags in instead of one big one | ||
| If all table cells are consistent, it styles every one in detail | ||
| So doesn't kill empty paragraphs | ||
| So doesn't factor consistencies out | ||
| Has no sense of separating concerns | ||
| A bold non-breaking space treated as critical…. | ||
| No notion that format and content may be separated | ||
| Files become uselessly huge | ||
| Files become unreadably cluttered | ||
| Formatting cannot be centrally changed (styles) | ||
| And formatting for paper and Web should be different... | ||
| Formatting will have quirks | ||
| For example, nb spaces used to align, on wrong screen size | ||
| Try to figure out and express some user intent | ||
| If user provided styles, use them | ||
| If many things look the same, name them alike | ||
| Be smart about typical whitespace conventions | ||
| Make absolutely conformant XML | ||
| Make good XML, not just any XML | ||
| Separate formatting so it can be changed | ||
| Do nothing that requires magic to interpret | ||
| (like Word's <!--<![elseif[… --> stuff) | ||
| Special Word objects | ||
| Footnotes, index entries, links, highlighting, revisions | ||
| Character sets | ||
| Explicit structure from user | ||
| Styles (para and char) | ||
| Implicit structure from user | ||
| Matching formatting | ||
| Use of whitespace | ||
| Find implicit consistency and factor it out | ||
| Make styles for matching paragraphs | ||
| Normalize whitespace by size | ||
| Don't discard explicit consistency | ||
| Keep user styles | ||
| Map Word objects directly | ||
| Avoid any style info outside of stylesheet | ||
| Maximize editability/reusability | ||
| About 4,000 lines of Visual Basic | ||
| Only Word has access to all the information | ||
| Why not RTF? | ||
| Parsing is gross, and frequently changes | ||
| User would have to manually convert each file to RTF unless we used VB anyway | ||
| Code would have to be multi-platform | ||
| Why not Word API? | ||
| Licensing | ||
| Harder to do multi-platform release | ||
| Find "exception" paragraphs | |||
| Paragraphs that don't match their named style | |||
| In worst documents, everything is "Normal" style with whatever format (exceptions) the author made | |||
| When you find one: | |||
| Make a style for it | |||
| Apply that style to all other exceptions that match it | |||
| (this way we don't trash real styled paragraphs) | |||
| Same for character styles | |||
| Makes documents cleaner, more maintainable | |||
| To check if a paragraph has exceptions | |||
| Compare properties of style and paragraph format | |||
| 18 paragraph formatting properties | |||
| 23 font properties | |||
| Ignore kerning, hyphenation, etc, as won't likely be the only thing distinguishing element types | |||
| Ignore border line types as too tedious to bear | |||
| Ignore color, can't seem to get needed info | |||
| Scans a directory tree | ||
| Finds all Word files | ||
| Applies something to each one: | ||
| Save As RTF | ||
| Stylify | ||
| XMLify | ||
| Many have suggested integrating this with one of the countless Word macro viruses: Convert the world's Word files to XML en masse | ||
| Runs Stylify to clean up | |
| Converts document to XML | |
| Creates CSS stylesheet |
| Styled paragraphs without "exceptions" | |||
| Convert style name to well-formed XML name | |||
| Tricky since Word doesn't have "isXMLnameChar(c)" | |||
| Instances of character styles | |||
| Easy via global change | |||
| Character format exceptions | |||
| For every character: | |||
| compare 23 font properties to style | |||
| Wayyyyy too slow | |||
| Sneaky tricks to use Word "find" feature | |||
| Keep track of every distinct char format | |||
| So you can re-use styles after created | |||
| Map Word language/country codes to xml:lang | |||
| Map characters to Unicode entity references | |||
| A pain for Symbol, dingbats, etc | |||
| A double pain for PC and Mac specific characters | |||
| ™ § ¶ � | |||
| A pain for control-ish characters | |||
| En, em, thin dash and space; soft and hard hyphen;…. | |||
| Map Word color names to CSS names | |||
| Easy to find: Style "Heading 1" &c | ||
| But: | ||
| Best to add DIV containers | ||
| May have to open/close multiple levels (H1 skip to H5) | ||
| Word also has "Sections" | ||
| Used to insert page/col breaks | ||
| Must add list containers early, or Word loses track of list boundaries/numbering as you process | |
| Discontiguous numbered lists seem hopeless | |
| Manual vs. auto-numbering | |
| Bullet types (little CSS 1-2 control) |
| Using HTML table markup (can change w/ XSLT) | |||
| Word heading rows ̃ HTML heading cells | |||
| Word doesn't know about spans, so | |||
| Make array of distinct vertical boundaries (̃ <col>) | |||
| Search array for which boundary we're at | |||
| If it's the next one, no span; else count the span | |||
| Borders are too horrible | |||
| Bottom, Horizontal, Left, Right, Top, Vertical, DiagonalDown, DiagonalUp | |||
| Each has: ArtStyle ArtWidth ColorIndex LineStyle LineWidth Visible | |||
| Enums like: None Single Dot DashSmallGap DashLargeGap DashDot DashDotDot Double Triple ThinThickSmallGap... ThinThickThinLargeGap SingleWavy DoubleWavy DashDotStroked Emboss3D Engrave3D | |||
| Footnotes/endnotes and comments | |||
| Stored in separate Word "stories" | |||
| Normal processing doesn't reach them; must move | |||
| Hyperlinks | |||
| Highlighting, Bookmarks | |||
| Simple; color/name/author go on attributes | |||
| Index Entries | |||
| Parse primary/secondary/see also syntaxes | |||
| Eliminate internal formatting, move info to attributes | |||
| Revision tracking | |||
| Use HTML del/ins markup, add author/date/etc | |||
| Referenced pictures | ||
| Easily turn into link to file | ||
| Embedded pictures | ||
| Don't see how to get enough information | ||
| Authors use whitespace to format | ||
| Empty paragraphs (including whitespace-only) | ||
| Add to following "space-before" value, and drop | ||
| Must calculate total actual height | ||
| Paragraph-initial spaces/tabs | ||
| Find where "real" content start | ||
| Set first-indent value to that | ||
| (not yet): avoid tagging invisible formatting: | ||
| Like this red space. | ||
| Author might format all the text of a
paragraph, but not the ¶ mark itself |
||
| This suggests their formatting is para-level anyway | ||
| Must split up any char formatting that crosses paragraph boundaries (non-WF) | ||
| Word "based on " model doesn't fit with CSS/XSL | ||
| So, define CSS styles as if "based on" Normal | ||
| Lots more tedious comparisons: | ||
| If myStyle.Alignment = normalStyle.Alignment Then…. | ||
| Lots more tedious mapping: | ||
| If pf.Alignment =
wdAlignParagraphLeft Then a = a & " text-align:left;" |
||
| ElseIf pf.Alignment =
wdAlignParagraphRight Then a = a & " text-align:right;"….. |
||
| Can write stylesheet in header, or external file | ||
| Can write Stylesheet Attachment PI | ||
| <?xml-stylesheet …?> | ||
| Non-Latin-1 style-name characters | |
| Mapping Symbol, PC, Mac chars to Unicode | |
| Horizontal white-space normalization | |
| Embedded (not referenced) pictures | |
| Borders, border styles, unnamed colors | |
| Tables laid out via lots of fixed-pitch spaces | |
| Tabs for anything but first-line indents | |
| Absolutely-positioned "textframes" |
| Stylify | ||
| Cleans up many sloppy Word files nicely | ||
| Will not catch everything | ||
| XMLify | ||
| Converts almost everything in a Word file | ||
| Does a nice job with styles and CSS | ||
| Is dog slow | ||
| How much does it cost? | ||
| None, and you can help improve the source, too | ||