Getting from Word to XML
OUCS,
July 2001
Steven J. DeRose, Ph.D. | |
Brown University Scholarly Technology
Group Steven_DeRose@brown.edu http://www.stg.brown.edu/~sjd |
|
"Save as HTML" | |
Ok to post to Web for browsing | |
But take a closer look... |
No XML declaration, no DOCTYPE declaration | |
Unquoted attributes | |
Closing unopen tags; crossing tags | |
Character sets mislabeled | |
Font codes everywhere | |
Countless BR, align, empty, blank containers | |
No style information survives at all | |
Better: | ||
Maps headings to H1/H2/etc | ||
Seems to not write crossed-over elements | ||
Not better: | ||
Still no XML/DOCTYPE declarations | ||
Unquoted attributes | ||
Crammed with metadata, format details, etc. -- in comments | ||
Massive useless/redundant/empty/blank elements | ||
All style information still dies |
Rumored Microsoft "Save As XML" | |||
No doubt will write well-formed documents | |||
Said to avoid absurd amounts of style information | |||
Unknown how much it does, or if it does any style creation | |||
Third-party tools | |||
Most I've examined don't do much: | |||
Such as making everything a <table>, <p>, or <span> |
Word doesn't know what it's dealing with | ||
Doesn't notice consistency | ||
So puts 1000 font tags in instead of one big one | ||
If all table cells are consistent, it styles every one in detail | ||
So doesn't kill empty paragraphs | ||
So doesn't factor consistencies out | ||
Has no sense of separating concerns | ||
A bold non-breaking space treated as critical…. | ||
No notion that format and content may be separated |
Files become uselessly huge | ||
Files become unreadably cluttered | ||
Formatting cannot be centrally changed (styles) | ||
And formatting for paper and Web should be different... | ||
Formatting will have quirks | ||
For example, nb spaces used to align, on wrong screen size |
Try to figure out and express some user intent | ||
If user provided styles, use them | ||
If many things look the same, name them alike | ||
Be smart about typical whitespace conventions | ||
Make absolutely conformant XML | ||
Make good XML, not just any XML | ||
Separate formatting so it can be changed | ||
Do nothing that requires magic to interpret | ||
(like Word's <!--<![elseif[… --> stuff) |
Special Word objects | ||
Footnotes, index entries, links, highlighting, revisions | ||
Character sets | ||
Explicit structure from user | ||
Styles (para and char) | ||
Implicit structure from user | ||
Matching formatting | ||
Use of whitespace |
Find implicit consistency and factor it out | ||
Make styles for matching paragraphs | ||
Normalize whitespace by size | ||
Don't discard explicit consistency | ||
Keep user styles | ||
Map Word objects directly | ||
Avoid any style info outside of stylesheet | ||
Maximize editability/reusability | ||
About 4,000 lines of Visual Basic | ||
Only Word has access to all the information | ||
Why not RTF? | ||
Parsing is gross, and frequently changes | ||
User would have to manually convert each file to RTF unless we used VB anyway | ||
Code would have to be multi-platform | ||
Why not Word API? | ||
Licensing | ||
Harder to do multi-platform release | ||
Find "exception" paragraphs | |||
Paragraphs that don't match their named style | |||
In worst documents, everything is "Normal" style with whatever format (exceptions) the author made | |||
When you find one: | |||
Make a style for it | |||
Apply that style to all other exceptions that match it | |||
(this way we don't trash real styled paragraphs) | |||
Same for character styles | |||
Makes documents cleaner, more maintainable |
To check if a paragraph has exceptions | |||
Compare properties of style and paragraph format | |||
18 paragraph formatting properties | |||
23 font properties | |||
Ignore kerning, hyphenation, etc, as won't likely be the only thing distinguishing element types | |||
Ignore border line types as too tedious to bear | |||
Ignore color, can't seem to get needed info | |||
Scans a directory tree | ||
Finds all Word files | ||
Applies something to each one: | ||
Save As RTF | ||
Stylify | ||
XMLify | ||
Many have suggested integrating this with one of the countless Word macro viruses: Convert the world's Word files to XML en masse |
Runs Stylify to clean up | |
Converts document to XML | |
Creates CSS stylesheet |
Styled paragraphs without "exceptions" | |||
Convert style name to well-formed XML name | |||
Tricky since Word doesn't have "isXMLnameChar(c)" | |||
Instances of character styles | |||
Easy via global change |
Character format exceptions | |||
For every character: | |||
compare 23 font properties to style | |||
Wayyyyy too slow | |||
Sneaky tricks to use Word "find" feature | |||
Keep track of every distinct char format | |||
So you can re-use styles after created |
Map Word language/country codes to xml:lang | |||
Map characters to Unicode entity references | |||
A pain for Symbol, dingbats, etc | |||
A double pain for PC and Mac specific characters | |||
™ § ¶ � | |||
A pain for control-ish characters | |||
En, em, thin dash and space; soft and hard hyphen;…. | |||
Map Word color names to CSS names |
Easy to find: Style "Heading 1" &c | ||
But: | ||
Best to add DIV containers | ||
May have to open/close multiple levels (H1 skip to H5) | ||
Word also has "Sections" | ||
Used to insert page/col breaks |
Must add list containers early, or Word loses track of list boundaries/numbering as you process | |
Discontiguous numbered lists seem hopeless | |
Manual vs. auto-numbering | |
Bullet types (little CSS 1-2 control) |
Using HTML table markup (can change w/ XSLT) | |||
Word heading rows ̃ HTML heading cells | |||
Word doesn't know about spans, so | |||
Make array of distinct vertical boundaries (̃ <col>) | |||
Search array for which boundary we're at | |||
If it's the next one, no span; else count the span | |||
Borders are too horrible | |||
Bottom, Horizontal, Left, Right, Top, Vertical, DiagonalDown, DiagonalUp | |||
Each has: ArtStyle ArtWidth ColorIndex LineStyle LineWidth Visible | |||
Enums like: None Single Dot DashSmallGap DashLargeGap DashDot DashDotDot Double Triple ThinThickSmallGap... ThinThickThinLargeGap SingleWavy DoubleWavy DashDotStroked Emboss3D Engrave3D | |||
Footnotes/endnotes and comments | |||
Stored in separate Word "stories" | |||
Normal processing doesn't reach them; must move | |||
Hyperlinks | |||
Highlighting, Bookmarks | |||
Simple; color/name/author go on attributes | |||
Index Entries | |||
Parse primary/secondary/see also syntaxes | |||
Eliminate internal formatting, move info to attributes | |||
Revision tracking | |||
Use HTML del/ins markup, add author/date/etc |
Referenced pictures | ||
Easily turn into link to file | ||
Embedded pictures | ||
Don't see how to get enough information |
Authors use whitespace to format | ||
Empty paragraphs (including whitespace-only) | ||
Add to following "space-before" value, and drop | ||
Must calculate total actual height | ||
Paragraph-initial spaces/tabs | ||
Find where "real" content start | ||
Set first-indent value to that | ||
(not yet): avoid tagging invisible formatting: | ||
Like this red space. |
Author might format all the text of a
paragraph, but not the ¶ mark itself |
||
This suggests their formatting is para-level anyway | ||
Must split up any char formatting that crosses paragraph boundaries (non-WF) |
Word "based on " model doesn't fit with CSS/XSL | ||
So, define CSS styles as if "based on" Normal | ||
Lots more tedious comparisons: | ||
If myStyle.Alignment = normalStyle.Alignment Then…. | ||
Lots more tedious mapping: | ||
If pf.Alignment =
wdAlignParagraphLeft Then a = a & " text-align:left;" |
||
ElseIf pf.Alignment =
wdAlignParagraphRight Then a = a & " text-align:right;"….. |
||
Can write stylesheet in header, or external file | ||
Can write Stylesheet Attachment PI | ||
<?xml-stylesheet …?> | ||
Non-Latin-1 style-name characters | |
Mapping Symbol, PC, Mac chars to Unicode | |
Horizontal white-space normalization | |
Embedded (not referenced) pictures | |
Borders, border styles, unnamed colors | |
Tables laid out via lots of fixed-pitch spaces | |
Tabs for anything but first-line indents | |
Absolutely-positioned "textframes" |
Stylify | ||
Cleans up many sloppy Word files nicely | ||
Will not catch everything | ||
XMLify | ||
Converts almost everything in a Word file | ||
Does a nice job with styles and CSS | ||
Is dog slow | ||
How much does it cost? | ||
None, and you can help improve the source, too |