Getting from Word to XML
 OUCS, July 2001
Steven J. DeRose, Ph.D.
Brown University Scholarly Technology Group
Steven_DeRose@brown.edu
http://www.stg.brown.edu/~sjd

The easy way
"Save as HTML"
Ok to post to Web for browsing
But take a closer look...

Word 98
No XML declaration, no DOCTYPE declaration
Unquoted attributes
Closing unopen tags; crossing tags
Character sets mislabeled
Font codes everywhere
Countless BR, align, empty, blank containers
No style information survives at all

Word 98 sample

Word 2001
Better:
Maps headings to H1/H2/etc
Seems to not write crossed-over elements
Not better:
Still no XML/DOCTYPE declarations
Unquoted attributes
Crammed with metadata, format details, etc. -- in comments
Massive useless/redundant/empty/blank elements
All style information still dies

Word 2001 sample header

Word 2001 sample body

Word 2001 sample body

Other options
Rumored Microsoft "Save As XML"
No doubt will write well-formed documents
Said to avoid absurd amounts of style information
Unknown how much it does, or if it does any style creation
Third-party tools
Most I've examined don't do much:
Such as making everything a <table>, <p>, or <span>

What's the real problem?
Word doesn't know what it's dealing with
Doesn't notice consistency
So puts 1000 font tags in instead of one big one
If all table cells are consistent, it styles every one in detail
So doesn't kill empty paragraphs
So doesn't factor consistencies out
Has no sense of separating concerns
A bold non-breaking space treated as critical….
No notion that format and content may be separated

And this is bad because...
Files become uselessly huge
Files become unreadably cluttered
Formatting cannot be centrally changed (styles)
And formatting for paper and Web should be different...
Formatting will have quirks
For example, nb spaces used to align, on wrong screen size

What should a solution do?
Try to figure out and express some user intent
If user provided styles, use them
If many things look the same, name them alike
Be smart about typical whitespace conventions
Make absolutely conformant XML
Make good XML, not just any XML
Separate formatting so it can be changed
Do nothing that requires magic to interpret
(like Word's <!--<![elseif[… --> stuff)

Sources of information
Special Word objects
Footnotes, index entries, links, highlighting, revisions
Character sets
Explicit structure from user
Styles (para and char)
Implicit structure from user
Matching formatting
Use of whitespace

General approach
Find implicit consistency and factor it out
Make styles for matching paragraphs
Normalize whitespace by size
Don't discard explicit consistency
Keep user styles
Map Word objects directly
Avoid any style info outside of stylesheet
Maximize editability/reusability

Implementation
About 4,000 lines of Visual Basic
Only Word has access to all the information
Why not RTF?
Parsing is gross, and frequently changes
User would have to manually convert each file to RTF unless we used VB anyway
Code would have to be multi-platform
Why not Word API?
Licensing
Harder to do multi-platform release

Part 1: Stylify
Find "exception" paragraphs
Paragraphs that don't match their named style
In worst documents, everything is "Normal" style with whatever format (exceptions) the author made
When you find one:
Make a style for it
Apply that style to all other exceptions that match it
(this way we don't trash real styled paragraphs)
Same for character styles
Makes documents cleaner, more maintainable

Stylify: Finding exceptions
To check if a paragraph has exceptions
Compare properties of style and paragraph format
18 paragraph formatting properties
23 font properties
Ignore kerning, hyphenation, etc, as won't likely be the only thing distinguishing element types
Ignore border line types as too tedious to bear
Ignore color, can't seem to get needed info

Part 3: dirSweep
Scans a directory tree
Finds all Word files
Applies something to each one:
Save As RTF
Stylify
XMLify
Many have suggested integrating this with one of the countless Word macro viruses: Convert the world's Word files to XML en masse

Part 2: XMLify
Runs Stylify to clean up
Converts document to XML
Creates CSS stylesheet

The easy bits
Styled paragraphs without "exceptions"
Convert style name to well-formed XML name
Tricky since Word doesn't have "isXMLnameChar(c)"
Instances of character styles
Easy via global change

A clever bit
Character format exceptions
For every character:
compare 23 font properties to style
Wayyyyy too slow
Sneaky tricks to use Word "find" feature
Keep track of every distinct char format
So you can re-use styles after created

The tedious bits
Map Word language/country codes to xml:lang
Map characters to Unicode entity references
A pain for Symbol, dingbats, etc
A double pain for PC and Mac specific characters
™ § ¶ �
A pain for control-ish characters
En, em, thin dash and space; soft and hard hyphen;….
Map Word color names to CSS names

Headings
Easy to find: Style "Heading 1" &c
But:
Best to add DIV containers
May have to open/close multiple levels (H1 skip to H5)
Word also has "Sections"
Used to insert page/col breaks

Lists
Must add list containers early, or Word loses track of list boundaries/numbering as you process
Discontiguous numbered lists seem hopeless
Manual vs. auto-numbering
Bullet types (little CSS 1-2 control)

Tables
Using HTML table markup (can change w/ XSLT)
Word heading rows ̃ HTML heading cells
Word doesn't know about spans, so
Make array of distinct vertical boundaries (̃ <col>)
Search array for which boundary we're at
If it's the next one, no span; else count the span
Borders are too horrible
Bottom, Horizontal, Left, Right, Top, Vertical, DiagonalDown, DiagonalUp
Each has: ArtStyle ArtWidth ColorIndex LineStyle LineWidth Visible
Enums like: None Single Dot DashSmallGap DashLargeGap DashDot DashDotDot Double Triple ThinThickSmallGap... ThinThickThinLargeGap SingleWavy DoubleWavy DashDotStroked Emboss3D Engrave3D

Special Word constructs
Footnotes/endnotes and comments
Stored in separate Word "stories"
Normal processing doesn't reach them; must move
Hyperlinks
Highlighting, Bookmarks
Simple; color/name/author go on attributes
Index Entries
Parse primary/secondary/see also syntaxes
Eliminate internal formatting, move info to attributes
Revision tracking
Use HTML del/ins markup, add author/date/etc

Pictures
Referenced pictures
Easily turn into link to file
Embedded pictures
Don't see how to get enough information

Whitespace
Authors use whitespace to format
Empty paragraphs (including whitespace-only)
Add to following "space-before" value, and drop
Must calculate total actual height
Paragraph-initial spaces/tabs
Find where "real" content start
Set first-indent value to that
(not yet): avoid tagging invisible formatting:
Like this red space.

Other tricks
Author might format all the text of a paragraph,
but not the ¶ mark itself
This suggests their formatting is para-level anyway
Must split up any char formatting that crosses  paragraph boundaries (non-WF)

Figuring what CSS to write
Word "based on " model doesn't fit with CSS/XSL
So, define CSS styles as if "based on" Normal
Lots more tedious comparisons:
If myStyle.Alignment = normalStyle.Alignment Then….
Lots more tedious mapping:
If pf.Alignment = wdAlignParagraphLeft Then
a = a & " text-align:left;"
ElseIf pf.Alignment = wdAlignParagraphRight Then
a = a & " text-align:right;"…..
Can write stylesheet in header, or external file
Can write Stylesheet Attachment PI
<?xml-stylesheet …?>

Not handled yet
Non-Latin-1 style-name characters
Mapping Symbol, PC, Mac chars to Unicode
Horizontal white-space normalization
Embedded (not referenced) pictures
Borders, border styles, unnamed colors
Tables laid out via lots of fixed-pitch spaces
Tabs for anything but first-line indents
Absolutely-positioned "textframes"

Summary
Stylify
Cleans up many sloppy Word files nicely
Will not catch everything
XMLify
Converts almost everything in a Word file
Does a nice job with styles and CSS
Is dog slow
How much does it cost?
None, and you can help improve the source, too