Home | About Me | Developer PFE Blog | Become a Developer PFE

Contact

Categories

On this page

Getting more information from the Word error box when troubleshooting OpenXML / WordML issues
OpenXML: How to refresh a field when the document is opened

Archive

Blogroll

Disclaimer
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

Sign In

# Saturday, September 3, 2011
Saturday, September 3, 2011 11:56:55 AM (Central Daylight Time, UTC-05:00) ( OpenXML | Troubleshooting )

So, many apologies for dropping off the face of the blogosphere lately.  Fortunately (or unfortunately, depending on your perspective), I’ve been really busy at work.  I’ve been working on some really cool things that I hope I’ll be able to talk about publicly soon.  For now, though, I wanted to pass on something that I haven’t seen documented in other places that actually helped me quite a bit lately. 

So, for those of you that generate Word documents via the OpenXML (or any other of a variety of methods), you may have come across something like this when you opened up a document you just generated:

image

There are some problems with this message but the big one is that it just says “Line: 1, Column: 0”.  Not exactly a map to the error.  As a result, you may have stared at this message for a long time and wondered – “how the heck do I fix this?  What is the real problem?”.  Well, let me show you a really quick and easy way of getting more information than what is initially provided.

Step 1:  Change the extension from docx to zip

As you may or may not know, all OpenXML documents (or Office documents since Office 2007) are actually zip files at their core. That means you can just crack them open and peer inside.

image

See the difference?  Easy!

Step 2:  Extract the zip file to a folder

Once again, pretty straight forward.  Once you extract the zip file above, you should see a structure like the following:

image

Now, from here – you’ll be able to locate the file referenced in that cryptic error message above. 

Step 3:  Find the file that’s causing the problem

In the example above, the message states that the problem lies with the file “/word/document.xml” so just navigate to the “word” folder and find the “document.xml”.

image

Step 4:  Open and format the file in Visual Studio

One of the great features of Visual Studio is that it can format an XML file for you.  So, in our case, the document.xml file is natively just one big line:

image

Incidentally, this is why the message always states “Line 1,…” in the error message.  As far as Word is concerned, the problem IS on the first line.  Fortunately for us, though, Word can take that single line file and format it for us.  Just use the Edit > Advanced > Format Document option in Visual Studio:

image

That will then format the XML and make it look closer to:

image

Step 5:  Recreate the Word doc and get the additional information

Now that you have the file formatted appropriately, you can just re-create the Word document and re-open it.  For this, just go back to the root of the document, select all the files/folders and then zip it back up:

image

Once it’s zipped back up again, just change the extension from zip to docx and re-open the file.  When you do so, you’ll see the following:

image

Note that now, you’ll see that it says “Line: 5667, Column: 0” – which will point to the exact line causing the problem – which allows you to just go back to the “document.xml” file you already have open in Visual Studio to see the problem.  In our case:

image

Note that this won’t magically fix your problem.  You’ll still need to examine the WordML to figure out the problem – but at least you know where to go.  And knowing is half the battle! 

That’s all for now and I will be back with some more developer stuff soon. 

Until next time!

# Monday, August 9, 2010
Monday, August 9, 2010 1:09:32 AM (Central Daylight Time, UTC-05:00) ( Development | OpenXML )

logo_Office_2010 I was working on an internal project a bit ago and one of the requirements was to implement a fancy Word document.  The idea was that all of the editing of the text/code samples/etc. would be done in the application and then the user could just export it to Word to put any finishing touches and send off to the customer.  The final report needed to include section headers, page breaks, a table of contents, etc.  There are a number of ways we could have accomplished the task.  There’s the Word automation stuff that relies upon a COM based API, there’s the method of just creating an HTML document and loading that into Word and then finally there’s the Open XML API.  Now, someone had hacked up a version of this export functionality previously using the Word automation stuff but considering we’re often dealing with 1,000+ page documents – it turned out to be a little slow.  Also, there are some restrictions around using the automation libraries in a server context.  Lastly, since my OpenXML kung-fu is strong, I thought I would take the opportunity to implement a better, more flexible and much faster solution.  For those just starting out, Brian and Zeyad’s excellent blog on the topic is invaluable

One of the requirements for the export operation was to have Word automagically refresh the table of contents (and other fields) the first time the document is opened.  This was something that took a bit of time to research but you really end up with 2 options:

w:updateFields Element

The “w:updateFields” element is a document-level element that is set in the document settings part and tells Word to update all of the fields in the document:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:settings>
<w:updateFields w:val="true" />

</w:settings>

If you’re wondering what the document settings part is – just rename a Word doc from “blah.docx” to “blah.docx.zip” and extract it to a folder on your computer.  In the new folder is a directory called “word”.  In that directory, you should see a file called “settings.xml”:

image In that file are all of the document level settings for your docx.  There’s some really great stuff in here

If you’d like to use the OpenXML SDK to set that value (and you’d be crazy not to), here’s some sample code:

using (WordprocessingDocument document = WordprocessingDocument.Open(path, true))
{

DocumentSettingsPart settingsPart =
document.MainDocumentPart.GetPartsOfType<DocumentSettingsPart>().First();

// Create object to update fields on open
UpdateFieldsOnOpen updateFields = new UpdateFieldsOnOpen();
updateFields.Val = new DocumentFormat.OpenXml.OnOffValue(true);

// Insert object into settings part.
settingsPart.Settings.PrependChild<UpdateFieldsOnOpen>(updateFields);
settingsPart.Settings.Save();

}

w:dirty Attribute

This attribute is applied to the field you would like to have refreshed when the document is opened in Word.  It tells Word to only refresh this field the next time the document is opened.  For example, if you want to apply it to a field like your table of contents, just find the w:fldChar and add that attribute:

<w:r>
<w:fldChar w:fldCharType="begin" w:dirty="true"/>
</w:r>

For a simple field, like the document author, you’ll want to add it to the w:fldSimple element, like so:

<w:fldSimple w:instr="AUTHOR \* Upper \* MERGEFORMAT"
w:dirty="true" >
<w:r>
...
</w:r>
</w:fldSimple>

A caveat or two

Both of these methods will work just fine in Word 2010. 

In Word 2007, though, you need to clear out the contents of the field before the user opens the document.  For example, with a table of contents, Word will normally cache the contents of the TOC in the fldChar element.  This is good, normally, but here it causes a problem. 

For example, in a very simple test document, you would see the following cached data (i.e.:  Heading 1, Heading 2, etc.):

<w:p w:rsidR="00563999" w:rsidRDefault="00050B09">
...
<w:r>
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r w:rsidR="00563999">
<w:instrText xml:space="preserve"> TOC \* MERGEFORMAT </w:instrText>
</w:r>
</w:p>
<w:p w:rsidR="00F77370" w:rsidRDefault="00F77370">
...
<w:r>
...
<w:t>Heading 1</w:t>
</w:r>
...
</w:p>
<w:p w:rsidR="00F77370" w:rsidRDefault="00F77370">
...
<w:r>
...
<w:t>Heading 2</w:t>
</w:r>
...
</w:p>
<w:p w:rsidR="00F77370" w:rsidRDefault="00F77370">
...
<w:r>
<w:rPr>
<w:noProof/>
</w:rPr>
<w:fldChar w:fldCharType="end"/>
</w:r>
</w:p>

After you clear out the schmutz, you end up with just the begin element, the definition of the TOC and the end element:

<w:p w:rsidR="00563999" w:rsidRDefault="00563999">
...
<w:r>
<w:fldChar w:fldCharType="begin"/>
</w:r>
<w:r>
<w:instrText xml:space="preserve"> TOC \* MERGEFORMAT </w:instrText>
</w:r>
</w:p>
<w:p w:rsidR="00B63C3C" w:rsidRDefault="00563999" w:rsidP="00B63C3C">
<w:r>
<w:fldChar w:fldCharType="end"/>
</w:r>
...
</w:p>

Once you’ve made the updates, you can safely open up your file in Word 2007 and your fields will update when the document opens.

Big thanks for Zeyad for his tip on trimming out the schmutz.

Just to stress, this is improved in Word 2010 and you no longer need to clear out the cached data in your fields.

Enjoy!