Telerik blogs

For the Q1 2013 release of Telerik’s controls for ASP.NET AJAX we had an important objective – let the end user paste documents from MS Word in our rich-text Editor and produce nice, clean HTML while still preserving the original appearance. Read on to see the bonus we added :) But first, some background:

How MS Word content looks like

Let’s take a look at the HTML bellow, which is generated by pasting a paragraph from MS Word in a simple editable iframe. The original document has bold and red color styles applied.
MS Word original content

<!--[if gte mso 9]><xml>
 <w:WordDocument>
  <w:View>Normal</w:View>
  <w:Zoom>0</w:Zoom>
  .....
  .....
  </m:mathPr></w:WordDocument>
</xml><![endif]-->

<p class="MsoNormal"><b style="mso-bidi-font-weight:normal"><span style="color:red;mso-ansi-language:EN-US">Some text</span></b></p>
<!--[if gte mso 9]><xml>
 <w:LatentStyles DefLockedState="false" DefUnhideWhenUsed="true"
  DefSemiHidden="true" DefQFormat="false" DefPriority="99"
  LatentStyleCount="267">
  .....
  .....
  <w:LsdException Locked="false" Priority="39" QFormat="true" Name="TOC Heading"/>
 </w:LatentStyles>
</xml><![endif]--><!--[if gte mso 10]>
<style>
 /* Style Definitions */
 table.MsoNormalTable
 {mso-style-name:"Table Normal";
 mso-tstyle-rowband-size:0;
 mso-hansi-font-family:Calibri;
 mso-hansi-theme-font:minor-latin;}
</style>
<![endif]-->


And all of that is a trimmed down version of the actual content for just two words.

The code behind the magic

We can see that most of the content is commented xml formatting. So, we can clear the comments using regular expression, which will simplify the content:

content = content.replace(/<!—[\s\S]*?-->/gm, “”);

Now the content will be only the paragraph:

<p class="MsoNormal"><b style="mso-bidi-font-weight:normal"><span style="color:red;mso-ansi-language:EN-US">Some text</span></b></p>

This is, obviously, still not clean enough - it contains MS Word formatting which is not valid CSS (or HTML markup in more complex cases).

Now, can we clean it further by using only regular expressions? Unfortunately, the answer is “no”. The MS Word styles can vary greatly and there is no way to make sure that we have accounted for all possible cases. No need to mention that they are different with the different versions of MS Word.

We need to clean the HTML, so we can take advantage of the fact that we can traverse its DOM tree with JavaScript. So, we create a DIV element and set the HTML to its innerHTML property. Then we can walk through all its child elements and make sure they are valid. We begin with the class attribute. It is rather straightforward - if the CSS class starts with “mso”, we remove it.

It’s true that regular expressions work faster than traversing the DOM with JavaScript. However our tests show that there is no actual difference in the performance for most of the cases. You could feel a short delay only when pasting word documents with many, many pages. The most important difference between both approaches is that one of them works and the other will not produce the results you’d expect.

What else do we do?

Customize the stripping options

With a single line of JavaScript

There is a simple array with all the CSS properties we want to keep and we compare the actual rules our DOM nodes have with its items. Only the properties present in the array will be kept which effectively removes all invalid properties that MS Word gave us.

This is far more reliable than relying on regular expressions alone. Its only downside is that it is a bit slower, but the trade-off is well worth it:

You can override this array on your page and its name is Telerik.Web.UI.Editor.Utils. cssPropertiesToKeep. Now you have detailed control over the formatting that will be kept:

  • If you don’t need to keep any styles, you could just set it to empty array Telerik.Web.UI.Editor.Utils. cssPropertiesToKeep = [];
  • Or you can set only the styles that you want to remain: Telerik.Web.UI.Editor.Utils. cssPropertiesToKeep = [‘color’, ‘backgroundColor’];

Just place this line of code at the end of the form, just before the closing </form> tag, and you are set for the entire page! Easy as pie!

With a server property

While the entire feature is based on JavaScript there is a server property that can be used to control its behavior: StripFormattingOptions. Well, it actually changes the cssPropertiesToKeep array, but some of you may still find it more comfortable. Besides, it can affect only a single RadEditor on the page.

What I would like to show you is the new member of the enum it takes – MSWordNoMargins. Setting

StripFormattingOptions="MSWordNoMargins, ConvertWordLists"

is one of the best combinations and this is why it is the default value. It will give you clean HTML content that will preserve the original formatting from the MS Word document. Here is how our original example looks with it:

How MS Word content looks in RadEditor 

Give it a try and share your feedback

The good news is that this will work for complex content as well. Give it a shot – try it with a couple of bulleted lists, or colored text, or tables in the online demos. There is still work to be done, so let us know what you need the most – add a comment, post in the forums or open a private support ticket.


Marin Bratanov
About the Author

Marin Bratanov

Marin Bratanov was a Principal Technical Support Engineer in the Blazor division.

Comments

Comments are disabled in preview mode.