Home Company Services Portfolio Contact us nav spacer

Cleaning documents polluted by copy-paste from MSWord

by Izak Burger posted on Sep 16, 2011 11:13 AM last modified Sep 16, 2011 11:13 AM —

This problem is much less severe now that Plone uses tinyMCE in the newer versions, but we still run into problems with older documents created in Kupu on older versions of Plone.

Case in point, yesterday I dumped the content of such a document to a file and cleaned it up. This resulted in a reduction in file size of more than 90%.

-rw-r--r-- 1 izak izak 3.2M 2011-09-15 15:52 /tmp/before.html
-rw-r--r-- 1 izak izak 205K 2011-09-15 16:09 /tmp/after.html

One thing that TinyMCE definitely doesn't handle as well as Kupu, is 3.2M documents, so we can no longer ignore the MSWord bloat. I wrote the following bit of code to make the cleanup easier. It uses Elementtree.

import sys
from lxml import etree
from lxml.etree import HTMLParser

parser = HTMLParser()
fp = open(sys.argv[1], 'r')
tree = etree.parse(fp, parser)
fp.close()

xslt = etree.XML("""\
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
    <xsl:template match="comment()" />
    <xsl:template match="style" />
    <xsl:template match="link" />
    <xsl:template match="@*|node()">
        <xsl:copy><xsl:apply-templates select="@*|node()"/></xsl:copy>
    </xsl:template>
</xsl:stylesheet>""")
transform = etree.XSLT(xslt)

newtree = transform(tree)
print str(newtree)

I hope this is useful to someone.