Based on internal metrics, half of the OpenERP custom code I produce for my customers is Python. The other half is XML (sigh).

If Python is well-equipped to enforce coding styles (thanks to pep8, pyflakes, pylint and the likes), it’s another story for XML. After some investigations and experiments, here is the best way I found to automate the cleaning of huge quantities of XML content.

First, we have to install some command-line utilities:

1 $ aptitude install libxml2-utils xsltproc

Override the default XML indention from 2 spaces to 4, before forcing the cleaning of each XML file found from our current folder:

1 $ export XMLLINT_INDENT="    "
2 $ find . -iname "*.xml" -exec xmllint --format --output "{}" "{}" \;

Now we have a set of normalized XML content.

Create an empty XSLT file named tidy.xslt and copy the following content in it:

<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <!-- Produce an exact copy of the original XML content -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

  <!-- Insert blank lines between each child element of data tags -->
  <xsl:template match="data">
    <xsl:copy>
      <xsl:apply-templates select="@*"/>
      <xsl:text>
      </xsl:text>
      <xsl:apply-templates select="node()"/>
    </xsl:copy>
  </xsl:template>
  <xsl:template match="data/*">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
    <xsl:text>
    </xsl:text>
  </xsl:template>

</xsl:stylesheet>

The XSLT file above will separate with a blank line all children of all data tags. If this particular example is designed for OpenERP’s XML, you can update the second and third xsl:template block to produce files fitting your taste and style.

Finally, you can apply our XSLT to all our local XML files:

1 $ find . -iname "*.xml" -exec xsltproc --output "{}" ./tidy.xslt "{}" \;