Open Source Exile: July 2009

We mark up a lot of names, so one of the first things I decided to do was to build an XSLT stylesheet that takes a list of names and tags those names when they occur in a separate XSLT file. To make things easier and clearer, I've ignored little things like namespaces, conformant TEI, etc, etc.

First up, the list of names, these are multi-word names. Notice the simple structure, this could easily be built from a comma seperated list or similar:

<?xml version="1.0" encoding="UTF-8"?>
<names>
<name>Papaver argemone</name>
<name>Papaver dubium</name>
<name>Papaver Rhceas</name>
<name>Zanthoxylum novæ-zealandiæ</name>
</names>

Next, some sample text:

<?xml version="1.0" encoding="UTF-8"?>
<doc>
 There are several names Papaver argemone in this document Papaver argemone
 Some of them are the same as others (Papaver Rhceas Papaver rhceas P. rhceas)
 Non ASCII characters shouldn't cause a problem in names like Zanthoxylum novæ-zealandiæ AKA Zanthoxylum novae-zealandiae
</doc>

Finally the stylesheet. It consists of three parts: the regexp variable that builds a regexp from the names in the file; a default template for everything but text(); and a template for text()s that applies the rexexp.

<?xml version="1.0"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform" >


<xsl:variable name="regexp">
<xsl:value-of select="concat('(',string-join(document('name-list.xml')//name/text(), '|'), ')')"/>
</xsl:variable>


<xsl:template match="@*|*|processing-instruction()|comment()">
<xsl:copy>
<xsl:apply-templates select="@*|*|processing-instruction()|comment()|text()"/>
</xsl:copy>
</xsl:template>


<xsl:template match="text()">
<xsl:analyze-string select="." regex="{$regexp}">
<xsl:matching-substring>
<name type="taxonomic" subtype="matched">
<xsl:value-of select="regex-group(1)"/>
</name>
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>

</xsl:stylesheet>

The output looks like:

<?xml version="1.0" encoding="UTF-8"?><doc>
 There are several names <name type="taxonomic" subtype="matched">Papaver argemone</name> in this document <name type="taxonomic" subtype="matched">Papaver argemone</name>
 Some of them are the same as others (<name type="taxonomic" subtype="matched">Papaver Rhceas</name> Papaver rhceas P. rhceas)
 Non ASCII characters shouldn't cause a problem in names like <name type="taxonomic" subtype="matched">Zanthoxylum novæ-zealandiæ</name> AKA Zanthoxylum novae-zealandiae
</doc>

As you may notice, I've not yet worked out the best way to handle the 'æ'

Open Source Exile

Monday 27 July 2009

Learning XSLT 2.0 Part 1; Finding Names

About Me

Blog Archive