Plain text processing in structured documents

Nico VerwerRakensi

Applications that analyze and process natural language can be used for things like named entity recognition, anonymization, topic extraction, sentiment analysis. In most cases, these applications use the plain text of a document, and may add or change markup. This causes problems when the original document already contains markup that must be preserved. The text to be analyzed may run across markup boundaries, and newly generated markup may lead to unbalanced (non well-formed) structures. This presentation shows how the Separated Markup API for XML (SMAX) can be used to apply natural language processing to XML documents. It preserves the existing document structure and allows for balanced insertion of new markup. A demonstration will be given of the use of SMAX for extracting and marking references in legal documents. This Link eXtractor was built for the Dutch center for governmental publications. SMAX and Simple Pipelines of Event API Transformers (SPEAT) will be available as open source software at the time of Declarative Amsterdam.

Table of contents

Abstract
Introduction
SMAX: Separated Markup API for XML
Balancing strategies
SPEAT: Simple Pipelines of Event API Transformers
Acknowledgments
Notes
References
Biographical notes

Introduction

The LX is an application that detects and resolves references to legislation, case law and other governmental and legal documents. These references occur as pieces of text in the document content, with a loosely defined structure. The LX recognizes these text fragments, marks them (usually with an XML element) and provides a link to the referred document if possible. The LX is used in a large-scale production environment and has a high recall and precision.

The link detection algorithm of LX consists of Named Entity Recognition (NER) and parsers using Parsing Expression Grammars (PEG), plus some specific operations for disambiguation, reference construction and link resolution. The result produced by the LX can be a list of references with links, or links that are embedded in the input document.

The input documents of the LX are almost always XML documents. The LX detects textual references in the document add adds linking markup around the reference. For example, consider the fragment11.The LX works on Dutch texts only, although this depends on the grammars that are used. Where possible we have translated examples into English. The LX will not work on the English examples.

<para> […] expressed by the Supreme Court in its judgment of 16 February 2010, published in NS 2010, 98, also in NJ 2010, 232 with annotation of M.J. Borgers, and RvdW 2007, 420 (case R06/090) […] </para>

The words making up references are shown in bold print. The LX detects these parts and combines them. Thus, the phrases “Supreme Court 16 February 2010”, “NS 2010, 98” and “NJ 2010, 232” are all references (although the first one is incomplete). The link resolution process will find that these references point to the same document, which may be referred to by the European Case Law Identifier “ECLI:NL:HR:2010:BK6357”. In the output document, the added markup would look like

<para>[…] expressed by the <link ecli="ECLI:NL:HR:2010:BK6357">Supreme Court in its judgment of 16 February 2010, published in NS 2010, 98, also in NJ 2010, 232</link> with annotation of M.J. Borgers, and <link ecli="ECLI:NL:HR:2007:AZ7628">RvdW 2007/420 (case R06/090</link>) […]</para>

In this example, the LX just analyzes the text content of a paragraph element. However, there might be markup embedded in the input document, like:

<para>[…] expressed by the Supreme Court in its judgment of <date iso-8601-date="2010-02-16">16 February 2010</date>, published in NS 2010, 98, also in NJ 2010, 232 with annotation of M.J. Borgers, and RvdW 2007, 420 (case R06/090) […]</para>

The <date> element must be preserved in the output and will be contained within the added <link>. In order to detect the components of the link, the LX needs to look at the plain text contained in the paragraph and ignore the <date> element. This is because the NER and PEG parsers work on plain text and produce a parse tree. After several more steps, this parse tree must be merged with the original markup of the input document.

The LX must of course produce well-formed XML. This can be challenging when newly inserted XML elements have an overlap with other elements. These element may be present in the original document, or they may be added by the LX. For example, after some parsing and other processing steps, the fragment given before will be transformed into:

<para>[…] expressed by the <lx:INSTANTIE start="23" end="32" norm="HR">Supreme Court</lx:INSTANTIE> in its judgment of <date iso-8601-date="2010-02-16">16 February 2010</date>, published in NS 2010, 98, also in NJ 2010, 232 with annotation of M.J. Borgers, and RvdW 2007, 420 (case R06/090) […]</para>

When inserting the resolved link, the LX must decide if it should put the <link ecli="ECLI:NL:HR:2010:BK6357"> element before or after the <lx:INSTANTIE start="23" end="32" norm="HR"> element. In this case, it is obvious that the <link> should come before <lx:INSTANTIE>. But what can the LX do when someone decides to add some emphasis, as in

<para>[…] expressed by the Supreme Court in its judgment of 16 February 2010, published in NS 2010, 98, also in NJ 2010, 232 with annotation of M.J. Borgers, and RvdW 2007, 420 (case R06/090) […]</para>

Should it draw the word “the” before “Supreme Court” into the <link> element? Should it split the  element, or perhaps generate an error?

The example given above illustrates some of the challenges in what the LX should do. More problems arise when we consider how the LX should do it.

The LX performs a sequence of steps, each of which produces its own additional markup. Some of these steps will undo the results of earlier steps, fixing erroneous or ambiguous references. As we saw earlier, after each of the steps, the elements that have been added must be inserted into the document in such a way that it remains well-formed XML. All parsing could be done within XSLT, but this leads very quickly to a lot of accidental complexity22. Accidental complexity is complexity that is added for technical reasons, as opposed to essential complexity, which is inherent to the functionality of a program. See Brooks 1986. and unwieldy stylesheets. Therefore, the LX is constructed as a pipeline of approximately 70 small and medium-sized steps. Many steps are XSLT transformations or SAX-processing Java components. The NER and PEG parsing are steps that work on the plain text content of the document and add markup.

In the original version of the LX the PEG parsers work on a serialized version of the document, and have to take into account the (serialized) markup in the original document as well as XML elements added by earlier steps. For example, consider the phrase

EG is a treaty, but also part of the reference HvJ EG 18 juli 2007, C-231/05

One of the first steps is named entity recognitions for legislation, which recognizes “EG” twice as legislation identified by ‘BWBV0001506’.

<lx:regeling name="BWBV0001506">EG</lx:regeling> is a treaty, but also part of the reference HvJ <lx:regeling name="BWBV0001506">EG</lx:regeling> 18 juli 2007, C-231/05

In order to make this suitable for parsing by a PEG parser, the added markup will be serialized into

<lx:regeling name="BWBV0001506">EG</lx:regeling> is a treaty, but also part of the reference HvJ <lx:regeling name="BWBV0001506">EG </lx:regeling> 18 juli 2007, C-231/05

This means that the PEG grammar33.LX uses the Waxeye parser generator, which places grammar operators like ‘*’ and ‘!’ in front of the term that they apply to. must contain constructions like the following, to preserve serialized markup:

regeling <- lx_regeling_start *(![<] .) lx_regeling_end

lx_regeling_start <- '<lx:regeling' *(![>] .) '>'

lx_regeling_end <- '</lx:regeling>'

Thus, the PEG parser for legislative or case law references contains a small part of an XML parser as well. A consequence is, that after the parser has done its work, the serialized XML elements have been turned into XML elements themselves:

<lx:Regeling start="0" end="93"><lx:Lx_regeling_start start="0" end="77"> <lx:regeling xmlns:lx="http://linkeddata.overheid.nl/lx/" name="BWBV0001506"> </lx:Lx_regeling_start>EG<lx:Lx_regeling_end start="79" end="93"> </lx:regeling> </lx:Lx_regeling_end></lx:Regeling> is a treaty, […]

After each parsing step, there is a ‘normalization’ stylesheet to get rid of the superfluous markup. In some cases, LX uses Saxon’s parse-xml function to restore markup that had to be serialized before parsing. In this way, mixing of XML processing components and text processing components is rather inefficient, and is a source of accidental complexity.

In one of the processing steps, the LX parses case law references. The European Court of Justice is a particularly interesting case, since it is, and has been known by many names and abbreviations. One of these abbreviations (in Dutch) is “HvJ EG”. As we saw above, in an earlier step the NER has put the <lx:regeling> element around “EG”. Again, the LX needs to compensate for this in the PEG grammar:

EUROPEES

 <= :lx_regeling_start EUROPEES :lx_regeling_end

  | 'E'("g"|"u"?('rope'('es'|'se')))

This will throw away the incorrect <lx:regeling> element. But it also shows how accidental complexity enters the PEG grammar. This grammar is supposed to be maintained by domain experts, to whom the :lx_regeling_start and :lx_regeling_end terms do not make sense. It would be much better if we would only have ‘functional’ grammar rules.

The SMAX (and SPEAT) software library has been designed to solve the problems arising from the application of text-processing components to XML documents, by separating markup and text content. How this is done is discussed in the remainder of this document.

SMAX: Separated Markup API for XML

The standard XML processing APIs, such as DOM and SAX (and LINQ, if you use .net) view an XML document as a tree. The same is true for XSLT, XPath and XQuery. The tree representation of an XML document consists of element nodes, and text content that is distributed across the leaves of the tree. It is possible in XSLT or XPath to operate on the complete text content of the document (or of an element), but this will lose the structure of the document.

SMAX (Separated Markup API for XML) was designed in order to be able to process the text content of an XML document in a linear way, without intervening markup. At the same time, SMAX retains and can modify the structure and markup of the document. As the name indicates, SMAX treats the markup (element tree structure) and text of an XML document separately. A typical use case is text analysis inserting markup around recognized text fragments, as described in the introduction. Other applications include anonymization, sentiment analysis and topic extraction.

SMAX is an event API, like SAX. Event APIs implement 'push parsing'. In this approach, a processing component accepts events generated by another component, which may be a parser. By contrast, in 'pull parsing', the application decides when to ask for more data (from another component like a parser). Unlike SAX, which provides methods for fine-grained events such as startElement or characters, SMAX has just one event called process, which operates on the markup (structure) and the content (unstructured, linear text) of a document as separate properties of a SmaxDocument object. In this way, it combines aspects of event-driven APIs like SAX, but also of the DOM (document object model), which provides access to parts of the XML document.

https://declarative.amsterdam/prd-data/resources/da/article/da.2020/da.2020.verwer.plain-text-processing/figures/figure01.svg

In SMAX, the XML tree structure and text content are stored separately. Each element records its ‘span’ in the text content as a start position and an end position. This representation can be made from SAX or DOM, and converting back into SAX or DOM is easy. Of course, there are well-formedness constraints on the start position and end position, which must be maintained by SMAX operations.

The SmaxDocument class has an insertMarkup method that inserts a SmaxElement without child elements into the markup tree of a document. The new SmaxElement is inserted at a given start position and end position, relative to the text content of the document. Thus, it will surround a piece of text and markup between the start and end position. Unfortunately, this will not always lead to well-formed XML, as illustrated below.

https://declarative.amsterdam/prd-data/resources/da/article/da.2020/da.2020.verwer.plain-text-processing/figures/figure02.svg

SMAX has several strategies to handle unbalanced (non-well-formed) insertions. The user can select a strategy with the balancing parameter of insertMarkup. The available Balancing values are: OUTER, INNER, START, END, BALANCE_TO_START, and BALANCE_TO_END. The effect of each of these strategies is the topic of the next section.

Balancing strategies

When discussing balancing strategies, we will take a rather abstract view, in order to deal with all possible cases. The abstract cases can then be made concrete for specific examples.

Consider the situation where a new XML element <M>…</M> is inserted in a document which already has the elements ……<r>…</r>…<q>…</q>. (The order p, r, q is used because it looks symmetric.) We will refer to these elements and their content as just <M>, , <r>, and <q>. We assume that <M> has no overlap with elements other than , <r>, and <q>. The elements , <r>, and <q> can be generalized to any number of elements.

The easiest case is an insertion of <M> with start and end positions that lead to a balanced tree, so no balancing is needed. This is the case if <M> is fully contained within one of the existing elements, say ; if <M> surrounds , <r>, and <q>, but does not overlap with other elements; or if <M> is inserted between two elements, say  and <q>, and no other elements <r> exist between  and <q>.

https://declarative.amsterdam/prd-data/resources/da/article/da.2020/da.2020.verwer.plain-text-processing/figures/figure03.svg

The strategies OUTER and INNER, as well as BALANCE_TO_START and BALANCE_TO_END all lead to the same result. This is because the BALANCE_TO_START and BALANCE_TO_END strategies are only used if balancing is needed.

If <M> surrounds , <r>, and <q>:

Replace , <r>, and <q> by <M> in the parent node and insert , <r>, and <q> as the content of <M>.
If <M> is inserted between  and <q>:

Insert <M> into the parent node between  and <q>. (This can be seen as the previous case for zero surrounded elements.)
If <M> is contained within an element :

Insert <M> recursively into the content of  using the same start and end positions.

The START and END strategies will collapse the start and end positions of <M>.

If <M> is contained within an element :

Insert <M> recursively into the content of  keeping start and end positions (or collapsing start and end positions according to START or END).
If <M> surrounds , <r>, and <q>:

START: Insert <M> before  in the parent node, replace the end position of <M> by the start position;

END: Insert <M> after <q> in the parent node, replace the start position of <M> by the end position.
If <M> is inserted between  and <q>:

Insert <M> into the parent node between  and <q>.

START: Replace the end position of <M> by the start position;

END: Replace the start position of <M> by the end position.

There are several ways in which the inserted element <M> can partially overlap with existing elements. In these cases we need to adjust the start and/or end positions in order to get a balanced (well-formed) tree.

https://declarative.amsterdam/prd-data/resources/da/article/da.2020/da.2020.verwer.plain-text-processing/figures/figure04.svg

OUTER:

If the start position of <M> is inside  and the end position of <M> is inside <q>:

Replace , <r>, and <q> by <M> in the parent node and insert , <r>, and <q> as the content of <M>. Make the start position of <M> equal to the start position of  and make the end position of <M> equal to the end position of <q>.
If the start position of <M> is inside  and the end position of <M> is not inside any element:

Replace  and <r> by <M> in the parent node and insert , <r> as the content of <M>. Make the start position of <M> equal to the start position of .
If the start position of <M> is not inside any element and the end position of <M> is inside <q>:

Replace <r> and <q> by <M> in the parent node and insert <r>, <q> as the content of <M>. Make the end position of <M> equal to the end position of <q>.

INNER:

If the start position of <M> is inside  and the end position of <M> is inside <q>:

Replace <r> by <M> in the parent node and insert <r> as the content of <M>. Make the start position of <M> equal to the end position of  and make the end position of <M> equal to the start position of <q>.
If the start position of <M> is inside  and the end position of <M> is not inside any element:

Replace <r> by <M> in the parent node and insert <r> as the content of <M>. Make the start position of <M> equal to the end position of .
If the start position of <M> is not inside any element and the end position of <M> is inside <q>:

Replace <r> by <M> in the parent node and insert <r> as the content of <M>. Make the end position of <M> equal to the start position of <q>.

https://declarative.amsterdam/prd-data/resources/da/article/da.2020/da.2020.verwer.plain-text-processing/figures/figure05.svg

START, BALANCE_TO_START (these are the same, because balancing is needed):

If the start position of <M> is inside  and the end position of <M> is inside <q>:

Replace the end position of <M> by the start position and insert <M> into .
If the start position of <M> is inside  and the end position of <M> is not inside any element:

Replace the end position of <M> by the start position and insert <M> into .
If the start position of <M> is not inside any element and the end position of <M> is inside <q>:

Replace the end position of <M> by the start position and insert <M> in the parent node before <r>.

END, BALANCE_TO_END (these are the same, because balancing is needed):

If the start position of <M> is inside  and the end position of <M> is inside <q>:

Replace the start position of <M> by the end position and insert <M> into <q>.
If the start position of <M> is inside  and the end position of <M> is not inside any element:

Replace the start position of <M> by the end position and insert <M> in the parent node after <r>.
If the start position of <M> is not inside any element and the end position of <M> is inside <q>:

Replace the start position of <M> by the end position and insert <M> into <q>.

The rules for insertion can be illustrated by the effect they have on actual XML fragments. Suppose that we want to mark the phrase “!!!” in a document with the <M> element. Other characters will be presented by dots. This would transform the fragment “..!!!..” into “..<M>!!!</M>..” if the OUTER or INNER balancing strategy is used; no balancing is required.

The balanced insertions would look like the examples in the table below.

https://declarative.amsterdam/prd-data/resources/da/article/da.2020/da.2020.verwer.plain-text-processing/figures/figure06.svg

Non-balanced insertions are illustrated in the next table.

https://declarative.amsterdam/prd-data/resources/da/article/da.2020/da.2020.verwer.plain-text-processing/figures/figure07.svg

The END and BALANCE_TO_END strategies are not shown because they are very similar to START.

SPEAT: Simple Pipelines of Event API Transformers

XML document processing is often done using pipelines, facilitated by technologies like XProc and Apache Cocoon. SPEAT is a Java library for building pipelines from event APIs like SAX and SMAX. Each step in a SPEAT pipeline is a transformer from one event API type to another. Examples are SaxToSmaxAdapter, which transforms from SAX to SMAX, and NamedEntityRecognizer, which transforms from SMAX to SMAX. A transformer is a (one-step) pipeline component, and a pipeline can be used as a transformer in another pipeline.

Event APIs like SAX and SMAX are defined by Java interfaces, where each event is an (abstract) method. For example, in SAX an XML document is represented by a sequence of method calls:

startDocument(), startElement(uri, lname, qname,attrs), …, endElement(uri, lname, qname), endDocument()

This is a very efficient way to represent events. No reification takes place, i.e., events are not instances of some Event interface that are passed between pipeline stages.

The basis of SPEAT is

interface Pipeline<S, T> extends EventHandler<S>, EventSupplier<T>

which is a pipeline that transforms S events into T events.44.EventHandler and EventSupplier are very simple Java interfaces, which determine whether their parameter event API class is used for input or output. The parameter classes S and T are event APIs. A Pipeline<S, T> receives S events and generates T events. Usually, there is no one-to-one correspondence between the S events and T events.

All events are passed by calling the corresponding method.

https://declarative.amsterdam/prd-data/resources/da/article/da.2020/da.2020.verwer.plain-text-processing/figures/figure08.svg

The Pipeline<S, T> interface has a method

 Pipeline<S, U> append(Pipeline<T, U>)

that makes a Pipeline<S, U> by appending a Pipeline<T, U> to a Pipeline<S, T>. It also provides logging to classes that implement the Pipeline interface.

At this moment, some basic pipeline components are available in SPEAT. There are components to transform between the SAX and SMAX event APIs and event APIs for handling plain text. There are also components for input and output. The only components that do ‘real’ work are the RegexContentMatcher and the NamedEntityRecognizer.

An adapter class55.This adapter is not available as open source, but is part of the Link eXtractor. has been made to use SPEAT pipelines as part of pipelines in Apache Cocoon, and if the need arises, an adapter can be made to use SPEAT from an XProc implementation.

SPEAT and SMAX are still under active development, but the author has decided that the code should be stable enough to be released as open source software. There are more than 100 unit tests, which also serve as examples of how to use SPEAT. Some of the technical documentation needs to be revised, and more XML structure transforming operations, such as element deletions, will be added. In the current state it is possible to develop useful components for processing the plain text content of XML documents, and add markup that keeps the document well-formed.

The author’s intention is not to make SPEAT into a framework, like XProc or Apache Cocoon. It is a library that can be used within or without other frameworks.

Acknowledgments

Much of the work described in this document was done while working on the Link eXtractor (LX), and the author wishes to thank Marc van Opijnen of the Expert Center for Official Governmental Publications (KOOP) in the Netherlands for making this work both necessary and possible. The SMAX and SPEAT software libraries described in this document were developed in the author’s own time, and will be released as open source software.

Notes

1.The LX works on Dutch texts only, although this depends on the grammars that are used. Where possible we have translated examples into English. The LX will not work on the English examples.

2. Accidental complexity is complexity that is added for technical reasons, as opposed to essential complexity, which is inherent to the functionality of a program. See Brooks 1986.

3.LX uses the Waxeye parser generator, which places grammar operators like ‘*’ and ‘!’ in front of the term that they apply to.

4.EventHandler and EventSupplier are very simple Java interfaces, which determine whether their parameter event API class is used for input or output.

5.This adapter is not available as open source, but is part of the Link eXtractor.

References

van Opijnen, Marc, Nico Verwer, and Jan Meijer

2015 Beyond the Experiment: The Extendable Legal Link Extractor. Workshop on Automated Detection, Extraction and Analysis of Semantic Information in Legal Texts, held in conjunction with the 2015 International Conference on Artificial Intelligence and Law (ICAIL), June 08 - 12, 2015, San Diego, CA, USA. Available at SSRN: https://ssrn.com/abstract=2626521

Simple API for XML

Wikipedia. Available at https://en.wikipedia.org/wiki/Simple_API_for_XML

XProc homepage

Available at https://xproc.org/

Apache Cocoon homepage

Available at http://cocoon.apache.org/

SPEAT on GitHub

Available at https://github.com/nverwer/SPEAT

Brooks Fred P.

1986 No Silver Bullet – Essence and Accident in Software Engineering. In: Proceedings of the IFIP Tenth World Computing Conference, H-J Kugler (ed.): 1069–1076. Amsterdam: Elsevier

Biographical notes

Nico Verwer works as a freelance software developer, designer, architect and trouble-shooter. His clients are mainly companies in the fields of publishing, media and government services, but also fit20, the world market leader in High Intensity Resistance fitness training. Nico has no preferred programming language, because he values understanding the application domain over knowledge of a particular technology. However, he does prefer techniques and methods that minimize accidental complexity. During his career he has deleted more lines of code than he has written.