Plain text processing in structured documents
Abstract
Applications that analyze and process natural language can be used for things like named entity recognition, anonymization, topic extraction, sentiment analysis. In most cases, these applications use the plain text of a document, and may add or change markup. This causes problems when the original document already contains markup that must be preserved. The text to be analyzed may run across markup boundaries, and newly generated markup may lead to unbalanced (non well-formed) structures. This presentation shows how the Separated Markup API for XML (SMAX) can be used to apply natural language processing to XML documents. It preserves the existing document structure and allows for balanced insertion of new markup. A demonstration will be given of the use of SMAX for extracting and marking references in legal documents. This Link eXtractor was built for the Dutch center for governmental publications. SMAX and Simple Pipelines of Event API Transformers (SPEAT) will be available as open source software at the time of Declarative Amsterdam.
Table of contents
Introduction
The LXvan Opijnen, Marc, Nico Verwer, and Jan Meijer 2015 Beyond the Experiment: The Extendable Legal Link Extractor. Workshop on Automated Detection, Extraction and Analysis of Semantic Information in Legal Texts, held in conjunction with the 2015 International Conference on Artificial Intelligence and Law (ICAIL), June 08 - 12, 2015, San Diego, CA, USA. Available at SSRN: https://ssrn.com/abstract=2626521 is an application that detects and resolves references to legislation, case law and other governmental and legal documents. These references occur as pieces of text in the document content, with a loosely defined structure. The LX recognizes these text fragments, marks them (usually with an XML element) and provides a link to the referred document if possible. The LX is used in a large-scale production environment and has a high recall and precision.
The link detection algorithm of LX consists of Named Entity Recognition (NER) and parsers using Parsing Expression Grammars (PEG), plus some specific operations for disambiguation, reference construction and link resolution. The result produced by the LX can be a list of references with links, or links that are embedded in the input document.
The input documents of the LX are almost always XML documents. The LX detects textual references in the document add adds linking markup around the reference. For example, consider the fragment11.The LX works on Dutch texts only, although this depends on the grammars that are used. Where possible we have translated examples into English. The LX will not work on the English examples.
<para> […] expressed by the Supreme Court in its judgment of 16 February 2010, published in NS 2010, 98, also in NJ 2010, 232 with annotation of M.J. Borgers, and RvdW 2007, 420 (case R06/090) […] </para>
The words making up references are shown in bold print. The LX detects these parts and combines them. Thus, the phrases “Supreme Court 16 February 2010”, “NS 2010, 98” and “NJ 2010, 232” are all references (although the first one is incomplete). The link resolution process will find that these references point to the same document, which may be referred to by the European Case Law Identifier “ECLI:NL:HR:2010:BK6357”. In the output document, the added markup would look like
<para>[…] expressed by the <link ecli="ECLI:NL:HR:2010:BK6357">Supreme Court in its judgment of 16 February 2010, published in NS 2010, 98, also in NJ 2010, 232</link> with annotation of M.J. Borgers, and <link ecli="ECLI:NL:HR:2007:AZ7628">RvdW 2007/420 (case R06/090</link>) […]</para>
In this example, the LX just analyzes the text content of a paragraph element. However, there might be markup embedded in the input document, like:
<para>[…] expressed by the Supreme Court in its judgment of <date iso-8601-date="2010-02-16">16 February 2010</date>, published in NS 2010, 98, also in NJ 2010, 232 with annotation of M.J. Borgers, and RvdW 2007, 420 (case R06/090) […]</para>
The <date>
element must be preserved in the output and will be contained within the added <link>
.
In order to detect the components of the link, the LX needs to look at the plain text contained in the paragraph and ignore the <date>
element. This is because the NER and PEG parsers work on plain text and produce a parse tree. After several more steps, this parse tree must be merged with the original markup of the input document.
The LX must of course produce well-formed XML. This can be challenging when newly inserted XML elements have an overlap with other elements. These element may be present in the original document, or they may be added by the LX. For example, after some parsing and other processing steps, the fragment given before will be transformed into:
<para>[…] expressed by the <lx:INSTANTIE start="23" end="32" norm="HR">Supreme Court</lx:INSTANTIE> in its judgment of <date iso-8601-date="2010-02-16">16 February 2010</date>, published in NS 2010, 98, also in NJ 2010, 232 with annotation of M.J. Borgers, and RvdW 2007, 420 (case R06/090) […]</para>
When inserting the resolved link, the LX must decide if it should put the <link ecli="ECLI:NL:HR:2010:BK6357">
element
before or after the <lx:INSTANTIE start="23" end="32" norm="HR">
element.
In this case, it is obvious that the <link>
should come before <lx:INSTANTIE>
.
But what can the LX do when someone decides to add some emphasis, as in
<para>[…] expressed by <em>the Supreme Court</em> in its judgment of 16 February 2010, published in NS 2010, 98, also in NJ 2010, 232 with annotation of M.J. Borgers, and RvdW 2007, 420 (case R06/090) […]</para>
Should it draw the word “the” before “Supreme Court” into the <link>
element?
Should it split the <em>
element, or perhaps generate an error?
The example given above illustrates some of the challenges in what the LX should do. More problems arise when we consider how the LX should do it.
The LX performs a sequence of steps, each of which produces its own additional markup. Some of these steps will undo the results of earlier steps, fixing erroneous or ambiguous references. As we saw earlier, after each of the steps, the elements that have been added must be inserted into the document in such a way that it remains well-formed XML. All parsing could be done within XSLT, but this leads very quickly to a lot of accidental complexity22. Accidental complexity is complexity that is added for technical reasons, as opposed to essential complexity, which is inherent to the functionality of a program. See Brooks 1986Brooks Fred P. 1986 No Silver Bullet – Essence and Accident in Software Engineering. In: Proceedings of the IFIP Tenth World Computing Conference, H-J Kugler (ed.): 1069–1076. Amsterdam: Elsevier. and unwieldy stylesheets. Therefore, the LX is constructed as a pipeline of approximately 70 small and medium-sized steps. Many steps are XSLT transformations or SAX-processing Java components. The NER and PEG parsing are steps that work on the plain text content of the document and add markup.
In the original version of the LX the PEG parsers work on a serialized version of the document, and have to take into account the (serialized) markup in the original document as well as XML elements added by earlier steps. For example, consider the phrase
EG is a treaty, but also part of the reference HvJ EG 18 juli 2007, C-231/05
One of the first steps is named entity recognitions for legislation, which recognizes “EG” twice as legislation identified by ‘BWBV0001506’.
<lx:regeling name="BWBV0001506">EG</lx:regeling> is a treaty, but also part of the reference HvJ <lx:regeling name="BWBV0001506">EG</lx:regeling> 18 juli 2007, C-231/05
In order to make this suitable for parsing by a PEG parser, the added markup will be serialized into
<lx:regeling name="BWBV0001506">EG</lx:regeling> is a treaty,
but also part of the reference HvJ <lx:regeling name="BWBV0001506">EG
</lx:regeling> 18 juli 2007, C-231/05
This means that the PEG grammar33.LX uses the Waxeye parser generator, which places grammar operators like ‘*’ and ‘!’ in front of the term that they apply to. must contain constructions like the following, to preserve serialized markup:
regeling <- lx_regeling_start *(![<] .) lx_regeling_endlx_regeling_start <- '<lx:regeling' *(![>] .) '>'lx_regeling_end <- '</lx:regeling>'
Thus, the PEG parser for legislative or case law references contains a small part of an XML parser as well. A consequence is, that after the parser has done its work, the serialized XML elements have been turned into XML elements themselves:
<lx:Regeling start="0" end="93"><lx:Lx_regeling_start start="0" end="77">
<lx:regeling xmlns:lx="http://linkeddata.overheid.nl/lx/" name="BWBV0001506">
</lx:Lx_regeling_start>EG<lx:Lx_regeling_end start="79" end="93">
</lx:regeling>
</lx:Lx_regeling_end></lx:Regeling> is a treaty, […]
After each parsing step, there is a ‘normalization’ stylesheet to get rid of the superfluous markup. In some cases, LX uses Saxon’s parse-xml function to restore markup that had to be serialized before parsing. In this way, mixing of XML processing components and text processing components is rather inefficient, and is a source of accidental complexity.
In one of the processing steps, the LX parses case law references.
The European Court of Justice is a particularly interesting case, since it is, and has been known by many names and abbreviations.
One of these abbreviations (in Dutch) is “HvJ EG”.
As we saw above, in an earlier step the NER has put the <lx:regeling>
element around “EG”.
Again, the LX needs to compensate for this in the PEG grammar:
EUROPEES<= :lx_regeling_start EUROPEES :lx_regeling_end| 'E'("g"|"u"?('rope'('es'|'se')))
This will throw away the incorrect <lx:regeling>
element.
But it also shows how accidental complexity enters the PEG grammar.
This grammar is supposed to be maintained by domain experts, to whom the :lx_regeling_start
and
:lx_regeling_end
terms do not make sense.
It would be much better if we would only have ‘functional’ grammar rules.
The SMAX (and SPEAT) software library has been designed to solve the problems arising from the application of text-processing components to XML documents, by separating markup and text content. How this is done is discussed in the remainder of this document.
SMAX: Separated Markup API for XML
The standard XML processing APIs, such as DOM and SAX (and LINQ, if you use .net) view an XML document as a tree. The same is true for XSLT, XPath and XQuery. The tree representation of an XML document consists of element nodes, and text content that is distributed across the leaves of the tree. It is possible in XSLT or XPath to operate on the complete text content of the document (or of an element), but this will lose the structure of the document.
SMAX (Separated Markup API for XML) was designed in order to be able to process the text content of an XML document in a linear way, without intervening markup. At the same time, SMAX retains and can modify the structure and markup of the document. As the name indicates, SMAX treats the markup (element tree structure) and text of an XML document separately. A typical use case is text analysis inserting markup around recognized text fragments, as described in the introduction. Other applications include anonymization, sentiment analysis and topic extraction.
SMAX is an event API, like SAXSimple API for XML Wikipedia. Available at https://en.wikipedia.org/wiki/Simple_API_for_XML. Event APIs implement 'push parsing'.
In this approach, a processing component accepts events generated by another component, which may be a parser.
By contrast, in 'pull parsing', the application decides when to ask for more data (from another component like a parser).
Unlike SAX, which provides methods for fine-grained events such as startElement
or characters
,
SMAX has just one event called process
,
which operates on the markup (structure) and the content (unstructured, linear text) of a document
as separate properties of a SmaxDocument object.
In this way, it combines aspects of event-driven APIs like SAX, but also of the DOM (document object model),
which provides access to parts of the XML document.
In SMAX, the XML tree structure and text content are stored separately. Each element records its ‘span’ in the text content as a start position and an end position. This representation can be made from SAX or DOM, and converting back into SAX or DOM is easy. Of course, there are well-formedness constraints on the start position and end position, which must be maintained by SMAX operations.
The SmaxDocument
class has an insertMarkup
method that inserts a SmaxElement
without child elements into the markup tree of a document.
The new SmaxElement
is inserted at a given start position and end position, relative to the text content of the document. Thus, it will surround a piece of text and markup between the start and end position. Unfortunately, this will not always lead to well-formed XML, as illustrated below.
SMAX has several strategies to handle unbalanced (non-well-formed) insertions.
The user can select a strategy with the balancing
parameter of insertMarkup
.
The available Balancing
values are: OUTER, INNER, START, END, BALANCE_TO_START, and BALANCE_TO_END.
The effect of each of these strategies is the topic of the next section.
Balancing strategies
When discussing balancing strategies, we will take a rather abstract view, in order to deal with all possible cases. The abstract cases can then be made concrete for specific examples.
Consider the situation where a new XML element <M>…</M>
is inserted in a document
which already has the elements <p>…</p>…<r>…</r>…<q>…</q>
.
(The order p, r, q is used because it looks symmetric.)
We will refer to these elements and their content as just <M>
,
<p>
, <r>
, and <q>
.
We assume that <M>
has no overlap with elements other than
<p>
, <r>
, and <q>
.
The elements <p>
, <r>
, and <q>
can be generalized to any number of elements.
The easiest case is an insertion of <M>
with start and end positions that lead to
a balanced tree, so no balancing is needed.
This is the case if <M>
is fully contained within one of the existing elements,
say <p>
;
if <M>
surrounds <p>
, <r>
,
and <q>
, but does not overlap with other elements;
or if <M>
is inserted between two elements,
say <p>
and <q>
,
and no other elements <r>
exist
between <p>
and <q>
.
The strategies OUTER and INNER, as well as BALANCE_TO_START and BALANCE_TO_END all lead to the same result. This is because the BALANCE_TO_START and BALANCE_TO_END strategies are only used if balancing is needed.
-
If
<M>
surrounds<p>
,<r>
, and<q>
:Replace
<p>
,<r>
, and<q>
by<M>
in the parent node and insert<p>
,<r>
, and<q>
as the content of<M>
. -
If
<M>
is inserted between<p>
and<q>
:Insert
<M>
into the parent node between<p>
and<q>
. (This can be seen as the previous case for zero surrounded elements.) -
If
<M>
is contained within an element<p>
:Insert
<M>
recursively into the content of<p>
using the same start and end positions.
The START and END strategies will collapse the start and end positions of <M>
.
-
If
<M>
is contained within an element<p>
:Insert
<M>
recursively into the content of<p>
keeping start and end positions (or collapsing start and end positions according to START or END). -
If
<M>
surrounds<p>
,<r>
, and<q>
:START: Insert
<M>
before<p>
in the parent node, replace the end position of<M>
by the start position;END: Insert
<M>
after<q>
in the parent node, replace the start position of<M>
by the end position. -
If
<M>
is inserted between<p>
and<q>
:Insert
<M>
into the parent node between<p>
and<q>
.START: Replace the end position of
<M>
by the start position;END: Replace the start position of
<M>
by the end position.
There are several ways in which the inserted element <M>
can partially overlap
with existing elements.
In these cases we need to adjust the start and/or end positions in order to get a balanced (well-formed) tree.
OUTER:
-
If the start position of
<M>
is inside<p>
and the end position of<M>
is inside<q>
:Replace
<p>
,<r>
, and<q>
by<M>
in the parent node and insert<p>
,<r>
, and<q>
as the content of<M>
. Make the start position of<M>
equal to the start position of<p>
and make the end position of<M>
equal to the end position of<q>
. -
If the start position of
<M>
is inside<p>
and the end position of<M>
is not inside any element:Replace
<p>
and<r>
by<M>
in the parent node and insert<p>
,<r>
as the content of<M>
. Make the start position of<M>
equal to the start position of<p>
. -
If the start position of
<M>
is not inside any element and the end position of<M>
is inside<q>
:Replace
<r>
and<q>
by<M>
in the parent node and insert<r>
,<q>
as the content of<M>
. Make the end position of<M>
equal to the end position of<q>
.
INNER:
-
If the start position of
<M>
is inside<p>
and the end position of<M>
is inside<q>
:Replace
<r>
by<M>
in the parent node and insert<r>
as the content of<M>
. Make the start position of<M>
equal to the end position of<p>
and make the end position of<M>
equal to the start position of<q>
. -
If the start position of
<M>
is inside<p>
and the end position of<M>
is not inside any element:Replace
<r>
by<M>
in the parent node and insert<r>
as the content of<M>
. Make the start position of<M>
equal to the end position of<p>
. -
If the start position of
<M>
is not inside any element and the end position of<M>
is inside<q>
:Replace
<r>
by<M>
in the parent node and insert<r>
as the content of<M>
. Make the end position of<M>
equal to the start position of<q>
.
START, BALANCE_TO_START (these are the same, because balancing is needed):
-
If the start position of
<M>
is inside<p>
and the end position of<M>
is inside<q>
:Replace the end position of
<M>
by the start position and insert<M>
into<p>
. -
If the start position of
<M>
is inside<p>
and the end position of<M>
is not inside any element:Replace the end position of
<M>
by the start position and insert<M>
into<p>
. -
If the start position of
<M>
is not inside any element and the end position of<M>
is inside<q>
:Replace the end position of
<M>
by the start position and insert<M>
in the parent node before<r>
.
END, BALANCE_TO_END (these are the same, because balancing is needed):
-
If the start position of
<M>
is inside<p>
and the end position of<M>
is inside<q>
:Replace the start position of
<M>
by the end position and insert<M>
into<q>
. -
If the start position of
<M>
is inside<p>
and the end position of<M>
is not inside any element:Replace the start position of
<M>
by the end position and insert<M>
in the parent node after<r>
. -
If the start position of
<M>
is not inside any element and the end position of<M>
is inside<q>
:Replace the start position of
<M>
by the end position and insert<M>
into<q>
.
The rules for insertion can be illustrated by the effect they have on actual XML fragments.
Suppose that we want to mark the phrase “!!!” in a document with the <M>
element.
Other characters will be presented by dots.
This would transform the fragment “<p>..!!!..</p>” into “<p>..<M>!!!</M>..</p>”
if the OUTER or INNER balancing strategy is used; no balancing is required.
The balanced insertions would look like the examples in the table below.
Non-balanced insertions are illustrated in the next table.
The END and BALANCE_TO_END strategies are not shown because they are very similar to START.
SPEAT: Simple Pipelines of Event API Transformers
XML document processing is often done using pipelines, facilitated by technologies like XProcXProc homepage
Available at https://xproc.org/
and Apache CocoonApache Cocoon homepage
Available at http://cocoon.apache.org/.
SPEAT is a Java library for building pipelines from event APIs like SAX and SMAX.
Each step in a SPEAT pipeline is a transformer from one event API type to another.
Examples are SaxToSmaxAdapter
, which transforms from SAX to SMAX, and NamedEntityRecognizer
,
which transforms from SMAX to SMAX.
A transformer is a (one-step) pipeline component, and a pipeline can be used as a transformer in another pipeline.
Event APIs like SAX and SMAX are defined by Java interfaces, where each event is an (abstract) method. For example, in SAX an XML document is represented by a sequence of method calls:
startDocument()
,startElement(uri, lname, qname,attrs)
, …,endElement(uri, lname, qname)
,endDocument()
This is a very efficient way to represent events. No reification takes place, i.e., events are not instances of some Event interface that are passed between pipeline stages.
The basis of SPEAT is
interface Pipeline<S, T> extends EventHandler<S>, EventSupplier<T>
which is a pipeline that transforms S
events into T
events.44.EventHandler and EventSupplier are very simple Java interfaces, which determine whether their parameter event API class is used for input
or output.
The parameter classes S
and T
are event APIs.
A Pipeline<S, T>
receives S
events and generates T
events.
Usually, there is no one-to-one correspondence between the S
events and T
events.
All events are passed by calling the corresponding method.
The Pipeline<S, T>
interface has a method
<U> Pipeline<S, U> append(Pipeline<T, U>)
that makes a Pipeline<S, U>
by appending a Pipeline<T, U>
to a Pipeline<S, T>
.
It also provides logging to classes that implement the Pipeline interface.
At this moment, some basic pipeline components are available in SPEAT.
There are components to transform between the SAX and SMAX event APIs and event APIs for handling plain text.
There are also components for input and output.
The only components that do ‘real’ work are the RegexContentMatcher
and the NamedEntityRecognizer
.
An adapter class55.This adapter is not available as open source, but is part of the Link eXtractor. has been made to use SPEAT pipelines as part of pipelines in Apache CocoonApache Cocoon homepage Available at http://cocoon.apache.org/, and if the need arises, an adapter can be made to use SPEAT from an XProcXProc homepage Available at https://xproc.org/ implementation.
SPEAT and SMAX are still under active development, but the author has decided that the code should be stable enough to be released as open source software. There are more than 100 unit tests, which also serve as examples of how to use SPEATSPEAT on GitHub Available at https://github.com/nverwer/SPEAT. Some of the technical documentation needs to be revised, and more XML structure transforming operations, such as element deletions, will be added. In the current state it is possible to develop useful components for processing the plain text content of XML documents, and add markup that keeps the document well-formed.
The author’s intention is not to make SPEAT into a framework, like XProc or Apache Cocoon. It is a library that can be used within or without other frameworks.
Acknowledgments
Much of the work described in this document was done while working on the Link eXtractor (LX), and the author wishes to thank Marc van Opijnen of the Expert Center for Official Governmental Publications (KOOP) in the Netherlands for making this work both necessary and possible. The SMAX and SPEAT software libraries described in this document were developed in the author’s own time, and will be released as open source software.