Plain text processing in structured documents

Applications that analyze and process natural language can be used for things like named entity recognition, anonymization, topic extraction, sentiment analysis. In most cases, these applications use the plain text of a document, and may add or change markup. This causes problems when the original document already contains markup that must be preserved. The text to be analyzed may run across markup boundaries, and newly generated markup may lead to unbalanced (non well-formed) structures. This presentation shows how the Separated Markup API for XML (SMAX) can be used to apply natural language processing to XML documents. It preserves the existing document structure and allows for balanced insertion of new markup. A demonstration will be given of the use of SMAX for extracting and marking references in legal documents. This Link eXtractor was built for the Dutch center for governmental publications. SMAX and Simple Pipelines of Event API Transformers (SPEAT) will be available as open source software at the time of Declarative Amsterdam.

Presentation, 9 October 2020

Nico Verwer works as a freelance software developer, designer, architect and trouble-shooter. His clients are mainly companies in the fields of publishing, media and government services, but also fit20, the world market leader in High Intensity Resistance fitness training. Nico has no preferred programming language, because he values understanding the application domain over knowledge of a particular technology. However, he does prefer techniques and methods that minimize accidental complexity. During his career he has deleted more lines of code than he has written.