Text normalization with Invisible XML round-tripping

Alain Couthures

agenceXML

Abstract

Invisible XML enables producing an XML document from a text value according to a grammar, which specifies rules for generating elements and attributes.

In certain use cases, round-tripping is required: unparsing the resulting XML document back into text.

Three steps are proposed for this unparsing:

Grammar normalization by pruning meaningless rules and symbols.
Inversion of the normalized grammar
Parsing of the XML document with the inverted normalized grammar

Grammix is a Javascript implementation for parsing text values and unparsing XML documents according to an Invisible XML grammar. Grammix internally uses an Earley parser.

Table of contents

Abstract
Invisible XML for Parsing Text Values
Importance of Round-Tripping in Invisible XML
Introducing “Grammix”
Generating a Normalized Grammar
Round-Tripping XML to Normalized Text
Ambiguities in Normalizing Text
Conclusion
Biographical notes

Invisible XML for Parsing Text Values

In many cases, data processing requires analyzing text values to identify and reformat their parts. In this way, semantically identical text values are stored uniquely, facilitating comparison and sorting. Before XML, loading huge volumes of data into SQL databases already involved normalizing column contents. Normalization remains relevant in Machine Learning: models may internally learn and normalize values, but interpretability is not yet mastered, and efficiency improves with normalized data.

Invisible XML is a recent specification supported by an active community and several implementations. It enables extraction of components according to a grammar (rules declared as a list of symbols) and structures these components into an XML document.

Supported grammars feature EBNF-like constructs (“option”, “repeat0”, “repeat1”), instructions for naming elements and attributes and potentially inserting additional text. It adopts a declarative approach, avoiding imperative statements within grammars.

Resulting attribute values and text nodes are inherently normalized due to the grammar: irrelevant parts are eliminated, and the remaining parts ones are potentially transformed or enriched.

Post-processing of the resulting XML document can be performed using standard XML tools, generating a corresponding new text value potentially differing from the original.

Round-tripping for Invisible XML involves generating a new text value from the XML document that remains valid according to the grammar and would regenerate the identical XML upon re-parsing.

Importance of Round-Tripping in Invisible XML

When the resulting XML document is edited, it is useful to verify if the new internal values still conform to the grammar and to regenerate the corresponding complex text value, possibly using XProc.

Integrating Invisible XML into XForms enables editing complex text values through input controls associated with the XML document and submitting the resulting text value.

Revealing text values via round-tripping can also aid grammar improvement. Writing grammars is often challenging; an author might incorrectly consider a grammar valid based solely on a test suite while it falsely accepts other input values.

Depending on the grammar, infinitely many complex text values containing meaningless characters (e.g., whitespaces or newlines for indentation) could result in the same XML document. Round-tripping aims to produce a single, preferably shortest, normalized value.

Introducing “Grammix”

Grammix is an implementation of Invisible XML, compatible with browsers and Javacript engines like nodeJS.

Grammix uses an internal Earley parser based on the nearley Javascript package, renowned for its speed and robustness. Error reporting clearly lists expected values within rules formatted using Invisible XML syntax.

Unlike Invisible XML grammars, Earley grammars contain only basic rules without direct EBNF capabilities:

Multiple rules with the same name represent alternatives for a given symbol
String literals are represented as lists of successive characters
EBNF structures and grouping are absent; additional recursive rules are needed to implement iteration

For Grammix to support Invisible XML effectively, regular expression character matching and Unicode classes are required, and string insertions must be skipped during parsing.

While the nearley package returns an array of arrays by default, it allows Javascript-based post-processing at rule completion. Grammix’s post processing specifically manages appending children to constructed nodes. Although inspired by DOM, Grammix handles nodes without constraints such as single parent nodes, due to the Earley algorithm evaluating all possible rules concurrently.

Grammix internally contains its compiled version of Invisible XML own grammar, enabling it to parse grammars expressed in Invisible XML syntax directly.

Successful grammar parsing produces a Javascript function embodying the compiled grammar.

Generating a Normalized Grammar

A normalized grammar resembles the original grammar but excludes iterations of meaningless characters, thus accepting only normalized text values.

For instance, basic string symbols in rules may simplify as follows:

-"a"* and -"a"? can be removed
-"a"+ can become -"a"
-["a" ; "b"] can become -"a"
(-"a" ; "b")+ can become (-"a" ; "b"+)

Generating a normalized grammar involves recursively identifying rules composed solely of meaningless symbols. The resulting normalized grammar is typically simpler, containing fewer rules and symbols.

Round-Tripping XML to Normalized Text

Although numerous methods exist to build text from an XML document, Grammix requires no additional coding (declarative or imperative). Its round-tripping relies on automatically inverting the normalized grammar.

An inverted grammar offers another opportunity to utilize an Earley parser, though it must accommodate additional symbols for a lexer delivering start tags, attributes, and end tags.

Elements within an XML document consistently appear in their initial order within the complex text. Attributes, however, appear at the element level in arbitrary order. Consequently, the lexer provides all effective attributes which the parser must verify before accepting an end tag.

Attributes might also emerge from rules creating, permissible in Invisible WML; hence, required start/end tags and attributes are ignored when parsing attribute values.

Due to Invisible XML’s flexibility regarding marking elements or attributes at rule or symbol levels, start/end tags are explicitly added as extra symbols around the corresponding symbol. A new start rule is also introduced for the root element for consistency.

Ambiguities in Normalizing Text

Normalizing a complex value involves parsing it and immediately unparsing the resulting XML document to produce a normalized value.

Even with normalized grammars, round-tripping with an Earley parser may yield multiple normalized values.

For instance, given a simple grammar:

a:(-"0","a",-"0";-"1","a",-"1")+.

An input value with five iterations as:

0a01a10a01a10a0

will generate 2⁵, that is 32, distinct normalized values.

Any grammar defining strings for a programming language with quotes or apostrophes will encounter this problem. Selecting the first normalization can be feasible, but performance and memory issues quickly arise.

Adding attributes to the intermediate XML can ensure normalized values strictly reflect the original input. Pragmas might indicate preferred alternatives.

Conclusion

An Invisible XML grammar serves as a powerful declarative tool, primarily parsing complex text into XML

Consequently, it readily validates complex text values akin to schemas.

It also facilitates text normalization without additional programming.

Using two grammars to transpile complex texts is possible, though currently constrained by Invisible XML’s design, lacking computational functions and contextual memory during parsing.

Biographical notes

Alain Couthures, owner of agenceXML founded in 2006, expert in XML/XSLT/XQuery/XForms/XProc for advanced data processing in heterogeneous environments, implementer of XSLTForms (XForms), Fleur (XQuery) and Grammix (Invisible XML).