Generalised Invisible Markup

Steven Pemberton

Abstract

Invisible XML makes the implicit structure of textual documents explicit by parsing, and then transforming the resultant parsetree into an abstract document. This abstract document is the essence of Invisible XML: it can be processed in many ways, though principally by serialising to XML.

However, as shown in an earlier paper on roundtripping ixml, it is possible to simultaneously simplify and generalise the ixml serialisation process, thereby opening it to other serialisations that are not hard-wired in the processor. By the same token, this further simplifies ixml proper, by reducing it to a simple transformation of grammars from ixml to an equivalent Invisible Markup grammar.

This paper investigates the changes needed to create a generalised Invisible Markup Language, explores the alternatives, and proposes future steps.

Table of contents

Abstract
Motivation
Design
Readability
Roundtripping
Ambiguity
Other Uses
Other Structural Opportunities
ixml as an application of iml
Syntax of iml in ixml
References

Motivation

Textual documents have an implicit structure that is mostly recognisable for human readers, but for computers the structure has to be made explicit to enable processing of documents. For this reason, markup languages, such as SGML and XML [sgml]Charles F. Goldfarb, The SGML Handbook, Clarendon Press, 1990, ISBN 9780198537373. [xml]T Bray et al., Extensible Markup Language (XML) 1.0, W3C, 1998, https://www.w3.org/TR/1998/REC-xml-19980210.html. , were invented in order to add information to textual documents to make the structure explicit.

Invisible XML (ixml) [ixml]Steven Pemberton, ed. Invisible XML Specification, Invisible XML Organisation, 2022, https://invisiblexml.org/1.0/. works in a different way: rather than adding structuring information directly to the document, the format of the document is described using context-free grammars [cfg]AV Aho, JD Ullman, The Theory of Parsing, Translation, and Compiling, Vol 1: Parsing, Prentice-Hall, 1972, ISBN 0-13-914556-7. , the document is then parsed with the grammar producing an abstract representation of the document, which can then be serialised to a document with explicit markup. Extra information in the grammar permits control over the serialisation.

The abstract parsetree is the central element of the Invisible XML process, since it can be used in several ways, though the principle one is serialising to XML. However, there are other possible uses and serialisations. For instance, some implementations of ixml serialise the document-describing grammar, which is itself expressed in ixml, into a format more suitable for the parser being used [ixampl]Steven Pemberton, A Pilot Implementation of ixml, Proc. XML Prague 2022, 2022, ISBN 978-80-907787-0-2, https://archive.xmlprague.cz/2022/files/xmlprague-2022-proceedings.pdf#page=51. , [jwixml]John Lumley, Invisible XML workbench, Github, 2024, https://johnlumley.github.io/jwiXML.xhtml. . Other implementations allow the abstract document to be serialised to other formats, such as JSON [cp]Norman Tovey-Walsh, Coffeepot, An Invisible XML processor, nineml.org, 2022, https://docs.nineml.org/current/coffeepot/. .

Furthermore, a paper on roundtripping ixml [rt]Steven Pemberton, Round-tripping Invisible XML, Proc. XML Prague 2024, 2024, pp. 153–164, ISBN 978-80-907787-2-6, https://archive.xmlprague.cz/2024/files/xmlprague-2024-proceedings.pdf#page=163. , pointed out that you can reverse the serialisation process by using ixml to recognise the XML serialisation produced by ixml. This is done by transforming grammars so that they include the necessary extra characters such as "<" and ">" added by serialisation, thus recreating an abstract document that would create the same XML serialisation. Extra facilities are needed to do this in a general way, since some items (namely attributes) get implicitly moved to earlier in the serialisation than their position in the abstract document. The extra facilities allow moving the parsed attributes back to their original position. What this produces is a grammar that recognises the ixml serialisation, and serialises it back to textual form.

As a simple example, for the ixml grammar:

 date: day, -"/", month, -"/", year.
  day: d, d.
month: d, d.
 year: +"20", d, d.
   -d: ["0"-"9"].

where an input such as

07/11/25

would generate a serialisation like:

<date><day>07</day><month>11</month><year>2025</year></date>

transforming the grammar to recognise the serialisation gives:

  date: -"<date>", day, +"/", month, +"/", year, -"</date>".
  -day: -"<day>", d, d, -"</day>".
-month: -"<month>", d, d, -"</month>".
 -year: -"<year>", -"20", d, d, -"</year>".
    -d: ["0"-"9"].

Using this grammar and the regular ixml parser to parse the serialised XML, then gives:

<date>07/11/2025</date>

However, this has a surprising effect: if you transform a grammar twice in this way, the resulting grammar both recognises the original input format, and includes the necessary characters to do the serialisation, without needing to call on the XML serialisation process of ixml.

This would give:

 -date: +"<date>", day, -"/", month, -"/", year, +"</date>".
  -day: +"<day>", d, d, +"</day>".
-month: +"<month>", d, d, +"</month>".
 -year: +"<year>", +"20", d, d, +"</year>".
    -d: ["0"-"9"].

which for the same original input, produces the same original serialisation (modulo details explained in the original paper).

As the paper remarks:

the ixml processor is now unbound from XML, and could be used to produce other serialisations in a fairly straightforward way.

This paper takes that observation, and explores the design of a generalised version of ixml that allows serialisation not only to XML, but also to other structured data formats, such as JSON [json], and ABC structured data [abc].

Design

The purpose of the generalised invisible markup language is thus to parse a textual input and produce an abstract document, just as with ixml, but then to leave the serialisation solely to details in the grammar specification.

To explore the possibilities, let us take an ixml definition of a simplified URL as an example, but treat it in some different ways.

      url: scheme, -":", authority, path.
   scheme: letter+.
authority: -"//", host.
     host: sub++-".".
      sub: letter+.
     path: (-"/", seg)+.
      seg: fletter*.
  -letter: ["a"-"z"; "A"-"Z"; "0"-"9"].
 -fletter: letter; ".".

If we use this to process the string https://invisiblexml.org/1.0/ with ixml, we get the serialisation

<url>
   <scheme>https</scheme>
   <authority>
      <host>
         <sub>invisiblexml</sub>
         <sub>org</sub>
      </host>
   </authority>
   <path>
      <seg>1.0</seg>
      <seg/>
   </path>
</url>

Let us now assume a version of ixml that doesn't serialise to XML, but instead requires all serialisation characters to come from the grammar. To duplicate what we have above would just require adding the start and end tags to the relevant rules:

   url: +"<url>", scheme, -":", authority, path, +"</url>".
scheme: +"<scheme>", letter+, +"</scheme>".
(etc)

There is one difference, namely this would generate

<seg></seg>

for the final empty segment. If the short version were required, then you could write:

-seg: +"<seg>", fletter+, +"</seg>"; +"<seg/>".

If we wanted to make scheme an attribute, we could write:

   -url: +"<url", scheme, +">", -":", authority, path, +"</url>".
-scheme: +" scheme='",  letter+, +"'".

This would then generate

<url scheme='https'>

as the first line of the serialisation.

But now we have some more freedom than we did in ixml. For instance we could use the scheme as the name of the element:

   -url: +"<", scheme, +">", -":", authority, path, +"</", +scheme, ">".
-scheme: letter+.

which would give:

<https>
  <authority>
     <host>
        <sub>invisiblexml</sub>
        <sub>org</sub>
      </host>
   </authority>
   <path>
      <seg>1.0</seg>
      <seg/>
   </path>
</https>

Note that a facility proposed in the roundtripping paper has been used here: applying + to a nonterminal, meaning "parse nothing, but serialise the nonterminal of this name at this position". While this facility was introduced in that paper in order to deal with attributes that appear in a different position in the serialisation to its position in the parse, it can now be used to introduce structural differences in the serialisation, in the XML case, so that the name of elements may come from the input.

Readability

In ixml each rule combines both the form of the input and that of the serialisation, but since the two are very closely related, and the reordering of attributes only happens implicitly, it is easy to see what is input, and what is output.

However, since more is required to be included in rules in the new version, it quickly becomes hard to read, and determine what the form of the input is, and how it will be serialised. For this reason, a different form for rules is proposed here, that separates input and output. For instance:

      url: scheme, ":", authority, path
         => "<", scheme, ">", authority, path, "</", scheme, ">".
   scheme: letter+.

What should immediately be obvious here is that all 'marks' driving the serialisation have been made unnecessary and therefore have been eliminated. It also makes it much clearer what the format of the input is, and the resulting serialisation. If input and output are identical, such as in scheme above, there is no need to specify a separate output part of the rule. Here is the rest:

authority: "//", host  => "<authority>", host, "</authority>".
     host: sub++"."    => "<host>", sub+, "</host>".
      sub: letter+     => "<sub>", letter+, "</sub>".
     path: ("/", seg)+ => "<path>", seg+, "</path>".
      seg: fletter*    => "<seg>", fletter*, "</seg>".
   letter: ["a"-"z"; "A"-"Z"; "0"-"9"].
  fletter: letter; ".".

If a rule should produce no output then an empty output part is specified. For instance:

 input: spaces, url, spaces.
spaces: " "* => .

Note that each alternative needs an output part, not just the whole rule:

 input: entry++" " => "<entries>", entry+, "</entries>".
 entry: number     => "<number>", number, "</number>";
        word       => "<word>", word, "</word>".
number: ["0"-"9"]+.
  word: [L]+.

In some cases, they can be combined

date: day, "/", month, "/", year => day, month, year;
      year, "-", month, "-", day => day, month, year.

can be combined to

date: (day, "/", month, "/", year;
       year, "-", month, "-", day) => day, month, year.

In some cases, the same nonterminal occurs more than once, such as s in this example:

mapping: name, ":", s, values, ".", s
      => "<", name, ">", s, values, s, "</", name, ">".

The behaviour is that each is taken in turn in the output; if there are more in the output than the input, then it restarts from the first.

If it is necessary to use a specific nonterminal, extra rules can be added as required:

person: name1, " ", name2 => name2, ", ", name1.
 name1: name.
 name2: name.
  name: [L]+.

This would turn Steven Pemberton into Pemberton, Steven.

Roundtripping

An interesting aspect of this representation is that the syntax of both input and output are now separately described. What this means is that roundtripping becomes trivial: you just swap the input and output sides, parse with the output part, and serialise with the input part. To take the earlier example:

 date: day, "/", month, "/", year
   => "<date>", year, month, day, "</date>".
  day: d, d => "<day>", d, d, "</day>".
month: d, d => "<month>", d, d, "</month>".
 year: d, d => "<year>", "20", d, d "</year>".
    d: ["0"-"9"].

Ambiguity

As with ixml, a parse may reveal ambiguity. While ixml does not define which parse to use, it may be useful to specify that the textually earliest serialisation be used. For instance in the case of the earlier example,

date: (day, "/", month, "/", year;
       year, "-", month, "-", day) => day, month, year.

where round tripping has a choice of two serialisations, it might useful to be able to specify that it is the first of these that is used. Further implementation work needs to be done to investigate this.

It is also worth noting that there is now no longer a place to specify that a parse was ambiguous. Implementations may have to resort to using a Unix-style error output channel to report errors and warnings of this sort.

Other Uses

Although the inspiration for iml is invisible markup, there are other uses that the language can be put to. For instance, general editing, such as replacing all separating semicolons with commas, but not those in strings:

input: item++";" => item++",".
item: string; word; number.
string: '"', ~['"']*, '"'.
word: [L]+.
number: [Nd]+

or reversing a list of words:

words: word, s, words => words, " ", word.
    s: " "+.
 word: [L]+.

In fact iml could be used for many similar cases to how sed and grep are used, but with an additional advantage of being able to select on structure as well as text.

Other Structural Opportunities

In the original ixml paper [ixml0]Steven Pemberton, Invisible XML, Proceedings of Balisage: The Markup Conference 2013, Balisage Series on Markup Technologies 10 2013, it was remarked that {"a": 1, "b": 2} cannot be transformed [with ixml] into <j><a>1</a><b>2</b></j>. However, iml enables it to be done.

j: "{", member**", ", "}"  => "<j>", member*, "</j>".
member: name, ": ", number => "<", name, ">", number, "</", name, ">".
name: '"', letters, '"'    => letters.
letters: ["a"-"z"]+.
number: ["0"-"9"]+.

ixml as an application of iml

As promised, iml makes the implementation of ixml much easier, because all that is needed is grammar transformation: an ixml grammar is transformed into an iml grammar, in the same way that ixml round-tripping was done earlier.

The only special case is the need to deal with XML special characters, such as "<", in text. For instance, the ixml

line: c*, #a.
c: ~[#a].

needs to be transformed to

line: c*, #a => "<line>", c*, </line>
c: ~[#a; "<"];
   "<" => "&lt;".

Syntax of iml in ixml

Here then is the proposed syntax of iml, expressed in ixml. The syntax is close to that of ixml, though of course there are no marks any more. Differences include names being simpler since they no longer have to match the XML production of name, and 'restrictions' have been added.

          iml: s, rule+.
         rule: name, s, -":", s, alternatives, -".", s.
        @name: [L; Nd]+.
-alternatives: alternative++(-";", s).
  alternative: input, output?.
        input: alt.
       output: (-"=>"; -"⇒"), s, alt.
         -alt: term**(-",", s).
        -term: factor;
               option;
               restriction;
               repeat.
      -factor: terminal;
               nonterminal;
               group.
       option: factor, -"?", s.
  restriction: factor, -"!", s.
      -repeat: repeat0; 
               repeat1.
      repeat0: factor, (-"*", s; -"**", s, sep).
      repeat1: factor, (-"+", s; -"++", s, sep).
          sep: factor.
        group: -"(", s, ^alt++(-";", s), -")", s.
  nonterminal: name, s.
     terminal: string, s;
               encoded, s;
               charset.
      @string: -'"', dchar+, -'"';
               -"'", schar+, -"'".
     @encoded: "#", hex.
          hex: ["0"-"9"; "a"-"f"; "A"-"F"]+.
     -charset: inclusion;
               exclusion.
    inclusion:  -"[", s, member**(-";", s), -"]", s.
    exclusion: -"~[", s, member**(-";", s), -"]", s.
       member: string, s; encoded, s; range; class, s.
       -range: from, s, -"-", s, to, s.
        @from: char.
          @to: char.
        -char: -'"', dchar, -'"';
               -"'", schar, -"'";
               encoded.
       @class: [L],[L]?.
       -dchar: ~['"'; Cc]; '"', -'"'.
       -schar: ~["'"; Cc]; "'", -"'".
           -s: (space; comment)*.
       -space: -[Zs; #a; #9; #d].
     -comment: -"{", c*, -"}".
           -c: -~["{}"]; comment.

It is left as an exercise for the reader to rewrite this in iml format.

References

[abc]

Leo Geurts et al.,

The ABC Programmer's Handbook, Prentice-Hall, 1990, ISBN 0-13-000027-2, http://cwi.nl/~steven/abc/programmers/handbook/.

[cfg]

AV Aho, JD Ullman,

The Theory of Parsing, Translation, and Compiling, Vol 1: Parsing, Prentice-Hall, 1972, ISBN 0-13-914556-7.

①

[cp]

Norman Tovey-Walsh,

Coffeepot, An Invisible XML processor, nineml.org, 2022, https://docs.nineml.org/current/coffeepot/.

①

[ixampl]

Steven Pemberton,

A Pilot Implementation of ixml, Proc. XML Prague 2022, 2022, ISBN 978-80-907787-0-2, https://archive.xmlprague.cz/2022/files/xmlprague-2022-proceedings.pdf#page=51.

①

[ixml]

Steven Pemberton, ed. Invisible XML Specification, Invisible XML Organisation, 2022, https://invisiblexml.org/1.0/.

①

[ixml0]

Steven Pemberton,

Invisible XML, Proceedings of Balisage: The Markup Conference 2013, Balisage Series on Markup Technologies 10 2013,

①

[json]

ECMA International,

ECMA-404: The JSON data interchange syntax, 2nd edition, December, 2017, https://ecma-international.org/publications-and-standards/standards/ecma-404/.

[jwixml]

John Lumley,

Invisible XML workbench, Github, 2024, https://johnlumley.github.io/jwiXML.xhtml.

①

[rt]

Steven Pemberton,

Round-tripping Invisible XML, Proc. XML Prague 2024, 2024, pp. 153–164, ISBN 978-80-907787-2-6, https://archive.xmlprague.cz/2024/files/xmlprague-2024-proceedings.pdf#page=163.

①

[sgml]

Charles F. Goldfarb,

The SGML Handbook, Clarendon Press, 1990, ISBN 9780198537373.

①

[xml]

T Bray et al.,

Extensible Markup Language (XML) 1.0, W3C, 1998, https://www.w3.org/TR/1998/REC-xml-19980210.html.

①