Generalised Invisible Markup
Abstract
Invisible XML makes the implicit structure of textual documents explicit by parsing, and then transforming the resultant parsetree into an abstract document. This abstract document is the essence of Invisible XML: it can be processed in many ways, though principally by serialising to XML.
However, as shown in an earlier paper on roundtripping ixml, it is possible to simultaneously simplify and generalise the ixml serialisation process, thereby opening it to other serialisations that are not hard-wired in the processor. By the same token, this further simplifies ixml proper, by reducing it to a simple transformation of grammars from ixml to an equivalent Invisible Markup grammar.
This paper investigates the changes needed to create a generalised Invisible Markup Language, explores the alternatives, and proposes future steps.
Table of contents
Motivation
Textual documents have an implicit structure that is mostly recognisable for human readers, but for computers the structure has to be made explicit to enable processing of documents. For this reason, markup languages, such as SGML and XML [sgml]Charles F. Goldfarb, The SGML Handbook, Clarendon Press, 1990, ISBN 9780198537373. [xml]T Bray et al., Extensible Markup Language (XML) 1.0, W3C, 1998, https://www.w3.org/TR/1998/REC-xml-19980210.html. , were invented in order to add information to textual documents to make the structure explicit.
Invisible XML (ixml) [ixml]Steven Pemberton, Invisible XML Specification, Invisible XML Organisation, 2022, https://invisiblexml.org/1.0/. works in a different way: rather than adding structuring information directly to the document, the format of the document is described using context-free grammars [cfg]AV Aho, JD Ullman, The Theory of Parsing, Translation, and Compiling, Vol 1: Parsing, Prentice-Hall, 1972, ISBN 0-13-914556-7. , the document is then parsed with the grammar producing an abstract representation of the document, which can then be serialised to a document with explicit markup. Extra information in the grammar permits control over the serialisation.
The abstract parsetree is the central element of the Invisible XML process, since it can be used in several ways, though the principle one is serialising to XML. However, there are other possible uses and serialisations. For instance, some implementations of ixml serialise the document-describing grammar, which is itself expressed in ixml, into a format more suitable for the parser being used [ixampl]Steven Pemberton, A Pilot Implementation of ixml, Proc. XML Prague 2022, 2022, ISBN 978-80-907787-0-2, https://archive.xmlprague.cz/2022/files/xmlprague-2022-proceedings.pdf#page=51. , [jwixml]John Lumley, Invisible XML workbench, Github, 2024, https://johnlumley.github.io/jwiXML.xhtml. . Other implementations allow the abstract document to be serialised to other formats, such as JSON [cp]Norman Tovey-Walsh, Coffeepot, An Invisible XML processor, nineml.org, 2022, https://docs.nineml.org/current/coffeepot/. .
Furthermore, a paper on roundtripping ixml [rt]Steven Pemberton, Round-tripping Invisible XML, Proc. XML Prague 2024, 2024, pp. 153–164, ISBN 978-80-907787-2-6, https://archive.xmlprague.cz/2024/files/xmlprague-2024-proceedings.pdf#page=163. , pointed out that you can reverse the serialisation process by using ixml to recognise the XML serialisation produced by ixml. This is done by transforming grammars so that they include the necessary extra characters such as "<" and ">" added by serialisation, thus recreating an abstract document that would create the same XML serialisation.
As a simple example, for the ixml grammar:
date: day, -"/", month, -"/", year. day: d, d. month: d, d. year: +"20", d, d. -d: ["0"-"9"].
where an input such as
07/11/25would generate a serialisation like:
<date><day>07</day><month>11</month><year>2025</year></date>
transforming the grammar to recognise the serialisation gives:
date: -"<date>", day, +"/", month, +"/", year, -"</date>".
-day: -"<day>", d, d, -"</day>".
-month: -"<month>", d, d, -"</month>".
-year: -"<year>", -"20", d, d, -"</year>".
-d: ["0"-"9"].
Using this grammar and the regular ixml parser to parse the serialised XML, then gives:
<date>07/11/2025</date>
However, this has a surprising effect: if you transform a grammar twice in this way, the resulting grammar both recognises the original input format, and includes the necessary characters to do the serialisation, without needing to call on the XML serialisation process of ixml.
This would give:
-date: +"<date>", day, -"/", month, -"/", year, +"</date>".
-day: +"<day>", d, d, +"</day>".
-month: +"<month>", d, d, +"</month>".
-year: +"<year>", +"20", d, d, +"</year>".
-d: ["0"-"9"].
which for the same original input, produces the same original serialisation (modulo details explained in the original paper).
As the paper remarks:
the ixml processor is now unbound from XML, and could be used to produce other serialisations in a fairly straightforward way.
This paper takes that observation, and explores the design of a generalised version of ixml that allows serialisation not only to XML, but also to other structured data formats, such as JSON [json], and ABC structured data [abc].
Design
The purpose of the generalised invisible markup language is thus to parse a textual input and produce an abstract document, just as with ixml, but then to leave the serialisation solely to details in the grammar specification.
To explore the possibilities, let us take an ixml definition of a simplified URL as an example, but treat it in some different ways.
url: scheme, -":", authority, path.
scheme: letter+.
authority: -"//", host.
host: sub++".".
sub: letter+.
path: (-"/", seg)+.
seg: fletter*.
-letter: ["a"-"z"; "A"-"Z"; "0"-"9"].
-fletter: letter; ".".
If we use this to process the string
https://invisiblexml.org/1.0/ with ixml, we get the serialisation
<url>
<scheme>https</scheme>
<authority>
<host>
<sub>invisiblexml</sub>
<sub>org</sub>
</host>
</authority>
<path>
<seg>1.0</seg>
<seg/>
</path>
</url>
Let us now assume a version of ixml that doesn't serialise to XML, but instead requires all serialisation characters to come from the grammar. To duplicate what we have above would just require adding the start and end tags to the relevant rules:
url: +"<url>", scheme, -":", authority, path, +"</url>". scheme: +"<scheme>", letter+, +"</scheme>". (etc)
There is one difference, namely this would generate
<seg></seg>
for the final empty segment. If the short version were required, then you could write:
-seg: +"<seg>", fletter+, +"</seg>"; +"<seg/>".
If we wanted to make scheme an attribute, we could write:
-url: +"<url", scheme, +">", -":", authority, path, +"</url>". -scheme: +" scheme='", letter+, +"'".
This would then generate
<url scheme='https'>
as the first line of the serialisation.
But now we have some more freedom than we did in ixml. For instance we could use the scheme as the name of the element:
-url: +"<", scheme, +">", -":", authority, path, +"</", +scheme, ">". -scheme: letter+.
which would give:
<https>
<authority>
<host>
<sub>invisiblexml</sub>
<sub>org</sub>
</host>
</authority>
<path>
<seg>1.0</seg>
<seg/>
</path>
</https>
Note that a facility proposed in the roundtripping paper has been used here:
applying + to a nonterminal, meaning "parse nothing, but serialise
the nonterminal of this name at this position". While this facility was
introduced in that paper in order to deal with attributes that appear in a
different position in the serialisation to its position in the parse, it can
now be used to introduce structural differences in the serialisation, in the
XML case, so that the name of elements may come from the input.
Readability
In ixml each rule combines both the form of the input and that of the serialisation, but since the two are very closely related, and the reordering of attributes only happens implicitly, it is easy to see what is input, and what is output.
However, since more is required to be included in rules in the new version, it quickly becomes hard to read. For this reason, a different form for rules is proposed here, that separates input and output:
url: scheme, ":", authority, path
=> "<", scheme, ">", authority, path, "</", scheme, ">".
scheme: letter+.
What should immediately be obvious here is that all 'marks' driving the
serialisation have been made unnecessary and therefore have been eliminated. It
also makes it much clearer what the format of the input is, and the resulting
serialisation. If input and output are identical, such as in
scheme above, there is no need to specify a separate output part
of the rule. Here is the rest:
authority: "//", host => "<authority>", host, "</authority>".
host: sub++"." => "<host>", sub+, "</host>".
sub: letter+ => "<sub>", letter+, "</sub>".
path: ("/", seg)+ => "<path>", seg+, "</path>".
seg: fletter* => "<seg>", fletter*, "</seg>".
letter: ["a"-"z"; "A"-"Z"; "0"-"9"].
fletter: letter; ".".
If a rule should produce no output then an empty output part is specified:
input: spaces, url, spaces. spaces: " "* => .
Note that each alternative needs an output part, not just the whole rule:
input: entry++" " => "<entries>", entry+, "</entries>".
entry: number => "<number>", number, "</number>";
word => "<word>", word, "</word>".
number: ["0"-"9"]+.
word: [L]+.
In some cases, they can be combined
date: day, "/", month, "/", year => day, month, year;
year, "-", month, "-", day => day, month, year.
can be combined to
date: (day, "/", month, "/", year;
year, "-", month, "-", day) => day, month, year.
In some cases, the same nonterminal occurs more than once, such as
s in this example:
mapping: name, ":", s, values, ".", s
=> "<", name, ">", s, values, s, "</", name, ">".
The behaviour is that each is taken in turn in the output; if there are more in the output than the input, then it restarts from the first.
If it is necessary to use a specific nonterminal, extra rules can be added as required:
person: name1, " ", name2 => name2, ", ", name1. name1: name. name2: name. name: [L]+.
This would turn Steven Pemberton into Pemberton,
Steven.
Roundtripping
An interesting aspect of this representation is that the syntax of both input and output are now separately described. What this means is that roundtripping becomes trivial: you just swap the input and output sides, parse with the output part, and serialise with the input part.
date: day, "/", month, "/", year
=> "<date>", year, month, day, "</date>".
day: d, d => "<day>", d, d, "</day>".
month: d, d => "<month>", d, d, "</month>".
year: d, d => "<year>", "20", d, d "</year>".
d: ["0"-"9"].
Ambiguity
As with ixml, a parse may reveal ambiguity. While ixml does not define which parse to use, it may be useful to specify that the textually earliest serialisation be used. It is also worth noting that there is now no longer a place to specify that a parse was ambiguous.
date: (day, "/", month, "/", year;
year, "-", month, "-", day) => day, month, year.
where round tripping has a choice of two serialisations, it might useful to be able to specify that it is the first of these that is used. Further implementation work needs to be done to investigate this.
It is also worth noting that there is now no longer a place to specify that a parse was ambiguous. Implementations may have to resort to using a Unix-style error output channel to report errors and warnings of this sort.
Other Uses
Although the inspiration for iml is invisible markup, there are other uses that the language can be put to. For instance, general editing, such as replacing all separating semicolons with commas, but not those in strings:
input: item++";" => item++",". item: string; word; number. string: '"', ~['"']*, '"'. word: [L]+. number: [Nd]+
or reversing a list of words:
words: word, s, words => words, " ", word.
s: " "+.
word: [L]+.
In fact iml could be used for many similar cases to how sed and
grep are used, but with an additional advantage of being able to
select on structure as well as text.
Other Structural Opportunities
In the original ixml paper [ixml0]Steven Pemberton,
Invisible XML,
Proceedings of Balisage: The Markup Conference 2013,
Balisage Series on Markup Technologies
10
2013,
it was remarked that
{"a": 1, "b": 2} cannot be transformed [with ixml] into <j><a>1</a><b>2</b></j>. However, iml enables it to be done.
However, iml enables it to be done:
j: "{", member**", ", "}" => "<j>", member*, "</j>".
member: name, ": ", number => "<", name, ">", number, "</", name, ">".
name: '"', letters, '"' => letters.
letters: ["a"-"z"]+.
number: ["0"-"9"]+.
ixml as an application of iml
As promised, iml makes the implementation of ixml much easier, because all that is needed is grammar transformation: an ixml grammar is transformed into an iml grammar, in the same way that ixml round-tripping was done earlier.
The only special case is the need to deal with XML special characters, such as "<", in text. For instance, the ixml
line: c*, #a. c: ~[#a].
needs to be transformed to
line: c*, #a => "<line>", c*, </line> c: ~[#a; "<"]; "<" => "<".
Syntax of iml in ixml
Here then is the proposed syntax of iml, expressed in ixml. The syntax is close to that of ixml, though of course there are no marks any more. Differences include names being simpler since they no longer have to match the XML production of name, and 'restrictions' have been added.
iml: s, rule+.
rule: name, s, -":", s, alternatives, -".", s.
@name: [L; Nd]+.
-alternatives: alternative++(-";", s).
alternative: input, output?.
input: alt.
output: (-"=>"; -"⇒"), s, alt.
-alt: term**(-",", s).
-term: factor;
option;
restriction;
repeat.
-factor: terminal;
nonterminal;
group.
option: factor, -"?", s.
restriction: factor, -"!", s.
-repeat: repeat0;
repeat1.
repeat0: factor, (-"*", s; -"**", s, sep).
repeat1: factor, (-"+", s; -"++", s, sep).
sep: factor.
group: -"(", s, ^alt++(-";", s), -")", s.
nonterminal: name, s.
terminal: string, s;
encoded, s;
charset.
@string: -'"', dchar+, -'"';
-"'", schar+, -"'".
@encoded: "#", hex.
hex: ["0"-"9"; "a"-"f"; "A"-"F"]+.
-charset: inclusion;
exclusion.
inclusion: -"[", s, member**(-";", s), -"]", s.
exclusion: -"~[", s, member**(-";", s), -"]", s.
member: string, s; encoded, s; range; class, s.
-range: from, s, -"-", s, to, s.
@from: char.
@to: char.
-char: -'"', dchar, -'"';
-"'", schar, -"'";
encoded.
@class: [L],[L]?.
-dchar: ~['"'; Cc]; '"', -'"'.
-schar: ~["'"; Cc]; "'", -"'".
-s: (space; comment)*.
-space: -[Zs; #a; #9; #d].
-comment: -"{", c*, -"}".
-c: -~["{}"]; comment.
It is left as an exercise for the reader to rewrite this in iml format.