Lexicography versus XML
In lexicography, dictionary entries are usually encoded in XML. Typically, lexicographic XML contains a high degree of purely structural markup: elements whose only purpose is to group other elements together instead of marking up human-readable text. This makes XML-encoded dictionary entries verbose and complex. In this talk, I will explain why this is. Most lexicographic content, such as definitions, example sentences and translation equivalents, is inherently ‘headed’ and would most economically be represented as a triple (name + value + children), whereas in XML every element is a tuple (name + value and/or children). This means that a single content item cannot be represented with just one XML element, leading to purely structural markup. I will review strategies that are common in lexicography for dealing with this problem in XML and in other data languages such as JSON and YAML. I will conclude that not a single one of these popular languages serves the needs of lexicography well because neither has good support for representing headed triples. The only languages that do are XML’s historical predecessor SGML (thanks to its mark-up minimization features) and a less well-known language called NVH which was specifically designed for that.
Michal is a computational lexicographer. He is the author of the open-source dictionary writing system Lexonomy and the open-source terminology management platform Terminologue. He works on language technology projects for Dublin City University and for Foras na Gaeilge in Dublin, and has previously worked for Microsoft Ireland. He is currently based in Brno, Czech Republic where he is writing a PhD dissertation on the digitization of lexicography at Masaryk University.