Syntax highlighting for code blocks using ixml

Pieter Lamers

John Benjamins Publishing

For a long time, John Benjamins Publishing Company has published books and journals using a production pipeline based on XML and the XML tool chain. Manuscripts are converted into JATS or BITS XML, which is then converted into ePub, PDF or paper. Earlier this year, we were processing a book that contained code fragments in the R programming language. The manuscript, in Microsoft Word, contained code blocks that used shading, font properties and colored characters for syntax highlighting. Apparently, these code blocks were copied and pasted from an editor that supports R syntax highlighting.

Encoding the syntax highlighting in JATS would require an understanding of the R language. Even then, manually encoding the syntax highlighting would be too laborious, so we ended up with <code> blocks with plain text and no syntax highlighting. Since the author insisted on syntax highlighting, and we did not want to add markup to each of the 509 <code> blocks manually, we looked for existing syntax highlighting software. A popular highlighter supporting R is highlight.js. Being under time pressure, we decided to make an XSLT transformation based on the highlight.js code. Highlight.js uses regular expressions to extract syntactic constructs, which may then be marked and styled. Our XSLT takes the same approach.

Although this was sufficient for most small code fragments in a single book, it was clear from the beginning that regular expressions fall short when it comes to parsing programming languages. Therefore we want to see if it is possible to apply syntax highlighting using a real parser. This requires a grammar and a parser generator, and iXML is an obvious choice for the latter. In this presentation, we will show how to go from regular expression matching to iXML parsing, and why this is worth the effort.

We will show how the resulting markup can be styled with CSS, in compliance with WCAG guidelines and the European Union's Web Accessibility Directive. An interesting situation occurs when a <code> element contains other elements. This is possible in JATS for making text bold, underlined; for embedded links, footnotes, or even to specify syntax highlighting. Both regular expression matching and parsing have difficulty dealing with this situation, because they expect plain text as their input. We use an extension to invisible XML (transparent invisible XML, or tiXML) to make the embedded markup transparent to the parser, and keep it in the parse result.

Presentation, 8 November 2024

Pieter Lamers has been in scholarly publishing for almost 30 years, focusing mainly on style, not substance. He both supervises the development of ict solutions for his employer, John Benjamins Publishing Company, and participates in its development, which can be hard at times. He has also been making (mostly classical) music for over 40 years, as an amateur.