Self-generating quality control: A case study

Tomos Hillman and Vincent LizzieXpertML Ltd | Taylor & Francis Group

This paper demonstrates how quality control infrastructure can be generated from a single requirements document. Taken from a recent project that is now being used in production at a large journal publisher, it discusses some of the challenges faced and techniques used when generating Schematron, XSpec tests, XML grammar checks, and documentation.

The project set out to implement quality control requirements for journal articles using Schematron. In pursuing this objective, the project also created quality control infrastructure for Schematron itself that streamlines the process for incorporating iterative changes to requirements.

The techniques used in this project and described in this paper may be generally applicable in other projects.

Table of contents

Abstract
Introduction
Requirements Analysis
The Requirements XML Format
Generating Schematron
Generating XSpec
Generating Documentation
QA beyond Schematron, within Schematron
Parsing XML in XSLT
Continuous Integration
Conclusions
Acknowledgements
References
Appendix
Biographical notes

Introduction

Taylor & Francis (T&F) is a specialist publisher who curate, produce and publish scholarly research and reference-led content in specialist subject areas; among other content, they publish 2,700 high quality, peer reviewed journals under imprints including Dove Medical Press, Cogent OA, and Routledge.

In 2019, T&F implemented a project to migrate their journals data from a customized version of JATS 1.0 to the JATS 1.2 standards. This was taken as an opportunity for a review of their QA requirements, and a fresh implementation of their Schematron rule sets, and they reached out to eXpertML Ltd for technical assistance. The project required implementing more than 200 validation rules in Schematron. A second key requirement of the project was to create corresponding XSpec scenarios to ensure that the Schematron functions correctly according to expectations.

Requirements Analysis

The early stages of the project included an analysis step in which a team of production staff drafted the tagging guidelines documentation and validation rules that would be used for JATS 1.2. During this analysis stage of the project, the production team divided up the work into a list of topics, tracked their progress on each topic in an Excel spreadsheet, used Microsoft Word as an authoring tool to capture the documentation and validation rules, and performed a peer review on each document. The team used a Microsoft Word Template that was designed for the goals of creating a documentation website and a Schematron. The template included space to capture:

Name - the name of the topic
Description - text to be used for the eventual documentation
Examples - including correct examples and incorrect counter examples
Validation rules arranged in a table that provided space for each rule to have:
- Message – A description of the rule to be used in the validation message. Ideally, the message would be written in a positive tone to point out a problem, provide direction to a solution, and possibly use a placeholder to include contextual information as part of the message.
- Severity Level – Either "error", "warning", or "information"
- Suggested Schematron phases – One or more of "current content", "scanned content", "converted content", "rendering alerts" (or all)
- Context(s) – The elements or attributes relevant to the validation rule
- Suggested XPath – An XPath test for the rule that evaluates to either true or false

The production team was asked to always provide the message, phase and level. The context and XPath were provided if known, and were usually included when copying an existing validation rule that had used for JATS 1.0.

The Word documents resulting from the analysis step were the basis for implementing the Schematron rules and XSpec tests in later stages of the project. Although the human readable content of these document was very thorough, the authors of these documents had varying levels of experience with the XML/XPath syntax or Schematron, and with an unrestricted authoring tool, multiple documents, and multiple authors, it was inevitable that many fields and examples were informal, incomplete, or inconsistent.

This was not entirely a negative thing, because allowing the production team to use pseudo-code or omit certain information at this stage of the project allowed the team to focus on describing what the tagging guidelines and validation rules needed to be, and trust that the details of implementation would be handled in later stages of the project.

The Requirements XML Format

In publishing, content manuscripts written in Microsoft Word are not uncommon: it is a tool which is familiar to many subject matter experts. Of course, it comes at a price: data that would inform the contents of the end deliverable can be obscured, and normally needs to be explicitly marked up.

Happily, producing an XML format from Word manuscripts is not an unfamiliar problem for those working in digital publishing, and an "XML Early" approach was taken!

The information in these specifications was taken and collected into a single XML document, which could then be copy edited, enriched, and used to generate several different deliverables. Since both the authors are longstanding proponents of XML early workflows and single source publishing, this transpired to be a fitting example of "practicing what we preach"!

Initially, the XML collection of requirements was not seen as a single source for publishing: the more immediate need was to collect the data from the specification documents; check them for errors; identify the sets of rule phases, contexts, and severities; and to eliminate the discrepancies that naturally arise when using an unrestricted authoring tool.

To this end, a fiat requirements format was devised with a simple Relax NG grammar (see Listing 1), and some Schematron (see Listing 2). The grammar organises the requirements by topic (and therefore original Word specification), then by rule. Topics also include shared examples and counter-examples. Rules include the user message, phases, level, contexts, and XPath provided in the specifications, as well as a notes field to capture any other human readable information that may be relevant.

Figure 1.Requirements XML

Validation rules could then be used to identify initial work tasks such as:

Missing data (e.g. missing phases for rule definitions)
Missing @xml:id on several elements
Standardising to single quotes in context predicates (this prevents errors when the context XPath is used within a schematron attribute later)
Removal of extraneous namespace prefixes

There are useful tools in the XML tool chain that can be leveraged as part of this tidying: Schematron Quick Fix (SQF) was used as the data was being captured for some automated corrections; further fixes were implemented using an XSLT script, minimising the manual corrections needed.

As initial data was captured and tidied, additional schematron phases were added to test completion of various added-value components such as the XSpec test definitions; the use of phases meant that work to add features could be tested separately from corrective work, allowing developers to focus their efforts.

Investing this effort early in the project paid dividends at each subsequent stage. The requirements XML format was embraced as a living master that could be adapted and extended based on the needs of the project. The requirements XML turns out to have a number of uses and can be transformed to generate outputs including: Schematron, XSpec, issue tracker tickets on a rule-by-rule basis, and documentation for the Schematron rules.

Generating Schematron

The general approach to the generated schematron is to turn the rules in the requirements document into a single set of abstract rules, grouped in a single schematron pattern. A set of patterns is then created for every unique combination of phases in the document. Phases include all of the patterns that apply to that phase: a pattern that refers to multiple phases will appear in each of those phases. There was also a special pattern for wildcarded contexts (like the document root /, * and so on), which were configured to apply to given phases. The contexts relevant to each pattern are then used to generate rules which extend the corresponding abstract rules.

There was an anticipated challenge in this approach: contexts in schematron rules work because they are converted to XSLT template match statements. As is familiar in XSLT, a given node may be matched by multiple templates, and the more ‘specific’ match statement ‘wins’. Therefore, a specific rule context may mean that rules which are applied to similar, simpler contexts are not inherited, and do not apply where they should.

This problem is solved by ensuring that for a given context, any abstract rules which would apply to a simpler context are also extended. This is achieved by parsing the context statements using an XSLT parser for the XPath language generated by Gunther Rademacher’s excellent ‘REx’ parser generator (REx). This returns the XPath statement as an XML tree, which can be processed to remove location steps and predicates, and returning a set of possible combinations, resulting in simpler match statements that can be checked using an XSLT key. The goal here is not to return a fool-proof definitive list of every possible simplified match statement, but every match statement which is likely to be used: some responsibility must be retained when choosing the contexts for rules!

Generating XSpec

Generating XSpec to test for consistency over time was much more straightforward.

The examples and counter-examples in the requirements XML were exported to separate files using <xsl:result-document>. XSpec scenarios were created for each of the example files, with the file as the context of the tests. Expectation tests created from data linking rules to examples in the requirements document.

Schematron rules validating the requirements document itself ensure that each rule has at least one 'true positive' and one 'true negative' test; a valid XSpec test of the resulting schematron can then be used as a regression test as part of a continuous integration workflow.

Generating Documentation

Generating simple HTML documentation for the schematron rules was a straight-forward return on the investment of work earlier in the project.

The documentation indexes rules so that they can be found by context element or by rule ID. The XPath parser could be used not only to show applicable rules from other contexts, but could order the context index so that e.g. event/date and date contexts appeared together.

The biggest advantage is that no-one needs to write this sort of documentation again and the documentation will never go out of date.

QA beyond Schematron, within Schematron

Some of the rules that we wanted to validate using Schematron were beyond the normal use cases for Schematron, such as checking the DOCTYPE and constraining content models with a grammar.

We were motivated to implement these rules in Schematron because the Schematron would be shared to ensure that the exact same validation rules are used at our suppliers and in our internal systems. Also, the overall reason for creating these validation rules, and in general for using Schematron with JATS, arose from experiences of resolving problems in JATS XML files and a desire to save time in the long run by preventing certain problems from happening in the first place.

Parsing XML in XSLT

One challenge that we faced is that Schematron normally only sees an XML document after it has been through an XML parsing process that discards the prologue of the XML document and fills in missing details with information provided by a DTD. However, Schematron is based on XPath and XSLT which are powerful programming languages capable of handling the logic for these validation rules. The starting point is to gain access to the XML that has not yet been parsed.

The unparsed XML can be accessed using the XPath function unparsed-text(), which reads text from a URI, along with the function base-uri(), which provides the physical location of the current XML document (①). Alternatively, the host program can provide the unparsed XML in a string parameter to the Schematron (②).

The string is first tested to see if it begins with a left angle bracket, as required by the XML grammar, and a Schematron assertion fails if it does not (③). Next, the string is parsed into a tree model and held in a variable so that assertions can be tested (④).

② <sch:let name="unparsedXml" value="''"/> ① <sch:let name="unparsedXmlString" value="if (string-length($unparsedXml) gt 0) then $unparsedXml else unparsed-text(base-uri())"/> ③ <sch:let name="unparsedXmlAvailable" value="matches($unparsedXmlString, '^\s*<')"/> ④ <sch:let name="parsed" value="xmlstart:parse-document($unparsedXmlString)"/>

We wanted the validation rules to test assertions on beginning of the XML document, including the character encoding found in the XML declaration, the DTD document type declaration, and the attributes on the root element. These assertions can be initiated by using the document node for the context of a Schematron rule. But then, how do we interrogate the document’s prologue to test these assertions?

One option that we considered was to use regular expressions to parse the XML string. This option would have required a series of regular expressions, and it is not as simple as it might appear at first to create regular expressions that respect the syntax of XML.

Fortunately, the REx Parser Generator created by Gunther Rademacher can generate a grammar-based parser in XSLT. This approach to use a grammar-based parser had the advantages of adhering to the syntax of XML and producing a tree model of the prologue of an XML document that can easily be interrogated using XPath.

We used the EBNF grammar for XML 1.0 and reduced it to only include the beginning parts of an XML document: the XML declaration (①), DOCTYPE declaration (②), comments, processing instructions, and the root element’s start tag and attributes (③). In this reduced grammar, no internal DTD is allowed and everything after the start tag of the root element is ignored. This EBNF grammar is provided in Listing 3.

① <?xml version="1.0" encoding="UTF-8"?> ② <!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD with OASIS Tables with MathML3 v1.2 20190208//EN" "https://jats.nlm.nih.gov/archiving/1.2/JATS-archive-oasis-article1-mathml3.dtd"> ③ <article article-type="" xml:lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://www.niso.org/standards/z39-96/ns/oasis-exchange/table" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"/>

Testing the `DOCTYPE` declaration

Once the prologue of the XML document has been parsed into a tree model and stored in a variable, assertions about the DOCTYPE declaration can be tested with simple XPath expressions.

The test for the document type declaration checks for an expected set of values in the root element, public identifier, and system URI. We are using 2 DTDs: the JATS 1.2 Archiving DTD and the Atypon Issue XML DTD, so the XPath expression tests for both options. The reason for testing the document type declaration is because an incorrect public identifier or system URI can cause the parsing of an XML document to fail or can cause a mishandling of space characters in text due to not having the DTD available to identify element content models that allow a mixture of text and elements.

$parsed//doctypedecl Name = "'article" ExternalID/PubidLiteral = "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD with OASIS Tables with MathML3 v1.2 20190208//EN" ExternalID/SystemLiteral = "https://jats.nlm.nih.gov/archiving/1.2/JATS-archive-oasis-article1-mathml3.dtd" Or Name = "issue-xml" ExternalID/PubidLiteral = "-//Atypon//DTD Atypon JATS Journal Archiving and Interchange Issue XML DTD v1.1 20160222//EN" ExternalID/SystemLiteral = "http://cats.informa.com/tfjats/1.2/dtd/atypon-jats-v1.1-issue.dtd"

In the requirements XML, the actual implementation of this rule is as follows:

<rule id="JATS-0043-002"> <Message>DOCTYPE should have article or issue-xml with required public identifier and system identifier. Incorrect <sch:value-of select="$doctype"/></Message> <Phases> <phase>AllElements</phase> </Phases> <Level>Error</Level> <contexts> <context>/</context> </contexts> <sch:let name="unparsedXmlString" value="if (string-length($unparsedXml) gt 0) then $unparsedXml else unparsed-text(base-uri())"/> <sch:let name="unparsedXmlAvailable" value="matches($unparsedXmlString, '^\s*<')"/> <sch:let name="parsed" value="local:xml-start($unparsedXmlString)"/> <sch:let name="doctype" value="$parsed//doctypedecl"/> <xpath>not($unparsedXmlAvailable) or ($doctype/Name = 'article' and $doctype/ExternalID/PubidLiteral = '"-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD with OASIS Tables with MathML3 v1.2 20190208//EN"' and $doctype/ExternalID/SystemLiteral = '"https://jats.nlm.nih.gov/archiving/1.2/JATS-archive-oasis-article1-mathml3.dtd"') or ($doctype/Name = 'issue-xml' and $doctype/ExternalID/PubidLiteral = '"-//Atypon//DTD Atypon JATS Journal Archiving and Interchange Issue XML DTD v1.1 20160222//EN"' and $doctype/ExternalID/SystemLiteral = '"http://cats.informa.com/tfjats/1.2/dtd/atypon-jats-v1.1-issue.dtd"') </xpath> <xspec> <expect-not-assert eg="e001"/> <expect-not-assert eg="e002"/> <expect-assert eg="ce003"/> </xspec> </rule> <example xml:space="preserve" format="text" id="JATS-0043-e001"><![CDATA[<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD with OASIS Tables with MathML3 v1.2 20190208//EN" "https://jats.nlm.nih.gov/archiving/1.2/JATS-archive-oasis-article1-mathml3.dtd"> <article article-type="" xml:lang="en" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:oasis="http://www.niso.org/standards/z39-96/ns/oasis-exchange/table" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"/> ]]></example> <example xml:space="preserve" format="text" id="JATS-0043-e002"><![CDATA[<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE issue-xml PUBLIC "-//Atypon//DTD Atypon JATS Journal Archiving and Interchange Issue XML DTD v1.1 20160222//EN" "http://cats.informa.com/tfjats/1.2/dtd/atypon-jats-v1.1-issue.dtd"> <issue-xml xml:lang="en" xmlns:xlink="http://www.w3.org/1999/xlink"/> ]]></example> <counter-example xml:space="preserve" format="text" id="JATS-0043-ce003"><![CDATA[ <!DOCTYPE article> <article/> ]]></counter-example>

Testing physical presence of defaulted attributes

One of our validation requirements was to verify that attributes on the article root element are physically present in the XML document and not provided by default values in the JATS DTD.

For instance, the xml:lang attribute on the article element identifies the primary language of the document. The JATS DTD provides a default value, however it is preferable, according to the Web Content Accessibility Guidelines (WCAG 2.1 section 3.1.1), to have the primary language of a document be declared rather than to assume that the document is in English. The XML parser generated by REx only uses the grammar to parse the XML, it cannot access the DTD to find out which attributes have default values. This means the parsed tree model can be tested with a simple XPath to verify whether the xml:lang attribute is physically present.

Another rule is used to check the value of all xml:lang attributes using a list of ISO language codes.

$parsed/element/Attribute[Name = 'xml:lang']

The same technique was also used to verify that namespace definitions are physically present on the article root element. The JATS DTD provides fixed attribute defaults for namespace definitions on the article element.

These defaults are helpful; however, if the namespace definitions are not present in the XML document and the XML parser also does not fill in the fixed attribute defaults from the DTD, then parsing the XML document typically fails. Therefore, we want to ensure that all namespaces that are used in an XML document have namespace definitions physically present on the root element.

The first such test is for the XLink namespace. Nearly every JATS XML document contains XLink attributes, so we can simply test the parsed tree to verify that the XLink namespace attribute is present.

$parsed/element/Attribute[Name = 'xmlns:xlink' and AttValue = '"http://www.w3.org/1999/xlink"']

For other namespaces that are allowed in JATS, we first test the XML document to see if any elements or attributes are present that use the particular namespace and then test the parsed tree to verify that the namespace definition is physically present on the article root element.

(not(.//mml:*) or $parsed/element/Attribute[Name = 'xmlns:mml' and AttValue = '"http://www.w3.org/1998/Math/MathML"']) (not(.//oasis:*) or $parsed/element/Attribute[Name = 'xmlns:oasis' and AttValue='"http://www.niso.org/standards/z39-96/ns/oasis-exchange/table"']) (not(.//ali:*) or $parsed/element/Attribute[Name = 'xmlns:ali' and AttValue = '"http://www.niso.org/schemas/ali/1.0/"']) (not(.//@xsi:*) or $parsed/element/Attribute[Name = 'xmlns:xsi' and AttValue = '"http://www.w3.org/2001/XMLSchema-instance"']))"

Testing character encoding

We defined validation rules for character encoding because problems with character encoding in an XML file can cause failures in the publishing pipeline, or in the worse cases, can make it all the way through to publication and then cause a problem somewhere down the line.

Three separate Schematron rules are used to validate character encoding. First, the parsed XML tree is tested to verify that the prologue of the XML document declares the encoding of the file as Unicode UTF-8 or a subset of Unicode such as US-ASCII.

$parsed/prolog/XMLDecl/EncodingDecl/EncName = ('UTF-8', 'US-ASCII', 'ISO646-US')

The second rule tests the unparsed XML string using a regular expression to ensure that only an allowed set of widely compatible characters are physically present. Any Unicode characters that are used in the document and not in this allowed list must be escaped, preferably as numeric character entities.

replace($unparsedXml, concat('([ 	
*<>?=\-\./:#;+"&,/\\\[\]^_`\|{}@~%!$0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', codepoints-to-string(39), ']+)+'), "")

The third rule tests all elements in the XML document using a regular expression to ensure that none of the characters that are present in the document are on a list of denied characters. The denied list includes Unicode control characters and Unicode private use areas.

context=* test string-join(for $t in (@*, node() except *) return analyze-string($t, "'([-]|[-]| … )'")//*:group)

<rule id="JATS-0011-001"> <Message xml:space="preserve">The following <sch:value-of select="string-length($match)"/> characters that are present in the XML "<sch:value-of select="$match"/>" should be captured as numeric character references: <sch:value-of select="for $cp in string-to-codepoints($match) return concat('&#', $cp, ';')"/></Message> <Phases> <phase>AllElements</phase> </Phases> <Level>Error</Level> <contexts> <context>/</context> </contexts> <sch:let name="text" value="if (string-length($unparsedXml) gt 0) then $unparsedXml else unparsed-text(base-uri())"/> <sch:let name="match" value="replace($text, concat('([ 	
*<>?=\-\./:#;+"&,/\\\[\]^_`\|{}@~%!$0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz', codepoints-to-string(39), ']+)+'), '')"/> <xpath>string-length($match) = 0</xpath> <xspec> <expect-assert eg="ce001"/> <expect-assert eg="ce003"/> <expect-not-assert eg="e001"/> </xspec> </rule> <rule id="JATS-0011-002"> <Message>The XML should not contain any control characters, private use characters, or disallowed characters. Found <sch:value-of select="string-length($found)"/> invalid characters "<sch:value-of select="$found"/>" with codepoints <sch:value-of select="string-join(string-to-codepoints($found), ', ')"/></Message> <documentation> <p>See related documentation <a href="https://tfjats.gitlab.io/jats1.2/jats-guide/topics/chatacter-encoding-and-whitespace/">Character Encoding and Whitespace</a>. Characters that are in the following Unicode ranges are not allowed:</p> <table class="table"> <tr><th>From</th><th>To</th><th>Description</th></tr> <tr><td>U+0000</td><td>U+0008</td><td rowspan="3">Control characters that are not allowed by the W3C XML Recommendation</td></tr> <tr><td>U+000B</td><td>U+000C</td></tr> <tr><td>U+000E</td><td>U+001F</td></tr> <tr><td>U+007F</td><td>U+009F</td><td>Control characters in Unicode that also represent other characters in Windows-1252</td></tr> <tr><td>U+E000</td><td>U+F8FF</td><td rowspan="3">Private Use Area</td></tr> <tr><td>U+F0000</td><td>U+FFFFD</td></tr> <tr><td>U+100000</td><td>U+10FFFD</td></tr> <tr><td>U+FDD0</td><td>U+FDEF</td><td rowspan="17">Discouraged by the W3C XML Recommendation</td></tr> <tr><td>U+1FFFE</td><td>U+1FFFF</td></tr> <tr><td>U+2FFFE</td><td>U+2FFFF</td></tr> <tr><td>U+3FFFE</td><td>U+3FFFF</td></tr> <tr><td>U+4FFFE</td><td>U+4FFFF</td></tr> <tr><td>U+5FFFE</td><td>U+5FFFF</td></tr> <tr><td>U+6FFFE</td><td>U+6FFFF</td></tr> <tr><td>U+7FFFE</td><td>U+7FFFF</td></tr> <tr><td>U+8FFFE</td><td>U+8FFFF</td></tr> <tr><td>U+9FFFE</td><td>U+9FFFF</td></tr> <tr><td>U+AFFFE</td><td>U+AFFFF</td></tr> <tr><td>U+BFFFE</td><td>U+BFFFF</td></tr> <tr><td>U+CFFFE</td><td>U+CFFFF</td></tr> <tr><td>U+DFFFE</td><td>U+DFFFF</td></tr> <tr><td>U+EFFFE</td><td>U+EFFFF</td></tr> <tr><td>U+FFFFE</td><td>U+FFFFF</td></tr> <tr><td>U+10FFFE</td><td>U+10FFFF</td></tr> </table> </documentation> <Phases> <phase>AllElements</phase> </Phases> <Level>Error</Level> <contexts> <context>*</context> </contexts> <sch:let name="regex" value="'([-]|[-]|[󰀀-󿿽]|[􀀀-􏿽]|[﷐-﷯]|[🿾-🿿]|[𯿾-𯿿]|[𿿾-𿿿]|[񏿾-񏿿]|[񟿾-񟿿]|[񯿾-񯿿]|[񿿾-񿿿]|[򏿾-򏿿]|[򟿾-򟿿]|[򯿾-򯿿]|[򿿾-򿿿]|[󏿾-󏿿]|[󟿾-󟿿]|[󯿾-󯿿]|[󿿾-󿿿]|[􏿾-􏿿])'"/> <sch:let name="found" value="string-join(for $t in (@*, node() except *) return analyze-string($t, $regex)//*:group)"/> <xpath>string-length($found) = 0</xpath> <xspec> <expect-assert eg="ce002" location="/p"/> <expect-assert eg="ce003" location="/sec/p[1]"/> <expect-assert eg="ce003" location="/sec/p[2]"/> <expect-assert eg="ce003" location="/sec/p[3]"/> <expect-assert eg="ce003" location="/sec/p[4]"/> <expect-assert eg="ce003" location="/sec/p[5]"/> <expect-assert eg="ce003" location="/sec/p[6]"/> <expect-assert eg="ce003" location="/sec/p[7]"/> <expect-assert eg="ce003" location="/sec/p[8]"/> <expect-not-assert eg="e001"/> <expect-not-assert eg="e002"/> </xspec> </rule>

Adding grammatical constraints

We had decided to use the JATS 1.2 Archiving DTD, which is permissive to allow tagging anything that might appear in a journal article. This DTD has flexibility to hold any content that might be received. However, we needed to add grammar-like restrictions to create more consistency in certain elements, for example author metadata. Although the JATS DTD can be customized and we had previously customized the JATS 1.0 DTD, we wanted to adhere to the public standard and chose not to customize the DTD. Our best recourse, then, was to use Schematron to apply content model restrictions.

Relax NG has a concise and expressive XML format for describing XML grammars. Instead of inventing something new, we extended the Requirements XML format to include a subset of Relax NG for describing content models. The Relax NG subset includes: mixed content, choice, optional, group, one or more, and zero or more. Our content model restrictions could now be expressed in the Requirements XML as rules using Relax NG (①). When the Requirements XML is transformed into Schematron, the Relax NG content models are transformed into regular expressions (②), and transformed into a more human-readable DTD-like syntax for the validation messages (③). The assertions in the Schematron create a string representation of the content that is found in an element (④), and test whether the string matches the regular expression (⑤).

<rule id="JATS-0045-002"> <Message>contrib-group</Message> <Phases><phase>converted content</phase><phase>current content</phase><phase>scanned content</phase></Phases> <Level>Error</Level> <contexts><context>contrib-group</context></contexts> ①<model> <oneOrMore><element name="contrib"/></oneOrMore> <zeroOrMore><choice><element name="aff"/><element name="aff-alternatives"/><element name="bio"/><element name="etal"/></choice></zeroOrMore> </model> <xspec><expect-not-assert eg="e001"/><expect-assert eg="ce002"/></xspec> </rule> ② $regex = ^(?:contrib,?)+,?(?:(?:aff|aff-alternatives|bio|etal),?)*$ ③ $text = (contrib)+, ((aff | aff-alternatives | bio | etal))* ④ $sequence = string-join(for $n in node() return if ($n instance of element()) then local-name($n) else if ($n instance of text() and normalize-space($n)) then '#PCDATA' else (), ',') e.g. contrib,contrib,aff ⑤ matches($sequence, $regex)

Continuous Integration

After the Requirements XML and transformations were created and working in oXygen it was fairly easy to configure a continuous integration service using GitLab.

The continuous integration automation ensures that the test and build processes are done consistently for every change that is made to the Schematron. Every time a change is made, the continuous integration service transforms the requirements XML to generate the Schematron, generate and run XSpec test, generate the documentation website, and build release packages.

The release packages contain the generated Schematron and compiled XSLTs for each phase, and release packages are provided in Zip format and Extensible Archive Format.

This automation also enforces good development practices such as creating a merge request for each change and creating a version tag for each release.

Figure 2.Self-Generating Quality Control

The self-generating quality control process and the continuous integration automation are represented in this diagram. The area in blue indicates process steps that are done in oXygen XML Editor. The area in grey indicates process steps that are done in the GitLab continuous integration service. First, we add new validation rules to a Requirements XML document. Each validation rule has an ID, message, context, XPath test, and examples. The Requirements XML document is then transformed to Schematron and XSpec. The XSpec is run to test the Schematron, and we check the XSpec report to see if all the rules work as expected. After the tests pass, we create a merge request. Then the continuous integration process transforms the Requirements XML to Schematron and XSpec, and runs the XSpec test. If the tests pass, the Requirements XML is transformed to HTML for a documentation website, and the build process generates release packages containing the Schematron and compiled XSLTs using SchXslt. The Schematron and XLSTs are then deployed into multiple systems that perform validation on JATS XML documents.

Conclusions

This project achieved, and exceeded, its goals by using declarative programming methods to create self-generating quality control for Schematron validation rules. By first declaring what we know in a clear format, rather than how it should be used, we discovered a range of benefits: the documentation and tests represent the validation rules exactly; time is saved in developing Schematron rules; and future maintenance of the Schematron will use the infrastructure that was created in this project.

Acknowledgements

Tomos Hillman would like to thank Liam Quin for his transformative work on other aspects of this project, and Gunther Rademacher for his work on the REx Parser Generator.

Vincent Lizzi would like to thank Joanna Czerepowicz, Kirstin Heilmann, and Barney Hall for their work in leading this project, and all of the members of the Taylor & Francis teams and supplier teams who contributed to this project.

References

JATS 1.2

, JATS: Journal Article Tag Suite, version 1.2. Available from https://www.niso.org/publications/z3996-2019-jats

JATS 1.2 DTD

. Available from https://jats.nlm.nih.gov/archiving/1.2/JATS-archivearticle1-mathml3.dtd

REx

. Available from https://www.bottlecaps.de/rex/

Schematron

, second edition, International Standard ISO/IEC 19757-3:20106, Geneva, Switzerland : ISO. Available from http://standards.iso.org/ittf/PubliclyAvailableStandards/c055982_ISO_IEC_19757-3_2016.zip.

SQF

, an extension of the ISO standard Schematron. Available from https://www.schematron-quickfix.com

SchXslt

. Available from https://github.com/schxslt/schxslt

WCAG

AndrewKirkpatrick, JoshueO Connor, AlastairCampbell, MichaelCooper eds.

, AndrewKirkpatrick, JoshueO Connor, AlastairCampbell, MichaelCooper eds.W3C (World Wide Web Consortium). Available at https://www.w3.org/TR/WCAG21/

XSpec

. Available from https://github.com/xspec/xspec

XML

Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and François Yergeau eds.

(Fifth Edition), Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and François Yergeau eds.W3C (World Wide Web Consortium). Available at http://www.w3.org/TR/REC-xml/.

AppendixCode samples

Listing 1.Requirements Datamodel (RNC)

default namespace = ""
namespace mml = "http://www.w3.org/1998/Math/MathML"
namespace sch = "http://purl.oclc.org/dsdl/schematron"
namespace xlink = "http://www.w3.org/1999/xlink"

start =
  element schematron {
    element topic {
      attribute id { xsd:NCName },
      title,
      element rule {
        attribute id { xsd:NCName }?,
        attribute pending {text}?,
        element Message {
          attribute xml:space { xsd:NCName }?,
          (text
           | element sch:name { 
               attribute path { text }?
           }
           | element sch:value-of {
               attribute select { text }
             })+
        },
        element notes {
          attribute resolved {text}?,
          anything 
        }?,
        element Phases {
           element phase {"AllElements"} |
           (
            element phase { ("converted content") }?,
            element phase { ("current content") }?,
            element phase { ("scanned content") }?,
            element phase { ("rendering_alerts") }?
           )
         },
        element Level { "Error" | "Warning" | "Info" },
        element contexts {
          element context { text }+
        },
        element sch:let {
          attribute name { xsd:NCName },
          attribute value { string }
        }*,
        (
            element xpath {
              attribute type { xsd:NCName }?,
              text
            }
            | element model { ContentModel }
        )?,
        element xspec {
          element expect-not-assert {
            expect-attributes
          }* &amp;
          element expect-assert {
            expect-attributes
          }* &amp;
          element expect-not-report {
            expect-attributes
          }* &amp;
          element expect-report {
            expect-attributes
          }*
        }?
      }*,
      element examples {
        element counter-example {
           attribute id { xsd:NCName }?,
           attribute xml:space { xsd:NCName },
           attribute format { "xml" | "text" }?,
           anything
         }* &amp;
         element example {
           attribute id { xsd:NCName }?,
           attribute xml:space { "preserve" },
           attribute format { "xml" | "text" }?,
           anything
         }*
      }?,
      element history { event }?
    }+
  }
  
title = element title { text }

event =
  element event {
    attribute timestamp { xsd:dateTime }?,
    attribute type { xsd:NCName }?,
    anything
  }

expect-attributes = 
  attribute eg { xsd:NCName },
  attribute location { text }?,
  attribute pending { text }?

ContentModel =
  ContentModelSequence
  | element empty { empty }

ContentModelSequence =
  (element element {
     attribute name { xsd:NCName }
   }
   | element text { empty }
   | element group { ContentModelSequence }
   | element optional { ContentModelSequence }
   | element choice { ContentModelSequence, ContentModelSequence+ }
   | element zeroOrMore { ContentModelSequence }
   | element oneOrMore { ContentModelSequence })+

anyAttribute = attribute * { text }
  
anything = text &amp; element * { anyAttribute*, anything*}*

Listing 2.Requirements Schematron

<?xml version="1.0" encoding="UTF-8"?>
<sch:schema 
    xmlns:sch="http://purl.oclc.org/dsdl/schematron"
    xmlns:sqf="http://www.schematron-quickfix.com/validator/process" 
    xmlns:p="xpath-31.ebnf" 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
    xmlns:local="local" 
    queryBinding="xslt2" 
    defaultPhase="xspec">
    
    <xsl:import href="xpath-31.xslt"/>
    
    <xsl:key name="element-by-id" match="*[@id]" use="@id"/>
    
    <xsl:function name="local:getEg" as="element()*">
        <xsl:param name="expect" as="element()"/>
        <xsl:sequence select="key('element-by-id', concat($expect/ancestor::topic/@id, '-', $expect/@eg), $expect/ancestor::schematron)"/>
    </xsl:function>
    
    <sch:ns prefix="p" uri="xpath-31.ebnf"/>
    <sch:ns prefix="local" uri="local"/>
    
    <sch:phase id="capture">
        <sch:active pattern="capture.pattern"/>
    </sch:phase>
    <sch:phase id="rules">
        <sch:active pattern="capture.pattern"/>
        <sch:active pattern="xpath.pattern"/>
    </sch:phase>
    <sch:phase id="xspec">
        <sch:active pattern="capture.pattern"/>
        <sch:active pattern="xpath.pattern"/>
        <sch:active pattern="xspec.pattern"/>
    </sch:phase>
    
    <sch:pattern id="capture.pattern">
        
        <sch:rule context="rule">
            <sch:report test="notes[not(@resolved=('y', 'yes', 'true'))]" id="notesReport" role="info">Rule <sch:value-of select="@id"/> has a note that may need to be resolved</sch:report>
        </sch:rule>
        
        <sch:rule context="Phases">
            <sch:assert test="phase" id="atLeastOnePhaseRequired" sqf:fix="addAllPhase">There should be at least one phase</sch:assert>
        </sch:rule>
        
        <sch:rule context="context">
            <sch:extends rule="noQuotesAbstract"/>
            <sch:extends rule="noJatsNamespaceAbstract"/>
        </sch:rule>
        
        <sch:rule context="example">
            <sch:assert test="matches(@id, concat('^', ancestor::topic/@id, '-e\d{3}')) or not(@id)" sqf:fix="AddtopicID">Example IDs should be in the format $topicID-e000</sch:assert>
        </sch:rule>
        <sch:rule context="counter-example">
            <sch:assert test="matches(@id, concat('^', ancestor::topic/@id, '-ce\d{3}')) or not(@id)" sqf:fix="AddtopicID">Example IDs should be in the format $topicID-ce000</sch:assert>
        </sch:rule>
        
        <sch:rule context="@id">
            <sch:assert test="count(key('element-by-id', .)) eq 1 or ancestor::*[parent::example or parent::counter-example]">Each ID used in the document should be unique.</sch:assert>
        </sch:rule>
    
    </sch:pattern>

    <sch:pattern id="xpath.pattern">
        
        <sch:rule context="rule">
            <sch:assert test="xpath or model" id="xpathRequired">Rule <sch:value-of select="@id"/> has neither assert nor report xpath nor model.</sch:assert>
        </sch:rule>
        
        <sch:rule context="xpath">
            <sch:extends rule="noQuotesAbstract"/>
            <sch:extends rule="noJatsNamespaceAbstract"/>
        </sch:rule>
        
        <sch:rule context="context">
            <sch:report test="p:parse-XPath(.)//Predicate[not(ancestor::Predicate)]//FunctionCall[not(ancestor::FunctionCall)][FunctionEQName/FunctionName/QName eq 'not']" id="PredicateBetterInRule" role="warning">Consider moving the condition in the predicate of rule <sch:value-of select="ancestor::rule/@id"/> to avoid problems with overly specific context paths</sch:report>
        </sch:rule>
    
    </sch:pattern>
  
    <sch:pattern id="xspec.pattern">
        
        <sch:rule context="rule">
            <sch:assert test="xspec" id="xspecRequired">Rule <sch:value-of select="ancestor::rule/@id"/> needs a selection of xspec rules</sch:assert>
        </sch:rule>
        
        <sch:rule context="xspec[../xpath/@type = 'report']">
            <sch:let name="contextCount" value="count(../contexts/context)"/>
            <sch:assert test="count(expect-report) ge $contextCount" id="expect-reportRequired">There should be at least one <sch:name/> for each context</sch:assert>
            <sch:assert test="count(expect-not-report) ge $contextCount" id="expect-not-reportRequired">There should be at least one <sch:name/> for each context</sch:assert>
        </sch:rule>
        
        <sch:rule context="xspec[not(../xpath/@type = 'report')]">
            <sch:let name="contextCount" value="count(../contexts/context)"/>
            <sch:assert test="count(expect-assert) ge $contextCount" id="expect-assertRequired">There should be at least one <sch:name/> for each context</sch:assert>
            <sch:assert test="count(expect-not-assert) ge $contextCount" id="expect-not-assertRequired">There should be at least one <sch:name/> for each context</sch:assert>
        </sch:rule>
        
        <sch:rule context="expect-report|expect-assert">
            <sch:assert test="local:getEg(.)[self::counter-example]" id="pointsAtCounter-example"><sch:name/> should point to a counter-example showing the error within the same topic (could not find <sch:value-of select="@eg"/>).</sch:assert>
        </sch:rule>
        
        <sch:rule context="expect-not-report|expect-not-assert">
            <sch:assert test="local:getEg(.)[self::example]" id="pointsAtExample"><sch:name/> should point to an example showing the desired mark-up within the same topic (could not find <sch:value-of select="@eg"/>).</sch:assert>
        </sch:rule>
        
        <sch:rule context="example|counter-example">
            <sch:assert test="@id"><sch:name/> should have an @id</sch:assert>
        </sch:rule>
    
    </sch:pattern>
    
    <sch:pattern id="abstractRules">
        
        <sch:rule abstract="true" id="noQuotesAbstract">
            <sch:report test="contains(., '&quot;')" role="warning" id="noQuotesInXpath"><sch:name/> contains a quote character!  This will likely result in errors unless nested inside single quotes.</sch:report>
        </sch:rule>
        
        <sch:rule abstract="true" id="noJatsNamespaceAbstract">
            <sch:report test="contains(., 'jats:')" role="error" id="NoJatsNamespace">The 'jats:' namespace should not be used!</sch:report>
        </sch:rule>
            
    </sch:pattern>
    
    <sqf:fixes>
        <sqf:fix id="addAllPhase">
            <sqf:description>
                <sqf:title>Add the default 'All' phase</sqf:title>
            </sqf:description>
            <sqf:add>
                <phase>All</phase>
            </sqf:add>
        </sqf:fix>
        <sqf:fix id="AddtopicID">
            <sqf:description>
                <sqf:title>Prefix with the topic ID</sqf:title>
            </sqf:description>
            <sqf:replace match="@id" node-type="attribute" target="id" select="concat(ancestor::topic/@id, '-', .)"/>
        </sqf:fix>
    </sqf:fixes>
    
</sch:schema>

Listing 3.XML Start Grammar (EBNF)

/* 
Grammar for parsing the start of an XML document

This is a subset of the Extensible Markup Language (XML) 1.0
grammar from http://www.w3.org/TR/xml/

The purpose of using this grammar and a parser derived from it
is to collect information about the start of an XML file
that is usually lost during parsing of the XML
(such as XML declaration, DOCTYPE, and attributes on the 
root element masked by attribute defaults coming from a DTD)
so that rules can be enforced using XPath in Schematron.

Parser generated as XSLT code using https://bottlecaps.de/rex/
java REx -xslt -tree -name "xml-start" xml-start.ebnf
then change the namespace prefix from "p" to "xmlstart"

Substantive changes from the original grammar:
  * everything after then end of the start element is ignored
  * internal DTD subset in DOCTYPE is not allowed
*/

document ::= prolog element
prolog   ::= XMLDecl? Misc* ( doctypedecl Misc* )?
XMLDecl  ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
VersionInfo
         ::= S 'version' Eq ( "'" VersionNum "'" | '"' VersionNum '"' )
EncodingDecl
         ::= S 'encoding' Eq ( '"' EncName '"' | "'" EncName "'" )
SDDecl   ::= S 'standalone' Eq ( "'" ( 'yes' | 'no' ) "'" | '"' ( 'yes' | 'no' ) '"' )
Misc     ::= Comment
           | PI
           | S
doctypedecl
         ::= ''
ExternalID
         ::= 'SYSTEM' S SystemLiteral
           | 'PUBLIC' S PubidLiteral S SystemLiteral
intSubset
         ::= S?
Attribute
         ::= Name Eq AttValue
element  ::= '<' Name ( S Attribute )* S? ( '/>' | '>' )


<?TOKENS?>
Char     ::= #x0009
           | #x000A
           | #x000D
           | [#x0020-#xD7FF]
           | [#xE000-#xFFFD]
           | [#x10000-#x10FFFF]
S        ::= ( #x0020 | #x0009 | #x000D | #x000A )+
NameStartChar
         ::= ':'
           | [A-Z]
           | '_'
           | [a-z]
           | [#x00C0-#x00D6]
           | [#x00D8-#x00F6]
           | [#x00F8-#x02FF]
           | [#x0370-#x037D]
           | [#x037F-#x1FFF]
           | [#x200C-#x200D]
           | [#x2070-#x218F]
           | [#x2C00-#x2FEF]
           | [#x3001-#xD7FF]
           | [#xF900-#xFDCF]
           | [#xFDF0-#xFFFD]
NameChar ::= NameStartChar
           | '-'
           | '.'
           | [0-9]
           | #x00B7
           | [#x0300-#x036F]
           | [#x203F-#x2040]
Name     ::= NameStartChar NameChar*
Eq       ::= S? '=' S?
VersionNum
         ::= '1.' [0-9]+
EncName  ::= [A-Za-z] ( [A-Za-z0-9._] | '-' )*
Comment  ::= '<!--' ( Char - '-' | '-' ( Char - '-' ) )* '-->'
PI       ::= '<?' PITarget ( S ( [^?] | '?'+ [^?>] )* '?'* )? '?>'
PITarget ::= Name - 'xml'

Reference
         ::= EntityRef
           | CharRef
CharRef  ::= '&#' [0-9]+ ';'
           | '&#x' [0-9a-fA-F]+ ';'
EntityRef
         ::= '&' Name ';'
PubidLiteral
         ::= '"' PubidChar* '"'
           | "'" ( PubidChar - "'" )* "'"
PubidChar
         ::= #x0020
           | #x000D
           | #x000A
           | [a-zA-Z0-9]
           | [-'()+,./:=?;!*#@$_%]
SystemLiteral
         ::= '"' [^"]* '"'
           | "'" [^']* "'"
AttValue ::= '"' ( [^<&"] | Reference )* '"'
           | "'" ( [^<&'] | Reference )* "'"

Biographical notes

Tom Hillman has over a decade of experience with XML, XSLT, XQuery and related technologies, particularly in the field of digital publishing, quality analysis, and transformation. He has given training courses to various institutions including publishers, universities and the UN, as well as being a regular faculty member at the prestigious XML Summer School in Oxford.

https://orcid.org/0000-0001-8980-7625

Vincent Lizzy is Head of Information Standards at Taylor & Francis, is a member of the NISO JATS Standing Committee (ANSI/NISO Z39.96), and has contributed to the development of XSpec. Vincent was the technical lead on this project.