The TEI Processing Model: Introduction, limitations and potential extensions
Abstract
The TEI Processing Model (PM) is a TEI (Text Encoding Initiative) facility that can be used to give a declarative description of the expected behaviour associated with TEI XML elements. The paper will introduce the model and its implementation in TEI Publisher. It will go on to discuss a number of real-life edition requirements that the current model cannot handle. It will propose extensions to the model to deal with those requirements. Among these are: (i) being able to associate multiple behaviours with an element, such as in the case of a page break rendered as an image that is also a link (which could be handled by nesting associated behaviours); (ii) being able to generate new content, such as headers (could also be handled by nesting); (iii) sorting of output (which would require a new parameter); and (iv) more modularity in the PM (which could be handled by facilitating switching to another set of PM definitions). The paper will also argue that it is conceptually and practically not a good idea to define the PM as part of the schema specification and propose an alternative. The common aim of these proposals is to diminish the need for procedural coding in the creation of editions.
Table of contents
The Text Encoding Intiative (TEI) is a consortium that maintains a large tagset (ca. 600 elements) for representing text and language more generally in a humanities context (Burnard 2014)Lou Burnard (2014). What is the Text Encoding Initiative? OpenEdition Press. https://books.openedition.org/oep/426. . The TEI also maintains the infrastructure that facilitates the tagset's maintenance and use. TEI usage is expecially widespread in heritage institutions such as libraries and archives and in (web) publication of historical, literary and other sources.[1] While the descriptive power of the TEI tagset is evident, one issue for text encoders has always been: how do I move from a perfect XML-encoded text to a (web) publication (Flanders and Hamlin 2013)Julia Flanders & Scott Hamlin (2013). TAPAS: building a TEI publishing and repository service. In: Journal of the Text Encoding Initiative 5. https://journals.openedition.org/jtei/788. ? Given the great variety in the types of text that TEI is used to encode, as well as the variety in research interests among the researchers that use TEI, a single publication tool or system for TEI-encoded texts is not really possible. It would also be at odds with the spirit of declarative encoding, where we first encode the meaningful elements of the text and only then start wondering what representation(s) would be most suitable for these encodings.
Nevertheless, in practice, most representations of encoded texts consist of a limited number of standard textual constituents or behaviours: paragraphs, headings, text divisions, and the like. That realisation was the starting point for the TEI Processing Model (Turska, Cummings and Rahtz 2016)Magdalena Turska, James Cummings & Sebastian Rahtz (2016). Challenging the myth of presentation in digital editions. In: Journal of the Text Encoding Initiative 9. https://journals.openedition.org/jtei/1453. . The processing model essentially consists of an XML vocabulary (included in the TEI namespace) that makes it possible to define these common behaviours and their parameters, as well as to associate TEI elements with the behaviours. In this paper, I will briefly introduce the Processing Model and then discuss what I see as a number of limitations of the model, as well of ways of handling these.
The context in which I have been thinking and experimenting with the TEI Processing Model is a research project in which I investigate how far a fully declarative specification for a digital edition could get us. Such a specification would need to include aspects that the processing model doesn't handle (pages, search facets, interaction), but the model would still be an essential ingredient. For the current state of this project, see https://gitlab.huc.knaw.nl/edition-publication-model/edition-publication-model. As the main testcase I am trying to replicate the functionality in the first installment of the edition of the papers of the painter Mondrian. Some of the examples in this paper will come from that edition.
The TEI Processing Model
Usage of the processing model is discussed in the Guidelines that the TEI Consortium maintains (TEI Consortium 2024)TEI Consortium (2024). TEI P5. Guidelines for Electronic Text Encoding and Interchange. Version 4.8.0. https://tei-c.org/release/doc/tei-p5-doc/en/html/. . The model uses the model element to associate a TEI element with a behaviour (behaviour attribute) in a certain context (a predicate attribute containing an xpath expression). A cssClass attribute is a hook for attaching CSS definitions to the display. The model element can contain param elements that carry name and value attributes (value contains an xpath expression) to pass paramenters to the behaviour. One of the most common parameters is content: the content that should be used for further processing. By default this is the current element's content, but the parameter might also point to an attribute or a location elsewhere in the document. The outputRendition child element can be used to associate renderings (in terms of CSS facilities) with the element itself or perhaps before or after it. model elements can be grouped in modelGrps (meant to describe various output methods, such as web or print) or modelSequences if multiple consecutive behaviours are to be associated with an element.
The behaviour(s) assciated with an element is (are) expected to be described as part of the definition of the element in the TEI's schema specification language, also part of the TEI vocabulary.[2] In TEI, schemas are maintained in so-called 'ODD'-files (an abbreviation for 'One Document Does it all': i.e. schema definition, documentation, constraints, and now also the definition of expected processing). Schema definitions in TEI can be chained, in order to facilitate extending, subsetting or modifying the TEI schema for specific projects. The expectation is that projects will also use this facility for overriding default behaviours associated with an element.
Here are some examples of what these associations between elements and behaviours may look like, taken from the TEI Guidelines.
An element such as foreign, to be displayed in italics inline, could have its processing model defined as follows:
<model behaviour="inline">
<outputRendition>font-style: italic;</outputRendition>
</model>
In the following example for the ref element (a link), we see how a model element is embedded in an elementSpec element. The example also demonstrates the use of parameters:
<elementSpec ident="ref" mode="add">
<model behaviour="link">
<param name="uri" value="@target"/>
<param name="content" value="."/>
</model>
</elementSpec>
The following example shows how various contexts can be distinguished using the predicate attribute. The quote element is displayed inline or as a block depending on whether it is the child of a paragraph:
<model predicate="ancestor::p"
behaviour="inline">
<outputRendition>font-style: italic;</outputRendition>
</model>
<model behaviour="block">
<outputRendition>left-margin: 2em;</outputRendition>
</model>
The description of the model element in the TEI Guidelines provides an overview of suggested values for the behaviour attribute. Many of these behaviours can be compared to xslt templates, processed by walking down a tree (from the root to the leaves). That would be true for section, block, paragraph, heading, inline, table and a host of others. For other values, such as document or index, that wouldn't be so clear.
The TEI encourages projects to extend or modify the TEI schemas when the available definitions do no suit the specific needs of a project (Cummings 2019)James Cummings (2019). A world of difference: Myths and misconceptions about the TEI. In: Digital Scholarship in the Humanities34 (Supplement_1). pp i58-i79. https://academic.oup.com/dsh/article/34/Supplement_1/i58/5248221. . The TEI Processing Model can also be extended, at various levels. The lowest and simplest level is a modification of the model elements associating a TEI element with a certain behaviour. Say we don't want a correction (corr) and its accompanying original form (sic) to be rendered inline using behaviour alternate (one form rendered in the text, the other perhaps as a popup) but rather out of line as a note, we would probably want to associate the corrs parent element choice with the note behaviour.
One level of modification up would be to introduce new behaviours. The TEI model explicitly permits this by providing only suggested values for the behaviour attribute. A project that uses the TEI nets module to represent textual stemma's in TEI XML could introduce a graph behaviour in order to be able to represent these graphs. Another project might define a behaviour para-plus-number-plus-uri for paragraphs that need to be provided with a paragraph number and a uri for citation, and that behaviour might then also be associated with lgs (line groups, i.e. poems or stanzas of poems).
The third and highest level of modification would be a change in the elements or attributes the TEI uses to define the processing model, or a change in their semantics. An example of such a change would be to permit a subelement associated-event in model, in order to be able to describe the events that an element should respond to and perhaps the actions that should be executed in response to that event.
The implementation of the TEI Processing Model in TEI Publisher
It is important to understand that the TEI Processiong Model by itself changes nothing to the situation of a lonely text encoder with encoded XML but without a publication tool. The TEI now provides a vocabulary to describe the desired output, but it does not provide software implementing that technology. This is where TEI Publisher (e-editiones 2024)e-editiones (2024). TEI Publisher v9.0.0. e-editiones. https://github.com/eeditiones/tei-publisher-app/tree/v9.0.0 comes in.[3] TEI Publisher is Open Source software and provides an application running on the XML database eXist-db.[4] As far as I know, it is currently the only tool that offers an implementation of the TEI Processing Model. Users can upload XML files in TEI Publisher, use or modify some default ODD-files containing the processing definitions and inspect the result, or they can create their own ODD files. HTML page templates can be used to describe the context for the output of the processing model as well as to bring in various types of functionality, such as tables of contents, forward and backward buttons, search fields and a login function. When satisfied with the result, users can generate their own 'app's, that can be further modified and customised, e.g. by defining search facets. The combination of the processing model definitions in the ODD file and the facilities of TEI Publisher can result in surprisingly quick application development.[5] Development of TEI publisher is coordinated by the non-profit e-editiones,[6] an association that brings together projects and people interested in digital editions and more specifically the development of TEI Publisher.
To extend the power of the TEI Processing Model, TEI Publisher has introduced a number of extensions, described in the TEI Publisher documentation.[7] These extensions include:
-
use of XQuery rather than XPath when computing the value of parameters. This makes it easier to do all sorts of processing in establishing the value of parameters;
-
setting default processing model rules for elements for which there is no explicite specification and for text content;
-
making it possible to pass external parameters to the processing rules, including a mode parameter that can be used to determine the type of processing. It also supports setting parameters from within the processing rules;
-
allowing, within a model element, a pb:template child element, that contains a parametrized HTML fragment (in the case of web output). After processing the parameters, the template itself is processed according to the model's behaviour predicate.
-
allowing the definition of new behaviours in the ODD file, also based on pb:template elements. These behaviours are limited in what they can do, as they cannot contain processing;
-
allowing the user to create completely new behaviours in XQuery;
-
a new pass-through behaviour with the effect that the element itself is not processed but its children will be processed. In effect, that makes it equivalent to the default behaviour, but it is useful in preventing execution of a behaviour defined at a higher level;
-
and finally, it is possible to refer in the processing model to one of the web components that TEI Publisher uses extensively. This is done by assigning the value webcomponent to the behaviour attribute and using a parameter name that contains the name of the web component. For instance, activating a facsimile image in TEI Publisher is handled by the pb-facs-link web component. The model element for a TEI pb (page begin) element might refer to this web component.
It is of course fine for a toolset such as TEI Publisher to define extensions to the TEI Processing model. From a TEI perspective, however, if these extensions are useful or even necessary when seriously applying the TEI Processing Model, we should maybe ask ourselves whether these facilities, or similar ones, shouldn't be part of the Processing Model itself.
Limitations and ways to handle these
In this section we'll encounter some limitations of the Processing Model as defined in the TEI Guidelines that might or might not be solved using the extensions implemented in TEI Publisher or others. In some respects, my focus here is somewhat different from that of TEI Publisher. While TEI Publisher as a publication tool aims to provide sensible defaults in order to minimise the amount of customisation necessary to get most TEI sources published, my aim is to be able to describe in terms of the TEI Processing Model any element display functionality that can be useful in an edition.
Multiple behaviours
Suppose that my edition contains notes to the edited text, and these notes are located in a listAnnotation under the standOff element. I want to produce a section of notes, headed by the heading 'Notes'. I don't want to use the TEI PM note behaviour, because that doesn't give me sufficient control. That means that I have to create a container, a heading within that container, followed by the representation of the note element themselves. Currently, the processing model offers me behaviours section to create the container and heading to create the heading. The only way to associate a listAnnotation with both behaviours is to group them in a modelSequence element, but that would give me the heading before or after the section, not within it.
One obvious solution would be to define a further behaviour: section-with-heading. Besides the default content parameter, that behaviour would also have the parameters needed to create the heading. The model statement for the listAnnotation element might look like:
<model behaviour="section-with-heading">
<param name="content-heading" value="'Notes'"/>
<param name="level" value="3"/>
</model>
This would do what we need. However, it will only help an individual project once it is implemented in TEI Publisher or another implementation of the TEI PM.
But if my project uses TEI Publisher, I already have a solution available, based on the pb:template element. My listAnnotation model element could look like this:
<model behaviour="section">
<pb:template>
<div><h3>[[content_heading]]</h3>[[content]]</div>
</pb:template>
<param name="content_heading" value="'Notes'"/>
</model>
For anyone who knows some HTML, this is certainly an attractive solution. Note, however, that it takes us out of the domain of the processing model and into the domain (HTML) from which the processing model was trying to abstract.
To me a better solution seems to be to have nested models. A model with behaviour section could contain a modelSequence containing another model with behaviour heading and one with behaviour block, or even pass-through. This would keep us at the level of the model, without bothering with the HTML code that should be left to the implementation. So the model for the listAnnotation would look like this:
<model behaviour="section">
<modelSequence>
<model behaviour="heading">
<param name="level" value="3"/>
<param name="content" value="'Notes'"/>
</model>
<model behaviour="block"/>
</modelSequence>
</model>
In fact, cases where an element needs to be associated with multiple behaviours at the same time are very common. Think of a pb (page begin) element that we want to associate with a link (to the representation of that page) and the link should have the form of a thumbnail whose filename is held in the pbs facs attribute. This could be modelled by a behaviour link containing a behaviour graphic. Another example is that of a name element to become a link into an external authority file, but also to be provided with a popup with some biographical information.[8] The example that we mentioned earlier of a paragraph that needs to be provided with a paragraph number and a permanent uri for citation purposes is another case of a situation that can be handled by nesting models.
Ordering output
Let us suppose we have a personography contained in a personList element in a TEI document. Suppose we want to display the personography, ordering the persons by last name. With the behaviours described in the current TEI Processing Model that is not possible.[9] What we could do is define a new behaviour, say, sorted-block, with a parameter sort-key. The model element for personList would then look like:
<model behaviour="sorted-block">
<param name="content" value="tei:person"/>
<param name="sort-key" value="tei:persName/tei:surname/text()"/>
</model>
This would allow us to stick to the existing syntax for defining processing models.
Suppose however, that we want to have multiple orderings, with the user being able to select the ordering to be used, perhaps using radio buttons or a similar device. In theory, we could have parameters sort-key-1, sort-key-2, etc. The sorted-block component could then use all parameters whose name starts with 'sort-key' as potential sort-keys. We could even have a convention that what in the parameter's name comes after 'sort-key-' should be used as a label in the radio buttons. But that would become very ugly. It would be much cleaner to be able to describe the orderings explicity, for instance as follows:
<model behaviour="sorted-block">
<param name="content" value="tei:person"/>
<order name="Last name" value="tei:persName/tei:surname/text()"/>
<order name="Date of birth" value="string(tei:birth/@when)"/>
<order name="Nationality" value="tei:nationality/text()"/>
</model>
Modularisation in the ODD
As mentioned above, elements' processing models are supposed to be defined under the elementSpec element for those elements in a project's schema specification. The TEI schema definition module (tagdocs) makes it possible for projects to accept an element's definition from the schema, but selectively override certain aspects of that definition, such as its attributes, the constraints that apply to it or its contents. The base definition is taken from the schemaSpecs or elementSpecs source attribute. The way is which definitions are to be merged is indicated by the mode attribute. The value replace indicates that the new definition replaces the old one, the value change means that the existing definition will remain intact, except for the explicitly modified parts. In this way, projects could also override an element's model subelements.
There are, I believe, a number of difficulties with this model. First of all there is a conceptual mismatch between the model subelements and the other content in the elementSpecs. A schema is for validation, models are for processing. But what is more important is that, as the schema definition hierachically consists of the element definitions (and many other things), all processing has to be defined at the element level.
This is unfortunate, because the processing of an element very much depends on the context in which it is to be processed. Take a person element. It may be processed in at least four different ways: (i) in its rendering in the personography, (ii) in its redering in a side-panel to another document where the person is mentioned, (iii) in the pop-up that appears when the mouse hovers over the reference, and finally (iv) in the facetted search where the person's name is used as a label. Having to force all of these contexts into a single elementSpec leads to long lists of model elements that use parameters such as $mode (not mentioned in the Processing Model specification) in their predicate attribute. This makes the behaviours associated with an element hard to understand. That is all the more likely because the different behaviours for the person element will also affect the behaviour of its underlying elements in these contexts. To understand what happens in a certain context requires us then to find the relevant models (checking the parameters in the predicates) for various elements.
Alternatively, it could lead to using multiple ODDs (the document that holds the (modified) schema definition) for a single document type, which could then be activated in various contexts. These would only need to contain the definitions applicable in a certain context. Indeed, TEI Publisher facilitates using different ODDs for different views on a page. But now we have achieved the opposite of what we aimed for: we put the processing definitions within the elementSpec to make them part of the single ODD file. But because this isn't really workable we begin to create multiple ODDs.
I believe it is possible to define the processing for a schema's elements in the ODD file in a more modular way, without the disadvantages mentioned here. But before we discuss that, I want to mention another undesirable effect of the current state of affairs. The model element does not have an ident attribute. This makes it unclear how the model children of an elementSpec with mode="change" should be handled. The definition of mode change mentions replacing the children of the original definition 'that agree in type and identifier' with the children of the new definition.[10] The assumption in TEI Publisher seems to be that all model (and modelGrp and modelSequence) children of the original elementSpec should be replaced by those of the new one. The visual ODD editor in TEI Publisher, at least, when we indicate we want to change the behaviour of an element from the default, copies over all model and related elements from the default ODD to our new one. It would be more practical to be able to override only a single model subelement.
In passing, we note that TEI Publisher in practice does away with the fiction that elements' processing model is part of the schema specification. The ODDs that we encounter in TEI Publisher do contain a schemaSpec element, but the only reason it is there seems to be that the schemaSpecs source attribute is used for chaining ODDs and overriding definitions.
Tentatively, I think we might want to define elements' processing in an element schemaProc, sibling to schemaSpec. schemaProc has a source attribute to be able to chain (and override) definitions. schemaProc would contain one or more processing blocks (procBlock), distinguished by a procmode attribute.[11] One processing block could have the procmode attribute equal to default. Other procmodes might be popup, title, toc, sidepanel, depending on the edition's needs. The processing blocks would contain elementProc elements that describe the processing for an element in that processing mode, and these in turn would contain the models, modelSequences and modelGrps that are now contained in elementSpec.
The processing blocks might have other attributes. A start attribute could contain an xpath expression pointing to the location in the document tree where processing should start. A adddefaultdefs boolean attribute could be used to indicate whether all processing definitions to be used should come from the current processing block or whether the default processing block definitions can be used for situations not described in the processing block itself.
This would do away with some of the disadvantages of the current situation: it makes a clear distinction between schema and processing while these can still be described in the same file; various modes of processing can be described in a single ODD without the need for parameters; all processing for a processing mode is described in a single processing block rather than in many different places in an ODD.
As an example of where such a processing block could be useful, think of the need for a document title on each of the pages representing a document. TEI Publisher now handles that situation by setting in the HTML templates a parameter header with the value short, which is then used in the predicate attribute of a model element in the ODD. It would be much cleaner if it were possible to straightworwardly point to a section in the processing specification.
Similarly, when generating a table of contents, I might have different processing expectations than in the representation of the transcription. In a table of contents, I might want to ignore line breaks in headings, to always resolve abbreviations and to ignore notes. We could handle this by referring to the same parameter in the processing model of all these elements, but to have a single processing block for handling the table of contents is obviously simpler.
Modular components and transferring control to another ODD
All editions contain edited text(s) of at least one document or text type. But besides the edited text, editions also contain other components, such as introductions, bibliographies, and possibly many more, such as a variant apparatus, a personography, list of places, list of artworks, etc. Each of these, we can call them secondary, components has its own schema, and presumably also its own processing expectations.
If we want to create editions based on reusable components, we should make sure that in the places where the components interact, we can somehow call or include functionality from one component into the other. If, in the example mentioned before, in our edition of correspondence we want the references to persons to be provided with a popup that gives some summary information about that person, we should be able to call a function (or to execute a rule block, or something of the kind) defined in the processing specification for the personography. If we can do that, we have a personography component that we can reuse in the context of other editions. If we cannot do it, we need to define some of the personography processing in each of the editions that uses the personography, doing away with hopes for a modular design. This is in fact, I believe, a strong reason for the processing blocks proposed in the preceding section.
To work this out a bit more, suppose in our edited text we have an rs element (referring string) of type person, representing a (reference to a) person. Supposing we have defined a popup behaviour that associates the string with the popup, we might also define a behaviour switch_odd (or perhaps switch_schema) and an associated attribute switch_to, used as follows:
<elementProc ident="rs" mode="change">
<model predicate="@type='person'" behaviour="popup" cssClass="popupContainer">
<modelSequence>
<model behaviour="inline" cssClass="hasPopup entity"/>
<!-- displays the referring string -->
<model behaviour="switch_odd" switch_to="bio.odd#proc-bio-popup">
<!-- creates the popup content -->
<param name="content" value="@ref"/>
</model>
</modelSequence>
</model>
</elementProc>
For creating the content of the popup, this would switch to the proc-bio-popup processing block in the personography ODD (called bio.odd). This processing block might look like this:
<procBlock xml:id="proc-bio-popup" procmode="mode-bio-popup">
<elementProc ident="person" mode="change">
<model behaviour="block" cssClass="isPopup">
<modelSequence>
<model behaviour="heading">
<param name="level" value="4"/>
<param name="content">Person information</param>
</model>
<model behaviour="paragraph">
<param name="content" value="tei:persName[@full='abb']"/>
</model>
<model behaviour="paragraph">
<modelSequence>
<model behaviour="text">
<param name="content">Birth: </param>
</model>
<model behaviour="inline">
<param name="content" value="string(tei:birth/@when)"/>
</model>
<model behaviour="text">
<param name="content"> Death: </param>
</model>
<model behaviour="inline">
<param name="content" value="string(tei:death/@when)"/>
</model>
</modelSequence>
</model>
<model behaviour="paragraph">
<param name="content" value="tei:note[@type='shortdesc']"/>
</model>
</modelSequence>
</model>
</elementProc>
<elementProc ident="persName">
<model predicate="@full='abb'" behaviour="inline"/>
</elementProc>
</procBlock>
This creates a block that contains a heading and several paragraphs of person information.[12]
We already saw some places where this switch_odd behaviour might be useful in creating an edition. Think of a person's or other entity's representation in a side panel to the text, or of the label needed to represent an enity in a facetted search.
Some more limitations
I want to mention a few more potential requirements for extension of the Processing Model, some of them in more detail than others.
-
Above we saw that the models contained in an elementSpec in a chained ODD apparently are replaced wholesale when an elementSpec for the same element in the chaining ODD also contains model elements. It would seem much more natural to concatenate the models from the chaining and the chained ODDs, and apply the first model with a predicate and output attributes that match the current situation.
One reason for this is that some elementSpecs contain long lists of model elements, because the required behaviour depends very much on the contexts in which the element occurs. It is cumbersome and and error-prone to have to overide all when you need to change just a single model element. Another reason is that concatenating is much more powerful that just overriding. For instance if a chained ODD says that a head within a line group (stanza) shoud be displayed in a certain way (predicate parent::tei:lg), when we concatenate rather than override, the chaining ODD can say that this should not apply to heads in nested line groups (predicate parent::tei:lg[parent::tei:lg] and behaviour pass-through) without changing the behaviour for heads in top-level line group.
-
Currently, the values for the behaviour attribute are described as 'suggested'. That means that a project that would use the Processing Model to describe the expected processing for its elements can never be sure that an implementation that claims to support the Processing Model actually supports the behaviours that are decribed. The section in the TEI Guidelines that describes what is expected from implemenations[13] does not mention this aspect at all. I would argue that the Guidelines should prescribe which behaviours an implementation should at least support.
-
The value attribute of the param element is expected to contain an xpath expression. But above we also met cases where the value could be a literal. In that case, we have to write value="'literal'". It would be cleaner to provide a literal value as text content. The param element should then have either a value attribute or literal content.
-
If you read closely the example for switching to another ODD, you will have noticed that what we passed to the personography ODD was not the person node, but a reference to that node. To be able to provide the person node, the name of the personography file would have to be known in the rules of the other components. It makes for a cleaner separation in functionality when processing model rules for a component do not have to be aware of file names of other components. It should be left to the implementation to provide a mechanism that gets from the reference to a person (or other object) to the corresponding node.
Conclusion
In the preceding discussions we saw that the TEI Processing Model is a declarative facility for associating processing rules with elements. We discussed some ways of extending the model that, I believe, make it more powerful, usable and more suitable for the creation of interoperable components. We also saw that TEI Publisher is using other, but partly related extensions. If we want to get closer to the idea (or ideal) of a declaratively defined edition, I think there is reason enough for us in the TEI Consortium to take another look at the specification of the Processing Model.