Morphological analyses

Published

June 14, 2024

Natural languages have very different morphological structures. Nevertheless, we can define general patterns for the process of analyzing the morphology of a digital text, and expressing the results in terms of citable scholarly resources.1

Parsing and analyses

Parsing operates on individual tokens. It therefore presupposes and requires an explicit orthographic system that defines tokenization, and classifies the resulting tokens. Morphological parsing operates on tokens classified as lexical tokens (and not on puncutation, numbers, or other types of non-lexical tokens).

Parsing may normalize the token in some way (analyzing strings in case-insensitive form, for example). We refer to these normalized forms that a parser can analyze as morphological tokens. A lexical token can even correspond to more than one morphological token. In English, for example, tokens like “cannot” or contracted forms like “can’t” correspond to two morphological tokens, “can” plus a negative adverb “n’t” or “not”.

Parsing a morphological token produces zero or more morphological analyses.

A morphological analysis identifies a morphological form and a lexeme for the token. An analysis of the English word “books”, for example, might indicate that this form is a plural noun of a lexeme with the dictionary form “book.”

Morphological tokens are often treated as the result of applying a rule to a stem. An English parser, for example, might interpret the token “books” as the result of applying to a stem “book” a rule to add an ending “s” to create a plural form.

We can therefore model a morphological analysis as associating five pieces of information with an orthograpic token: a morphological token (often identical to the orthographic token); a lexeme and a form that together identify the morphology of the token; and a stem and inflectional rule that together explain how the analysis arrived at the resulting lexeme and form.

The morphological token is a string value. The other four pieces of information can be represented by a unique identifier encoded as a URN. Internally, the CitableParserBuilder package works with an abbreviated form of URNs, and defines concrete types for each of these four pieces of information:

  1. the lexeme (LexemeUrn)
  2. the morphological form (FormUrn)
  3. the stem used to arrive at the analysis (StemUrn)
  4. the inflectional rule used to arrive at the analysis (RuleUrn)

The structure of the morphological form identified will vary from one language to another.

Footnotes

  1. The model of citable morphological parsing presented here was developed over many years in the work of the Holy Cross Manuscripts, Inscriptions and Documents Club (MID).↩︎