Working with analyzed forms

Published

July 5, 2024

To replicate all the steps in this tutorial:

  • install Julia if you haven’t already done so
  • download or clone the Tabulae.jl repository
  • start a Julia REPL
  • assign to the variable repo the path to the cloned repository

Start by repeating the steps from the introductory tutorial to compile a parser, and assign it to variable named parser:

Compile a parser
using Tabulae

shareddir = joinpath(repo, "datasets", "core-infl-shared") 
lat25dir = joinpath(repo, "datasets", "core-infl-lat25") 

parser = dataset([shareddir, lat25dir]) |> tabulaeStringParser

Morphological analyses

When we parse a token, the result is a Vector of analyses. Each analysis assocates the token with four identifiers (as you can see in the parser output). If the form is unambiguous, the Vector will have only one element:

using CitableParserBuilder
verbparses = parsetoken("amabatur", parser)
1-element Vector{Analysis}:
 Analysis("amabatur", ls.n2280, forms.3312120000, latcommon.verbn2280, latcommon.are_conj1impft9, "amabatur")

If the form is morphologically ambiguous, the results will include an analysis for each possibility.

nounparses = parsetoken("agricolae", parser)
4-element Vector{Analysis}:
 Analysis("agricolae", ls.n1626, forms.2010001200, latcommon.noun1626, latcommoninfl.a_ae14, "agricolae")
 Analysis("agricolae", ls.n1626, forms.2010001300, latcommon.noun1626, latcommoninfl.a_ae15, "agricolae")
 Analysis("agricolae", ls.n1626, forms.2020001100, latcommon.noun1626, latcommoninfl.a_ae18, "agricolae")
 Analysis("agricolae", ls.n1626, forms.2020001600, latcommon.noun1626, latcommoninfl.a_ae24, "agricolae")

Morphological forms and properties

Use the latinForm function to construct a Latin morphological form from the identifier in a morphological analysis. Morpological forms belong to subtypes of the abstract LatinMorphologicalForm type. The following cells, for example create LMFNoun and LMFFiniteVerb forms from our previous analyses.

nounexample = latinForm(nounparses[1])
typeof(nounexample)
LMFNoun
verbexample = latinForm(verbparses[1])
typeof(verbexample)
LMFFiniteVerb

These different types of form have different properties, as the default display suggests. Noun forms have properties for gender, case and number, while finite verb forms have properties for tense, mood, voice, person and number.

nounexample
masculine genitive singular
verbexample
imperfect indicative passive third singular

We can get at any property of a Latin form with a function having a name beginning with lower-case lmp followed by the property name. For example, the lmpCase function gets the morphological property of case, and lmpTense gets the tense property.

casevalue =  lmpCase(nounexample)
genitive
tensevalue = lmpTense(verbexample)
imperfect

As with morphological forms, morphological properties are not string values, so you need a string label for a property value, use the same label you used with morphological forms:

label(casevalue)
"genitive"

The same functions that retrieve a property from a form can also be used to construct a property from a string value. For example, you can use lmpTense to construct a property for tense.

lmpTense("perfect")
perfect

We can take advantage of this in normal Julia operations on collections of analyses. For instance, in the following cell we separate out all the analyses for the token agricolae with plural forms, and extract the form object from them:

pluralvalue = lmpNumber("plural")
pluralnouns = filter(parse -> lmpNumber(latinForm(parse)) == pluralvalue, nounparses)
latinForm.(pluralnouns)
2-element Vector{LMFNoun}:
 masculine nominative plural
 masculine vocative plural

Morphological forms and “parts of speech”

Note that the various types of LatinMorphologicalForm are not equivalent to a traditional “part of speech.” Rather, they are analytical types defined by their unique set of properties. “Verbs” as a category for part of speech include multiple types of morphological forms: finite verbs like the example above, but also infinitives, participles, and other forms.

Consider the ambiguity of the token amare, for example.

multiparses = parsetoken("amare", parser)
multiforms = latinForm.(multiparses)
3-element Vector{LatinMorphologicalForm}:
 present indicative passive second singular
 present imperative passive second singular
 present active infinitive

One of the forms is an infinitive, with only two morphological properties, for tense and voice. Compare the types of the forms.

typeof.(multiforms)
3-element Vector{DataType}:
 LMFFiniteVerb
 LMFFiniteVerb
 LMFInfinitive

We can meaningfully look at tense properties for all of the forms.

lmpTense.(multiforms)
3-element Vector{LMPTense}:
 present
 present
 present

but if we apply the lmpPerson function to an infinitive, we get a warning, and resulting value of nothing.

 lmpPerson.(multiforms)
3-element Vector{Union{Nothing, LMPPerson}}:
 second
 second
 nothing