Build a parser and parse Latin strings

Published

July 5, 2024

To replicate all the steps in this tutorial:

  • install Julia if you haven’t already done so
  • download or clone the Tabulae.jl repository
  • start a Julia REPL
  • assign to the variable repo the path to the cloned repository

Building a parser from local files

You can build a parser from one or more sets of delimited-text files organized in directories following Tabulae’s conventions. In this tutorial, we’ll use the files in the core-infl-shared and core-infl-lat25 directories in the datasets directory of the Tabulae github repository.

If you have a variable named repo with the root directory of the Tabulae repository, then the full path to the directories will:

shareddir = joinpath(repo, "datasets", "core-infl-shared") 
lat25dir = joinpath(repo, "datasets", "core-infl-lat25") 

Instantiate a data set

You can create a Tabulae.DataSet from a list of one or more directories.

using Tabulae
ds = dataset([shareddir, lat25dir])

Compile a parser

You can then build a parser from a data set.

p = tabulaeStringParser(ds)

Interactive parsing

Use the parsetoken function (from the CitableParserBuilder package) to parse a string with a parser.

using CitableParserBuilder
s = "agricolae"
parses = parsetoken(s, p)
4-element Vector{Analysis}:
 Analysis("agricolae", ls.n1626, forms.2010001200, latcommon.noun1626, latcommoninfl.a_ae14, "agricolae")
 Analysis("agricolae", ls.n1626, forms.2010001300, latcommon.noun1626, latcommoninfl.a_ae15, "agricolae")
 Analysis("agricolae", ls.n1626, forms.2020001100, latcommon.noun1626, latcommoninfl.a_ae18, "agricolae")
 Analysis("agricolae", ls.n1626, forms.2020001600, latcommon.noun1626, latcommoninfl.a_ae24, "agricolae")

Morphological analyses

The result is a Vector of analyses. Each Analysis includes identifiers for a morphological form object and a lexeme (or vocabulary item). You can use the lexemeurn function (from the CitableParserBuilder package) to extract the lexeme’s identifier from an Analysis.

using CitableParserBuilder
lexemeurn(parses[1])
ls.n1626

Tabulae’s latinForm function extracts the form identifier from an analysis, and creates a LatinMorphologicalFormfrom it.

forms = latinForm.(parses)
4-element Vector{LMFNoun}:
 masculine genitive singular
 masculine dative singular
 masculine nominative plural
 masculine vocative plural

Note that morphological forms are not string values. If you want a string label for a form, use the aptly named label function.

label.(forms)
4-element Vector{String}:
 "masculine genitive singular"
 "masculine dative singular"
 "masculine nominative plural"
 "masculine vocative plural"

See the following tutorial on working with morphological forms