Kanones’ analyses

Published

June 27, 2024

REIMPLEMENT THIS PAGE

IMPLEMENT THESE WITH NEW CPB:

Kanones.jl implements the model of the CitableParserBuilder module. Parsing functions (like parsetoken) return a Vector of Analysis objects. In addition to a lexeme and a form, each Analysis also includes a stem and an inflectional rule. Conceptually, the stem and rule provide the rationale for an analysis: the stem explains why a specific lexeme was chosen, and the inflectional rule explains how the token was formed by applying a particular inflectional pattern to the stem. When generating tokens, pairing a stem and a rule provides enough information to identify a lexeme and a form and to compose a token. Kanones can actually produce a full Analysis object when generating tokens as well as when parsing them.

Kanones further associates an implementation of an Orthography with each parser. You can use Kanones to build parsers that are tailored not only to specific features of language (vocabulary or inflectional patterns specific to a particular corpus or dialect), but also to specific orthographic systems and the phonology they represent. The Kanones github repository, for example, includes stems and rules in two completely different orthographies: the standard orthography of printed literary Greek, and the orthography of inscriptions of Athens prior to 403 BCE.

In Kanones, each of the four components of an Analysis are Cite2Urn values. The identifiers for lexemes and morphological forms are potentially applicable to any parser you build with Kanones; stems and rules for the same lexeme and form may differ if you are parsing texts using different orthographies. The fact that you can meaningfully use references to lexems and forms drawn from parsers in different orthographies means that you can even analyze a token in one orthography, and generate the corresponding token for the same lexeme and form in another orthography.

A magical example: transcoding content

The datasets directory of the Kanones.jl repository includes sample datasets in traditional literary Greek orthography, and in the orthography used in Athens before 403 BCE.

litgreekfiles = joinpath(repo, "datasets", "literarygreek-rules")
atticfiles = joinpath(repo,"datasets","attic")

We build a parser for each.

using Kanones, CitableParserBuilder
#litgreek = dataset(litgreekfiles)
#litgreekparser =  KanonesStringParser(litgreek)

We construct the second data set with an optional parameter explicitly setting the orthography to use.

using AtticGreek
#attic = dataset(atticfiles; ortho = atticGreek())
#atticparser = KanonesStringParser(attic)

We can analyze a token written in standard orthography. (In this example, since we expect only one analysis for βουλῆς, we’re just taking the first analysis from the resulting Vector.)

#analysis = parsetoken("βουλῆς", litgreekparser)[1]

In Kanónes, all components of the analysis are identified by URNs rather than string values. We can use the lexemeurn and formurn functions to retrieve those elements of an analysis, and then generate the equivalent string using a different dataset with a different orthography.

#vocabitem = lexemeurn(analysis)
#form = formurn(analysis)
#generate(vocabitem, form, attic)