Managing data in multiple orthographies

Published

June 13, 2024

Incomplete

TBA

Overview

This is a strength of Tabulae’s design.

example of ortho2[345] correspondences with appropriate distinctions of i/j, u/v (or not)
pair with common dir where no need to replicate

We’ll use three directories: a common directory, and specialized directories for a 23-character and a 25-character orthography.

shareddir = joinpath(repo, "datasets", "core-infl-shared") 
lat25dir = joinpath(repo, "datasets", "core-infl-lat25") 
lat23dir = joinpath(repo, "datasets", "core-infl-lat23")

We’ll make two datasets by combining each specialized directory with the common one, and build two parsers from them

using Tabulae, CitableParserBuilder
ds23 = dataset([shareddir, lat23dir])
ds25 = dataset([shareddir, lat25dir])

p23 = tabulaeStringParser(ds23)
p25 = tabulaeStringParser(ds25)

TabulaeStringParser(Any["agricola|ls.n1626|forms.2010001100|latcommon.noun1626|latcommoninfl.a_ae13", "agricolae|ls.n1626|forms.2010001200|latcommon.noun1626|latcommoninfl.a_ae14", "agricolae|ls.n1626|forms.2010001300|latcommon.noun1626|latcommoninfl.a_ae15", "agricolam|ls.n1626|forms.2010001400|latcommon.noun1626|latcommoninfl.a_ae16", "agricola|ls.n1626|forms.2010001500|latcommon.noun1626|latcommoninfl.a_ae17", "agricolae|ls.n1626|forms.2020001100|latcommon.noun1626|latcommoninfl.a_ae18", "agricolarum|ls.n1626|forms.2020001200|latcommon.noun1626|latcommoninfl.a_ae19", "agricolis|ls.n1626|forms.2020001300|latcommon.noun1626|latcommoninfl.a_ae20", "agricolas|ls.n1626|forms.2020001400|latcommon.noun1626|latcommoninfl.a_ae21", "agricolis|ls.n1626|forms.2020001500|latcommon.noun1626|latcommoninfl.a_ae22"  …  "fuisset|ls.n46529|forms.3315210000|latcommon.irregverbn46529bm|irreginfl.irregular2", "fuissemus|ls.n46529|forms.3125210000|latcommon.irregverbn46529bn|irreginfl.irregular2", "fuissetis|ls.n46529|forms.3225210000|latcommon.irregverbn46529bo|irreginfl.irregular2", "fuissent|ls.n46529|forms.3325210000|latcommon.irregverbn46529bp|irreginfl.irregular2", "es|ls.n46529|forms.3211310000|latcommon.irregverbn46529bq|irreginfl.irregular2", "este|ls.n46529|forms.3221310000|latcommon.irregverbn46529br|irreginfl.irregular2", "esto|ls.n46529|forms.3213310000|latcommon.irregverbn46529bs|irreginfl.irregular2", "estote|ls.n46529|forms.3223310000|latcommon.irregverbn46529bt|irreginfl.irregular2", "esto|ls.n46529|forms.3313310000|latcommon.irregverbn46529bu|irreginfl.irregular2", "sunto|ls.n46529|forms.3323310000|latcommon.irregverbn46529bv|irreginfl.irregular2"], LatinOrthography.Latin24("abcdefghiklmnopqrstuvxyzABCDEFGHIKLMNOPQRSTUVXYZ.,;:? \n\t+", DataType[Orthography.LexicalToken, Orthography.PunctuationToken, LatinOrthography.EncliticToken], LatinOrthography.tokenizeLatin24), "|")

results25 = parsetoken("amavissem", p25)

1-element Vector{Analysis}:
 Analysis("amavissem", ls.n2280, forms.3115210000, latcommon.verbn2280, lat25.ere_plupft7)

results23 = parsetoken("amauissem", p23)

1-element Vector{Analysis}:
 Analysis("amauissem", ls.n2280, forms.3115210000, latcommon.verbn2280, lat23.ere_plupft7)

We can computationally verify that “amavissem” in 25-letter orthography in fact is identical to “amauissem” in 23-letter orthography, because they have identical lexemes and forms.

lexemeurn(results23[1]) == lexemeurn(results25[1])

true

formurn(results23[1]) == formurn(results25[1])

true