shareddir = joinpath(repo, "datasets", "core-infl-shared")
lat25dir = joinpath(repo, "datasets", "core-infl-lat25")
lat23dir = joinpath(repo, "datasets", "core-infl-lat23") Managing data in multiple orthographies
Incomplete
TBA
Overview
This is a strength of Tabulae’s design.
- example of ortho2[345] correspondences with appropriate distinctions of i/j, u/v (or not)
- pair with
commondir where no need to replicate
We’ll use three directories: a common directory, and specialized directories for a 23-character and a 25-character orthography.
We’ll make two datasets by combining each specialized directory with the common one, and build two parsers from them
using Tabulae, CitableParserBuilder
ds23 = dataset([shareddir, lat23dir])
ds25 = dataset([shareddir, lat25dir])
p23 = tabulaeStringParser(ds23)
p25 = tabulaeStringParser(ds25)TabulaeStringParser(Any["agricola|ls.n1626|forms.2010001100|latcommon.noun1626|latcommoninfl.a_ae13", "agricolae|ls.n1626|forms.2010001200|latcommon.noun1626|latcommoninfl.a_ae14", "agricolae|ls.n1626|forms.2010001300|latcommon.noun1626|latcommoninfl.a_ae15", "agricolam|ls.n1626|forms.2010001400|latcommon.noun1626|latcommoninfl.a_ae16", "agricola|ls.n1626|forms.2010001500|latcommon.noun1626|latcommoninfl.a_ae17", "agricolae|ls.n1626|forms.2020001100|latcommon.noun1626|latcommoninfl.a_ae18", "agricolarum|ls.n1626|forms.2020001200|latcommon.noun1626|latcommoninfl.a_ae19", "agricolis|ls.n1626|forms.2020001300|latcommon.noun1626|latcommoninfl.a_ae20", "agricolas|ls.n1626|forms.2020001400|latcommon.noun1626|latcommoninfl.a_ae21", "agricolis|ls.n1626|forms.2020001500|latcommon.noun1626|latcommoninfl.a_ae22" … "fuisset|ls.n46529|forms.3315210000|latcommon.irregverbn46529bm|irreginfl.irregular2", "fuissemus|ls.n46529|forms.3125210000|latcommon.irregverbn46529bn|irreginfl.irregular2", "fuissetis|ls.n46529|forms.3225210000|latcommon.irregverbn46529bo|irreginfl.irregular2", "fuissent|ls.n46529|forms.3325210000|latcommon.irregverbn46529bp|irreginfl.irregular2", "es|ls.n46529|forms.3211310000|latcommon.irregverbn46529bq|irreginfl.irregular2", "este|ls.n46529|forms.3221310000|latcommon.irregverbn46529br|irreginfl.irregular2", "esto|ls.n46529|forms.3213310000|latcommon.irregverbn46529bs|irreginfl.irregular2", "estote|ls.n46529|forms.3223310000|latcommon.irregverbn46529bt|irreginfl.irregular2", "esto|ls.n46529|forms.3313310000|latcommon.irregverbn46529bu|irreginfl.irregular2", "sunto|ls.n46529|forms.3323310000|latcommon.irregverbn46529bv|irreginfl.irregular2"], LatinOrthography.Latin24("abcdefghiklmnopqrstuvxyzABCDEFGHIKLMNOPQRSTUVXYZ.,;:? \n\t+", DataType[Orthography.LexicalToken, Orthography.PunctuationToken, LatinOrthography.EncliticToken], LatinOrthography.tokenizeLatin24), "|")
results25 = parsetoken("amavissem", p25)1-element Vector{Analysis}:
Analysis("amavissem", ls.n2280, forms.3115210000, latcommon.verbn2280, lat25.ere_plupft7)
results23 = parsetoken("amauissem", p23)1-element Vector{Analysis}:
Analysis("amauissem", ls.n2280, forms.3115210000, latcommon.verbn2280, lat23.ere_plupft7)
We can computationally verify that “amavissem” in 25-letter orthography in fact is identical to “amauissem” in 23-letter orthography, because they have identical lexemes and forms.
lexemeurn(results23[1]) == lexemeurn(results25[1])true
formurn(results23[1]) == formurn(results25[1])true