Organization of datasets

Published

July 5, 2024

Warning

These notes are highly incomplete

A dataset is a set of delimited-text files in an explicitly defined orthography.

The file layout follows a set of conventions. (See reference section for full definition.)

Example of a noun stem

using Tabulae, CitableBase

stemdelimited = "latcommon.noun1626|ls.n1626|agricol|masculine|a_ae"
nounstem = fromcex(stemdelimited, TabulaeNounStem)
Noun stem agricol- (masculine)
cex(nounstem)
"latcommon.noun1626|ls.n1626|agricol|masculine|a_ae"

These can always be round tripped:

fromcex(delimited, TabulaeNounStem) |> cex == stemdelimited
true

Example of a noun rule

  • stem and rule always joined by inflectional type
  • for nouns, also by gender
ruledelimited = "latcommoninfl.a_ae16|a_ae|am|masculine|accusative|singular"
nounrule = fromcex(ruledelimited, TabulaeNounRule)
Noun inflection rule: ending -am in class a_ae can be masculine accusative singular.