🤷‍♂️ Creating a data set

Published

July 2, 2024

Caution

Content of this page TBA

dataset is an abstraction. Future versions of Kanones may have multiple concrete types.
right now, the only concrete type is the FilesDataset

Files dataset:

specified directory layout
delimited text files: no restrictions on naming, number of files, inclusion of blank lines for legibility
associated with a specific orthography

Subsequent pages detail working with files in a FilesDataset.

mydata
└── stems-tables
    ├── adjectives
    ├── nouns
    ├── verbs-compound
    └── verbs-simplex

Instantiating a dataset

root directory
orthography

(See previous pages on the diretory layout of stem and rule types in a FilesDataset.)

This is enough for the stemsarray and rulesarray functions to collect all stem and rule data from the file system of a FilesDataset.

Stems and Rules

There are five stem types that are subtypes of KanonesStemType.

There are nine rules types that are subtypes of KanonesRuleType.

Out of date

The description of IO is out of date and wrong.

Each stem type implements the readstemrow function; each rule type implements ``. These functions read a single delimited-text line and construct a stem or row of their type.

For the four collection Kanones needs:

Forms collection

a collection of all possible morphological forms is precompiled as part of the GH repo. These are the only forms Kanones can work with: you cannot change these.

Collection of lexemes

a large collection of Greek lexemes is included in two collections: lsjx is a collection of candidate lexemes generated from LSJ articles; lsj is a subset of those that have been verified to be a lexeme.

Inflectional rules

for standard literary Greek, datasets/literarygreek-rules should be all you need for Attic. You can add to these if you want to e.g. expand coverage of literary dialects. Best practice: maintain additions in separate files, and please submit pull request to add them to Kanones’ gh repo!
for Attic alphabet pre 403 BCE, sample rules in datasets/attic.

Stems

In practice, this is the dataset you’re most likely to edit.

Identifying lexemes:

check LSJ from folio2.furman.edu; use its ID value if you find your item. Otherwise, register your own namespace, create a new id in that namespace
use separate files to group things easily. Eg., proper names in a particular text or corpus that do not appear in LSJ