π€·ββοΈ Creating a data set
Content of this page TBA
- dataset is an abstraction. Future versions of Kanones may have multiple concrete types.
- right now, the only concrete type is the
FilesDataset
Files dataset:
- specified directory layout
- delimited text files: no restrictions on naming, number of files, inclusion of blank lines for legibility
- associated with a specific orthography
Subsequent pages detail working with files in a FilesDataset.
mydata
βββ stems-tables
βββ adjectives
βββ nouns
βββ verbs-compound
βββ verbs-simplex
Instantiating a dataset
- root directory
- orthography
(See previous pages on the diretory layout of stem and rule types in a FilesDataset.)
This is enough for the stemsarray and rulesarray functions to collect all stem and rule data from the file system of a FilesDataset.
Stems and Rules
There are five stem types that are subtypes of KanonesStemType.
There are nine rules types that are subtypes of KanonesRuleType.
The description of IO is out of date and wrong.
Each stem type implements the readstemrow function; each rule type implements ``. These functions read a single delimited-text line and construct a stem or row of their type.
For the four collection Kanones needs:
Forms collection
- a collection of all possible morphological forms is precompiled as part of the GH repo. These are the only forms Kanones can work with: you cannot change these.
Collection of lexemes
- a large collection of Greek lexemes is included in two collections:
lsjxis a collection of candidate lexemes generated from LSJ articles;lsjis a subset of those that have been verified to be a lexeme.
Inflectional rules
- for standard literary Greek,
datasets/literarygreek-rulesshould be all you need for Attic. You can add to these if you want to e.g. expand coverage of literary dialects. Best practice: maintain additions in separate files, and please submit pull request to add them to Kanonesβ gh repo! - for Attic alphabet pre 403 BCE, sample rules in
datasets/attic.
Stems
In practice, this is the dataset youβre most likely to edit.
Identifying lexemes:
- check LSJ from folio2.furman.edu; use its ID value if you find your item. Otherwise, register your own namespace, create a new id in that namespace
- use separate files to group things easily. Eg., proper names in a particular text or corpus that do not appear in LSJ