srcdir = joinpath(repo, "datasets", "literarygreek-rules") Build a parser and parse Greek strings
To replicate all the steps in this tutorial:
- install Julia if you haven’t already done so
- download or clone the Kanones.jl repository
- start a Julia REPL
Building a parser from local files
You can build a parser from delimited-text files organized in directories following Kanónes’ conventions. In this tutorial, we’ll use the files in the literarygreek-rules directory in the datasets directory of the Kanónes github repository.
If you have a variable named repo with the root directory of the Kanónes repository, then the core-infl directory will be:
Instantiate a data set
You can create a Kanones.FilesDataSet from a list of one or more directories.
using Kanones
kds = dataset([srcdir])Compile a parser
You can then build a parser from a data set.
p = kParser(kds)Interactive parsing
Use the parsetoken function to parse a string with a parser.
s = "ἀνθρώπῳ"
parses = parsetoken(s, p)1-element Vector{CitableParserBuilder.Analysis}:
CitableParserBuilder.Analysis("ἀνθρώπῳ", lsj.n8909, forms.2010001300, nounstems.n8909, nouninfl.os_ou3, "ἀνθρωπῳ", "a")
The result is a Vector of analyses. Each Analysis includes a morphological form object and an identifier for a lexeme (or vocabulary item). You can use the greekForm and lexemeurn functions to extract these from an Analysis; for a human-readable string value, use the label function on the result. E.g., label the first form in the result:
parses[1] |> greekForm |> label"noun: masculine dative singular"
or use Julia broadcasting to label the forms of all parses:
parses .|> greekForm .|> label1-element Vector{String}:
"noun: masculine dative singular"
Use broadcasting to find URNs for the lexeme from each analysis with the lexemeurn function from the CitableParserBuilder package:
using CitableParserBuilder
lexemelist = parses .|> lexemeurn1-element Vector{LexemeUrn}:
lsj.n8909
To label lexemes, you can attach lemmata drawn from Liddell-Scott’s lexicon. lemmadict is a convenience function that retrieves these data over the internet.
lsj = lemmatadict()Use broadcasting to label each lexeme for easier human reading:
lemmalabel.(lexemelist, dict = lsj)1-element Vector{String}:
"lsj.n8909@ἄνθρωπος"
To parse a list of words, use the parsewordlist function. For example, we can split this string into a list of 23 word tokens.
s = "περὶ πολλοῦ ἂν ποιησαίμην ὦ ἄνδρες τὸ τοιούτους ὑμᾶς ἐμοὶ δικαστὰς περὶ τούτου τοῦ πράγματος γενέσθαι οἷοι ἂν ὑμῖν αὐτοῖς εἴητε τοιαῦτα πεπονθότες"
words = split(s)23-element Vector{SubString{String}}:
"περὶ"
"πολλοῦ"
"ἂν"
"ποιησαίμην"
"ὦ"
"ἄνδρες"
"τὸ"
"τοιούτους"
"ὑμᾶς"
"ἐμοὶ"
"δικαστὰς"
"περὶ"
"τούτου"
"τοῦ"
"πράγματος"
"γενέσθαι"
"οἷοι"
"ἂν"
"ὑμῖν"
"αὐτοῖς"
"εἴητε"
"τοιαῦτα"
"πεπονθότες"
We’ll need a parser with more vocabulary than the literary Greek sample we used above. You can build a small parser with some common vocabulary with the Kanones.coreparser function. We’ll further limit it to include only Attic forms.
p2 = Kanones.coreparser(repo; atticonly = true)#parses = parsewordlist(words,p2)The result is just a list of 23 Vectors of analyses, one for each word we submitted.
If all you want is labelled strings? Try this.
#labelledlines = formlabels(words, p2)using Markdown
#mdlines = map(ln -> "1. " * ln, labelledlines)
#join(mdlines, "\n") |> Markdown.parse