Build a parser and parse Greek strings

Published

July 12, 2024

Follow along

To replicate all the steps in this tutorial:

Building a parser from local files

You can build a parser from delimited-text files organized in directories following Kanónes’ conventions. In this tutorial, we’ll use the files in the literarygreek-rules directory in the datasets directory of the Kanónes github repository.

If you have a variable named repo with the root directory of the Kanónes repository, then the core-infl directory will be:

srcdir = joinpath(repo, "datasets", "literarygreek-rules") 

Instantiate a data set

You can create a Kanones.FilesDataSet from a list of one or more directories.

using Kanones
kds = dataset([srcdir])

Compile a parser

You can then build a parser from a data set.

p = kParser(kds)

Interactive parsing

Use the parsetoken function to parse a string with a parser.

s = "ἀνθρώπῳ"
parses = parsetoken(s, p)
1-element Vector{CitableParserBuilder.Analysis}:
 CitableParserBuilder.Analysis("ἀνθρώπῳ", lsj.n8909, forms.2010001300, nounstems.n8909, nouninfl.os_ou3, "ἀνθρωπῳ", "a")

The result is a Vector of analyses. Each Analysis includes a morphological form object and an identifier for a lexeme (or vocabulary item). You can use the greekForm and lexemeurn functions to extract these from an Analysis; for a human-readable string value, use the label function on the result. E.g., label the first form in the result:

parses[1] |> greekForm |> label
"noun: masculine dative singular"

or use Julia broadcasting to label the forms of all parses:

parses .|> greekForm .|> label
1-element Vector{String}:
 "noun: masculine dative singular"

Use broadcasting to find URNs for the lexeme from each analysis with the lexemeurn function from the CitableParserBuilder package:

using CitableParserBuilder
lexemelist = parses .|> lexemeurn
1-element Vector{LexemeUrn}:
 lsj.n8909

To label lexemes, you can attach lemmata drawn from Liddell-Scott’s lexicon. lemmadict is a convenience function that retrieves these data over the internet.

lsj = lemmatadict()

Use broadcasting to label each lexeme for easier human reading:

lemmalabel.(lexemelist, dict = lsj)
1-element Vector{String}:
 "lsj.n8909@ἄνθρωπος"

To parse a list of words, use the parsewordlist function. For example, we can split this string into a list of 23 word tokens.

s = "περὶ πολλοῦ ἂν ποιησαίμην ὦ ἄνδρες τὸ τοιούτους ὑμᾶς ἐμοὶ δικαστὰς περὶ τούτου τοῦ πράγματος γενέσθαι οἷοι ἂν ὑμῖν αὐτοῖς εἴητε τοιαῦτα πεπονθότες"
words = split(s)
23-element Vector{SubString{String}}:
 "περὶ"
 "πολλοῦ"
 "ἂν"
 "ποιησαίμην"
 "ὦ"
 "ἄνδρες"
 "τὸ"
 "τοιούτους"
 "ὑμᾶς"
 "ἐμοὶ"
 "δικαστὰς"
 "περὶ"
 "τούτου"
 "τοῦ"
 "πράγματος"
 "γενέσθαι"
 "οἷοι"
 "ἂν"
 "ὑμῖν"
 "αὐτοῖς"
 "εἴητε"
 "τοιαῦτα"
 "πεπονθότες"

We’ll need a parser with more vocabulary than the literary Greek sample we used above. You can build a small parser with some common vocabulary with the Kanones.coreparser function. We’ll further limit it to include only Attic forms.

p2 = Kanones.coreparser(repo; atticonly = true)
#parses = parsewordlist(words,p2)

The result is just a list of 23 Vectors of analyses, one for each word we submitted.

If all you want is labelled strings? Try this.

#labelledlines = formlabels(words, p2)
using Markdown
#mdlines = map(ln -> "1. " * ln, labelledlines)
#join(mdlines, "\n") |> Markdown.parse