using CitableCorpus, CitableBase
= joinpath(repo,"test","data","gettysburgcorpus.cex")
f = fromcex(f, CitableTextCorpus, FileReader)
corpus
using Orthography
= tokenizedcorpus(corpus, simpleAscii())
tc
using CitableParserBuilder
= CitableParserBuilder.gettysburgParser() parser
Working with analyzed tokens
Same set up as before: read corpus, tokenize, parse.
= parsecorpus(tc, parser) parsed
parsed
Collection of 1506 analyzed tokens.
Lexemes
Get a list of unique lexeme identifiers for all parsed tokens.
= lexemes(parsed)
lexemelist length(lexemelist)
154
For a given lexeme, find all surface forms appearing in the corpus. The lexeme “gburglex.and” appears in only one form, and.
stringsforlexeme(parsed, "gburglex.and" )
1-element Vector{AbstractString}:
"and"
Get a dictionary keyed by lexeme that can be used to find all forms, and all passages for a given lexeme. It will have the same length as the list of lexemes, which are its keys.
THIS IS BORKEN:
ortho = simpleAscii() # hide
tokenindex = corpusindex(corpus, ortho)
lexdict = lexemedictionary(parsed, tokenindex)
length(lexdict)
Each entry in the dictionary is a further dictionary mapping surface forms to passages.
lexdict["gburglex.and"]