using CitableCorpus, CitableBase
f = joinpath(repo,"test","data","gettysburgcorpus.cex")
corpus = fromcex(f, CitableTextCorpus, FileReader)
using Orthography
tc = tokenizedcorpus(corpus, simpleAscii())
using CitableParserBuilder
parser = CitableParserBuilder.gettysburgParser()Working with analyzed tokens
Same set up as before: read corpus, tokenize, parse.
parsed = parsecorpus(tc, parser)parsedCollection of 1506 analyzed tokens.
Lexemes
Get a list of unique lexeme identifiers for all parsed tokens.
lexemelist = lexemes(parsed)
length(lexemelist)154
For a given lexeme, find all surface forms appearing in the corpus. The lexeme “gburglex.and” appears in only one form, and.
stringsforlexeme(parsed, "gburglex.and" )1-element Vector{AbstractString}:
"and"
Get a dictionary keyed by lexeme that can be used to find all forms, and all passages for a given lexeme. It will have the same length as the list of lexemes, which are its keys.
THIS IS BORKEN:
ortho = simpleAscii() # hide
tokenindex = corpusindex(corpus, ortho)
lexdict = lexemedictionary(parsed, tokenindex)
length(lexdict)
Each entry in the dictionary is a further dictionary mapping surface forms to passages.
lexdict["gburglex.and"]