Working with analyzed tokens

Published

June 14, 2024

Same set up as before: read corpus, tokenize, parse.

using CitableCorpus, CitableBase

f = joinpath(repo,"test","data","gettysburgcorpus.cex") 
corpus = fromcex(f, CitableTextCorpus, FileReader)

using Orthography
tc = tokenizedcorpus(corpus, simpleAscii())

using CitableParserBuilder
parser = CitableParserBuilder.gettysburgParser()
parsed =  parsecorpus(tc, parser)
parsed
Collection of 1506 analyzed tokens.

Lexemes

Get a list of unique lexeme identifiers for all parsed tokens.

lexemelist = lexemes(parsed)
length(lexemelist)
154

For a given lexeme, find all surface forms appearing in the corpus. The lexeme “gburglex.and” appears in only one form, and.

stringsforlexeme(parsed, "gburglex.and" )
1-element Vector{AbstractString}:
 "and"

Get a dictionary keyed by lexeme that can be used to find all forms, and all passages for a given lexeme. It will have the same length as the list of lexemes, which are its keys.

THIS IS BORKEN:

ortho = simpleAscii() # hide
tokenindex = corpusindex(corpus, ortho)
lexdict = lexemedictionary(parsed, tokenindex)
length(lexdict)

Each entry in the dictionary is a further dictionary mapping surface forms to passages.

lexdict["gburglex.and"]