You can use a tokenizer to compile a list of unique token values in a corpus. The tokens will be sorted by their frequency in the corpus. Here are the first four tokens in the resulting list for the first lines of the Mr. Ed theme song.
usingOrthographyusingCitableText, CitableCorpuscorpus =CitableTextCorpus([CitablePassage(CtsUrn("urn:cts:docstrings:mred.themesong.v1:1"),"A horse is a horse, of course, of course,"),CitablePassage(CtsUrn("urn:cts:docstrings:mred.themesong.v1:2"),"And no one can talk to a horse of course,"),CitablePassage(CtsUrn("urn:cts:docstrings:mred.themesong.v1:3"),"That is, of course, unless the horse is the famous Mr. Ed."),])lexvalues =tokenvalues(corpus, simpleAscii())
18-element Vector{SubString{String}}:
"course"
"horse"
"of"
"is"
"a"
"the"
"That"
"one"
"Ed"
"unless"
"A"
"famous"
"Mr"
"can"
"no"
"to"
"talk"
"And"
By default, the tokenvalues function only collects lexical tokens, but you can filter by any token type, or by nothing to get a list of all token values.
You can also count frequencies of tokens. Like all the other corpus functions, the corpus_histo function counts only lexical tokens by default. To count all token types, we can pass nothing as the value of an optional filterby parameter.