Working with citable corpora

Published

June 7, 2024

Token lists

You can use a tokenizer to compile a list of unique token values in a corpus. The tokens will be sorted by their frequency in the corpus. Here are the first four tokens in the resulting list for the first lines of the Mr. Ed theme song.

using Orthography
using CitableText, CitableCorpus
corpus = CitableTextCorpus([
        CitablePassage(CtsUrn("urn:cts:docstrings:mred.themesong.v1:1"),"A horse is a horse, of course, of course,"),
        CitablePassage(CtsUrn("urn:cts:docstrings:mred.themesong.v1:2"),"And no one can talk to a horse of course,"),
        CitablePassage(CtsUrn("urn:cts:docstrings:mred.themesong.v1:3"),"That is, of course, unless the horse is the famous Mr. Ed."),
])
lexvalues = tokenvalues(corpus, simpleAscii())

18-element Vector{SubString{String}}:
 "course"
 "horse"
 "of"
 "is"
 "a"
 "the"
 "That"
 "one"
 "Ed"
 "unless"
 "A"
 "famous"
 "Mr"
 "can"
 "no"
 "to"
 "talk"
 "And"

lexvalues[1:4]

4-element Vector{SubString{String}}:
 "course"
 "horse"
 "of"
 "is"

By default, the tokenvalues function only collects lexical tokens, but you can filter by any token type, or by nothing to get a list of all token values.

allvalues = tokenvalues(corpus, simpleAscii(); filterby = nothing) allvalues[1:4]

Token histograms

You can also count frequencies of tokens. Like all the other corpus functions, the corpus_histo function counts only lexical tokens by default. To count all token types, we can pass nothing as the value of an optional filterby parameter.

histo_all = corpus_histo(corpus, simpleAscii(); filterby = nothing)

OrderedCollections.OrderedDict{AbstractString, Int64} with 20 entries:
  ","      => 6
  "course" => 4
  "horse"  => 4
  "of"     => 4
  "is"     => 3
  "a"      => 2
  "."      => 2
  "the"    => 2
  "That"   => 1
  "one"    => 1
  "Ed"     => 1
  "unless" => 1
  "A"      => 1
  "famous" => 1
  "Mr"     => 1
  "can"    => 1
  "no"     => 1
  "to"     => 1
  "talk"   => 1
  "And"    => 1

histo_all["course"]

There are lots of commas.

histo_all[","]

Optionally, you can include a token type to limit results. If we consider only lexical tokens, we should get the same result for “course”.

histo_lex = corpus_histo(corpus, simpleAscii(); filterby = LexicalToken())
histo_lex["course"]

But punctuation tokens will not be part of the histogram.

haskey(histo_lex,",")

false

Tokenized editions

You can use an orthography’s tokenizer to create a new corpus, citable at the level of the token.

tkncorpus = tokenizedcorpus(corpus, simpleAscii())

Corpus with 39 citable passages in 1 documents.

tokenized = tokenizedcorpus(corpus, simpleAscii())

Corpus with 39 citable passages in 1 documents.

Token index

You can index a tokenized edition. The result is a dictionary keyed by token strings, and yielding lists of CTS URNs.

idx = corpusindex(corpus, simpleAscii())

18-element Dictionaries.Dictionary{SubString{String}, Vector{CtsUrn}}
      "A" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:1.1]
  "horse" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:1.2, urn:cts:doc…
     "is" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:1.3, urn:cts:doc…
      "a" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:1.4, urn:cts:doc…
     "of" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:1.6, urn:cts:doc…
 "course" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:1.7, urn:cts:doc…
    "And" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:2.1]
     "no" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:2.2]
    "one" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:2.3]
    "can" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:2.4]
   "talk" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:2.5]
     "to" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:2.6]
   "That" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:3.1]
 "unless" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:3.5]
    "the" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:3.6, urn:cts:doc…
 "famous" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:3.10]
     "Mr" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:3.11]
     "Ed" │ CtsUrn[urn:cts:docstrings:mred.themesong.v1_tokens:3.12]

idx["horse"]

4-element Vector{CtsUrn}:
 urn:cts:docstrings:mred.themesong.v1_tokens:1.2
 urn:cts:docstrings:mred.themesong.v1_tokens:1.5
 urn:cts:docstrings:mred.themesong.v1_tokens:2.8
 urn:cts:docstrings:mred.themesong.v1_tokens:3.7