Tokenization in depth

Published

June 7, 2024

Warning

Just a dump of notes here: contents TBA.

Example

using Orthography
ortho = simpleAscii()
SimpleAscii("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,-:;!?'\"()[] \t\n", TokenCategory[LexicalToken(), NumericToken(), PunctuationToken()])
  • explicitly defines a set of token types
  • explicitly defines a character set or set of code points that can compose tokens
  • uses semantics of the character set to parse text into sequences of tokens with associated token type

The tokenizing functionality can be applied to strings of text, citable text passages, or entire citable corpora. It can be applied to index a corpus by token, to compile token lists for a corpus, to compute token histograms for a corpus, and to generate a new corpus citable at the level of the token.

Specifying resulting URNs

When you are tokenizing citable content (either CitablePassages or a CitableTextCorpus), you can include optional parameters to specify the form of the citable tokenized content:

  • edition will be used as the value of the version identifier
  • exemplar will be used as the value of the exemplar identifier

You may include either or neither. If neither is specified, the resulting URNs are cited at the version level with a version identifier composed of the source version identifer concatenated with _tokens.

labellededition = tokenize(cn, orthography, edition = “special_edition_id”) labellededition[1]

labelledexemplars = tokenize(cn, orthography, exemplar = “tokens”) labelledexemplars[1]