Tokenization

Published

June 7, 2024

Orthographies allow you to break up a continuous passage of text into a series of tokens. The examples on this page use SimpleAscii, an orthography for a basic alphabetic subset of the ASCII character set.

using Orthography
orthography = simpleAscii()

Tokenization parses a string value into a sequence of classified substrings. You can see the types of tokens that an orthography recognizes with the tokentypes function.

tokentypes(orthography)

3-element Vector{TokenCategory}:
 LexicalToken()
 NumericToken()
 PunctuationToken()

Whenever the token value is a valid token in the orthographic system, the classification will be one of these enumerated token types.

Tokenizing strings

Tokenize a string with the tokenize function.

s = "Four score and seven years ago..."
tokens = tokenize(s, orthography)

7-element Vector{OrthographicToken}:
 OrthographicToken("Four", LexicalToken())
 OrthographicToken("score", LexicalToken())
 OrthographicToken("and", LexicalToken())
 OrthographicToken("seven", LexicalToken())
 OrthographicToken("years", LexicalToken())
 OrthographicToken("ago", LexicalToken())
 OrthographicToken("...", PunctuationToken())

The result is a vector of OrthographicTokens. You can find the text content of a token with the tokentext function.

tokens[1] |> tokentext

"Four"

The tokencategory function tells you its type.

tokens .|> tokencategory

7-element Vector{TokenCategory}:
 LexicalToken()
 LexicalToken()
 LexicalToken()
 LexicalToken()
 LexicalToken()
 LexicalToken()
 PunctuationToken()

A common pattern is to filter a token to include only tokens of a particular type, e.g., lexical tokens for futher analysis (such as morphological parsing).

lextokens = filter(t -> tokencategory(t) isa LexicalToken, tokens)

6-element Vector{OrthographicToken}:
 OrthographicToken("Four", LexicalToken())
 OrthographicToken("score", LexicalToken())
 OrthographicToken("and", LexicalToken())
 OrthographicToken("seven", LexicalToken())
 OrthographicToken("years", LexicalToken())
 OrthographicToken("ago", LexicalToken())

You can use Julia broadcasting to extract the text value of all the lexical tokens.

vocab = lextokens .|> tokentext

6-element Vector{SubString{String}}:
 "Four"
 "score"
 "and"
 "seven"
 "years"
 "ago"

Tokenizing citable texts

The tokenize function is also aware of the structures of citable texts defined in the CitableCorpus package. In addition to tokenizing string values, you can tokenize a CitablePassage or a CitableTextCorpus.

Tip

You can learn about citable text corpora and the CitableCorpus package at https://neelsmith.quarto.pub/citablecorpus/

Citable passages

When you tokenize a CitablePassage, the result resembles an OrthographicToken in that it includes a category for each token. Instead of a simple text value for the token, however, the category is paired with a new CitablePassage. The text value of the passage is the text of a the single token. Its URN value uniquely identifies it with a reference one level of citation deeper than the original passage.

using CitableText, CitableCorpus
psgurn = CtsUrn("urn:cts:orthodocs:tokenization.docs.v1:sample")
cn = CitablePassage(psgurn, s)
tokenizedpassages = tokenize(cn, orthography)

7-element Vector{CitableToken}:
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1> Four (LexicalToken)
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.2> score (LexicalToken)
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.3> and (LexicalToken)
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.4> seven (LexicalToken)
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.5> years (LexicalToken)
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6> ago (LexicalToken)
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6a> ... (PunctuationToken)

The urn function gives you this new, token-level URN.

tokenizedpassages[1] |> urn

urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1

To get the text and type of hte token, use the same functions you used with OrthographicTokens.

tokenizedpassages[1] |> tokentext

"Four"

tokenizedpassages[1] |> tokencategory

LexicalToken()

If you prefer to get the citable passage as a CitablePassage object, use the passage function.

tokenizedpassages[1] |> passage

<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1> Four

One common idiom might be to get a new collection of citable passages.

tokenizedpassages .|> passage

7-element Vector{CitablePassage}:
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1> Four
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.2> score
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.3> and
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.4> seven
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.5> years
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6> ago
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6a> ...

You could construct new citable corpus from this list.

tokenizedpassages .|> passage |> CitableTextCorpus

Corpus with 7 citable passages in 1 documents.

A citable text corpus

If you tokenize a CitableTextCorpus, you get the same kind of pairing of citable nodes with token categories as when you parse an individual CitablePassage.

corpus = CitableTextCorpus([cn])
tokenizedcorpus = tokenize(corpus, orthography)

7-element Vector{CitableToken}:
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1> Four (LexicalToken)
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.2> score (LexicalToken)
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.3> and (LexicalToken)
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.4> seven (LexicalToken)
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.5> years (LexicalToken)
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6> ago (LexicalToken)
 <urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6a> ... (PunctuationToken)

If your text corpus has only a single node, the results will therefore be equal to parsing that node separately.

tokenizedcorpus == tokenizedpassages

true