using Orthography
orthography = simpleAscii()Tokenization
Orthographies allow you to break up a continuous passage of text into a series of tokens. The examples on this page use SimpleAscii, an orthography for a basic alphabetic subset of the ASCII character set.
Tokenization parses a string value into a sequence of classified substrings. You can see the types of tokens that an orthography recognizes with the tokentypes function.
tokentypes(orthography)3-element Vector{TokenCategory}:
LexicalToken()
NumericToken()
PunctuationToken()
Whenever the token value is a valid token in the orthographic system, the classification will be one of these enumerated token types.
Tokenizing strings
Tokenize a string with the tokenize function.
s = "Four score and seven years ago..."
tokens = tokenize(s, orthography)7-element Vector{OrthographicToken}:
OrthographicToken("Four", LexicalToken())
OrthographicToken("score", LexicalToken())
OrthographicToken("and", LexicalToken())
OrthographicToken("seven", LexicalToken())
OrthographicToken("years", LexicalToken())
OrthographicToken("ago", LexicalToken())
OrthographicToken("...", PunctuationToken())
The result is a vector of OrthographicTokens. You can find the text content of a token with the tokentext function.
tokens[1] |> tokentext"Four"
The tokencategory function tells you its type.
tokens .|> tokencategory7-element Vector{TokenCategory}:
LexicalToken()
LexicalToken()
LexicalToken()
LexicalToken()
LexicalToken()
LexicalToken()
PunctuationToken()
A common pattern is to filter a token to include only tokens of a particular type, e.g., lexical tokens for futher analysis (such as morphological parsing).
lextokens = filter(t -> tokencategory(t) isa LexicalToken, tokens)6-element Vector{OrthographicToken}:
OrthographicToken("Four", LexicalToken())
OrthographicToken("score", LexicalToken())
OrthographicToken("and", LexicalToken())
OrthographicToken("seven", LexicalToken())
OrthographicToken("years", LexicalToken())
OrthographicToken("ago", LexicalToken())
You can use Julia broadcasting to extract the text value of all the lexical tokens.
vocab = lextokens .|> tokentext6-element Vector{SubString{String}}:
"Four"
"score"
"and"
"seven"
"years"
"ago"
Tokenizing citable texts
The tokenize function is also aware of the structures of citable texts defined in the CitableCorpus package. In addition to tokenizing string values, you can tokenize a CitablePassage or a CitableTextCorpus.
You can learn about citable text corpora and the CitableCorpus package at https://neelsmith.quarto.pub/citablecorpus/
Citable passages
When you tokenize a CitablePassage, the result resembles an OrthographicToken in that it includes a category for each token. Instead of a simple text value for the token, however, the category is paired with a new CitablePassage. The text value of the passage is the text of a the single token. Its URN value uniquely identifies it with a reference one level of citation deeper than the original passage.
using CitableText, CitableCorpus
psgurn = CtsUrn("urn:cts:orthodocs:tokenization.docs.v1:sample")
cn = CitablePassage(psgurn, s)
tokenizedpassages = tokenize(cn, orthography)7-element Vector{CitableToken}:
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1> Four (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.2> score (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.3> and (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.4> seven (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.5> years (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6> ago (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6a> ... (PunctuationToken)
The urn function gives you this new, token-level URN.
tokenizedpassages[1] |> urnurn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1
To get the text and type of hte token, use the same functions you used with OrthographicTokens.
tokenizedpassages[1] |> tokentext"Four"
tokenizedpassages[1] |> tokencategoryLexicalToken()
If you prefer to get the citable passage as a CitablePassage object, use the passage function.
tokenizedpassages[1] |> passage<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1> Four
One common idiom might be to get a new collection of citable passages.
tokenizedpassages .|> passage7-element Vector{CitablePassage}:
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1> Four
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.2> score
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.3> and
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.4> seven
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.5> years
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6> ago
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6a> ...
You could construct new citable corpus from this list.
tokenizedpassages .|> passage |> CitableTextCorpusCorpus with 7 citable passages in 1 documents.
A citable text corpus
If you tokenize a CitableTextCorpus, you get the same kind of pairing of citable nodes with token categories as when you parse an individual CitablePassage.
corpus = CitableTextCorpus([cn])
tokenizedcorpus = tokenize(corpus, orthography)7-element Vector{CitableToken}:
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1> Four (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.2> score (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.3> and (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.4> seven (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.5> years (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6> ago (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6a> ... (PunctuationToken)
If your text corpus has only a single node, the results will therefore be equal to parsing that node separately.
tokenizedcorpus == tokenizedpassagestrue