using Orthography
= simpleAscii() orthography
Tokenization
Orthographies allow you to break up a continuous passage of text into a series of tokens. The examples on this page use SimpleAscii
, an orthography for a basic alphabetic subset of the ASCII character set.
Tokenization parses a string value into a sequence of classified substrings. You can see the types of tokens that an orthography recognizes with the tokentypes
function.
tokentypes(orthography)
3-element Vector{TokenCategory}:
LexicalToken()
NumericToken()
PunctuationToken()
Whenever the token value is a valid token in the orthographic system, the classification will be one of these enumerated token types.
Tokenizing strings
Tokenize a string with the tokenize
function.
= "Four score and seven years ago..."
s = tokenize(s, orthography) tokens
7-element Vector{OrthographicToken}:
OrthographicToken("Four", LexicalToken())
OrthographicToken("score", LexicalToken())
OrthographicToken("and", LexicalToken())
OrthographicToken("seven", LexicalToken())
OrthographicToken("years", LexicalToken())
OrthographicToken("ago", LexicalToken())
OrthographicToken("...", PunctuationToken())
The result is a vector of OrthographicToken
s. You can find the text content of a token with the tokentext
function.
1] |> tokentext tokens[
"Four"
The tokencategory
function tells you its type.
.|> tokencategory tokens
7-element Vector{TokenCategory}:
LexicalToken()
LexicalToken()
LexicalToken()
LexicalToken()
LexicalToken()
LexicalToken()
PunctuationToken()
A common pattern is to filter a token to include only tokens of a particular type, e.g., lexical tokens for futher analysis (such as morphological parsing).
= filter(t -> tokencategory(t) isa LexicalToken, tokens) lextokens
6-element Vector{OrthographicToken}:
OrthographicToken("Four", LexicalToken())
OrthographicToken("score", LexicalToken())
OrthographicToken("and", LexicalToken())
OrthographicToken("seven", LexicalToken())
OrthographicToken("years", LexicalToken())
OrthographicToken("ago", LexicalToken())
You can use Julia broadcasting to extract the text value of all the lexical tokens.
= lextokens .|> tokentext vocab
6-element Vector{SubString{String}}:
"Four"
"score"
"and"
"seven"
"years"
"ago"
Tokenizing citable texts
The tokenize
function is also aware of the structures of citable texts defined in the CitableCorpus
package. In addition to tokenizing string values, you can tokenize a CitablePassage
or a CitableTextCorpus
.
You can learn about citable text corpora and the CitableCorpus
package at https://neelsmith.quarto.pub/citablecorpus/
Citable passages
When you tokenize a CitablePassage
, the result resembles an OrthographicToken
in that it includes a category for each token. Instead of a simple text value for the token, however, the category is paired with a new CitablePassage
. The text value of the passage is the text of a the single token. Its URN value uniquely identifies it with a reference one level of citation deeper than the original passage.
using CitableText, CitableCorpus
= CtsUrn("urn:cts:orthodocs:tokenization.docs.v1:sample")
psgurn = CitablePassage(psgurn, s)
cn = tokenize(cn, orthography) tokenizedpassages
7-element Vector{CitableToken}:
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1> Four (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.2> score (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.3> and (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.4> seven (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.5> years (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6> ago (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6a> ... (PunctuationToken)
The urn
function gives you this new, token-level URN.
1] |> urn tokenizedpassages[
urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1
To get the text and type of hte token, use the same functions you used with OrthographicToken
s.
1] |> tokentext tokenizedpassages[
"Four"
1] |> tokencategory tokenizedpassages[
LexicalToken()
If you prefer to get the citable passage as a CitablePassage
object, use the passage
function.
1] |> passage tokenizedpassages[
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1> Four
One common idiom might be to get a new collection of citable passages.
.|> passage tokenizedpassages
7-element Vector{CitablePassage}:
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1> Four
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.2> score
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.3> and
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.4> seven
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.5> years
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6> ago
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6a> ...
You could construct new citable corpus from this list.
.|> passage |> CitableTextCorpus tokenizedpassages
Corpus with 7 citable passages in 1 documents.
A citable text corpus
If you tokenize a CitableTextCorpus
, you get the same kind of pairing of citable nodes with token categories as when you parse an individual CitablePassage
.
= CitableTextCorpus([cn])
corpus = tokenize(corpus, orthography) tokenizedcorpus
7-element Vector{CitableToken}:
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.1> Four (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.2> score (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.3> and (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.4> seven (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.5> years (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6> ago (LexicalToken)
<urn:cts:orthodocs:tokenization.docs.v1_tokens:sample.6a> ... (PunctuationToken)
If your text corpus has only a single node, the results will therefore be equal to parsing that node separately.
== tokenizedpassages tokenizedcorpus
true