Parsing citable texts

Published

June 14, 2024

An example using the CitablePassage and Orthography packages

The CitablePassage type from the Julia CitableCorpus package represents a passage of citable text with a URN identifier and a string value.

using CitableText, CitableCorpus
s = "Four score and seven years ago..."
psgurn = CtsUrn("urn:cts:parserdocs:example.docs.v1:1")
cpsg = CitablePassage(psgurn, s)
<urn:cts:parserdocs:example.docs.v1:1> Four score and seven years ago...

The tokenize function in the Julia Orthography package includes a method for tokenize CitablePassages. This creates a series of CitableTokens.

Citable tokens

See this tutorial for a hands-on introduction to tokenizing citable texts with the Orthography package.

using Orthography
orthosystem = simpleAscii()
tokenizedpassages = tokenize(cpsg, orthosystem)
7-element Vector{CitableToken}:
 <urn:cts:parserdocs:example.docs.v1_tokens:1.1> Four (LexicalToken)
 <urn:cts:parserdocs:example.docs.v1_tokens:1.2> score (LexicalToken)
 <urn:cts:parserdocs:example.docs.v1_tokens:1.3> and (LexicalToken)
 <urn:cts:parserdocs:example.docs.v1_tokens:1.4> seven (LexicalToken)
 <urn:cts:parserdocs:example.docs.v1_tokens:1.5> years (LexicalToken)
 <urn:cts:parserdocs:example.docs.v1_tokens:1.6> ago (LexicalToken)
 <urn:cts:parserdocs:example.docs.v1_tokens:1.6a> ... (PunctuationToken)

Each citable token has defined a new citable passage, with a single token for the text value.

tokenizedpassages .|> passage
7-element Vector{CitablePassage}:
 <urn:cts:parserdocs:example.docs.v1_tokens:1.1> Four
 <urn:cts:parserdocs:example.docs.v1_tokens:1.2> score
 <urn:cts:parserdocs:example.docs.v1_tokens:1.3> and
 <urn:cts:parserdocs:example.docs.v1_tokens:1.4> seven
 <urn:cts:parserdocs:example.docs.v1_tokens:1.5> years
 <urn:cts:parserdocs:example.docs.v1_tokens:1.6> ago
 <urn:cts:parserdocs:example.docs.v1_tokens:1.6a> ...

The tokenizer has also extended the canonical citation of passages of text to refer to individual tokens. The entire passage had a passage component with a single level of citation (1); the tokens are cited at two levels (1.1, etc.)

Each of these citable passages is assigned a tokencategory.

tokenizedpassages .|> tokencategory
7-element Vector{TokenCategory}:
 LexicalToken()
 LexicalToken()
 LexicalToken()
 LexicalToken()
 LexicalToken()
 LexicalToken()
 PunctuationToken()