using Orthography
= simpleAscii() ortho
SimpleAscii("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,-:;!?'\"()[] \t\n", TokenCategory[LexicalToken(), NumericToken(), PunctuationToken()])
⚠️ Draft documentation in progress:
Package version 0.22.0
.
June 7, 2024
Just a dump of notes here: contents TBA.
SimpleAscii("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,-:;!?'\"()[] \t\n", TokenCategory[LexicalToken(), NumericToken(), PunctuationToken()])
The tokenizing functionality can be applied to strings of text, citable text passages, or entire citable corpora. It can be applied to index a corpus by token, to compile token lists for a corpus, to compute token histograms for a corpus, and to generate a new corpus citable at the level of the token.
When you are tokenizing citable content (either CitablePassage
s or a CitableTextCorpus
), you can include optional parameters to specify the form of the citable tokenized content:
edition
will be used as the value of the version identifierexemplar
will be used as the value of the exemplar identifierYou may include either or neither. If neither is specified, the resulting URNs are cited at the version level with a version identifier composed of the source version identifer concatenated with _tokens
.
labellededition = tokenize(cn, orthography, edition = “special_edition_id”) labellededition[1]
labelledexemplars = tokenize(cn, orthography, exemplar = “tokens”) labelledexemplars[1]