An abstract type for orthographic systems.
Exported types and functions
June 7, 2024
Types
OrthographicSystem
TokenCategory
An abstract type for token categories.
LexicalToken
Category of alphabetic tokens.
NumericToken
Category of numeric tokens.
PunctuationToken
Category of punctuation tokens.
Functions
Public functions implemented for all subtypes of OrthographicSystem
.
codepoints
Category of punctuation tokens.
tokentypes
Delegate to specific functions based on type’s orthography trait value.
It is an error to invoke the tokentypes
function on anything but an orthographic system.
Orthographic systems must implement tokentypes.
Implement tokentypes function for SimpleAscii.
Implement tokentypes function for WSTokenizer.
validcp
validstring
tokenize
Delegate to specific functions based on type’s orthography trait value.
It is an error to invoke the tokenize
function on anything but an orthographic system.
Orthographic systems must implement tokenize.
Tokenize citable node cn
using the tokenizer of the given orthographic system.
The return value is a list of pairings of a CitablePassage
and a token category. The citable node is citable at the level of the token.
Tokenize corpus c
using the tokenizer of the given orthographic system.
The return value is a list of pairings of a CitablePassage
and a token category. The citable node is citable at the level of the token.
Tokenize document doc
using the tokenizer of the given orthographic system.
The return value is a list of pairings of a CitablePassage
and a token category. The citable node is citable at the level of the token.
Implement tokenize function for SimpleAscii
orthography.
Implement tokenize function for WSTokenizer
orthography.
Working with text corpora:
corpus_histo
Other utilities
nfkc
Example implementation
SimpleAscii
An orthographic system for a basic alphabetic subset of the ASCII character set.
simpleAscii