Validating and tokenizing text

Published

July 26, 2024

Use the stemortho function to get an orthographic system for working with Greek scientific texts:

using GreekScientificOrthography
o = stemortho()
typeof(o)
GreekSciOrthography

Validate strings

Text including astronomical symbols is valid:

using Orthography
validstring("ὁ 🜚︎", o)
true

The following phrase from Archimedes, On the Measurement of the Circle, proposition 3, includes numeric quantities and figure labels in the text. Those are also valid.

archimedes = "ἡ ΓΕ πρὸς ΓΗ μείζονα λόγον ἔχει ἤπερ φοαʹ πρὸς ρνγʹ."
validstring(archimedes, o)
true

Tokenize strings

A GreekSciOrthography can recognize several specialized types of token:

tokentypes(o)
6-element Vector{DataType}:
 LexicalToken
 PunctuationToken
 Orthography.UnanalyzedToken
 FigureLabelToken
 MilesianIntegerToken
 AstronomicalSymbol
tokenize("ὁ 🜚︎", o)
2-element Vector{OrthographicToken}:
 OrthographicToken("ὁ", LexicalToken())
 OrthographicToken("🜚︎", AstronomicalSymbol())
tokenize(archimedes, o)
12-element Vector{OrthographicToken}:
 OrthographicToken("ἡ", LexicalToken())
 OrthographicToken("ΓΕ", FigureLabelToken())
 OrthographicToken("πρὸς", LexicalToken())
 OrthographicToken("ΓΗ", FigureLabelToken())
 OrthographicToken("μείζονα", LexicalToken())
 OrthographicToken("λόγον", LexicalToken())
 OrthographicToken("ἔχει", LexicalToken())
 OrthographicToken("ἤπερ", LexicalToken())
 OrthographicToken("φοαʹ", MilesianIntegerToken())
 OrthographicToken("πρὸς", LexicalToken())
 OrthographicToken("ρνγʹ", MilesianIntegerToken())
 OrthographicToken(".", PunctuationToken())