using BiblicalHebrew, Orthography
= HebrewOrthography() ortho
HebrewOrthography()
Built with package version 0.4.0
July 11, 2024
Create a HebrewOrthography
object and use it to get metadata about the orthography, to validate strings, and to tokenize strings.
All 84 defined codepoints in the Unicode Hebrew range plus four white-space characters (space, \n
, \r
and \n
) are valid in this orthography.
88-element Vector{Char}:
'\t': ASCII/Unicode U+0009 (category Cc: Other, control)
'\n': ASCII/Unicode U+000A (category Cc: Other, control)
'\r': ASCII/Unicode U+000D (category Cc: Other, control)
' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
'֑': Unicode U+0591 (category Mn: Mark, nonspacing)
'֒': Unicode U+0592 (category Mn: Mark, nonspacing)
'֓': Unicode U+0593 (category Mn: Mark, nonspacing)
'֔': Unicode U+0594 (category Mn: Mark, nonspacing)
'֕': Unicode U+0595 (category Mn: Mark, nonspacing)
'֖': Unicode U+0596 (category Mn: Mark, nonspacing)
'֗': Unicode U+0597 (category Mn: Mark, nonspacing)
'֘': Unicode U+0598 (category Mn: Mark, nonspacing)
'֙': Unicode U+0599 (category Mn: Mark, nonspacing)
⋮
'ס': Unicode U+05E1 (category Lo: Letter, other)
'ע': Unicode U+05E2 (category Lo: Letter, other)
'ף': Unicode U+05E3 (category Lo: Letter, other)
'פ': Unicode U+05E4 (category Lo: Letter, other)
'ץ': Unicode U+05E5 (category Lo: Letter, other)
'צ': Unicode U+05E6 (category Lo: Letter, other)
'ק': Unicode U+05E7 (category Lo: Letter, other)
'ר': Unicode U+05E8 (category Lo: Letter, other)
'ש': Unicode U+05E9 (category Lo: Letter, other)
'ת': Unicode U+05EA (category Lo: Letter, other)
'׳': Unicode U+05F3 (category Po: Punctuation, other)
'״': Unicode U+05F4 (category Po: Punctuation, other)
Test whether a string is valid in this orthography:
The orthography can identify three categories of token:
Tokenization associates a string value with a token category. Since punctuation like maqaf doesn’t display properly in this documentation, we’ll use the package’s maqaf_join
function to create a construct chain, then tokenize the resulting string.
3-element Vector{OrthographicToken}:
OrthographicToken("בֵּֽין", LexicalToken())
OrthographicToken("־", PunctuationToken())
OrthographicToken("פָּארָ֧ן", LexicalToken())
Numeric tokens are followed by gershe or gershayim. To compose a string for the numeric value 1, the following example passes a named character constant as a parameter to the package’s gershe
function to append a gershe to it.
codepoints
tokenize
tokentypes
validstring