Quick start

Published

July 11, 2024

Create a HebrewOrthography object and use it to get metadata about the orthography, to validate strings, and to tokenize strings.

using BiblicalHebrew, Orthography
ortho = HebrewOrthography()
HebrewOrthography()

Valid characters

All 84 defined codepoints in the Unicode Hebrew range plus four white-space characters (space, \n, \r and \n) are valid in this orthography.

codepoints(ortho)
88-element Vector{Char}:
 '\t': ASCII/Unicode U+0009 (category Cc: Other, control)
 '\n': ASCII/Unicode U+000A (category Cc: Other, control)
 '\r': ASCII/Unicode U+000D (category Cc: Other, control)
 ' ': ASCII/Unicode U+0020 (category Zs: Separator, space)
 '֑': Unicode U+0591 (category Mn: Mark, nonspacing)
 '֒': Unicode U+0592 (category Mn: Mark, nonspacing)
 '֓': Unicode U+0593 (category Mn: Mark, nonspacing)
 '֔': Unicode U+0594 (category Mn: Mark, nonspacing)
 '֕': Unicode U+0595 (category Mn: Mark, nonspacing)
 '֖': Unicode U+0596 (category Mn: Mark, nonspacing)
 '֗': Unicode U+0597 (category Mn: Mark, nonspacing)
 '֘': Unicode U+0598 (category Mn: Mark, nonspacing)
 '֙': Unicode U+0599 (category Mn: Mark, nonspacing)
 ⋮
 'ס': Unicode U+05E1 (category Lo: Letter, other)
 'ע': Unicode U+05E2 (category Lo: Letter, other)
 'ף': Unicode U+05E3 (category Lo: Letter, other)
 'פ': Unicode U+05E4 (category Lo: Letter, other)
 'ץ': Unicode U+05E5 (category Lo: Letter, other)
 'צ': Unicode U+05E6 (category Lo: Letter, other)
 'ק': Unicode U+05E7 (category Lo: Letter, other)
 'ר': Unicode U+05E8 (category Lo: Letter, other)
 'ש': Unicode U+05E9 (category Lo: Letter, other)
 'ת': Unicode U+05EA (category Lo: Letter, other)
 '׳': Unicode U+05F3 (category Po: Punctuation, other)
 '״': Unicode U+05F4 (category Po: Punctuation, other)

Test whether a string is valid in this orthography:

validstring("בֵּֽין־פָּארָ֧ן", ortho)
true
validstring("Hi, בֵּֽין־פָּארָ֧ן", ortho)
false

Tokenization

The orthography can identify three categories of token:

tokentypes(ortho)
3-element Vector{DataType}:
 LexicalToken
 PunctuationToken
 NumericToken

Tokenization associates a string value with a token category. Since punctuation like maqaf doesn’t display properly in this documentation, we’ll use the package’s maqaf_join function to create a construct chain, then tokenize the resulting string.

s1 = "בֵּֽין"
"בֵּֽין"
s2 = "פָּארָ֧ן"
"פָּארָ֧ן"
construct = BiblicalHebrew.maqaf_join([s1,s2])
"בֵּֽין־פָּארָ֧ן"
tokens = tokenize(construct, ortho)
3-element Vector{OrthographicToken}:
 OrthographicToken("בֵּֽין", LexicalToken())
 OrthographicToken("־", PunctuationToken())
 OrthographicToken("פָּארָ֧ן", LexicalToken())

Numeric tokens are followed by gershe or gershayim. To compose a string for the numeric value 1, the following example passes a named character constant as a parameter to the package’s gershe function to append a gershe to it.

aleph = string(BiblicalHebrew.aleph_ch)
one = BiblicalHebrew.gershe(aleph)
"א׳"
tokenize(one, ortho)
1-element Vector{OrthographicToken}:
 OrthographicToken("א", NumericToken())

check that each of these is covered

codepoints

tokenize

tokentypes

validstring