Utilities for working with character values

Published

July 9, 2024

The BiblicalHebrew package includes a few shortcuts for viewing different representations of codepoints that, although generic, can be useful when working with fully pointed Hebrew text.

Some Unicode basics

The Unicode package’s graphemes function iterates through the graphemes in a string. If we apply it to the string “מִדְבָּר” and collect the results into a vector, the vector will have four elements, one for each consonant together with any associated points such as vowel points or dagesh.

using Unicode
desert = "מִדְבָּר"
graphemev = graphemes(desert) |> collect
4-element Vector{SubString{String}}:
 "מִ"
 "דְ"
 "בָּ"
 "ר"

We can use collect on string values to gather a vector of Chars.

charv = collect(graphemev[1])
2-element Vector{Char}:
 'מ': Unicode U+05DE (category Lo: Letter, other)
 'ִ': Unicode U+05B4 (category Mn: Mark, nonspacing)

Codepoints, integers and hexadecimal strings

BiblicalHebrew.codepoint gives the integer value of a character.

using BiblicalHebrew
BiblicalHebrew.codepoint.(charv)
2-element Vector{UInt32}:
 0x000005de
 0x000005b4

These are unsigned integers. If you want signed integers, you can construct signed integers directly from them:

BiblicalHebrew.codepoint.(charv) .|> Int64
2-element Vector{Int64}:
 1502
 1460

So these are tautologies:

(BiblicalHebrew.codepoint.(charv) .|> Char) == 
(BiblicalHebrew.codepoint.(charv) .|> Int64 .|> Char) == charv
true

Julia’s string function displays integers in decimal notation.

BiblicalHebrew.codepoint.(charv) .|> string
2-element Vector{String}:
 "1502"
 "1460"

BiblicalHebrew.hex gets a hex string for codepoints, integers or characters:

charv .|> BiblicalHebrew.hex
2-element Vector{String}:
 "5de"
 "5b4"
charv  .|> BiblicalHebrew.codepoint  .|> BiblicalHebrew.hex
2-element Vector{String}:
 "5de"
 "5b4"
charv  .|> BiblicalHebrew.codepoint .|> Int64 .|> BiblicalHebrew.hex
2-element Vector{String}:
 "5de"
 "5b4"

And BiblicalHebrew.int converts a hex string into an integer value.

charv  .|> BiblicalHebrew.codepoint  .|> BiblicalHebrew.hex .|> BiblicalHebrew.int
2-element Vector{UInt32}:
 0x000005de
 0x000005b4

So this is also a tautology:

codepoint.(charv) == BiblicalHebrew.codepoint.(charv) .|> BiblicalHebrew.hex .|> BiblicalHebrew.int 
true

Splitting up sequences of codepoints

BiblicalHebrew.codept_split works like Julia’s split function, but by default keeps the separating character value;

split(desert, charv[2])
2-element Vector{SubString{String}}:
 "מ"
 "דְבָּר"
BiblicalHebrew.codept_split(desert, charv[2])
3-element Vector{String}:
 "מ"
 "ִ"
 "דְבָּר"

You can override that behavior:

BiblicalHebrew.codept_split(desert, charv[2]; keep = false)
2-element Vector{String}:
 "מ"
 "דְבָּר"