Morphological data

Published

January 4, 2025

Hebrew

We are working with two existing sets of data with morphological annotations on the Hebrew Bible.

Open Scriptures Hebrew Bible

The Open Scriptures Hebrew Bible project (OSHB) includes manual annotations on everyword of the Hebrew Bible, including a detailed morphological analysis. Lexemes are identified with Strong numbers. The OSHB data set is available on github here.

Julia package

Neel Smith has published OpenScripturesHebrew.jl, a Julia package for working with the OSHB data set. (See the package documentation.)

Example of usage

using OpenScripturesHebrew
allwords = tanakh()

length(allwords)

allwords[3]

(urn = "urn:cts:compnov:bible.genesis.osh:1.1", code = "HVqp3ms", mtoken = "בָּרָ֣א", otoken = "בָּרָ֣א", otoken_num = 2, lemma = "1254 a")

allwords[3].mtoken

"בָּרָ֣א"

parseword(allwords[3])

finite verb: qal perfect third singular masculine

Sefaria

The Sefaria project makes its data available through an API documented here. The API includes an option to search Sefaria’s online lexica for articles based on any form of a Hebrew word. Sefaria’s search results include identifiers for the Brown-Driver-Briggs lexicon (among others), which for our morphological anlayses are often more helpful than the Strong identifiers used by OSHB. While the Sefaria queries do not include detailed morphological analysis, it is possible to find part of speech codes in the related data, taken from Strong.

Precompiled data sets

The response time to a query using Sefaria’s API is typically in tenths of a second. While this is perfectly adequate for interactive usage, it is impractical for application to a corpus with more than 400,000 words. (A single pass over the Hebrew Bible would require more than 11 hours to process 400,000 queries × 0.1 second = 40,000 seconds). In practice, automated runs of queries often result in time-out errors after only a few hundred queries. For our morphological analysis therefore, we have compiled a set of static files with Sefaria’s resolution of every verb form to a lemma (dictionary form) and an idenfier in Brown-Driver-Briggs. The files (one for each book of the Hebrew Bible) are in delimited text format following this example:

urn|form|lemma|bdbid
urn:cts:compnov:bible.genesis.masoretic_tokens:1.1.2|בָּרָ֣א|בָּרָא|BDB01439

Interactive use

Neel Smith has published BrownDriverBriggs.jl, a Julia package for working with the Sefaria API’s dynamically. (See the documentation.)

Example of usage

using BrownDriverBriggs 
articles = bdb("בָּרָ֣א")

6-element Vector{Article}:
 בַּר (BDB01437)
 בַּר² (BDB01438)
 בָּרָא (BDB01439)
 בָּרָא² (BDB01442)
 בַּר³ (BDB01500)
 בַּר⁴ (BDB01501)

Greek

We’re using the Kanones system to build Greek parsers tailored to the corpora of the Complutensian’s Greek documents.

The current version of the Greek parser for the Septuagint is available at http://shot.holycross.edu/morphology/complutensian-current.cex.

Example of usage

using Kanones, CitableBase
greekurl = "http://shot.holycross.edu/morphology/complutensian-current.cex"
greekparser = kParser(greekurl, UrlReader)

Precompiling Kanones...
    616.3 ms  ✓ BenchmarkTools
   1372.8 ms  ✓ Orthography
   1691.9 ms  ✓ PolytonicGreek
   1563.0 ms  ✓ AtticGreek
  18330.3 ms  ✓ DataFrames
   2484.4 ms  ✓ CitableParserBuilder
   3840.8 ms  ✓ Kanones
  7 dependencies successfully precompiled in 25 seconds. 123 already precompiled.
Precompiling QuartoNotebookWorkerDataFramesTablesExt...
   1036.5 ms  ✓ QuartoNotebookWorker → QuartoNotebookWorkerDataFramesTablesExt
  1 dependency successfully precompiled in 1 seconds. 54 already precompiled.

KanonesStringParser(["ἀγαθός|lsj.n260|forms.7010001110|adjstems.n260|adjinfl.os_h_on_pos1|ἀγαθος|a", "ἀγαθή|lsj.n260|forms.7010002110|adjstems.n260|adjinfl.os_h_on_pos2|ἀγαθη|a", "ἀγαθόν|lsj.n260|forms.7010003110|adjstems.n260|adjinfl.os_h_on_pos3|ἀγαθον|a", "ἀγαθοῦ|lsj.n260|forms.7010001210|adjstems.n260|adjinfl.os_h_on_pos4|ἀγαθου|a", "ἀγαθῆς|lsj.n260|forms.7010002210|adjstems.n260|adjinfl.os_h_on_pos5|ἀγαθης|a", "ἀγαθοῦ|lsj.n260|forms.7010003210|adjstems.n260|adjinfl.os_h_on_pos6|ἀγαθου|a", "ἀγαθῷ|lsj.n260|forms.7010001310|adjstems.n260|adjinfl.os_h_on_pos7|ἀγαθῳ|a", "ἀγαθῇ|lsj.n260|forms.7010002310|adjstems.n260|adjinfl.os_h_on_pos8|ἀγαθῃ|a", "ἀγαθῷ|lsj.n260|forms.7010003310|adjstems.n260|adjinfl.os_h_on_pos9|ἀγαθῳ|a", "ἀγαθόν|lsj.n260|forms.7010001410|adjstems.n260|adjinfl.os_h_on_pos10|ἀγαθον|a"  …  "ὑποτιθῆται|lsj.n109358|forms.3311220000|compounds.n109358|irreginfl.irregular2|ὑποτιθηται|a", "ὑποτιθώμεθα|lsj.n109358|forms.3131220000|compounds.n109358|irreginfl.irregular2|ὑποτιθωμεθα|a", "ὑποτιθῆσθε|lsj.n109358|forms.3231220000|compounds.n109358|irreginfl.irregular2|ὑποτιθησθε|a", "ὑποτιθῶνται|lsj.n109358|forms.3331220000|compounds.n109358|irreginfl.irregular2|ὑποτιθωνται|a", "ὑποτιθῶμαι|lsj.n109358|forms.3111230000|compounds.n109358|irreginfl.irregular2|ὑποτιθωμαι|a", "ὑποτιθῇ|lsj.n109358|forms.3211230000|compounds.n109358|irreginfl.irregular2|ὑποτιθῃ|a", "ὑποτιθῆται|lsj.n109358|forms.3311230000|compounds.n109358|irreginfl.irregular2|ὑποτιθηται|a", "ὑποτιθώμεθα|lsj.n109358|forms.3131230000|compounds.n109358|irreginfl.irregular2|ὑποτιθωμεθα|a", "ὑποτιθῆσθε|lsj.n109358|forms.3231230000|compounds.n109358|irreginfl.irregular2|ὑποτιθησθε|a", "ὑποτιθῶνται|lsj.n109358|forms.3331230000|compounds.n109358|irreginfl.irregular2|ὑποτιθωνται|a"], PolytonicGreek.LiteraryGreekOrthography("'αβγδεζηθικλμνξοπρςστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩϊϋόύώάέήίΰΐἀἁἂἃἄἅἆἇἈἉἊἋἌἍἎἏἐἑἒἓἔἕἘἙἚἛἜἝἠἡἢἣἤἥἦἧἨἩἪἫἬἭἮἯἰἱἲἳἴἵἶἷἸἹἺἻἼἽἾἿὀὁὂὃὄὅὈὉὊὋὌὍὐὑὒὓὔὕὖὗὙὛὝὟὠὡὢὣὤὥὦὧὨὩὪὫὬὭὮὯὰάὲέὴήὶίὸόὺύὼώᾀᾁᾂᾃᾄᾅᾆᾇᾈᾉᾊᾋᾌᾍᾎᾏᾐᾑᾒᾓᾔᾕᾖᾗᾘᾙᾚᾛᾜᾝᾞᾟᾠᾡᾢᾣᾤᾥᾦᾧᾨᾩᾪᾫᾬᾭᾮᾯᾲᾳᾴᾶᾷᾸᾹᾺΆᾼῂῃῄῆῇῈΈῊΉῌῒΐῖῗῘῙῚΊῢΰῤῥῦῧῪΎῬῲῳῴῶῷῸΌῺΏῼ \t\n(\".,;:)", DataType[Orthography.LexicalToken, Orthography.PunctuationToken]), "|")

goodparses = parsetoken("ἀγαθός", greekparser)

1-element Vector{CitableParserBuilder.Analysis}:
 CitableParserBuilder.Analysis("ἀγαθός", lsj.n260, forms.7010001110, adjstems.n260, adjinfl.os_h_on_pos1, "ἀγαθος", "a")

Latin

We’re using the Tabulae system to build Latin parsers tailored to the corpora of the Complutensian’s Latin documents.

The current versions of our parsers for Latin documents are available here:

for the 25-letter orthography of the Vulgate: http://shot.holycross.edu/tabulae/complut-lat25-current.cex
for the 23-letter orthography of the Latin glosses: http://shot.holycross.edu/tabulae/complut-lat23-current.cex

Example of usage

using Tabulae
url = "http://shot.holycross.edu/tabulae/complut-lat25-current.cex"
latinparser = tabulaeStringParser(url, UrlReader)

Precompiling Tabulae...
   1217.0 ms  ✓ LatinOrthography
   2646.3 ms  ✓ Tabulae
  2 dependencies successfully precompiled in 4 seconds. 116 already precompiled.

Latin parser covering 184181 analyses.

latinparses = parsetoken("creavit", latinparser)

1-element Vector{CitableParserBuilder.Analysis}:
 CitableParserBuilder.Analysis("creavit", ls.n11543, forms.3314110000, latcommon.verbn11543, lat25.c1pftact_pft3, "creavit", "A")