using OpenScripturesHebrew
allwords = tanakh()Morphological data
Hebrew
We are working with two existing sets of data with morphological annotations on the Hebrew Bible.
Open Scriptures Hebrew Bible
The Open Scriptures Hebrew Bible project (OSHB) includes manual annotations on everyword of the Hebrew Bible, including a detailed morphological analysis. Lexemes are identified with Strong numbers. The OSHB data set is available on github here.
Julia package
Neel Smith has published OpenScripturesHebrew.jl, a Julia package for working with the OSHB data set. (See the package documentation.)
Example of usage
length(allwords)432307
allwords[3](urn = "urn:cts:compnov:bible.genesis.osh:1.1", code = "HVqp3ms", mtoken = "בָּרָ֣א", otoken = "בָּרָ֣א", otoken_num = 2, lemma = "1254 a")
allwords[3].mtoken"בָּרָ֣א"
parseword(allwords[3])finite verb: qal perfect third singular masculine
Sefaria
The Sefaria project makes its data available through an API documented here. The API includes an option to search Sefaria’s online lexica for articles based on any form of a Hebrew word. Sefaria’s search results include identifiers for the Brown-Driver-Briggs lexicon (among others), which for our morphological anlayses are often more helpful than the Strong identifiers used by OSHB. While the Sefaria queries do not include detailed morphological analysis, it is possible to find part of speech codes in the related data, taken from Strong.
Precompiled data sets
The response time to a query using Sefaria’s API is typically in tenths of a second. While this is perfectly adequate for interactive usage, it is impractical for application to a corpus with more than 400,000 words. (A single pass over the Hebrew Bible would require more than 11 hours to process 400,000 queries × 0.1 second = 40,000 seconds). In practice, automated runs of queries often result in time-out errors after only a few hundred queries. For our morphological analysis therefore, we have compiled a set of static files with Sefaria’s resolution of every verb form to a lemma (dictionary form) and an idenfier in Brown-Driver-Briggs. The files (one for each book of the Hebrew Bible) are in delimited text format following this example:
urn|form|lemma|bdbid
urn:cts:compnov:bible.genesis.masoretic_tokens:1.1.2|בָּרָ֣א|בָּרָא|BDB01439
Interactive use
Neel Smith has published BrownDriverBriggs.jl, a Julia package for working with the Sefaria API’s dynamically. (See the documentation.)
Example of usage
using BrownDriverBriggs
articles = bdb("בָּרָ֣א")6-element Vector{Article}:
בַּר (BDB01437)
בַּר² (BDB01438)
בָּרָא (BDB01439)
בָּרָא² (BDB01442)
בַּר³ (BDB01500)
בַּר⁴ (BDB01501)
Greek
We’re using the Kanones system to build Greek parsers tailored to the corpora of the Complutensian’s Greek documents.
The current version of the Greek parser for the Septuagint is available at http://shot.holycross.edu/morphology/complutensian-current.cex.
Example of usage
using Kanones, CitableBase
greekurl = "http://shot.holycross.edu/morphology/complutensian-current.cex"
greekparser = kParser(greekurl, UrlReader)Precompiling Kanones...
616.3 ms ✓ BenchmarkTools
1372.8 ms ✓ Orthography
1691.9 ms ✓ PolytonicGreek
1563.0 ms ✓ AtticGreek
18330.3 ms ✓ DataFrames
2484.4 ms ✓ CitableParserBuilder
3840.8 ms ✓ Kanones
7 dependencies successfully precompiled in 25 seconds. 123 already precompiled.
Precompiling QuartoNotebookWorkerDataFramesTablesExt...
1036.5 ms ✓ QuartoNotebookWorker → QuartoNotebookWorkerDataFramesTablesExt
1 dependency successfully precompiled in 1 seconds. 54 already precompiled.
KanonesStringParser(["ἀγαθός|lsj.n260|forms.7010001110|adjstems.n260|adjinfl.os_h_on_pos1|ἀγαθος|a", "ἀγαθή|lsj.n260|forms.7010002110|adjstems.n260|adjinfl.os_h_on_pos2|ἀγαθη|a", "ἀγαθόν|lsj.n260|forms.7010003110|adjstems.n260|adjinfl.os_h_on_pos3|ἀγαθον|a", "ἀγαθοῦ|lsj.n260|forms.7010001210|adjstems.n260|adjinfl.os_h_on_pos4|ἀγαθου|a", "ἀγαθῆς|lsj.n260|forms.7010002210|adjstems.n260|adjinfl.os_h_on_pos5|ἀγαθης|a", "ἀγαθοῦ|lsj.n260|forms.7010003210|adjstems.n260|adjinfl.os_h_on_pos6|ἀγαθου|a", "ἀγαθῷ|lsj.n260|forms.7010001310|adjstems.n260|adjinfl.os_h_on_pos7|ἀγαθῳ|a", "ἀγαθῇ|lsj.n260|forms.7010002310|adjstems.n260|adjinfl.os_h_on_pos8|ἀγαθῃ|a", "ἀγαθῷ|lsj.n260|forms.7010003310|adjstems.n260|adjinfl.os_h_on_pos9|ἀγαθῳ|a", "ἀγαθόν|lsj.n260|forms.7010001410|adjstems.n260|adjinfl.os_h_on_pos10|ἀγαθον|a" … "ὑποτιθῆται|lsj.n109358|forms.3311220000|compounds.n109358|irreginfl.irregular2|ὑποτιθηται|a", "ὑποτιθώμεθα|lsj.n109358|forms.3131220000|compounds.n109358|irreginfl.irregular2|ὑποτιθωμεθα|a", "ὑποτιθῆσθε|lsj.n109358|forms.3231220000|compounds.n109358|irreginfl.irregular2|ὑποτιθησθε|a", "ὑποτιθῶνται|lsj.n109358|forms.3331220000|compounds.n109358|irreginfl.irregular2|ὑποτιθωνται|a", "ὑποτιθῶμαι|lsj.n109358|forms.3111230000|compounds.n109358|irreginfl.irregular2|ὑποτιθωμαι|a", "ὑποτιθῇ|lsj.n109358|forms.3211230000|compounds.n109358|irreginfl.irregular2|ὑποτιθῃ|a", "ὑποτιθῆται|lsj.n109358|forms.3311230000|compounds.n109358|irreginfl.irregular2|ὑποτιθηται|a", "ὑποτιθώμεθα|lsj.n109358|forms.3131230000|compounds.n109358|irreginfl.irregular2|ὑποτιθωμεθα|a", "ὑποτιθῆσθε|lsj.n109358|forms.3231230000|compounds.n109358|irreginfl.irregular2|ὑποτιθησθε|a", "ὑποτιθῶνται|lsj.n109358|forms.3331230000|compounds.n109358|irreginfl.irregular2|ὑποτιθωνται|a"], PolytonicGreek.LiteraryGreekOrthography("'αβγδεζηθικλμνξοπρςστυφχψωΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩϊϋόύώάέήίΰΐἀἁἂἃἄἅἆἇἈἉἊἋἌἍἎἏἐἑἒἓἔἕἘἙἚἛἜἝἠἡἢἣἤἥἦἧἨἩἪἫἬἭἮἯἰἱἲἳἴἵἶἷἸἹἺἻἼἽἾἿὀὁὂὃὄὅὈὉὊὋὌὍὐὑὒὓὔὕὖὗὙὛὝὟὠὡὢὣὤὥὦὧὨὩὪὫὬὭὮὯὰάὲέὴήὶίὸόὺύὼώᾀᾁᾂᾃᾄᾅᾆᾇᾈᾉᾊᾋᾌᾍᾎᾏᾐᾑᾒᾓᾔᾕᾖᾗᾘᾙᾚᾛᾜᾝᾞᾟᾠᾡᾢᾣᾤᾥᾦᾧᾨᾩᾪᾫᾬᾭᾮᾯᾲᾳᾴᾶᾷᾸᾹᾺΆᾼῂῃῄῆῇῈΈῊΉῌῒΐῖῗῘῙῚΊῢΰῤῥῦῧῪΎῬῲῳῴῶῷῸΌῺΏῼ \t\n(\".,;:)", DataType[Orthography.LexicalToken, Orthography.PunctuationToken]), "|")
goodparses = parsetoken("ἀγαθός", greekparser)1-element Vector{CitableParserBuilder.Analysis}:
CitableParserBuilder.Analysis("ἀγαθός", lsj.n260, forms.7010001110, adjstems.n260, adjinfl.os_h_on_pos1, "ἀγαθος", "a")
Latin
We’re using the Tabulae system to build Latin parsers tailored to the corpora of the Complutensian’s Latin documents.
The current versions of our parsers for Latin documents are available here:
- for the 25-letter orthography of the Vulgate:
http://shot.holycross.edu/tabulae/complut-lat25-current.cex - for the 23-letter orthography of the Latin glosses:
http://shot.holycross.edu/tabulae/complut-lat23-current.cex
Example of usage
using Tabulae
url = "http://shot.holycross.edu/tabulae/complut-lat25-current.cex"
latinparser = tabulaeStringParser(url, UrlReader)Precompiling Tabulae...
1217.0 ms ✓ LatinOrthography
2646.3 ms ✓ Tabulae
2 dependencies successfully precompiled in 4 seconds. 116 already precompiled.
Latin parser covering 184181 analyses.
latinparses = parsetoken("creavit", latinparser)1-element Vector{CitableParserBuilder.Analysis}:
CitableParserBuilder.Analysis("creavit", ls.n11543, forms.3314110000, latcommon.verbn11543, lat25.c1pftact_pft3, "creavit", "A")