using LexiconMining
trancheroot = joinpath(repo, "suarez", "lewisshort-extracts", "extracts-cycle2")
dir = joinpath(trancheroot, "tranche11")
(data11,failed11) = LexiconMining.readtranche(dir)Parsing the delimited-text data
The readtranche function lets you read an entire directory of ChatGPT’s output. Here’s how you can use it if you have a variable named repo pointing to a clone of the LexiconMining github repository.
The result comes in two parts: a data set composed of named tuples, and a list of files that did not parse properly. Let’s look at a the first few entries in each list:
data11[1:5]5-element Vector{Any}:
(seq = "11000", urn = "urn:cite2:hmt:ls.markdown:n10999", lemma = "converso ", definition = " to turn round; to abide or dwell somewhere ", pos = " verb (compound) ", morphology = " 1, converso, conversare, conversavi, conversatum")
(seq = "11001", urn = "urn:cite2:hmt:ls.markdown:n11000", lemma = "conversus", definition = "turned, changed", pos = "participle", morphology = "uninflected")
(seq = "11002", urn = "urn:cite2:hmt:ls.markdown:n11001", lemma = "conversus", definition = "turned, changed", pos = "participle", morphology = "uninflected")
(seq = "11003", urn = "urn:cite2:hmt:ls.markdown:n11002", lemma = "conversus", definition = "a turning or twisting round", pos = "noun", morphology = "conversus, conversūs, m")
(seq = "11004", urn = "urn:cite2:hmt:ls.markdown:n11003", lemma = "convertĭbĭlis", definition = "changeable", pos = "adjective", morphology = "convertĭbĭlis, convertĭbĭlis, convertĭbĭle")
failed11[1:5]5-element Vector{Any}:
"n11089.cex"
"n11098.cex"
"n11149.cex"
"n11250.cex"
"n11349.cex"
What percentage of entries were parseable?
length(data11) / (length(data11) + length(failed11))0.9878296146044625
Read a list of directories
dirlist = [joinpath(trancheroot, "tranche$(i)") for i in 11:33]
resultpairs = map(d -> LexiconMining.readtranche(d), dirlist)
allgood = map(r -> r[1], resultpairs) |> Iterators.flatten |> collect
allbad = map(r -> r[2], resultpairs) |> Iterators.flatten |> collect
length(allgood) / (length(allgood) + length(allbad))0.988289979440422