Published

January 15, 2025

Parsing the delimited-text data

The readtranche function lets you read an entire directory of ChatGPT’s output. Here’s how you can use it if you have a variable named repo pointing to a clone of the LexiconMining github repository.

using LexiconMining
trancheroot = joinpath(repo, "suarez", "lewisshort-extracts", "extracts-cycle2")
dir = joinpath(trancheroot, "tranche11")
(data11,failed11) = LexiconMining.readtranche(dir)

The result comes in two parts: a data set composed of named tuples, and a list of files that did not parse properly. Let’s look at a the first few entries in each list:

data11[1:5]
5-element Vector{Any}:
 (seq = "11000", urn = "urn:cite2:hmt:ls.markdown:n10999", lemma = "converso ", definition = " to turn round; to abide or dwell somewhere ", pos = " verb (compound) ", morphology = " 1, converso, conversare, conversavi, conversatum")
 (seq = "11001", urn = "urn:cite2:hmt:ls.markdown:n11000", lemma = "conversus", definition = "turned, changed", pos = "participle", morphology = "uninflected")
 (seq = "11002", urn = "urn:cite2:hmt:ls.markdown:n11001", lemma = "conversus", definition = "turned, changed", pos = "participle", morphology = "uninflected")
 (seq = "11003", urn = "urn:cite2:hmt:ls.markdown:n11002", lemma = "conversus", definition = "a turning or twisting round", pos = "noun", morphology = "conversus, conversūs, m")
 (seq = "11004", urn = "urn:cite2:hmt:ls.markdown:n11003", lemma = "convertĭbĭlis", definition = "changeable", pos = "adjective", morphology = "convertĭbĭlis, convertĭbĭlis, convertĭbĭle")
failed11[1:5]
5-element Vector{Any}:
 "n11089.cex"
 "n11098.cex"
 "n11149.cex"
 "n11250.cex"
 "n11349.cex"

What percentage of entries were parseable?

length(data11) / (length(data11) + length(failed11))
0.9878296146044625

Read a list of directories

dirlist = [joinpath(trancheroot, "tranche$(i)") for i in 11:33]

resultpairs = map(d -> LexiconMining.readtranche(d), dirlist)
allgood = map(r -> r[1], resultpairs) |> Iterators.flatten |> collect
allbad = map(r -> r[2], resultpairs) |> Iterators.flatten |> collect

length(allgood) / (length(allgood) + length(allbad))
0.988289979440422