summariesdir = joinpath(repo, "summaries")
tranchenames = filter(readdir(summariesdir)) do dir
startswith(dir, "tranche")
end
tranchepaths = map(name -> joinpath(summariesdir, name), tranchenames)
length(tranchepaths)52
Package version: 0.3.0
January 17, 2025
In the LexiconMining.jl github repository, the summaries directory has ChatGPT’s summaries of Lewis-Short articles in subdirectories with slices of 1,000 entries named tranche0 .. tranche51.
We’ll start by getting a list of full paths to these directories in your local file system. Define a variable named repo to point to the root directory of the LexiconMining repository, and collect file names:
LexiconMining includes a readdata function that takes a list of directories, and reads all the summaries into named tuples. It returns two objects: the first is a vector of the named tuples for each successfully parsed record; the second is a list of the records it could not parse.
How many Lewis-Short articles did ChatGPT summarize?
Almost 99% of ChatGPT’s summaries can be parsed into these tuples:
Each tuple has the following fields: a number with the sequence of its article in Lewis-Short (seq), a Cite2URN identifying the article (urn), a dictionary headword (or lemma, lemma), a brief definition (definition), a part of speech (pos), and morphological information that will vary in format depending on the part of speech (morphology). Here is what ChatGPT’s summary of the 100th entry in Lewis-Short looks like:
(seq = 103, urn = "urn:cite2:hmt:ls.markdown:n102", lemma = "ab-jurgo", definition = "to deny or refuse reproachfully", pos = "verb (compound)", morphology = "1st, ab-jurgo, ab-jurgare, ab-jurgavi, ab-jurgatum")
Many articles in Lewis-Short are actually just cross references to other articles. This is helpful for a human reader, but we want to exclude these in building a morphological database.
These duplicate references show up in two categories for part of speech in ChatGPT’s summaries. Most obviously, some entries are identified with crossreference. Others are identified as participle: all such entries refer to the article for the verb the participle is derived from. We can eliminate both categories:
lexicaldata = filter(data) do tpl
tpl.pos != "crossreference" &&
tpl.pos != "participle"
end
lexical = length(lexicaldata)47379
About 5% of the article in Lewis-Short are actually just cross references.
The information in these named tuples can be quite useful by itself even before we integrate it into a Tabulae parser. Let’s look at a couple of examples.
We’ll take an arbitrary sliver of 9 entries from the lexical data for this tutorial. Obviously, you could identify a meaningful selection from the data (most securely, by selecting entries based on the article’s URN).
9-element Vector{Any}:
(seq = 10782, urn = "urn:cite2:hmt:ls.markdown:n10781", lemma = "con-tĕrēbro", definition = "to pierce or bore through", pos = "verb (compound)", morphology = "1st, conterebro, conterebrare, conterebravi, conterebratus")
(seq = 10783, urn = "urn:cite2:hmt:ls.markdown:n10782", lemma = "conterebromius", definition = "humorously-coined epithet", pos = "adjective", morphology = "conterebromius, conterebromia, conterebromium")
(seq = 10784, urn = "urn:cite2:hmt:ls.markdown:n10783", lemma = "contermĭno", definition = "to be a borderer, to border upon", pos = "verb (compound)", morphology = "1st,āre")
(seq = 10785, urn = "urn:cite2:hmt:ls.markdown:n10784", lemma = "contermĭnum", definition = "boundary, border", pos = "noun", morphology = "contermĭnum, contermĭni, neuter")
(seq = 10786, urn = "urn:cite2:hmt:ls.markdown:n10785", lemma = "con-terminus", definition = "bordering upon, neighboring", pos = "adjective", morphology = "con-terminus, con-termina, con-terminum")
(seq = 10788, urn = "urn:cite2:hmt:ls.markdown:n10787", lemma = "conternatio", definition = "a placing of three things together", pos = "noun", morphology = "conternatio, conternationis, feminine")
(seq = 10789, urn = "urn:cite2:hmt:ls.markdown:n10788", lemma = "con-terno", definition = "to put three things together, to make threefold", pos = "verb (compound)", morphology = "1, con-terno, con-ternāre, con-ternāvi, con-ternātum")
(seq = 10790, urn = "urn:cite2:hmt:ls.markdown:n10789", lemma = "contero", definition = "to grind, bruise, diminish by rubbing, waste, destroy", pos = "verb (compound)", morphology = "3, contero, conterere, contrivi, contritum")
(seq = 10791, urn = "urn:cite2:hmt:ls.markdown:n10790", lemma = "con-terraneus", definition = "a fellow-countryman", pos = "noun", morphology = "con-terraneus, con-terranei, masculine")
Here’s a basic string formatting function to present the entry like a familiar glossary or vocabulary entry in a textbook. It composes a single string with some labels and light highlighting in Markdown to make the entry more easily readable.
function markdown_gloss(tpl)
string(
"**", tpl.lemma, "**",
" *", tpl.definition, "*. ",
"Part of speech: *", tpl.pos, "*",
" Forms: **", tpl.morphology, "**"
)
endmarkdown_gloss (generic function with 1 method)
We can use our new function to map each article to a single string for one glossary entry, then join the entries together, separating them with two newlines (\n\n).
glosses = map(tpl -> markdown_gloss(tpl), sliver)
glosstext = "#### Sample glossary\n\n" * join(glosses, "\n\n")"#### Sample glossary\n\n**con-tĕrēbro** *to pierce or bore through*. Part of speech: *verb (compound)* Forms: **1st, conterebro, conterebrare, conterebravi, conterebratus**\n\n**conterebromius** *humorously-coined epithet*. Part of speech: *adjective* Forms: **conterebromius" ⋯ 672 bytes ⋯ "*contero** *to grind, bruise, diminish by rubbing, waste, destroy*. Part of speech: *verb (compound)* Forms: **3, contero, conterere, contrivi, contritum**\n\n**con-terraneus** *a fellow-countryman*. Part of speech: *noun* Forms: **con-terraneus, con-terranei, masculine**"
The result is a single string containing the Markdown glossary for our selection of entries. We can use Markdown.parse to get a visual rendering of the string.
con-tĕrēbro to pierce or bore through. Part of speech: verb (compound) Forms: 1st, conterebro, conterebrare, conterebravi, conterebratus
conterebromius humorously-coined epithet. Part of speech: adjective Forms: conterebromius, conterebromia, conterebromium
contermĭno to be a borderer, to border upon. Part of speech: verb (compound) Forms: 1st,āre
contermĭnum boundary, border. Part of speech: noun Forms: contermĭnum, contermĭni, neuter
con-terminus bordering upon, neighboring. Part of speech: adjective Forms: con-terminus, con-termina, con-terminum
conternatio a placing of three things together. Part of speech: noun Forms: conternatio, conternationis, feminine
con-terno to put three things together, to make threefold. Part of speech: verb (compound) Forms: 1, con-terno, con-ternāre, con-ternāvi, con-ternātum
contero to grind, bruise, diminish by rubbing, waste, destroy. Part of speech: verb (compound) Forms: 3, contero, conterere, contrivi, contritum
con-terraneus a fellow-countryman. Part of speech: noun Forms: con-terraneus, con-terranei, masculine
It would be interesting to know how ChatGPT has classified the part of speech for each article. Let’s start by isolating part of speech values.
47379-element Vector{String}:
"uninflected"
"preposition"
"interjection"
"noun"
"preposition"
"uninflected"
"noun"
"noun"
"adjective"
"noun"
⋮
"noun"
"noun"
"noun"
"noun"
"noun"
"noun"
"noun"
"noun"
"noun"
The Julia StatsBase module will count occurrences of a value for us; if we convert the results to an ordered dictionary, we can sort the results by frequency. Sorting in “reverse” order sorts from most frequent to least frequent value.
using StatsBase, OrderedCollections
poscounts = countmap(posvalues) |> OrderedDict
sort!(poscounts; rev=true, byvalue=true)OrderedDict{String, Int64} with 151 entries:
"noun" => 25287
"adjective" => 10769
"verb (compound)" => 7242
"adverb" => 2388
"verb" => 549
"uninflected" => 504
"interjection" => 60
"pronoun" => 54
"participle and adjective" => 48
"conjunction" => 43
"preposition" => 40
"adjective and noun" => 40
"adv." => 27
"n/a" => 24
"participle (compound)" => 21
"adverb and preposition" => 17
"" => 14
"adv" => 12
"adjective, noun" => 12
⋮ => ⋮
Although a Latinist might first notice the outlier labels with only a handful of values (e.g., empty string, or adv), ChatGPT’s classification is very useful, even before we regularize the exceptions: the top eight categories (from noun down to pronoun in the list above) cover 99% of the lexical entries in Lewis-Short.