Published

January 17, 2025

Read ChatGPT’s summaries of Lewis-Short

Find ChatGPT’s summaries in the github repository

In the LexiconMining.jl github repository, the summaries directory has ChatGPT’s summaries of Lewis-Short articles in subdirectories with slices of 1,000 entries named tranche0 .. tranche51.

We’ll start by getting a list of full paths to these directories in your local file system. Define a variable named repo to point to the root directory of the LexiconMining repository, and collect file names:

summariesdir = joinpath(repo, "summaries")
tranchenames = filter(readdir(summariesdir)) do dir
    startswith(dir, "tranche")
end
tranchepaths = map(name -> joinpath(summariesdir, name), tranchenames)
length(tranchepaths)
52

Read summaries into a named tuple

LexiconMining includes a readdata function that takes a list of directories, and reads all the summaries into named tuples. It returns two objects: the first is a vector of the named tuples for each successfully parsed record; the second is a list of the records it could not parse.

using LexiconMining
(data, errs) = readdata(tranchepaths)

How many Lewis-Short articles did ChatGPT summarize?

good = length(data)
bad = length(errs)
totalarticles = good + bad
50262

Almost 99% of ChatGPT’s summaries can be parsed into these tuples:

pct = good / totalarticles
0.988599737376149

Working with the named tuples

Each tuple has the following fields: a number with the sequence of its article in Lewis-Short (seq), a Cite2URN identifying the article (urn), a dictionary headword (or lemma, lemma), a brief definition (definition), a part of speech (pos), and morphological information that will vary in format depending on the part of speech (morphology). Here is what ChatGPT’s summary of the 100th entry in Lewis-Short looks like:

data[100]
(seq = 103, urn = "urn:cite2:hmt:ls.markdown:n102", lemma = "ab-jurgo", definition = "to deny or refuse reproachfully", pos = "verb (compound)", morphology = "1st, ab-jurgo, ab-jurgare, ab-jurgavi, ab-jurgatum")

Eliminating cross references

Many articles in Lewis-Short are actually just cross references to other articles. This is helpful for a human reader, but we want to exclude these in building a morphological database.

These duplicate references show up in two categories for part of speech in ChatGPT’s summaries. Most obviously, some entries are identified with crossreference. Others are identified as participle: all such entries refer to the article for the verb the participle is derived from. We can eliminate both categories:

lexicaldata = filter(data) do tpl
    tpl.pos != "crossreference" &&
    tpl.pos != "participle"
end
lexical = length(lexicaldata)
47379

About 5% of the article in Lewis-Short are actually just cross references.

lexical / good
0.953510837408682

The information in these named tuples can be quite useful by itself even before we integrate it into a Tabulae parser. Let’s look at a couple of examples.

Example 1: format a brief dictionary entry

We’ll take an arbitrary sliver of 9 entries from the lexical data for this tutorial. Obviously, you could identify a meaningful selection from the data (most securely, by selecting entries based on the article’s URN).

sliver = lexicaldata[10002:10010]
9-element Vector{Any}:
 (seq = 10782, urn = "urn:cite2:hmt:ls.markdown:n10781", lemma = "con-tĕrēbro", definition = "to pierce or bore through", pos = "verb (compound)", morphology = "1st, conterebro, conterebrare, conterebravi, conterebratus")
 (seq = 10783, urn = "urn:cite2:hmt:ls.markdown:n10782", lemma = "conterebromius", definition = "humorously-coined epithet", pos = "adjective", morphology = "conterebromius, conterebromia, conterebromium")
 (seq = 10784, urn = "urn:cite2:hmt:ls.markdown:n10783", lemma = "contermĭno", definition = "to be a borderer, to border upon", pos = "verb (compound)", morphology = "1st,āre")
 (seq = 10785, urn = "urn:cite2:hmt:ls.markdown:n10784", lemma = "contermĭnum", definition = "boundary, border", pos = "noun", morphology = "contermĭnum, contermĭni, neuter")
 (seq = 10786, urn = "urn:cite2:hmt:ls.markdown:n10785", lemma = "con-terminus", definition = "bordering upon, neighboring", pos = "adjective", morphology = "con-terminus, con-termina, con-terminum")
 (seq = 10788, urn = "urn:cite2:hmt:ls.markdown:n10787", lemma = "conternatio", definition = "a placing of three things together", pos = "noun", morphology = "conternatio, conternationis, feminine")
 (seq = 10789, urn = "urn:cite2:hmt:ls.markdown:n10788", lemma = "con-terno", definition = "to put three things together, to make threefold", pos = "verb (compound)", morphology = "1, con-terno, con-ternāre, con-ternāvi, con-ternātum")
 (seq = 10790, urn = "urn:cite2:hmt:ls.markdown:n10789", lemma = "contero", definition = "to grind, bruise, diminish by rubbing, waste, destroy", pos = "verb (compound)", morphology = "3, contero, conterere, contrivi, contritum")
 (seq = 10791, urn = "urn:cite2:hmt:ls.markdown:n10790", lemma = "con-terraneus", definition = "a fellow-countryman", pos = "noun", morphology = "con-terraneus, con-terranei, masculine")

Here’s a basic string formatting function to present the entry like a familiar glossary or vocabulary entry in a textbook. It composes a single string with some labels and light highlighting in Markdown to make the entry more easily readable.

function markdown_gloss(tpl)
    string(
        "**", tpl.lemma, "**",
        " *", tpl.definition, "*. ",
        "Part of speech: *", tpl.pos, "*",
        " Forms: **", tpl.morphology, "**"
    )
end
markdown_gloss (generic function with 1 method)

We can use our new function to map each article to a single string for one glossary entry, then join the entries together, separating them with two newlines (\n\n).

glosses = map(tpl -> markdown_gloss(tpl), sliver)
glosstext = "#### Sample glossary\n\n" * join(glosses, "\n\n")
"#### Sample glossary\n\n**con-tĕrēbro** *to pierce or bore through*. Part of speech: *verb (compound)* Forms: **1st, conterebro, conterebrare, conterebravi, conterebratus**\n\n**conterebromius** *humorously-coined epithet*. Part of speech: *adjective* Forms: **conterebromius" ⋯ 672 bytes ⋯ "*contero** *to grind, bruise, diminish by rubbing, waste, destroy*. Part of speech: *verb (compound)* Forms: **3, contero, conterere, contrivi, contritum**\n\n**con-terraneus** *a fellow-countryman*. Part of speech: *noun* Forms: **con-terraneus, con-terranei, masculine**"

The result is a single string containing the Markdown glossary for our selection of entries. We can use Markdown.parse to get a visual rendering of the string.

using Markdown
Markdown.parse(glosstext)

Sample glossary

con-tĕrēbro to pierce or bore through. Part of speech: verb (compound) Forms: 1st, conterebro, conterebrare, conterebravi, conterebratus

conterebromius humorously-coined epithet. Part of speech: adjective Forms: conterebromius, conterebromia, conterebromium

contermĭno to be a borderer, to border upon. Part of speech: verb (compound) Forms: 1st,āre

contermĭnum boundary, border. Part of speech: noun Forms: contermĭnum, contermĭni, neuter

con-terminus bordering upon, neighboring. Part of speech: adjective Forms: con-terminus, con-termina, con-terminum

conternatio a placing of three things together. Part of speech: noun Forms: conternatio, conternationis, feminine

con-terno to put three things together, to make threefold. Part of speech: verb (compound) Forms: 1, con-terno, con-ternāre, con-ternāvi, con-ternātum

contero to grind, bruise, diminish by rubbing, waste, destroy. Part of speech: verb (compound) Forms: 3, contero, conterere, contrivi, contritum

con-terraneus a fellow-countryman. Part of speech: noun Forms: con-terraneus, con-terranei, masculine

Example 2: count distribution of articles by part of speech

It would be interesting to know how ChatGPT has classified the part of speech for each article. Let’s start by isolating part of speech values.

posvalues = map(tpl -> tpl.pos, lexicaldata)
47379-element Vector{String}:
 "uninflected"
 "preposition"
 "interjection"
 "noun"
 "preposition"
 "uninflected"
 "noun"
 "noun"
 "adjective"
 "noun"
 ⋮
 "noun"
 "noun"
 "noun"
 "noun"
 "noun"
 "noun"
 "noun"
 "noun"
 "noun"

The Julia StatsBase module will count occurrences of a value for us; if we convert the results to an ordered dictionary, we can sort the results by frequency. Sorting in “reverse” order sorts from most frequent to least frequent value.

using StatsBase, OrderedCollections
poscounts = countmap(posvalues) |> OrderedDict
sort!(poscounts; rev=true, byvalue=true)
OrderedDict{String, Int64} with 151 entries:
  "noun"                     => 25287
  "adjective"                => 10769
  "verb (compound)"          => 7242
  "adverb"                   => 2388
  "verb"                     => 549
  "uninflected"              => 504
  "interjection"             => 60
  "pronoun"                  => 54
  "participle and adjective" => 48
  "conjunction"              => 43
  "preposition"              => 40
  "adjective and noun"       => 40
  "adv."                     => 27
  "n/a"                      => 24
  "participle (compound)"    => 21
  "adverb and preposition"   => 17
  ""                         => 14
  "adv"                      => 12
  "adjective, noun"          => 12
  ⋮                          => ⋮

Although a Latinist might first notice the outlier labels with only a handful of values (e.g., empty string, or adv), ChatGPT’s classification is very useful, even before we regularize the exceptions: the top eight categories (from noun down to pronoun in the list above) cover 99% of the lexical entries in Lewis-Short.

top8 = collect(values(poscounts))[1:8]
sum(top8) / lexical
0.9888980349944068