The GettysburgParser

Published

June 6, 2024

The GettysburgParser used in this demonstration works with a simple dictionary of tokens to POS tags. The dictionary was constructed by wrapping the Python NLTK POS tagger with a Julia function. This page documents how to do that so that you can generically apply the NTLK tagger to a list of tokens from Julia.

Python prerequisites

You need to have Python, with nltk.

pip install nltk

Then start python, and at the python prompt,

import nltk
nltk.download

A Julia wrapper

# If you're in a system with python accessible
# and the nltk module installed, you can actually
# execute all the code blocks on this page.
repo = pwd() |> dirname  |> dirname |> dirname
gburgfile = joinpath(repo,"test","data","gettysburg","gettysburgcorpus.cex")
using CitableCorpus
corpus = corpus_fromfile(gburgfile, "|")

In Julia, you can make the NLTK module’s tag function available like this:

using Conda
Conda.add("nltk")
using PyCall
@pyimport nltk.tag as ptag

Now if we have a citable corpus named corpus, we can use the TextAnalysis functions to extract a unique lexicon, and apply the NLTK tagger to it.

using CitableCorpusAnalysis
using TextAnalysis
tacorp = tacorpus(corpus)

tkns = []
for doc in tacorp.documents
    push!(tkns, tokens(doc))
end
tknlist = tkns |> Iterators.flatten |> collect |> unique
tagged = ptag.pos_tag(tknlist)