Working directly with parser data sources

Published

June 8, 2024

The functionality defined by the abstract CitableParser type can be implemented in different ways. Beginning with version 0.26.0 of the CitableParserBuilder.jl package, a second tier of abstractions defines types for parsers backed by a dataframe, or by a vector of delimited-text lines. This page shows you how to convert one type of parser to another, and to work directly with the data source underlying the parser.

Dataframe-backed parsers

Example: the `GettysburgParser`

The GettysburgParser type is a descendant of the abstract CitableParser, so we can use functions like parsetoken, and orthography that work with any CitableParser.

graph LR
AbstractDFParser --> CitableParser

AbstractStringParser --> CitableParser


AbstractDictParser --> CitableParser

GettysburgParser --> AbstractDFParser

using CitableParserBuilder
gparser = CitableParserBuilder.gettysburgParser()
parsetoken("score", gparser)

1-element Vector{Analysis}:
 Analysis(InlineStrings.String15("score"), gburglex.score, pennpos.NN, gburgstem.score, gburgrule.pennid)

orthography(gparser) |> typeof

Orthography.SimpleAscii

It is also a direct subtype of the AbstractDFParser, so we can use additional functions that apply to dataframe-backed parsers.

gparser  |> typeof |> supertype

AbstractDFParser

You can get direct access to the backing DataFrame, for example, with the dataframe function:

df = datasource(gparser)
# display first 5 rows of DataFrame:
df[1:5,:]

5×5 DataFrame

Row	Token	Lexeme	Form	Stem	Rule
	String15	String31	String15	String31	String31
1	come	gburglex.come	pennpos.VBN	gburgstem.come	gburgrule.pennid
2	brought	gburglex.brought	pennpos.VBD	gburgstem.brought	gburgrule.pennid
3	carried	gburglex.carried	pennpos.VBD	gburgstem.carried	gburgrule.pennid
4	we	gburglex.we	pennpos.PRP	gburgstem.we	gburgrule.pennid
5	equal	gburglex.equal	pennpos.JJ	gburgstem.equal	gburgrule.pennid

String-backed parsers

String-backed parsers use a Vector of delimited-text strings to store data. You can easily build one manually.

src = """Token|Lexeme|Form|Stem|Rule
et|ls.n16278|morphforms:1000000001|stems.example1|rules.example1
"""
srclines = split(src,"\n")
manual = StringParser(srclines)

StringParser(SubString{String}["Token|Lexeme|Form|Stem|Rule", "et|ls.n16278|morphforms:1000000001|stems.example1|rules.example1", ""], Orthography.SimpleAscii("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,-:;!?'\"()[] \t\n", Orthography.TokenCategory[Orthography.LexicalToken(), Orthography.NumericToken(), Orthography.PunctuationToken()]), "|")

You can also convert a dataframe-backed parser to a StringParser

stringbacked = StringParser(gparser)

StringParser(SubString{String}["Token|Lexeme|Form|Stem|Rule", "come|gburglex.come|pennpos.VBN|gburgstem.come|gburgrule.pennid", "brought|gburglex.brought|pennpos.VBD|gburgstem.brought|gburgrule.pennid", "carried|gburglex.carried|pennpos.VBD|gburgstem.carried|gburgrule.pennid", "we|gburglex.we|pennpos.PRP|gburgstem.we|gburgrule.pennid", "equal|gburglex.equal|pennpos.JJ|gburgstem.equal|gburgrule.pennid", "seven|gburglex.seven|pennpos.CD|gburgstem.seven|gburgrule.pennid", "propriety|gburglex.propriety|pennpos.VB|gburgstem.propriety|gburgrule.pennid", "live|gburglex.live|pennpos.VB|gburgstem.live|gburgrule.pennid", "that|gburglex.that|pennpos.DT|gburgstem.that|gburgrule.pennid"  …  "gave|gburglex.gave|pennpos.VBD|gburgstem.gave|gburgrule.pennid", "civil|gburglex.civil|pennpos.JJ|gburgstem.civil|gburgrule.pennid", "men|gburglex.men|pennpos.NNS|gburgstem.men|gburgrule.pennid", "great|gburglex.great|pennpos.JJ|gburgstem.great|gburgrule.pennid", "all|gburglex.all|pennpos.RB|gburgstem.all|gburgrule.pennid", "poor|gburglex.poor|pennpos.JJ|gburgstem.poor|gburgrule.pennid", "But|gburglex.But|pennpos.CC|gburgstem.But|gburgrule.pennid", "living|gburglex.living|pennpos.VBG|gburgstem.living|gburgrule.pennid", "from|gburglex.from|pennpos.IN|gburgstem.from|gburgrule.pennid", ""], Orthography.SimpleAscii("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,-:;!?'\"()[] \t\n", Orthography.TokenCategory[Orthography.LexicalToken(), Orthography.NumericToken(), Orthography.PunctuationToken()]), "|")

String-backed parsers work identically for functions applying to all citable parsers.

parsetoken("score", stringbacked) == parsetoken("score", gparser)

true

orthography(stringbacked) |> typeof

Orthography.SimpleAscii

It is a direct subtype of the AbstractStringParser.

stringbacked  |> typeof |> supertype

AbstractStringParser

This gives us access to the delimiter function to find out how lines of delimited text are structured.

delimiter(stringbacked)

"|"

The same datasource function provides the underlying data, but now in the form of a Vector of strings (including a header line).

datalines = datasource(stringbacked)
# The first five lines:
datalines[1:5]

5-element Vector{SubString{String}}:
 "Token|Lexeme|Form|Stem|Rule"
 "come|gburglex.come|pennpos.VBN|gburgstem.come|gburgrule.pennid"
 "brought|gburglex.brought|pennpos.VBD|gburgstem.brought|gburgrule.pennid"
 "carried|gburglex.carried|pennpos.VBD|gburgstem.carried|gburgrule.pennid"
 "we|gburglex.we|pennpos.PRP|gburgstem.we|gburgrule.pennid"

Dataframe-backed parsers

Example: the GettysburgParser

String-backed parsers

Example: the `GettysburgParser`