Working directly with parser data sources

Published

June 8, 2024

The functionality defined by the abstract CitableParser type can be implemented in different ways. Beginning with version 0.26.0 of the CitableParserBuilder.jl package, a second tier of abstractions defines types for parsers backed by a dataframe, or by a vector of delimited-text lines. This page shows you how to convert one type of parser to another, and to work directly with the data source underlying the parser.

Dataframe-backed parsers

Example: the GettysburgParser

The GettysburgParser type is a descendant of the abstract CitableParser, so we can use functions like parsetoken, and orthography that work with any CitableParser.

graph LR
AbstractDFParser --> CitableParser

AbstractStringParser --> CitableParser


AbstractDictParser --> CitableParser

GettysburgParser --> AbstractDFParser

using CitableParserBuilder
gparser = CitableParserBuilder.gettysburgParser()
parsetoken("score", gparser)
1-element Vector{Analysis}:
 Analysis(InlineStrings.String15("score"), gburglex.score, pennpos.NN, gburgstem.score, gburgrule.pennid)
orthography(gparser) |> typeof
Orthography.SimpleAscii

It is also a direct subtype of the AbstractDFParser, so we can use additional functions that apply to dataframe-backed parsers.

gparser  |> typeof |> supertype
AbstractDFParser

You can get direct access to the backing DataFrame, for example, with the dataframe function:

df = datasource(gparser)
# display first 5 rows of DataFrame:
df[1:5,:]
5×5 DataFrame
Row Token Lexeme Form Stem Rule
String15 String31 String15 String31 String31
1 come gburglex.come pennpos.VBN gburgstem.come gburgrule.pennid
2 brought gburglex.brought pennpos.VBD gburgstem.brought gburgrule.pennid
3 carried gburglex.carried pennpos.VBD gburgstem.carried gburgrule.pennid
4 we gburglex.we pennpos.PRP gburgstem.we gburgrule.pennid
5 equal gburglex.equal pennpos.JJ gburgstem.equal gburgrule.pennid

String-backed parsers

String-backed parsers use a Vector of delimited-text strings to store data. You can easily build one manually.

src = """Token|Lexeme|Form|Stem|Rule
et|ls.n16278|morphforms:1000000001|stems.example1|rules.example1
"""
srclines = split(src,"\n")
manual = StringParser(srclines)
StringParser(SubString{String}["Token|Lexeme|Form|Stem|Rule", "et|ls.n16278|morphforms:1000000001|stems.example1|rules.example1", ""], Orthography.SimpleAscii("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,-:;!?'\"()[] \t\n", Orthography.TokenCategory[Orthography.LexicalToken(), Orthography.NumericToken(), Orthography.PunctuationToken()]), "|")

You can also convert a dataframe-backed parser to a StringParser

stringbacked = StringParser(gparser)
StringParser(SubString{String}["Token|Lexeme|Form|Stem|Rule", "come|gburglex.come|pennpos.VBN|gburgstem.come|gburgrule.pennid", "brought|gburglex.brought|pennpos.VBD|gburgstem.brought|gburgrule.pennid", "carried|gburglex.carried|pennpos.VBD|gburgstem.carried|gburgrule.pennid", "we|gburglex.we|pennpos.PRP|gburgstem.we|gburgrule.pennid", "equal|gburglex.equal|pennpos.JJ|gburgstem.equal|gburgrule.pennid", "seven|gburglex.seven|pennpos.CD|gburgstem.seven|gburgrule.pennid", "propriety|gburglex.propriety|pennpos.VB|gburgstem.propriety|gburgrule.pennid", "live|gburglex.live|pennpos.VB|gburgstem.live|gburgrule.pennid", "that|gburglex.that|pennpos.DT|gburgstem.that|gburgrule.pennid"  …  "gave|gburglex.gave|pennpos.VBD|gburgstem.gave|gburgrule.pennid", "civil|gburglex.civil|pennpos.JJ|gburgstem.civil|gburgrule.pennid", "men|gburglex.men|pennpos.NNS|gburgstem.men|gburgrule.pennid", "great|gburglex.great|pennpos.JJ|gburgstem.great|gburgrule.pennid", "all|gburglex.all|pennpos.RB|gburgstem.all|gburgrule.pennid", "poor|gburglex.poor|pennpos.JJ|gburgstem.poor|gburgrule.pennid", "But|gburglex.But|pennpos.CC|gburgstem.But|gburgrule.pennid", "living|gburglex.living|pennpos.VBG|gburgstem.living|gburgrule.pennid", "from|gburglex.from|pennpos.IN|gburgstem.from|gburgrule.pennid", ""], Orthography.SimpleAscii("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,-:;!?'\"()[] \t\n", Orthography.TokenCategory[Orthography.LexicalToken(), Orthography.NumericToken(), Orthography.PunctuationToken()]), "|")

String-backed parsers work identically for functions applying to all citable parsers.

parsetoken("score", stringbacked) == parsetoken("score", gparser)
true
orthography(stringbacked) |> typeof
Orthography.SimpleAscii

It is a direct subtype of the AbstractStringParser.

stringbacked  |> typeof |> supertype
AbstractStringParser

This gives us access to the delimiter function to find out how lines of delimited text are structured.

delimiter(stringbacked)
"|"

The same datasource function provides the underlying data, but now in the form of a Vector of strings (including a header line).

datalines = datasource(stringbacked)
# The first five lines:
datalines[1:5]
5-element Vector{SubString{String}}:
 "Token|Lexeme|Form|Stem|Rule"
 "come|gburglex.come|pennpos.VBN|gburgstem.come|gburgrule.pennid"
 "brought|gburglex.brought|pennpos.VBD|gburgstem.brought|gburgrule.pennid"
 "carried|gburglex.carried|pennpos.VBD|gburgstem.carried|gburgrule.pennid"
 "we|gburglex.we|pennpos.PRP|gburgstem.we|gburgrule.pennid"