graph LR AbstractDFParser --> CitableParser AbstractStringParser --> CitableParser AbstractDictParser --> CitableParser GettysburgParser --> AbstractDFParser
Working directly with parser data sources
The functionality defined by the abstract CitableParser type can be implemented in different ways. Beginning with version 0.26.0 of the CitableParserBuilder.jl package, a second tier of abstractions defines types for parsers backed by a dataframe, or by a vector of delimited-text lines. This page shows you how to convert one type of parser to another, and to work directly with the data source underlying the parser.
Dataframe-backed parsers
Example: the GettysburgParser
The GettysburgParser type is a descendant of the abstract CitableParser, so we can use functions like parsetoken, and orthography that work with any CitableParser.
using CitableParserBuilder
gparser = CitableParserBuilder.gettysburgParser()
parsetoken("score", gparser)1-element Vector{Analysis}:
Analysis(InlineStrings.String15("score"), gburglex.score, pennpos.NN, gburgstem.score, gburgrule.pennid)
orthography(gparser) |> typeofOrthography.SimpleAscii
It is also a direct subtype of the AbstractDFParser, so we can use additional functions that apply to dataframe-backed parsers.
gparser |> typeof |> supertypeAbstractDFParser
You can get direct access to the backing DataFrame, for example, with the dataframe function:
df = datasource(gparser)
# display first 5 rows of DataFrame:
df[1:5,:]| Row | Token | Lexeme | Form | Stem | Rule |
|---|---|---|---|---|---|
| String15 | String31 | String15 | String31 | String31 | |
| 1 | come | gburglex.come | pennpos.VBN | gburgstem.come | gburgrule.pennid |
| 2 | brought | gburglex.brought | pennpos.VBD | gburgstem.brought | gburgrule.pennid |
| 3 | carried | gburglex.carried | pennpos.VBD | gburgstem.carried | gburgrule.pennid |
| 4 | we | gburglex.we | pennpos.PRP | gburgstem.we | gburgrule.pennid |
| 5 | equal | gburglex.equal | pennpos.JJ | gburgstem.equal | gburgrule.pennid |
String-backed parsers
String-backed parsers use a Vector of delimited-text strings to store data. You can easily build one manually.
src = """Token|Lexeme|Form|Stem|Rule
et|ls.n16278|morphforms:1000000001|stems.example1|rules.example1
"""
srclines = split(src,"\n")
manual = StringParser(srclines)StringParser(SubString{String}["Token|Lexeme|Form|Stem|Rule", "et|ls.n16278|morphforms:1000000001|stems.example1|rules.example1", ""], Orthography.SimpleAscii("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,-:;!?'\"()[] \t\n", Orthography.TokenCategory[Orthography.LexicalToken(), Orthography.NumericToken(), Orthography.PunctuationToken()]), "|")
You can also convert a dataframe-backed parser to a StringParser
stringbacked = StringParser(gparser)StringParser(SubString{String}["Token|Lexeme|Form|Stem|Rule", "come|gburglex.come|pennpos.VBN|gburgstem.come|gburgrule.pennid", "brought|gburglex.brought|pennpos.VBD|gburgstem.brought|gburgrule.pennid", "carried|gburglex.carried|pennpos.VBD|gburgstem.carried|gburgrule.pennid", "we|gburglex.we|pennpos.PRP|gburgstem.we|gburgrule.pennid", "equal|gburglex.equal|pennpos.JJ|gburgstem.equal|gburgrule.pennid", "seven|gburglex.seven|pennpos.CD|gburgstem.seven|gburgrule.pennid", "propriety|gburglex.propriety|pennpos.VB|gburgstem.propriety|gburgrule.pennid", "live|gburglex.live|pennpos.VB|gburgstem.live|gburgrule.pennid", "that|gburglex.that|pennpos.DT|gburgstem.that|gburgrule.pennid" … "gave|gburglex.gave|pennpos.VBD|gburgstem.gave|gburgrule.pennid", "civil|gburglex.civil|pennpos.JJ|gburgstem.civil|gburgrule.pennid", "men|gburglex.men|pennpos.NNS|gburgstem.men|gburgrule.pennid", "great|gburglex.great|pennpos.JJ|gburgstem.great|gburgrule.pennid", "all|gburglex.all|pennpos.RB|gburgstem.all|gburgrule.pennid", "poor|gburglex.poor|pennpos.JJ|gburgstem.poor|gburgrule.pennid", "But|gburglex.But|pennpos.CC|gburgstem.But|gburgrule.pennid", "living|gburglex.living|pennpos.VBG|gburgstem.living|gburgrule.pennid", "from|gburglex.from|pennpos.IN|gburgstem.from|gburgrule.pennid", ""], Orthography.SimpleAscii("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,-:;!?'\"()[] \t\n", Orthography.TokenCategory[Orthography.LexicalToken(), Orthography.NumericToken(), Orthography.PunctuationToken()]), "|")
String-backed parsers work identically for functions applying to all citable parsers.
parsetoken("score", stringbacked) == parsetoken("score", gparser)true
orthography(stringbacked) |> typeofOrthography.SimpleAscii
It is a direct subtype of the AbstractStringParser.
stringbacked |> typeof |> supertypeAbstractStringParser
This gives us access to the delimiter function to find out how lines of delimited text are structured.
delimiter(stringbacked)"|"
The same datasource function provides the underlying data, but now in the form of a Vector of strings (including a header line).
datalines = datasource(stringbacked)
# The first five lines:
datalines[1:5]5-element Vector{SubString{String}}:
"Token|Lexeme|Form|Stem|Rule"
"come|gburglex.come|pennpos.VBN|gburgstem.come|gburgrule.pennid"
"brought|gburglex.brought|pennpos.VBD|gburgstem.brought|gburgrule.pennid"
"carried|gburglex.carried|pennpos.VBD|gburgstem.carried|gburgrule.pennid"
"we|gburglex.we|pennpos.PRP|gburgstem.we|gburgrule.pennid"