Working directly with parser data sources
The functionality defined by the abstract CitableParser
type can be implemented in different ways. Beginning with version 0.26.0
of the CitableParserBuilder.jl
package, a second tier of abstractions defines types for parsers backed by a dataframe, or by a vector of delimited-text lines. This page shows you how to convert one type of parser to another, and to work directly with the data source underlying the parser.
Dataframe-backed parsers
Example: the GettysburgParser
The GettysburgParser
type is a descendant of the abstract CitableParser
, so we can use functions like parsetoken
, and orthography
that work with any CitableParser
.
using CitableParserBuilder
= CitableParserBuilder.gettysburgParser()
gparser parsetoken("score", gparser)
1-element Vector{Analysis}:
Analysis(InlineStrings.String15("score"), gburglex.score, pennpos.NN, gburgstem.score, gburgrule.pennid)
orthography(gparser) |> typeof
Orthography.SimpleAscii
It is also a direct subtype of the AbstractDFParser
, so we can use additional functions that apply to dataframe-backed parsers.
|> typeof |> supertype gparser
AbstractDFParser
You can get direct access to the backing DataFrame
, for example, with the dataframe
function:
= datasource(gparser)
df # display first 5 rows of DataFrame:
1:5,:] df[
Row | Token | Lexeme | Form | Stem | Rule |
---|---|---|---|---|---|
String15 | String31 | String15 | String31 | String31 | |
1 | come | gburglex.come | pennpos.VBN | gburgstem.come | gburgrule.pennid |
2 | brought | gburglex.brought | pennpos.VBD | gburgstem.brought | gburgrule.pennid |
3 | carried | gburglex.carried | pennpos.VBD | gburgstem.carried | gburgrule.pennid |
4 | we | gburglex.we | pennpos.PRP | gburgstem.we | gburgrule.pennid |
5 | equal | gburglex.equal | pennpos.JJ | gburgstem.equal | gburgrule.pennid |
String-backed parsers
String-backed parsers use a Vector of delimited-text strings to store data. You can easily build one manually.
= """Token|Lexeme|Form|Stem|Rule
src et|ls.n16278|morphforms:1000000001|stems.example1|rules.example1
"""
= split(src,"\n")
srclines = StringParser(srclines) manual
StringParser(SubString{String}["Token|Lexeme|Form|Stem|Rule", "et|ls.n16278|morphforms:1000000001|stems.example1|rules.example1", ""], Orthography.SimpleAscii("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,-:;!?'\"()[] \t\n", Orthography.TokenCategory[Orthography.LexicalToken(), Orthography.NumericToken(), Orthography.PunctuationToken()]), "|")
You can also convert a dataframe-backed parser to a StringParser
= StringParser(gparser) stringbacked
StringParser(SubString{String}["Token|Lexeme|Form|Stem|Rule", "come|gburglex.come|pennpos.VBN|gburgstem.come|gburgrule.pennid", "brought|gburglex.brought|pennpos.VBD|gburgstem.brought|gburgrule.pennid", "carried|gburglex.carried|pennpos.VBD|gburgstem.carried|gburgrule.pennid", "we|gburglex.we|pennpos.PRP|gburgstem.we|gburgrule.pennid", "equal|gburglex.equal|pennpos.JJ|gburgstem.equal|gburgrule.pennid", "seven|gburglex.seven|pennpos.CD|gburgstem.seven|gburgrule.pennid", "propriety|gburglex.propriety|pennpos.VB|gburgstem.propriety|gburgrule.pennid", "live|gburglex.live|pennpos.VB|gburgstem.live|gburgrule.pennid", "that|gburglex.that|pennpos.DT|gburgstem.that|gburgrule.pennid" … "gave|gburglex.gave|pennpos.VBD|gburgstem.gave|gburgrule.pennid", "civil|gburglex.civil|pennpos.JJ|gburgstem.civil|gburgrule.pennid", "men|gburglex.men|pennpos.NNS|gburgstem.men|gburgrule.pennid", "great|gburglex.great|pennpos.JJ|gburgstem.great|gburgrule.pennid", "all|gburglex.all|pennpos.RB|gburgstem.all|gburgrule.pennid", "poor|gburglex.poor|pennpos.JJ|gburgstem.poor|gburgrule.pennid", "But|gburglex.But|pennpos.CC|gburgstem.But|gburgrule.pennid", "living|gburglex.living|pennpos.VBG|gburgstem.living|gburgrule.pennid", "from|gburglex.from|pennpos.IN|gburgstem.from|gburgrule.pennid", ""], Orthography.SimpleAscii("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789.,-:;!?'\"()[] \t\n", Orthography.TokenCategory[Orthography.LexicalToken(), Orthography.NumericToken(), Orthography.PunctuationToken()]), "|")
String-backed parsers work identically for functions applying to all citable parsers.
parsetoken("score", stringbacked) == parsetoken("score", gparser)
true
orthography(stringbacked) |> typeof
Orthography.SimpleAscii
It is a direct subtype of the AbstractStringParser
.
|> typeof |> supertype stringbacked
AbstractStringParser
This gives us access to the delimiter
function to find out how lines of delimited text are structured.
delimiter(stringbacked)
"|"
The same datasource
function provides the underlying data, but now in the form of a Vector of strings (including a header line).
= datasource(stringbacked)
datalines # The first five lines:
1:5] datalines[
5-element Vector{SubString{String}}:
"Token|Lexeme|Form|Stem|Rule"
"come|gburglex.come|pennpos.VBN|gburgstem.come|gburgrule.pennid"
"brought|gburglex.brought|pennpos.VBD|gburgstem.brought|gburgrule.pennid"
"carried|gburglex.carried|pennpos.VBD|gburgstem.carried|gburgrule.pennid"
"we|gburglex.we|pennpos.PRP|gburgstem.we|gburgrule.pennid"