Build your own `CitableParser`

Published

August 7, 2024

This DIY recipe will “parse” a form by normalizing tokens to lower case, but then simply reporting the same analysis for every token it sees. It illustrates how simple the basic framework for building a parser is: two steps are all it takes.

1. Define your parser type

Use CitableParserBuilder, and define your parser as a subtype of CitableParser.

In Julia, it could look like this if your parser object contained no information beyond a labelling string.

using CitableParserBuilder
struct FakeParser <: CitableParser
    label::AbstractString
end

2. Implement `parsetoken`

The second step is to import CitableParserBuilder: parsetoken and then define a method of parsetoken for your parser’s type. You’re done! You can now use parsetoken and all the other functions of the ParserTrait.

In the following cell, we’ll wrap the whole package in a module named DIYParser, then use DIYParser along with CitableParserBuilder and Orthography to parse a token.

module DIYParser
    using CitableParserBuilder, Orthography
    import CitableParserBuilder: parsetoken
    export FakeParser

    struct FakeParser <: CitableParser
        label::AbstractString
    end

    function parsetoken(s::AbstractString, parser::FakeParser, ortho::T)  where {T <: OrthographicSystem}
        # Returns only the same analysis no matter
        # what the token is
        [
            Analysis(
            s,
            LexemeUrn("fakeparser.nolexemeanalysis"),
            FormUrn("fakeparser.noformanalysis"),
            StemUrn("fakeparser.nostemanalysis"),
            RuleUrn("fakeparser.noruleanalysis"),
            lowercase(s), "A"
        )
        ]
    end
end # end of module


# Use the new module along with CitableParserBuilder:
using .DIYParser, CitableParserBuilder, Orthography
fakeParser = FakeParser("Parser generating dummy values")
tokenparses = parsetoken("Word", fakeParser, simpleAscii())

1-element Vector{Analysis}:
 Analysis("Word", fakeparser.nolexemeanalysis, fakeparser.noformanalysis, fakeparser.nostemanalysis, fakeparser.noruleanalysis, "word", "A")

The result is a generic Vector of Analysis objects which we can use just like the output of any other CitableParser.

tokenparses .|> token

1-element Vector{String}:
 "Word"

tokenparses .|> mtoken

1-element Vector{String}:
 "word"

tokenparses .|> lexemeurn

1-element Vector{LexemeUrn}:
 fakeparser.nolexemeanalysis

tokenparses .|> formurn

1-element Vector{FormUrn}:
 fakeparser.noformanalysis

1. Define your parser type

2. Implement parsetoken

2. Implement `parsetoken`