Earley
Parsing all context-free grammars using Earley's algorithm.
LTS Haskell 22.42: | 0.13.0.1 |
Stackage Nightly 2024-11-18: | 0.13.0.1 |
Latest on Hackage: | 0.13.0.1 |
Earley-0.13.0.1@sha256:266f3ba67527ba4e7f62f2ab78b836f8dc1993ead5d03ab3cc739f9675b95be2,4706
Module documentation for 0.13.0.1
Earley
Go to the API documentation on Hackage.
This (Text.Earley) is a library consisting of a few main parts:
Text.Earley.Grammar
An embedded context-free grammar (CFG) domain-specific language (DSL) with semantic action specification in applicative style.
An example of a typical expression grammar working on an input tokenised into strings is the following:
expr :: Grammar r (Prod r String String Expr)
expr = mdo
x1 <- rule $ Add <$> x1 <* namedToken "+" <*> x2
<|> x2
<?> "sum"
x2 <- rule $ Mul <$> x2 <* namedToken "*" <*> x3
<|> x3
<?> "product"
x3 <- rule $ Var <$> (satisfy ident <?> "identifier")
<|> namedToken "(" *> x1 <* namedToken ")"
return x1
where
ident (x:_) = isAlpha x
ident _ = False
Text.Earley.Parser
An implementation of (a modification of) the Earley parsing algorithm.
To invoke the parser on the above grammar, run e.g. (here using words
as a
stupid tokeniser):
fullParses (parser expr) $ words "a + b * ( c + d )"
= ( [Add (Var "a") (Mul (Var "b") (Add (Var "c") (Var "d")))]
, Report {...}
)
Note that we get a list of all the possible parses (though in this case there is only one).
Another invocation, which shows the error reporting capabilities (giving the last position that the parser reached and what it expected at that point), is the following:
fullParses (parser expr) $ words "a +"
= ( []
, Report { position = 2
, expected = ["(","identifier","product"]
, unconsumed = []
}
)
Text.Earley.Generator
Functionality to generate the members of the language that a grammar generates.
To get the language of a grammar given a list of allowed tokens, run e.g.:
language (generator romanNumeralsGrammar "VIX")
= [(0,""),(1,"I"),(5,"V"),(10,"X"),(20,"XX"),(11,"XI"),(15,"XV"),(6,"VI"),(9,"IX"),(4,"IV"),(2,"II"),(3,"III"),(19,"XIX"),(16,"XVI"),(14,"XIV"),(12,"XII"),(7,"VII"),(21,"XXI"),(25,"XXV"),(30,"XXX"),(31,"XXXI"),(35,"XXXV"),(8,"VIII"),(13,"XIII"),(17,"XVII"),(26,"XXVI"),(29,"XXIX"),(24,"XXIV"),(22,"XXII"),(18,"XVIII"),(36,"XXXVI"),(39,"XXXIX"),(34,"XXXIV"),(32,"XXXII"),(23,"XXIII"),(27,"XXVII"),(33,"XXXIII"),(28,"XXVIII"),(37,"XXXVII"),(38,"XXXVIII")]
The above example shows the language generated by a Roman numerals
grammar limited to the tokens 'V'
, 'I'
, and
'X'
.
Text.Earley.Mixfix
Helper functionality for creating parsers for expressions with mixfix identifiers in the style of Agda.
How do I write grammars?
As hinted at above, the grammars are written inside Grammar
, which is a
Monad
and MonadFix
. For the library to be able to tame the recursion in
the grammars, we have to use the rule
function whenever a production is
recursive.
Whenever you would write e.g.
...
p = foo <|> bar <*> p
...
in a conventional combinator parser library, you instead write the following:
grammar = mdo
...
p <- rule $ foo <|> bar <*> p
...
Apart from making it possible to do recursion (even left-recursion), rule
s
have an additional benefit: they control where work is shared, by the rule that
any rule
is only ever expanded once per position in the input string. If a
rule
is encountered more than once at a position, the work is shared.
Compared to parser generators and combinator libraries
This library differs from the main methods that are used to write parsers in the Haskell ecosystem:
-
Compared to parser generators (YACC, Happy, etc.) it requires very little pre-processing of the grammar. It also allows you to stay in the host language for both grammar and parser, i.e. there is no use of a separate tool. This also means that you are free to use the abstraction facilities of Haskell when writing a grammar. Currently the library requires a linear traversal of the grammar’s rules before use, which is usually fast enough to do at run time, but precludes infinite grammars.
-
The grammar language is similar to that of many parser combinators (Parsec, Attoparsec, parallel parsing processes, etc.), providing an applicative interface, but the parser gracefully handles all finite CFGs, including those with left-recursion. On the other hand, its productions are not monadic meaning that it does not support context-sensitive or infinite grammars, which are supported by many parser combinator libraries.
Note: The
Grammar
type is aMonad
(used to provide observable sharing) but it lives a layer above productions. It cannot be used to decide what production to use depending on the result of a previous production, i.e. it does not give us monadic parsing.
The parsing algorithm
The parsing algorithm that this library uses is based on Earley’s parsing algorithm. The algorithm has been modified to produce online parse results, to give good error messages, and to allow garbage collection of the item sets. Essentially, instead of storing a sequence of sets of items like in the original algorithm, the modified algorithm just stores pointers back to sets of reachable items.
The worst-case run time performance of the Earley parsing algorithm is cubic in the length of the input, but for large classes of grammars it is linear. It should however be noted that this library will likely be slower than most parser generators and parser combinator libraries.
The parser implements an optimisation similar to that presented in Joop M.I.M Leo’s paper A general context-free parsing algorithm running in linear time on every LR(k) grammar without using lookahead, which removes indirections in sequences of non-ambiguous backpointers between item sets.
For more in-depth information about the internals of the library, there are implementation notes currently being written.
Contact
Olle Fredriksson - https://github.com/ollef
Changes
Unreleased
0.13.0.1
- Add a missing test module to the Cabal file
0.13.0.0
- Remove the previously deprecated operators
symbol
,namedSymbol
, andword
- Change
Prod
’sMonoid
andSemigroup
instances to lift their element instances instead of being the same as theAlternative
instance - Add unbalanced parentheses/EOF test
0.12.1.0
- GHC 8.4.1 support
- Update ‘base’ dependency bounds
- Add
Semigroup
instance to theProd
type
0.12.0.1
- Update ‘base’ dependency bounds
0.12.0.0
- Add the
Generator
module for generating grammar members - Change (simplify) the type returned by
parser
, introducing aParser
type synonym for it, and change the signature ofallParses
,fullParses
, andreport
to accept aParser
- The
Text.Earley.Internal
module is nowText.Earley.Parser.Internal
0.11.0.1
- Add missing modules to Cabal file
0.11.0.0
- Add
IsString Prod
instance - Change the signature of
Terminal
to take a functiona -> Maybe b
, and add a new operatorterminal
- Move
satisfy
to theDerived
module - Add the
token
,namedToken
, andlist
operators - Deprecate the
symbol
,namedSymbol
, andword
operators (use the above instead) - Add the
listLike
operator
0.10.1.0
- Fix bug concerning nullable rules (#14)
- Add
runGrammar
0.10.0.1
- Add changelog
0.10
- Remove
Args
, and useResults
instead - Make
parser
function not take input directly - Remove redundant type parameter to
Grammar
0.9
- Optimise handling of nullable non-terminals
- Pass a record of arguments in the parse routine
- Add support for consecutive mixfix holes