Parsec Internals

I’ve spent quite a bit of time looking through the internals of Haskell’s Parsec monadic parser combinator library. This post is intended as a brain dump so what in six months when I’ve forgotten it all I can come back and remember what I’ve forgotten. This post is based on parsec-3.1.3 which is the current version as of April 22, 2013.

What I am particularly interested in for this post is the way the ParsecT monad is constructed and the implications that has for stretching the Parsec Library beyond its original design.

ParsecT

Like all truly well made Haskell code, the lowest level of Parsec is amazingly small and flexible. Most of the code in the Parsec library is there to present concise forms of the most used combinators. It is a big library with a small core. The core consists of the ParsecT monad, the manyAccum and try combinator, the TokenPrimEx parser, and the functions runParsecT and mkPT. The documentation in the code suggests that most of these need not be used directly. All else is some combination this core.

Without going any further, lets list the ParsecT data type:

newtype ParsecT s u m a
    = ParsecT {unParser :: forall b .
		 State s u
	      -> (a -> State s u -> ParseError -> m b) -- consumed ok
	      -> (ParseError -> m b)                   -- consumed err
	      -> (a -> State s u -> ParseError -> m b) -- empty ok
	      -> (ParseError -> m b)                   -- empty err
	      -> m b
	     }

There is a lot of rather opaque stuff in this so lets unpack it beginning with the type parameters (s u m a).

The first s is the type of stream of tokens to be parsed. The original, and still most natural structure is a simple list of tokens. For a parser without a scanner, this could in fact be a list of Char, and Parsec has a higher-level parser that makes that choice. While it is not listed here because it doesn’t need to be, all actual implementations of ParsecT will insist that s be a part of a Stream class. More on this below. A final cautionary note, ‘s’ is more often used in the code as a reference to a State rather then a Stream.

The next parameter u is a user defined state. This is what allows Parsec to parse context-sensitive grammars. There are no restrictions on u and it acts like the state variable of a State monad.

Third, ‘m’ is the underlying monad. ‘ParsecT’ is a monad transformer and can stack on top of any other Monad. If that Monad is a MonadIO then the parser can perform IO during a parse. Note that while functions that use ParsecT insist on m being a monad the type definition does not.

Finally, a is the contained type of the resulting monad. In simple terms it is the result of what has been parsed.

Now lets turn to the structure of ParsecT itself. It is really nothing more then a function. The function takes a state which among other things contains the current token stream, and four continuation functions. The result is absolutely generic, forall b . m b. The only way for a parser to create a result is to call one of the four continuations. The only place where b becomes concrete is in the definition of runParsecT.

To understand what is happening here, lets look back at the definition of ParsecT in 3.0.0:

data ParsecT s u m a
    = ParsecT { runParsecT :: State s u -> m (Consumed (m (Reply s u a))) }

data Consumed a  = Consumed a
                 | Empty !a

data Reply s u a = Ok !a !(State s u) ParseError
                 | Error ParseError

Notice that ‘runParsecT’ was accessors to the data type, so a ParsecT was directly runnable. The result of feeding a state to this function was a somewhat complicated looking result. (Note that this is the same result currently given by runParsecT in 3.1.3) This result can be summarized into one of four results if we for the moment assume m is Identity: 1) input consumed, good result; 2) input consumed, error; 3) no input consumed, good result; 4) no input consumed, error.

The structure is created this way so that the GC can through out the head of the input stream as soon as it is successfully parsed (unless there is a <|> combinator holding some part of the stream in its closure).

The difficulty is that this structure slows down the common case where m is Identity. Bind must evaluate the underlying monad at each combination which slowed the parser down 1.8 times over the non-transformer version. To remedy this the resultant data structure was converted to its functional form (see Latter). The four continuations in the 3.1.3 version each represent one of the possible results: consumed ok cok, consumed error cerr, empty ok eok, and empty error eerr. A parser will call one of these results.

ParsecT as a Monad

The genius of this is in ParsecT’s bind operator. Bind builds up a stack of continuations for good parse, but just passes the continuations for errors through to each parser. When a parser encounters an error it calls one of the error continuations without having to unwind a call stack or plow through the back end of a bound chain of more clearly failing parsers.

Let’s take a look at the code for bind:

parserBind m k
  = ParsecT $ \s cok cerr eok eerr ->
    let
        -- consumed-okay case for m
        mcok x s err =
            let
                 -- if (k x) consumes, those go straigt up
                 pcok = cok
                 pcerr = cerr
                                               
                 -- if (k x) doesn't consume input, but is okay,
                 -- we still return in the consumed continuation
                 peok x s err' = cok x s (mergeError err err')

                 -- if (k x) doesn't consume input, but errors,
                 -- we return the error in the 'consumed-error'
                 -- continuation
                 peerr err' = cerr (mergeError err err')
            in  unParser (k x) s pcok pcerr peok peerr                      

        -- empty-ok case for m
        meok x s err =
            let
                -- in these cases, (k x) can return as empty
                pcok = cok
                peok x s err' = eok x s (mergeError err err')
                pcerr = cerr
                peerr err' = eerr (mergeError err err') 
            in  unParser (k x) s pcok pcerr peok peerr
        -- consumed-error case for m
        mcerr = cerr

        -- empty-error case for m
        meerr = eerr

    in unParser m s mcok mcerr meok meerr

I find it easiest to read this from bottom to top. (I also prefer where to let when I can get away with it.) The phrase unParser m s mcok mcerr meok meerr is going to run the parser to the left of bind (m >>= k) called m in this function. But, instead of using the closures provided, it is going to use four newly created closures (mcok, mcerr, meok, and meerr) found further up in the let binding.

Now in point of fact two of those new functions (those representing errors) are exactly identical to the closures passed into the combined parser. It is now a question of what to do when the first parser, m succeeds. The functions mcok and meok both pass continuations on to the parser derived from the monadic function k. This function (k :: a -> ParsecT s u m b) uses the result of m (called x) to create a new parser. (k x) is this parser. It then runs (k x) using the state remaining after m (confusingly also called s) and a set of closures named with ‘p’ prefixes that are modifications of the input closures.

The difference between mcok and meok stems from needing to describe whether the combined parser consumed input. In the meok case the only changes made to the provided continuations is the merging of error messages a process that is beyond the scope of this document. In mcok however, m has consumed input. This means that even if (k x) is empty the combined parser must always use the consumed closures. peerr, the empty error closure for (k x) get promoted to call cerr. Likewise, peok gets promoted to use the pcok closure.

Before moving on, I want to point out some interesting things about the monad instance of ParsecT. Like in all monads, when we call bind the second argument takes the results of the first as an argument. In the world of parsers this is rather unique. Most parsers are Applicative Functors but distinctly not monads. In the language of parser theory we say they are “context free.” There is a parser combinator very similar to Parsec called Attoparsec that fits this context free, Applicative Functors model. It claims greater speed then Parsec but is less expressive. As a monad, Parsec can form context sensitive parsers.

Another interesting element that comes from using closures instead of Algebraic data types is the lack of constraint on the monad transformer. ParsecT caries an underlying monad and is therefore a monad transformer. Most instances of a monad transformers the monad instance is constrained by having the underlying instance be a Monad, ie instance (Monad m) => Monad (transT m) where .... On the other hand, ParsecT s u m is a Monad whether or not m is also. Only at the point where (k x) requires a monadic action does m need to become a monad. When m is Identity there is no transformer overhead.

Some interesting primitive Parsers and Combinators

An interesting example of how the continuation format can be manipulated is found in the try parser:

try :: ParsecT s u m a -> ParsecT s u m a
try p =
    ParsecT $ \s cok _ eok eerr ->
    unParser p s cok eerr eok eerr

What on its face looks like it ought to be a very complicated combinator is in fact very strait forward. We create a new parser that acts just like p but will call the “empty error (eerr) continuation even if p consumed input. The state s is actually captured in <|> rather then try.

Now there is a small “gotcha” here. Every parser I have written has used some kind of list as its stream; either a list of Char as for a typical string parser or a list of Token’s generated by Alex. But one could, in theory use parsec to read directly from a file handle or network socket. In this case a subtle note in the documentation should through up a big red flag, “A Stream instance is responsible for maintaining the ‘position within the stream’ in the stream state @s@. This is trivial unless you are using the monad in a non-trivial way.” The short version of this is that a stream must be able to “back up” to the point it was at when it was captured in a closure. A simple handle or socket will fail when used with try because there is not logic to rewind the stream.

The obvious solution to this is to read the contents into memory - potentially in a lazy fashion - and allow the GC to collect the head of that stream when it is no longer needed.

There is only one primitive parser that actually consumes input, tokenPrinEx. This function uses a lot of functions to handle error reporting and source position tracking. Ignoring all this we can just look at the final argument and result, [...] -> (t -> Maybe a) -> ParsecT s u m a. It takes a test function and applied that function to the next token in the stream. If the result is Nothing then the parser has failed without consuming input so calls eerr. Otherwise it has succeeded and calls cok with the provided result.

Note that tokens also consumes input. It could well be written as a combination of token and >>= but is not for performance reasons. We therefore do not consider it primitive.

The final combinator to look at is manyAccum which is the primitive combinator for all the sequence combinators such as many, many1, and etc.

manyAccum :: (a -> [a] -> [a])
          -> ParsecT s u m a
          -> ParsecT s u m [a]
manyAccum acc p =
    ParsecT $ \s cok cerr eok eerr ->
    let walk xs x s' err =
            unParser p s'
              (seq xs $ walk $ acc x xs)  -- consumed-ok
              cerr                        -- consumed-err
              manyErr                     -- empty-ok
              (\e -> cok (acc x xs) s' e) -- empty-err
    in unParser p s (walk []) cerr manyErr (\e -> eok [] s e)

References

Parsec: Direct Style Monadic Parser Combinators for the Real World (pdf). Daan Leijen and Erik Meijer. October 2001. This paper by the original authors of parsec goes into the theory and use of Monads to create a Parser Combinator. This paper is rather dated with respect to the current code base, but is a good overview of the basic strategy of Parsec.

Adventures in Parsec. Antoine Latter. December 2009. The current maintainer of Parsec lays out the rational for the changes made from Parsec 2 to Parsec 3. In particular the curious but quite ingenious final form of ParsecT in the 3.1.3 code base.