Markdown and an Introduction to Parsing Expression Grammars (PEG)

Writing an ANTLR LL(*) grammar for Markdown has been the itch I just can’t scratch this month. I keep going back to it as I have a new idea about how to approach the problem or how to solve a previous problem I’ve had. Each time I get further but I still keep hitting a wall.

It’s a shame really because ANTLRWorks is an excellent tool and ANTLR is an extremely mature product. The rewriting rules and tree grammars are extremely elegant.

Over the couple of weeks I’ve been investigating PEGs (“Parsing Expression Grammars”). I highly recommend Parsing Expression Grammars: A Recognition-Based Syntactic Foundation by Bryan Ford. PEGs are relatively new (Ford’s paper was published in 2004) whereas parsing CFGs (“Context Free Grammars”) with LL, LR and LALR parsers has a history going back decades.

Traditional Parsers

Parsing of computer and natural languages (by computers) has it’s roots in Noam Chomsky’s work on generative grammars, particularly the Chomsky hierarchy and the work of Donald Knuth (On the Translation of Languages from Left to Right [1965]) and Frank DeRemer (Practical Translators for LR(k) Languages [1969]).

To understand Parsing Expression Grammars let me first explain the basic workings of a traditional parsers. The first step is lexical analysis that turns an input stream into a series of tokens. The parser will then apply various rules to these tokens. There are varying techniques for dealing with ambiguities and recursive rules.

As Terence Parr (the creator of ANTLR) puts it in his (excellent) The Definitive ANTLR Reference: Building Domain-Specific Languages:

Unfortunately, ANTLR cannot generate a top-down recognizer for every grammar—LL recognizers restrict the class of acceptable grammars somewhat. For example, ANTLR cannot accept left-recursive grammars such as the following (see Section 11.5, Left-Recursive Grammars, on page 274):

/** An expression is defined to be an expression followed by '++' */
expr : expr '++'
     ;

ANTLR translates this grammar to a recursive method called expr() that immediately invokes itself:

void expr() {
  expr();
  match("++");
}

This is something that LALR parsers handle better.

The Lexical Analysis Problem

But the big problem as far as Markdown is concerned is that tokens are not context-free. Take a natural definition for lists:

listItem    : ORDERED inline NEWLINE
            | UNORDERED inline NEWLINE
            ;

ORDERED     : DIGIT+ '.' (' ' | '\t')+ ;
UNORDERED   : ('*' | '-' | '+') (' ' | '\t')+ ;
inline      : (~ NEWLINE)+ ;
NEWLINE     : '\r' '\n'? : '\n' ;

and this Markdown:

1. one
2. two
3. three

will be converted into this lexical stream:

ORDERED inline("one") NEWLINE
ORDERED inline("two") NEWLINE
ORDERED inline("three") NEWLINE

and then this AST ("Absract Syntax Tree") will result:

document
+- listItem
|  +- ORDERED
|  +- inline ("one")
|  +- NEWLINE
+- listItem
|  +- ORDERED
|  +- inline ("two")
|  +- NEWLINE
+- listItem
   +- ORDERED
   +- inline ("three")
   +- NEWLINE

Looks good, right? Wrong. It quickly falls down when the Markdown becomes pathological:

1. 1. one
2. two
3. three

because the input stream is:

ORDERED ORDERED inline("one") NEWLINE
ORDERED inline("two") NEWLINE
ORDERED inline("three") NEWLINE

assuming you can resolve the ambiguity regarding inline being able to technically match "1." (which you can.... kinda).

The above will not be recognized because there is no rule that handles a pair of ORDERED tokens. Really what you want to do is not create an ORDERED token after you’ve already started a list item but at this point you no longer have a context-free grammar.

ANTLR’s semantic and syntactic predicates make an admirable effort of dealing with these kinds of ambiguities and context sensitivities but ultimately it’s just not designed for this kind of grammar.

Enter PEG

PEG parsers take a different approach in two important ways:

  1. PEGs are not ambiguous. Choices in the above can lead to ambiguities. ANTLR resolves many of these by using predicates, which are a way of saying “if it looks like a duck then it’s a duck otherwise it’s something else”. PEGs use a prioritized choice operator, which basically try the choices in order until it finds one that matches. By definition this is unambiguous because the input stream will either be recognized or it won’t; and
  2. PEGs better handle non-CFGs by trying to recognize tokens as part of processing a rule rather than recognizing tokens and then applying rules to them.

Prioritized Choice

So in PEG terms, Markdown becomes easier to describe:

Document <- Line*
Line     <- Heading / ListItem / Inline / Empty
Heading  <= '#'+ WS+ Inline
ListItem <- (DIGIT+ '.' / '*' / '-' / '+') WS+ Inline
Inline   <- (!NEWLINE .)+ NEWLINE

DIGIT    <- [0-9]
WS       <- ' ' | '\t'
NEWLINE  <- '\r\n' / '\r' / '\n'

This is of course partial and a simplification but the important thing here is that prioritized choice resolves what otherwise will be ambiguous. This is the “else” clause I’ve been looking for.

Context-Sensitive Tokenization

Markdown has lots of these issues. For example ‘###’ might indicate a header but only if that line itself isn’t a header (by the next line consisting of all equals signs or hyphens). ANTLR allows you to handle some of these situations by doing something like:

HEADER : {getCharPositionInLine()==0]?=> ‘#’+ WS+ ;

but what about this Markdown?

> # quoted heading
> some text

It’s entirely possible I’m missing some key part of the puzzle here but I’m not hopeful.

Ford illustrates this problem:

...PEGs also create new possibilities for language syntax design. Consider for example a well-known problem with C++ syntax involving nested template type expressions:

vector<vector<float> > MyMatrix;

The space between the two right angle brackets is required because the C++ scanner is oblivious to the language’s hierarchical syntax, and would otherwise interpret the >> incorrectly as a right shift operator. In a language described by a unified PEG, however, it is easy to define the language to permit a >> sequence to be interpreted as either one token or two depending on its context:

TemplType <- PrimType (LANGLE TemplType RANGLE)?
ShiftExpr <- PrimExpr (ShiftOper PrimExpr)*
ShiftOper <- LSHIFT / RSHIFT
LANGLE    <- ’<’ Spacing
RANGLE    <- ’>’ Spacing
LSHIFT    <- ’<<’ Spacing
RSHIFT    <- ’>>’ Spacing

(emphasis added)

Conclusion

This isn’t a new problem and I’m the first to approach the issue of Markdown parsing with a PEG grammar. peg-markdown is an implementation of Markdown in C using a PEG parser.

My own effort is going forward despite this implementation existing for several reasons:

  1. I plan on having implementations in several languages;
  2. I intend to implement various Markdown and Wiki extensions and flavours; and
  3. Because I’m getting a kick out of it.

I learnt compiler theory in university but it was all quite theoretical with simple yet interesting examples. The practical application to a real-world problem is quite something else. Plus PEG is only 6 or so years old so is new to me.

It is my belief that PEGs are a far more natural and robust means of parsing any form of Markdown, Wiki syntax, BBcode or other forum format.

And that’s the direction I’m heading.