Markdown and an Introduction to Parsing Expression Grammars (PEG)

Writing an ANTLR LL(*) grammar for Markdown has been the itch I just can’t scratch this month. I keep going back to it as I have a new idea about how to approach the problem or how to solve a previous problem I’ve had. Each time I get further but I still keep hitting a wall.

It’s a shame really because ANTLRWorks is an excellent tool and ANTLR is an extremely mature product. The rewriting rules and tree grammars are extremely elegant.

Over the couple of weeks I’ve been investigating PEGs (“Parsing Expression Grammars”). I highly recommend Parsing Expression Grammars: A Recognition-Based Syntactic Foundation by Bryan Ford. PEGs are relatively new (Ford’s paper was published in 2004) whereas parsing CFGs (“Context Free Grammars”) with LL, LR and LALR parsers has a history going back decades.

Traditional Parsers

Parsing of computer and natural languages (by computers) has it’s roots in Noam Chomsky’s work on generative grammars, particularly the Chomsky hierarchy and the work of Donald Knuth (On the Translation of Languages from Left to Right [1965]) and Frank DeRemer (Practical Translators for LR(k) Languages [1969]).

To understand Parsing Expression Grammars let me first explain the basic workings of a traditional parsers. The first step is lexical analysis that turns an input stream into a series of tokens. The parser will then apply various rules to these tokens. There are varying techniques for dealing with ambiguities and recursive rules.

As Terence Parr (the creator of ANTLR) puts it in his (excellent) The Definitive ANTLR Reference: Building Domain-Specific Languages:

Unfortunately, ANTLR cannot generate a top-down recognizer for every grammar—LL recognizers restrict the class of acceptable grammars somewhat. For example, ANTLR cannot accept left-recursive grammars such as the following (see Section 11.5, Left-Recursive Grammars, on page 274):

/** An expression is defined to be an expression followed by '++' */
expr : expr '++'
     ;

ANTLR translates this grammar to a recursive method called expr() that immediately invokes itself:

void expr() {
  expr();
  match("++");
}

This is something that LALR parsers handle better.

The Lexical Analysis Problem

But the big problem as far as Markdown is concerned is that tokens are not context-free. Take a natural definition for lists:

listItem    : ORDERED inline NEWLINE
            | UNORDERED inline NEWLINE
            ;

ORDERED     : DIGIT+ '.' (' ' | '\t')+ ;
UNORDERED   : ('*' | '-' | '+') (' ' | '\t')+ ;
inline      : (~ NEWLINE)+ ;
NEWLINE     : '\r' '\n'? : '\n' ;

and this Markdown:

1. one
2. two
3. three

will be converted into this lexical stream:

ORDERED inline("one") NEWLINE
ORDERED inline("two") NEWLINE
ORDERED inline("three") NEWLINE

and then this AST ("Absract Syntax Tree") will result:

document
+- listItem
|  +- ORDERED
|  +- inline ("one")
|  +- NEWLINE
+- listItem
|  +- ORDERED
|  +- inline ("two")
|  +- NEWLINE
+- listItem
   +- ORDERED
   +- inline ("three")
   +- NEWLINE

Looks good, right? Wrong. It quickly falls down when the Markdown becomes pathological:

1. 1. one
2. two
3. three

because the input stream is:

ORDERED ORDERED inline("one") NEWLINE
ORDERED inline("two") NEWLINE
ORDERED inline("three") NEWLINE

assuming you can resolve the ambiguity regarding inline being able to technically match "1." (which you can.... kinda).

The above will not be recognized because there is no rule that handles a pair of ORDERED tokens. Really what you want to do is not create an ORDERED token after you’ve already started a list item but at this point you no longer have a context-free grammar.

ANTLR’s semantic and syntactic predicates make an admirable effort of dealing with these kinds of ambiguities and context sensitivities but ultimately it’s just not designed for this kind of grammar.

Enter PEG

PEG parsers take a different approach in two important ways:

  1. PEGs are not ambiguous. Choices in the above can lead to ambiguities. ANTLR resolves many of these by using predicates, which are a way of saying “if it looks like a duck then it’s a duck otherwise it’s something else”. PEGs use a prioritized choice operator, which basically try the choices in order until it finds one that matches. By definition this is unambiguous because the input stream will either be recognized or it won’t; and
  2. PEGs better handle non-CFGs by trying to recognize tokens as part of processing a rule rather than recognizing tokens and then applying rules to them.

Prioritized Choice

So in PEG terms, Markdown becomes easier to describe:

Document <- Line*
Line     <- Heading / ListItem / Inline / Empty
Heading  <= '#'+ WS+ Inline
ListItem <- (DIGIT+ '.' / '*' / '-' / '+') WS+ Inline
Inline   <- (!NEWLINE .)+ NEWLINE

DIGIT    <- [0-9]
WS       <- ' ' | '\t'
NEWLINE  <- '\r\n' / '\r' / '\n'

This is of course partial and a simplification but the important thing here is that prioritized choice resolves what otherwise will be ambiguous. This is the “else” clause I’ve been looking for.

Context-Sensitive Tokenization

Markdown has lots of these issues. For example ‘###’ might indicate a header but only if that line itself isn’t a header (by the next line consisting of all equals signs or hyphens). ANTLR allows you to handle some of these situations by doing something like:

HEADER : {getCharPositionInLine()==0]?=> ‘#’+ WS+ ;

but what about this Markdown?

> # quoted heading
> some text

It’s entirely possible I’m missing some key part of the puzzle here but I’m not hopeful.

Ford illustrates this problem:

...PEGs also create new possibilities for language syntax design. Consider for example a well-known problem with C++ syntax involving nested template type expressions:

vector<vector<float> > MyMatrix;

The space between the two right angle brackets is required because the C++ scanner is oblivious to the language’s hierarchical syntax, and would otherwise interpret the >> incorrectly as a right shift operator. In a language described by a unified PEG, however, it is easy to define the language to permit a >> sequence to be interpreted as either one token or two depending on its context:

TemplType <- PrimType (LANGLE TemplType RANGLE)?
ShiftExpr <- PrimExpr (ShiftOper PrimExpr)*
ShiftOper <- LSHIFT / RSHIFT
LANGLE    <- ’<’ Spacing
RANGLE    <- ’>’ Spacing
LSHIFT    <- ’<<’ Spacing
RSHIFT    <- ’>>’ Spacing

(emphasis added)

Conclusion

This isn’t a new problem and I’m the first to approach the issue of Markdown parsing with a PEG grammar. peg-markdown is an implementation of Markdown in C using a PEG parser.

My own effort is going forward despite this implementation existing for several reasons:

  1. I plan on having implementations in several languages;
  2. I intend to implement various Markdown and Wiki extensions and flavours; and
  3. Because I’m getting a kick out of it.

I learnt compiler theory in university but it was all quite theoretical with simple yet interesting examples. The practical application to a real-world problem is quite something else. Plus PEG is only 6 or so years old so is new to me.

It is my belief that PEGs are a far more natural and robust means of parsing any form of Markdown, Wiki syntax, BBcode or other forum format.

And that’s the direction I’m heading.

7 comments:

Anonymous said...

"Scratch you can't itch"?

William Shields said...

Heh woops. Fixed. :)

Wincent Colaiuta said...

Hi. I'm the author of a wikitext-to-html translator (see http://wikitext.rubyforge.org/ for details) and while wiki markup is not Markdown it does share some characteristics.

I also looked at using ANTLR and in the end could only use it for the lexing part, not the parsing, because it really is best suited for well-formed computer languages where syntax errors are usually considered fatal and the translator is not expected to magically recover.

The problem with Markdown and wikitext, apart from its context-sensitivity, is that it isn't a programming language and input is often not well-formed.

So quite close to the beginning of development I came to the conclusion that I was going to need a hand-coded parser. Later on I dumped ANTLR for the lexing and replaced it with Ragel as it was much faster.

I do have an interest in PEG and in fact one of my projects includes a flexible PEG parser generator (see http://walrus.wincent.com/ for info), but for unreliable, error-prone input like this kind of markup I still think hand-coded parsers are the only way to go.

Wincent Colaiuta said...

Whoops, forgot to add that PEG parsers aren't really suitable for use on webservers (where Markdown is most often found) because of their large memory footprints. A malicious user could try to DoS the server by submitting large slabs of text, rapidly exhausting the memory resources.

John MacFarlane said...

Hi - I'm the author of peg-markdown, and also of two other PEG-based markdown parsers: lunamark, in lua (http://github.com/jgm/lunamark), and markdown-peg, in Haskell (http://github.com/jgm/markdown-peg). So I understand about itches you can't scratch. It wouldn't be hard at all to extend peg-markdown or lunamark to parse other wiki-like formats, but I haven't gotten around to that.

PEG isn't *quite* capable of parsing markdown, so I had to resort to some workarounds. For example, the markdown syntax for inline code is: for any n, a sequence of n backticks, then a string of characters (possibly including backticks) ended by a sequence of n backticks. (I'm simplifying; there are also some optional spaces.) It's easy enough in a PEG to specify this rule for any particular n, but there's no way to specify it for arbitrary n. I dealt with this in peg-markdown by limiting n to 1-5, which should be enough for any conceivable ordinary use. In lunamark, I was able to implement the full rule, because the luapeg library is more powerful than standard PEG.

I dealt with quotations, lists, and other constructs that embed lists of block elements by running the parser again on them, recursively. This isn't ideal, but I couldn't think of another way to do it with PEG.

As for memory usage, it's true that PEG parsers use a lot; peg-markdown uses about 4M of heap space while parsing a 179K file.

Scott Frazer said...

Thank. You. So. Much.

I was trying to write markdown in ANTLR and I was pulling my hair out because I felt like I was missing something essential, but this definitely clears things up.

dtm said...

In looking at what else is out there, you're also looking at Pandoc, right? If nothing else the page http://code.google.com/p/pandoc/wiki/PandocVsMarkdownPl should be full of helpful corner cases to test.

Post a Comment