JMD, Markdown and a Brief Overview of Parsing and Compilers

Like most comp sci students, I did a course on compilers in university. I also did some parsing and syntax trees in a data structures course. At the time I wrote a couple of parsers, including one for simplifying boolean expressions (de Morgen’s laws, a AND true = a, etc) and another for evaluating arithmetic expressions.

So I’ve been familiar with the basics and the theory of compiler design but I was by no means an expert.

Recently, I launched JMD, a Java implementation of MarkdownSharp, itself a C# port and extension of the original Perl Markdown scripts. Like the original it relies heavily on regular expressions.

I like the idea of e and improving Markdown but I’m no fan of using complicated regular expressions for that purpose. The current version is a milestone. It passes the unit tests and allows me to better build the replacement, which will be written more in the traditional compiler/translator sense.

Finite State Machines

A finite state machine (“FSM”) defines two things: a finite number of states and transitions between them. This abstract machines models behaviour of some kind.

Often—but not always—such machines have a start state and one or more end states. FSMs are typically used for games, input processing and many other things.

One important characteristic of such machines is that they typically have no memory. They merely know the current state and what transitions there are.

Typically in computer science we’re more concerned with a special class of FSMs called deterministic finite state machines (“DFSM”) or deterministic finite automata (“DFAs”). The key difference is that transitions are deterministic, meaning there is only one transition between two states with a given input symbol.

Another characteristic of finite state machines is whether they are cyclic or acyclic. If there exists a state such that a transition can be taken and it is possible to return to that same state the FSM is cyclic, otherwise it is acyclic. By definition, cyclic FSMs are capable of processing an infinite space of inputs. Acyclic FSMs are not.

Regular Expressions

The most familiar DFA for most programmers will probably be regular expressions. a regular expression (“regex”) is a shorthand way of building a DFA to process text input by specifying the optionality, cardinality, capturing and ordering of character sequences. Typically programs will determine if a given input matches a specified regex or whether or not that regex can be found anywhere in the input and possibly capture key parts of that input.

Undoubtedly regexes are useful but they tend to be overused. Consider them a shining example of how once you have a hammer everything starts to look like a nail. In particular programmers will often try to use them to parse HTML or XML documents, which tends to be a pet peeve of Stackoverflow answeres such as myself.

The reason they are a poor choice is that HTML is not a regular language. What that means is that it is not possible to parse and validate HTML with a DFA. That’s because things like proper nesting of tags, ordering of opening and closing tags, etc require the machine to have some kind of memory, which DFAs don’t have.

You see that in regex-based Markdown parsers where limitations are imposed to make it possible to parse Markdown, such as introducing a nesting depth limit to certain block level elements.

To give you an example: if you’re looking for links, will “<a[ >]” find them? Most of the time? Yes. But not all the time. Consider the case of such expressions appearing in attributes, XML CDATA blocks, inside XML, CSS or Javascript comments, inside Javascript strings, etc. Regexes can’t detect these kinds of corner cases. It simply isn’t capable. Not reliably at least.

As it turns out a fairly simple change greatly enhances the power of DFAs.

Pushdown Automata

Pushdown Automata (“PDAs”) make the simple change of adding a stack (hence “pushdown”), giving the machine a memory (of sorts) beyond the current state. To clarify, the machine can both inspect and manipulate the stack both in deciding what transition to take and what to do with the stack.

This allows PDAs to process a much broader set of languages. A language that can be processed by a PDA is called a context-free language (“CFL”), which is a superset of all regular languages.

If a PDA is deterministic (much like FSMs vs DFAs) then it is called a deterministic pushdown automaton (“DPDA”). Any languages that can be parsed by DPDAs are called deterministic context-free languages (“DFCLs”), which are a subset of CFLs.

Going back to regular expressions and HTML/XML parsing: with the addition of this stack, suddenly your parsing becomes much more reliable. You can stop looking for anchors when you enter a Javascript block or a comment and so on.

Many programming languages are CFLs but certainly not all.

Context-Sensitive Languages

The next broader class of languages are called context-sensitive languages. CFLs are a subset of context-sensitive languages. C++ is the traditional poster-child for hard-to-parse languages. It’s grammar is also context-sensitive. Take the following expression:

A a = B();

Is that a method call or object construction? You can’t tell without looking up B in the symbol table.

Ruby and some other programming languages are context-sensitive. HTML/XML is also in this category and so is Markdown. For example:

> Block quote <p
> >

Is the second > on the second line the start of a nested block quote or the closing part of the paragraph tag at the end of the first line?

Not only is this context-sensitive it’s also ambiguous. Technically we call this a non-determinism.

Once again returning to parsing a document for anchors, context-sensitivity bridges the remaining gap. Cases like being inside a comment or not represent the kind of context-sensitivity that is unmanageable for regular languages.

Grammars, Lexers and Parsers

A formal grammar (or just “grammar” for short) is a set of rules that describe a sequence of tokens. A token sequence is often called a sentence. The set of all sentences described by a given grammar is the language for that grammar. Note: a context-free language is described by a context-free grammar (“CFG”), etc.

Compilers, interpreters and translators are typically written in at least two parts (that are of interest to us): a lexer and a parser. Simple grammars may be implemented where the lexer and parser are combined. More complicated grammars may have many parsing steps.

A lexer or lexical analyzer or scanner or recognizer reads a set of input tokens—most often a character stream—and converts them into lexemes or tokens. For example, a lexer for arithmetic expressions may convert:

4+5*7

into

NUMBER(4) OP(+) NUMBER(5) OP(*) NUMBER(7)

A parser is a program that interprets a stream of lexemes by a set of rules. Those rules are defined in terms of lexems and/or other rules, or even the same rule (although anything other than tail-recursion in grammar rules tends to be problematic and is usually factored out either manually because the parser won’t accept it or automatically).

This distinction is somewhat artificial and a little blurred. ANTLR maeks the definition that lexer rules are terminating rules and parser rules are non-terminating. “Terminating” means the rules is not defined in terms of any other rules and as such can at best resolve to a lexeme.

Types of Parsers

At the top level there are two main types of parsers.

The first category are LL-parsers. Here each L stands for “left to right”. Basically this means the parser is top-down. The parser attempts to match the input to a rule and in doing so will attempt to match input tokens to lexemes. The other L means the input tokens are matched left-to-right too (ie from hte beginning).

LL parsers vary in their degree of lookahead. The simplest LL parsers are LL(1), meaning they lookahead one token. An LL parser cannot choose between:

r : A B
  | A C
  ;

An LL(2) parser however can. LL parsers with a finite amount of lookahead are also called LL(k) parsers. Arbitrary lookahead LL parsers are called LL(*) parsers.

The other main category is LR parsers. The input tokens are still read from the beginning but instead of trying to match rules, the parser will look at the input and try to construct tokens. From those tokens it will then look for rules and match the input that way.

The most important subset of LR parsers are LALR parsers. Frankly it’s been too many years for me to remember the difference between LR and LALR parsers.

There are various strengths and weaknesses of each approach, which is beyond the scope of this post. Generally though, LL parsers are easier to understand but LR parsers are more often used by compiler compilers. A compiler compiler is a tool that takes a formal grammar and creates a compiler, parser, interpreter or translator.

Another class of parsers is parsing expression grammars (“PEGs”). I know far less about these. They’re fairly new but at least one Markdown parser, Lunamark, has been written with a PEG parser.

ANTLR

Apparently Joel and Jeff discussed this issue in this week's Stackoverflow podcast. They ruminated that it would have been better had Markdown been written using a formal grammar and a tool such as bison. I agree and they provide a pretty good example of how horrifying regex parsing of non-regular languages can be.

My own journey towards a non-regex solution first took me to ANTLR (“ANother Tool for Language Recognition”), which has a good GUI for debugging parsers. ANTLR is an LL(*) parser with some extensions. One of these is syntactic predicates. Syntactic predicates increase the recognition power of LL parsers by resolving some ambiguities that LL parsers otherwise can’t handle.

For example, consider the following Markdown:

> This is a *test of
> blockquoting*,
> emphasis and
> http://www.google.com
> (autolinking)
A paragraph.

When I was playing around with ANTLR this was problematic to parse. The natural way is to remove the blockquoting and then parse the remaining text, probably recursively.

One “problem” with LL parses is that they attempt to match all the rule alternatives so if you wanted to write the above as:

document : (para | quote)* ;
para     : ((~ '\n') '\n')+ ;
quote    : ('> ' (~ '\n')*)+ ;

you’ve actually created an ambiguous grammar because the para rule can match quote lines can be matched by the para rule. Syntactic predicates seek to resolve this kind of ambiguity by saying things like “if it looks like a block quote then it’s a block quote” when choosing between possible alternatives.

Another problem I ran into was how to deal with things like auto-linking URLs?

Throw in limited XML/HTML parsing and it just became a hair-pulling exercise. Basically it just seemed to be the wrong tool for this particular job. Now that doesn’t mean it can’t be done. Someone more skilled than I with it no doubt could get it done. I could see the path forward and it wasn’t pretty however.

It’s a shame really because I like ANTLR. Terence Parr, the author of ANTLR and all-round language tool rock star, has written an excellent book The Definitive ANTLR Reference: Building Domain-Specific Languages and I can’t recommend this enough.

The Future of JMD

One commentor asked:

How is it better/different than MarkdownJ?

It’s a good question. The answer is that JMD will use a true parser rather than a hash of regexes to parse and process Markdown. It’ll also do a lot more than this but more on this later when its closer to fruition.

The ANTLR exercise wasn’t a complete waste. I had a choice to see if LALR parsing would offer a better alternative. The ANTLR experiment did solidify in my mind how I would go about parsing Markdown.

I’m convinced that a hand-coded parser is not only possible but it’s relatively straightforward.

Preliminary results look extremely promising. It’s not feature-complete yet and I won’t release it until it passes a significant portion of the unit tests inherited from MarkdownSharp.

Initial results indicate it will be 50-100x faster than a regex solution. The lexical analysis so far is being done in a single pass using virtually no memory and is taking between 10 and 20 microseconds to tokenize a document about the size of one of the unit tests.

Conclusion

I hope this post has been useful in three respects:

  1. To give a brief overview of the field of compiler compilers;
  2. To explain my thought process behind how to take JMD forward; and
  3. Why I’m doing what I’m doing.

Watch this space.

5 comments:

Jeff Atwood said...

very cool, looking forward to seeing some preliminary results -- hoping the code is not just faster, but shorter when all is said and done!

Anonymous said...

You could try jmeta: http://github.com/onnlucky/jmeta or rats! Both PEG based parsers.

Evan said...

A bit of a shame that an existing parser generator such as ANTLR does not appear to be sufficient. How cool would it be to be generating a markdown parser in java, javascript, c#, etc all from the same grammar?

E.

William Shields said...

ANTLR is a really cool tool and the idea was to minimize the language porting work by using it. LL parsers have their limitations though but like I said: I don't think it's impossible, just difficult and messy.

PEG at least seems to solve the precedence problems (the order they appear is the precedence) and it still may be worth investigating but the hand-coded lexer is coming along quite nicely.

night-fairy-tales said...

Nice post, Thanks

Post a Comment