More Details on JMD Markdown Parsing

I’ve reached an important milestone tonight. As previously mentioned I’m working on a non-regex Markdown library for Java (and other languages to follow).

The goals of this project are:

  • To be feature complete for standard Markdown as well as the StackOverflow/Github extensions;
  • To add table support;
  • To support various Wiki markdown flavours; and
  • To convert from Markdown to HTML and from HTML to Markdown.

There looks like being four steps in this process.

The first step I’ve called lexical analysis but it’s part scanning and part parsing mainly because to do it at this stage is convenient and saves me a lot of grief later. The end result of this step is a list of Tokens, which is highly memory efficient. The Token object only requires 4 integers each and for a source file 10K in size you’ll probably end up with between 2,000 and 5,000 tokens.

The second step, which I’m not convinced will remain, is a rewrite step. There are a couple of awkward cases I don’t want to handle in the third step so I filter the list of tokens at this point.

The last step is to take the list of tokens and to generate a Document. A Document is basically an Abstract Syntax Tree and looks a lot like a DOM.

The last step is to use the Visitor pattern to render an HTML document.

Tonight I have working code that does all four steps. It is still very much feature incomplete. Lots of inline styling doesn’t work. Neither do reference images, reference links nor any kind of list. Still it is correctly handling nested block quotes, implicit paragraphs and paragraph breaks and indented code blocks.

As of right now it is converting (the Code_blocks unit test from MarkdownSharp):

 code block on the first line
 
Regular text.

    code block indented by spaces

Regular text.

 the lines in this block  
 all contain trailing spaces  

Regular Text.

 code block on the last line

into this:

<pre><code>code block on the first line
</code></pre>

<p>Regular text.</p>

<pre><code>code block indented by spaces
</code></pre>

<p>Regular text.</p>

<pre><code>the lines in this block  
all contain trailing spaces  
</code></pre>

<p>Regular Text.</p>

<pre><code>code block on the last line</code></pre>

in 5 microseconds.

Let me repeat that: it looped through that conversion one million times in under 5 seconds… in pure Java! To compare, my regex solution is doing this in 600-700 microseconds (that’s based on the 1.006 MarkdownSharp code; 1.009 has improved block handling, which should make a difference).

Now you might look at that document and say it’s not that complicated (and you’d be right) but all the infrastructure is there. I know how I’m going to implement the rest and I can’t imagine anything (other than auto-linking) significantly affecting performance. What’s more even if it was 100 times slower I’d still be happy. I’m working on a worst case of it being 10 times slower when feature complete.

So far I haven’t used a single regular expression and don’t think I’ll need to apart from maybe link validation. I’ll document more about the design in future posts (after the code is released probably) to explain many optimizations you can make to this process as well as the overall parsing strategy. So far there has been almost zero need for lookahead and backtracking, which is generally what kills your performance (without complicated techniques like memoization).

Stay tuned…

3 comments:

Jeff Atwood said...

a lot of it boils down to doing one "pass" over the input text instead of dozens to hundreds, as the regex-based solution does.

William Shields said...

Well, yes and no. There are steps in the regex parsing of, say, escaping and unescaping special characters. I would count these as passes.

Mathias said...

Just came over your post here, I think you should take a look at "pegdown" (http://github.com/sirthias/pegdown), a newly release open-source Java library pretty much doing exactly what you are planning:
Proper Markdown processing without regular expressions but rather an actual parser. In the case of pegdown it's a PEG parser built with "parboiled" (http://github.com/sirthias/parboiled), another relatively new Java approach to parsing.
The PEG grammar underlying pegdown stems largely from John MacFarlanes "peg-markup" implemented in C and also available on GitHub...

Post a Comment