I’ve reached an important milestone tonight. As previously mentioned I’m working on a non-regex Markdown library for Java (and other languages to follow).
The goals of this project are:
- To be feature complete for standard Markdown as well as the StackOverflow/Github extensions;
- To add table support;
- To support various Wiki markdown flavours; and
- To convert from Markdown to HTML and from HTML to Markdown.
There looks like being four steps in this process.
The first step I’ve called lexical analysis but it’s part scanning and part parsing mainly because to do it at this stage is convenient and saves me a lot of grief later. The end result of this step is a list of Tokens, which is highly memory efficient. The Token object only requires 4 integers each and for a source file 10K in size you’ll probably end up with between 2,000 and 5,000 tokens.
The second step, which I’m not convinced will remain, is a rewrite step. There are a couple of awkward cases I don’t want to handle in the third step so I filter the list of tokens at this point.
The last step is to take the list of tokens and to generate a Document. A Document is basically an Abstract Syntax Tree and looks a lot like a DOM.
The last step is to use the Visitor pattern to render an HTML document.
Tonight I have working code that does all four steps. It is still very much feature incomplete. Lots of inline styling doesn’t work. Neither do reference images, reference links nor any kind of list. Still it is correctly handling nested block quotes, implicit paragraphs and paragraph breaks and indented code blocks.
As of right now it is converting (the Code_blocks unit test from MarkdownSharp):
code block on the first line Regular text. code block indented by spaces Regular text. the lines in this block all contain trailing spaces Regular Text. code block on the last line
<pre><code>code block on the first line </code></pre> <p>Regular text.</p> <pre><code>code block indented by spaces </code></pre> <p>Regular text.</p> <pre><code>the lines in this block all contain trailing spaces </code></pre> <p>Regular Text.</p> <pre><code>code block on the last line</code></pre>
in 5 microseconds.
Let me repeat that: it looped through that conversion one million times in under 5 seconds… in pure Java! To compare, my regex solution is doing this in 600-700 microseconds (that’s based on the 1.006 MarkdownSharp code; 1.009 has improved block handling, which should make a difference).
Now you might look at that document and say it’s not that complicated (and you’d be right) but all the infrastructure is there. I know how I’m going to implement the rest and I can’t imagine anything (other than auto-linking) significantly affecting performance. What’s more even if it was 100 times slower I’d still be happy. I’m working on a worst case of it being 10 times slower when feature complete.
So far I haven’t used a single regular expression and don’t think I’ll need to apart from maybe link validation. I’ll document more about the design in future posts (after the code is released probably) to explain many optimizations you can make to this process as well as the overall parsing strategy. So far there has been almost zero need for lookahead and backtracking, which is generally what kills your performance (without complicated techniques like memoization).