I haven’t had much time to work on my Markdown parser lately (sadly) but I thought it was worth posting an update on where I’m at. I have been digging deep into the dark depths of inline parsing. I have previously discussed the two modes of parsing Markdown, which I call block and inline.
But the block parsing is done (well, I have to go back and tweak one thing) so I’m onto the murky world of inline Markdown parsing.
Parsing Block Markup
Various Markdown implementations allow you to create markup blocks. There are usually quite strict requirements about how you can write these blocks. For example, you might need to put the start and end tags on separate lines such as:
<ul> <li>one</li> <li>two</li> <li>three</li> </ul>
I have a much more forgiving approach to this such that this “Markdown”:
This is a paragraph with a <ul><li>nested</li></li>block</li> with some <hr>random<h2>other tags</h2> in it
and convert it to:
<p>This is a paragraph</p> <ul> <li>nested</li> <li>block</li> </ul> <p>with some</p> <hr> <p>random</p> <h2>other tags</h2> <p>in it</p>
This part already works.
But it gets better. It will also take that some input stream and convert it to:
This is a paragraph - nested - block with some ------ random ## other tags ## in it
So to be clear: this will convert acceptable markup to markdown and filter out unacceptable markup (like script tags).
This will include parsing links and images into Markdown references.
Parsing Inline Markup
This is what I’m working on now. I’m still looking for a good generic way of doing this that correctly captures tag hierarchies (eg list items must be children to unordered and ordered lists). What I’m probably going to do is release a messy version of the code (being the current version) then go back and revisit it once I have a working implementation.
This is a good general principle: it’s far easier to fix something that’s complete and working than it is to constantly strive for perfection in incomplete code (basically artists ship).
One thing I’m debating is whether I require tags to be balanced. That means whether I accept this:
<b>this is<i>a</b> test</i>
Ideally I’d like to not accept this. XML/XHTML requires balanced tags but HTML either doesn’t or even if it does, most browsers are quite forgiving of this. XML treats markup essentially as a document tree whereas the HTML view is more like tags are, in certain circumstances, switches to turn behaviour on or off.
I went into this problem thinking I could construct a document tree out of
***this is a* test**
document +- bold +- italic | +- text: this is a +- text: test
but that idea quickly falls down when you consider that this is valid Markdown:
**this is *a** test*
which basically parses to:
BOLD_ON TEXT("this is ") ITALIC_ON TEXT("a") BOLD_OFF TEXT(" test") ITALIC_OFF
Almost any Markdown parser will generate HTML from this that looks like this:
<strong>this is <em>a</strong> test</em>
That’s unfortunate because I like the document tree. But sadly the matching problem still remains because if a special sequence doesn’t have a matching close it is put into the document as a literal sequence.
This leads to some fairly pathological corner cases like:
*this [link* google] : http://google.com
which will translate to
<em>this <a href="http://google.com">link</em> google</a>
and browsers will tend to break that link into two parts (where you can click on “link” or “google”).
But work progresses.
Even with a lot of the Markdown spec being parsed my transformation times on basic documents (eg a couple of lists, a block quote and some paragraphs) is still under 60 microseconds (roughly) and that’s with some messy array manipulation and temporary object creation that I plan to revisit and clean up.
At this stage I’m hoping to have some committed and available for comment within two weeks. It won’t be pretty but my goal is to get feedback earlier rather than later.