Markdown, Inline Parsing and Badly Formed HTML

I haven’t had much time to work on my Markdown parser lately (sadly) but I thought it was worth posting an update on where I’m at. I have been digging deep into the dark depths of inline parsing. I have previously discussed the two modes of parsing Markdown, which I call block and inline.

But the block parsing is done (well, I have to go back and tweak one thing) so I’m onto the murky world of inline Markdown parsing.

Parsing Block Markup

Various Markdown implementations allow you to create markup blocks. There are usually quite strict requirements about how you can write these blocks. For example, you might need to put the start and end tags on separate lines such as:

<ul>
  <li>one</li>
  <li>two</li>
  <li>three</li>
</ul>  

I have a much more forgiving approach to this such that this “Markdown”:

This is a paragraph with a <ul><li>nested</li></li>block</li> with
some <hr>random<h2>other tags</h2> in it

and convert it to:

<p>This is a paragraph</p>

<ul>
  <li>nested</li>
  <li>block</li>
</ul>

<p>with some</p>

<hr>

<p>random</p>

<h2>other tags</h2>

<p>in it</p>

This part already works.

But it gets better. It will also take that some input stream and convert it to:

This is a paragraph

- nested
- block

with some

------

random

## other tags ##

in it

So to be clear: this will convert acceptable markup to markdown and filter out unacceptable markup (like script tags).

This will include parsing links and images into Markdown references.

Parsing Inline Markup

This is what I’m working on now. I’m still looking for a good generic way of doing this that correctly captures tag hierarchies (eg list items must be children to unordered and ordered lists). What I’m probably going to do is release a messy version of the code (being the current version) then go back and revisit it once I have a working implementation.

This is a good general principle: it’s far easier to fix something that’s complete and working than it is to constantly strive for perfection in incomplete code (basically artists ship).

One thing I’m debating is whether I require tags to be balanced. That means whether I accept this:

<b>this is<i>a</b> test</i>

Ideally I’d like to not accept this. XML/XHTML requires balanced tags but HTML either doesn’t or even if it does, most browsers are quite forgiving of this. XML treats markup essentially as a document tree whereas the HTML view is more like tags are, in certain circumstances, switches to turn behaviour on or off.

Markdown Formatting

I went into this problem thinking I could construct a document tree out of

***this is a* test**

into

document
+- bold
   +- italic
   |  +- text: this is a 
   +- text: test

but that idea quickly falls down when you consider that this is valid Markdown:

**this is *a** test*

which basically parses to:

BOLD_ON
TEXT("this is ")
ITALIC_ON
TEXT("a")
BOLD_OFF
TEXT(" test")
ITALIC_OFF

Almost any Markdown parser will generate HTML from this that looks like this:

<strong>this is <em>a</strong> test</em>

That’s unfortunate because I like the document tree. But sadly the matching problem still remains because if a special sequence doesn’t have a matching close it is put into the document as a literal sequence.

This leads to some fairly pathological corner cases like:

*this [link* google][1]

  [1]: http://google.com

which will translate to

<em>this <a href="http://google.com">link</em> google</a>

and browsers will tend to break that link into two parts (where you can click on “link” or “google”).

But work progresses.

Conclusion

Even with a lot of the Markdown spec being parsed my transformation times on basic documents (eg a couple of lists, a block quote and some paragraphs) is still under 60 microseconds (roughly) and that’s with some messy array manipulation and temporary object creation that I plan to revisit and clean up.

At this stage I’m hoping to have some committed and available for comment within two weeks. It won’t be pretty but my goal is to get feedback earlier rather than later.

4 comments:

Adam Paynter said...

I have been enjoying your posts, keep up the good work! It's nice seeing people's projects develop. You say you didn't have much time this week for your Markdown parser. Is this a personal project you do in your spare time?

William Shields said...

Yes this is a personal project so it comes after my job and and usual suspects.

Thanks for the feedback.

David Pashley said...

Your example should probably come out as:

<strong>this is <em>a</em></strong><em> test</em>

Which would give you your tree, but might be harder to parse and wouldn't match your existing test cases. On the plus side, it would be valid HTML.

Anonymous said...

Hi. Just wondering if you've made any more progress with your markdown parser? I've been enjoying the series of articles so far!

Post a Comment