Markdown, Block Parsing and the Road to Hell

I thought it times to update my status on this particular undertaking, which so far has ended up being far more massive than originally envisioned.

The overall design of the Markdown parser is that there are two parsers… kinda. There is a parser to break your document into blocks and another to interpret the inline content within those blocks. As soon as I made this realization, everything just got a whole lot easier.

I use this this term (and “inline”) because those are the terms HTML uses (“block elements” and “inline elements”). Of course HTML also gets more complex (eg “replaced” vs “non-replaced” elements and inline-block, floats, etc) but fundamentally you can think of a Markdown document—or any hypertext document—as consisting of block and inline elements.

Markdown parsers will often talk about “blocks” and “spans” instead.

Block Parsing

The first level of parsing of Markdown is into blocks.

Such a document can be viewed as a tree. The root node is the document. Every node below that is either a block or an inline node. The tree can be arbitrarily deep and there are certain rules about relationships in that tree. For instance:

  • Block nodes are only ever children of other block nodes (counting the root Document node as a block node);
  • Paragraphs can only contain inline elements;
  • List items must be children of lists;
  • and so on.

The goal of any parser is take an input and build a valid syntax tree based on the rules defined.

This part of the problem for what I’m writing is now done. This includes code blocks, paragraphs, block quotes, ordered and unordered lists, headers and horizontal rules. Tables I plan to return to later.

List Parsing

Today I came across Three Markdown Gotchas, which I hadn’t seen before but it opened my eyes to one particular area of difficulty I had: list processing. Go to StackOverflow, ask a question and type in:

- one
 - two
  - three
   - four

and you probably won’t get you what you expect. You get this:

<ul>
  <li>one
    <ul>
      <li>two</li>
      <li>two</li>
      <li>two</li>
    </ul>
  </li>
</ul>

Let me give you some background: Markdown has the concept of indents. Based on a predefined tab width (typically 4), a single tab or 4 spaces represents one indent. That’s important because code lines are preceded by one indent. A non-indent space is sometimes ignored at the beginning of a line, for example at the start of a paragraph line or the continuation of an existing one.

The original Markdown “spec” says that nesting list items is done by preceding the line with one more indent than the previous line. In vanilla Markdown the above sequence would come out as:

<ul>
  <li>one</li>
  <li>two</li>
  <li>two</li>
  <li>two</li>
</ul>

because none of the lines has a leading indent. That’s logical and consistent. Jeff’s point is basically that even one space should indicate intent and be interpreted as nesting. Sounds reasonable right? Maybe. The problem is that it leads to unintended complexity.

Go back to the above example and put one, two then three spaces in front of the first list item. Watch the preview pane to see how the list changes. The implied nesting changes all over the place? Logical? I think not.

But it gets worse.

- one

 two
 - three

 four

comes out as

<ul>
  <li>
    <p>one</p>
    <p>two</p>
    <ul>
      <li>three</li>
    </ul>
    <p>four</p>
  </li>
</ul>

Okay… bear in mind that there are spaces before two and four so that you continue the list item. Otherwise they would be interpreted as separate paragraphs. But what if you want four to continue the nested list item three? How much indentation do you need? It turns out that the magical number is anything from 5 to 11.

But it gets worse. Put one space before one and suddenly one and three are the same list so four is now indented so far that it becomes a code block run-on from three. Add a second space to the front of one and for some reason it returns to the original nesting even though one is now indented more than three. Huh?

I’ll leave an examination of the MarkdownSharp source code as to the reasons for this as an exercise for the reader. Suffice it to say that it all stems from the motivation that one (more) space indicating nesting being somehow more intuitive.

The Road to Hell

The road to hell is paved with good intentions. It’s one of my favourite sayings. We programmers as a whole are unreasonable people. Through a combination of hubris, stubbornness and even laziness we have a tendency to throw out what’s been done before or simply make breaking changes because we prefer it, we think others will prefer it, we don’t appreciate that someone else may have to deal with the consequences or simply out of ignorance as to what led to the original changes.

We all do this, myself included. It’s worst when it not only manifests itself in company culture but it’s enshrined. Take Microsoft as a prime example. Internet Explorer has “Favourites”. What the hell are favourites? Well, they’re bookmarks. But IE can’t call them that because Netscape called them that first and Microsoft wanted to differentiated themselves and their products. This is of course led to many conversations I know I had at the time that went something like this:

New user: What’s a favourite?
Me: It’s a bookmark.

I couldn’t help but laugh out loud when I first read C# and saw all the things copied from Java had been renamed, sometimes with significantly worse names. Java’s final as C#’s sealed springs to mind. You can just tell that there were people dedicated to the task of finding names to Java concepts and keywords. It’s just sad.

Hyperbolae aside, I digress.

The point of all this is that:

  1. Often things that came before you were done for a reason, whether or not you’re aware of it and whether or not you agree with it if you are;
  2. Breaking changes have a high price so much so that the cure is often far worse than the disease and your delicate sensibilities be damned. Internal consistency and syntactic purity is overrated. Interestingly those overly encumbered with such sensibilities seem to have a disproportionate tendency to become Python programmers.

List Sanity

For this reason my parser has returned to what is probably the original implementation. That is:

  1. A leading non-indent space is ignored before list items. That is, it implies no meaning and is discarded so there is no difference between 0 and 2 leading spaces before a list item;
  2. Up to one leading indent (meaning one tab or 0 to 4 spaces) is consumed from each subsequent line until a new list items is hit or a line with no leading spaces is met. The subsequent list item will be a part of the same list. Text with no leading spaces will end the list and form a new paragraph; and
  3. All lines that continue the list item are combined (with their leading tab or 0 to 4 spaces consumed) and they form a new block context. Meaning they are then parsed as if they were a separate input, meaning it can contain new lists, block quotes, code segments and so on.

(3) provides a lot of consistency. it means that if you have a list item followed by a line with two indents that second line will be a code block (one indent marking a continued list item, the second will be interpreted as a code block within the list item block context).

To me this is supremely more logical—and easier to implement—but I guess if you’re really attached to nesting list items with a single space and figuring out that 5 to 11 spaces is the magical number of spaces to continue a nested list item then you’ll hate it. Too bad.

The nested block context from (3) has one exception. If the nested block context would result in a single paragraph then that paragraph is unwrapped to being inline content of the list item. This has one important effect, which some may consider a breaking change. Namely this Markdown:

- one
- two
- three

and

- one

- two

- three

will both be interpreted as being:

<ul>
  <li>one</li>
  <li>two</li>
  <li>three</li>
</ul>

whereas MarkdownSharp will interpret the latter as:

<ul>
  <li><p>one</p></li>
  <li><p>two</p></li>
  <li><p>three</p></li>
</ul>

which is something I've previously documented and disagreed with.

But this could be interpreted as a breaking change so I will probably add a special case for just this scenario as an option

Conclusion

The block parsing portion is done. The code is ugly and needs to be refactored (again) but it works. I still have an issue with too many temporary objects being created (mainly because it simplified some code) and I’ll need to go back and eliminate that.

What’s been interesting is that I’ve now rewritten the block parsing at least four times before it felt right. John Carmack once said he needs to write something five or six times before he gets it right. I agree with his sentiment. It takes that long to truly understand the domain, in my opinion.

The inline parsing has been a completely different set of problems. I will have a follow-up post on that soon.

4 comments:

Evan said...

I'm not sure you can drop the case of list items with paragraphs. For one, I've always liked a bit more space around my bullet points which this gives you an easy way to do. For another, if you do the following markdown will create HTML with some funky spacing:

 - One

 - Two

   Too

 - Three

 - Four

If I understand your rules correctly, with your parser this will produce the following HTML:

<ul>
  <li>one</li>
  <li>
    <p>two</p>
    <p>too</p>
  </li>
  <li>three</li>
  <li>four</li>
</ul>

Whereas the current parser produces:

<ul>
  <li><p>one</p></li>
  <li>
    <p>two</p>
    <p>too</p>
  </li>
  <li><p>three</p></li>
  <li><p>four</p></li>
</ul>

Which looks a lot better.

P.S. You didn't mention block level HTML tags, which "must be separated from surrounding content by blank lines, and the start and end tags of the block should not be indented with tabs or spaces". That's being handled okay?

William Shields said...

@Evan: your supposition regarding resultant markup is correct. But, like I said, I'll probably code a special case for the previous behaviour at least as an option.

Regarding block-level tags, my code is actually far less strict about their placement and this is the exception to the block-then-inline rule I referred to earlier: the inline parse may find block level tags.

I'm also planning to go further and turn markup into markdown where applicable.

StoneCypher said...

Did you just claim that Java invented the sealed class concept?

William Shields said...

@StoneCypher: did I? Not that I can see. What are you referring to?

Post a Comment