Markdown Musings on Unintended Consequences

It may seem lately that Markdown is my white whale to which I respond thusly… call me Ahab.

One of the problems with implementing something like this is that no one can quite agree on what exactly constitutes Markdown. It gets worse when you consider Wiki syntaxes. What’s stunning is that someone (CosmoCode) has gone so far as to create a matrix comparing them all.

If you peruse the unit tests you find things like:

Asterisks tight:

* asterisk 1
* asterisk 2
* asterisk 3


Asterisks loose:

* asterisk 1

* asterisk 2

* asterisk 3

is converted to:

<p>Asterisks tight:</p>

<ul>
<li>asterisk 1</li>
<li>asterisk 2</li>
<li>asterisk 3</li>
</ul>

<p>Asterisks loose:</p>

<ul>
<li><p>asterisk 1</p></li>
<li><p>asterisk 2</p></li>
<li><p>asterisk 3</p></li>
</ul>

Now having gone through the code I can see why this is: two newlines is typically used as a block delimiter, between paragraphs, code blocks and so forth. But I have to wonder three things:

  1. Is this planned behaviour or simply the result of splitting the file into blocks using two or more newlines as a delimeter?
  2. Is this behaviour desirable?
  3. Is this behaviour reasonable?

Of course there is a case for paragraphs being nested in list items, namely that you have two or more paragraphs or other nested block content within list items. This is certainly something you can do—and will do—in HTML but I’m not so convinced that a newlines wrapping list content in a paragraph is anything other than an unintended consequence.

Of course there is no grammar or spec for Markdown so it’s something you can argue til the cows come home. You can also change it and still call what you do “Markdown”. It’s why there are so many Wiki syntaxes.

There are other issues. For example, should you be able to start or end bold or italic styling in the middle of a word? I believe Github has taken the approach that underscores for italics can’t start or end intra-word, sensibly (as this is a common occurrence in source code).

Lastly, Markdown preserves HTML. It’s my opinion that it should be replaced with Markdown where possible. What should you do with this:

<blockquote>
  <ul id="list">
    <li>one</li>
    <li>two</li>
    <li>three</li>
  </ul>
</blockquote>

In my opinion, it would make sense to convert this to:

> 1. one
> 2. two
> 3. three

Of course you lose information in doing this (namely the id attribute) but you have to decide: are you using Markdown or HTML?

Opinions will of course vary.

Weighty issues indeed! But this is what I’m struggling with as I’m working on my list parsing while trying to prevent my lexer from becoming a pushdown automaton.

6 comments:

Mike Weller said...

I think having the p elements inside the list items is for this use case:

* paragraph 1

paragraph 2 (still inside list item)

* paragraph 3

Fred Blasdel said...

The PHP Markdown changelog should give you at least a hundred bugs in Markdown.pl to test against — he started with a straight transliteration (much like MarkdownSharp), and gradually made it less shitty. Here's some more from the author of Pandoc.

You're doing the right thing by completely rewriting it using real tools instead of multi-pass regex spaghetti.

Atwood and Gruber really do deserve each other — does MarkdownSharp replicate Markdown.pl's ridiculous MD5-based escaping mechanism?

William Shields said...

MarkdownSharp replaces blocks with hashcodes so they won't be affected by subsequent regexes. I suspect this must be the equivalent of the MD5 hashing you refer to (I haven't looked at the original Markdown source).

Fred Blasdel said...

Ha! A match made in heaven!

Gruber's design 'escapes' blocks by replacing them with their hashcodes, but if the original input contains the same hashcodes — welcome to XSS city!

Some people have 'fixed' this hole by salting the MD5 replacements so an attacker can't guess them.

William Shields said...

The XSS aspect is interesting and something that hadn't occurred to me. MarkdownSharp appears to still use the hash code so I guess would be vulnerable to XSS.

I wonder if that means Stackoverflow is subject to XSS in the same way since I assume SO uses MarkdownSharp?

Jeff Atwood said...

You guys haven't even LOOKED at the MarkdownSharp code, obviously.

William, are you 100% sure nobody else has defined a grammar for Markdown yet?

Post a Comment