It may seem lately that Markdown is my white whale to which I respond thusly… call me Ahab.
One of the problems with implementing something like this is that no one can quite agree on what exactly constitutes Markdown. It gets worse when you consider Wiki syntaxes. What’s stunning is that someone (CosmoCode) has gone so far as to create a matrix comparing them all.
If you peruse the unit tests you find things like:
Asterisks tight: * asterisk 1 * asterisk 2 * asterisk 3 Asterisks loose: * asterisk 1 * asterisk 2 * asterisk 3
is converted to:
<p>Asterisks tight:</p> <ul> <li>asterisk 1</li> <li>asterisk 2</li> <li>asterisk 3</li> </ul> <p>Asterisks loose:</p> <ul> <li><p>asterisk 1</p></li> <li><p>asterisk 2</p></li> <li><p>asterisk 3</p></li> </ul>
Now having gone through the code I can see why this is: two newlines is typically used as a block delimiter, between paragraphs, code blocks and so forth. But I have to wonder three things:
- Is this planned behaviour or simply the result of splitting the file into blocks using two or more newlines as a delimeter?
- Is this behaviour desirable?
- Is this behaviour reasonable?
Of course there is a case for paragraphs being nested in list items, namely that you have two or more paragraphs or other nested block content within list items. This is certainly something you can do—and will do—in HTML but I’m not so convinced that a newlines wrapping list content in a paragraph is anything other than an unintended consequence.
Of course there is no grammar or spec for Markdown so it’s something you can argue til the cows come home. You can also change it and still call what you do “Markdown”. It’s why there are so many Wiki syntaxes.
There are other issues. For example, should you be able to start or end bold or italic styling in the middle of a word? I believe Github has taken the approach that underscores for italics can’t start or end intra-word, sensibly (as this is a common occurrence in source code).
Lastly, Markdown preserves HTML. It’s my opinion that it should be replaced with Markdown where possible. What should you do with this:
<blockquote> <ul id="list"> <li>one</li> <li>two</li> <li>three</li> </ul> </blockquote>
In my opinion, it would make sense to convert this to:
> 1. one > 2. two > 3. three
Of course you lose information in doing this (namely the id attribute) but you have to decide: are you using Markdown or HTML?
Opinions will of course vary.
Weighty issues indeed! But this is what I’m struggling with as I’m working on my list parsing while trying to prevent my lexer from becoming a pushdown automaton.