Markdown Headings, Grief and Unknown Elements to the Rescue

Well it’s not a day to be outside (unless you’re at the beach). It’s 41 degrees and that’s metric (none of this Imperial rubbish that only the US uses). That’s 106F in the old scale.

So I’m tackling the problem of Markdown headings in my parser.

Heading 1
=========

Heading 2
---------

# Heading 1
## Heading 2
### Heading 3

Horizontal rules:

-------
*******
_______

This is really annoying for two reasons:

  1. Ambiguous syntax: a line of hyphens could be a horizontal rule or indicate a heading depending on the context (and again we return to the point of Markdown being context-snesitive); and
  2. From an LL-persepctive this is left-recursive and requires LL(*) (arbitrary lookahead) in a normal grammar.

Let me explain.

A subset of the grammar for Markdown might look something like this (in ANTLR-like syntax):

document  : block* ;
block     : paragraph | heading | heading1 | heading2 | codeblock ;
paragraph : inline+ END_BLOCK ;
heading   : '#'+ inline NEWLINE ;
heading1  : inline+ NEWLINE '='+ (NEWLINE | END_BLOCK) ;
heading2  : inline| NEWLINE '-'+ (NEWLINE | END_BLOCK) ;
inline    : '*' inline+ '*'
          | '`' inline+ '`'
          | ...
          | OTHER+
OTHER     : '.' ;
END_BLOCK : '\n' '\n'+ | EOF ;

Try and plug something like that into ANTLR and it will complain all over the place.

Firstly it’s ambiguous. An input sequence like “*123*” matches two of the inline alternatives. I’m led to believe that PEG parsers can deal with this by simply trying rules in the order they appear. That would fit a lot better to this situation. ANTLR can (messily) handle it with syntactic predicates.

The other problem is the grammar is left-recursive, most notably with the inline rule.

Yet another problem is that this requires arbitrary lookahead (again, something ANTLR can do with its LL(*) algorithm) because the token that delineates the heading rules is right at the end of a cyclic rule.

It's even worse once you start factoring in paragraphs and lists.

All of this leads to a whole bunch of headaches but I thought about this long and hard (going so far as to wake up in a sweat after a lexical analysis nightmare) and came up with a much more elegant (imho) solution

Consider a lexical stream that looks like this:

WORD("Heading") WHITE_SPACE(" ") WORD("1") ...

What's next is important because the parser doesn't yet know if this is a paragraph or a heading. But here is where I was trying to be too clever for my own good by determining the block quote in the lexer. After all, it would make the parsing step easier if I could just use a stack to push/pop block elements already knowing what they are.

Instead I decided to treat a stream of inline elements as an Unknown element and I could just determine the type as a parsing action rather than a lexcial rule. So the grammar simplifies somewhat to:

document  : block* ;
block     : codeblock | unknown ;
codeblock : ('    ' .* NEWLINE)+ ;
unknown   : inline* END_BLOCK ;

Again, order of rules is useful here, meaning if it looks like a code block it is a code block, otherwise it’s an unknown block. A syntactic predicate could handle this or you could make an indent at the start of a line an INDENT token, which wouldn’t fit into the inline rule. This makes the grammar unambiguous but still requires arbitrary lookahead. It’s easier to simply make a decision based on the first token and avoid any backtracking whatsoever.

So if the parsing actions come across the right token sequence within the unknown block it changes that block to a heading, otherwise when that block ends it simply defaults to being a paragraph.

Think of unknown elements as being the stem cells of Markdown lexical analysis.

Anyway that was my revelation for the week. I still need to finish my list handling, inline styling and links, which is still more than I’d like but it’s getting there.

12 comments:

gabe said...
This comment has been removed by the author.
gabe said...

To clarify, I think Markdown has 2 syntaxes: 1) A block-level syntax that should tokenize whole lines as particular kinds of lines; and, 2) an in-line syntax that can be parsed separately.

I think you could reasonably do something like:

line-grammar:
regular-line
| block-line
| empty-line
| heading-line-block
| heading-line-inline

regular-line: << any line starting with regular characters >>
block-line: << line starting with spaces >>
empty-line: << >>
heading-line-block: << full line of '###' heading characters>>
heading-inline-block: << '##' with text >>

With that, the inline-checking at the start of lines would be fairly minimal and could be easily put into lexing rules (or hand written).

Once you have all of the lines, a block-level parser could collate and segment blocks, producing a line-oriented stream of tokens that could then be parsed by the in-line lexer/parser.

It's certainly more work, but I think it better reflects the syntax of the language. Indeed, I have a feeling that Markdown is unparseable in LL(k) where k is any sane number.

William Shields said...

Part of the problem has been that obviously there's no formal grammar to follow so I've had to play it by ear. Add to that that you have choices about how to handle certain cases (eg do you treat something as a lexical or parsing issue?) and I've had to stop and rethink my approach several times.

Last night I came to the realization that I am going to have to have an extra line-level step that I was otherwise hoping to avoid. I started to realize my previous single-pass approach was going to end up with some hideous corner cases that were simply too complex to reasonably deal with whereas an extra pass would much more reasonable deal with them and let me better handle the inline parsing to boot.

DisgruntledGoat said...

Love this set of articles. Are you just working in Java, or will the final code be easily portable to Javascript, PHP, etc?

William Shields said...

Java is first. C# will be second.

Javascript is an interesting one. Because the regex library might be faster than the hand-coded solution. We'll see when it gets to that stage.

Daniel said...

I think you are on the right track treating each line as a token from the lexical analysis. It seems to me that markdown "grammar" is much more concerned with vertical elements than horizontal elements. The horizontal elements -- emphasis, bold, code, hyperlinks -- are rather easily handled.

On the vertical, however, you have normal blocks, quote blocks, code blocks, item blocks, numbered blocks, block separators and whatever combinations might be valid.

My suggestion, in fact, is for you to do two grammars: one at the document level, and one at the block level. You get two lexers. The first will return block tokens. Depending on the type of block, a different second lexer will be applied. Likewise, two grammars.

Once you have such thing working, you can probably reduce it to a single complex lexer/grammar much more easily than do it right from the start.

As a side note, I'm very interested in this project, because I'd like to try doing it with Scala and either the backtracking parser or the ratpack parser. And ADTs instead of visitor patterns. :-) So I have a vested interested in having an EBNF for it. :-)

William Shields said...

@Daniel: You're probably not going to get EBNF out of this. Context-sensitivity is the issue. Perhaps that can be unrolled to a CFG but I'm not entirely convinced of that. The code so far makes decisions and looks for tokens based on context.

It'll be something to discuss once I get the first cut out there.

goyuix said...

Your four articles are well worth reading - thank you!

You have hinted along the way that there are certain portions of Markdown that are either troublesome, conflicting or in some way cause grief. I would love to see a wrap up article with a summary of what you would like to see as a "Markdown Mark II", complete with a sensible lexer/parser and spec.

dtm said...

What are you using as the canonical source of what Markdown should be? The original spec of John Gruber (found at http://daringfireball.net/projects/markdown/syntax) is known to be a bit ambiguous and troublesome in some details. Are you aware of the "Markdown Extra" spec at http://michelf.com/specs/markdown-extra/ ?

William Shields said...

@dtm: I haven't seen the Markdown Extra one but yes I've seen the other. Markdown Extra is incomplete (and probably use some examples).

Reading the token definitions it's pretty much dead on what I'd derived anyway.

Unfortunately Markdown is a really poor candidate for EBNF and the like because there are a lot of "else" clauses like:

<http://example.com>

is clearly a URL but to a lexer it could be:

- a URL
- a tag (although the rule for this would fail); or
- a "textrun" (in MDE-speke)

But the textrun is a fallback case. And there are lots of these fallback cases like:

> blockquote

which could reasonably be scanned as QUOTE TEXTRUN but the fallback case of TEXTRUN which is needed for

sometext

also covers the block quote, which leads to ambiguity unless you specifically enumerate all the possible start characters which is possible but I've had issues along that line too.

dtm said...

Right; Markdown seems full of fallback cases.

In fact, I wonder if that isn't the right way to specify it, as a series of less-specific constructs. There was a parsing approach I read a paper on a while ago that I unfortunately can't remember the name of - it was slightly heavy on memory usage, but maybe that's OK. Rather than the traditional lexer/parser split, this approach was based on one uniform approach to the whole document, which was a series of functions:

String -> [(ParsedStructure, restOfString)]

where I'm using Haskell notation there to mean a list of pairs. The first element of the list would be the most preferred parse, then the second-most-preferred parse, etc. So as an example. The paper's example implementation language was Haskell, so all these lists were being constructed lazily; in Java you'd probably have to construct something that returned Iterators>. The paper showed how to chain together different definitions so that if you had a high-level construct that was something like:

INLINESTUFF = '*' TEXTRUN '*' | TEXTRUN

you could then chain together a "recognize '*'" function and a "recognize TEXTRUN" function into a "recognize INLINESTUFF" function. I wish I could remember the name of this approach so that I could find it because the details of how you combined stuff so that the whole result was efficient escape me at the moment. Throwing seemingly related terms at Google isn't helping.

dtm said...

Let me also include a pointer to BableMark, a markdown implementation testbed: http://babelmark.bobtfish.net/?markdown=x%3Cmax(a,b)%0D%0A&normalize=on

You may find the grammar linked to for the "PEG Markdown" implementation useful.

Post a Comment