Well it’s not a day to be outside (unless you’re at the beach). It’s 41 degrees and that’s metric (none of this Imperial rubbish that only the US uses). That’s 106F in the old scale.
So I’m tackling the problem of Markdown headings in my parser.
Heading 1 ========= Heading 2 --------- # Heading 1 ## Heading 2 ### Heading 3 Horizontal rules: ------- ******* _______
This is really annoying for two reasons:
- Ambiguous syntax: a line of hyphens could be a horizontal rule or indicate a heading depending on the context (and again we return to the point of Markdown being context-snesitive); and
- From an LL-persepctive this is left-recursive and requires LL(*) (arbitrary lookahead) in a normal grammar.
Let me explain.
A subset of the grammar for Markdown might look something like this (in ANTLR-like syntax):
document : block* ;
block : paragraph | heading | heading1 | heading2 | codeblock ;
paragraph : inline+ END_BLOCK ;
heading : '#'+ inline NEWLINE ;
heading1 : inline+ NEWLINE '='+ (NEWLINE | END_BLOCK) ;
heading2 : inline| NEWLINE '-'+ (NEWLINE | END_BLOCK) ;
inline : '*' inline+ '*'
| '`' inline+ '`'
| ...
| OTHER+
OTHER : '.' ;
END_BLOCK : '\n' '\n'+ | EOF ;
Try and plug something like that into ANTLR and it will complain all over the place.
Firstly it’s ambiguous. An input sequence like “*123*” matches two of the inline alternatives. I’m led to believe that PEG parsers can deal with this by simply trying rules in the order they appear. That would fit a lot better to this situation. ANTLR can (messily) handle it with syntactic predicates.
The other problem is the grammar is left-recursive, most notably with the inline rule.
Yet another problem is that this requires arbitrary lookahead (again, something ANTLR can do with its LL(*) algorithm) because the token that delineates the heading rules is right at the end of a cyclic rule.
It's even worse once you start factoring in paragraphs and lists.
All of this leads to a whole bunch of headaches but I thought about this long and hard (going so far as to wake up in a sweat after a lexical analysis nightmare) and came up with a much more elegant (imho) solution
Consider a lexical stream that looks like this:
WORD("Heading") WHITE_SPACE(" ") WORD("1") ...
What's next is important because the parser doesn't yet know if this is a paragraph or a heading. But here is where I was trying to be too clever for my own good by determining the block quote in the lexer. After all, it would make the parsing step easier if I could just use a stack to push/pop block elements already knowing what they are.
Instead I decided to treat a stream of inline elements as an Unknown element and I could just determine the type as a parsing action rather than a lexcial rule. So the grammar simplifies somewhat to:
document : block* ;
block : codeblock | unknown ;
codeblock : (' ' .* NEWLINE)+ ;
unknown : inline* END_BLOCK ;
Again, order of rules is useful here, meaning if it looks like a code block it is a code block, otherwise it’s an unknown block. A syntactic predicate could handle this or you could make an indent at the start of a line an INDENT token, which wouldn’t fit into the inline rule. This makes the grammar unambiguous but still requires arbitrary lookahead. It’s easier to simply make a decision based on the first token and avoid any backtracking whatsoever.
So if the parsing actions come across the right token sequence within the unknown block it changes that block to a heading, otherwise when that block ends it simply defaults to being a paragraph.
Think of unknown elements as being the stem cells of Markdown lexical analysis.
Anyway that was my revelation for the week. I still need to finish my list handling, inline styling and links, which is still more than I’d like but it’s getting there.