Standing on the Outside

This week I read Life outside .NET, or “How to check out your neighbours”. I really like posts like this. They’re instructive about the culture of a particular community.

For over a decade I’ve been a Java developer (since JDK 1.0.2). Like most Java developers I have a love-hate relationship with the language, the libraries and Sun. Java didn’t invent the virtual machine but it certainly popularized it. 5-10 years ago (in particular) Java was a hotbed for the development of many technologies, concepts and frameworks.

As the author notes, MVC and DI (dependency injection) are simply assumed in Javaland. It’s true. Good luck finding a non-MVC Web framework in Java out of the dozens that exist.

My experience and exposure with .Net is at best peripheral. ASP.NET always struck me as somewhat primitive in the sense that it’s what would’ve happened had JSP been taken to the nth-degree instead of being supplanted by Struts and all that came after. That’s not to say ASP.NET is bad or doesn’t do it’s job but to a Java developer it seems somehow crude.

Beyond the boring and irrelevant comparisons of Java vs. .Net performance, the more interesting comparison is as a proxy for decentralized vs. centralized platform progression.

The Microsoft Way definitely has its advantages. Where once Redmond was playing catch-up on Java (technically speaking), Sun’s inability to lead (and no clue where they were going if they could) has left Java largely stagnant. Java 7 is due at the end of the year but has been delayed years. Thankfully it’s now getting closures if for no other reason than we can all stop bitching about it (frankly, I think some form of function pointers or delegates in “C#-speke” will be sufficient for 99% of use cases).

It can be useful not to have a diaspora of Web development frameworks (even at the cost of innovation). Takes a Struts developer and put them on a Wicket or Tapestry project and their experience won’t be especially applicable.

It will certainly be interesting to see if Oracle can provide more leadership than Sun. Oracle was always heavily invested in Java  so I’m hoping Java isn’t simply collateral damage to Larry’s acquisition of Sun’s server business. Bizarrely Oracle seems committed to JavaFX of all things.

For those of you unfamiliar with it, JavaFX is Sun’s “me too” Flash alternative and a prime example of Sun’s boondoggles of recent years.

I for one welcome our new insect overlords. I’d like to remind them that as a trusted blogger, I can be helpful in rounding up others to toil in their underground sugar caves.

Markdown, Block Parsing and the Road to Hell

I thought it times to update my status on this particular undertaking, which so far has ended up being far more massive than originally envisioned.

The overall design of the Markdown parser is that there are two parsers… kinda. There is a parser to break your document into blocks and another to interpret the inline content within those blocks. As soon as I made this realization, everything just got a whole lot easier.

I use this this term (and “inline”) because those are the terms HTML uses (“block elements” and “inline elements”). Of course HTML also gets more complex (eg “replaced” vs “non-replaced” elements and inline-block, floats, etc) but fundamentally you can think of a Markdown document—or any hypertext document—as consisting of block and inline elements.

Markdown parsers will often talk about “blocks” and “spans” instead.

Block Parsing

The first level of parsing of Markdown is into blocks.

Such a document can be viewed as a tree. The root node is the document. Every node below that is either a block or an inline node. The tree can be arbitrarily deep and there are certain rules about relationships in that tree. For instance:

  • Block nodes are only ever children of other block nodes (counting the root Document node as a block node);
  • Paragraphs can only contain inline elements;
  • List items must be children of lists;
  • and so on.

The goal of any parser is take an input and build a valid syntax tree based on the rules defined.

This part of the problem for what I’m writing is now done. This includes code blocks, paragraphs, block quotes, ordered and unordered lists, headers and horizontal rules. Tables I plan to return to later.

List Parsing

Today I came across Three Markdown Gotchas, which I hadn’t seen before but it opened my eyes to one particular area of difficulty I had: list processing. Go to StackOverflow, ask a question and type in:

- one
 - two
  - three
   - four

and you probably won’t get you what you expect. You get this:

<ul>
  <li>one
    <ul>
      <li>two</li>
      <li>two</li>
      <li>two</li>
    </ul>
  </li>
</ul>

Let me give you some background: Markdown has the concept of indents. Based on a predefined tab width (typically 4), a single tab or 4 spaces represents one indent. That’s important because code lines are preceded by one indent. A non-indent space is sometimes ignored at the beginning of a line, for example at the start of a paragraph line or the continuation of an existing one.

The original Markdown “spec” says that nesting list items is done by preceding the line with one more indent than the previous line. In vanilla Markdown the above sequence would come out as:

<ul>
  <li>one</li>
  <li>two</li>
  <li>two</li>
  <li>two</li>
</ul>

because none of the lines has a leading indent. That’s logical and consistent. Jeff’s point is basically that even one space should indicate intent and be interpreted as nesting. Sounds reasonable right? Maybe. The problem is that it leads to unintended complexity.

Go back to the above example and put one, two then three spaces in front of the first list item. Watch the preview pane to see how the list changes. The implied nesting changes all over the place? Logical? I think not.

But it gets worse.

- one

 two
 - three

 four

comes out as

<ul>
  <li>
    <p>one</p>
    <p>two</p>
    <ul>
      <li>three</li>
    </ul>
    <p>four</p>
  </li>
</ul>

Okay… bear in mind that there are spaces before two and four so that you continue the list item. Otherwise they would be interpreted as separate paragraphs. But what if you want four to continue the nested list item three? How much indentation do you need? It turns out that the magical number is anything from 5 to 11.

But it gets worse. Put one space before one and suddenly one and three are the same list so four is now indented so far that it becomes a code block run-on from three. Add a second space to the front of one and for some reason it returns to the original nesting even though one is now indented more than three. Huh?

I’ll leave an examination of the MarkdownSharp source code as to the reasons for this as an exercise for the reader. Suffice it to say that it all stems from the motivation that one (more) space indicating nesting being somehow more intuitive.

The Road to Hell

The road to hell is paved with good intentions. It’s one of my favourite sayings. We programmers as a whole are unreasonable people. Through a combination of hubris, stubbornness and even laziness we have a tendency to throw out what’s been done before or simply make breaking changes because we prefer it, we think others will prefer it, we don’t appreciate that someone else may have to deal with the consequences or simply out of ignorance as to what led to the original changes.

We all do this, myself included. It’s worst when it not only manifests itself in company culture but it’s enshrined. Take Microsoft as a prime example. Internet Explorer has “Favourites”. What the hell are favourites? Well, they’re bookmarks. But IE can’t call them that because Netscape called them that first and Microsoft wanted to differentiated themselves and their products. This is of course led to many conversations I know I had at the time that went something like this:

New user: What’s a favourite?
Me: It’s a bookmark.

I couldn’t help but laugh out loud when I first read C# and saw all the things copied from Java had been renamed, sometimes with significantly worse names. Java’s final as C#’s sealed springs to mind. You can just tell that there were people dedicated to the task of finding names to Java concepts and keywords. It’s just sad.

Hyperbolae aside, I digress.

The point of all this is that:

  1. Often things that came before you were done for a reason, whether or not you’re aware of it and whether or not you agree with it if you are;
  2. Breaking changes have a high price so much so that the cure is often far worse than the disease and your delicate sensibilities be damned. Internal consistency and syntactic purity is overrated. Interestingly those overly encumbered with such sensibilities seem to have a disproportionate tendency to become Python programmers.

List Sanity

For this reason my parser has returned to what is probably the original implementation. That is:

  1. A leading non-indent space is ignored before list items. That is, it implies no meaning and is discarded so there is no difference between 0 and 2 leading spaces before a list item;
  2. Up to one leading indent (meaning one tab or 0 to 4 spaces) is consumed from each subsequent line until a new list items is hit or a line with no leading spaces is met. The subsequent list item will be a part of the same list. Text with no leading spaces will end the list and form a new paragraph; and
  3. All lines that continue the list item are combined (with their leading tab or 0 to 4 spaces consumed) and they form a new block context. Meaning they are then parsed as if they were a separate input, meaning it can contain new lists, block quotes, code segments and so on.

(3) provides a lot of consistency. it means that if you have a list item followed by a line with two indents that second line will be a code block (one indent marking a continued list item, the second will be interpreted as a code block within the list item block context).

To me this is supremely more logical—and easier to implement—but I guess if you’re really attached to nesting list items with a single space and figuring out that 5 to 11 spaces is the magical number of spaces to continue a nested list item then you’ll hate it. Too bad.

The nested block context from (3) has one exception. If the nested block context would result in a single paragraph then that paragraph is unwrapped to being inline content of the list item. This has one important effect, which some may consider a breaking change. Namely this Markdown:

- one
- two
- three

and

- one

- two

- three

will both be interpreted as being:

<ul>
  <li>one</li>
  <li>two</li>
  <li>three</li>
</ul>

whereas MarkdownSharp will interpret the latter as:

<ul>
  <li><p>one</p></li>
  <li><p>two</p></li>
  <li><p>three</p></li>
</ul>

which is something I've previously documented and disagreed with.

But this could be interpreted as a breaking change so I will probably add a special case for just this scenario as an option

Conclusion

The block parsing portion is done. The code is ugly and needs to be refactored (again) but it works. I still have an issue with too many temporary objects being created (mainly because it simplified some code) and I’ll need to go back and eliminate that.

What’s been interesting is that I’ve now rewritten the block parsing at least four times before it felt right. John Carmack once said he needs to write something five or six times before he gets it right. I agree with his sentiment. It takes that long to truly understand the domain, in my opinion.

The inline parsing has been a completely different set of problems. I will have a follow-up post on that soon.

Java IDEs: the Blue Heeler, the Dachshund and the Labradoodle

I’ve had a frustrating week. I’m on a mission to find out why a piece of code I wrote had a “blowout” in execution time (“blowout” here means 60 microseconds instead of 15 in sustained usage just to keep things in perspective). I suspect it’s to do with temporary objects either auto-boxing/unboxing and/or temporary arrays.

Java, in my opinion, has the best IDEs of any language or platform bar none. Say what you want about the language but the IDEs are, on the whole, first rate. That doesn’t mean there aren’t bumps along the road however.

For the purposes of this completely biased rant I shall liken them to dog breeds.

Blue Heeler

The Blue Heeler is one kind of Australian cattle dog. It’s used on sheep farms and cattle stations to round up livestock. It’s not the prettiest of breeds.

So you won’t see these as family pets or in trendy dog parks or in your neighbourhood. But they’re smart, obedient, protective and hard-working. If you’re herding cattle? You won’t see much else. The Blue Heeler is a working dog and a victory for utilitarianism.

IntelliJ IDEA is the Blue Heeler of the Java IDE world.

Ever hear anyone rave about Resharper when talking about Visual Studio? Or even go so far as to say that Resharper is what makes Visual Studio good? Well, Resharper is adding the functionality to Visual Studio that IntelliJ has for Java.

Yet all is not perfect in IntelliJ-land. The biggest problem is plugins. You certainly don’t have the range that, say, Eclipse does. But nor do you have the “plugin hell” woes so often associated with Eclipse either. Open source frameworks will tend to release plugins for Eclipse and its up to third parties to make IntelliJ versions, which doesn’t always happen.

On the bright side, you don’t actually need that many plugins because nearly everything you need is done out of the box anyway.

What makes it worse is that Jetbrains keeps breaking all the plugins. IntelliJ 9 is relatively new but it once again broke all the plugins. I’ve lost count of the number of times a major version has done this. Seriously, can’t you guys make the plug-in architecture remotely backwards compatible? Is that breaking change you’re making really necessary? Really?

The Jetty plugin still doesn’t work, which is a reasonably big deal. Worse, i can’t find a profiler that works in IntelliJ to save my life, except possibly JProfiler but who can justify $499 for a fixed single license of a profiler? Especially considering the same thing for the whole rest of the IDE is $249 (if you don’t want to use the free version.

Dachshund

The Dachshund is a strange and impractical dog. Just look at it and you can tell it’s no product of evolution. Not without man’s intervention anyway.

Yet people like them. Families own them. They are however stubborn and hard to train and they have their fair share of health problems (including spinal problems unsurprisingly).

Eclipse is the Dachshund of the Java IDE world.

If IntelliJ has not enough plugins then arguably Eclipse has too many. Hell, this goes so far as having two Subversion plugins. A former colleague, who was an avid Eclipse fan, could never get either one to work. Sometimes they lied about checking stuff in (big problem) or just conflicted with other stuff.

Every time I try and use Eclipse for something I’m struck with an overwhelming sense of how awkward and unintuitive it is. Take Maven projects as one example. In IntelliJ or Netbeans you just open one up and it just works. Googling doesn’t really help either. The first link is seemingly out of date and it doesn’t get much better.

Now I realize this is a whole Coke vs Pepsi thing. Many people are no doubt experts in Eclipse. They’re used to the “Eclipse Way” so it all makes sense (which strikes me as a form of Stockholm Syndrome but I digress…). Hell, they may even like the whole perspectives thing, which I’ve always hated.

But if you tell me you’ve never had problems with Eclipse plugins you’re lying.

Labradoodle

The Labradoodle is a strange and relatively new dog breed. Whereas programmers with too much free time come out with bizarre ways of calculating Pi and other such boondoggles, dog breeders with idle hands decide to answer a question that has plagued civilization since Aristotle’s time:

What happens when you cross a Labrador Retriever with a poodle?

Baseless hyperbolae aside, I’m sure there was a reason. I just don’t know what it was.

But what resulted is a friendly, energetic and not-too-bright breed that families tend to like.

Well Netbeans is the Labradoodle of the Java IDE world.

Netbeans does some things very well, particularly Swing development (which admittedly in today’s Web-focused world is a lot like being the best manufacturer of horse bridles and saddles).

Netbeans also faces an uncertain future with Oracle’s acquisition of Sun.

Netbeans at least immediately understand my Maven project. It couldn’t find classes with main() methods that were under the test directory (IntelliJ could) but it otherwise all just worked.

So it was looking good to finally get a profiler running… until I came across a bug. There is at least one open bug against Netbeans that raises an issue against Windows 7. My dev machine is a Windows 7 64 bit machine. Months after Windows 7’s release—nearly a year after the beta version—to still have permission problems is simply unacceptable. Yet that’s what happens when I try and use the profiler.

Conclusion

All this and I still have no profile of my code!

Please don’t waste my time and yours by commenting or sending me a message saying I’m wrong about <insert favourite IDE here>. You’re missing the point of a rant (in that largely there isn’t one).

Markdown and an Introduction to Parsing Expression Grammars (PEG)

Writing an ANTLR LL(*) grammar for Markdown has been the itch I just can’t scratch this month. I keep going back to it as I have a new idea about how to approach the problem or how to solve a previous problem I’ve had. Each time I get further but I still keep hitting a wall.

It’s a shame really because ANTLRWorks is an excellent tool and ANTLR is an extremely mature product. The rewriting rules and tree grammars are extremely elegant.

Over the couple of weeks I’ve been investigating PEGs (“Parsing Expression Grammars”). I highly recommend Parsing Expression Grammars: A Recognition-Based Syntactic Foundation by Bryan Ford. PEGs are relatively new (Ford’s paper was published in 2004) whereas parsing CFGs (“Context Free Grammars”) with LL, LR and LALR parsers has a history going back decades.

Traditional Parsers

Parsing of computer and natural languages (by computers) has it’s roots in Noam Chomsky’s work on generative grammars, particularly the Chomsky hierarchy and the work of Donald Knuth (On the Translation of Languages from Left to Right [1965]) and Frank DeRemer (Practical Translators for LR(k) Languages [1969]).

To understand Parsing Expression Grammars let me first explain the basic workings of a traditional parsers. The first step is lexical analysis that turns an input stream into a series of tokens. The parser will then apply various rules to these tokens. There are varying techniques for dealing with ambiguities and recursive rules.

As Terence Parr (the creator of ANTLR) puts it in his (excellent) The Definitive ANTLR Reference: Building Domain-Specific Languages:

Unfortunately, ANTLR cannot generate a top-down recognizer for every grammar—LL recognizers restrict the class of acceptable grammars somewhat. For example, ANTLR cannot accept left-recursive grammars such as the following (see Section 11.5, Left-Recursive Grammars, on page 274):

/** An expression is defined to be an expression followed by '++' */
expr : expr '++'
     ;

ANTLR translates this grammar to a recursive method called expr() that immediately invokes itself:

void expr() {
  expr();
  match("++");
}

This is something that LALR parsers handle better.

The Lexical Analysis Problem

But the big problem as far as Markdown is concerned is that tokens are not context-free. Take a natural definition for lists:

listItem    : ORDERED inline NEWLINE
            | UNORDERED inline NEWLINE
            ;

ORDERED     : DIGIT+ '.' (' ' | '\t')+ ;
UNORDERED   : ('*' | '-' | '+') (' ' | '\t')+ ;
inline      : (~ NEWLINE)+ ;
NEWLINE     : '\r' '\n'? : '\n' ;

and this Markdown:

1. one
2. two
3. three

will be converted into this lexical stream:

ORDERED inline("one") NEWLINE
ORDERED inline("two") NEWLINE
ORDERED inline("three") NEWLINE

and then this AST ("Absract Syntax Tree") will result:

document
+- listItem
|  +- ORDERED
|  +- inline ("one")
|  +- NEWLINE
+- listItem
|  +- ORDERED
|  +- inline ("two")
|  +- NEWLINE
+- listItem
   +- ORDERED
   +- inline ("three")
   +- NEWLINE

Looks good, right? Wrong. It quickly falls down when the Markdown becomes pathological:

1. 1. one
2. two
3. three

because the input stream is:

ORDERED ORDERED inline("one") NEWLINE
ORDERED inline("two") NEWLINE
ORDERED inline("three") NEWLINE

assuming you can resolve the ambiguity regarding inline being able to technically match "1." (which you can.... kinda).

The above will not be recognized because there is no rule that handles a pair of ORDERED tokens. Really what you want to do is not create an ORDERED token after you’ve already started a list item but at this point you no longer have a context-free grammar.

ANTLR’s semantic and syntactic predicates make an admirable effort of dealing with these kinds of ambiguities and context sensitivities but ultimately it’s just not designed for this kind of grammar.

Enter PEG

PEG parsers take a different approach in two important ways:

  1. PEGs are not ambiguous. Choices in the above can lead to ambiguities. ANTLR resolves many of these by using predicates, which are a way of saying “if it looks like a duck then it’s a duck otherwise it’s something else”. PEGs use a prioritized choice operator, which basically try the choices in order until it finds one that matches. By definition this is unambiguous because the input stream will either be recognized or it won’t; and
  2. PEGs better handle non-CFGs by trying to recognize tokens as part of processing a rule rather than recognizing tokens and then applying rules to them.

Prioritized Choice

So in PEG terms, Markdown becomes easier to describe:

Document <- Line*
Line     <- Heading / ListItem / Inline / Empty
Heading  <= '#'+ WS+ Inline
ListItem <- (DIGIT+ '.' / '*' / '-' / '+') WS+ Inline
Inline   <- (!NEWLINE .)+ NEWLINE

DIGIT    <- [0-9]
WS       <- ' ' | '\t'
NEWLINE  <- '\r\n' / '\r' / '\n'

This is of course partial and a simplification but the important thing here is that prioritized choice resolves what otherwise will be ambiguous. This is the “else” clause I’ve been looking for.

Context-Sensitive Tokenization

Markdown has lots of these issues. For example ‘###’ might indicate a header but only if that line itself isn’t a header (by the next line consisting of all equals signs or hyphens). ANTLR allows you to handle some of these situations by doing something like:

HEADER : {getCharPositionInLine()==0]?=> ‘#’+ WS+ ;

but what about this Markdown?

> # quoted heading
> some text

It’s entirely possible I’m missing some key part of the puzzle here but I’m not hopeful.

Ford illustrates this problem:

...PEGs also create new possibilities for language syntax design. Consider for example a well-known problem with C++ syntax involving nested template type expressions:

vector<vector<float> > MyMatrix;

The space between the two right angle brackets is required because the C++ scanner is oblivious to the language’s hierarchical syntax, and would otherwise interpret the >> incorrectly as a right shift operator. In a language described by a unified PEG, however, it is easy to define the language to permit a >> sequence to be interpreted as either one token or two depending on its context:

TemplType <- PrimType (LANGLE TemplType RANGLE)?
ShiftExpr <- PrimExpr (ShiftOper PrimExpr)*
ShiftOper <- LSHIFT / RSHIFT
LANGLE    <- ’<’ Spacing
RANGLE    <- ’>’ Spacing
LSHIFT    <- ’<<’ Spacing
RSHIFT    <- ’>>’ Spacing

(emphasis added)

Conclusion

This isn’t a new problem and I’m the first to approach the issue of Markdown parsing with a PEG grammar. peg-markdown is an implementation of Markdown in C using a PEG parser.

My own effort is going forward despite this implementation existing for several reasons:

  1. I plan on having implementations in several languages;
  2. I intend to implement various Markdown and Wiki extensions and flavours; and
  3. Because I’m getting a kick out of it.

I learnt compiler theory in university but it was all quite theoretical with simple yet interesting examples. The practical application to a real-world problem is quite something else. Plus PEG is only 6 or so years old so is new to me.

It is my belief that PEGs are a far more natural and robust means of parsing any form of Markdown, Wiki syntax, BBcode or other forum format.

And that’s the direction I’m heading.

Markdown Headings, Grief and Unknown Elements to the Rescue

Well it’s not a day to be outside (unless you’re at the beach). It’s 41 degrees and that’s metric (none of this Imperial rubbish that only the US uses). That’s 106F in the old scale.

So I’m tackling the problem of Markdown headings in my parser.

Heading 1
=========

Heading 2
---------

# Heading 1
## Heading 2
### Heading 3

Horizontal rules:

-------
*******
_______

This is really annoying for two reasons:

  1. Ambiguous syntax: a line of hyphens could be a horizontal rule or indicate a heading depending on the context (and again we return to the point of Markdown being context-snesitive); and
  2. From an LL-persepctive this is left-recursive and requires LL(*) (arbitrary lookahead) in a normal grammar.

Let me explain.

A subset of the grammar for Markdown might look something like this (in ANTLR-like syntax):

document  : block* ;
block     : paragraph | heading | heading1 | heading2 | codeblock ;
paragraph : inline+ END_BLOCK ;
heading   : '#'+ inline NEWLINE ;
heading1  : inline+ NEWLINE '='+ (NEWLINE | END_BLOCK) ;
heading2  : inline| NEWLINE '-'+ (NEWLINE | END_BLOCK) ;
inline    : '*' inline+ '*'
          | '`' inline+ '`'
          | ...
          | OTHER+
OTHER     : '.' ;
END_BLOCK : '\n' '\n'+ | EOF ;

Try and plug something like that into ANTLR and it will complain all over the place.

Firstly it’s ambiguous. An input sequence like “*123*” matches two of the inline alternatives. I’m led to believe that PEG parsers can deal with this by simply trying rules in the order they appear. That would fit a lot better to this situation. ANTLR can (messily) handle it with syntactic predicates.

The other problem is the grammar is left-recursive, most notably with the inline rule.

Yet another problem is that this requires arbitrary lookahead (again, something ANTLR can do with its LL(*) algorithm) because the token that delineates the heading rules is right at the end of a cyclic rule.

It's even worse once you start factoring in paragraphs and lists.

All of this leads to a whole bunch of headaches but I thought about this long and hard (going so far as to wake up in a sweat after a lexical analysis nightmare) and came up with a much more elegant (imho) solution

Consider a lexical stream that looks like this:

WORD("Heading") WHITE_SPACE(" ") WORD("1") ...

What's next is important because the parser doesn't yet know if this is a paragraph or a heading. But here is where I was trying to be too clever for my own good by determining the block quote in the lexer. After all, it would make the parsing step easier if I could just use a stack to push/pop block elements already knowing what they are.

Instead I decided to treat a stream of inline elements as an Unknown element and I could just determine the type as a parsing action rather than a lexcial rule. So the grammar simplifies somewhat to:

document  : block* ;
block     : codeblock | unknown ;
codeblock : ('    ' .* NEWLINE)+ ;
unknown   : inline* END_BLOCK ;

Again, order of rules is useful here, meaning if it looks like a code block it is a code block, otherwise it’s an unknown block. A syntactic predicate could handle this or you could make an indent at the start of a line an INDENT token, which wouldn’t fit into the inline rule. This makes the grammar unambiguous but still requires arbitrary lookahead. It’s easier to simply make a decision based on the first token and avoid any backtracking whatsoever.

So if the parsing actions come across the right token sequence within the unknown block it changes that block to a heading, otherwise when that block ends it simply defaults to being a paragraph.

Think of unknown elements as being the stem cells of Markdown lexical analysis.

Anyway that was my revelation for the week. I still need to finish my list handling, inline styling and links, which is still more than I’d like but it’s getting there.