Java IDEs: the Blue Heeler, the Dachshund and the Labradoodle

I’ve had a frustrating week. I’m on a mission to find out why a piece of code I wrote had a “blowout” in execution time (“blowout” here means 60 microseconds instead of 15 in sustained usage just to keep things in perspective). I suspect it’s to do with temporary objects either auto-boxing/unboxing and/or temporary arrays.

Java, in my opinion, has the best IDEs of any language or platform bar none. Say what you want about the language but the IDEs are, on the whole, first rate. That doesn’t mean there aren’t bumps along the road however.

For the purposes of this completely biased rant I shall liken them to dog breeds.

Blue Heeler

The Blue Heeler is one kind of Australian cattle dog. It’s used on sheep farms and cattle stations to round up livestock. It’s not the prettiest of breeds.

So you won’t see these as family pets or in trendy dog parks or in your neighbourhood. But they’re smart, obedient, protective and hard-working. If you’re herding cattle? You won’t see much else. The Blue Heeler is a working dog and a victory for utilitarianism.

IntelliJ IDEA is the Blue Heeler of the Java IDE world.

Ever hear anyone rave about Resharper when talking about Visual Studio? Or even go so far as to say that Resharper is what makes Visual Studio good? Well, Resharper is adding the functionality to Visual Studio that IntelliJ has for Java.

Yet all is not perfect in IntelliJ-land. The biggest problem is plugins. You certainly don’t have the range that, say, Eclipse does. But nor do you have the “plugin hell” woes so often associated with Eclipse either. Open source frameworks will tend to release plugins for Eclipse and its up to third parties to make IntelliJ versions, which doesn’t always happen.

On the bright side, you don’t actually need that many plugins because nearly everything you need is done out of the box anyway.

What makes it worse is that Jetbrains keeps breaking all the plugins. IntelliJ 9 is relatively new but it once again broke all the plugins. I’ve lost count of the number of times a major version has done this. Seriously, can’t you guys make the plug-in architecture remotely backwards compatible? Is that breaking change you’re making really necessary? Really?

The Jetty plugin still doesn’t work, which is a reasonably big deal. Worse, i can’t find a profiler that works in IntelliJ to save my life, except possibly JProfiler but who can justify $499 for a fixed single license of a profiler? Especially considering the same thing for the whole rest of the IDE is $249 (if you don’t want to use the free version.

Dachshund

The Dachshund is a strange and impractical dog. Just look at it and you can tell it’s no product of evolution. Not without man’s intervention anyway.

Yet people like them. Families own them. They are however stubborn and hard to train and they have their fair share of health problems (including spinal problems unsurprisingly).

Eclipse is the Dachshund of the Java IDE world.

If IntelliJ has not enough plugins then arguably Eclipse has too many. Hell, this goes so far as having two Subversion plugins. A former colleague, who was an avid Eclipse fan, could never get either one to work. Sometimes they lied about checking stuff in (big problem) or just conflicted with other stuff.

Every time I try and use Eclipse for something I’m struck with an overwhelming sense of how awkward and unintuitive it is. Take Maven projects as one example. In IntelliJ or Netbeans you just open one up and it just works. Googling doesn’t really help either. The first link is seemingly out of date and it doesn’t get much better.

Now I realize this is a whole Coke vs Pepsi thing. Many people are no doubt experts in Eclipse. They’re used to the “Eclipse Way” so it all makes sense (which strikes me as a form of Stockholm Syndrome but I digress…). Hell, they may even like the whole perspectives thing, which I’ve always hated.

But if you tell me you’ve never had problems with Eclipse plugins you’re lying.

Labradoodle

The Labradoodle is a strange and relatively new dog breed. Whereas programmers with too much free time come out with bizarre ways of calculating Pi and other such boondoggles, dog breeders with idle hands decide to answer a question that has plagued civilization since Aristotle’s time:

What happens when you cross a Labrador Retriever with a poodle?

Baseless hyperbolae aside, I’m sure there was a reason. I just don’t know what it was.

But what resulted is a friendly, energetic and not-too-bright breed that families tend to like.

Well Netbeans is the Labradoodle of the Java IDE world.

Netbeans does some things very well, particularly Swing development (which admittedly in today’s Web-focused world is a lot like being the best manufacturer of horse bridles and saddles).

Netbeans also faces an uncertain future with Oracle’s acquisition of Sun.

Netbeans at least immediately understand my Maven project. It couldn’t find classes with main() methods that were under the test directory (IntelliJ could) but it otherwise all just worked.

So it was looking good to finally get a profiler running… until I came across a bug. There is at least one open bug against Netbeans that raises an issue against Windows 7. My dev machine is a Windows 7 64 bit machine. Months after Windows 7’s release—nearly a year after the beta version—to still have permission problems is simply unacceptable. Yet that’s what happens when I try and use the profiler.

Conclusion

All this and I still have no profile of my code!

Please don’t waste my time and yours by commenting or sending me a message saying I’m wrong about <insert favourite IDE here>. You’re missing the point of a rant (in that largely there isn’t one).

Markdown and an Introduction to Parsing Expression Grammars (PEG)

Writing an ANTLR LL(*) grammar for Markdown has been the itch I just can’t scratch this month. I keep going back to it as I have a new idea about how to approach the problem or how to solve a previous problem I’ve had. Each time I get further but I still keep hitting a wall.

It’s a shame really because ANTLRWorks is an excellent tool and ANTLR is an extremely mature product. The rewriting rules and tree grammars are extremely elegant.

Over the couple of weeks I’ve been investigating PEGs (“Parsing Expression Grammars”). I highly recommend Parsing Expression Grammars: A Recognition-Based Syntactic Foundation by Bryan Ford. PEGs are relatively new (Ford’s paper was published in 2004) whereas parsing CFGs (“Context Free Grammars”) with LL, LR and LALR parsers has a history going back decades.

Traditional Parsers

Parsing of computer and natural languages (by computers) has it’s roots in Noam Chomsky’s work on generative grammars, particularly the Chomsky hierarchy and the work of Donald Knuth (On the Translation of Languages from Left to Right [1965]) and Frank DeRemer (Practical Translators for LR(k) Languages [1969]).

To understand Parsing Expression Grammars let me first explain the basic workings of a traditional parsers. The first step is lexical analysis that turns an input stream into a series of tokens. The parser will then apply various rules to these tokens. There are varying techniques for dealing with ambiguities and recursive rules.

As Terence Parr (the creator of ANTLR) puts it in his (excellent) The Definitive ANTLR Reference: Building Domain-Specific Languages:

Unfortunately, ANTLR cannot generate a top-down recognizer for every grammar—LL recognizers restrict the class of acceptable grammars somewhat. For example, ANTLR cannot accept left-recursive grammars such as the following (see Section 11.5, Left-Recursive Grammars, on page 274):

/** An expression is defined to be an expression followed by '++' */
expr : expr '++'
     ;

ANTLR translates this grammar to a recursive method called expr() that immediately invokes itself:

void expr() {
  expr();
  match("++");
}

This is something that LALR parsers handle better.

The Lexical Analysis Problem

But the big problem as far as Markdown is concerned is that tokens are not context-free. Take a natural definition for lists:

listItem    : ORDERED inline NEWLINE
            | UNORDERED inline NEWLINE
            ;

ORDERED     : DIGIT+ '.' (' ' | '\t')+ ;
UNORDERED   : ('*' | '-' | '+') (' ' | '\t')+ ;
inline      : (~ NEWLINE)+ ;
NEWLINE     : '\r' '\n'? : '\n' ;

and this Markdown:

1. one
2. two
3. three

will be converted into this lexical stream:

ORDERED inline("one") NEWLINE
ORDERED inline("two") NEWLINE
ORDERED inline("three") NEWLINE

and then this AST ("Absract Syntax Tree") will result:

document
+- listItem
|  +- ORDERED
|  +- inline ("one")
|  +- NEWLINE
+- listItem
|  +- ORDERED
|  +- inline ("two")
|  +- NEWLINE
+- listItem
   +- ORDERED
   +- inline ("three")
   +- NEWLINE

Looks good, right? Wrong. It quickly falls down when the Markdown becomes pathological:

1. 1. one
2. two
3. three

because the input stream is:

ORDERED ORDERED inline("one") NEWLINE
ORDERED inline("two") NEWLINE
ORDERED inline("three") NEWLINE

assuming you can resolve the ambiguity regarding inline being able to technically match "1." (which you can.... kinda).

The above will not be recognized because there is no rule that handles a pair of ORDERED tokens. Really what you want to do is not create an ORDERED token after you’ve already started a list item but at this point you no longer have a context-free grammar.

ANTLR’s semantic and syntactic predicates make an admirable effort of dealing with these kinds of ambiguities and context sensitivities but ultimately it’s just not designed for this kind of grammar.

Enter PEG

PEG parsers take a different approach in two important ways:

  1. PEGs are not ambiguous. Choices in the above can lead to ambiguities. ANTLR resolves many of these by using predicates, which are a way of saying “if it looks like a duck then it’s a duck otherwise it’s something else”. PEGs use a prioritized choice operator, which basically try the choices in order until it finds one that matches. By definition this is unambiguous because the input stream will either be recognized or it won’t; and
  2. PEGs better handle non-CFGs by trying to recognize tokens as part of processing a rule rather than recognizing tokens and then applying rules to them.

Prioritized Choice

So in PEG terms, Markdown becomes easier to describe:

Document <- Line*
Line     <- Heading / ListItem / Inline / Empty
Heading  <= '#'+ WS+ Inline
ListItem <- (DIGIT+ '.' / '*' / '-' / '+') WS+ Inline
Inline   <- (!NEWLINE .)+ NEWLINE

DIGIT    <- [0-9]
WS       <- ' ' | '\t'
NEWLINE  <- '\r\n' / '\r' / '\n'

This is of course partial and a simplification but the important thing here is that prioritized choice resolves what otherwise will be ambiguous. This is the “else” clause I’ve been looking for.

Context-Sensitive Tokenization

Markdown has lots of these issues. For example ‘###’ might indicate a header but only if that line itself isn’t a header (by the next line consisting of all equals signs or hyphens). ANTLR allows you to handle some of these situations by doing something like:

HEADER : {getCharPositionInLine()==0]?=> ‘#’+ WS+ ;

but what about this Markdown?

> # quoted heading
> some text

It’s entirely possible I’m missing some key part of the puzzle here but I’m not hopeful.

Ford illustrates this problem:

...PEGs also create new possibilities for language syntax design. Consider for example a well-known problem with C++ syntax involving nested template type expressions:

vector<vector<float> > MyMatrix;

The space between the two right angle brackets is required because the C++ scanner is oblivious to the language’s hierarchical syntax, and would otherwise interpret the >> incorrectly as a right shift operator. In a language described by a unified PEG, however, it is easy to define the language to permit a >> sequence to be interpreted as either one token or two depending on its context:

TemplType <- PrimType (LANGLE TemplType RANGLE)?
ShiftExpr <- PrimExpr (ShiftOper PrimExpr)*
ShiftOper <- LSHIFT / RSHIFT
LANGLE    <- ’<’ Spacing
RANGLE    <- ’>’ Spacing
LSHIFT    <- ’<<’ Spacing
RSHIFT    <- ’>>’ Spacing

(emphasis added)

Conclusion

This isn’t a new problem and I’m the first to approach the issue of Markdown parsing with a PEG grammar. peg-markdown is an implementation of Markdown in C using a PEG parser.

My own effort is going forward despite this implementation existing for several reasons:

  1. I plan on having implementations in several languages;
  2. I intend to implement various Markdown and Wiki extensions and flavours; and
  3. Because I’m getting a kick out of it.

I learnt compiler theory in university but it was all quite theoretical with simple yet interesting examples. The practical application to a real-world problem is quite something else. Plus PEG is only 6 or so years old so is new to me.

It is my belief that PEGs are a far more natural and robust means of parsing any form of Markdown, Wiki syntax, BBcode or other forum format.

And that’s the direction I’m heading.

Markdown Headings, Grief and Unknown Elements to the Rescue

Well it’s not a day to be outside (unless you’re at the beach). It’s 41 degrees and that’s metric (none of this Imperial rubbish that only the US uses). That’s 106F in the old scale.

So I’m tackling the problem of Markdown headings in my parser.

Heading 1
=========

Heading 2
---------

# Heading 1
## Heading 2
### Heading 3

Horizontal rules:

-------
*******
_______

This is really annoying for two reasons:

  1. Ambiguous syntax: a line of hyphens could be a horizontal rule or indicate a heading depending on the context (and again we return to the point of Markdown being context-snesitive); and
  2. From an LL-persepctive this is left-recursive and requires LL(*) (arbitrary lookahead) in a normal grammar.

Let me explain.

A subset of the grammar for Markdown might look something like this (in ANTLR-like syntax):

document  : block* ;
block     : paragraph | heading | heading1 | heading2 | codeblock ;
paragraph : inline+ END_BLOCK ;
heading   : '#'+ inline NEWLINE ;
heading1  : inline+ NEWLINE '='+ (NEWLINE | END_BLOCK) ;
heading2  : inline| NEWLINE '-'+ (NEWLINE | END_BLOCK) ;
inline    : '*' inline+ '*'
          | '`' inline+ '`'
          | ...
          | OTHER+
OTHER     : '.' ;
END_BLOCK : '\n' '\n'+ | EOF ;

Try and plug something like that into ANTLR and it will complain all over the place.

Firstly it’s ambiguous. An input sequence like “*123*” matches two of the inline alternatives. I’m led to believe that PEG parsers can deal with this by simply trying rules in the order they appear. That would fit a lot better to this situation. ANTLR can (messily) handle it with syntactic predicates.

The other problem is the grammar is left-recursive, most notably with the inline rule.

Yet another problem is that this requires arbitrary lookahead (again, something ANTLR can do with its LL(*) algorithm) because the token that delineates the heading rules is right at the end of a cyclic rule.

It's even worse once you start factoring in paragraphs and lists.

All of this leads to a whole bunch of headaches but I thought about this long and hard (going so far as to wake up in a sweat after a lexical analysis nightmare) and came up with a much more elegant (imho) solution

Consider a lexical stream that looks like this:

WORD("Heading") WHITE_SPACE(" ") WORD("1") ...

What's next is important because the parser doesn't yet know if this is a paragraph or a heading. But here is where I was trying to be too clever for my own good by determining the block quote in the lexer. After all, it would make the parsing step easier if I could just use a stack to push/pop block elements already knowing what they are.

Instead I decided to treat a stream of inline elements as an Unknown element and I could just determine the type as a parsing action rather than a lexcial rule. So the grammar simplifies somewhat to:

document  : block* ;
block     : codeblock | unknown ;
codeblock : ('    ' .* NEWLINE)+ ;
unknown   : inline* END_BLOCK ;

Again, order of rules is useful here, meaning if it looks like a code block it is a code block, otherwise it’s an unknown block. A syntactic predicate could handle this or you could make an indent at the start of a line an INDENT token, which wouldn’t fit into the inline rule. This makes the grammar unambiguous but still requires arbitrary lookahead. It’s easier to simply make a decision based on the first token and avoid any backtracking whatsoever.

So if the parsing actions come across the right token sequence within the unknown block it changes that block to a heading, otherwise when that block ends it simply defaults to being a paragraph.

Think of unknown elements as being the stem cells of Markdown lexical analysis.

Anyway that was my revelation for the week. I still need to finish my list handling, inline styling and links, which is still more than I’d like but it’s getting there.

Markdown Musings on Unintended Consequences

It may seem lately that Markdown is my white whale to which I respond thusly… call me Ahab.

One of the problems with implementing something like this is that no one can quite agree on what exactly constitutes Markdown. It gets worse when you consider Wiki syntaxes. What’s stunning is that someone (CosmoCode) has gone so far as to create a matrix comparing them all.

If you peruse the unit tests you find things like:

Asterisks tight:

* asterisk 1
* asterisk 2
* asterisk 3


Asterisks loose:

* asterisk 1

* asterisk 2

* asterisk 3

is converted to:

<p>Asterisks tight:</p>

<ul>
<li>asterisk 1</li>
<li>asterisk 2</li>
<li>asterisk 3</li>
</ul>

<p>Asterisks loose:</p>

<ul>
<li><p>asterisk 1</p></li>
<li><p>asterisk 2</p></li>
<li><p>asterisk 3</p></li>
</ul>

Now having gone through the code I can see why this is: two newlines is typically used as a block delimiter, between paragraphs, code blocks and so forth. But I have to wonder three things:

  1. Is this planned behaviour or simply the result of splitting the file into blocks using two or more newlines as a delimeter?
  2. Is this behaviour desirable?
  3. Is this behaviour reasonable?

Of course there is a case for paragraphs being nested in list items, namely that you have two or more paragraphs or other nested block content within list items. This is certainly something you can do—and will do—in HTML but I’m not so convinced that a newlines wrapping list content in a paragraph is anything other than an unintended consequence.

Of course there is no grammar or spec for Markdown so it’s something you can argue til the cows come home. You can also change it and still call what you do “Markdown”. It’s why there are so many Wiki syntaxes.

There are other issues. For example, should you be able to start or end bold or italic styling in the middle of a word? I believe Github has taken the approach that underscores for italics can’t start or end intra-word, sensibly (as this is a common occurrence in source code).

Lastly, Markdown preserves HTML. It’s my opinion that it should be replaced with Markdown where possible. What should you do with this:

<blockquote>
  <ul id="list">
    <li>one</li>
    <li>two</li>
    <li>three</li>
  </ul>
</blockquote>

In my opinion, it would make sense to convert this to:

> 1. one
> 2. two
> 3. three

Of course you lose information in doing this (namely the id attribute) but you have to decide: are you using Markdown or HTML?

Opinions will of course vary.

Weighty issues indeed! But this is what I’m struggling with as I’m working on my list parsing while trying to prevent my lexer from becoming a pushdown automaton.

More Details on JMD Markdown Parsing

I’ve reached an important milestone tonight. As previously mentioned I’m working on a non-regex Markdown library for Java (and other languages to follow).

The goals of this project are:

  • To be feature complete for standard Markdown as well as the StackOverflow/Github extensions;
  • To add table support;
  • To support various Wiki markdown flavours; and
  • To convert from Markdown to HTML and from HTML to Markdown.

There looks like being four steps in this process.

The first step I’ve called lexical analysis but it’s part scanning and part parsing mainly because to do it at this stage is convenient and saves me a lot of grief later. The end result of this step is a list of Tokens, which is highly memory efficient. The Token object only requires 4 integers each and for a source file 10K in size you’ll probably end up with between 2,000 and 5,000 tokens.

The second step, which I’m not convinced will remain, is a rewrite step. There are a couple of awkward cases I don’t want to handle in the third step so I filter the list of tokens at this point.

The last step is to take the list of tokens and to generate a Document. A Document is basically an Abstract Syntax Tree and looks a lot like a DOM.

The last step is to use the Visitor pattern to render an HTML document.

Tonight I have working code that does all four steps. It is still very much feature incomplete. Lots of inline styling doesn’t work. Neither do reference images, reference links nor any kind of list. Still it is correctly handling nested block quotes, implicit paragraphs and paragraph breaks and indented code blocks.

As of right now it is converting (the Code_blocks unit test from MarkdownSharp):

 code block on the first line
 
Regular text.

    code block indented by spaces

Regular text.

 the lines in this block  
 all contain trailing spaces  

Regular Text.

 code block on the last line

into this:

<pre><code>code block on the first line
</code></pre>

<p>Regular text.</p>

<pre><code>code block indented by spaces
</code></pre>

<p>Regular text.</p>

<pre><code>the lines in this block  
all contain trailing spaces  
</code></pre>

<p>Regular Text.</p>

<pre><code>code block on the last line</code></pre>

in 5 microseconds.

Let me repeat that: it looped through that conversion one million times in under 5 seconds… in pure Java! To compare, my regex solution is doing this in 600-700 microseconds (that’s based on the 1.006 MarkdownSharp code; 1.009 has improved block handling, which should make a difference).

Now you might look at that document and say it’s not that complicated (and you’d be right) but all the infrastructure is there. I know how I’m going to implement the rest and I can’t imagine anything (other than auto-linking) significantly affecting performance. What’s more even if it was 100 times slower I’d still be happy. I’m working on a worst case of it being 10 times slower when feature complete.

So far I haven’t used a single regular expression and don’t think I’ll need to apart from maybe link validation. I’ll document more about the design in future posts (after the code is released probably) to explain many optimizations you can make to this process as well as the overall parsing strategy. So far there has been almost zero need for lookahead and backtracking, which is generally what kills your performance (without complicated techniques like memoization).

Stay tuned…

JMD, Markdown and a Brief Overview of Parsing and Compilers

Like most comp sci students, I did a course on compilers in university. I also did some parsing and syntax trees in a data structures course. At the time I wrote a couple of parsers, including one for simplifying boolean expressions (de Morgen’s laws, a AND true = a, etc) and another for evaluating arithmetic expressions.

So I’ve been familiar with the basics and the theory of compiler design but I was by no means an expert.

Recently, I launched JMD, a Java implementation of MarkdownSharp, itself a C# port and extension of the original Perl Markdown scripts. Like the original it relies heavily on regular expressions.

I like the idea of e and improving Markdown but I’m no fan of using complicated regular expressions for that purpose. The current version is a milestone. It passes the unit tests and allows me to better build the replacement, which will be written more in the traditional compiler/translator sense.

Finite State Machines

A finite state machine (“FSM”) defines two things: a finite number of states and transitions between them. This abstract machines models behaviour of some kind.

Often—but not always—such machines have a start state and one or more end states. FSMs are typically used for games, input processing and many other things.

One important characteristic of such machines is that they typically have no memory. They merely know the current state and what transitions there are.

Typically in computer science we’re more concerned with a special class of FSMs called deterministic finite state machines (“DFSM”) or deterministic finite automata (“DFAs”). The key difference is that transitions are deterministic, meaning there is only one transition between two states with a given input symbol.

Another characteristic of finite state machines is whether they are cyclic or acyclic. If there exists a state such that a transition can be taken and it is possible to return to that same state the FSM is cyclic, otherwise it is acyclic. By definition, cyclic FSMs are capable of processing an infinite space of inputs. Acyclic FSMs are not.

Regular Expressions

The most familiar DFA for most programmers will probably be regular expressions. a regular expression (“regex”) is a shorthand way of building a DFA to process text input by specifying the optionality, cardinality, capturing and ordering of character sequences. Typically programs will determine if a given input matches a specified regex or whether or not that regex can be found anywhere in the input and possibly capture key parts of that input.

Undoubtedly regexes are useful but they tend to be overused. Consider them a shining example of how once you have a hammer everything starts to look like a nail. In particular programmers will often try to use them to parse HTML or XML documents, which tends to be a pet peeve of Stackoverflow answeres such as myself.

The reason they are a poor choice is that HTML is not a regular language. What that means is that it is not possible to parse and validate HTML with a DFA. That’s because things like proper nesting of tags, ordering of opening and closing tags, etc require the machine to have some kind of memory, which DFAs don’t have.

You see that in regex-based Markdown parsers where limitations are imposed to make it possible to parse Markdown, such as introducing a nesting depth limit to certain block level elements.

To give you an example: if you’re looking for links, will “<a[ >]” find them? Most of the time? Yes. But not all the time. Consider the case of such expressions appearing in attributes, XML CDATA blocks, inside XML, CSS or Javascript comments, inside Javascript strings, etc. Regexes can’t detect these kinds of corner cases. It simply isn’t capable. Not reliably at least.

As it turns out a fairly simple change greatly enhances the power of DFAs.

Pushdown Automata

Pushdown Automata (“PDAs”) make the simple change of adding a stack (hence “pushdown”), giving the machine a memory (of sorts) beyond the current state. To clarify, the machine can both inspect and manipulate the stack both in deciding what transition to take and what to do with the stack.

This allows PDAs to process a much broader set of languages. A language that can be processed by a PDA is called a context-free language (“CFL”), which is a superset of all regular languages.

If a PDA is deterministic (much like FSMs vs DFAs) then it is called a deterministic pushdown automaton (“DPDA”). Any languages that can be parsed by DPDAs are called deterministic context-free languages (“DFCLs”), which are a subset of CFLs.

Going back to regular expressions and HTML/XML parsing: with the addition of this stack, suddenly your parsing becomes much more reliable. You can stop looking for anchors when you enter a Javascript block or a comment and so on.

Many programming languages are CFLs but certainly not all.

Context-Sensitive Languages

The next broader class of languages are called context-sensitive languages. CFLs are a subset of context-sensitive languages. C++ is the traditional poster-child for hard-to-parse languages. It’s grammar is also context-sensitive. Take the following expression:

A a = B();

Is that a method call or object construction? You can’t tell without looking up B in the symbol table.

Ruby and some other programming languages are context-sensitive. HTML/XML is also in this category and so is Markdown. For example:

> Block quote <p
> >

Is the second > on the second line the start of a nested block quote or the closing part of the paragraph tag at the end of the first line?

Not only is this context-sensitive it’s also ambiguous. Technically we call this a non-determinism.

Once again returning to parsing a document for anchors, context-sensitivity bridges the remaining gap. Cases like being inside a comment or not represent the kind of context-sensitivity that is unmanageable for regular languages.

Grammars, Lexers and Parsers

A formal grammar (or just “grammar” for short) is a set of rules that describe a sequence of tokens. A token sequence is often called a sentence. The set of all sentences described by a given grammar is the language for that grammar. Note: a context-free language is described by a context-free grammar (“CFG”), etc.

Compilers, interpreters and translators are typically written in at least two parts (that are of interest to us): a lexer and a parser. Simple grammars may be implemented where the lexer and parser are combined. More complicated grammars may have many parsing steps.

A lexer or lexical analyzer or scanner or recognizer reads a set of input tokens—most often a character stream—and converts them into lexemes or tokens. For example, a lexer for arithmetic expressions may convert:

4+5*7

into

NUMBER(4) OP(+) NUMBER(5) OP(*) NUMBER(7)

A parser is a program that interprets a stream of lexemes by a set of rules. Those rules are defined in terms of lexems and/or other rules, or even the same rule (although anything other than tail-recursion in grammar rules tends to be problematic and is usually factored out either manually because the parser won’t accept it or automatically).

This distinction is somewhat artificial and a little blurred. ANTLR maeks the definition that lexer rules are terminating rules and parser rules are non-terminating. “Terminating” means the rules is not defined in terms of any other rules and as such can at best resolve to a lexeme.

Types of Parsers

At the top level there are two main types of parsers.

The first category are LL-parsers. Here each L stands for “left to right”. Basically this means the parser is top-down. The parser attempts to match the input to a rule and in doing so will attempt to match input tokens to lexemes. The other L means the input tokens are matched left-to-right too (ie from hte beginning).

LL parsers vary in their degree of lookahead. The simplest LL parsers are LL(1), meaning they lookahead one token. An LL parser cannot choose between:

r : A B
  | A C
  ;

An LL(2) parser however can. LL parsers with a finite amount of lookahead are also called LL(k) parsers. Arbitrary lookahead LL parsers are called LL(*) parsers.

The other main category is LR parsers. The input tokens are still read from the beginning but instead of trying to match rules, the parser will look at the input and try to construct tokens. From those tokens it will then look for rules and match the input that way.

The most important subset of LR parsers are LALR parsers. Frankly it’s been too many years for me to remember the difference between LR and LALR parsers.

There are various strengths and weaknesses of each approach, which is beyond the scope of this post. Generally though, LL parsers are easier to understand but LR parsers are more often used by compiler compilers. A compiler compiler is a tool that takes a formal grammar and creates a compiler, parser, interpreter or translator.

Another class of parsers is parsing expression grammars (“PEGs”). I know far less about these. They’re fairly new but at least one Markdown parser, Lunamark, has been written with a PEG parser.

ANTLR

Apparently Joel and Jeff discussed this issue in this week's Stackoverflow podcast. They ruminated that it would have been better had Markdown been written using a formal grammar and a tool such as bison. I agree and they provide a pretty good example of how horrifying regex parsing of non-regular languages can be.

My own journey towards a non-regex solution first took me to ANTLR (“ANother Tool for Language Recognition”), which has a good GUI for debugging parsers. ANTLR is an LL(*) parser with some extensions. One of these is syntactic predicates. Syntactic predicates increase the recognition power of LL parsers by resolving some ambiguities that LL parsers otherwise can’t handle.

For example, consider the following Markdown:

> This is a *test of
> blockquoting*,
> emphasis and
> http://www.google.com
> (autolinking)
A paragraph.

When I was playing around with ANTLR this was problematic to parse. The natural way is to remove the blockquoting and then parse the remaining text, probably recursively.

One “problem” with LL parses is that they attempt to match all the rule alternatives so if you wanted to write the above as:

document : (para | quote)* ;
para     : ((~ '\n') '\n')+ ;
quote    : ('> ' (~ '\n')*)+ ;

you’ve actually created an ambiguous grammar because the para rule can match quote lines can be matched by the para rule. Syntactic predicates seek to resolve this kind of ambiguity by saying things like “if it looks like a block quote then it’s a block quote” when choosing between possible alternatives.

Another problem I ran into was how to deal with things like auto-linking URLs?

Throw in limited XML/HTML parsing and it just became a hair-pulling exercise. Basically it just seemed to be the wrong tool for this particular job. Now that doesn’t mean it can’t be done. Someone more skilled than I with it no doubt could get it done. I could see the path forward and it wasn’t pretty however.

It’s a shame really because I like ANTLR. Terence Parr, the author of ANTLR and all-round language tool rock star, has written an excellent book The Definitive ANTLR Reference: Building Domain-Specific Languages and I can’t recommend this enough.

The Future of JMD

One commentor asked:

How is it better/different than MarkdownJ?

It’s a good question. The answer is that JMD will use a true parser rather than a hash of regexes to parse and process Markdown. It’ll also do a lot more than this but more on this later when its closer to fruition.

The ANTLR exercise wasn’t a complete waste. I had a choice to see if LALR parsing would offer a better alternative. The ANTLR experiment did solidify in my mind how I would go about parsing Markdown.

I’m convinced that a hand-coded parser is not only possible but it’s relatively straightforward.

Preliminary results look extremely promising. It’s not feature-complete yet and I won’t release it until it passes a significant portion of the unit tests inherited from MarkdownSharp.

Initial results indicate it will be 50-100x faster than a regex solution. The lexical analysis so far is being done in a single pass using virtually no memory and is taking between 10 and 20 microseconds to tokenize a document about the size of one of the unit tests.

Conclusion

I hope this post has been useful in three respects:

  1. To give a brief overview of the field of compiler compilers;
  2. To explain my thought process behind how to take JMD forward; and
  3. Why I’m doing what I’m doing.

Watch this space.

Announcing JMD: Java MarkDown (port of MarkdownSharp)

By a strange coincidence when I was looking for text editing options for another project, the Stackoverflow guys released MarkdownSharp last week, being a C# port and extension to what was originally written in Perl.

A couple of days later I have JMD (Java MarkDown) with the same extensions and unit tests. At this stage—and certainly while the code stabilizes and work progresses in passing all the tests—it is an almost line-for-line translation of the C# source as this makes it easier to apply patches. This isn’t the Java Way, in particular Java favours a more DI-centric approach typified by Spring rather than static configuration.

Ugliness and architectural issues aside, it will do for now. You can:

  1. Download it from the Downloads page; or
  2. Retrieve it from Github at git://github.com/cletus/jmd.git.

It is built with Maven and should build out of the box (assuming correctly configured Maven). Running on my machine:

  • Intel Q9450 CPU (2.66GHz);
  • 8GB DDR2 RAM;
  • Windows 7 Ultimate 64; and
  • Intel X25-M G2 80GB SSD.

The results are:

JMD test run

1   Amps_and_angle_encoding                                 OK
2   Auto_links                                              OK
3   Backslash_escapes                                       OK
4   Blockquotes_with_code_blocks                            OK
5   Code_Blocks                                             OK
6   Code_Spans                                              OK
7   Hard_wrapped_paragraphs_with_list_like_lines            OK
8   Horizontal_rules                                        OK
9   Images                                                  OK
10  Inline_HTML_Advanced                                    Mismatch
11  Inline_HTML_comments                                    OK
12  Inline_HTML_Simple                                      OK
13  Links_inline_style                                      OK
14  Links_reference_style                                   OK
15  Links_shortcut_references                               OK
16  Literal_quotes_in_titles                                OK
17  Markdown_Documentation_Basics                           OK
18  Markdown_Documentation_Syntax                           OK
19  Nested_blockquotes                                      OK
20  Ordered_and_unordered_lists                             Mismatch
21  Strong_and_em_together                                  OK
22  Tabs                                                    OK
23  Tidyness                                                OK^

Tests      : 23
OK         : 21 (^ 1 whitespace differences)
Mismatch   : 2

input string length: 475
4000 iterations in 6.301 seconds (1.575 ms per iteration)
input string length: 2356
1000 iterations in 6.390 seconds (6.390 ms per iteration)
input string length: 27737
100 iterations in 10.503 seconds (105.031 ms per iteration)
input string length: 11075
1 iteration in 0.037 seconds
input string length: 88607
1 iteration in 0.518 seconds
input string length: 354431
1 iteration in 4.992 seconds

To compare, on the same machine, these are the MarkdownSharp results in Visual Studio 2008:

MarkdownSharp v1.006 test run on \mdtest-1.1

001 Amps_and_angle_encoding                                OK
002 Auto_links                                             OK
003 Backslash_escapes                                      OK^
004 Blockquotes_with_code_blocks                           OK
005 Code_Blocks                                            OK
006 Code_Spans                                             OK
007 Hard_wrapped_paragraphs_with_list_like_lines           OK
008 Horizontal_rules                                       OK
009 Images                                                 OK
010 Inline_HTML_Advanced                                   Mismatch
011 Inline_HTML_comments                                   OK
012 Inline_HTML_Simple                                     OK
013 Links_inline_style                                     OK
014 Links_reference_style                                  OK
015 Links_shortcut_references                              OK
016 Literal_quotes_in_titles                               OK
017 Markdown_Documentation_Basics                          OK
018 Markdown_Documentation_Syntax                          OK
019 Nested_blockquotes                                     OK
020 Ordered_and_unordered_lists                            Mismatch
021 Strong_and_em_together                                 OK
022 Tabs                                                   OK
023 Tidyness                                               OK^

Tests        : 23
OK           : 21 (^ 2 whitespace differences)
Mismatch     : 2

MarkdownSharp v1.006 test run on \test-input

001 markdown-readme                                        OK
002 reality-check                                          OK

Tests        : 2
OK           : 2
Mismatch     : 0


MarkdownSharp v1.006 benchmark, takes 10 ~ 30 seconds...

input string length: 475
4000 iterations in 3827 ms (0.95675 ms per iteration)
input string length: 2356
1000 iterations in 4205 ms (4.205 ms per iteration)
input string length: 27737
100 iterations in 4736 ms (47.36 ms per iteration)
input string length: 11075
1 iteration in 23 ms
input string length: 88607
1 iteration in 191 ms
input string length: 354431
1 iteration in 1025 ms

So Java is roughly half the speed of C# in this regard, which is more difference than I’d expect for what is essentially the same code. At this preliminary stage I can only attribute this to the .Net Regex libraries being better.

JMD is released under the same permissive MIT license as MarkdownSharp. Please feel free to use it, let me know what you think or to contribute.

Java: Why-oh-why still no multi-line strings?

Over the last year or two I’ve been doing a lot of PHP (which I really like).One of the things I use a lot is heredoc syntax eg:

$query = <<<END
SELECT *
FROM tablename
WHERE condition1 = $field
AND condition2 = 345
END;

This is much more convenient than, say:

String query =
  "SELECT * " + // MUST remember to put a space here!
  "FROM tablename " +
  "WHERE condition1 = " + field + " " + 
  "AND condition2 = 345";

I’ve been doing quite a bit of Java recently and it’s really starting to bug me. I don't understand why pretty much every other imperative language invented in the last 15 years can have some form of multi-line string syntax but Java still doesn’t.

Java 6 was released over three years ago. Java 7—thanks largely to the unexpected (yet welcome) inclusion of closures—isn’t due for nearly another year. Four years between releases.

Surely Java could have gotten something in that time. Even if it’s the rather ugly (imho) triple-quote syntax of Scala/Groovy it’d be better than nothing.

Anyway, I just needed to get that out.

Happy New Year for 2010.