Stackoverflow: Joel and Jeff want VC Money? Say What?

The big news today is that Stackoverflow—started by Joel Spolsky and Jeff Atwood as a programming Q&A site almost 18 months ago—is now looking for VC money. This is huge and deeply worrying. And it raises a whole raft of questions.

Vertical Growth

Stackoverflow has grown to be probably the largest programming Q&A site on the internet in its short life, supplanting the “evil hyphen site”, to be just outside the top 1000 sites having over 4.5 million visitors a month. While it continues to grow, there’s only so big it can get because there are only so many programmers.

Joel says:

In 18 months we’ve accomplish that: we’ve got 6 million unique visitors every month.

Note: this figure includes Superuser (1M) and Serverfault (730K).

The issue of course is how to turn this traffic into revenue sufficient to cover the site’s running costs, development of the site and profit for its owners. Programmers are a hard group to monetize and you can see Joel and Jeff struggle with this when it comes the usual method: advertising. See Responsible Advertising: Feed a Programmer, Our Amazon Advertising Experiment and Summary of Amazon Remnant Ad Experiment.

Horizontal Growth

It’s natural for companies that exhaust opportunities in their home markets to look at other markets that are related somehow, fuelled by (sometimes justified) paranoia that if they stop growing they’ll die or simply the need for incessant growth.

Microsoft Operating Profits by Division

Look at where Microsoft's profits come from and you’ll see their core business is Windows and Office. Forays into gaming, music, online services, mobile communication, etc have varied from being lacklustre to haemorrhaging money pits.

Google’s core business is search and advertising.

It takes a rare combination of talent, timing and luck to successfully branch into new areas as Apple did with online music, portable music players and the iPhone.

Joel gave a Google Tech Talk about Stackoverflow last May that’s instructive. A key point is that all software is social and that a given platform that works in one community that’s dropped into another may simply not work.

Programmers respond to the Q&A format of Stackoverflow because a programmer is predisposed to formulating questions, answering them and categorizing (tagging) them. What’s more, the subject matter is sufficiently objective for there to be right and wrong answers most of the time.

To put it another way: programmers talking about programming are self-organizing.

Some miss the point completely and criticize the format for making discussion hard, which misses the point entirely.

Sister Sites

Joel and Jeff’s first attempts at horizontal market growth are the sister sites: Serverfault (for sysadmins) and Superuser (for general computer questions), which Jeff calls the League of Justice. There are also loose affiliations with How-to Geek and Doctype (from the guys behind Litmus).

While a million (ish) uniques per month is nothing to sneeze at it’s clear that these sites haven’t grown like Stackoverflow has. See superuser.com and serverfault.com (this one has started to pick up recently).

Stack Exchange

Fog Creek has adapted the Stackoverflow code to create a hosted white label Q&A solution. For roughly $129/month you can have your own Q&A site to discuss everything from parenting issues to World of Warcraft (no joke).

Such sites rely on communities and building communities takes time. Stackoverflow succeeded in part because it leveraged the existing audiences of Joel and Jeff.

Careers

This is perhaps the more controversial move and something I covered in Joel Inc., Stackoverflow Careers and Jumping Sharks and Hard Numbers on Stackoverflow Careers. It’s something the pair have pushed repeatedly, going so far as heartfelt testimonials.

This one differs from the others in that the revenue model isn’t based on advertising: it’s based on the high cost of recruitment and the unique tie-in with Stackoverflow. My opinion is there simply aren’t enough active Stackoverflow users for this to be a real money spinner but time will tell.

Self-Funding and Control

Self-funding has huge advantages for any venture. If it’s possible it keeps control in the hands of the founders. Investors have their own agenda—being a return on that investment—which doesn’t necessarily coincide with the best long-term interests of the venture.

Some argue Transmeta was derailed by being forced to make a premature product launch.

When you own your own venture you can do whatever you want. Well, you can’t break the law but other than that, there’s not a lot you can’t do.

As soon as you have investors that changes. Investors have rights. Their money comes with conditions like how you can spend the company’s money, reporting requirements and so on.

It gets even worse when you’re a public company and worse again when you’re a publicly listed company.

Debt and Equity

There are two basic sources of funding for a venture: debt and equity.

Debt is borrowing money that you agree to repay the lender, typically at a fixed or floating rate over a given period of time. In the corporate world, there are many sources of debt: bank bills, overdrafts, commercial paper, bonds, swaps, traditional loans (secured and unsecured) and so forth. Many of these you have to be sufficiently large to have access to (eg corporate bonds are an option for the Toyotas of the world).

Equity is ownership of the company. Depending on your jurisdiction there are many forms of equity: ordinary shareholders, preferential shareholders and so on. They have different rights and a different pecking order for being repaid if the company is ever wound up (and typically the debt-holders will be ahead of all of them).

In between there are countless variations (eg convertible notes are a debt instrument that can be converted to equity in certain circumstances).

Companies generally strive for a healthy mix of debt and equity funding options.

The fallacy that many tech companies succumb to is that venture capitalists are their only source of funding. What’s more, VC funding is about the most expensive source of funding. A bank, being your typical source for a loan, will look at your plan and make a decision on your ability to repay the loan. Not your revenue but your income (being revenue minus expenses), both current and projected.

VCs typically look for blue-sky potential, often in ventures that don’t even generate revenue now or in the foreseeable future. Still any business plan will need to answer the questions of “when” and “how” the investors will get a return.

What Does Stackoverflow Want?

This move is surprising consider Joel wrote Fixing Venture Capital and Strategy Letter I: Ben and Jerry's vs. Amazon. Joel is somewhat vague on their motivations, saying only:

Now we’re biting off the bigger goal of changing the wayeveryone gets answers to their questions on the Internet, and that’s something we can’t do alone.

The infrastructure (hardware and bandwidth) is cheap (almost free) for Q&A. Stackoverflow.com seems to run on three Web servers based on Stack Overflow Architecture (a little outdated but those Web servers are low RAM and single CPU, which means dirt cheap) and Stack Overflow Network Configuration.

It’s fair to say that hardware is ludicrously cheap. Plentyoffish uses less than 10 servers for over a billion monthly page views.

Is it development? Is there some grand Q&A idea that’s going to take 50 man-years of development time to implement? Jeff has repeatedly said that apart from tweaking around the edges, Stackoverflow as a technology platform is basically “done”.

Is it to broaden the scope of Stackoverflow? What about a Wikipedia-like platform? What about the Wikipedia content? Is there any money in that?

Why Venture Capital?

This points to something ridiculously large scale otherwise:

  • Why wouldn’t a bank fund it (based on existing income)?
  • Why wouldn’t Fog Creek fund it?

The last is worth mulling over. Fog Creek has ~34 employees. Joel once said for every $10,000/month Fog Creek made he hired a programmer. Fog Creek is a private company so it’s profits aren’t published but it would seem reasonable to assume that their revenue is in the order of $4-10 million per annum.

  1. he business itself could benefit from the publicity of getting an investment from someone who is thought of as being a savvy investor.
  2. The investor will add substantial value to the business in advice, connections, and introductions.

But he also says:

  1. The founders are not in it for their own personal aggrandizement and are happy to give up some control to make the business more successful.

Interesting. Could it be as simple as wanting to cash out?

I suspect (3) and (4) are more what it’s about but without knowing what they want to do it’s largely impossible to figure out the why.

Conclusion

It’s hard not to be concerned by this. The evil hyphen site became evil when they tried to take what was free content and and monetize it using a subscription model. I don’t believe this is a likely outcome here but when you give up control, it’s a question of what your investors believe is the path to profitability that matters.

Many businesses fail because they try to apply something that worked one place to another area where it simply doesn’t work. I would hate to see this happen to Stackoverflow as I’m personally a big fan of the site.

There’s something to be said for leaving something that works well enough alone and turning your attention to building something else. Not everyone can or should be Microsoft or Google. Trying to be is typically a surefire way of converting success into failure.

Update: I misspoke regarding the Stackoverflow Web server configuration. Fixed.

Markdown, Inline Parsing and Badly Formed HTML

I haven’t had much time to work on my Markdown parser lately (sadly) but I thought it was worth posting an update on where I’m at. I have been digging deep into the dark depths of inline parsing. I have previously discussed the two modes of parsing Markdown, which I call block and inline.

But the block parsing is done (well, I have to go back and tweak one thing) so I’m onto the murky world of inline Markdown parsing.

Parsing Block Markup

Various Markdown implementations allow you to create markup blocks. There are usually quite strict requirements about how you can write these blocks. For example, you might need to put the start and end tags on separate lines such as:

<ul>
  <li>one</li>
  <li>two</li>
  <li>three</li>
</ul>  

I have a much more forgiving approach to this such that this “Markdown”:

This is a paragraph with a <ul><li>nested</li></li>block</li> with
some <hr>random<h2>other tags</h2> in it

and convert it to:

<p>This is a paragraph</p>

<ul>
  <li>nested</li>
  <li>block</li>
</ul>

<p>with some</p>

<hr>

<p>random</p>

<h2>other tags</h2>

<p>in it</p>

This part already works.

But it gets better. It will also take that some input stream and convert it to:

This is a paragraph

- nested
- block

with some

------

random

## other tags ##

in it

So to be clear: this will convert acceptable markup to markdown and filter out unacceptable markup (like script tags).

This will include parsing links and images into Markdown references.

Parsing Inline Markup

This is what I’m working on now. I’m still looking for a good generic way of doing this that correctly captures tag hierarchies (eg list items must be children to unordered and ordered lists). What I’m probably going to do is release a messy version of the code (being the current version) then go back and revisit it once I have a working implementation.

This is a good general principle: it’s far easier to fix something that’s complete and working than it is to constantly strive for perfection in incomplete code (basically artists ship).

One thing I’m debating is whether I require tags to be balanced. That means whether I accept this:

<b>this is<i>a</b> test</i>

Ideally I’d like to not accept this. XML/XHTML requires balanced tags but HTML either doesn’t or even if it does, most browsers are quite forgiving of this. XML treats markup essentially as a document tree whereas the HTML view is more like tags are, in certain circumstances, switches to turn behaviour on or off.

Markdown Formatting

I went into this problem thinking I could construct a document tree out of

***this is a* test**

into

document
+- bold
   +- italic
   |  +- text: this is a 
   +- text: test

but that idea quickly falls down when you consider that this is valid Markdown:

**this is *a** test*

which basically parses to:

BOLD_ON
TEXT("this is ")
ITALIC_ON
TEXT("a")
BOLD_OFF
TEXT(" test")
ITALIC_OFF

Almost any Markdown parser will generate HTML from this that looks like this:

<strong>this is <em>a</strong> test</em>

That’s unfortunate because I like the document tree. But sadly the matching problem still remains because if a special sequence doesn’t have a matching close it is put into the document as a literal sequence.

This leads to some fairly pathological corner cases like:

*this [link* google][1]

  [1]: http://google.com

which will translate to

<em>this <a href="http://google.com">link</em> google</a>

and browsers will tend to break that link into two parts (where you can click on “link” or “google”).

But work progresses.

Conclusion

Even with a lot of the Markdown spec being parsed my transformation times on basic documents (eg a couple of lists, a block quote and some paragraphs) is still under 60 microseconds (roughly) and that’s with some messy array manipulation and temporary object creation that I plan to revisit and clean up.

At this stage I’m hoping to have some committed and available for comment within two weeks. It won’t be pretty but my goal is to get feedback earlier rather than later.

Standing on the Outside

This week I read Life outside .NET, or “How to check out your neighbours”. I really like posts like this. They’re instructive about the culture of a particular community.

For over a decade I’ve been a Java developer (since JDK 1.0.2). Like most Java developers I have a love-hate relationship with the language, the libraries and Sun. Java didn’t invent the virtual machine but it certainly popularized it. 5-10 years ago (in particular) Java was a hotbed for the development of many technologies, concepts and frameworks.

As the author notes, MVC and DI (dependency injection) are simply assumed in Javaland. It’s true. Good luck finding a non-MVC Web framework in Java out of the dozens that exist.

My experience and exposure with .Net is at best peripheral. ASP.NET always struck me as somewhat primitive in the sense that it’s what would’ve happened had JSP been taken to the nth-degree instead of being supplanted by Struts and all that came after. That’s not to say ASP.NET is bad or doesn’t do it’s job but to a Java developer it seems somehow crude.

Beyond the boring and irrelevant comparisons of Java vs. .Net performance, the more interesting comparison is as a proxy for decentralized vs. centralized platform progression.

The Microsoft Way definitely has its advantages. Where once Redmond was playing catch-up on Java (technically speaking), Sun’s inability to lead (and no clue where they were going if they could) has left Java largely stagnant. Java 7 is due at the end of the year but has been delayed years. Thankfully it’s now getting closures if for no other reason than we can all stop bitching about it (frankly, I think some form of function pointers or delegates in “C#-speke” will be sufficient for 99% of use cases).

It can be useful not to have a diaspora of Web development frameworks (even at the cost of innovation). Takes a Struts developer and put them on a Wicket or Tapestry project and their experience won’t be especially applicable.

It will certainly be interesting to see if Oracle can provide more leadership than Sun. Oracle was always heavily invested in Java  so I’m hoping Java isn’t simply collateral damage to Larry’s acquisition of Sun’s server business. Bizarrely Oracle seems committed to JavaFX of all things.

For those of you unfamiliar with it, JavaFX is Sun’s “me too” Flash alternative and a prime example of Sun’s boondoggles of recent years.

I for one welcome our new insect overlords. I’d like to remind them that as a trusted blogger, I can be helpful in rounding up others to toil in their underground sugar caves.

Markdown, Block Parsing and the Road to Hell

I thought it times to update my status on this particular undertaking, which so far has ended up being far more massive than originally envisioned.

The overall design of the Markdown parser is that there are two parsers… kinda. There is a parser to break your document into blocks and another to interpret the inline content within those blocks. As soon as I made this realization, everything just got a whole lot easier.

I use this this term (and “inline”) because those are the terms HTML uses (“block elements” and “inline elements”). Of course HTML also gets more complex (eg “replaced” vs “non-replaced” elements and inline-block, floats, etc) but fundamentally you can think of a Markdown document—or any hypertext document—as consisting of block and inline elements.

Markdown parsers will often talk about “blocks” and “spans” instead.

Block Parsing

The first level of parsing of Markdown is into blocks.

Such a document can be viewed as a tree. The root node is the document. Every node below that is either a block or an inline node. The tree can be arbitrarily deep and there are certain rules about relationships in that tree. For instance:

  • Block nodes are only ever children of other block nodes (counting the root Document node as a block node);
  • Paragraphs can only contain inline elements;
  • List items must be children of lists;
  • and so on.

The goal of any parser is take an input and build a valid syntax tree based on the rules defined.

This part of the problem for what I’m writing is now done. This includes code blocks, paragraphs, block quotes, ordered and unordered lists, headers and horizontal rules. Tables I plan to return to later.

List Parsing

Today I came across Three Markdown Gotchas, which I hadn’t seen before but it opened my eyes to one particular area of difficulty I had: list processing. Go to StackOverflow, ask a question and type in:

- one
 - two
  - three
   - four

and you probably won’t get you what you expect. You get this:

<ul>
  <li>one
    <ul>
      <li>two</li>
      <li>two</li>
      <li>two</li>
    </ul>
  </li>
</ul>

Let me give you some background: Markdown has the concept of indents. Based on a predefined tab width (typically 4), a single tab or 4 spaces represents one indent. That’s important because code lines are preceded by one indent. A non-indent space is sometimes ignored at the beginning of a line, for example at the start of a paragraph line or the continuation of an existing one.

The original Markdown “spec” says that nesting list items is done by preceding the line with one more indent than the previous line. In vanilla Markdown the above sequence would come out as:

<ul>
  <li>one</li>
  <li>two</li>
  <li>two</li>
  <li>two</li>
</ul>

because none of the lines has a leading indent. That’s logical and consistent. Jeff’s point is basically that even one space should indicate intent and be interpreted as nesting. Sounds reasonable right? Maybe. The problem is that it leads to unintended complexity.

Go back to the above example and put one, two then three spaces in front of the first list item. Watch the preview pane to see how the list changes. The implied nesting changes all over the place? Logical? I think not.

But it gets worse.

- one

 two
 - three

 four

comes out as

<ul>
  <li>
    <p>one</p>
    <p>two</p>
    <ul>
      <li>three</li>
    </ul>
    <p>four</p>
  </li>
</ul>

Okay… bear in mind that there are spaces before two and four so that you continue the list item. Otherwise they would be interpreted as separate paragraphs. But what if you want four to continue the nested list item three? How much indentation do you need? It turns out that the magical number is anything from 5 to 11.

But it gets worse. Put one space before one and suddenly one and three are the same list so four is now indented so far that it becomes a code block run-on from three. Add a second space to the front of one and for some reason it returns to the original nesting even though one is now indented more than three. Huh?

I’ll leave an examination of the MarkdownSharp source code as to the reasons for this as an exercise for the reader. Suffice it to say that it all stems from the motivation that one (more) space indicating nesting being somehow more intuitive.

The Road to Hell

The road to hell is paved with good intentions. It’s one of my favourite sayings. We programmers as a whole are unreasonable people. Through a combination of hubris, stubbornness and even laziness we have a tendency to throw out what’s been done before or simply make breaking changes because we prefer it, we think others will prefer it, we don’t appreciate that someone else may have to deal with the consequences or simply out of ignorance as to what led to the original changes.

We all do this, myself included. It’s worst when it not only manifests itself in company culture but it’s enshrined. Take Microsoft as a prime example. Internet Explorer has “Favourites”. What the hell are favourites? Well, they’re bookmarks. But IE can’t call them that because Netscape called them that first and Microsoft wanted to differentiated themselves and their products. This is of course led to many conversations I know I had at the time that went something like this:

New user: What’s a favourite?
Me: It’s a bookmark.

I couldn’t help but laugh out loud when I first read C# and saw all the things copied from Java had been renamed, sometimes with significantly worse names. Java’s final as C#’s sealed springs to mind. You can just tell that there were people dedicated to the task of finding names to Java concepts and keywords. It’s just sad.

Hyperbolae aside, I digress.

The point of all this is that:

  1. Often things that came before you were done for a reason, whether or not you’re aware of it and whether or not you agree with it if you are;
  2. Breaking changes have a high price so much so that the cure is often far worse than the disease and your delicate sensibilities be damned. Internal consistency and syntactic purity is overrated. Interestingly those overly encumbered with such sensibilities seem to have a disproportionate tendency to become Python programmers.

List Sanity

For this reason my parser has returned to what is probably the original implementation. That is:

  1. A leading non-indent space is ignored before list items. That is, it implies no meaning and is discarded so there is no difference between 0 and 2 leading spaces before a list item;
  2. Up to one leading indent (meaning one tab or 0 to 4 spaces) is consumed from each subsequent line until a new list items is hit or a line with no leading spaces is met. The subsequent list item will be a part of the same list. Text with no leading spaces will end the list and form a new paragraph; and
  3. All lines that continue the list item are combined (with their leading tab or 0 to 4 spaces consumed) and they form a new block context. Meaning they are then parsed as if they were a separate input, meaning it can contain new lists, block quotes, code segments and so on.

(3) provides a lot of consistency. it means that if you have a list item followed by a line with two indents that second line will be a code block (one indent marking a continued list item, the second will be interpreted as a code block within the list item block context).

To me this is supremely more logical—and easier to implement—but I guess if you’re really attached to nesting list items with a single space and figuring out that 5 to 11 spaces is the magical number of spaces to continue a nested list item then you’ll hate it. Too bad.

The nested block context from (3) has one exception. If the nested block context would result in a single paragraph then that paragraph is unwrapped to being inline content of the list item. This has one important effect, which some may consider a breaking change. Namely this Markdown:

- one
- two
- three

and

- one

- two

- three

will both be interpreted as being:

<ul>
  <li>one</li>
  <li>two</li>
  <li>three</li>
</ul>

whereas MarkdownSharp will interpret the latter as:

<ul>
  <li><p>one</p></li>
  <li><p>two</p></li>
  <li><p>three</p></li>
</ul>

which is something I've previously documented and disagreed with.

But this could be interpreted as a breaking change so I will probably add a special case for just this scenario as an option

Conclusion

The block parsing portion is done. The code is ugly and needs to be refactored (again) but it works. I still have an issue with too many temporary objects being created (mainly because it simplified some code) and I’ll need to go back and eliminate that.

What’s been interesting is that I’ve now rewritten the block parsing at least four times before it felt right. John Carmack once said he needs to write something five or six times before he gets it right. I agree with his sentiment. It takes that long to truly understand the domain, in my opinion.

The inline parsing has been a completely different set of problems. I will have a follow-up post on that soon.