Is it Politically Incorrect to Criticize Open-Source Software?

Political correctness is one of the most virulent and pervasive social diseases of the modern era, one for which I have no time whatsoever. Take, for example, the self-evident phenomenon that programming is a male-dominated industry with one study finding in 2002 that:

Currently, [Computer Science] is a major that has a very low percentage of women. At top research universities, about 15 to 20 percent of majors are female,

Yet some zealots write off anything other than a 50-50 split as the product of sexism or oppression.

Of Listening and Maps

To me, it is blindingly obvious that men and women are wired differently. That’s not to say that there aren’t female programmers or that women should in any way be discouraged or steered away from programming. Quite the contrary: I’m a stalwart believer in meritocracy. Nor does such an observation predict the outcome of any one individual (alone or compared to another). But the fact remains that if you take 1000 men and women in otherwise equal circumstances, those who end up programmers will be disproportionately male.

The problem begins when one group should be given preferential treatment over another to cure the alleged imbalance. Let’s say that the faculties of Stanford, U. Washington and MIT got together and decided they would make it easier for female applicants to computer science courses to be accepted. This, to me, does two things:

  1. It reduces the overall quality of computer science graduates since, by definition, you’re not getting the pool of those with the most merit; and
  2. Female programmers could end up being viewed as somehow less qualified than their male counterparts by virtue of the perception that they had an easier ride.

But the point isn't whether this is true or not or what the reasons for it are if it is true. The point is can you even discuss it without being accused of committing some serious social transgression?

I hear you asking: what the heck does this have to do with open source software? Well, I’m glad you asked. In some ways, the situation is exactly the same.

Not Un-Delicious

Recently I posted Spring Batch or How Not to Design an API, which generally received positive feedback. Dave Syer left a comment (emphasis added):

As one of the authors of Spring Batch I also find this article a bit harsh, especially coming from someone who has has not been active on the forum and not raised any issues as far as I can tell in the Batch JIRA. Spring projects are community projects and we do care a lot about what people think. All of the issues above would benefit from discussion and clarification, but this is not the right forum, so I warmly invite interested parties to follow up on the Batch forum (http://forum.springsource.org?forumdisplay.php?f=41) or in JIRA (http://jira.springframework.org/browse/BATCH).

That comment got me thinking: what ethical responsibilities do we, as programmers, have with respect to open source software?

For the sake of completeness, I will point out that I have raised a few issues against Spring proper (not Spring Batch) 2.0.x and Spring Webflow, all raised on 18 April 2007. They are:

Issue Description Timeframe Resolution
SPR-3390 Better error handling from the <form:errors> tag n/a Unresolved
SPR-3389 Nicer handling of Java 5 enums by the Spring MVC form taglib. 1 week Fixed in 2.0.x
SPR-3388 <form:*> tag library assumes the value of a Java 5 enum property is the value of toString() 1 week Fixed in 2.0.x
SPR-3387 DataBinding error with Java 5 enums 21 months Fixed in 3.0.x

Are We There Yet?

These were all issues affecting what I was doing at the time (using Spring MVC and Webflow). The second and third issue were fixed in a useful time period. In fact, I was surprised how quickly they were resolved but I attribute that to the quality of the Spring and Spring Webflow projects in terms of the libraries themselves, the teams developing them and the technical leadership driving them.

Sadly, my experience has been that this is the exception rather than the norm.

Anyway, the key point from the above is that when you encounter issues such as these they need to be fixed in a usefully short time period or they are for all intents and purposes not fixed. Interestingly, the concept of usefully short time frames was one of the motivating factors behind Stackoverflow.

When you encounter a problem you can’t solve the immediate response is usually to go to google and look for an answer. You could ask a question on a forum but forums, well, suck. Even if they have a critical mass of an audience it may take a day or more to get a response.

This is not a usefully short time period.

It’s a Question of Hats

When you’re designing a technical solution or selecting a library or framework to solve a particular programming problem and you’re unfamiliar with some or all of the choices you go looking for reviews, tutorials and blog posts to get feedback while you try and ascertain the suitability of library or framework and identify any risks or shortcomings it has.

There is nothing worse than selecting a tool only to find out halfway through the process it has a massive problem. It’s too late to change, it’ll take too long for the development team to fix and you’re faced with a messy workaround.

This motivated me to write my previous post because that was exactly my situation.

What Dave implies is that I, as a user of Spring Batch and a blogger, have a responsibility to pursue these issues through “official” channels before—or instead of—posting any kind of public criticism. Or perhaps' it’s that I haven’t earned the right to be critical by virtue of being insufficiently community-minded?

So what do I, as a programmer and a blogger, owe any open source tool I use?

As I see it:

  • as a professional programmer I have a responsibility to my employer or client to get the job done first and foremost;
  • as a blogger—or any kind of writer really—I have an ethical responsibility to write with integrity. Particularly as a review, this means giving credit where credit is due but also being critical where justified; and
  • as an open source citizen I have a moral imperative to support such community-driven efforts.

So which hat should I wear?

Of these the first is arguably the most important. You have a duty of care to your employer and/or client that has legal standing and failure to act in the best interests of that party can be grounds for professional misconduct, breach of contract or worse.

Of Apples and Oranges

In one respect I can understand Dave’s displeasure. After all, he’s not getting paid for his involvement in Spring Batch (as far as I know). Having your work criticized at the best of times is usually hard but it sticks in the throat even more when you’re volunteering.

This brings us to the next interesting question: is open source software held to a different standard than commercial software?

Clearly I think it is as everyone’s favourite whipping boy, Microsoft, can surely attest. Is that fair? Should it be held to a different standard?

Some will argue that commercial software by virtue of funding that could amount to billions of dollars in some cases has such a huge advantage that open source software should be given a free pass or at least held to a lower standard. After all, it needs every advantage it can get right?

My opinion is that the opposite is true: the cost structure of most open source software is, well, zero. Put up a Website, register a domain name, host some forums and host a source control repository and, unless you’re wildly successful, you’re spending maybe $300 a year. Users debug your software for you. Programmers volunteer their time.

How can a commercial software vendor possibly compete with that after they pay for programmers, office space, hardware, internet connections, office staff, sales, marketing, accounting, compliance and so on? In some cases—most notably Windows—the vendor is so entrenched that it constitutes a virtual if not actual monopoly.

But in many other cases, such as databases, commercial vendors continue to exist because they produce a better product (than say MySQL). Oracle is expensive and that cost is certainly not justifiable in many business models. Nor is it’s feature set or performance benefit applicable in all circumstances. But whatever the case, Oracle is clearly better than MySQL and to argue otherwise is naive, ignorant or both.

So with so many advantages shouldn’t we hold open source to the same standard (if not a higher standard)?

Of Female Programmers and Open Source

And at last I return to the original point.

The danger of holding open source projects to a lower standard is that you will end up with a bunch of mediocre (if not outright terrible) projects. And to be perfectly blunt, this is exactly what has happened to a large number of Apache projects.

The documentation of many Apache projects is beyond woeful, even for (allegedly) mature and popular frameworks like log4j. I can understand this because it’s hard enough to get people to write documentation when it’s their job let alone when they’re volunteering. But it’s just not good enough.

Taken to extremes, a different standard has crossed the line from advocacy to pandering.

Let products—irrespective of their source—stand on their merits and let the chips fall where they may.

If you hold something to a higher standard you will get a better product.

Conclusion

Ultimately, after having given it some thought, I am unrepentantly unapologetic about my criticisms of Spring Batch. If I’d made some factual error, that’s one thing. That’s my opinion and you can do with that what you wish whether you agree with my opinion or not.

I post what I believe other developers would want to know and should know about Spring Batch before plunging in.

I say this all due respect to Dave and the other team members of Spring Batch. I really do appreciate their efforts but the problems I’ve raised are in some cases fundamental to the overall design. These are things that are not going to be fixed or changed in any kind of useful time frame if they are even fixed or changed at all.

Whether or not anyone has the right to criticize open source is an irrelevant distraction. The legitimacy and accuracy of those complaints is paramount.

Spring Batch or How Not to Design an API

Let me start out by saying I’m a huge fan of the Spring framework. It revolutionized enterprise Java development, supplanted J2EE and is probably the single most important Java development in its turgid history.

One of the great things about Spring is that it is largely non-invasive and the documentation is extensive and, for the most part, excellent. The Spring reference manual is running at around 600 pages these days.

In fact about the only negative thing I can say about Spring is that if you get stuck there’s a good chance you’ll have a hard time finding an answer. So much of the Spring-related information is contained in mailing lists (my pet peeve) and forum posts (often unanswered questions), two of the mediums that in part led to the creation of StackOverflow.

Compare that to something as ubiquitous and venerable as Apache’s log4j, where your only real options (beyond the meagre introduction) are to read the source code or to buy some book. Poor or no documentation seems to be the hallmark of Apache projects to the point that my default position when evaluating an unfamiliar one is to be wary.

I’ve read about Spring Batch over the last year or two. Batch jobs tend to be one of those things that we as programmers hate doing, probably because they’re messy. There is a small kernel of technical solution surrounded by layers and layers of questions like:

  • How is the job started?
  • How do we monitor it?
  • How is it restarted manually?
  • How is it called?
  • When is it called?
  • How does it interact with other such jobs?
  • and so on.

But it’s a common problem so I’d approached Spring Batch with high hopes—especially given it's now up to version 2.0.1—that it might ease some of the pain. Unfortunately not but at least it’s a good example of how not to design an API.

The Big Picture

Spring Batch’s unit of work is called a Job. A job involves one or more Steps that consist of Chunks. A particular run of a Job is called a JobInstance and identified by the JobParameters passed to it. Each attempt at a JobInstance (successful or otherwise) is a JobExecution. The state of a particular execution is stored in a JobRepository. A particular Job with a given set of JobParameters is started using a JobLauncher.

A given chunk of work has an ItemReader for a source that can be anything eg a CSV file, a database query, data read from a TCP connection or whatever you like. Data is written out using an ItemWriter, which again can be anything, In between there is optionally an ItemProcessor that transforms items read to items to be written.

Sounds good? It did to me, particularly the simple abstraction of item readers and writers. But the honeymoon was quickly over.

Some background is required here. I needed to load some CSV files once per day into an in-memory cache (Oracle Coherence). That’s a straight load and overwrite existing entries. Neither the CSV files nor the cache are transactional (although Coherence i believe can support JTA transactions) and it’s eminently rerunnable: it’ll just overwrite the same data.

Why do I Need a Transaction Manager?

The basic job looks something like this:

<job id="loadData">
  <step id="loadDataStep">
    <tasklet>
      <chunk reader="reader" writer="writer"/>
    </tasklet>
  </step>
</job>

That’s assuming you use the batch schema by default. The names refer to other beans in your application context.

But if you try and load the above you’ll get exceptions thrown if you don’t have a Spring bean named “transactionManager” visible to the job. Why? I’m not doing transactions! This is enough of a problem that Spring bean has a ResourcelessTransactionManager to use in such situations.

Why do I Need a Job Repository?

Spring Batch is big on the concept that a given job only be run once (successfully). If it fails, the idea is generally to allow it to be restarted and continue. As such, the Job needs to maintain state about attempts to run, what succeeded and so on. That’s all well and good except that sometimes it’s just not appropriate.

Thing is, I don’t care about any of that. My job can run as many times as it pleases. Why am I being forced down this path?

Spring Batch comes with two implementations of the JobRepository interface: one saves it to a database. The other saves it to memory, being simply a collection of static Maps. I chose to use the map implementation because, like I said, I didn’t need the state anyway.

Why Is the CSV Parser So Strict?

CSV parsing is generally done using the bundled FlatFileItemReader. There is a lot of configuration that goes into instantiating one of these, a ridiculous amount in fact.

First problem: when I specify the column names I have to get the exact number of commas right or the parser bombs out (with an exception about the incorrect number of tokens). Can’t I just specify up to the fields I’m interested in rather than putting 30 commas at the end just for the sake of it?

Second problem: if you have different record types in your file, each must have the correct config to parse it whether you use it or not. My first file had a header and footer record . I need config to process these records (technically I can skip the first one with a skip parameter but that doesn’t work for the last).

Third problem: if you don’t care about certain fields and try to ignore them with config like “,,,price,,name” it actually reads them anyway and puts them in a result map with a key of the empty string.

Fourth problem: I did have a price field and it was getting overwritten by nonsense. It took awhile to find this one but basically the reader has a concept of distance. Distance is a static field (why??) so you can’t change it with Spring config and it defaults to 5. This means that it will ignore the last 5 characters of your result bean property names when trying to set a value (it’ll try an exact match first then ignore the last character and so on up to 5).

Can you spot the problem? The field “price” is 5 characters long. That means one of my empty key-value pairs with an empty string key matches the price property.What the…?!

Why Is the CSV Reader so Slow?

When loading 100,000 records this initially took 50 seconds. That’s ridiculously slow. Turns out there was a good reason for this.

The “normal” means of creating a value object for your reader is to use a prototype-scoped Spring bean and apply the Spring data binding with the name-value pairs retrieved from the parsing step. This is ridiculously slow. I replaced with with a simple method that was just instantiating a new object and manually setting its properties and the load time went to under 4 seconds.

That’s a ridiculous difference: an order of magnitude. You may lay the blame on reflection but not so. Ibatis has proven to me beyond a shadow of a doubt that flexible and robust reflection-based property setting can be done very quickly. Of course doing it manually is going to be faster but that much faster?

I don’t mind writing that code either except for this: it’s one less thing Spring Batch is doing foor me and another chunk of basically boilerplate code I’d rather not write but have to.

Why Do I Do About Parsing Errors?

The Job (or Step specifically) allows you to configure exceptions to ignore as well as how many exceptions you can ignore. These happen for a variety of reasons. In some cases, people editing the CSV file with Excel. Excel adds commas so each line has the same number of fields. A noble gesture but misguided. It breaks my CSV parsing (unfortunately).

The other error I had was commas inadequately escaped in the file. This is a genuine error. The problem is that each time it happens I get a giant stack trace in my log file. I don’t want that.

You can add a listener and listen for these errors. Thing is, that doesn’t stop the default behaviour to dump a giant stack. Annoying. Really annoying. This is basically an event system and any reasonable event system should have some means of halting further propagation of the event, kind of like e.preventDefault() in jQuery.

The Good

It’s worth having an intermission to mention some of the good things.

The CSV parser is (otherwise) good in that it supports quoting and escaping. Splitting a string on the comma is easy. Handling commas that are escaped or within quotes is not exactly hard but it’s tedious and who wants to write that code? I know I don’t.

The commit intervals are good. You can set a property of the interval of read items between each commit

The ItemProcessors are reasonably good too. You can basically use them as adapters between readers and writers such that it’s more possible to reuse the same reader and/or writer in different jobs, which is actually something I did.

The range of events you can listen to is also good, particularly how you can just implement say ItemStream or ExecutionStepListener on a custom ItemWriter and then get those events.

Why Does My Job Keep Failing?

This, to me, was the straw that broke the camel’s back. I didn’t necessarily mind having to put in a map-backed JobRepository just to satisfy the “purity” of the API… until it stopped working.

this only became apparent when I tried to run several jobs in parallel. Basically the map-backed JObRepository has huge concurrency problems despite the somewhat misleading usage of synchronized in some parts of the case base.

For example, in MapJobExecutionDao:

private static Map<Long, JobExecution> executionsById = TransactionAwareProxyFactory.createTransactionalMap();
private static long currentId = 0;

public void saveJobExecution(JobExecution jobExecution) {
  Assert.isTrue(jobExecution.getId() == null);
  Long newId = currentId++;
  jobExecution.setId(newId);
  jobExecution.incrementVersion();
  executionsById.put(newId, copy(jobExecution));
}

What’s wrong with this? Quite a few things actually. For a start:

  1. Updating a long is not an atomic operation. That’s why we have AtomicLong;
  2. The post-increment operator is not threadsafe; and
  3. Updating this particular Map in this way is not threadsafe.

Aside from that there’s also the annoying issue that the map is static. Why is this annoying? Because I had so many threading issues with this particular JobRepository implementation I gave up and used one for each Job.

Last Minute Complaints

One of the great things about Ibatis is that it supports grouping of result rows. A fixed commit interval is useful but what if you’re bundling rows to get them into groups? I did have to do this and it became painful. It resulted in a custom ItemReader that internally bundles rows. It works but I’d rather not have had to write it.

The other problem with commit intervals is they work off items read rather than items written. In some cases with a commit interval of 1000 I was doing commits of 8 records because the rest had been rejected for various reasons.

The config required is excessive. Whatever happened to convention over configuration? A good example of this is that you need to create separate line tokenizers and field set mappers (the first creates a property map from a line of CSV whereas the second binds that to an object). That’s a reasonable option but none of my scenarios had this kind of reuse. Couldn’t we simplify this somewhat?

We Have the Technology

Now with all these problems you may reasonably point out that perhaps I’m using the wrong tool or even that what I’m doing is wrong. I believe the above are quite legitimate complaints however.

You need look no further than Spring itself to see what Spring Batch is doing wrong. Spring MVC for example is a really lightweight Web framework in many ways. A request is mapped to a controller. That controller creates a model and passes it to a view. It’s straightforward and simple yet flexible and powerful. Controllers can vary from the very simple to the extremely complex with many implementations you can use out of the box.

That’s the right way to design an API or a library or framework: use as much or as little as you like. Don’t make me create a bunch of stuff I don’t need (and will in fact break what I want to do) for the sake of your API.

The very simplest Spring Batch job should be a simple read and write with no transactions and no repository. Restartability and a repository should be some kind of decorator/observer or just a more complex implementation. Do transactional management the way Spring does it (declaratively, programmatically or not at all, as you see fit) on both your jobs or steps and the repository.

While I can appreciate the flat file reader’s “white list” approach it’s simply too strict. CSV input by its nature tends to be hard to reliably define. It should be trivial to change this strict behaviour and to mask exceptions (or not as you see fit).

Things like property distance are just plain bizarre and need to go or at least be instance variables so config can easily override them.

Much like the groupBy in Ibatis, you should be able to set a commit interval breakpoint that uses one or more fields instead of or in addition to the commit interval to allow clean breaks in writing.

The commit interval should absolutely work off items written not items read.

There needs to be something in between hand-coding property setting and Spring’s full data binding that is reasonably powerful yet without the huge cost of Spring data binding prototype beans.

Lastly, the map-based repository needs some serious attention with respect to concurrency.

Conclusion

If you’ve gotten this far you may think I’m quite negative on the whole Spring Batch experience and you’d be right. Honestly I expect more out of something that wears the Spring label, has been out for a year or two and ostensibly at version 2.0.

Starting a Programming Blog, Part 2

It has now been two months since I started this blog and a month since that post. Since it seemed to be well-received I thought I’d follow it up after another month.

Firstly though I want to stress something I said from the previous post: more than anything, these posts are a journal of my experiences and not any kind of expert opinion. Advice given is the sum of that knowledge, which may prove to be wrong. It just happens to be what my opinion is at that time.

It’s been an interesting month. The vital statistics for the period June 12 to July 11 are:

  • 23,661 visits;
  • 20,1660 Absolute Unique Visitors;
  • 28,224 Pageviews;
  • My most read post Do Programmers Optimize... Life? received 13,338 views;
  • Where last month, DZone accounted for 65% of my traffic, this month that number fell to 13%. This month the biggest source was by far reddit at 51% due mostly to the above post;
  • Google search traffic was up to over 7%. Spring and Ibatis Tutorial, published in the previous month, received over 1,000 visits just from search engine traffic;
  • According to Feedburner, RSS subscribers are up to 77;
  • $5.61 in AdSense revenue. :-)

All in all, I'm incredibly pleased with the results. That being said, I have to consider the above to be an outlier due to the rather unexpected success of one post. So I fully expect next month's results to be lower. It takes time to build an audience.

It’s the Title, Stupid

One conclusion I’ve reached is just how important the title of whatever you post is. This may seem shallow (and it is) but consider for a second how most people find and read things on the internet.

It’s quite easy to completely deluged with information. Nowadays, there’s not a lot I read that doesn’t come from an RSS or Atom feed of some kind. The days of directly visiting sites are pretty much over for me. Google Reader is my tool of choice. The fact that I can use it from home and work (and an iphone if I had one, which I don’t) beats any desktop app for that purpose.

In my Reader I have some sites, which themselves are aggregators, like Slashdot, DZone, programming.reddit.com and Hacker News. Between those and dozens of other feeds I probably get 400+ items each and every day. There are lots of duplicates and I could probably drop one of Hacker News and programming.reddit.com due to high incidence of duplication but I know what I’ve read and it takes no time to skim.

Of these 400+ items I ready maybe 20 on a good day (beyond any one paragraph blurb). You might say that’s a low signal-to-noise ratio but you’d be wrong. The reason I do it is because if you skim 400+ articles and do that often you get a sense of developer mindshare. What are developers interested in? What are they talking about? What’s hot? What’s not?

That’s incredibly useful information.

Anyway, in my estimation, the vast majority of programmers read very little online. I’d put myself in a more active minority. Within that minority, I’d say the above reading pattern is probably typical.

When skimming a large amount of material, the title is probably the most important deciding factor on whether you read that item or not.

Title also has a great deal to do with people finding your work via search engines.

So the title:

  1. must accurately reflect the contents of your post;
  2. peaks the interest of a reader within a large volume of material; and
  3. contain likely keywords that people will search for assuming that item will be of interest to someone searching for such things.

For example, I wrote The Monetization of Java Begins? The title is accurate but the word “monetization” and that’s a somewhat esoteric word used primarily by people typically talking about ad revenues and commercial licensing. The question mark at the end is important because it accurately reflects the view that there is uncertainty over the answer (which was subsequently cleared up by Sun).

Now imagine if that post was called Sun to Start Charging for Java Features? It’s a much better title that still fits all the above criteria.

I can imagine some people thinking to themselves at this point “Boy, that’s a lot to write just about the title” or they may even accuse me of getting caught up in minutiae but if you want people to read what you’ve written, put careful thought into your title.

I’m certain the same purists who (mistakenly) believe that you can realistically write a blog these days by pushing out ASCII text files will jump up and down and say only the content matters. Now of course the content matters. The point of the title is to get that content read.

Anyway, that’s just my opinion.

Search Engine Optimization

SEO is a bit of a strange topic. A lot has been written about it. Some people make their livings out of it. To me, in certain circles, SEO borders on being a religion. Not only due to the fervour of its followers but that many of its tenets seem based on blind faith rather than having any basis in fact.

Those making their living out of it just come across as the priests of this cooky cult.

That being said, I do believe in these principles (not just for blogging but for Web development in general):

  1. Title matters (see above);
  2. Put the title of your post or page as the first element of the HTML title; and
  3. The URL should match the title.

(2) is different to the default behaviour of Bloggger (I can’t speak for Wordpress or any others) but just requires the following template change:

<b:if cond='data:blog.pageType == &quot;item&quot;'>
<title><data:blog.pageName/></title>
<b:else/>
<title><data:blog.pageTitle/></title>
</b:if>

Alternatively put the blog title at the end of the HTML title:

<b:if cond='data:blog.pageType == &quot;item&quot;'>
<title><data:blog.pageName/></title>
<b:else/>
<title><data:blog.pageTitle/> ~ <data:blog.title/><</title>
</b:if>

if you want your blog title at the end. Whatever the case, don't put it at the front.

Windows Live Writer

When I posted Starting a Programming Blog I had it suggested to me that I use Windows Live Writer to write posts instead of doing what I had been doing, which is hand-coding HTML because the Blogger editor is so awful. Now I have to admit I was sceptical.

It’s not every day I’m surprised by Microsoft. What I’ve come to expect is vertical integration that is somewhere between truly invasive to just plain nauseating but I am stunned at just how good Live Writer is. It’s not perfect but it has an extensible plug-in architecture, it integrates seamlessly with Blogger (and Wordpress, etc) and gives you a pretty darn good preview of what your site will look like.

Plus the HTML produced is pretty clean in exactly the way that Microsoft Word produced HTML isn’t.

I really can’t believe Microsoft produced this and give it away for free. It’s so completely unlike what I’ve come to expect from Microsoft. Everything I write now uses it.

The Future

It’s fair to say that this site had—and still has—a lot of JavaScript by virtue of widgets and social news site links. About half of those were removed recently as non-performing (defined as generating no or few referrals).

The code snippet JavaScript plug-in generates attractive pretty code but it another hit in terms of JavaScript load and execution. I know when developing websites I am absolutely cutthroat when it comes to reducing the amount of JavaScript that needs to be loaded (either by caching of some kind or simply including only what you really need) and executed.

The current code plug-in has the advantage that I can use it on Blogger without hosting anything myself. Even though hosting PHP (for example) is cheap, you still get what you pay for. Shared hosting tends to have issues with unexpected downtime, sometimes for long period, as well as performance.

Also, once you host your own blog, you then have to worry about issues like being hacked, backup strategies and so on.

Ultimately though, I think I will move to a VPS based solution either when I have other the need of VPS hosting for some other reason or I believe the blog has gotten to the point where it is justified. It certainly isn’t there yet despite greatly exceeding any expectations to date.

All of this however illustrates the importance of running a blog under your own domain from day one. Without that I wouldn’t have the freedom to move without breaking any searches or links I’ve built up to this point.

Conclusion

I hope the above is of some use or interest to you. As always, there’ll be another instalment as long as I feel I’ve got something to say.

Plain English Explanation of Big O Notation

I recently read A Beginners’ Guide to Big O Notation and while I appreciate such efforts I don’t think it went far enough. I’m a huge fan of “plain English” explanations to, well, anything. Just look at the formal definition of Big O. The only people who can understand that already know what it means (and probably have a higher degree in mathematics and/or computer science).

On StackOverflow you often get comments like “you should do X because it’s O(2n) and Y is O(3n)”. Such statements originate from a basic misunderstanding of what Big O is and how to apply it. The material in this post is basically a rehash and expansion of what I've previously written on the subject.

What is Big O?

Big O notation seeks to describe the relative complexity of an algorithm by reducing the growth rate to the key factors when the key factor tends towards infinity. For this reason, you will often hear the phrase asymptotic complexity. In doing so, all other factors are ignored. It is a  relative representation of complexity.

What Isn’t Big O?

Big O isn’t a performance test of an algorithm. It is also notional or abstract in that it tends to ignore other factors. Sorting algorithm complexity is typically reduced to the number of elements being sorted as being the key factor. This is fine but it doesn’t take into account issues such as:

  • Memory Usage: one algorithm might use much more memory than another. Depending on the situation this could be anything from completely irrelevant to critical;
  • Cost of Comparison: It may be that comparing elements is really expensive, which will potentially change any real-world comparison between algorithms;
  • Cost of Moving Elements: copying elements is typically cheap but this isn’t necessarily the case;
  • etc.

Arithmetic

The best example of Big-O I can think of is doing arithmetic. Take two numbers (123456 and 789012). The basic arithmetic operations we learnt in school were:

  • addition;
  • subtraction;
  • multiplication; and
  • division.

Each of these is an operation or a problem. A method of solving these is called an algorithm.

Addition is the simplest. You line the numbers up (to the right) and add the digits in a column writing the last number of that addition in the result. The 'tens' part of that number is carried over to the next column.

Let's assume that the addition of these numbers is the most expensive operation in this algorithm. It stands to reason that to add these two numbers together we have to add together 6 digits (and possibly carry a 7th). If we add two 100 digit numbers together we have to do 100 additions. If we add two 10,000 digit numbers we have to do 10,000 additions.

See the pattern? The complexity (being the number of operations) is directly proportional to the number of digits. We call this O(n) or linear complexity. Some argue that this is in fact O(log n) or logarithmic complexity. Why? Because adding 10,000,000 to itself takes twice as long as adding 1,000 to itself as there are 8 digits instead of 4. But 10,000,000 is 10,000 times as large so depending on your application it may be appropriate to define the problem in terms of number of digits (ie order of magnitude) of the input. In others, the number itself may be appropriate.

Subtraction is similar (except you may need to borrow instead of carry).

Multiplication is different. You line the numbers up, take the first digit in the bottom number and multiply it in turn against each digit in the top number and so on through each digit. So to multiply our two 6 digit numbers we must do 36 multiplications. We may need to do as many as 10 or 11 column adds to get the end result too.

If we have 2 100 digit numbers we need to do 10,000 multiplications and 200 adds. For two one million digit numbers we need to do one trillion (1012) multiplications and two million adds.

As the algorithm scales with n-squared, this is O(n2) or quadratic complexity. This is a good time to introduce another important concept:

We only care about the most significant portion of complexity.

The astute may have realized that we could express the number of operations as: n2 + 2n. But as you saw from our example with two numbers of a million digits apiece, the second term (2n) becomes insignificant (accounting for 0.00002% of the total operations by that stage).

The Telephone Book

The next best example I can think of is the telephone book, normally called the White Pages or similar but it'll vary from country to country. But I'm talking about the one that lists people by surname and then initials or first name, possibly address and then telephone numbers.

Now if you were instructing a computer to look up the phone number for "John Smith", what would you do? Ignoring the fact that you could guess how far in the S's started (let's assume you can't), what would you do?

A typical implementation might be to open up to the middle, take the 500,000th and compare it to "Smith". If it happens to be "Smith, John", we just got real lucky. Far more likely is that "John Smith" will be before or after that name. If it's after we then divide the last half of the phone book in half and repeat. If it's before then we divide the first half of the phone book in half and repeat. And so on.

This is called a bisection search and is used every day in programming whether you realize it or not.

So if you want to find a name in a phone book of a million names you can actually find any name by doing this at most 21 or so times (I might be off by 1). In comparing search algorithms we decide that this comparison is our 'n'.

For a phone book of 3 names it takes 2 comparisons (at most).
For 7 it takes at most 3.
For 15 it takes 4.
...
For 1,000,000 it takes 21 or so.

That is staggeringly good isn't it?

In Big-O terms this is O(log n) or logarithmic complexity. Now the logarithm in question could be ln (base e), log10, log2 or some other base. It doesn't matter it's still O(log n) just like O(2n2) and O(100n2) are still both O(n2).

It's worthwhile at this point to explain that Big O can be used to determine three cases with an algorithm:

  • Best Case: In the telephone book search, the best case is that we find the name in one comparison. This is O(1) or constant complexity;
  • Expected Case: As discussed above this is O(log n); and
  • Worst Case: This is also O(log n).

Normally we don't care about the best case. We're interested in the expected and worst case. Sometimes one or the other of these will be more important.

Back to the telephone book.

What if you have a phone number and want to find a name? The police have a reverse phone book but such lookups are denied to the general public. Or are they? Technically you can reverse lookup a number in an ordinary phone book. How?

You start at the first name and compare the number. If it's a match, great, if not, you move on to the next. You have to do it this way because the phone book is unordered (by phone number anyway).

So to find a name:

  • Best Case: O(1);
  • Expected Case: O(n) (for 500,000); and
  • Worst Case: O(n) (for 1,000,000).

The Travelling Salesman

This is quite a famous problem in computer science and deserves a mention. In this problem you have N towns. Each of those towns is linked to 1 or more other towns by a road of a certain distance. The Travelling Salesman problem is to find the shortest tour that visits every town.

Sounds simple? Think again.

If you have 3 towns A, B and C with roads between all pairs then you could go:

A -> B -> C 
A -> C -> B 
B -> C -> A 
B -> A -> C 
C -> A -> B 
C -> B -> A

Well actually there's less than that because some of these are equivalent (A -> B -> C and C -> B -> A are equivalent, for example, because they use the same roads, just in reverse).

In actuality there are 3 possibilities.

Take this to 4 towns and you have (iirc) 12 possibilities. With 5 it's 60. 6 becomes 360.

This is a function of a mathematical operation called a factorial. Basically:

5! = 5 * 4 * 3 * 2 * 1 - 120
6! = 6 * 5 * 4 * 3 * 2 * 1 = 720
7! = 7 * 6 * 5 * 4 * 3 * 2 * 1 = 5040
...
25! = 25 * 24 * ... * 2 * 1 = 15,511,210,043,330,985,984,000,000
...
50! = 50 * 49 * ... * 2 * 1 = 3.04140932... × 10^64

So far, the only way known of solving the Travelling Salesman problem is by brute force. Unfortunately, such a technique has O(n!) complexity to solve.

By the time you get to 200 towns there isn't enough time left in the universe to solve the problem with traditional computers.

Something to think about.

Polynomial Time

Another point I wanted to make quick mention of is that any algorithm that has a complexity of O(nk) for any constant k is said to have polynomial complexity or is solvable in polynomial time.

Traditional computers can solve problems in polynomial time. Certain things are used in the world because of this. Public Key Cryptography is a prime example. It is computationally hard to find two prime factors of a very large number. If it wasn't, we couldn't use the public key systems we use.

Big Greek Letters

Big O is often misused. Big O or Big Oh is actually short for Big Omicron. It represents the upper bound of asymptotic complexity. So if an algorithm is O(n log n) there exists a constant c such that the upper bound is cn log n.

Θ(n log n) (Big Theta) is more tightly bound than that. Such an algorithm means there exists two constants c1 and c2 such that c1n log n < f(n) < c2n log n.

Ω(n log n) (Big Omega) says that the algorithm has a lower bound of cn log n.

There are others but these are the most common and Big O is the most common of all. Such a distinction is typically unimportant but it is worth noting. The correct notation is the correct notation, after all.

Determinism

Algorithms can also be classified as being either deterministic or probabilistic.It’s important to understand the difference. Sometimes requirements or constraints may dictate the choice of one over the other even if the expected case is worse. You should be able to classify an algorithm as one or the other.

A good example of this is comparing files. Say you have two files A and B and you want to determine if they are the same. The simple implementation for this is:

  1. If the sizes are different, the files are different; else
  2. Compare each file byte-for-byte. If two different bytes are found, the files are different; else
  3. The files are the same.

This is a deterministic algorithm because the probability of a false positive (the algorithm saying the files are the same when they aren’t) and a false negative (saying they are different when they aren’t) is 0 in both cases.

For various reasons however it might be impractical or undesirable to implement the algorithm this way. Many file comparisons may be required making the operation potentially very expensive on large files. Also the files might be remote to each other and it might be impractical to send a complete copy just so the remote system can see if its changed.

A more common approach is to use a hash function. A hash function basically just converts a large piece of data into a smaller piece of data (called a hash), usually a 32-128 bit integer. A good hash function will distribute values in the new (smaller) data range as evenly as possible.

A common hash function is an MD5 hash, which generates a 128-bit hash. Let’s say files A and B were on different servers. One could send an MD5 hash of the file to the other, which could compare it to its own MD5 hash. If they’re different, the files are different. If they’re the same, the files are highly likely to be the same.

An MD5 hash comparison is a probabilistic comparison algorithm for this reason.

And before you say that the chance is so remote it’ll never happen, think again. A malicious exploit has been demonstrated of generating two files with the same MD5 hash.

Algorithms such as this that only have brute force approaches age relatively quickly. Where once MD5 was considered safe, creating two messages with the same MD5 hash is now feasible (in a matter of days with not unreasonable hardware) such that the more secure SHA-1 algorithm has largely replaced it’s usage.

Conclusion

Anyway, that's it for my (hopefully plain English) explanation of Big O. I intend to follow this up with applications to some common scenarios in weeks to come.

StackOverflow for... marketing questions?!

Yesterday I saw this question and got to thinking... what would a Q&A site for marketing questions look like? What would people ask? I then got to thinking about how tech products get marketed and this post is the result. It's my take on StackOverflow for marketing.

22
7

I have to name a new video card my company is releasing. What do I do?

edit|close|flag
add comment

4 Answers

12

Our extensive research around the water cooler has clearly indicated that geeks LOVE Xs. Therefore we have the following rules about naming video cards:

  1. It must have a 3 or 4 digit number;
  2. Put as many X's as possible in. Geeks LOVE X's; and
  3. To break up XXX sequences with the obvious connotations, break it up with 'GT';

Thus you can see the ULTIMATE video card name is the X1900 XTX. If only we'd named it the XX1900 XXGTXX, Nvidia would now be bankrupt.

link| edit| flag
7
 
+1 Wow! Truly Amazing! Geeks and Xs!!! - PRweenie
add comment
37
11

My company has spent billions on making a new version of our operating system. It doesn't really do anything more than the old one. It just has really bloated hardware requirements for a pretty interface noone uses. How do we convince people to buy it?

edit|close|flag
add comment

4 Answers

43

Firstly, you kill off the old operating system even though people are happily using them. They'll have to upgrade eventually.

Secondly, you steadily release software and features that only works on the new operating system.

Next, release a bewildering array of versions so your customers think you provide everything. It's also important to make them choose things like if they want a 32 or 64 bit operating system as that's the kind of decision most consumers are well-informed about and like to make.

Next, when it flops, just wait a year, bump up the version, change the name and rerelease it as the new and improved version even though it's basically the same thing.

Don't forget to needlessly ramp up hardware requirements. Customers will surely buy new PCs because they'll be so keen to get their hands on your new OS.

Last, use your monopoly power to force all OEMs to ship it on new PCs whether customers want it or not.

link| edit| flag
add comment
9
2

Apparently our shiny new phones have a new feature our researchers are calling 'data' and it comes in something called 'megabytes'. How much can I charge for it?

edit|close|flag
add comment

4 Answers

17

Well, you give them enough of a free allowance that they'll actually use it for browsing and them wallop them with excess charge fees of say $100/gigabyte (apparently if you put 1000 megabytes together they form a gigabyte!) when they go above their quota.

Next charge them $1/minute to talk on their phones but give them $300 of credit for their $40 monthly fee and call it a "cap". This way when they go over there are more excess fees!

Now you're charging about $10,000/gigabyte. Wow! How does that work? Well, apparently voice is "data"! And apparently a minute of voice is about 100 kilobytes of data so you are giving them 100K for $1, which is $10,000 per gigabyte. And people are happy to pay it!

P.S. Make sure you pressure your handset manufacturer to disable Skype being used over 3G!

link| edit| flag
add comment

Aggregation vs Joins: Methodology

I promised to outline my methodology for Oracle vs MySQL vs SQL Server: Aggregation vs Joins so here it is.

Oracle

Version: Oracle 10g Express Edition ("XE") running on Windows XP SP3

CREATE TABLE Emp (
  ID NUMBER(19,0) PRIMARY KEY,
  PersonID NUMBER(19,0),
  CompanyID NUMBER(19,0)
);
CREATE SEQUENCE emp_seq START WITH 1;
CREATE INDEX idx1 ON Emp (PersonID, CompanyID);
CREATE INDEX idx2 ON Emp (CompanyID, PersonID);

MySQL

Version: MySQL 5.0.49a (MyISAM) running on Windows XP SP3

CREATE TABLE Emp (
  ID INT(19) AUTO_INCREMENT PRIMARY KEY,
  PersonID INT(19),
  CompanyID INT(19)
);
CREATE INDEX idx1 ON Emp (PersonID, CompanyID);
CREATE INDEX idx2 ON Emp (CompanyID, PersonID);

SQL Server

Version: SQL Server running on Windows XP SP3

CREATE TABLE Emp (
  ID NUMERIC(19,0) IDENTITY PRIMARY KEY,
  PersonID NUMERIC(19,0),
  CompanyID NUMERIC(19,0)
);
CREATE INDEX idx1 ON Emp (PersonID, CompanyID);
CREATE INDEX idx2 ON Emp (CompanyID, PersonID);

Join Query

SELECT e1.personid
FROM emp e1
JOIN emp e2 ON e1.personid = e2.personid AND e2.companyid = 80
JOIN emp e3 ON e2.personid = e3.personid AND e3.companyid = 95
JOIN emp e4 ON e3.personid = e4.personid AND e4.companyid = 98
WHERE e1.companyid = 99

Aggregation Query

SELECT personid
FROM emp
WHERE companyid IN (80,95,98,99)
GROUP BY personid
HAVING COUNT(1) = 4

Emp.java

package com.cforcoding;

public class Emp {
    private Long id;
    private long personId;
    private long companyId;

    public Emp(long personId, long companyId) { this.personId = personId; this.companyId = companyId; }
    public Long getId() { return id; }
    public void setId(Long id) { this.id = id; } 
    public long getPersonId() { return personId; }
    public void setPersonId(long personId) { this.personId = personId; }
    public long getCompanyId() { return companyId; }
    public void setCompanyId(long companyId) { this.companyId = companyId; }
}

Emp.xml

Note: Ibatis was used to create the sample data. See Spring and Ibatis Tutorial for detailed setup instructions.

<insert id="insertEmp_sequence" parameterClass="com.cforcoding.Emp">
    <selectKey keyProperty="id" resultClass="long">
        SELECT emp_seq.NEXTVAL FROM DUAL
    </selectKey>
    INSERT INTO emp (id, person_id, company_id) VALUES (#id#, #personId#, #companyId#)
</insert>

<insert id="insertEmp_auto" parameterClass="com.cforcoding.Emp">
    INSERT INTO Emp (PersonID, CompanyID) VALUES (#personId#, #companyId#)
    <selectKey keyProperty="id" resultClass="long">
        SELECT @@IDENTITY
    </selectKey>
</insert>

The first query is used for Oracle. The second one for SQL Server and MySQL. The obvious DAO class has been skipped.

Create.java

private final static int BATCH_SIZE = 1000;
private final static List<Integer> companies = new ArrayList<Integer>();
private final static Random r = new Random(167234987609003358L);
private static PartyDAO partyDAO = null;

private static void runTest() {
    seedCompanies(100);
    long start = System.currentTimeMillis();
    for (int i=0; i<100; i++) {
        createEmpBatch(i * BATCH_SIZE);
        long now = System.currentTimeMillis();
        double duration = (now - start) / 1000.0D;
        System.out.printf("[+%,3f] Iteration %d%n", duration, i);
    }
}

private static void seedCompanies(int count) {
    double factor = 0.6d;
    double n = 3.0d;
    for (int i=0; i<count; i++) {
        int weight = (int) n;
        for (int j = 0; j < weight; j++) {
            companies.add(i);
        }
        n *= (1.0d + factor);
        if (factor > 0.01d) {
            factor *= 0.94d;
        }
    }
}

@Transactional
private static void createEmpBatch(int start) {
    for (int i=0; i<BATCH_SIZE; i++) {
        long person = start + i;
        int count = r.nextInt(3) + r.nextInt(3) + r.nextInt(5);
        Set<Long> employers = new HashSet<Long>();
        for (int j=0; j<count; j++) {
            long company;
            do {
                company = companies.get(r.nextInt(companies.size()));
            } while (employers.contains(company));
            employers.add(company);
            Emp e = new Emp(person, company);
            partyDAO.createEmp(e);
        }
    }
}

Basically what this does is:

  • Create a list of 100 companies;
  • Creates a weighted random table for the companies. The higher the company number, the more likely it is to appear. This is why the numbers in the queries above are also quite high
  • This generates roughly four million records; and
  • With the same random seed you can create the same dataset in each database.

If you see any issues, please let me know.

Oracle vs MySQL vs SQL Server: Aggregation vs Joins

I’m having a bad week for writing blog posts. I’ve so far started posts on two different topics, both of which I’ve been forced to abandon mid-way because while researching the topic and verifying my assumptions or results I’ve disproven my argument. That’s annoying but it would’ve been annoying to publish them and make an ass of myself by being grossly factually incorrect. So free tip for you: do your research.

Anyway, one of them led to this one, which allows me to make some points that are (hopefully) worth making.

SQL for a lot of developers is somewhat mystical. As far as I’m concerned, relational algebra is an essential foundation for any programmer’s education but some courses seem to skip it entirely (or pay it lip service briefly before moving on), some programmers didn’t get any sort of (related) formal education or its simply not interesting so it’s in one ear and out the next.

This has led in part to a strong movement in modern programming to treat databases and SQL as a problem that needs fixing. Look no further than the plethora of .Net/Java ORMs or Rails’ ActiveRecord to see proof of that.

There’s also a lot of ignorance about how to construct performant queries. Whereas programs can nearly always be analysed in purely algorithmic terms, SQL needs to be tested. That’s where this topic comes in.

Consider this situation: you have a join table between Employees and Companies (being a many-to-many relationship). For simplicity we’ll look at one table only with three columns: ID, PersonID and CompanyID. Technically the ID column could be dropped in favour of a composite primary key but I prefer single-column keys for a variety of reasons.

How do you get a list of people that have worked for a given list of 5 companies? By that I mean, they’ve worked for every one, meaning for a given person P every company C, there should exist a record (P,C). For this test I've generated roughly 4 million people that have employment records with roughly 2-10 of 100 companies. Further details of the test setup are in Aggregation vs Joins: Methodology.

There are two basic ways of solving this problem: aggregation and joins.

The aggregation approach uses the SQL GROUP BY functionality and will look something like this in its simplest form:

SELECT PersonID
FROM Emp
WHERE CompanyID IN (1,2,3,4)
GROUP BY PersonID
HAVING COUNT(*) = 4

A variation upon this question comes up reasonably often on StackOverflow and the above is a commonly suggested solution. I can understand this to a degree: it is reasonably elegant and lends itself to dynamic SQL writing.

The join version is, for a lot of people, “uglier”:

SELECT PersonID
FROM Emp e1
JOIN Emp e2 ON e1.PersonID = e2.PersonID AND e2.CompanyID = 2
JOIN Emp e3 ON e2.PersonID = e3.PersonID AND e3.CompanyID = 3
JOIN Emp e4 ON e3.PersonID = e4.PersonID AND e4.CompanyID = 4
WHERE e1.CompanyID = 1

Results

Query Indexes Database
P,C1 C,P2 Oracle 10g XE MySQL 5.0.49a (MyISAM) SQL Server 2008
Aggregation No No 1.329s 0.703s4 6.920s
Yes No 1.329s 2.219s 5.983s
No Yes 0.219s 0.406s 0.436s
Yes Yes 0.230s 0.406s 0.416s
Join No No 0.729s Failed5 6.656s
Yes No 0.719s 2.750s 6.796s
No Yes 0.094s 0.704s 0.670s3
Yes Yes 0.094s 0.813s 0.423s
1 (PersonID,CompanyID)
2 (CompanyID,PersonID)
3 This result was highly variable, ranging from 0.3 to 0.75 seconds
4 This is correct; the query is faster without the (CompanyID,PersonID) index
5 The query had failed to return after 4 minutes

Observations

From these results I can make several observations:

  1. (CompanyID,PersonID) was clearly the driving index across all three databases. I had some expectation that (PersonID,CompanyID) would be a factor. Clearly that was not the case;
  2. The optimizers for both Oracle and SQL Server were both extremely consistent with the one exception of the high variability of case (3) above;
  3. The join version on Oracle was consistently faster than the aggregation version;
  4. SQL Server had basically the same performance with both queries;
  5. Aggregation is generally faster on MySQL; and
  6. The performance order is clearly Oracle then SQL Server then MySQL for this test.

Conclusion

What I hope you take away from this is, first and foremost, the importance of testing queries. Unlike normal programming where the fact that a quicksort will be faster than a bubble sort (on a non-trivially sized data set) regardless of the implementation language, much fewer principles are universal in the database world.

The second thing is that data set size matters. I deliberately tested with millions of records above because all too often I see people draw erroneous conclusions based on datasets that are way too small. It’s a big mistake that can lead to all sorts of scalability problems. If you’re not testing your queries with a dataset analagous in size to a production environment then you are potentially invalidating any test results you get.

It’s worth noting that even at 4 million records, this is, at best, a medium-sized dataset. It is not unusual to get into the hundreds of millions (or even billions) of record ranges.

The third thing I hope you take away from this is that you get what you pay for when it comes to databases. As much as I like MySQL (and, trust me, I do like MySQL), use of commodity hardware and free software has almost become a religion in the blogosphere especially (and the interweb in general). Such stacks can have a lot of merit but they aren’t universally applicable.

The last thing I want to leave you with is hopefully an appreciation of how complicated it is to write a performant database layer. This should be a cautionary tale on the overuse of ORMs that write the SQL for you. If it’s this hard for a human to write, what chance does an ORM have?