I’m having a bad week for writing blog posts. I’ve so far started posts on two different topics, both of which I’ve been forced to abandon mid-way because while researching the topic and verifying my assumptions or results I’ve disproven my argument. That’s annoying but it would’ve been annoying to publish them and make an ass of myself by being grossly factually incorrect. So free tip for you: do your research.
Anyway, one of them led to this one, which allows me to make some points that are (hopefully) worth making.
SQL for a lot of developers is somewhat mystical. As far as I’m concerned, relational algebra is an essential foundation for any programmer’s education but some courses seem to skip it entirely (or pay it lip service briefly before moving on), some programmers didn’t get any sort of (related) formal education or its simply not interesting so it’s in one ear and out the next.
This has led in part to a strong movement in modern programming to treat databases and SQL as a problem that needs fixing. Look no further than the plethora of .Net/Java ORMs or Rails’ ActiveRecord to see proof of that.
There’s also a lot of ignorance about how to construct performant queries. Whereas programs can nearly always be analysed in purely algorithmic terms, SQL needs to be tested. That’s where this topic comes in.
Consider this situation: you have a join table between Employees and Companies (being a many-to-many relationship). For simplicity we’ll look at one table only with three columns: ID, PersonID and CompanyID. Technically the ID column could be dropped in favour of a composite primary key but I prefer single-column keys for a variety of reasons.
How do you get a list of people that have worked for a given list of 5 companies? By that I mean, they’ve worked for every one, meaning for a given person P every company C, there should exist a record (P,C). For this test I've generated roughly 4 million people that have employment records with roughly 2-10 of 100 companies. Further details of the test setup are in Aggregation vs Joins: Methodology.
There are two basic ways of solving this problem: aggregation and joins.
The aggregation approach uses the SQL GROUP BY functionality and will look something like this in its simplest form:
SELECT PersonID FROM Emp WHERE CompanyID IN (1,2,3,4) GROUP BY PersonID HAVING COUNT(*) = 4
A variation upon this question comes up reasonably often on StackOverflow and the above is a commonly suggested solution. I can understand this to a degree: it is reasonably elegant and lends itself to dynamic SQL writing.
The join version is, for a lot of people, “uglier”:
SELECT PersonID FROM Emp e1 JOIN Emp e2 ON e1.PersonID = e2.PersonID AND e2.CompanyID = 2 JOIN Emp e3 ON e2.PersonID = e3.PersonID AND e3.CompanyID = 3 JOIN Emp e4 ON e3.PersonID = e4.PersonID AND e4.CompanyID = 4 WHERE e1.CompanyID = 1
|P,C1||C,P2||Oracle 10g XE||MySQL 5.0.49a (MyISAM)||SQL Server 2008|
3 This result was highly variable, ranging from 0.3 to 0.75 seconds
4 This is correct; the query is faster without the (CompanyID,PersonID) index
5 The query had failed to return after 4 minutes
From these results I can make several observations:
- (CompanyID,PersonID) was clearly the driving index across all three databases. I had some expectation that (PersonID,CompanyID) would be a factor. Clearly that was not the case;
- The optimizers for both Oracle and SQL Server were both extremely consistent with the one exception of the high variability of case (3) above;
- The join version on Oracle was consistently faster than the aggregation version;
- SQL Server had basically the same performance with both queries;
- Aggregation is generally faster on MySQL; and
- The performance order is clearly Oracle then SQL Server then MySQL for this test.
What I hope you take away from this is, first and foremost, the importance of testing queries. Unlike normal programming where the fact that a quicksort will be faster than a bubble sort (on a non-trivially sized data set) regardless of the implementation language, much fewer principles are universal in the database world.
The second thing is that data set size matters. I deliberately tested with millions of records above because all too often I see people draw erroneous conclusions based on datasets that are way too small. It’s a big mistake that can lead to all sorts of scalability problems. If you’re not testing your queries with a dataset analagous in size to a production environment then you are potentially invalidating any test results you get.
It’s worth noting that even at 4 million records, this is, at best, a medium-sized dataset. It is not unusual to get into the hundreds of millions (or even billions) of record ranges.
The third thing I hope you take away from this is that you get what you pay for when it comes to databases. As much as I like MySQL (and, trust me, I do like MySQL), use of commodity hardware and free software has almost become a religion in the blogosphere especially (and the interweb in general). Such stacks can have a lot of merit but they aren’t universally applicable.
The last thing I want to leave you with is hopefully an appreciation of how complicated it is to write a performant database layer. This should be a cautionary tale on the overuse of ORMs that write the SQL for you. If it’s this hard for a human to write, what chance does an ORM have?