Oracle vs MySQL vs SQL Server: Aggregation vs Joins

I’m having a bad week for writing blog posts. I’ve so far started posts on two different topics, both of which I’ve been forced to abandon mid-way because while researching the topic and verifying my assumptions or results I’ve disproven my argument. That’s annoying but it would’ve been annoying to publish them and make an ass of myself by being grossly factually incorrect. So free tip for you: do your research.

Anyway, one of them led to this one, which allows me to make some points that are (hopefully) worth making.

SQL for a lot of developers is somewhat mystical. As far as I’m concerned, relational algebra is an essential foundation for any programmer’s education but some courses seem to skip it entirely (or pay it lip service briefly before moving on), some programmers didn’t get any sort of (related) formal education or its simply not interesting so it’s in one ear and out the next.

This has led in part to a strong movement in modern programming to treat databases and SQL as a problem that needs fixing. Look no further than the plethora of .Net/Java ORMs or Rails’ ActiveRecord to see proof of that.

There’s also a lot of ignorance about how to construct performant queries. Whereas programs can nearly always be analysed in purely algorithmic terms, SQL needs to be tested. That’s where this topic comes in.

Consider this situation: you have a join table between Employees and Companies (being a many-to-many relationship). For simplicity we’ll look at one table only with three columns: ID, PersonID and CompanyID. Technically the ID column could be dropped in favour of a composite primary key but I prefer single-column keys for a variety of reasons.

How do you get a list of people that have worked for a given list of 5 companies? By that I mean, they’ve worked for every one, meaning for a given person P every company C, there should exist a record (P,C). For this test I've generated roughly 4 million people that have employment records with roughly 2-10 of 100 companies. Further details of the test setup are in Aggregation vs Joins: Methodology.

There are two basic ways of solving this problem: aggregation and joins.

The aggregation approach uses the SQL GROUP BY functionality and will look something like this in its simplest form:

SELECT PersonID
FROM Emp
WHERE CompanyID IN (1,2,3,4)
GROUP BY PersonID
HAVING COUNT(*) = 4

A variation upon this question comes up reasonably often on StackOverflow and the above is a commonly suggested solution. I can understand this to a degree: it is reasonably elegant and lends itself to dynamic SQL writing.

The join version is, for a lot of people, “uglier”:

SELECT PersonID
FROM Emp e1
JOIN Emp e2 ON e1.PersonID = e2.PersonID AND e2.CompanyID = 2
JOIN Emp e3 ON e2.PersonID = e3.PersonID AND e3.CompanyID = 3
JOIN Emp e4 ON e3.PersonID = e4.PersonID AND e4.CompanyID = 4
WHERE e1.CompanyID = 1

Results

Query Indexes Database
P,C1 C,P2 Oracle 10g XE MySQL 5.0.49a (MyISAM) SQL Server 2008
Aggregation No No 1.329s 0.703s4 6.920s
Yes No 1.329s 2.219s 5.983s
No Yes 0.219s 0.406s 0.436s
Yes Yes 0.230s 0.406s 0.416s
Join No No 0.729s Failed5 6.656s
Yes No 0.719s 2.750s 6.796s
No Yes 0.094s 0.704s 0.670s3
Yes Yes 0.094s 0.813s 0.423s
1 (PersonID,CompanyID)
2 (CompanyID,PersonID)
3 This result was highly variable, ranging from 0.3 to 0.75 seconds
4 This is correct; the query is faster without the (CompanyID,PersonID) index
5 The query had failed to return after 4 minutes

Observations

From these results I can make several observations:

  1. (CompanyID,PersonID) was clearly the driving index across all three databases. I had some expectation that (PersonID,CompanyID) would be a factor. Clearly that was not the case;
  2. The optimizers for both Oracle and SQL Server were both extremely consistent with the one exception of the high variability of case (3) above;
  3. The join version on Oracle was consistently faster than the aggregation version;
  4. SQL Server had basically the same performance with both queries;
  5. Aggregation is generally faster on MySQL; and
  6. The performance order is clearly Oracle then SQL Server then MySQL for this test.

Conclusion

What I hope you take away from this is, first and foremost, the importance of testing queries. Unlike normal programming where the fact that a quicksort will be faster than a bubble sort (on a non-trivially sized data set) regardless of the implementation language, much fewer principles are universal in the database world.

The second thing is that data set size matters. I deliberately tested with millions of records above because all too often I see people draw erroneous conclusions based on datasets that are way too small. It’s a big mistake that can lead to all sorts of scalability problems. If you’re not testing your queries with a dataset analagous in size to a production environment then you are potentially invalidating any test results you get.

It’s worth noting that even at 4 million records, this is, at best, a medium-sized dataset. It is not unusual to get into the hundreds of millions (or even billions) of record ranges.

The third thing I hope you take away from this is that you get what you pay for when it comes to databases. As much as I like MySQL (and, trust me, I do like MySQL), use of commodity hardware and free software has almost become a religion in the blogosphere especially (and the interweb in general). Such stacks can have a lot of merit but they aren’t universally applicable.

The last thing I want to leave you with is hopefully an appreciation of how complicated it is to write a performant database layer. This should be a cautionary tale on the overuse of ORMs that write the SQL for you. If it’s this hard for a human to write, what chance does an ORM have?

7 comments:

Jim Robert said...

Great post!

Just one thing, I understand your sentiment, but I have to respectfully disagree with your last statement, "If it’s this hard for a human to write, what chance does an ORM have?"

I think the ORM has a BETTER chance in the vast majority of case where developers find SQL to be, as you say, "somehwat mystical"

In those cases the ORM will outperform the developer's clumsily handcoded sql nearly every time.

Additionially, while performance is important, I'll take developers that think of interesting applications but write slow sql code over those that write performant code but made useless applications any day.

Once again, very interesting article keep it up!

Programming Language Faqs said...

Hello,

I found this article to be very helpful and consistent with the information I have found throughout the web. I would also like to recommend you one article

http://programminglanguagefaqs.blogspot.com/2009/09/xml-vs-mysql.html

buy viagra said...

I like this comment: "In those cases the ORM will outperform the developer's clumsily handcoded sql nearly every time..." is very interesting!!!

viagra online said...

Competition between the industrial database backends too strong. Each of them supports the standard SQL in a certain volume, and its own dialect of SQL, allows you to squeeze the maximum out of this database. They say that Oracle - the fastest database in the world, but why then she would not compete with MySQL, which, judging by the tests, faster MS SQL Server. And everything was in order with the tests? New version of MS SQL Server - codenamed Yukon - promises to make a revolution. Which answer prepares Oracle?

mother breastfeeding said...

I prefer Oracle

Vefk said...

I prefer Oracle

Anonymous said...

hi, I tried to run a simple test to see how the speed fair between oracle and Ms Sql as follow:
populate two tables with 2 million records, cross join them, filter by a range, and do a sum. Both table structure as fellow:

create table t (
record number,
name , varchar(50),
num number //--random
)

SQL as :
select sum(a.record) from test a cross join test b where a.num between 50001 and 80000 and b.num between 80001 and 110000

Somehow the same query running on oracle perform badly when compare to Ms sql. On oracle, the timing around 20s, but on Ms Sql,the result return within 1s. As I increase the range, the timing on oracle degraded while sql still fair well.

Shall the statement on oracle be written differently? I tried searching for info but cant seem to find any information regarding this.
Wonder if you may provide some insight?

Thanks

Post a Comment