I recently read A Beginners’ Guide to Big O Notation and while I appreciate such efforts I don’t think it went far enough. I’m a huge fan of “plain English” explanations to, well, anything. Just look at the formal definition of Big O. The only people who can understand that already know what it means (and probably have a higher degree in mathematics and/or computer science).
On StackOverflow you often get comments like “you should do X because it’s O(2n) and Y is O(3n)”. Such statements originate from a basic misunderstanding of what Big O is and how to apply it. The material in this post is basically a rehash and expansion of what I've previously written on the subject.
What is Big O?
Big O notation seeks to describe the relative complexity of an algorithm by reducing the growth rate to the key factors when the key factor tends towards infinity. For this reason, you will often hear the phrase asymptotic complexity. In doing so, all other factors are ignored. It is a relative representation of complexity.
What Isn’t Big O?
Big O isn’t a performance test of an algorithm. It is also notional or abstract in that it tends to ignore other factors. Sorting algorithm complexity is typically reduced to the number of elements being sorted as being the key factor. This is fine but it doesn’t take into account issues such as:
- Memory Usage: one algorithm might use much more memory than another. Depending on the situation this could be anything from completely irrelevant to critical;
- Cost of Comparison: It may be that comparing elements is really expensive, which will potentially change any real-world comparison between algorithms;
- Cost of Moving Elements: copying elements is typically cheap but this isn’t necessarily the case;
The best example of Big-O I can think of is doing arithmetic. Take two numbers (123456 and 789012). The basic arithmetic operations we learnt in school were:
- multiplication; and
Each of these is an operation or a problem. A method of solving these is called an algorithm.
Addition is the simplest. You line the numbers up (to the right) and add the digits in a column writing the last number of that addition in the result. The 'tens' part of that number is carried over to the next column.
Let's assume that the addition of these numbers is the most expensive operation in this algorithm. It stands to reason that to add these two numbers together we have to add together 6 digits (and possibly carry a 7th). If we add two 100 digit numbers together we have to do 100 additions. If we add two 10,000 digit numbers we have to do 10,000 additions.
See the pattern? The complexity (being the number of operations) is directly proportional to the number of digits. We call this O(n) or linear complexity. Some argue that this is in fact O(log n) or logarithmic complexity. Why? Because adding 10,000,000 to itself takes twice as long as adding 1,000 to itself as there are 8 digits instead of 4. But 10,000,000 is 10,000 times as large so depending on your application it may be appropriate to define the problem in terms of number of digits (ie order of magnitude) of the input. In others, the number itself may be appropriate.
Subtraction is similar (except you may need to borrow instead of carry).
Multiplication is different. You line the numbers up, take the first digit in the bottom number and multiply it in turn against each digit in the top number and so on through each digit. So to multiply our two 6 digit numbers we must do 36 multiplications. We may need to do as many as 10 or 11 column adds to get the end result too.
If we have 2 100 digit numbers we need to do 10,000 multiplications and 200 adds. For two one million digit numbers we need to do one trillion (1012) multiplications and two million adds.
As the algorithm scales with n-squared, this is O(n2) or quadratic complexity. This is a good time to introduce another important concept:
We only care about the most significant portion of complexity.
The astute may have realized that we could express the number of operations as: n2 + 2n. But as you saw from our example with two numbers of a million digits apiece, the second term (2n) becomes insignificant (accounting for 0.00002% of the total operations by that stage).
The Telephone Book
The next best example I can think of is the telephone book, normally called the White Pages or similar but it'll vary from country to country. But I'm talking about the one that lists people by surname and then initials or first name, possibly address and then telephone numbers.
Now if you were instructing a computer to look up the phone number for "John Smith", what would you do? Ignoring the fact that you could guess how far in the S's started (let's assume you can't), what would you do?
A typical implementation might be to open up to the middle, take the 500,000th and compare it to "Smith". If it happens to be "Smith, John", we just got real lucky. Far more likely is that "John Smith" will be before or after that name. If it's after we then divide the last half of the phone book in half and repeat. If it's before then we divide the first half of the phone book in half and repeat. And so on.
This is called a bisection search and is used every day in programming whether you realize it or not.
So if you want to find a name in a phone book of a million names you can actually find any name by doing this at most 21 or so times (I might be off by 1). In comparing search algorithms we decide that this comparison is our 'n'.
For a phone book of 3 names it takes 2 comparisons (at most).
For 7 it takes at most 3.
For 15 it takes 4.
For 1,000,000 it takes 21 or so.
That is staggeringly good isn't it?
In Big-O terms this is O(log n) or logarithmic complexity. Now the logarithm in question could be ln (base e), log10, log2 or some other base. It doesn't matter it's still O(log n) just like O(2n2) and O(100n2) are still both O(n2).
It's worthwhile at this point to explain that Big O can be used to determine three cases with an algorithm:
- Best Case: In the telephone book search, the best case is that we find the name in one comparison. This is O(1) or constant complexity;
- Expected Case: As discussed above this is O(log n); and
- Worst Case: This is also O(log n).
Normally we don't care about the best case. We're interested in the expected and worst case. Sometimes one or the other of these will be more important.
Back to the telephone book.
What if you have a phone number and want to find a name? The police have a reverse phone book but such lookups are denied to the general public. Or are they? Technically you can reverse lookup a number in an ordinary phone book. How?
You start at the first name and compare the number. If it's a match, great, if not, you move on to the next. You have to do it this way because the phone book is unordered (by phone number anyway).
So to find a name:
- Best Case: O(1);
- Expected Case: O(n) (for 500,000); and
- Worst Case: O(n) (for 1,000,000).
The Travelling Salesman
This is quite a famous problem in computer science and deserves a mention. In this problem you have N towns. Each of those towns is linked to 1 or more other towns by a road of a certain distance. The Travelling Salesman problem is to find the shortest tour that visits every town.
Sounds simple? Think again.
If you have 3 towns A, B and C with roads between all pairs then you could go:
A -> B -> C A -> C -> B B -> C -> A B -> A -> C C -> A -> B C -> B -> A
Well actually there's less than that because some of these are equivalent (A -> B -> C and C -> B -> A are equivalent, for example, because they use the same roads, just in reverse).
In actuality there are 3 possibilities.
Take this to 4 towns and you have (iirc) 12 possibilities. With 5 it's 60. 6 becomes 360.
This is a function of a mathematical operation called a factorial. Basically:
5! = 5 * 4 * 3 * 2 * 1 - 120 6! = 6 * 5 * 4 * 3 * 2 * 1 = 720 7! = 7 * 6 * 5 * 4 * 3 * 2 * 1 = 5040 ... 25! = 25 * 24 * ... * 2 * 1 = 15,511,210,043,330,985,984,000,000 ... 50! = 50 * 49 * ... * 2 * 1 = 3.04140932... × 10^64
So far, the only way known of solving the Travelling Salesman problem is by brute force. Unfortunately, such a technique has O(n!) complexity to solve.
By the time you get to 200 towns there isn't enough time left in the universe to solve the problem with traditional computers.
Something to think about.
Another point I wanted to make quick mention of is that any algorithm that has a complexity of O(nk) for any constant k is said to have polynomial complexity or is solvable in polynomial time.
Traditional computers can solve problems in polynomial time. Certain things are used in the world because of this. Public Key Cryptography is a prime example. It is computationally hard to find two prime factors of a very large number. If it wasn't, we couldn't use the public key systems we use.
Big Greek Letters
Big O is often misused. Big O or Big Oh is actually short for Big Omicron. It represents the upper bound of asymptotic complexity. So if an algorithm is O(n log n) there exists a constant c such that the upper bound is cn log n.
Θ(n log n) (Big Theta) is more tightly bound than that. Such an algorithm means there exists two constants c1 and c2 such that c1n log n < f(n) < c2n log n.
Ω(n log n) (Big Omega) says that the algorithm has a lower bound of cn log n.
There are others but these are the most common and Big O is the most common of all. Such a distinction is typically unimportant but it is worth noting. The correct notation is the correct notation, after all.
Algorithms can also be classified as being either deterministic or probabilistic.It’s important to understand the difference. Sometimes requirements or constraints may dictate the choice of one over the other even if the expected case is worse. You should be able to classify an algorithm as one or the other.
A good example of this is comparing files. Say you have two files A and B and you want to determine if they are the same. The simple implementation for this is:
- If the sizes are different, the files are different; else
- Compare each file byte-for-byte. If two different bytes are found, the files are different; else
- The files are the same.
This is a deterministic algorithm because the probability of a false positive (the algorithm saying the files are the same when they aren’t) and a false negative (saying they are different when they aren’t) is 0 in both cases.
For various reasons however it might be impractical or undesirable to implement the algorithm this way. Many file comparisons may be required making the operation potentially very expensive on large files. Also the files might be remote to each other and it might be impractical to send a complete copy just so the remote system can see if its changed.
A more common approach is to use a hash function. A hash function basically just converts a large piece of data into a smaller piece of data (called a hash), usually a 32-128 bit integer. A good hash function will distribute values in the new (smaller) data range as evenly as possible.
A common hash function is an MD5 hash, which generates a 128-bit hash. Let’s say files A and B were on different servers. One could send an MD5 hash of the file to the other, which could compare it to its own MD5 hash. If they’re different, the files are different. If they’re the same, the files are highly likely to be the same.
An MD5 hash comparison is a probabilistic comparison algorithm for this reason.
And before you say that the chance is so remote it’ll never happen, think again. A malicious exploit has been demonstrated of generating two files with the same MD5 hash.
Algorithms such as this that only have brute force approaches age relatively quickly. Where once MD5 was considered safe, creating two messages with the same MD5 hash is now feasible (in a matter of days with not unreasonable hardware) such that the more secure SHA-1 algorithm has largely replaced it’s usage.
Anyway, that's it for my (hopefully plain English) explanation of Big O. I intend to follow this up with applications to some common scenarios in weeks to come.