Plain English Explanation of Big O Notation

I recently read A Beginners’ Guide to Big O Notation and while I appreciate such efforts I don’t think it went far enough. I’m a huge fan of “plain English” explanations to, well, anything. Just look at the formal definition of Big O. The only people who can understand that already know what it means (and probably have a higher degree in mathematics and/or computer science).

On StackOverflow you often get comments like “you should do X because it’s O(2n) and Y is O(3n)”. Such statements originate from a basic misunderstanding of what Big O is and how to apply it. The material in this post is basically a rehash and expansion of what I've previously written on the subject.

What is Big O?

Big O notation seeks to describe the relative complexity of an algorithm by reducing the growth rate to a single factor when the key factor tends towards infinity. In doing so, all other factors are ignored. It is a  relative representation of complexity.

What Isn’t Big O?

Big O isn’t a performance test of an algorithm. It is also notional or abstract in that it tends to ignore other factors. Sorting algorithm complexity is typically reduced to the number of elements being sorted as being the key factor. This is fine but it doesn’t take into account issues such as:

  • Memory Usage: one algorithm might use much more memory than another. Depending on the situation this could be anything from completely irrelevant to critical;
  • Cost of Comparison: It may be that comparing elements is really expensive, which will potentially change any real-world comparison between algorithms;
  • Cost of Moving Elements: copying elements is typically cheap but this isn’t necessarily the case;
  • etc.

Arithmetic

The best example of Big-O I can think of is doing arithmetic. Take two numbers (123456 and 789012). The basic arithmetic operations we learnt in school were:

  • addition;
  • subtraction;
  • multiplication; and
  • division.

Each of these is an operation or a problem. A method of solving these is called an algorithm.

Addition is the simplest. You line the numbers up (to the right) and add the digits in a column writing the last number of that addition in the result. The 'tens' part of that number is carried over to the next column.

Let's assume that the addition of these numbers is the most expensive operation in this algorithm. It stands to reason that to add these two numbers together we have to add together 6 digits (and possibly carry a 7th). If we add two 100 digit numbers together we have to do 100 additions. If we add two 10,000 digit numbers we have to do 10,000 additions.

See the pattern? The complexity (being the number of operations) is directly proportional to the number of digits. We call this O(n) or linear complexity.

Subtraction is similar (except you may need to borrow instead of carry).

Multiplication is different. You line the numbers up, take the first digit in the bottom number and multiply it in turn against each digit in the top number and so on through each digit. So to multiply our two 6 digit numbers we must do 36 multiplications. We may need to do as many as 10 or 11 column adds to get the end result too.

If we have 2 100 digit numbers we need to do 10,000 multiplications and 200 adds. For two one million digit numbers we need to do one trillion (1012) multiplications and two million adds.

As the algorithm scales with n-squared, this is O(n2) or quadratic complexity. This is a good time to introduce another important concept:

We only care about the most significant portion of complexity.

The astute may have realized that we could express the number of operations as: n2 + 2n. But as you saw from our example with two numbers of a million digits apiece, the second term (2n) becomes insignificant (accounting for 0.00002% of the total operations by that stage).

The Telephone Book

The next best example I can think of is the telephone book, normally called the White Pages or similar but it'll vary from country to country. But I'm talking about the one that lists people by surname and then initials or first name, possibly address and then telephone numbers.

Now if you were instructing a computer to look up the phone number for "John Smith", what would you do? Ignoring the fact that you could guess how far in the S's started (let's assume you can't), what would you do?

A typical implementation might be to open up to the middle, take the 500,000th and compare it to "Smith". If it happens to be "Smith, John", we just got real lucky. Far more likely is that "John Smith" will be before or after that name. If it's after we then divide the last half of the phone book in half and repeat. If it's before then we divide the first half of the phone book in half and repeat. And so on.

This is called a bisection search and is used every day in programming whether you realize it or not.

So if you want to find a name in a phone book of a million names you can actually find any name by doing this at most 21 or so times (I might be off by 1). In comparing search algorithms we decide that this comparison is our 'n'.

For a phone book of 3 names it takes 2 comparisons (at most).
For 7 it takes at most 3.
For 15 it takes 4.
...
For 1,000,000 it takes 21 or so.

That is staggeringly good isn't it?

In Big-O terms this is O(log n) or logarithmic complexity. Now the logarithm in question could be ln (base e), log10, log2 or some other base. It doesn't matter it's still O(log n) just like O(2n2) and O(100n2) are still both O(n2).

It's worthwhile at this point to explain that Big O can be used to determine three cases with an algorithm:

  • Best Case: In the telephone book search, the best case is that we find the name in one comparison. This is O(1) or constant complexity;
  • Expected Case: As discussed above this is O(log n); and
  • Worst Case: This is also O(log n).

Normally we don't care about the best case. We're interested in the expected and worst case. Sometimes one or the other of these will be more important.

Back to the telephone book.

What if you have a phone number and want to find a name? The police have a reverse phone book but such lookups are denied to the general public. Or are they? Technically you can reverse lookup a number in an ordinary phone book. How?

You start at the first name and compare the number. If it's a match, great, if not, you move on to the next. You have to do it this way because the phone book is unordered (by phone number anyway).

So to find a name:

  • Best Case: O(1);
  • Expected Case: O(n) (for 500,000); and
  • Worst Case: O(n) (for 1,000,000).

The Travelling Salesman

This is quite a famous problem in computer science and deserves a mention. In this problem you have N towns. Each of those towns is linked to 1 or more other towns by a road of a certain distance. The Travelling Salesman problem is to find the shortest tour that visits every town.

Sounds simple? Think again.

If you have 3 towns A, B and C with roads between all pairs then you could go:

A -> B -> C 
A -> C -> B 
B -> C -> A 
B -> A -> C 
C -> A -> B 
C -> B -> A

Well actually there's less than that because some of these are equivalent (A -> B -> C and C -> B -> A are equivalent, for example, because they use the same roads, just in reverse).

In actuality there are 3 possibilities.

Take this to 4 towns and you have (iirc) 12 possibilities. With 5 it's 60. 6 becomes 360.

This is a function of a mathematical operation called a factorial. Basically:

5! = 5 * 4 * 3 * 2 * 1 - 120
6! = 6 * 5 * 4 * 3 * 2 * 1 = 720
7! = 7 * 6 * 5 * 4 * 3 * 2 * 1 = 5040
...
25! = 25 * 24 * ... * 2 * 1 = 15,511,210,043,330,985,984,000,000
...
50! = 50 * 49 * ... * 2 * 1 = 3.04140932... × 10^64

So the Big-O of the Travelling Salesman problem is O(n!) or factorial or combinatorial complexity.

By the time you get to 200 towns there isn't enough time left in the universe to solve the problem with traditional computers.

Something to think about.

Polynomial Time

Another point I wanted to make quick mention of is that any algorithm that has a complexity of O(na) is said to have polynomial complexity or is solvable in polynomial time.

Traditional computers can solve problems in polynomial time. Certain things are used in the world because of this. Public Key Cryptography is a prime example. It is computationally hard to find two prime factors of a very large number. If it wasn't, we couldn't use the public key systems we use.

Determinism

Algorithms can also be classified as being either deterministic or probabilistic.It’s important to understand the difference. Sometimes requirements or constraints may dictate the choice of one over the other even if the expected case is worse. You should be able to classify an algorithm as one or the other.

A good example of this is comparing files. Say you have two files A and B and you want to determine if they are the same. The simple implementation for this is:

  1. If the sizes are different, the files are different; else
  2. Compare each file byte-for-byte. If two different bytes are found, the files are different; else
  3. The files are the same.

This is a deterministic algorithm because the probability of a false positive (the algorithm saying the files are the same when they aren’t) and a false negative (saying they are different when they aren’t) is 0 in both cases.

For various reasons however it might be impractical or undesirable to implement the algorithm this way. Many file comparisons may be required making the operation potentially very expensive on large files. Also the files might be remote to each other and it might be impractical to send a complete copy just so the remote system can see if its changed.

A more common approach is to use a hash function. A hash function basically just converts a large piece of data into a smaller piece of data (called a hash), usually a 32-128 bit integer. A good hash function will distribute values in the new (smaller) data range as evenly as possible.

A common hash function is an MD5 hash, which generates a 128-bit hash. Let’s say files A and B were on different servers. One could send an MD5 hash of the file to the other, which could compare it to its own MD5 hash. If they’re different, the files are different. If they’re the same, the files are highly unlikely to be the same.

An MD5 hash comparison is a probabilistic comparison algorithm for this reason.

And before you say that the chance is so remote it’ll never happen, think again. A malicious exploit has been demonstrated of generating two files with the same MD5 hash.

Conclusion

Anyway, that's it for my (hopefully plain English) explanation of Big O. I intend to follow this up with applications to some common scenarios in weeks to come.

17 comments:

Anonymous said...

Thanks William! This is by far the best explanation of Big O in plain english.

Trey said...

You wrote "highly unlikely" when you meant "highly likely".

Anonymous said...

>> See the pattern? The complexity (being the number of operations) is directly proportional to the number of digits. We call this O(n) or linear complexity.

If an algorithm is directly proportional to *the number of digits* in the problem size, then that algorithm has logarithmic complexity.

An O(n) algorithm would be something like a linear search through the phone book, or a (very) naive primality test, but not simple addition.

Anonymous said...

regarding the MD5 hash at the end of your post, you say:

" If they’re different, the files are different. If they’re the same, the files are highly unlikely to be the same."

Your saying that if the Hashes are the same then then they are *unlikely* to be the same as the full resolution file that generated the hash?

I can't quite grock that, could you have another go at explaining it, please?

If the hashes match and that means that they are unlikely to be the same as the originals, and if the hashes don't match you're saying that they are not the same too; so whats the point of hashes if they can't distinguish between same/non-same files?

I'm sure I'm missing something, its a great post thank you for writing it.

Anonymous said...

"Public Key Cryptography is a prime example."
pun intended? ;)

Anonymous said...

@Anonymous commenter #2

>> If an algorithm is directly proportional to *the number of digits* in the problem size, then that algorithm has logarithmic complexity.

I am pretty sure the author was correct when he said that his human addition algorithm is linear complexity and that you are wrong when you say it is logarithmic. Logarithmic would mean having more digits in the numbers you are adding would yeild smaller changes in the number of operations required. How can you add two numbers without adding each digit? And why would the positive slope of the number of additions you need to make fall as the two numbers got bigger? Maybe you are confused about the details of the addition algorithm.

Anonymous said...

Also, one more explanation. In the function y = Ax , where A is a constant, y and x are "directly proportional". In the function y = log x , their relationship is logarithmic. So saying that an algorithm which is directly proprtional has logarithmic complexity is false.

Anonymous said...

>> If an algorithm is directly proportional to *the number of digits* in the problem size, then that algorithm has logarithmic complexity.

Put more accurately, the algorithm has logarithmic complexity relative to the size of the number. You are correct in this.

Unfortunately, computer scientists (largely) don't care about complexity relative to the size of the number. When dealing with numbers as input to a problem (as in the case of addition), computer scientists tend to care about complexity relative to the size of the *encoding* of the number (that is, its length when you write it down).

So: In a sense you're both right - Big Oh can be used to measure the complexity both ways. Normally, though, we talk about the one used in the article.

Xerxes said...

Excellent article. Thanks for taking the time :)

Dan said...

@Anonymous #3 -

See Trey's comment: "unlikely" in that section should read "likely". If you have the hashes of two files and the hashes are different, then the files are different. If the hashes are the same, then they're very likely the same - although they could be be different (by the Pigeonhole Principle) since people have been able to generate collisions (which is the link provided).

>> “you should do X because it’s O(2n) and Y is O(3n)”

This always pisses me off, since it's people who should know better (since they can presumably determine that an algorithm is O(2n)). Big-O, as you say, deals with the algorithm when the factor in question goes to infinity. In this case, the "2" vs. "3" isn't going to be an issue. Big-O is meant to compare the order of an algorithm's performance, as opposed to the details.

The rule of thumb, when I was taught Big-O, was to use 1000000000 for n - at which point it becomes apparent what the dominant factor is. Big-O isn't meant for comparing 2000000000 [O(2n)] to 3000000000 [O(3n)] - it's meant for comparing 2000000000 [O(2n)] to 1000000000000000000 [O(n^2)]. The other one we were taught is that O(n) is really O(kn), where k is a constant that we supress since it doesn't matter when making comparisons of this nature.

Jass said...

A very good explanation of Big O in plain english!

P.S: Been following your posts from a couple months now... "I'm lovin it" ;)

arachnode.net said...

Great article! Could you go into more detail about calculating the 'log'...

Anonymous said...

That is an excellent article! I wish I'd read it yesterday, might not have screwed up my job interview this afternoon so badly :(

Thanks for writing it.

NonComplacentLabRat said...

I'm a self taught programmer who's still having some trouble with what O(log n) means. I understand logaritms (e.g. log base 2 1024 = 10), but what does the n mean in O(log n)? And what does the O mean. What is the base? 2? I think if the base is 2 then O means "the number of times you need to raise n to the power of 2 to get O?".

But I also think I'm totally wrong and I'm thinking too logically. Does O(log n) represent a concept as opposed to a formula? For instance does O(log n) mean that the worst case scenario is n times? and O(log n squared) mean the worst case scenario is n squared?

Please help!

William Shields said...

Searching for a name in the telephone book (example above) is a classic O(log n) algorithm.

1 name = 1 comparison
2-3 names = 2 comparisons
4-7 names = 3 comparisons
8-15 names = 4 comparisons
16-31 names = 5 comparisons
...
1,000,000 names = 21 comparisons

Logarithms are just the inverse of exponents ("to the power of").

2^20 = 1,048,576
log2 1,048,576 = 20

or

loga(a^n) = n

n in any O(...) expression is the meaningful quantifier and varies depending on the algorithm. For sorting it is typically the number of elements. For multiplication it is the number of digits.

Base isn't specified with O(log n). Whether it's log2 n, log10 n or log100 n is irrelevant just like it's irrelevant if its 2n, 10n or 100n. Mathematically there is a linear relationship between loga n and logb n.

Big O generally refers to expected or worst cases.

Anonymous said...

"So the Big-O of the Travelling Salesman problem is O(n!) or factorial or combinatorial complexity."
I'm not clear what this means. Surely problems don't have complexities, only algorithms?

Also you give "Public Key Cryptography" is given as an algorithm of polynomial complexity. I thought it was unknown what the complexity of integer factorisation was.

Anonymous said...

Well its good explanation but i would like to know what does the Big-O actually do.I mean say we have 2 expressions O(n^2) and O(n^2logn)......how do we compare them and know which is better

Post a Comment