Category: Algorithms

What is a Hash Function? It Depends on Who’s Speaking

Hash functions are one of those wonderful mathematical objects that have a ton of use cases. You can use them to check whether a download failed due to a random error, or to build fast data structures (including the count-min sketches, the simplest data structure that most people haven’t heard of), or as a basis for cryptocurrencies.

Having taught a combination of courses, some more programming-oriented and some more theory-oriented, I’ve found that the term “hash function” and what’s expected of it varies a lot by who’s doing the talking. The idea of a hash code to a veteran Java programmer bears little resemblance to the families of hash functions that theoreticians would talk about, which in turn bear little resemblance to the cryptographic hash functions that computer security folks spend so much time working on. This can make it challenging for programming students to transition to learning about the theory behind hash tables or other data structures, as I learned the hard way when I didn’t recognize this when teaching algorithms for the first time.

This post is my attempt to talk about the different expectations of how hash functions work, or are expected to work, across different areas of computer science and software engineering. My hope is that if you’re a practicing programmer and are starting to study algorithms, data structures, or cryptography that this gives you a sense of what the expectations are when you dive into those fields.

Hash Functions: Programmer Edition

If you’re a practicing programmer, you probably first encountered hash functions in the context of hash tables, like the Java HashMap, the C++ std::unordered_map, or the Python dict type. You may understand how these data structures work, or you might just know that “they’re fast” and “they need a hash code.”

From a programming perspective, the most common times you’ll see hash functions are when you have to write them yourself. For example, if you’re a Java programmer, you probably learned to write a hash code like this:

@Override public void hashCode() {
    int result = myFirstField.hashCode();
    result = 31 * result + mySecondField.hashCode();
    result = 31 * result + myPenultimateField.hashCode();
    result = 31 * result + myUltimateField.hashCode();
    return result;

You’ve probably heard that hash codes are supposed to work so that different objects generally give different hash codes, and that equal objects have to have equal hash codes.

If you’ve implemented a hash table, perhaps as part of an intro class, or perhaps because you just thought it would be fun, you’ve had to be a client of a hash function. You maintained some kind of array of slots and determined which slot or slots to look for an entry by computing its hash code to get back some value (probably a 32-bit or 64-bit value), modding it by the table size, and then looking in that spot.

Then there’s the question of what you likely prioritize in a good hash function. You probably care about things like how well it spreads things out (low spread is bad for hash tables, high spread is good for hash tables) and how fast it runs (the faster, the better!).

To summarize, the key assumptions are that

  • each object has a hash code, which you usually implement by doing something involving multiplying by 31 and adding values together;
  • equal objects should have equal hash codes, and unequal objects should try to have unequal hash codes;
  • hash functions are usually 32-bit integers or 64-bit integers that get modded by some kind of table size;
  • there’s a bunch of “other” hash functions you can use to get a fingerprint that helps you check if you downloaded something properly; and
  • it’s good to write hash functions that spread things out and run quickly.

Hash Functions: The Data Structure Theory Perspective

If you’re a researcher studying algorithms or data structures, you have a pretty different point of view about what a hash function is and how it should behave.

Mathematically speaking, a hash function h is typically modeled as a mathematical function from some set of possible inputs (often called the key space or the universe) to a set of outputs, which is usually a range of integer values. For example, if you were building a hash table that had, say, 100 slots in it and you wanted to store integers in that hash table, you would want a hash function that took in an integer and then outputted an index between 0 and 99, inclusive. Mathematically, we’d model this as a function h : \mathbb{Z} \to \{0, 1, 2, ..., 99\}.

A few things might jump out here. First, we no longer speak of individual objects of having “a” hash code. That is, there’s no longer the idea that every object has exactly one hash code as some sort of intrinsic property. Instead, we think of the objects that we’re hashing as existing on their own, and then we come from the outside with some hash function h that suits our needs. In other words:

If you are a software engineer, you probably think of every object having a hash code. Hash codes are an intrinsic property of objects.

If you are a theoretician, you probably think of there being all sorts of possible hash functions that we could use for a group of objects. We can define lots of different possible hash functions for a given collection of objects. A hash function is an extrinsic object that gets applied to objects.

There’s another detail here. In The Real World, where there’s Object.hashCode(), or std::hash<T>, you have one hash function that needs to work for any size of hash table you might want to toss an object into. This means that hash functions typically output some sort of 32-bit integer or 64-bit integer, or more generally something big enough that no matter how large of a hash table you make, your hash function can provide outputs that can span the whole range.

But in Theoryland, it’s far more common to think of hash functions as being specialized for a particular size of table. If I want a hash function that I’ll use in a string hash table that hash 137 slots, I’ll model it as a function that takes in a string and outputs a number between 0 and 136, inclusive. (If you’ve seen the notation \Sigma^* from formal language theory, that means I’d think of my hash code as a function h : \Sigma^* \to \{0, 1, 2, ..., 136\}. If I then resize my hash table to have 237 slots, I would model it mathematically as though I’d picked a brand new hash function h' that takes in strings and outputs a number in the range \{0, 1, 2, ..., 236\} (with formal notation, that would be something like h' : \Sigma^* \to \{0, 1, 2, ..., 236\}. For those of you with a discrete mathematics background, the terminology here is that the codomain of the hash function is a formal part of the specification of the hash function in the first place.

Because hash functions in Theoryland are external to objects, rather than intrinsic to them, it’s not a problem if we decide to completely switch the hash function we’re using. In an actual programming language like Java or C++, though, the idea of switching hash codes is hard to model, since it’s not like we can just swap out std::hash<T> or hashCode for some other implementation.

In other words:

In software engineering, hash codes usually output very large values that then get modded down to the size that’s needed. If your table grows in size, you use the same hash code and just change what you mod by.

If you’re a theoretician, you model a hash function producing one of only a few possible values by building the set of possible outputs into the definition of your hash function. If you have a change in a hash table size, you pretend you just grabbed a brand new hash function with a different codomain.

These two differences – that hash functions are external to objects, and that you can just grab new hash functions with different codomains whenever you need to – can come out of nowhere if you’re first learning how to analyze algorithms or data structures, because a professor who’s been working in Theoryland might treat this as so fundamental that it’s not worth pausing to comment on.

But things get a bit weirder from here, especially if you go on to study algorithms or data structures more formally.

Let’s imagine you have a hash function that takes as input some kind of object (say, widgets) and outputs a natural number between 0 and n - 1, inclusive. Let’s have U be the set of all possible widgets. I’m going to assume that there are a lot of different widgets out there, say, way, way more than n of them. In that case, if you know in advance what hash function h : U \to \{0, 1, 2, ..., n - 1\} is going to be used in some hash table, you could envision what would happen if you hashed every single possible widget and looked at what hash code you got back. The handy pigeonhole principle tells you that there would always be some collection of at least \frac{|U|}{n} objects that have the same hash code.

If you’re a theoretician, this means that you have to worry about the case where your hash table happens to get \frac{|U|}{n} elements dumped into it that all have the same hash code. It doesn’t matter what hash function you pick – there’s simply too many widgets to distribute them in a way that doesn’t have a lot of collisions.

If you’re a Nefarious Scoundrel, you might have this same idea and reach the conclusion that you could break hash tables by doing some preprocessing to figure out what sorts of elements collide with one another, then dump them all in to some web server’s internal hash tables at once, degrading performance. And congrats, you’ve just invented the HashDoS attack, and this is why we can’t have nice things.

From a Theoryland perspective, given that there’s this fundamental weakness in hash functions, it makes sense to try to search for some way to build hash tables that can’t get clobbered in this fashion. And this is where we can use the nice idea that hash functions are external to objects rather than an intrinsic property. What if, when we build a hash table, we choose the hash function to use out of a pool of potential candidates? From the perspective of our Nefarious Scoundrel, this makes attacking our hash tables a lot harder, since no matter how much up-front work they do to try to figure out a collection of bad inputs, there’s a chance that we’ll pick a function out of our pool that prevents this attack

You often see terms like “universal family of hash functions” or “family of 2-independent hash functions” tossed around in Theoryland when discussing hash tables for this precise reason. When we’re reasoning about properties of hash tables, we often imagine that we have a pool of hash functions to pick from (which we’d more commonly call a family of hash functions, just because it sounds nicer) and we choose one such function at random to use in a hash table.

This hits at a very large gap between the theory of hash functions and what software engineers likely encounter:

If you’re a software engineer, you think of each object as having “a” hash code. You might also think of switching between different hash functions like MurmurHash, SAX, etc., but only in specialized circumstances or for performance reasons.

If you’re a theoretician working on algorithms and data structures, you often think of having large families of hash functions to pick from, and design data structures on the assumption that you can choose from these hash functions at random if necessary.

That being said, there isn’t a complete disconnect between theory and practice here. After enough Nefarious Scoundrels started crafting HashDoS messages to web servers to slow them down, many language implementations (like the JVM) started adding some randomness to how they hash fundamental types like integers, strings, and arrays. This meant that primitive types would not produce the same hash codes from program run to program run, which was less than ideal if you liked precomputing hashes and storing them on disk, but makes it much harder to HashDoS a website.

Because the hash functions used in algorithm and data structure design typically look so different than the ones that are usually talked about at the language level, the qualities that are optimized over look quite different. Families of hash functions in Theoryland are usually categorized by the strength of the assumptions you can make about them. If you choose a random hash function from a family and evaluate it on a single input, can you predict with better than random chance what it will do on another input? If not, you have a 2-independent family of hash functions. If you pick a random hash function, can you say that any two objects, no matter how they’re chosen, have a low collision probability? If so, you have a universal family of hash functions.

To summarize, in the context of algorithms and data structures:

  • hash functions are external to the objects they work on and are not intrinsic properties;
  • hash functions typically have their codomain specified in a way that bakes in the size of the range that they’re mapping to;
  • hash functions are typically grouped into families, with the idea that a hash function will be picked out of the family at random; and
  • the “power” of a hash function is usually measured in terms of statistical guarantees and precise inequalities bounding collision probabilities, rather than in terms of speed or a more flexible idea of “well-distributed.”

Things Not Covered Here

This is, of course, a very quick and probably somewhat unfair characterization of what hash functions look like in programming practice and in the theory of algorithms and data structures. I didn’t talk about cryptographic hash functions, which get used all the time in both theory and practice. I didn’t address clever custom hash functions like locality-sensitive hash functions or the hash functions you might use in derandomization.

My intent is to help make it easier for people used to writing code and worrying about performance to be able to read about the theory behind why things work, and hopefully to learn a thing or two in the process of doing so. I think it would be great if more people knew about count-min sketches, or more clever hash tables like the FKS dynamic optimality scheme, or using LSH to speed up finding related items in a data set.


Negafibonacci Representations

Today I found this amazing question about representing numbers in negafibonacci over on Stack Overflow. The negafibonacci number system represents each integer as a sum of terms of the form (-1)^n F_n using only the digits 0 and 1. I played around with the problem a bit and found a really cool algorithm that efficiently converts numbers into negafibonacci by subdividing the integers into different ranges bounded by Fibonacci numbers and using those ranges to decide which bits to include.

Nuclear Reactors and the Master Theorem

The Master Theorem is a lovely little theorem for solving the sorts of recurrences that often come up in divide-and-conquer algorithms. I’ve seen a bunch of different versions of this theorem over the years. Some are simpler to state but less powerful, while others are extremely expressive but can be a bit intimidating for students learning it for the first time. When I last taught CS161, I used this version of the Master Theorem:

Theorem: Let T(n) be a recurrence defined by T(n) \le aT(\frac{n}{b}) + cn^d. Then

  • if \log_b a < d, then T(n) = O(n^d).
  • if \log_b a = d, then T(n) = O(n^d \log n).
  • if \log_b a > d, then T(n) = O(n^{\log_b a}).

When I first saw this theorem as an undergrad, I remember thinking two things. First, “Wow! It’s amazing that this just happens to work out this way.” Second, “Where on earth did this come from?” I had an amazing algorithms instructor (Tim Roughgarden) who showed a derivation of this result and worked through the intuition behind it. Even so, I remember thinking that the math behind the formal proof of this theorem was surprisingly tricky and very quickly obscured in the intuition.

When it was my turn to teach CS161, I gave a presentation similar to the one I’d seen from Tim, which focused on the intuition behind the three cases, followed up with a quick mathematical derivation of each case. I based the math on what I’d read in CLRS, and I remember thinking that it was a complete pain to get everything to work out correctly.

About a year ago I was talking with a former student of mine (Roslyn Cyrus, whose blog has some very cool descriptions of how to model, represent, and solve puzzle games) about a recurrence relation she’d come across that didn’t quite fit the Master Theorem, and in the course of working through it together I came across a different way of presenting the proof of the Master Theorem that, I believe, is much easier than the one in CLRS, makes the intuition behind the theorem much clearer (and therefore more generalizable), and avoids a lot of the denser algebraic manipulations that typically go along with the theorem. I thought I’d share this approach in case anyone feels like adopting it in their own theory classes (or in case any future algorithms students want to get a better sense of where this result comes from!)

Rather than taking on the Master Theorem directly, I’d like to start with a different example.

Let’s talk nuclear reactors.

Nuclear reactors are cool. I say this as someone who enjoys visiting decommissioned nuclear plants and taking photos of old cooling towers. I have a very weak grasp of the physics behind how they work, but at a very high level, a nuclear reactor works like this. You have a bunch of atoms that will break apart if you hit them with a neutron. When those atoms break apart, they release a bunch of energy, and then they kick off more neutrons, which in turn break apart more atoms, which in turn kicks off more neutrons, etc. By extracting the energy that’s given off when the atoms break apart, the reactor generates electricity, and by very closely controlling how many of the neutrons are produced at each point in time, the reactor makes sure it doesn’t blow itself apart.

Let’s imagine that you’re operating a nuclear reactor and you want to figure out how much total power your reactor is going to produce over some period of time. In our simplified model, you can figure this out purely from three quantities. First, when an atom splits, how many other atoms will end up splitting as a direct result of the neutrons that atom kicks off? It might be that every time you split an atom, it fires off two neutrons, which then go on to cause two more atoms to split. Or perhaps each atom, on average, releases 1.3 neutrons, causing (on average) 1.3 more atoms to split. Or maybe, on average, each atom releases 0.7 neutrons, causing (on average) 0.7 more atoms to split. Let’s denote the average number of new atoms that will split based on a single fission event to be \alpha. Intuitively, if \alpha = 1, the reaction will continue at the same rate over time, since each time an atom splits it triggers exactly one future atom to “replace” it and split again. If \alpha < 1, the reaction will slow down and eventually die out, since each fission event doesn’t produce enough new fission events to “replace it.” And, if \alpha > 1, then the reaction will grow over time and produce more and more power, since each fission event ends up triggering even more future fission events, which trigger more future fission events, etc. (That last case is generally considered desirable if you’re designing a nuclear weapon and should be treated with care if you’re designing a nuclear power plant.)

(Fun fact: the actual way nuclear reactors work is more complicated than this due to the fact that there are both prompt neutrons generated directly from nuclear fuel fission and delayed neutrons generated from the decay of older fission byproducts. But in this simplified model, we’ll ignore that!)

The next quantity you’d probably want to know is how much power your reactor is initially generating. As a simplifying assumption, we’ll assume that the amount of power generated is directly proportional to the number of fission events occurring per unit time. (In reality it’s more complicated. I don’t know exactly how it’s more complicated, but I’d be surprised if it were really this simple.) We’ll imagine that the reactor’s initial power output at the time you start monitoring it is denoted by P. If P is large, it means that the reactor is initially producing a lot of power, and if P is small it means that the reactor is initially producing very little power.

Finally, there’s the question of how long you let the reactor run. The longer you let the reactor run, the more power you’ll generate. Here we’re going to make a very aggressive simplifying assumption and assume that the reactor works in “stages,” where one “stage” corresponds to a bunch of atoms splitting, the next “stage” corresponds to the next round of atoms splitting, etc. Obviously, this isn’t how real reactors work, but it makes the math a bit easier. Let’s assume that the total number of “stages” we let the reactor run for is given by T, with T standing for “time.”

So we now have three parameters: \alpha, which tells us whether the reaction grows, shrinks, or remains stable over time; P, which tells us the initial level of activity in the reactor; and T, which tells us how long the reaction will go for. Given these parameters, how much total power ends up getting produced?

Well, let’s think about this one step at a time. In the first “stage” of our nuclear reaction, we’ll produce P total power. Each one of the fission events will trigger \alpha new fission events, and since the power generated is proportional to the number of fission events, in the next stage we’ll produce \alpha P total power. Depending on what value we have for \alpha, that might be greater, less than, or equal to the amount of power we generated in the first stage. Each of the fission events in that second stage will in turn trigger \alpha more fission events in the third stage, and because the power generated is proportional to the number of fission events, in the third stage we’ll have \alpha^2 P total power generated. Similarly, in the fourth stage we’ll have \alpha^3 P total power generated, in the fifth stage we’ll have \alpha^4 P total power generated, etc. In other words, the amount of power generated as a function of the number of steps is exponential in \alpha.

So how much total power ends up getting generated over T time steps? Well, that’s going to be the sum of the power drawn at each step, which is given by

P + \alpha P + \alpha^2 P + \alpha^3 P + ... + \alpha^{T-1}P = \sum_{i=0}^{T-1}{\alpha^i P} = P\sum_{i=0}^{T-1}{\alpha^i}.

We now have a nice summation in terms of our three parameters – \alpha, P, and T – that tells us how much total power is going to be produced. So now we can ask – how exactly do you simplify that summation?

The good news is that this is a famous summation – it’s the sum of a geometric series! This is one of those sums that I think everyone at some point learned and then promptly forgot – I’ll admit that until I started routinely teaching CS theory I had completely forgotten it. There are two cases that arise for this sum. First, if \alpha = 1, then notice that \alpha^i = 1 for any choice of i, and so this sum is relatively easy to evaluate:

P\sum_{i=0}^{T-1}{\alpha^i} = P\sum_{i=0}^{T-1}1 = PT.

In other words, the total power generated is equal to the initial power P times the total number of time steps T, which makes sense because the total number of fission events isn’t changing as a function of time.

On the other hand, if \alpha \ne 1, then there are two possibilities. The first possibility is that \alpha < 1, in which case the amount of power generated at each stage in the reactor is lower than the previous stage (it’s decaying geometrically). In that case, using the formula for the sum of an infinite geometric series, we have the nice, handy result that

P\sum_{i=0}^{T-1}{\alpha^i} < P\sum_{i=0}^{\infty}{\alpha^i} = P\frac{1}{1 - \alpha}.

In other words, the total power generated will be at most some constant multiple of the original power, and the specific constant multiple depends on the choice of \alpha.

In the last case, where \alpha > 1, we have this lovely formula at our disposal:

P\sum_{i=0}^{T-1}{\alpha^i} = P\frac{\alpha^T - 1}{\alpha - 1}

That formula is, um, a bit dense. For the purposes of where we’re going, I think it’s a bit easier to see what’s going on if we rewrite it like this:

P\sum_{i=0}^{T-1}{\alpha^i} = P(\alpha^T - 1)\frac{1}{\alpha - 1} < P\alpha^{T-1} \frac{\alpha}{\alpha - 1}

In other words, the total power drawn is exponential in the number of time steps T, with the base of the exponent depending on \alpha and an extra constant factor that also depends on \alpha.

So what’s the net result of all of this mathematical gymnastics? Although we’ve framed this in the context of nuclear reactions, really, the above math applies equally well to any process involving some quantity that changes exponentially as a function of time. In fact, we can summarize what we’ve seen in this nice, handy way:

Theorem: Let \alpha be a growth parameter, P be an initial value, and T be a number of time steps. Then

  • if \alpha < 1, then P \sum_{i=0}^{T-1}{\alpha^i} < P\frac{1}{1 - \alpha}.
  • if \alpha = 1, then P \sum_{i=0}^{T-1}{\alpha^i} = PT.
  • if \alpha > 1, then P \sum_{i=0}^{T-1}{\alpha^i} <P\alpha^{T-1} \frac{\alpha}{\alpha - 1}.

The general framework of this theorem – there are some initial parameters, that based on three cases split into three different inequalities bounding some overall quantity – looks a lot like the setup of the Master Theorem, and that’s not a coincidence. In fact, the Master Theorem is essentially what you get if you work out how the work done in a recurrence relation grows or shrinks as a function of time and then just plug the appropriate quantities back up into the above theorem!

To pull everything together, let’s return back to the Master Theorem. The setup for the Master Theorem assumes that we have some recurrence relation T(n) \le aT(\frac{n}{b}) + cn^d. This corresponds to the work done by a recursive function that

  • does O(n^d) work at each step, then
  • makes a total recursive calls,
  • each of which is on an input of size \frac{n}{b}.

The question then is the following: if you call this recursive function on an input of size n, how much total work ends up being done? In other words, what’s a good, reasonable bound on T(n)?

To figure that out, it often helps to visualize exactly what this recursive call tree would look like using a recursion tree. With the recurrence we have up above, the recursion tree would look something like this:

Screenshot from 2017-07-08 14-04-24

Here, our top-level recursive call (on an input of size n) fires off a recursive calls, each of which are on an input of size \frac{n}{b}. Each of those recursive calls then fire off a new recursive calls, each of which is of size \frac{n}{b^2}, and though it’s not shown those calls each fire off calls of size \frac{n}{b^3}, which fire off calls of size \frac{n}{b^4}, etc. until eventually we hit our base case.

Although we haven’t discussed base cases here, let’s assume that our base case occurs when we get the input down to size 1. Since when we’re k layers deep into our recursion the input size is \frac{n}{b^k}, this means that the recursion terminates when k = \log_b n, so the number of layers in our tree will be \log_b n + 1.

If we want to sum up how much total work is done across all the recursive calls, we could sum up how much work is done at each individual call, but that turns out to be a bit tricky. Rather, we’ll sum up how much work is done at each individual layer in the tree. (If you haven’t seen this before, this is a standard technique that comes up when manipulating recursion trees.)

In the first layer, we have a single recursive call of size n, and since our recurrence relation is T(n) \le aT(\frac{n}{b}) + cn^d, the work done purely in that call is cn^d.

In the second layer, we have a recursive calls made. Each call is on an input of size \frac{n}{b}. This means that each individual recursive calls does c(\frac{n}{b})^d = c(\frac{n^d}{b^d} = cn^d\frac{1}{b^d}. Across all a calls, that works out to a total of cn^d\frac{a}{b^d} total work.

How about the layer after that? Well, there are a^2 recursive calls here (each of the a calls in the layer above made a calls) and each call is on an input of size \frac{n}{b^2}, since each call in the layer above us was on an input of size \frac{n}{b} and each call shrinks the input size by a factor of b. The work done by each individual call here is c(\frac{n}{b^2})^d = c(\frac{n^d}{b^{2d}}) = cn^d\frac{1}{b^{2d}}, and summed up across all a^2 calls on this row, the work done is cn^d \frac{a^2}{b^{2d}}.

It seems like there’s a very interesting trend here. The first layer did cn^d work. The second layer did cn^d\frac{a}{b^d} work. The third layer did cn^d\frac{a^2}{b^{2d}} work. It turns out that using similar logic the fourth layer ends up doing cn^d\frac{a^3}{b^{3d}} work, the fifth layer does cn^d\frac{a^4}{b^{4d}} work, etc. Notice that the work changes geometrically between levels – each level does a \frac{a}{b^d} factor as much work as the previous one. We also know how many layers in the tree there are – there are \log_b n + 1 of them – and we know how much work is done at the top level (it’s cn^d). In other words, the work done here is pretty much exactly equal to the work done by our nuclear reactor, with

  • \alpha = \frac{a}{b^d},
  • P = cn^d, and
  • T = \log_b n + 1.

So all that’s left to do now is to plug these quantities into each of the three cases of the earlier theorem and do some algebra!

Our first case is where \frac{a}{b^d} < 1. That’s the exponential decay case – it means that the total work done decays from layer to layer, like in our nuclear reactor where each fission doesn’t trigger enough fissions to replace it. In that case, the sum of the work across the layers of the tree is given by P\frac{1}{1 - \alpha}. Plugging in P = cn^d gives us that the total work here is cn^d\frac{1}{1 - \alpha}, and since in this case \alpha is a fixed constant that doesn’t depend on n, the total work in this case is O(n^d). So if \frac{a}{b^d} < 1, we get that the recurrence solves to O(n^d).

Our second case is where \frac{a}{b^d} = 1. That’s the stable case – it means that the work done in each layer of the recursion tree is the same, and it corresponds to the case in our nuclear reactor where each fission event replaces itself in the next iteration. In that case, the work done is PT. Plugging in our values, we get that the total work done here is cn^d(\log_b n) = O(n^d \log n).

The last case is where \frac{a}{b^d} > 1. That’s the exponential growth case – it means that the work done in each layer of the tree increases as you go down the tree, and it corresponds to the case in our nuclear reactor where the reactor power keeps on growing as more and more fissions occur. In this case, the work done is bounded by P \alpha^{T-1} \frac{\alpha}{\alpha - 1}. If we plug in our values of P, \alpha, and T and simplify, we get that the total work done is bounded as follows:

= cn^d \alpha^{\log_b n}\frac{\alpha}{\alpha - 1}

= cn^d (\frac{a}{b^d})^{\log_b n}\frac{\alpha}{\alpha - 1}

= cn^d (\frac{a^{\log_b n}}{b^{d \log_b n}})\frac{\alpha}{\alpha - 1}

= cn^d (\frac{a^{\log_b n}}{b^{\log_b {n^d}}})\frac{\alpha}{\alpha - 1}

= cn^d (\frac{a^{\log_b n}}{n^d})\frac{\alpha}{\alpha - 1}

= ca^{\log_b n}\frac{\alpha}{\alpha - 1}

That looks horrible – I’m really sorry! – but it’s about to get a lot better. Using the useful but not particularly well known equality that a^{\log_b n} = n^{\log_b a}, we get that this whole thing simplifies to

cn^{\log_b a}\frac{\alpha}{\alpha - 1}

and since \alpha is a constant with respect to n, this whole messy expression ends up being O(n^{\log_b a}).

Summarizing the three cases we found:

  • If \frac{a}{b^d} < 1, then T(n) = O(n^d).
  • If \frac{a}{b^d} = 1, then T(n) = O(n^d \log n).
  • If \frac{a}{b^d} > 1, then T(n) = O(n^{\log_b a}).

Or, if you think that fraction involving a, b, and d is too messy, we can rewrite the same thing with logarithms:

  • If \log_b a < d, then T(n) = O(n^d).
  • If \log_b a = d, then T(n) = O(n^d \log n).
  • If \log_b a > d, then T(n) = O(n^{\log_b a}).

Et voila. We’ve derived the Master Theorem!

This presentation is lengthy, but I think it hits on the core ideas behind the Master Theorem. First, that the Master Theorem is a theorem about exponential growth, stability, or decay. The three cases in the Master Theorem correspond to whether some key parameter of the recurrence causes a certain behavior to grow exponentially, remain stable over time, or decay exponentially. Second, that the Master Theorem is a theorem about recursion trees. The specific exponential parameter we get here, \frac{a}{b^d}, is found by looking at the recursion tree for the recurrence, going one layer at a time, and summing up the total work that’s being done.

If you understand that these are the key ideas behind the Master Theorem, then you have more than a handy little tool for most recurrence relations – you have a framework for thinking about other recursion trees that you see in the future.

For example, consider the recurrence relation T(n) \le T(\frac{7n}{10}) + T(\frac{n}{5}) + n, which comes up in the analysis of the median-of-medians algorithm for linear-time selection. This recurrence relation doesn’t fit into the framework given by the Master Theorem because the recursive bits don’t have the same size. But that doesn’t mean we can’t take a stab at it. If we think about what that recursion tree is going to look like, we’ll see that at the first level we have a single call of size n, and it does n work. Below that are two calls of size \frac{7n}{10} and \frac{n}{5}, which collectively will do \frac{7n}{10} + \frac{n}{5} = \frac{9n}{10} work. That looks a lot like exponential decay. In fact, it is – if you keep expanding out the tree, you’ll see that the work at each level drops by at least \frac{9}{10} per level. Using what we know about geometric series, that suggests we’d see the same sort of behavior we’d see in general exponential decay – that the work at the top dominates the rest of the work – and that’s indeed what we find. This recurrence solves to T(n) = O(n). (The initial recurrence that spawned the conversation that led to this presentation of the Master Theorem was T(n) = T(\frac{n}{2}) + T(\frac{n}{4}) + T(\frac{n}{8}) + O(n). Can you do a similar analysis to solve it?)

Or what about my all-time favorite recurrence relation, T(n) = \sqrt{n}T(\sqrt{n}) + n? That doesn’t look anything like what the Master Theorem talks about. But look at what happens if you draw out the recursion tree – each level does n work (do you see why?) – so we’d expect that the work done here should be the work per level times the number of levels. That makes the recurrence solve to T(n) = O(n \log \log n), once you recognize that the number of levels happens to be O(\log \log n).

If I get another chance to teach CS161, I’m thinking about using this presentation – with some better visuals and a lot more exposition in the math – to explain the Master Theorem and how to attack other sorts of related recurrence relations. But in the meantime, I figured I’d get this out in case anyone else was interested in taking a look at it.