Author: welcometotheoryland

# Bin Packing and Health Care

In an earlier post I mentioned that I enjoy reframing classic algorithms questions as real-world problems to motivate student interest. One problem I’ve used as an exercise in recursive backtracking (both on an exam and in a recent assignment) is a version of the bin packing problem framed in terms of scheduling patients in a hospital. The problem is described like this:

You are working at a hospital trying to coordinate times in which non-emergency patients can meet with their doctors. Each doctor has a maximum number of hours each day that she can see patients. For each patient, you are given how much time she will need to spend with her doctor. Given the amount of time that each doctor is free on a given day, along with the amount of time required for each patient, you are interested in determining whether or not it is possible for every patient to be seen.

I guess that since the doctors have a bunch of patients to see and no fixed plan to see them, they’re Doctors Without Orders!

This problem is essentially the bin packing problem with the modification that each bin might have a different total size. Each doctor forms a bin whose capacity is equal to the number of free hours in the doctor’s schedule, and each patient is an object to place into a bin whose size is equal to the number of hours for which she needs to be seen.

The theming on this problem is quite nice – there’s a clear motivation behind why you might want to find the most efficient way to schedule patients across doctors. Since it’s NP-hard, most greedy solutions will fail to give optimal solutions in some cases (which is a great way to introduce students to some concepts from algorithms design). It also generalizes in a lot of fun ways. Imagine you have a bunch of doctors like these and want to plan a full week’s schedule, or otherwise try to get everyone seen as fast as possible. How would you do that? How might you define a measure of “fairness” in terms of workload across the doctors, then try to maximize fairness while getting everyone seen? What if the doctors are specialists in different subfields and each patient needs to be seen by a doctor with a certain set of skills? And on top of it all, the problem has a really nice recursive backtracking solution: fix an unassigned patient, and for each doctor with time available to see them, try assigning the patient to that doctor.

# Negations, Contrapositives, and Happiness

One of the first skills I teach students in CS103 is how to take the negations of different classes of statements, such as universally-quantified statements, existentially-quantified statements, vacuous truths, and implications. The goal is to help students get comfortable with proofs by contradiction and contrapositive as early as possible.

I accidentally stumbled on this little exercise, which is very effective at forcing students to pause and see how all these concepts interact:

What is the contrapositive of the statement “if someone is happy, then everyone is happy?”

1. If everyone is happy, then someone is happy.
2. If someone is not happy, then everyone is not happy.
3. If everyone is not happy, then someone is not happy.
4. If everyone is happy, then someone is not happy.
5. If someone is not happy, then everyone is happy.

What is the negation of the statement “if someone is happy, then everyone is happy?

1. If someone is not happy, then everyone is not happy.
2. If someone is happy, then someone is not happy.
3. Someone is happy and everyone is not happy.
4. Someone is happy and someone is not happy.
5. Someone is not happy and everyone is not happy.

If you haven’t practiced taking contrapositives and negating formulas recently, it’s worth taking a minute to answer these question for yourself before moving on. (The answers are scattered throughout the rest of this post)

These problems tend to really throw students when they’re getting started and they’re excellent at launching helpful conversations. There are a number of reasons why.

First, the statement at the core of these questions – “if someone is happy, then everyone is happy” – is hard to wrap your head around in a mathematical sense. A plain English reading of this statement leads most people to conclude that it’s totally false. The fact that one person is happy doesn’t cause other people to be happy. But if you interpret this statement in a mathematical sense – as in something given by the first-order logic statement $(\exists x. Happy(x)) \to (\forall x. Happy(x))$ – then the meaning is totally different. That implication is a material conditional, and the truth or falsity of the statement depends on the world in which you evaluate it. So for most students, learning to work with this statement at all involves first recognizing that they’re not dealing with implications or conditionals in the familiar sense of the term.

The second challenge with answering these questions is that they require students to have a fairly good understanding of how to negate universally-quantified statements, existentially-quantified statements, and implications. This is an area I’ve found students to be perennially perplexed by when they’re getting started. Let’s take the contrapositive, for example. Most students can recall independently that

• the negation of a statement of the form $\forall x. P(x)$ is $\exists x. \lnot P(x)$,
• the negation of a statement of the form $\exists x. P(x)$ is $\forall x. \lnot P(x)$, and
• the contrapositive of the implication $A \to B$ is $\lnot B \to \lnot A$,

but they don’t realize how to assemble them together. The majority of students I’ve had will correctly realize that they need to negate the antecedent and consequent and swap their roles, but don’t recognize that negating the statements “someone  is happy” and “everyone is happy” involves negating universally-quantified and existentially-quantified statements.

Then there’s the fact that the correct answer, in some sense, looks wrong. If you negate the antecedent and consequent and swap their order, you get back (in first-order logic) the statement $(\exists x. \lnot Happy(x)) \to (\forall x. \lnot Happy(x))$, which renders in English as “if someone is not happy, then everyone is not happy.” Many students going through a process-of-elimination through the answers will rule this one out because it initially looks like the antecedent and consequent were negated but not swapped.

And finally, there’s a big question that comes up about why, if this really is the contrapositive, it means the same thing as the original statement. Why do the statements “if someone is happy, then everyone is happy” and “if someone is not happy, then everyone is not happy” mean the same thing?

The best way I know to answer that question is to move on to the next part and think about the negation of the statement. Seeing when this statement isn’t true sheds a lot of light on when it is true.

Again, negating this statement gives students a lot of trouble. Most students get this one wrong by choosing the option “if someone is happy, then someone is not happy,” or, in first-order logic, $(\exists x. Happy(x)) \to (\exists x. \lnot Happy(x))$. The issue here is that the negation of an implication of the form $A \to B$ is $A \land \lnot B$, rather than $A \to \lnot B$. The correct answer is “someone is happy and someone is sad.”

So why is that statement the negation of the original one (or its contrapositive)? I like asking students to mull this one over. If you think about it, if you’re in a room where there’s a happy person and a sad person, then the claim “if someone is happy, then everyone is happy” is false, since the sad person disproves it, and the claim “if someone is not happy, then everyone is not happy” is also false because of the happy person. (Most students don’t find that part too hard to get). On the other hand, imagine that you have a room where everyone is happy or everyone is sad. If everyone is happy, then the statement “if someone is happy, then everyone is happy” is trivially true (“trivially” in the sense of “it’s an implication with a true consequent” rather than “obviously,”), and the statement “if someone is not happy, then everyone is not happy” is vacuously true. There’s a similar argument to be made for the case where everyone is sad.

Because this simple statement touches on so many topics – what implications mean, how to take a contrapositive, how to negate universally-quantified and existentially-quantified statements, negating implications, and vacuous truths – I’ve found that it’s an excellent conversation starter. If I see someone struggling with first-order logic, I’ll often ask this question and have them work through it.

So there you have it – a tiny little implication that leads to a ton of interesting edge cases on the boundaries of logic and language. I doubt I’m the first person to have stumbled on this one, and I hope that it finds a use somewhere outside of CS103!

# The Electoral College as Knapsack

One of the challenges in teaching theoretical CS – or at least, the more abstract parts of computer science – is that many of the topics can seem completely divorced from reality. Given that a lot of students take introductory CS courses to build up their programming skills – for interest, because they need them in a research lab, or because they’re looking for a safe and stable career path – it’s just as important to sell the students on the topics as it is to provide a clear and easily understood presentation.

A question I frequently ask myself when I’m teaching a class is whether the material has stakes. Are the motivating examples and questions things that students would consider inherently interesting independent of their technical merits?

About six months ago, I had the pleasure of teaching CS106B, the second course in Stanford’s introductory programming sequence. Topics include recursion, recursive problem-solving, and algorithmic efficiency. This is where students learn to solve problems like “list all the subsets of a set $S$” or “generate all permutations of a given string.” I have to confess that I’m the sort of person who really gets a kick out of problems like these (and likes playing around with ways of using nonstandard number systems to generate them), but the fact that I think it’s cool to generate all partitions of a set doesn’t really matter if I can’t get the students to find the topic interesting as well.

I pretty quickly realized that I’d seen this problem before – this is the 0/1 Knapsack Problem! We can think of each state as an object whose “weight” is the number of popular votes needed to win a majority, and whose “value” is its number of electoral votes. This means that all of the theory built up over the years for the knapsack problem easily transfer over to this new problem of winning the presidency with the fewest number of electoral votes.

I ended up turning this problem into a programming assignment for CS106B. I asked some of my undergraduate section leaders to pull up historical voting data for different presidential elections, and to my astonishment two of them (Nick Bowman and Eli Echt-Wilson) pulled up historical data going back to 1828 by the end of the day. Using some shape data files for US states that I’d downloaded back in 2011 (unfortunately, I cannot remember the source), I whipped up a little visualizer and asked students to code up the back-end logic to take in a list of states (along with popular votes cast and electoral vote totals) and output the minimum collection of states necessary to win the presidency. I gave some high-level guideposts about how to proceed – start off by writing a brute-force recursive backtracking solution, then add memoization – and provided starter code that hooked their answer into the visualizer.

The results were quite interesting. It turns out that the NPR article still had it wrong. Based on voting patterns from 2012, and making some simplifying assumptions (that electors never defect, that states that split their electors instead decide to send them as a bloc, that there’s a majority in the Electoral College, that there aren’t any third-party spoilers, etc.), it turns out that you’d only need around 21.1% of the popular vote to win the presidency. Just win the states highlighted in gold:

The particular states you’d need to win here seem like a pretty eclectic assortment. It turns out it is a good idea to win some big-population states like California and Texas, though you’d forgo others like New York and Florida. I’m curious what sort of platform you’d need to campaign on in order to win these states. 🙂

It’s interesting to run this program over data from different years to see what other trends emerge. You can run it in 1860 versus 1864 to see that, for some reason, in 1864 a bunch of states from the South didn’t cast Electoral College votes. You can run it over a longer period of time to see the addition of new States to the Union and the change in population centers over time.

One of the advantages of framing the knapsack problem this way is that many students were genuinely curious to explore this problem independently of the underlying CS theory. It motivates students to ask questions about how their government (or the government of their temporary host country) operates, its history, and questions of fairness and electoral power. In the end-of-quarter evaluations, many students reported this problem to be one of their favorites from the quarter.

I think there were some rough edges with the way the question was deployed the first time around – it’s a pretty tricky problem to assign to students who are still getting comfortable with recursion – but with a few touchups I think it could have a lot of staying power.

# Nuclear Reactors and the Master Theorem

The Master Theorem is a lovely little theorem for solving the sorts of recurrences that often come up in divide-and-conquer algorithms. I’ve seen a bunch of different versions of this theorem over the years. Some are simpler to state but less powerful, while others are extremely expressive but can be a bit intimidating for students learning it for the first time. When I last taught CS161, I used this version of the Master Theorem:

Theorem: Let $T(n)$ be a recurrence defined by $T(n) \le aT(\frac{n}{b}) + cn^d$. Then

• if $\log_b a < d$, then $T(n) = O(n^d)$.
• if $\log_b a = d$, then $T(n) = O(n^d \log n)$.
• if $\log_b a > d$, then $T(n) = O(n^{\log_b a})$.

When I first saw this theorem as an undergrad, I remember thinking two things. First, “Wow! It’s amazing that this just happens to work out this way.” Second, “Where on earth did this come from?” I had an amazing algorithms instructor (Tim Roughgarden) who showed a derivation of this result and worked through the intuition behind it. Even so, I remember thinking that the math behind the formal proof of this theorem was surprisingly tricky and very quickly obscured in the intuition.

When it was my turn to teach CS161, I gave a presentation similar to the one I’d seen from Tim, which focused on the intuition behind the three cases, followed up with a quick mathematical derivation of each case. I based the math on what I’d read in CLRS, and I remember thinking that it was a complete pain to get everything to work out correctly.

About a year ago I was talking with a former student of mine (Roslyn Cyrus, whose blog has some very cool descriptions of how to model, represent, and solve puzzle games) about a recurrence relation she’d come across that didn’t quite fit the Master Theorem, and in the course of working through it together I came across a different way of presenting the proof of the Master Theorem that, I believe, is much easier than the one in CLRS, makes the intuition behind the theorem much clearer (and therefore more generalizable), and avoids a lot of the denser algebraic manipulations that typically go along with the theorem. I thought I’d share this approach in case anyone feels like adopting it in their own theory classes (or in case any future algorithms students want to get a better sense of where this result comes from!)

Rather than taking on the Master Theorem directly, I’d like to start with a different example.

Let’s talk nuclear reactors.

Nuclear reactors are cool. I say this as someone who enjoys visiting decommissioned nuclear plants and taking photos of old cooling towers. I have a very weak grasp of the physics behind how they work, but at a very high level, a nuclear reactor works like this. You have a bunch of atoms that will break apart if you hit them with a neutron. When those atoms break apart, they release a bunch of energy, and then they kick off more neutrons, which in turn break apart more atoms, which in turn kicks off more neutrons, etc. By extracting the energy that’s given off when the atoms break apart, the reactor generates electricity, and by very closely controlling how many of the neutrons are produced at each point in time, the reactor makes sure it doesn’t blow itself apart.

Let’s imagine that you’re operating a nuclear reactor and you want to figure out how much total power your reactor is going to produce over some period of time. In our simplified model, you can figure this out purely from three quantities. First, when an atom splits, how many other atoms will end up splitting as a direct result of the neutrons that atom kicks off? It might be that every time you split an atom, it fires off two neutrons, which then go on to cause two more atoms to split. Or perhaps each atom, on average, releases 1.3 neutrons, causing (on average) 1.3 more atoms to split. Or maybe, on average, each atom releases 0.7 neutrons, causing (on average) 0.7 more atoms to split. Let’s denote the average number of new atoms that will split based on a single fission event to be $\alpha$. Intuitively, if $\alpha = 1$, the reaction will continue at the same rate over time, since each time an atom splits it triggers exactly one future atom to “replace” it and split again. If $\alpha < 1$, the reaction will slow down and eventually die out, since each fission event doesn’t produce enough new fission events to “replace it.” And, if $\alpha > 1$, then the reaction will grow over time and produce more and more power, since each fission event ends up triggering even more future fission events, which trigger more future fission events, etc. (That last case is generally considered desirable if you’re designing a nuclear weapon and should be treated with care if you’re designing a nuclear power plant.)

(Fun fact: the actual way nuclear reactors work is more complicated than this due to the fact that there are both prompt neutrons generated directly from nuclear fuel fission and delayed neutrons generated from the decay of older fission byproducts. But in this simplified model, we’ll ignore that!)

The next quantity you’d probably want to know is how much power your reactor is initially generating. As a simplifying assumption, we’ll assume that the amount of power generated is directly proportional to the number of fission events occurring per unit time. (In reality it’s more complicated. I don’t know exactly how it’s more complicated, but I’d be surprised if it were really this simple.) We’ll imagine that the reactor’s initial power output at the time you start monitoring it is denoted by $P$. If $P$ is large, it means that the reactor is initially producing a lot of power, and if $P$ is small it means that the reactor is initially producing very little power.

Finally, there’s the question of how long you let the reactor run. The longer you let the reactor run, the more power you’ll generate. Here we’re going to make a very aggressive simplifying assumption and assume that the reactor works in “stages,” where one “stage” corresponds to a bunch of atoms splitting, the next “stage” corresponds to the next round of atoms splitting, etc. Obviously, this isn’t how real reactors work, but it makes the math a bit easier. Let’s assume that the total number of “stages” we let the reactor run for is given by $T$, with $T$ standing for “time.”

So we now have three parameters: $\alpha$, which tells us whether the reaction grows, shrinks, or remains stable over time; $P$, which tells us the initial level of activity in the reactor; and $T$, which tells us how long the reaction will go for. Given these parameters, how much total power ends up getting produced?

Well, let’s think about this one step at a time. In the first “stage” of our nuclear reaction, we’ll produce $P$ total power. Each one of the fission events will trigger $\alpha$ new fission events, and since the power generated is proportional to the number of fission events, in the next stage we’ll produce $\alpha P$ total power. Depending on what value we have for $\alpha$, that might be greater, less than, or equal to the amount of power we generated in the first stage. Each of the fission events in that second stage will in turn trigger $\alpha$ more fission events in the third stage, and because the power generated is proportional to the number of fission events, in the third stage we’ll have $\alpha^2 P$ total power generated. Similarly, in the fourth stage we’ll have $\alpha^3 P$ total power generated, in the fifth stage we’ll have $\alpha^4 P$ total power generated, etc. In other words, the amount of power generated as a function of the number of steps is exponential in $\alpha$.

So how much total power ends up getting generated over $T$ time steps? Well, that’s going to be the sum of the power drawn at each step, which is given by

$P + \alpha P + \alpha^2 P + \alpha^3 P + ... + \alpha^{T-1}P = \sum_{i=0}^{T-1}{\alpha^i P} = P\sum_{i=0}^{T-1}{\alpha^i}$.

We now have a nice summation in terms of our three parameters – $\alpha$, $P$, and $T$ – that tells us how much total power is going to be produced. So now we can ask – how exactly do you simplify that summation?

The good news is that this is a famous summation – it’s the sum of a geometric series! This is one of those sums that I think everyone at some point learned and then promptly forgot – I’ll admit that until I started routinely teaching CS theory I had completely forgotten it. There are two cases that arise for this sum. First, if $\alpha = 1$, then notice that $\alpha^i = 1$ for any choice of $i$, and so this sum is relatively easy to evaluate:

$P\sum_{i=0}^{T-1}{\alpha^i} = P\sum_{i=0}^{T-1}1 = PT$.

In other words, the total power generated is equal to the initial power $P$ times the total number of time steps $T$, which makes sense because the total number of fission events isn’t changing as a function of time.

On the other hand, if $\alpha \ne 1$, then there are two possibilities. The first possibility is that $\alpha < 1$, in which case the amount of power generated at each stage in the reactor is lower than the previous stage (it’s decaying geometrically). In that case, using the formula for the sum of an infinite geometric series, we have the nice, handy result that

$P\sum_{i=0}^{T-1}{\alpha^i} < P\sum_{i=0}^{\infty}{\alpha^i} = P\frac{1}{1 - \alpha}$.

In other words, the total power generated will be at most some constant multiple of the original power, and the specific constant multiple depends on the choice of $\alpha$.

In the last case, where $\alpha > 1$, we have this lovely formula at our disposal:

$P\sum_{i=0}^{T-1}{\alpha^i} = P\frac{\alpha^T - 1}{\alpha - 1}$

That formula is, um, a bit dense. For the purposes of where we’re going, I think it’s a bit easier to see what’s going on if we rewrite it like this:

$P\sum_{i=0}^{T-1}{\alpha^i} = P(\alpha^T - 1)\frac{1}{\alpha - 1} < P\alpha^{T-1} \frac{\alpha}{\alpha - 1}$

In other words, the total power drawn is exponential in the number of time steps $T$, with the base of the exponent depending on $\alpha$ and an extra constant factor that also depends on $\alpha$.

So what’s the net result of all of this mathematical gymnastics? Although we’ve framed this in the context of nuclear reactions, really, the above math applies equally well to any process involving some quantity that changes exponentially as a function of time. In fact, we can summarize what we’ve seen in this nice, handy way:

Theorem: Let $\alpha$ be a growth parameter, $P$ be an initial value, and $T$ be a number of time steps. Then

• if $\alpha < 1$, then $P \sum_{i=0}^{T-1}{\alpha^i} < P\frac{1}{1 - \alpha}$.
• if $\alpha = 1$, then $P \sum_{i=0}^{T-1}{\alpha^i} = PT$.
• if $\alpha > 1$, then $P \sum_{i=0}^{T-1}{\alpha^i} .

The general framework of this theorem – there are some initial parameters, that based on three cases split into three different inequalities bounding some overall quantity – looks a lot like the setup of the Master Theorem, and that’s not a coincidence. In fact, the Master Theorem is essentially what you get if you work out how the work done in a recurrence relation grows or shrinks as a function of time and then just plug the appropriate quantities back up into the above theorem!

To pull everything together, let’s return back to the Master Theorem. The setup for the Master Theorem assumes that we have some recurrence relation $T(n) \le aT(\frac{n}{b}) + cn^d$. This corresponds to the work done by a recursive function that

• does $O(n^d)$ work at each step, then
• makes $a$ total recursive calls,
• each of which is on an input of size $\frac{n}{b}$.

The question then is the following: if you call this recursive function on an input of size $n$, how much total work ends up being done? In other words, what’s a good, reasonable bound on $T(n)$?

To figure that out, it often helps to visualize exactly what this recursive call tree would look like using a recursion tree. With the recurrence we have up above, the recursion tree would look something like this:

Here, our top-level recursive call (on an input of size $n$) fires off $a$ recursive calls, each of which are on an input of size $\frac{n}{b}$. Each of those recursive calls then fire off $a$ new recursive calls, each of which is of size $\frac{n}{b^2}$, and though it’s not shown those calls each fire off calls of size $\frac{n}{b^3}$, which fire off calls of size $\frac{n}{b^4}$, etc. until eventually we hit our base case.

Although we haven’t discussed base cases here, let’s assume that our base case occurs when we get the input down to size 1. Since when we’re $k$ layers deep into our recursion the input size is $\frac{n}{b^k}$, this means that the recursion terminates when $k = \log_b n$, so the number of layers in our tree will be $\log_b n + 1$.

If we want to sum up how much total work is done across all the recursive calls, we could sum up how much work is done at each individual call, but that turns out to be a bit tricky. Rather, we’ll sum up how much work is done at each individual layer in the tree. (If you haven’t seen this before, this is a standard technique that comes up when manipulating recursion trees.)

In the first layer, we have a single recursive call of size $n$, and since our recurrence relation is $T(n) \le aT(\frac{n}{b}) + cn^d$, the work done purely in that call is $cn^d$.

In the second layer, we have $a$ recursive calls made. Each call is on an input of size $\frac{n}{b}$. This means that each individual recursive calls does $c(\frac{n}{b})^d = c(\frac{n^d}{b^d} = cn^d\frac{1}{b^d}$. Across all $a$ calls, that works out to a total of $cn^d\frac{a}{b^d}$ total work.

How about the layer after that? Well, there are $a^2$ recursive calls here (each of the $a$ calls in the layer above made $a$ calls) and each call is on an input of size $\frac{n}{b^2}$, since each call in the layer above us was on an input of size $\frac{n}{b}$ and each call shrinks the input size by a factor of $b$. The work done by each individual call here is $c(\frac{n}{b^2})^d = c(\frac{n^d}{b^{2d}}) = cn^d\frac{1}{b^{2d}}$, and summed up across all $a^2$ calls on this row, the work done is $cn^d \frac{a^2}{b^{2d}}$.

It seems like there’s a very interesting trend here. The first layer did $cn^d$ work. The second layer did $cn^d\frac{a}{b^d}$ work. The third layer did $cn^d\frac{a^2}{b^{2d}}$ work. It turns out that using similar logic the fourth layer ends up doing $cn^d\frac{a^3}{b^{3d}}$ work, the fifth layer does $cn^d\frac{a^4}{b^{4d}}$ work, etc. Notice that the work changes geometrically between levels – each level does a $\frac{a}{b^d}$ factor as much work as the previous one. We also know how many layers in the tree there are – there are $\log_b n + 1$ of them – and we know how much work is done at the top level (it’s $cn^d$). In other words, the work done here is pretty much exactly equal to the work done by our nuclear reactor, with

• $\alpha = \frac{a}{b^d}$,
• $P = cn^d$, and
• $T = \log_b n + 1$.

So all that’s left to do now is to plug these quantities into each of the three cases of the earlier theorem and do some algebra!

Our first case is where $\frac{a}{b^d} < 1$. That’s the exponential decay case – it means that the total work done decays from layer to layer, like in our nuclear reactor where each fission doesn’t trigger enough fissions to replace it. In that case, the sum of the work across the layers of the tree is given by $P\frac{1}{1 - \alpha}$. Plugging in $P = cn^d$ gives us that the total work here is $cn^d\frac{1}{1 - \alpha}$, and since in this case $\alpha$ is a fixed constant that doesn’t depend on $n$, the total work in this case is $O(n^d)$. So if $\frac{a}{b^d} < 1$, we get that the recurrence solves to $O(n^d)$.

Our second case is where $\frac{a}{b^d} = 1$. That’s the stable case – it means that the work done in each layer of the recursion tree is the same, and it corresponds to the case in our nuclear reactor where each fission event replaces itself in the next iteration. In that case, the work done is $PT$. Plugging in our values, we get that the total work done here is $cn^d(\log_b n) = O(n^d \log n)$.

The last case is where $\frac{a}{b^d} > 1$. That’s the exponential growth case – it means that the work done in each layer of the tree increases as you go down the tree, and it corresponds to the case in our nuclear reactor where the reactor power keeps on growing as more and more fissions occur. In this case, the work done is bounded by $P \alpha^{T-1} \frac{\alpha}{\alpha - 1}$. If we plug in our values of $P$, $\alpha$, and $T$ and simplify, we get that the total work done is bounded as follows:

$= cn^d \alpha^{\log_b n}\frac{\alpha}{\alpha - 1}$

$= cn^d (\frac{a}{b^d})^{\log_b n}\frac{\alpha}{\alpha - 1}$

$= cn^d (\frac{a^{\log_b n}}{b^{d \log_b n}})\frac{\alpha}{\alpha - 1}$

$= cn^d (\frac{a^{\log_b n}}{b^{\log_b {n^d}}})\frac{\alpha}{\alpha - 1}$

$= cn^d (\frac{a^{\log_b n}}{n^d})\frac{\alpha}{\alpha - 1}$

$= ca^{\log_b n}\frac{\alpha}{\alpha - 1}$

That looks horrible – I’m really sorry! – but it’s about to get a lot better. Using the useful but not particularly well known equality that $a^{\log_b n} = n^{\log_b a}$, we get that this whole thing simplifies to

$cn^{\log_b a}\frac{\alpha}{\alpha - 1}$

and since $\alpha$ is a constant with respect to $n$, this whole messy expression ends up being $O(n^{\log_b a})$.

Summarizing the three cases we found:

• If $\frac{a}{b^d} < 1$, then $T(n) = O(n^d)$.
• If $\frac{a}{b^d} = 1$, then $T(n) = O(n^d \log n)$.
• If $\frac{a}{b^d} > 1$, then $T(n) = O(n^{\log_b a})$.

Or, if you think that fraction involving $a$, $b$, and $d$ is too messy, we can rewrite the same thing with logarithms:

• If $\log_b a < d$, then $T(n) = O(n^d)$.
• If $\log_b a = d$, then $T(n) = O(n^d \log n)$.
• If $\log_b a > d$, then $T(n) = O(n^{\log_b a})$.

Et voila. We’ve derived the Master Theorem!

This presentation is lengthy, but I think it hits on the core ideas behind the Master Theorem. First, that the Master Theorem is a theorem about exponential growth, stability, or decay. The three cases in the Master Theorem correspond to whether some key parameter of the recurrence causes a certain behavior to grow exponentially, remain stable over time, or decay exponentially. Second, that the Master Theorem is a theorem about recursion trees. The specific exponential parameter we get here, $\frac{a}{b^d}$, is found by looking at the recursion tree for the recurrence, going one layer at a time, and summing up the total work that’s being done.

If you understand that these are the key ideas behind the Master Theorem, then you have more than a handy little tool for most recurrence relations – you have a framework for thinking about other recursion trees that you see in the future.

For example, consider the recurrence relation $T(n) \le T(\frac{7n}{10}) + T(\frac{n}{5}) + n$, which comes up in the analysis of the median-of-medians algorithm for linear-time selection. This recurrence relation doesn’t fit into the framework given by the Master Theorem because the recursive bits don’t have the same size. But that doesn’t mean we can’t take a stab at it. If we think about what that recursion tree is going to look like, we’ll see that at the first level we have a single call of size $n$, and it does $n$ work. Below that are two calls of size $\frac{7n}{10}$ and $\frac{n}{5}$, which collectively will do $\frac{7n}{10} + \frac{n}{5} = \frac{9n}{10}$ work. That looks a lot like exponential decay. In fact, it is – if you keep expanding out the tree, you’ll see that the work at each level drops by at least $\frac{9}{10}$ per level. Using what we know about geometric series, that suggests we’d see the same sort of behavior we’d see in general exponential decay – that the work at the top dominates the rest of the work – and that’s indeed what we find. This recurrence solves to $T(n) = O(n)$. (The initial recurrence that spawned the conversation that led to this presentation of the Master Theorem was $T(n) = T(\frac{n}{2}) + T(\frac{n}{4}) + T(\frac{n}{8}) + O(n)$. Can you do a similar analysis to solve it?)

Or what about my all-time favorite recurrence relation, $T(n) = \sqrt{n}T(\sqrt{n}) + n$? That doesn’t look anything like what the Master Theorem talks about. But look at what happens if you draw out the recursion tree – each level does $n$ work (do you see why?) – so we’d expect that the work done here should be the work per level times the number of levels. That makes the recurrence solve to $T(n) = O(n \log \log n)$, once you recognize that the number of levels happens to be $O(\log \log n)$.

If I get another chance to teach CS161, I’m thinking about using this presentation – with some better visuals and a lot more exposition in the math – to explain the Master Theorem and how to attack other sorts of related recurrence relations. But in the meantime, I figured I’d get this out in case anyone else was interested in taking a look at it.

Enjoy!

# Cantor’s Theorem and the Universal Set

Over the past six years I’ve had the pleasure of teaching CS103, one of the core CS classes in the Stanford Computer Science department. The course moves fast – it spends five weeks on discrete mathematics; four weeks on automata, languages, and computability theory; and one week on complexity theory. There are no prerequisites for the course, and so the course effectively functions as a combination of introduction to proofwriting and a quick survey of some of the major results from computability and complexity theory.

In the course of teaching CS103 I’ve curated and developed a bunch of discrete math and CS theory exercises that I give out in a number of problem sets, practice exams, and extra practice problems. I wanted to share a few of my favorite problems from that group here, along with a discussion of why I like those problems so much.

To kick things off, I thought I’d share this problem, which originated as an exam question and has since become a mainstay on the problem sets:

What happens if we take absolutely everything and throw it into a set? If we do, we would get a set called the universal set, which we denote 𝒰:

𝒰 = { x | x exists }

Absolutely everything would belong to this set: 1 ∈ 𝒰, ℕ ∈ 𝒰, philosophy ∈ 𝒰, CS103 ∈ 𝒰, etc. In fact, we’d even have 𝒰 ∈ 𝒰, which is strange but not immediately a problem.

Unfortunately, the set 𝒰 doesn’t actually exist, as its existence would break mathematics.

1. Prove that if A and B are arbitrary sets where AB, then |A| ≤ |B|. (Hint: Look at the Guide to Cantor’s theorem. Formally speaking, if you want to prove that |A| ≤ |B|, what do you need to prove? Your answer should involve defining some sort of function between A and B and proving that function has some specific property or properties.)

2. Using your result from (i), prove that if 𝒰 exists, then |℘(𝒰)| ≤ |𝒰|.

3. Using your result from (ii) and Cantor’s Theorem, prove that 𝒰 does not exist. Feel free to the following fact in the course of writing up your proof: for any sets A and B, if |A| ≤ |B| and |B| ≤ |A|, then |A| = |B|.

The result you’ve proven shows that there is a collection of objects (namely, the collection of  everything that exists) that cannot be put into a set. When this was discovered at the start of the twentieth century, it caused quite a lot of chaos in the math world and led to a reexamination of logical reasoning itself and a more formal definition of what objects can and cannot be gathered into a set. If you’re curious to learn more about what came out of that, take Math 161 (Set Theory) or Phil 159 (Non-Classical Logic).

This problem has a huge “whoa” factor for students. There is something really cool about proving that the very natural, intuitive concept of a set (a set is just a collection of things) somehow doesn’t quite work. I’ve had a lot of students tell me over the years that this result totally blew their minds.

I also really like the pedagogical value of the question. In the lectures leading up to this problem, I formally define the relations $|A| \le |B|$, $|A| < |B|$, and $|A| = |B|$ using injections, surjections, and bijections, then write up a rigorous proof of Cantor’s theorem using those definitions. Part (i) of this problem asks students to prove a result that, in most students’ minds, is probably “obvious,” that if $A \subseteq B$, then $|A| \le |B|$. The intent is to get students to engage with the formal definition of this relation and to find an injection $f : A \to B$. This pushes students to think about what that function might look like, given that there’s basically no information given about what $A$ and $B$ look like.

Part (ii) of the problem asks people to think about what the power set of the universal set 𝒰 looks like and how it related back to 𝒰. This hits on the difference between the element-of relation and the subset-of relation. Most students recognize that $\wp(\mathcal{U}) \in \mathcal{U}$, but it’s a second step to see why $\wp(\mathcal{U}) \subseteq \mathcal{U}$ holds as well. One of the major issues I see students struggle with when they’re just getting started with discrete mathematics is understanding how these two concepts relate to one another, and I’m always looking for more ways to get them to practice this.

Part (iii) of this problem combines everything together and asks students to think about how everything fits together. The real star here is Cantor’s theorem, which sets up a contradiction between $|\wp(\mathcal{U})| \le |\mathcal{U}|$ and $|\mathcal{U}| < |\wp(\mathcal{U})|$.

As an interesting technical aside, the “textbook” way you’d prove that the universal set doesn’t exist doesn’t proceed this way. Typically, you’d Russell’s paradox: use the axiom schema of specification to select from 𝒰 the set $S = \{ x | x \notin x \}$, and then prove that $S \in S$ if and only if $S \notin S$. There’s a diagonal-esque argument involved in Russell’s paradox (looking at objects that don’t contain themselves), which is still present in the proof route here (embedded in the proof of Cantor’s theorem). In fact, if you trace through the formal proof of Cantor’s theorem using the identity function, which is in a sense what this entire problem is all about, you’ll find that it ends up making precisely the same argument that gives rise to the Russell’s paradox contradiction.

I think this particular problem is one that could be adapted for use in other discrete math courses. The major prerequisites here are Cantor’s theorem and the formal definitions of the cardinality relations, which often come up in some form when talking about countable versus uncountable infinities. If you work through this problem, I think you’ll find that it’s actually not all that tricky and that the sticking points students will hit make for really good teaching moments (expanding out formal definitions, defining functions between sets, element-of versus subset-of, etc.).