Sometimes, data are avocados. Sometimes, data is wine.

I just finished teaching a course in data structures. In the course of doing so, I was made to reflect on what exactly we mean by “data.” The word “data” itself leads a double life. In one sense, data are avocados. In another, data is wine. Which of these sentences is grammatically correct depends on the context and in what way we view data.

The word “data” originated as the plural form of the word “datum,” where “datum” means “a single piece of information.” A single temperature measurement is a datum; temperature measurements spanning a whole week are data. If you think of data in this way, to be grammatically correct, you would use third-person plural verbs when talking about data. For example:

  • “The data show X and Y ” rather than “The data shows X and Y.”
  • “The data are from X and Y” rather than “The data is from X and Y.”

In this sense, data are avocados. If you want to check if a sentence involving the word “data” is grammatically correct, apply the “Avocados Test:” replace the word “data” with “avocados” and see if what you have is syntactically correct. That is:

  • “The avocados show X and Y” is correct; “The avocados shows X and Y” is incorrect.
  • “The avocados are from X and Y” is correct; “The avocados is from X and Y” is incorrect.

On the other hand, in many cases we don’t think of “data” as a collection of discrete units, each of which is a “datum.” Instead, we now have so much data that we think of data as a continuous entity not made of individual pieces. That is, instead of thinking of “data” as a counting noun, something where you can say “one datum” or “two data,” we think of “data” as a mass noun, where you can say “some data” but not “two data.”

For example, take data structuredata science, or data entry. These all fail the Avocados Test. We wouldn’t say avocados structureavocados science, or avocados entry. If we wanted to think about ways of organizing avocados, or studying their deliciousness, or putting them into containers, we’d talk about an avocado structure, the field of avocado science, and the job of avocado entry.

In this sense, data is wine. You have a wine cellar, not a wines cellar. You have a wine tasting, not a wines tasting. (The analogy is not perfect; you can have “ten wines” if you are talking about types of wine, and I suspect “ten data” or “ten datas” would be pretty odd to think about). Let’s call this the “Wine Test:” if replacing “data” with “wine” remains grammatically correct, you’re treating “data” as a mass noun, something that isn’t made of individual constituent parts.

If you think of data this way, the correct choice of verbs to use with the word “data” flip. Take the two earlier example sentences: “The data show X and Y” and “The data are from X and Y.” These were fine with the Avocados Test, but they fail the Wine Test:

  • “The wine show X and Y ” is wrong; “The wine shows X and Y” is correct.
  • “The wine are from X and Y” is wrong; “The wine is from X and Y” is correct.

In rereading this post, I’ve noticed that I default to treating data as a mass noun. Take the phrase “we now have so much data.” Notice that

  • this fails the Avocados Test: “we now have so much avocados” is wrong; but
  • this passes the Wine Test: “we now have so much wine.”

However, in rereading some of my other writings, and in paying close attention to how I speak, when using “data” as the subject of a verb, I tend to treat it as a counting noun and, appropriately, use the plural verbs. Think of it as a decidedly more banal version of wave/particle duality; is this avocados/wine duality?

I’m curious to see if there are trends in how the word “data” is used in different disciplines. I suspect that in computer science, it’s more commonly treated like wine than like avocados. How about in biology? Would we find data to be avocados there? How about philosophy?

Addendum: After finishing the draft of this post, I found this lovely discussion of the word “data” on Wikipedia, which talks about how “data” and “datum” are used in technical contexts in different fields. How fascinating!

A Tablespoon of Salt

I enjoy cooking – I find it really relaxing. There’s something nice about getting to switch to autopilot and follow a recipe after a long day of work.

A few years back I found a soup recipe (I can’t remember which one, unfortunately) that looked really, really good. I invited some friends over for dinner on the promise that I’d trade them food and shelter for good conversation. Nowadays, I think I have a pretty good handle on the basic steps in making soup, but at the time I was still getting a handle on cooking and found myself essentially following recipes more or less verbatim.

One of the first steps in this particular recipe was to cook a bunch of aromatics over low heat, along with some salt to draw out the water and speed things up. The recipe recommended adding in a teaspoon of kosher salt. I misread the directions and instead added a tablespoon of table salt.

Needless to say, although everything smelled amazing, when it came time to serve it up we found that it was essentially inedible. I had to apologize for keeping people waiting, but we managed to salvage the evening by heading downtown to a restaurant where the food was made by trained professionals.

So what went wrong here?

There were two immediate errors. First, I had mixed up tablespoons and teaspoons. That made my measurements off by a factor of three. I also used table salt instead of kosher salt, which has smaller crystals that can pack more tightly, effectively multiplying the oversalting amount by something like an extra 25 – 40%. So I had essentially added in around four times the amount of salt necessary. Oops.

But let’s take a step back. Why didn’t I notice that I was about to make this mistake? Well, I was new to cooking, and I hadn’t made enough recipes to realize that a full tablespoon of table salt is a lot of salt. That’s three times the daily recommended salt intake for a person. (That point is particularly relevant, since I was making food for three people!) In a sense, I hadn’t built up an intuition that would cause me to pause for a minute, say that something didn’t feel right about adding that much salt, and then double-check the recipe.

It also didn’t help that I had never really stopped to think about the difference between kosher salt and table salt. I had just thought that salt was salt was salt and that was that. It had never occurred to me that different types of salt had different properties and different use cases, or that recipes would specify different types for a good reason. (Or, rather, that they often are different and often are there for good reason. In some cases, salt is salt is salt. Just not this case.)

It also didn’t help that I hadn’t yet built up enough cooking experience to realize just how powerful salt is. There are some ingredients you can accidentally triple without breaking things. If I used too much oil in this recipe, I probably wouldn’t have noticed because the last step was to immersion blend everything and chances are it would have (mostly) emulsified in. It just would have been super rich. Salt is not one of those ingredients, but I hadn’t learned that yet.

And finally, it really didn’t help that this was a recipe I’d never tried out before. At the time I was making this recipe, I hadn’t yet developed enough experience to be able to stop and taste something partway through a recipe and figure out whether it needed some sort of adjustment. In this case even, when I was thinking “wow, those aromatics by themselves are really unpleasantly salty!” I didn’t have the context to know “and that’s not what they’re supposed to taste like!”

The reason I’m mentioning this story is that it began with me making what, in some sense, is a pretty minor error: using the wrong amount of salt. But really, this error indicated that I still had a lot to learn about cooking, namely:

  • A tablespoon of salt is a lot of salt – probably way more than what you’d ever need in soup for three people.
  • Kosher salt is not the same as table salt, and the two aren’t one-for-one interchangable.
  • Oversalting something can totally wreck it, and you have to be very careful about how much salt you add to something.
  • You should not invite people over and serve them an untested recipe!

Over the years both taking and teaching classes, I’ve found that there are certain mistakes you can make that seem minor, but which totally break whatever it is that you’re working on. Often times, the reason that those mistakes seem minor is because there’s a lack of a deeper understanding of some key concept or concepts, and really the true error is something more fundamental. I call these errors “tablespoon of salt errors.” A tablespoon of salt error is one where

  • the error appears superficial to the person who made it;
  • the error is sufficiently serious that it’s difficult, if not impossible, to recover from it without starting over; and
  • the error reflects a fundamental process error or a lack of understanding of some key concept.

For example, when I was an undergrad, I was taking a class on advanced object-oriented programming techniques in C++. One of the assignments asked us to simulate objects moving around in a shipping network. I had a helper function that checked whether the weight of a packet was zero and then returned one of two different values based on this. The code essentially looked like this:

WeightClass weightClassFor(unsigned weight) {
    return weight = 0? ZeroWeightClass : NonzeroWeightClass;

There’s an big error here. Specifically, I’m using a single-equals sign (“assign weight to be zero, then evaluate to zero”) rather than a double-equals sign (“compare weight against zero.”) This code always returns the NonzeroWeightClass, which isn’t correct.

I spent many hours working on this project, submitted it, and earned a solid C for it. The error was localized specifically to the above function, and changing that one equals sign would have fixed the issue and passed all the functionality tests.

In the only time I can ever remember doing so as an undergrad, I went to the TAs to beg and plead for some partial credit back, since after all, it was a one-character error! It’s pretty clear what I meant to do, after all.

But the thing is, this wasn’t a one-character error. Somehow, I had written some code that didn’t work, and I hadn’t sufficiently tested things before I submitted it. In reality, the root causes of the error were more accurately things like these:

  • I wasn’t compiling with warnings on the maximum setting. A good compiler probably could have caught this and said something like “so… you’re sure you want to have a comparison statement that always does the same thing?”
  • I hadn’t tested things thoroughly enough. This code clearly doesn’t work, but I had written a whopping zero unit tests for my code and therefore didn’t have a way of noticing.

This is a classic tablespoon of salt error. It seems like a minor error (I missed an equals sign). It broke everything pretty seriously (a quarter of the tests failed, despite the rest of the code being essentially right). And it indicated a serious process error on my part (where were the unit tests? why didn’t you compile with warnings as errors?)

I’m now completely convinced that the TAs were justified in taking off a bunch of points, since there was something fundamentally broken with how I approached the assignment. Plus, I now know not to make that error again!

Now that I’m teaching classes rather than taking them, I’m noticing that tablespoon-of-salt-errors are extremely common, especially in discrete math and proofwriting, where very, very simple mistakes can totally derail a proof.

For example, consider the following proof about multiples of three. The relevant definition we give to students is that an integer n is a multiple of three if n = 3k for some integer k.

Theorem: If n is an integer that is not a multiple of three, then n^2 is not a multiple of three.

Proof: Consider an arbitrary natural number n where n is not a multiple of three. This means that n \ne 3k for any integer k. Consequently, n^2 \ne 3(3k^2) for any integer k. Therefore, if we let m = 3k, we see that n^2 \ne 3m, and therefore n^2 is not a multiple of three. $\qed$

There’s a nice rhythm to the proof, and if I were to read this, I’d think that the (hypothetical) student who wrote it made a good-faith effort to solve the problem. I’d also think that they made a pretty big logic error: just because you can’t write n^2 = 3(3k^2) for any integer k doesn’t necessarily mean that you can’t write n^2 = 3m for some integer m that isn’t of the form 3k^2. Using that same logic, I could prove the (incorrect) claim that if n is not a multiple of nine, then n^2 is not a multiple of nine either:

“Theorem:” If n is an integer that is not a multiple of nine, then n^2 is not a multiple of nine.

Proof: Consider an arbitrary natural number n where n is not a multiple of nine. This means that n \ne 9k for any integer k. Consequently, n^2 \ne 9(9k^2) for any integer k. Therefore, if we let m = 9k, we see that n^2 \ne 9m, and therefore n^2 is not a multiple of nine. $\qed$

The theorem here is not true (pick n = 3, for example), and the proof follows pretty much the same line of reasoning as the one from earlier, so the logic can’t possibly be correct.

Superficially, this error isn’t all that deep. “Oops, I just made a mistake in that last step.” But really, this goes a lot deeper and hits at the challenge of working with arbitrarily-chosen values. If you want to rule out all choices of m, you have to really show that no choice of m is going to work, not just find a family of infinitely many numbers that don’t work. There’s also another issue with setting m = 3k^2 without actually saying what k is in the first place (is it a placeholder? you generally can’t do that to placeholders.)

This is a tablespoon of salt error. The error seems superficial (either “you can’t use inequalities that way” or “not all integers m can be written as 3k^2 for some integer k.”) The error completely breaks the proof (the same reasoning can be used to prove patently false statements without much modification). And the error indicates a fundamental misunderstanding of some key concept (arbitrarily-chosen values, placeholder variables, etc.)

A question that I keep grappling with as an instructor, and which I’m not sure I’ll ever really figure out the best answer to, is how to assign grades to student work with a tablespoon of salt error in it. I remember my own frustration as a student getting slammed for mistakes that I thought were pretty minor. I also remember how valuable those experiences were, though, since they really forced me to admit that there were things I didn’t yet know and they helped me figure out where I needed to study up.

Figuring out how to handle this case in a way that pushes students to do better without making them feel overwhelmed by what they don’t yet know is a challenge. It’s exacerbated in the tablespoon of salt case because the grading feels crazily disconnected from reality.

There’s a case currently up at the US Supreme Court (District of Columbia v. Wesby) where someone hosted a party in a vacant house that they had no business being in. The police came, questioned everyone there, and arrested everyone on suspicion of trespassing. In DC law, trespassing requires a mens rea on the part of the perpetrator – they have to know that they’re not supposed to be there. The question before the Court is about probable cause. The police report seeing a group of people in a house whose situation was so rough that the partygoers should have known that they were breaking it. The partygoers reported that they thought they were invited to a house they were allowed to be in. The disconnect between these narratives is currently being decided by nine very intelligent legal minds and will have a huge impact on law enforcement.

From the perspective of an instructor, a student who made a tablespoon of salt error has a serious gap in their understanding that needs to be corrected. From the perspective of the student, they made a minor error and the graders harshly took off way too many points for it. It’s a clash of narratives, which is never fun.

I’m going to keep tuning my approaches to handling cases like these. I figured I’d post this concept in the meantime in case the metaphor proves useful in other contexts.

What is a Hash Function? It Depends on Who’s Speaking

Hash functions are one of those wonderful mathematical objects that have a ton of use cases. You can use them to check whether a download failed due to a random error, or to build fast data structures (including the count-min sketches, the simplest data structure that most people haven’t heard of), or as a basis for cryptocurrencies.

Having taught a combination of courses, some more programming-oriented and some more theory-oriented, I’ve found that the term “hash function” and what’s expected of it varies a lot by who’s doing the talking. The idea of a hash code to a veteran Java programmer bears little resemblance to the families of hash functions that theoreticians would talk about, which in turn bear little resemblance to the cryptographic hash functions that computer security folks spend so much time working on. This can make it challenging for programming students to transition to learning about the theory behind hash tables or other data structures, as I learned the hard way when I didn’t recognize this when teaching algorithms for the first time.

This post is my attempt to talk about the different expectations of how hash functions work, or are expected to work, across different areas of computer science and software engineering. My hope is that if you’re a practicing programmer and are starting to study algorithms, data structures, or cryptography that this gives you a sense of what the expectations are when you dive into those fields.

Hash Functions: Programmer Edition

If you’re a practicing programmer, you probably first encountered hash functions in the context of hash tables, like the Java HashMap, the C++ std::unordered_map, or the Python dict type. You may understand how these data structures work, or you might just know that “they’re fast” and “they need a hash code.”

From a programming perspective, the most common times you’ll see hash functions are when you have to write them yourself. For example, if you’re a Java programmer, you probably learned to write a hash code like this:

@Override public void hashCode() {
    int result = myFirstField.hashCode();
    result = 31 * result + mySecondField.hashCode();
    result = 31 * result + myPenultimateField.hashCode();
    result = 31 * result + myUltimateField.hashCode();
    return result;

You’ve probably heard that hash codes are supposed to work so that different objects generally give different hash codes, and that equal objects have to have equal hash codes.

If you’ve implemented a hash table, perhaps as part of an intro class, or perhaps because you just thought it would be fun, you’ve had to be a client of a hash function. You maintained some kind of array of slots and determined which slot or slots to look for an entry by computing its hash code to get back some value (probably a 32-bit or 64-bit value), modding it by the table size, and then looking in that spot.

Then there’s the question of what you likely prioritize in a good hash function. You probably care about things like how well it spreads things out (low spread is bad for hash tables, high spread is good for hash tables) and how fast it runs (the faster, the better!).

To summarize, the key assumptions are that

  • each object has a hash code, which you usually implement by doing something involving multiplying by 31 and adding values together;
  • equal objects should have equal hash codes, and unequal objects should try to have unequal hash codes;
  • hash functions are usually 32-bit integers or 64-bit integers that get modded by some kind of table size;
  • there’s a bunch of “other” hash functions you can use to get a fingerprint that helps you check if you downloaded something properly; and
  • it’s good to write hash functions that spread things out and run quickly.

Hash Functions: The Data Structure Theory Perspective

If you’re a researcher studying algorithms or data structures, you have a pretty different point of view about what a hash function is and how it should behave.

Mathematically speaking, a hash function h is typically modeled as a mathematical function from some set of possible inputs (often called the key space or the universe) to a set of outputs, which is usually a range of integer values. For example, if you were building a hash table that had, say, 100 slots in it and you wanted to store integers in that hash table, you would want a hash function that took in an integer and then outputted an index between 0 and 99, inclusive. Mathematically, we’d model this as a function h : \mathbb{Z} \to \{0, 1, 2, ..., 99\}.

A few things might jump out here. First, we no longer speak of individual objects of having “a” hash code. That is, there’s no longer the idea that every object has exactly one hash code as some sort of intrinsic property. Instead, we think of the objects that we’re hashing as existing on their own, and then we come from the outside with some hash function h that suits our needs. In other words:

If you are a software engineer, you probably think of every object having a hash code. Hash codes are an intrinsic property of objects.

If you are a theoretician, you probably think of there being all sorts of possible hash functions that we could use for a group of objects. We can define lots of different possible hash functions for a given collection of objects. A hash function is an extrinsic object that gets applied to objects.

There’s another detail here. In The Real World, where there’s Object.hashCode(), or std::hash<T>, you have one hash function that needs to work for any size of hash table you might want to toss an object into. This means that hash functions typically output some sort of 32-bit integer or 64-bit integer, or more generally something big enough that no matter how large of a hash table you make, your hash function can provide outputs that can span the whole range.

But in Theoryland, it’s far more common to think of hash functions as being specialized for a particular size of table. If I want a hash function that I’ll use in a string hash table that hash 137 slots, I’ll model it as a function that takes in a string and outputs a number between 0 and 136, inclusive. (If you’ve seen the notation \Sigma^* from formal language theory, that means I’d think of my hash code as a function h : \Sigma^* \to \{0, 1, 2, ..., 136\}. If I then resize my hash table to have 237 slots, I would model it mathematically as though I’d picked a brand new hash function h' that takes in strings and outputs a number in the range \{0, 1, 2, ..., 236\} (with formal notation, that would be something like h' : \Sigma^* \to \{0, 1, 2, ..., 236\}. For those of you with a discrete mathematics background, the terminology here is that the codomain of the hash function is a formal part of the specification of the hash function in the first place.

Because hash functions in Theoryland are external to objects, rather than intrinsic to them, it’s not a problem if we decide to completely switch the hash function we’re using. In an actual programming language like Java or C++, though, the idea of switching hash codes is hard to model, since it’s not like we can just swap out std::hash<T> or hashCode for some other implementation.

In other words:

In software engineering, hash codes usually output very large values that then get modded down to the size that’s needed. If your table grows in size, you use the same hash code and just change what you mod by.

If you’re a theoretician, you model a hash function producing one of only a few possible values by building the set of possible outputs into the definition of your hash function. If you have a change in a hash table size, you pretend you just grabbed a brand new hash function with a different codomain.

These two differences – that hash functions are external to objects, and that you can just grab new hash functions with different codomains whenever you need to – can come out of nowhere if you’re first learning how to analyze algorithms or data structures, because a professor who’s been working in Theoryland might treat this as so fundamental that it’s not worth pausing to comment on.

But things get a bit weirder from here, especially if you go on to study algorithms or data structures more formally.

Let’s imagine you have a hash function that takes as input some kind of object (say, widgets) and outputs a natural number between 0 and n - 1, inclusive. Let’s have U be the set of all possible widgets. I’m going to assume that there are a lot of different widgets out there, say, way, way more than n of them. In that case, if you know in advance what hash function h : U \to \{0, 1, 2, ..., n - 1\} is going to be used in some hash table, you could envision what would happen if you hashed every single possible widget and looked at what hash code you got back. The handy pigeonhole principle tells you that there would always be some collection of at least \frac{|U|}{n} objects that have the same hash code.

If you’re a theoretician, this means that you have to worry about the case where your hash table happens to get \frac{|U|}{n} elements dumped into it that all have the same hash code. It doesn’t matter what hash function you pick – there’s simply too many widgets to distribute them in a way that doesn’t have a lot of collisions.

If you’re a Nefarious Scoundrel, you might have this same idea and reach the conclusion that you could break hash tables by doing some preprocessing to figure out what sorts of elements collide with one another, then dump them all in to some web server’s internal hash tables at once, degrading performance. And congrats, you’ve just invented the HashDoS attack, and this is why we can’t have nice things.

From a Theoryland perspective, given that there’s this fundamental weakness in hash functions, it makes sense to try to search for some way to build hash tables that can’t get clobbered in this fashion. And this is where we can use the nice idea that hash functions are external to objects rather than an intrinsic property. What if, when we build a hash table, we choose the hash function to use out of a pool of potential candidates? From the perspective of our Nefarious Scoundrel, this makes attacking our hash tables a lot harder, since no matter how much up-front work they do to try to figure out a collection of bad inputs, there’s a chance that we’ll pick a function out of our pool that prevents this attack

You often see terms like “universal family of hash functions” or “family of 2-independent hash functions” tossed around in Theoryland when discussing hash tables for this precise reason. When we’re reasoning about properties of hash tables, we often imagine that we have a pool of hash functions to pick from (which we’d more commonly call a family of hash functions, just because it sounds nicer) and we choose one such function at random to use in a hash table.

This hits at a very large gap between the theory of hash functions and what software engineers likely encounter:

If you’re a software engineer, you think of each object as having “a” hash code. You might also think of switching between different hash functions like MurmurHash, SAX, etc., but only in specialized circumstances or for performance reasons.

If you’re a theoretician working on algorithms and data structures, you often think of having large families of hash functions to pick from, and design data structures on the assumption that you can choose from these hash functions at random if necessary.

That being said, there isn’t a complete disconnect between theory and practice here. After enough Nefarious Scoundrels started crafting HashDoS messages to web servers to slow them down, many language implementations (like the JVM) started adding some randomness to how they hash fundamental types like integers, strings, and arrays. This meant that primitive types would not produce the same hash codes from program run to program run, which was less than ideal if you liked precomputing hashes and storing them on disk, but makes it much harder to HashDoS a website.

Because the hash functions used in algorithm and data structure design typically look so different than the ones that are usually talked about at the language level, the qualities that are optimized over look quite different. Families of hash functions in Theoryland are usually categorized by the strength of the assumptions you can make about them. If you choose a random hash function from a family and evaluate it on a single input, can you predict with better than random chance what it will do on another input? If not, you have a 2-independent family of hash functions. If you pick a random hash function, can you say that any two objects, no matter how they’re chosen, have a low collision probability? If so, you have a universal family of hash functions.

To summarize, in the context of algorithms and data structures:

  • hash functions are external to the objects they work on and are not intrinsic properties;
  • hash functions typically have their codomain specified in a way that bakes in the size of the range that they’re mapping to;
  • hash functions are typically grouped into families, with the idea that a hash function will be picked out of the family at random; and
  • the “power” of a hash function is usually measured in terms of statistical guarantees and precise inequalities bounding collision probabilities, rather than in terms of speed or a more flexible idea of “well-distributed.”

Things Not Covered Here

This is, of course, a very quick and probably somewhat unfair characterization of what hash functions look like in programming practice and in the theory of algorithms and data structures. I didn’t talk about cryptographic hash functions, which get used all the time in both theory and practice. I didn’t address clever custom hash functions like locality-sensitive hash functions or the hash functions you might use in derandomization.

My intent is to help make it easier for people used to writing code and worrying about performance to be able to read about the theory behind why things work, and hopefully to learn a thing or two in the process of doing so. I think it would be great if more people knew about count-min sketches, or more clever hash tables like the FKS dynamic optimality scheme, or using LSH to speed up finding related items in a data set.

Designing Assignments with Google Forms

Over the last academic year, I’ve gotten a lot of mileage designing online assignments using Google Forms. I’ve typically seen them used for event RSVP systems, but apparently if you’re creative about how you use sections and multiple-choice questions and data validation, you can use Google Forms to ensure that students work through a particular series of exercises, and can even offer targeted feedback along the way.

There are two particular techniques that I’ve gotten the most mileage out of. The first one is an approach I’m calling the multiple choice maze. This is a Google Form that’s a series of multiple choice questions. Each time you provide a wrong answer, the form takes you to a page with either a hint or an explanation about why that answer is incorrect, followed by a link back to the previous page. The form can only be submitted on the last page, which ensures that students have to answer everything correctly before they can submit the assignment.

You can try out an example multiple choice using this link.

There are a number of really nice advantages to using a multiple-choice maze for an assignment. First, it’s a highly scalable way to make sure that everyone in the class is on the same page about something. When I last taught CS106B, I used a multiple choice maze to make sure that the students were familiar with the Honor Code policies for the course; we asked a series of multiple choice questions about what actions were and were not permissible, and only allowed students to submit the assignment by getting all the answers right. Second, it’s an effective and lightweight way to make sure students are on top of lecture material. I sent the spring quarter CS103A students a multiple choice maze assignment every Wednesday evening covering material from the week’s two lecture meetings and specifically asked questions that required them to exercise the skills and concepts we’d taught. Because the feedback is automatic and customizable, students can quickly size up whether their understanding of the concepts is where it needs to be and can get refreshers and reminders on topics that subject. Finally, since everything is handled through Google Forms, once all things are said and done we have a reliable way to determine who’s completed the assignment, since we have timestamps for all of the submissions and can ask for names.

Once you’ve gotten the hang of it, it’s not too hard to assemble a multiple choice maze. For example, here’s the internal view of the first question from the form linked above:

Screenshot from 2017-08-10 17-29-48

Here, I have a multiple-choice question. It’s set to required so that anyone working through the form has to answer it. In the dropdown menu in the bottom corner of the question, I’ve checked the Go to section based on answer option, which lets me have each question jump to a different section in the form. As you can see, the three wrong answers all link to Section 2 (titled “Oops! That wasn’t quite right.”) and the correct answer links to Section 3 (“Some Exposition!”). That way, if you answer correctly, you get taken to a section where we can move on to the next question, and if you answer incorrectly, you get taken to a different section where we can explain why that answer is wrong. Here’s what Section 2 looks like on the back-end:

Screenshot from 2017-08-10 17-29-08

Notice that there are no questions in this section and that the section is purely text describing how to approach the problem. (I’ve never taught addition of fractions, so I suppose that this is probably not a very good explanation… sorry!) Importantly, the After section 2 option, which controls which section to visit after seeing this section is set to Go to section 1, which is the section containing the initial multiple-choice question. That way, if you get the answer wrong, you’ll get this nice helpful text and then get warped back to the question so you can make another guess.

We can take this a step further by creating questions where each wrong answer goes to a separate screen with a follow-up. For example, here’s the back-end view of the second question from the list:

Screenshot from 2017-08-10 17-29-20.png

Notice how each wrong answer goes to its own individual section, each of which is designed like the one above to warp back to this question.

The second approach I’ve found useful for making Google Form assignments is what I’m calling a find the hidden number assignment.

Last time I taught CS106B, I wanted students to get more familiarity using the debugger over the course of the quarter and figured a good way to do this would be to create a debugging assignment. For the very first assignment of the quarter, I took a standard program we’ve given out for a while that computes a rolling hash code, then designed an assignment that asked students to single-step through the hash code computation on a specific input. At some point in the single-stepping, I asked students to write down one of the numbers that was generated at an intermediate step. The Google Form for the assignment then required students to enter that number in order to move on to the next part of the assignment (and, therefore, the “Submit” button.)

Here’s a demo of what the form might look like. The secret number is 137.

The main advantage of setting up an assignment like this is that it functions as a lightweight “proof-of-work” system. Students need to show that they did some amount of work – which you can customize – in order to receive credit for the assignment. In my case, I was able to use this system to ensure that (in principle) everyone in the class had successfully gotten their IDE and debugger set up and were able to run programs and step through them in the debugger.

Setting something like this up isn’t too hard and just requires you to enable data validation and to choose the “Equal to” option:

Screenshot from 2017-08-10 17-43-41

Are these tools perfect for the job? No. But do they work well? Evidence seems to suggest that they do! The multiple choice maze I set up for the Honor Code seemed to work well; we had way fewer copied assignments submitted the quarter we deployed it than what we usually see, and more students seemed to know how to use the debugger.

I’ll probably keep using this technique in the future (once the forms are set up, it’s not at all hard to copy them from quarter to quarter). Let me know if you’re using anything along these lines or find this approach useful or interesting!