What is a Hash Function? It Depends on Who’s Speaking

Hash functions are one of those wonderful mathematical objects that have a ton of use cases. You can use them to check whether a download failed due to a random error, or to build fast data structures (including the count-min sketches, the simplest data structure that most people haven’t heard of), or as a basis for cryptocurrencies.

Having taught a combination of courses, some more programming-oriented and some more theory-oriented, I’ve found that the term “hash function” and what’s expected of it varies a lot by who’s doing the talking. The idea of a hash code to a veteran Java programmer bears little resemblance to the families of hash functions that theoreticians would talk about, which in turn bear little resemblance to the cryptographic hash functions that computer security folks spend so much time working on. This can make it challenging for programming students to transition to learning about the theory behind hash tables or other data structures, as I learned the hard way when I didn’t recognize this when teaching algorithms for the first time.

This post is my attempt to talk about the different expectations of how hash functions work, or are expected to work, across different areas of computer science and software engineering. My hope is that if you’re a practicing programmer and are starting to study algorithms, data structures, or cryptography that this gives you a sense of what the expectations are when you dive into those fields.

Hash Functions: Programmer Edition

If you’re a practicing programmer, you probably first encountered hash functions in the context of hash tables, like the Java HashMap, the C++ std::unordered_map, or the Python dict type. You may understand how these data structures work, or you might just know that “they’re fast” and “they need a hash code.”

From a programming perspective, the most common times you’ll see hash functions are when you have to write them yourself. For example, if you’re a Java programmer, you probably learned to write a hash code like this:

@Override public void hashCode() {
    int result = myFirstField.hashCode();
    result = 31 * result + mySecondField.hashCode();
    result = 31 * result + myPenultimateField.hashCode();
    result = 31 * result + myUltimateField.hashCode();
    return result;

You’ve probably heard that hash codes are supposed to work so that different objects generally give different hash codes, and that equal objects have to have equal hash codes.

If you’ve implemented a hash table, perhaps as part of an intro class, or perhaps because you just thought it would be fun, you’ve had to be a client of a hash function. You maintained some kind of array of slots and determined which slot or slots to look for an entry by computing its hash code to get back some value (probably a 32-bit or 64-bit value), modding it by the table size, and then looking in that spot.

Then there’s the question of what you likely prioritize in a good hash function. You probably care about things like how well it spreads things out (low spread is bad for hash tables, high spread is good for hash tables) and how fast it runs (the faster, the better!).

To summarize, the key assumptions are that

  • each object has a hash code, which you usually implement by doing something involving multiplying by 31 and adding values together;
  • equal objects should have equal hash codes, and unequal objects should try to have unequal hash codes;
  • hash functions are usually 32-bit integers or 64-bit integers that get modded by some kind of table size;
  • there’s a bunch of “other” hash functions you can use to get a fingerprint that helps you check if you downloaded something properly; and
  • it’s good to write hash functions that spread things out and run quickly.

Hash Functions: The Data Structure Theory Perspective

If you’re a researcher studying algorithms or data structures, you have a pretty different point of view about what a hash function is and how it should behave.

Mathematically speaking, a hash function h is typically modeled as a mathematical function from some set of possible inputs (often called the key space or the universe) to a set of outputs, which is usually a range of integer values. For example, if you were building a hash table that had, say, 100 slots in it and you wanted to store integers in that hash table, you would want a hash function that took in an integer and then outputted an index between 0 and 99, inclusive. Mathematically, we’d model this as a function h : \mathbb{Z} \to \{0, 1, 2, ..., 99\}.

A few things might jump out here. First, we no longer speak of individual objects of having “a” hash code. That is, there’s no longer the idea that every object has exactly one hash code as some sort of intrinsic property. Instead, we think of the objects that we’re hashing as existing on their own, and then we come from the outside with some hash function h that suits our needs. In other words:

If you are a software engineer, you probably think of every object having a hash code. Hash codes are an intrinsic property of objects.

If you are a theoretician, you probably think of there being all sorts of possible hash functions that we could use for a group of objects. We can define lots of different possible hash functions for a given collection of objects. A hash function is an extrinsic object that gets applied to objects.

There’s another detail here. In The Real World, where there’s Object.hashCode(), or std::hash<T>, you have one hash function that needs to work for any size of hash table you might want to toss an object into. This means that hash functions typically output some sort of 32-bit integer or 64-bit integer, or more generally something big enough that no matter how large of a hash table you make, your hash function can provide outputs that can span the whole range.

But in Theoryland, it’s far more common to think of hash functions as being specialized for a particular size of table. If I want a hash function that I’ll use in a string hash table that hash 137 slots, I’ll model it as a function that takes in a string and outputs a number between 0 and 136, inclusive. (If you’ve seen the notation \Sigma^* from formal language theory, that means I’d think of my hash code as a function h : \Sigma^* \to \{0, 1, 2, ..., 136\}. If I then resize my hash table to have 237 slots, I would model it mathematically as though I’d picked a brand new hash function h' that takes in strings and outputs a number in the range \{0, 1, 2, ..., 236\} (with formal notation, that would be something like h' : \Sigma^* \to \{0, 1, 2, ..., 236\}. For those of you with a discrete mathematics background, the terminology here is that the codomain of the hash function is a formal part of the specification of the hash function in the first place.

Because hash functions in Theoryland are external to objects, rather than intrinsic to them, it’s not a problem if we decide to completely switch the hash function we’re using. In an actual programming language like Java or C++, though, the idea of switching hash codes is hard to model, since it’s not like we can just swap out std::hash<T> or hashCode for some other implementation.

In other words:

In software engineering, hash codes usually output very large values that then get modded down to the size that’s needed. If your table grows in size, you use the same hash code and just change what you mod by.

If you’re a theoretician, you model a hash function producing one of only a few possible values by building the set of possible outputs into the definition of your hash function. If you have a change in a hash table size, you pretend you just grabbed a brand new hash function with a different codomain.

These two differences – that hash functions are external to objects, and that you can just grab new hash functions with different codomains whenever you need to – can come out of nowhere if you’re first learning how to analyze algorithms or data structures, because a professor who’s been working in Theoryland might treat this as so fundamental that it’s not worth pausing to comment on.

But things get a bit weirder from here, especially if you go on to study algorithms or data structures more formally.

Let’s imagine you have a hash function that takes as input some kind of object (say, widgets) and outputs a natural number between 0 and n - 1, inclusive. Let’s have U be the set of all possible widgets. I’m going to assume that there are a lot of different widgets out there, say, way, way more than n of them. In that case, if you know in advance what hash function h : U \to \{0, 1, 2, ..., n - 1\} is going to be used in some hash table, you could envision what would happen if you hashed every single possible widget and looked at what hash code you got back. The handy pigeonhole principle tells you that there would always be some collection of at least \frac{|U|}{n} objects that have the same hash code.

If you’re a theoretician, this means that you have to worry about the case where your hash table happens to get \frac{|U|}{n} elements dumped into it that all have the same hash code. It doesn’t matter what hash function you pick – there’s simply too many widgets to distribute them in a way that doesn’t have a lot of collisions.

If you’re a Nefarious Scoundrel, you might have this same idea and reach the conclusion that you could break hash tables by doing some preprocessing to figure out what sorts of elements collide with one another, then dump them all in to some web server’s internal hash tables at once, degrading performance. And congrats, you’ve just invented the HashDoS attack, and this is why we can’t have nice things.

From a Theoryland perspective, given that there’s this fundamental weakness in hash functions, it makes sense to try to search for some way to build hash tables that can’t get clobbered in this fashion. And this is where we can use the nice idea that hash functions are external to objects rather than an intrinsic property. What if, when we build a hash table, we choose the hash function to use out of a pool of potential candidates? From the perspective of our Nefarious Scoundrel, this makes attacking our hash tables a lot harder, since no matter how much up-front work they do to try to figure out a collection of bad inputs, there’s a chance that we’ll pick a function out of our pool that prevents this attack

You often see terms like “universal family of hash functions” or “family of 2-independent hash functions” tossed around in Theoryland when discussing hash tables for this precise reason. When we’re reasoning about properties of hash tables, we often imagine that we have a pool of hash functions to pick from (which we’d more commonly call a family of hash functions, just because it sounds nicer) and we choose one such function at random to use in a hash table.

This hits at a very large gap between the theory of hash functions and what software engineers likely encounter:

If you’re a software engineer, you think of each object as having “a” hash code. You might also think of switching between different hash functions like MurmurHash, SAX, etc., but only in specialized circumstances or for performance reasons.

If you’re a theoretician working on algorithms and data structures, you often think of having large families of hash functions to pick from, and design data structures on the assumption that you can choose from these hash functions at random if necessary.

That being said, there isn’t a complete disconnect between theory and practice here. After enough Nefarious Scoundrels started crafting HashDoS messages to web servers to slow them down, many language implementations (like the JVM) started adding some randomness to how they hash fundamental types like integers, strings, and arrays. This meant that primitive types would not produce the same hash codes from program run to program run, which was less than ideal if you liked precomputing hashes and storing them on disk, but makes it much harder to HashDoS a website.

Because the hash functions used in algorithm and data structure design typically look so different than the ones that are usually talked about at the language level, the qualities that are optimized over look quite different. Families of hash functions in Theoryland are usually categorized by the strength of the assumptions you can make about them. If you choose a random hash function from a family and evaluate it on a single input, can you predict with better than random chance what it will do on another input? If not, you have a 2-independent family of hash functions. If you pick a random hash function, can you say that any two objects, no matter how they’re chosen, have a low collision probability? If so, you have a universal family of hash functions.

To summarize, in the context of algorithms and data structures:

  • hash functions are external to the objects they work on and are not intrinsic properties;
  • hash functions typically have their codomain specified in a way that bakes in the size of the range that they’re mapping to;
  • hash functions are typically grouped into families, with the idea that a hash function will be picked out of the family at random; and
  • the “power” of a hash function is usually measured in terms of statistical guarantees and precise inequalities bounding collision probabilities, rather than in terms of speed or a more flexible idea of “well-distributed.”

Things Not Covered Here

This is, of course, a very quick and probably somewhat unfair characterization of what hash functions look like in programming practice and in the theory of algorithms and data structures. I didn’t talk about cryptographic hash functions, which get used all the time in both theory and practice. I didn’t address clever custom hash functions like locality-sensitive hash functions or the hash functions you might use in derandomization.

My intent is to help make it easier for people used to writing code and worrying about performance to be able to read about the theory behind why things work, and hopefully to learn a thing or two in the process of doing so. I think it would be great if more people knew about count-min sketches, or more clever hash tables like the FKS dynamic optimality scheme, or using LSH to speed up finding related items in a data set.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s