Category: Miscellaneous

Sometimes, data are avocados. Sometimes, data is wine.

I just finished teaching a course in data structures. In the course of doing so, I was made to reflect on what exactly we mean by “data.” The word “data” itself leads a double life. In one sense, data are avocados. In another, data is wine. Which of these sentences is grammatically correct depends on the context and in what way we view data.

The word “data” originated as the plural form of the word “datum,” where “datum” means “a single piece of information.” A single temperature measurement is a datum; temperature measurements spanning a whole week are data. If you think of data in this way, to be grammatically correct, you would use third-person plural verbs when talking about data. For example:

  • “The data show X and Y ” rather than “The data shows X and Y.”
  • “The data are from X and Y” rather than “The data is from X and Y.”

In this sense, data are avocados. If you want to check if a sentence involving the word “data” is grammatically correct, apply the “Avocados Test:” replace the word “data” with “avocados” and see if what you have is syntactically correct. That is:

  • “The avocados show X and Y” is correct; “The avocados shows X and Y” is incorrect.
  • “The avocados are from X and Y” is correct; “The avocados is from X and Y” is incorrect.

On the other hand, in many cases we don’t think of “data” as a collection of discrete units, each of which is a “datum.” Instead, we now have so much data that we think of data as a continuous entity not made of individual pieces. That is, instead of thinking of “data” as a counting noun, something where you can say “one datum” or “two data,” we think of “data” as a mass noun, where you can say “some data” but not “two data.”

For example, take data structuredata science, or data entry. These all fail the Avocados Test. We wouldn’t say avocados structureavocados science, or avocados entry. If we wanted to think about ways of organizing avocados, or studying their deliciousness, or putting them into containers, we’d talk about an avocado structure, the field of avocado science, and the job of avocado entry.

In this sense, data is wine. You have a wine cellar, not a wines cellar. You have a wine tasting, not a wines tasting. (The analogy is not perfect; you can have “ten wines” if you are talking about types of wine, and I suspect “ten data” or “ten datas” would be pretty odd to think about). Let’s call this the “Wine Test:” if replacing “data” with “wine” remains grammatically correct, you’re treating “data” as a mass noun, something that isn’t made of individual constituent parts.

If you think of data this way, the correct choice of verbs to use with the word “data” flip. Take the two earlier example sentences: “The data show X and Y” and “The data are from X and Y.” These were fine with the Avocados Test, but they fail the Wine Test:

  • “The wine show X and Y ” is wrong; “The wine shows X and Y” is correct.
  • “The wine are from X and Y” is wrong; “The wine is from X and Y” is correct.

In rereading this post, I’ve noticed that I default to treating data as a mass noun. Take the phrase “we now have so much data.” Notice that

  • this fails the Avocados Test: “we now have so much avocados” is wrong; but
  • this passes the Wine Test: “we now have so much wine.”

However, in rereading some of my other writings, and in paying close attention to how I speak, when using “data” as the subject of a verb, I tend to treat it as a counting noun and, appropriately, use the plural verbs. Think of it as a decidedly more banal version of wave/particle duality; is this avocados/wine duality?

I’m curious to see if there are trends in how the word “data” is used in different disciplines. I suspect that in computer science, it’s more commonly treated like wine than like avocados. How about in biology? Would we find data to be avocados there? How about philosophy?

Addendum: After finishing the draft of this post, I found this lovely discussion of the word “data” on Wikipedia, which talks about how “data” and “datum” are used in technical contexts in different fields. How fascinating!