The data are/is in!

Recently, it has come to my attention that I use data as a singular noun, as in “The data is nominal” rather than the plural Latinate form that it technically is, as in “The data are nominal.” To those who brought it to my attention, it is a simple mistake. Fix it. Move on.

But, hold up. I’m not so sure about that. I do not like my natural language use corrected. I will use singular they in writing. I will use so as an adverbial connector while speaking. And, dammit, I will use data in the singular.

Is this just my youthful rebellious spirit? Am I on the vanguard of language change? Am I right? Are they wrong? The answer, as you can guess is yes and no. Both “data is” and “data are” seem acceptable. Here’s the evidence:

1. Twitter

I first turned to Twitter with a poll.

Validation! But, I wanted a more reliable data source, so I began looking at more scientific linguistics tools.

 

2. Google N-Gram Viewer

The Google N-Gram viewer (see graph below) tells me that, although “data are” is more common, “data is” is also common and has been rapidly increasing in usage in books since the late 1940s. Interestingly, “data are” skyrocketed the 1970s and peaked in the early 1980s (when I was born) and has been in free fall since then. “Data is” hit a small peak around 2000, when I was in college and probably was exposed to more data-based ideas. Similar to “data are,” “data is” has been in decline (I have no guess as to why) and now they are both similar in frequency.

The conclusion here is that, although “data are” once reigned supreme and “data is” has climbed slowly in popularity, they have been able to literally meet each other in the middle.

Click to view chart

3. Corpus of Contemporary American English (COCA)

A general search of COCA tells me the following:

FORM FREQUENCY EXAMPLE
data were 4830 Unfortunately, they did not analyze the impact of fees, as all data were net of fees…
data are 3088 These data are sensitive and consumers have a right to decide whether or not they can be…
data is 2069 …but the resulting data is complex and messy.
data was 1075 Most of the data was culled from studies conducted from the mid-1990s forward at sites in Illinois, Indiana…

One thing the prompted me to do this research was the use of “data is/are” in academic contexts, so a little more searching in COCA revealed the following:

  1. “Data is” is 3.8 times more common in academic texts than other genres
  2. “Data are” is is 16.8 times more common in academic texts.
  3. “Data were” is 88.9 times more common in academic texts. In fact, this phrase was quite rare in other genres.
  4. “Data was” is 10.8 times more common.

The conclusion from COCA is similar to the conclusion from Google’s N-Gram viewer: “data are” is more common, especially in academic texts, but “data is” is also common, even in academic texts.

4. Other Corpora

Have you visited the BYU corpus page lately? A few years ago, it was dominated by COCA and the BNC, but it has grown quite a bit since then and we have our choice of corpora ranging from News on the Web, Wikipedia, and even TV and Movie corpora. I chose three to do a quick comparison of “data are/is”.

Source “data are/were” Frequency “data is/was” Frequency
GloWbe 32 / 24 654 / 206
TV Corpus 16 / 12 327 / 103
News on the Web 11,728 / 4,594 83,123 /18,782

What this data suggests is that “data is” may have grown in tandem with growth in the media, especially online media. This is evident from the higher frequency in the more informal corpora above and the peaks around the year 2000. This growth likely bled over into academic usage as more people were exposed to or participated in these newer forms of communication. However, it is important to recall that the evidence shows “data” as a singular noun existed prior to that growth, too. The growth in media likely amplified the movement of “data is” into common usage.

5. Latinate Cousins

“Data” is not the only plural word used as singular: agenda, bacteria, criteria, media. These are all technically plural forms but often used in the singular. Here is how they are used (from COCA): 

Term “are / were” frequency “is / was” frequency
Agenda 61 / 34 736 / 258
Bacteria 338 / 156 108 / 37
Criteria 448 / 457 119 / 80
Media 1060 / 376 1655 / 463

And here is their growth from Google N-Gram Viewer:

Click to view chart

6. Is it a British thing?

I love British English and its quirky grammar and vocabulary, like “government are” rather than the American “government is”. So maybe the “data are/is” thing is a holdover from British English since they do prefer plural forms. Interestingly, looking at the British National Corpus, “data are” and “data is” are almost equally as frequent (491 vs 452, respectively). Looking at just the US and GB dialect forms in the GloWbe Corpus reinforced this. In fact, “data is” was more common than “data are” for all inner-circle and outer-circle English varieties:

Frequency for all English varieties in GloWbe Corpus

7. PLOS One

For my last check, I used AntCorGen to generate a corpus of 300 results sections from various peer-reviewed journals that are part of PLOS One and then ran a frequency search for “data are” and “data is”. The results were unsurprising: 72 for “data are” and 69 for “data is”. While “data are” had higher frequency in total, “data is” had greater range (greater representation in more than 1 text rather than frequent use in single text): 40 vs 30.

8. Cognitive Linguistics

Thus far, I have explained “data” in the singular sense as a side effect of media growth. However, I think this is neither a fully accurate nor fully plausible explanation. So, what is another way it can be explained? One way is through a cognitive linguistic understanding of “data”. In other words, how is “data” represented in the mind? Is “data” conceived of as numerous individual numbers or as a singular set of numbers? Is “data” the part or the whole? I’d argue that, since humans tend towards gestalt grouping, when we speak of data, we are not speaking of figures or numbers or even datums. We are, instead, referring to a singular group  or body of information. This mental representation then, when reflected in our lexicogrammar, takes the singular form: “the set is,” “the information is”.

Limitations

I searched mainly for present and some past tense forms of the be verb. I did not, however, check agreement with other common collocating verbs such as show, indicate, suggest, etc.

Conclusion

My exploration through the data has confirmed my original hunch: both forms are acceptable and “data is” has been growing in popularity, growth which has been concomitant with the growth of the media. Furthermore, as data has become more important in our society, it has naturally taken on a singular cognitive gestalt representation.

What does this mean for my own writing? I will continue to use the singular form of data and will proudly point any naysayers to its evidence because the data is in and it clearly shows “data is” is OK!

 

Ants on a Blog – Specialist Corpora and ESP

A quick post today on how I used some specialist corpora during a workshop with visiting Chinese professors. This post is entitled “Ants on a Blog,” a pun that combines the American snack food (ants on a log) with the fact that I utilized two wonderful tools from Laurence Anthony: AntCorGen and AntConc.

The visiting professors come from different fields and I thought this would be a great opportunity during their orientation week to help them explore research trends in their field, common language used in their field, and pronunciation of discipline-specific vocabulary.

Building the Corpora

Building four different corpora? Yes! It only took about five minutes using AntCorGen. In AntCorGen, you simply select the field or subfield you wish to explore, select the type of information you want (e.g. abstract, methods, full-text), how many texts you want, and press “Create Corpus”. In my case, I created four corpora that consisted of 300 abstracts each. Here is an example:

Easy peasy.

Explaining the Purpose

My next step was to demonstrate AntConc and how a corpus is both used and useful. I showed them how to open the corpus, and basic searches using only the “Clusters/N-Gram” tab. I focused on this tab because you can sort single words by range whereas in the “Wordlist” tab, you can only sort them by frequency. For our purposes, range was more important because it showed how words were distributed across texts. Basically, this will show you what words many different people are using while a very frequent words

Typically, the usefulness of corpora is not always easy to grasp. Any English language corpora will tell you the most frequent words are of, the, in, at. This is not useful stuff. By focusing on range, I explained that they could make guesstimates about trending concepts or research areas. Apart from that, I also explained that a corpus is not necessarily useful for answering specific questions as it is for simply exploring how language is used. I told them we would be going on language adventures, and none of us could be sure what we would find. I also asked them to give me ideas on how it could be useful, and this immediately elicited responses about writing, especially using correct and common phrases.

Exploring the Corpora

I placed each corpus and a copy of AntConc on separate USBs and headed to the lab with the professors. We used AntConc with the purpose of finding research trends, frequently used words, hard to pronounce words (e.g. utilitarianism, pharmozoocognosy), and “interesting” combinations of two, three, or four words.

I gave them a short worksheet I made for them to complete independently and offered feedback individually for searching. One of the activities was about finding hard to pronounce words, and when I saw that they had listed about 8-10, I offered one-on-one pronunciation instruction and feedback. What was great about this is not so much the one-off pronunciation practice of infrequent words but the rules these words embodied regarding stress placement, unstressed vowel placement, phonics and word origin (i.e. “ch” in most academic or scientific words is likely to have a /k/ sound due to their Greek origin), and chunking multisyllable words. Some wrote down acronyms or website names thinking these were words (they lose their capitalizations in “Clusters” tab, so I showed them how to examine the concordances for meaning, and how to look at the word in its entire context, too.

The worksheet I used is here. It contains the activities as well as instructions for the different types of analyses.

Takeaways

I think the professors enjoyed exploring their field’s language usage. They found the pronunciation activities very fun and were surprised at some of the words and those words’ variations they found. For example, using the “Regex” option, one professor and I found many different words using “phono” and explored those meanings. We also enjoyed reviewing the Greek mathematical letters that appeared, too.

These professors are experts in their fields, and while they do often communicate with each other and other international colleagues in English using discipline-specific language, any common ELF communication patterns could cause minor (probably not major) issues on an American campus. I thought that such independent explorations and feedback could only benefit them and give them the tools to do further exploration on their own, thus allowing them to be in even more control of their expertise. And many of them said they would in fact download and use these tools again to help with their writing.

I was happy to see that I was able to spark genuine interest in not only the corpus tools but how language is being used in their field. I hope to get more opportunities like these in the future.

To be or not to be or to not be: An exploration of corpora and viscera

The sentence was “Learn personal safety techniques, but I urge you to not buy a gun.” This was on a proofreading exercise looking for errors in gerund and infinitive usage. Though I had not taught it, many students highlighted the “to not buy” part and corrected it as “not to buy”. I told one of my students that either is acceptable and he said to me, “that feels weird”. This made me think of two things. This student has internalized a grammatical structure to the point where it had a sense of visceralness on par with “native speakers”. The other thought was, am I wrong? In this blog post, I will mostly focus on the latter thought, but I will come back to the more philosophical implications of the former.

To me, the placement of “not” in regards to an infinitive is fluid. It feels right to me in either place, though coming right before the verb does also have a feeling of emphasis as opposed to coming before “to”. I have been corrected on this before by a well-respected colleague I work with (one who I really enjoy getting into playful language tiffs with), but I always feel many of their corrections come down to prescriptivism and style rather that straight up grammar (we stI’ll argue about singular “they”). So, in order to answer my question of whether “to not” or “not to” is correct, I turned to my friends Google and COCA.

A Google n-gram search for “not to, to not” returned the following:

tonotgooglengram

Hmm…maybe I am wrong. “To not” barely lifts its head in recognition. But, what’s this? “Not to” seems to be falling with a slight upward tilt at around the same time “to not” makes an appearance. Is one trying to assert its dominance? That is probably a different story. “to not” exists, but may not be as common as thought, at least in books, edited by those who follow style guides

What about COCA?

Well, before drinking a cup of COCA, I noticed that the great corpus gods at Brigham Young have transformed the Google n-gram corpus into a POS-tagged database, which could give me a better look at the above search. A search for “not to [vv0*]”, that is, “not to” + base verb form gave me the following…byugooglengramnotto

…and “to not [vv0*]” gave me…

byugooglengramtonot

While the actual tokens are still worlds less for “to not” than “not to,” the increase has been almost double from 1990 to 2000 while “not to” has clearly been on a slow decline. Interesting. Six years later, this trend is likely continuing

Time to do some lines of COCA:

“not to”

cocanotto

“to not”

cocatonot

COCA mirrors the rise of “to not” from Google, especially in spoken English, though it is not absent in academic English. In fact, here are some KWIC examples of “to not” in Academic English:

cocatonotkwicacademic

All of this data tells me several things. First, “to not” is on the rise, most likely due to the fact that the ability to separate an infinitive has become more accepted and “to not” has probably rolled in through a snowball effect. Second, the placement of “not” does not necessarily imply emphasis, as can be seen in the sentences above. Third, while my speech may make some of the older generations shake their first with anger, possibly telling me I am killing English, I can now reply confidently that my speech is the vanguard of an English where “not” is as placement-fluid as “they” is gender-fluid. My speech may be a speech that is likely to boldly go where few have gone before. Or to not boldly go, because language change is really unpredictable, and this is just a tiny thing. Of course, I wouldn’t actually say any of this. I’m neither a grammar pedant nor an in-your-face defender of anything goes linguistic descriptivism.

However, the last thing it tells me is that grammar is not correct because of writers, style guides, or lines of random sentences. No, grammar correctness, and what is “correct” to a “native speaker” is something visceral. It is what “feels” right. Language is not a set of rules but a shared set of feelings about how we communicate, passed on as naturally to us as other concepts, such as love or morality. That is, we begin learning these things at or before birth from family, friends, and our environment. Of course, as second language students, language gets internalized later and in different ways, but at some point, things do get internalized. Students begin to develop gut feelings about the language based on prior experiences, whether or not we consider them correct. Language is the internal made external, and what comes out is never based on a set of rules, but what “feels” right and has felt right since we began listening to our first sounds of the language.

So, to me, both forms feel right and I am correct. To my student, one form feels right and they are correct. To teach or prescribe otherwise would be to not follow the spirit of communication and to deny the very “feeling” of being a speaker of a language.

(Updated and edited for typos and clarity.)