To be or not to be or to not be: An exploration of corpora and viscera

The sentence was “Learn personal safety techniques, but I urge you to not buy a gun.” This was on a proofreading exercise looking for errors in gerund and infinitive usage. Though I had not taught it, many students highlighted the “to not buy” part and corrected it as “not to buy”. I told one of my students that either is acceptable and he said to me, “that feels weird”. This made me think of two things. This student has internalized a grammatical structure to the point where it had a sense of visceralness on par with “native speakers”. The other thought was, am I wrong? In this blog post, I will mostly focus on the latter thought, but I will come back to the more philosophical implications of the former.

To me, the placement of “not” in regards to an infinitive is fluid. It feels right to me in either place, though coming right before the verb does also have a feeling of emphasis as opposed to coming before “to”. I have been corrected on this before by a well-respected colleague I work with (one who I really enjoy getting into playful language tiffs with), but I always feel many of their corrections come down to prescriptivism and style rather that straight up grammar (we stI’ll argue about singular “they”). So, in order to answer my question of whether “to not” or “not to” is correct, I turned to my friends Google and COCA.

A Google n-gram search for “not to, to not” returned the following:


Hmm…maybe I am wrong. “To not” barely lifts its head in recognition. But, what’s this? “Not to” seems to be falling with a slight upward tilt at around the same time “to not” makes an appearance. Is one trying to assert its dominance? That is probably a different story. “to not” exists, but may not be as common as thought, at least in books, edited by those who follow style guides

What about COCA?

Well, before drinking a cup of COCA, I noticed that the great corpus gods at Brigham Young have transformed the Google n-gram corpus into a POS-tagged database, which could give me a better look at the above search. A search for “not to [vv0*]”, that is, “not to” + base verb form gave me the following…byugooglengramnotto

…and “to not [vv0*]” gave me…


While the actual tokens are still worlds less for “to not” than “not to,” the increase has been almost double from 1990 to 2000 while “not to” has clearly been on a slow decline. Interesting. Six years later, this trend is likely continuing

Time to do some lines of COCA:

“not to”


“to not”


COCA mirrors the rise of “to not” from Google, especially in spoken English, though it is not absent in academic English. In fact, here are some KWIC examples of “to not” in Academic English:


All of this data tells me several things. First, “to not” is on the rise, most likely due to the fact that the ability to separate an infinitive has become more accepted and “to not” has probably rolled in through a snowball effect. Second, the placement of “not” does not necessarily imply emphasis, as can be seen in the sentences above. Third, while my speech may make some of the older generations shake their first with anger, possibly telling me I am killing English, I can now reply confidently that my speech is the vanguard of an English where “not” is as placement-fluid as “they” is gender-fluid. My speech may be a speech that is likely to boldly go where few have gone before. Or to not boldly go, because language change is really unpredictable, and this is just a tiny thing. Of course, I wouldn’t actually say any of this. I’m neither a grammar pedant nor an in-your-face defender of anything goes linguistic descriptivism.

However, the last thing it tells me is that grammar is not correct because of writers, style guides, or lines of random sentences. No, grammar correctness, and what is “correct” to a “native speaker” is something visceral. It is what “feels” right. Language is not a set of rules but a shared set of feelings about how we communicate, passed on as naturally to us as other concepts, such as love or morality. That is, we begin learning these things at or before birth from family, friends, and our environment. Of course, as second language students, language gets internalized later and in different ways, but at some point, things do get internalized. Students begin to develop gut feelings about the language based on prior experiences, whether or not we consider them correct. Language is the internal made external, and what comes out is never based on a set of rules, but what “feels” right and has felt right since we began listening to our first sounds of the language.

So, to me, both forms feel right and I am correct. To my student, one form feels right and they are correct. To teach or prescribe otherwise would be to not follow the spirit of communication and to deny the very “feeling” of being a speaker of a language.

(Updated and edited for typos and clarity.)

Research Bites: The Relevance of the Academic Vocabulary List (AVL)

Durrant, P. (2016). To what extent is the Academic Vocabulary List relevant to university student writing?. English for Specific Purposes, 43, 49–61.

Durrant compares the Academic Vocabulary List (AVL, Gardner and Davies, 2014) to university writing in order to understand how academic vocabulary is actually represented in undergraduate and graduate writing.

The Wordlists

The AVL is a more updated version of the popular Academic Word List. There are some important differences between the two:

Academic Word List (Coxhead) Academic Vocabulary List (Gardner and Davies)
based on a 3.5-million word academic corpus based on 120-million word Corpus of Contemporary American
based on headwords without regard to different meanings caused by changes to word families based on lemmas (“headwords plus inflectionally-related forms”) to take into account the various meanings of world forms
based on General Service List of high frequency general English words which may contain words that also have academic uses (e.g. address) but are not included in the AWL not based on any pre-existing list

The Problem

Durrant’s research is to provide insight into just how relevant the AVL is. Some of the problems highlighted about wordlists are that vocabulary varies too much by discipline to have any list be of value. Another argument is that wordlists are more useful (insofar as they are actually useful) for reading texts, not producing them. In other words, their productive value is questionable.

The Research

The research compared the AVL word list to the British Academic Written English (BAWE) corpus. Durrant looked at overall use of the AVL, as well as variation by student level, discipline, and genre.

The Findings and Conclusion

  • The AVL accounts for about 34% of the lexical words in the BAWE
    • 20% of this is covered by only 313 words
    • The most frequent 32 AVL items account for 5% of the BAWE lexical items
  • The AVL accounts for slightly more usage as their academic levels rise
  • There is wide variation between disciplines
    • While the average for the entire AVL to account for 20% of the BAWE is 313 words, there is great variation by discipline
      • 106 words in architecture account for 20%
      • 1,312 words in classics account for 20%
      • The median is 194
    • There is some overlap between certain disciplines
      • For example, 40 words from the AVL account for 10% of words in linguistics and physics (17% of items are shared)
        • The three words that cover 5% of the BAWE in these disciplines are however, therefore, and theory
      • About 30% of AVL represents shared words which account for 20% of the BAWE
  • There is signficant but small variation between text genres

The Implications

A relatively small amount of AVL words represent a great deal of academic writing while about half have very little contribution in terms of coverage. However, the words that do contribute to a great deal of coverage vary by discipline. Durrant argues that these results may seem to imply discipline-specific vocabulary teaching is a warranted approach. Nevertheless, he argues that is usually not practical nor desirable “given the cross-disciplinary nature” of academia. Durrant recommends focusing on the most frequently overlapping words (427 lemmas) and then moving on to either more discipline-specific lists or, vocabulary strategies such as inferring meaning or skipping unknown words (here, he refers to Nation’s [2011] “Learning vocabulary in another language“).

I have adapted the word list from Durrant’s work into an Excel file. The file contains the most common academic words that are shared among 30 disciplines, sorted by part of speech and frequency. Please click here to download it.

Research Bites: Take Care of Your Concordancing – Using Corpora for Self-Correction

I have written about corpora, concordancing, and DDL on this site before. Last year, my colleague and I completed a semester-long quantitative research project and co-wrote a paper on using DDL in the classroom (which has now been rejected three times!). I used to be a big fan of teaching students how to use these tools as an alternative reference and learning resource. However, due to lack of patience with computer illiterate “digital natives“, heaps of incomprehensible input that is difficult for learners to parse, and the paucity of the linguistic sixth sense among students, this kind of practice fell out of favor with me. Then, I stumbled upon Cynthia Quinn’s (2014) article in ELT Journal, and now the interest has been slightly rekindled. A snowball effect took place after reading this article, and I was happy to find a number of new corpus tools and active corpus linguistics websites. I’m not sure what effect this will have on my teaching, but I do present to you the latest Research Bites.


Quinn, C. (2014). Training L2 writers to reference corpora as a self-correction tool. ELT Journal. [$link]

Twitter Summary

New on #researchbites: Quinn shows how to scaffold #corpus use to teach error correction #ddl #corpuslinguistics
Continue reading

KWIC Vocabulary Review Activity

I am a big fan of data-driven learning (DDL) and using linguistic tools (such as COCA, StringNet, and Word and Phrase) in the classroom in order to give students a different perspective on language study. I had been looking through a wonderful little book called “Classroom Games from Corpora” by Ken Lackman and modified one of his activities for one of my classes, with good results. This book presents a number of useful activities that can be used in the classroom after a little preparation from your favorite corpus.

So, I would like to present a fun and useful activity modified from Lackman’s “Guess the Missing Word” activity (p. 14) that can be adapted and expanded in a myriad of different ways. It is based on showing a list of KWIC (keyword in context) concordances with the keyword missing. Students will have to guess the missing word during a line race activity. This is a simple review activitya that works well at the beginning of class, as it gets students up out of their seats and moving around a bit.

Basically, students will be split into two teams and make a line on either side of the screen or board. The instructor will show a slide with a number of concordances all missing the same keyword. The first to guess the keyword wins a point for their team. They go to the back of the line and the next two students continue the game. Afterwards, students can be given the complete concordances and asked to search for common patterns (collocations, colligations), which can then be discussed together as a class.


This activity assumes you have PPT or some way to project something on a screen. If you don’t, it can still be done with the modified KWIC concordances printedb.


(Note: there is a video of this method below.)

  1. Using Word and Phrasec (or COCA), do a search with your missing word. You may have to select the correct word form if your word can also be used as a verb, noun, adjective, etc.
  2. Copy the all the KWIC concordances at the bottom. The quickest way to do this is to click on the concordance frame and do CTRL+A (select all) and CTRL+C (copy). Then paste them into excel and remove the first 8 rows (which will leave you with only the concordances) and the first 2 columns. Resize columns to your liking.
  3. Now, for this activity, a maximum of 20 concordances is suggested. You probably have over 100, so select some rows to delete and whittle down until you have 20. I wouldn’t recommend randomly removing rows. Instead, consider what common words, parts of speech, or patterns you think students already know, or you will want them to study as part of the expansion.
    1. For example, for the verb “commit”, most common words on the right would be a crime noun, “by” for the passive form, and maybe “in” for places.
  4. Do steps 1-3 for each vocabulary word. It only takes a few minutes once you’ve done it a few times. Add each word to a separate tab/sheet in Excel.
  5. When you have all your words, in Excel, choose a sheet, select all the words in the three columns and paste it into PowerPoint. Resize and format as necessary.
  6. Delete the keywords in the middle.
  7. Repeat this, with each set of concordances on a different slide.
  8. Add whatever bells and whistles you want.


  1. Students are split into two teams (evenly or randomly; more than two teams is also possible).
  2. Each team forms a line facing the board.
  3. The teacher explains/demonstrates/models the activity:
    1. The students at the front will see a list of sentences missing a common vocabulary word.
    2. Whoever guesses the word first wins a point for their team.
    3. Both students will go to the back of the line and the next students will continue the game.
    4. The team with the most points is the winner.

Here is an example PPT of concordances I made. It’s really nothing special. For this PPT, I made the concordances in Word, put them into a PDF (for an unrelated reason), and then took a snapshot of the concordances in the PDF and pasted them into PPT. Sounds complicated but it probably took me 2 minutes.


There are a number of ways to expand this activity. Here is one idea I had:

  1. After playing the review game, students are given the one set of complete concordances so that each group has a different set.
  2. Students look at the examples and try to find any language patterns apparent.
  3. Meanwhile, the teacher writes the vocabulary words on the board.
  4. After a set amount of time, the students come to the board and write the language patterns they found under the respective keyword.
  5. A short class discussion is held.

Some other ideas would be to have students fill in the keyword using the correct verb form (if you are looking at verbs) or to randomly mix concordances so that one set has a range of different vocabulary words. Then students must guess what words are possible there. I did this last idea in class with adjectives where multiple adjectives are possible. It was a good activity, though it seemed difficult for students.

So, what do you think of this activity? Have you tried something similar? Do you have any comments or suggestions?


  1. You could also get students to investigate these words on their own for homework. This may make them better prepared for this activity.
  2. If you don’t have the technology needed for this activity, you can simply print each concordance set on a piece of paper. Instead of showing the set on screen, hand one set to each student at the front of the line. Do this at the same time and it is just like looking at one projected screen. They can then keep the set to use during the expansion stage.
  3. With Word and Phrase, you can only search for one word terms. However, you can click almost any word in the concordances to show it with the keyword. This might help you find more specific instances of the term you are looking for. You can also click on the words in the “Collocates” box and choose different examples of different collocates. Copy and paste them as you would above. You’ll just have to remove more rows in Excel.