7 Techniques for Mining Vocabulary

Selecting vocabulary to focus on from a text is not always as simple as reading the text and picking out words. It’s hard to determine the frequencies and relevance of words just from reading. So, I’d like to share several methods I use – often in conjunction – to decide what vocabulary I want my students to focus on.

To Pre-Teach or Not To Pre-Teach?

Whether or not to pre-teach vocabulary is a somewhat contentious issue. I leave it up to the teacher and their context to decide what is right. In my own context, and in my own view, I mostly pre-teach, or pre-expose students to vocabulary. This is because I deal with intensive reading which involves challenging texts that include difficult vocabulary. I want my students to go into their readings armed with enough vocabulary so that they do not feel completely overwhelmed. In addition, since these are challenging texts, and being able to guess words from their context requires knowing 95% of the surrounding words, relying solely on context is not a sound strategy. In addition, vocabulary is learned through multiple exposures. I believe that pre-teaching counts as an exposure. Working with the vocabulary first is likely to lead to greater recognition and internalization than other techniques. Finally, I do not rely on pre-teaching all the time, especially if it is a word that I know they can get from context, inference, or because they know related words. However, pre-teaching works for us most of the time.

1. Starting At the Source

This may seem obvious, but the best thing to do when choosing vocabulary is have a manipulable text. If your text is digital, you are already ahead of the game. All you have to do now is copy and paste. But if you are working with textbooks (readings or transcripts of lectures or conversations), a clean scan is required, followed by an OCR rendering. OCR makes the scanned “picture” readable by making the text recognizable by computers. If you have Adobe Acrobat, there is a built-in function for this. If not, you can use a free online service such as Free Online OCR. With OCR, it is not always 100% accurate. It depends on the scan quality, really. I still get things such as “are” rendered as “arc” or “history” as “h!story” but for the most part this is not a problem, and it is easily fixed in Word.

Starting at the source in Acrobat

Starting at the source in Acrobat

Copying to a text file is essential for some of the tools below.

Copying to a text file is essential for some of the tools below.


How to OCR in Acrobat Pro

2. Word Lists

Because I deal with academic texts, I use the Academic Word List (AWL) and the Academic Vocabulary List (AVL) to find words in the text that are important for academia. I have just started using Lauren Anthony’s free AntWordProfiler to compare target texts against lists. It’s quite simple to use and already comes preloaded with the AWL.

Download and run it (no installation necessary). There are three panes. The top pane is your target text(s). The bottom pane are your word lists. The right pane is the output, which includes the words in the text that are found on the word list, their frequencies in the text, and other pertinent information.

It’s quite simple to use. First, clear out the GSL lists in the bottom pane. That will leave you with only the AWL. You can add the AVL by first downloading it my very simple version of it here, then clicking “Choose” to add it. Click “Choose” in the top pane and select your text (which should be in a .txt file). Click “Start” at the bottom and the results will be printed on the right. Here are two examples:


My text compared against the Academic Vocabulary List. Words such as research, change, following, increase, system, and term have a frequency greater than 1 and appear on the Academic Vocabulary List.

AWL results

My text compared against the Academic Word List. Words such as research, consist, psychology, aware, estimate, function, benefit, etc. have a frequency greater than 1 and appear on the Academic Word List.

Having this information helps me quickly sort through what students likely need to know, what they can figure out, and what they already know.

3. Highlighter

An online tool that is related to the above word lists is the AWL Highlighter. Input text (up to 2400 characters) and select what level of sublist you’d like to search (there are 10, with the first having the most common words) and then hit submit. The website will bold the academic vocabulary. You can also select the gap-fill option to make the academic vocabulary disappear! The AWL Highlighter is a pretty good tool for quickly noticing academic vocabulary in context.

...text comes out

Text goes in…

...text comes out

…text comes out

4. Vocab Grabber

Another one of my favorite vocabulary mining tools is Visual Thesaurus’ Vocab Grabber. Paste in your text and click “Grab!”. It compares the text against its own word lists and then presents the text to you either as a cloud or a sorted list, which can then be filtered by subject and by list level. I typically arrange it as a list by frequency, and then look at each level individually. Levels 1 and 2 are the most common words. Relevant vocabulary typically appears in lists 3, 4, and 5.

What’s more, Vocab Grabber allows you to quickly get the definition of the word, see the word in its contexts in the text, and, being visual thesaurus, get a visual word association map of the text.


Unorganized Text from Vocab Grabber


Sorted by frequency, words from list 4


How words can be viewed: mind map, definition, examples from text

5. Word Clouds

I use word clouds more as an embellishment on my PPT slides, or as a warmer activity, than for actually mining vocabulary. However, visually displaying a text as a word cloud sometimes reveals vocabulary you may have otherwise missed. Tagxedo is the best world cloud generator I have found, especially as it allows you to organize your clouds into different shapes, and it has loads of customization features. Unfortunately, it doesn’t run on Chrome, but Firefox and IE work well.

The article's theme was about sleep, so I made the word cloiud into a "dream cloud".

The article’s theme was about sleep, so I made the word cloud into a “dream cloud”.

6. Manual Mining

Using the above resources is great. However, most of the time, manual mining of a text through skimming, scanning, or reading, is useful. This is especially true for various multi-word phrases that the automatic mining tools may miss. For example, the phrase “sleep debt” appears several times in the article but never appeared in any of the lists as a chunk. I wouldn’t actually define this phrase, as it deciding on the meaning of the phrase would be done via discussion. However, other phrases like sleep deprivation, fight off, cut short, wreak havoc, long haul, etc.

7. Student Mining

Getting the students themselves to do the mining is always a great way to work with the vocabulary they want, as well as giving them valuable pre-exposure too. You can have students scan for new or unfamiliar words (and phrases) and build a list. Then, they can work with another student to discuss unfamiliar words and come up with a list of words neither can define. I also get students to add words to a Google Form (a simple paragraph text input box) so that I can see the most common unknown words and work from there.


Leave a Reply