Deconstructing the Duolingo English Test (DET)

Update: after writing my original review, I had a chance to talk with someone from Duolingo who could explain a few things about the test and answer some questions. Here are some notes from our conversation:

  • The test was designed to be both quick and efficient, which is why it does not seem like any other English test.
  • The DET was trained on and calibrated to the CEFR and is meant as a test of general English ability, not academic skills because some literature has shown scores on academic tests are not predictors of academic success.
  • The test items on this test are based on ones that have been published in the literature. These published test items were shown to have great predictive abilities that gave accurate information about students’ reading, writing, comprehension, etc. This is especially true of the real vs. fake English word test items, which is said to be predictive of writing.
  • There are a number of articles worth reading at https://www.duolingo.com/research.

Original Review: Duolingo is a great language learning tool. It can introduce you to the basics of a number of different languages through a fun, game-like app in which grammar and vocabulary are built up and reinforced through translation practice. I can thank Duolingo for my basic ability in Spanish and my deeper understanding of Polish grammar. The Duolingo English Test (DET), on the other hand, is absolutely terrible. Last year, it made some buzz on the internet as a kind of TOEFL/IELTS killer, a serious competitor to the big tests – one which was affordable (at $30) and accurate. I had a chance to take the test last week to see if it could serve our institute and students. I went into it very excited and came away with a very bad taste in my mouth.

I’m an expert in English. Thanks for telling me, Duolingo!

Safety First

One thing I did like – probably the only thing – was that the test was pretty secure. There were a number of safeguards to ensure test takers do not cheat. You have to show ID to the camera, cannot wear headphones, must be the only person in a well-lit room, and must stay visible on camera the entire time. It would be pretty difficult to have someone assisting you or helping you cheat. So, as far as security goes, its pretty solid.

QUESTION TYPES

For the test that I took, I encountered 5 question types.

1. Gap-fill Reading Passage

You get a reading passage from what seems like an English literature source. There are gaps throughout the passage and each gap has a dropdown box with the same word choices.

The reading passages were actually quite difficult, and, without further context, some just didn’t make sense. The fact that they seemed to be from English literature from the late 19th or early 20th century is alarming, as they contained a good bit of verbose prose that most ELLs will probably have not encountered. This definitely throws off their ability to choose the correct grammar form of the word choices.

One passage read like the sentence below. The underlined words are completed gaps. Can you fill in the others? (Note: I have changed the words to avoid any copyright issues but tried to keep the syntax of the original):

After I came in with the vodka, they __________ already arranged on both sides of the General’s dinner-table — Big Bear next to the window and sitting backwards so as to have one eye on his companion and one, as I __________, on his exit.

Choices: had, have, came, come, comes, seated, sitting, seats, think, thought, was, were, did, do

It’s not overly difficult, and it’s not the passage that stumped me (there was one that did). However, it struck me as both out of place and even a bit inappropriate with the mention of alcohol. The prose is so strange that even if a student can get most of the words correct, will they have any idea of the meaning? I doubt it, and Duolingo doesn’t even check. That’s right. There is no reading comprehension other than the ability to follow simple instructions.

To be fair, the DET is an adaptive test, meaning the questions increase with difficulty the more a test-taker gets correct, so it is possible that those who miss more questions will get passages that make some sense and are a bit easier to complete. I took the test again and purposely got most questions wrong. The first gap-fill came after English Word Selection questions, Listen and Write, and Listen and Speak. The text that I got was quite above level for someone who got everything else wrong. I purposely chose wrong answers for everything and did not see another gap-fill during the test. I took it once more and this time I tried to get most answers correct and a few wrong. I saw two gap-fill prompts, both equally as difficult. They included words such as rattled, stirred, and save (the preposition form) among other words students likely had not encountered before. The second one seemed longer, so maybe it is adaptive in terms of length, but certainly not difficulty.

2. English Word Selection

You will see a number of real English words (such as resentment, vanquish) and words that look like English but are not (such as commemoral, executrive).

On the surface, this seems like a decent way to check one’s vocabulary. That is, until you realize that the real English words such as resentment, vanquish, pub, or floppy are probably not on any high frequency word lists (I checked the GSL and AWL) and are probably not encountered much in English study, day to day use, or even ESP situations (aside from pub!).

So, what is Duolingo testing? Judging by my test results labeling me an “Expert in English,” I suppose it’s a check to see whether or not I am an L1 user of a standard variety of English, or the extent to which I approximate that. I got 99%, so am I 99% a native-speaker? How does that correspond to a TOEFL score or CEFR level? Those scores and levels are not without their issues; however, at least they offer language ability descriptions. The DET does offer some explanation of my score:

Can understand virtually anything heard or read, even intellectually demanding material such as an academic lecture or a book on philosophy. Can use the language fluently and spontaneously in a way that can even be more advanced than an average native speaker. Can scan long texts for relevant information, and differentiate finer shades of meaning in complex social and professional situations.

Like the Gap-fill questions, no comprehension of the words was actually required. How this test knows I can understand something without testing this is beyond me. What’s this about relevent information, or even scanning? This did not come up once during the test. “Finer shades of meaning”? Um, OK…

3. Listen and Write

For this section, you hear a sentence (up to three times) and must write it verbatim. The sentences seemed to increase as the test went on. I did not think the vocabulary or grammar of these sentences were very difficult. I feel they did an OK job testing listening skills, though for the longer sentences, they also seemed to test whether the test taker had a decent phonological loop / working memory, as it took a bit of mental rehearsal to remember and write the dictation. Something to note here is more evidence of a major pattern flaw: no comprehension of the sentences is required.

4. Read and Speak

Here, you had to read a sentence aloud. Like the Listen and Write section, the sentences were pretty mundane, although, one of the sentences did contain the word “crap,” which surprised me. It seemed out of place for a test that wants to be taken seriously. This section seems to be accessing your pronunciation and ability to read aloud as, again, no comprehension is required.

5. Oral-interview Type Questions

Finally, there were several questions that required an oral response either based on a spoken question or a picture description prompt. I was asked about a person I thought was adventurous, and I was asked to describe a picture of a woman waiting for a subway train. While answering the first question, I was surprised when the test interrupted me during a short pause to ask me a follow-up question. Duolingo is clearly trying to emulate a more authentic language use situation, but it comes off as robotic and jarring – Duolingo had interrupted me even before I had finished my thought, the very reason I paused. The questions were very simplistic and while an experience language teacher could make a decent, holistic assessment about a student from these prompts, it’s not enough to base a whole test on, especially when you are trying to compete with the big boys.

“Scientifically Proven”

Throughout the Duolingo English Test FAQ, you can find many references to this test being “scientifically proven”

  • “The Duolingo English Test provides scientifically-proven language certification.” (here)
  • “The Duolingo English Test is scientifically designed to provide a precise and accurate assessment of real world language ability.” (here)
  • “The Duolingo English Test provides scientifically-proven language certification.” (at the bottom of my certificate)

The TOEFL, IELTS, and other major language tests have gone through years of development, testing, and research and still make no bold claims about being “scientifically proven”. So, what does Duolingo really mean? Label anything with “science” and it seems more believable, but if you read carefully, the claim is that the certification is “scientifically-proven,” meaning that the certificate comes from a scientifically designed test, and by science, I think they mean through their impressive ability to design an adaptive test via computer science.

Or, perhaps they mean that there has been a quantitative study (not peer-reviewed). From the FAQ:

Duolingo English Test scores are significantly correlated with the TOEFL iBT (a standardized English test). Read the validity study here for a comparison of Duolingo English Test scores with scales from other common language tests.

In this thirteen page (!) scientific article, we see that DET scores are pretty well correlated with TOEFL iBT test scores (though the correlation is weaker for individual subskills). On page 11, the authors state: “Scores from the Duolingo English test were found to be substantially correlated with the TOEFL iBT total scores, and moderately correlated with the individual TOEFL iBT section scores, which present strong criterion-related evidence for validity.”

I am by no means a psychometrics expert, but wouldn’t it make more sense to be looking at construct or content validity as opposed to criterion validity? Criterion validity is predictive in nature, concerned with how well a certain measure or test is related to a predicted outcome. In this paper, they looked at how well DET scores predicted TOEFL scores. I think the fact that there is strong correlation is interesting. How can a test with no measure of language comprehension, written expression, or academic language use be as valid as the TOEFL iBT, which contains all three of these constructs? Has Duolingo found the perfect questions that dig deep within a language user to pull out their capacity to, I don’t know, summarize and compare a lecture and reading, and do this only based on their ability to select real English words among some lookalikes? That’s good science!

Another non-peer reviewed Duolingo-commissioned study compared Duolingo test scores to faculty assessments of incoming freshman international students. Another high correlation was found and they recommend the DET as a placement test. That was a red flag to me. I took this test with the idea in mind that it could serve as a placement test for my program. After taking the test, I would in no way recommend it.

I think that doing some sort of construct validity test to check whether their questions measure what they say they measure is more warranted than a correlation study. There is very little published about the Duolingo test other than Duolingo-issued research. However, I did find a critical review published in refereed journal that essentially found all the same issues I did and sums up the test as follows:

In summary, at the time of writing this critique, the DET seems woefully inadequate as a measure of a test taker’s academic English proficiency or for high-stakes university admissions purposes….The test seems to be a case of “the tail wagging the dog,” in that the DET’s reliance on short, computer-scored test tasks has resulted in a test that does not assess the test takers’ communicative competence. Indeed, the test tasks that are used hearken back to the 1950s, when audiolingualism was the dominant theory in language learning. (Wagner & Kunnan, 2015)

Ouch. Sorry Duolingo. You might make a fun, somewhat effective language learning tool, but when it comes to language testing, your owl needs to take off its graduation cap and put its tracksuit back on. TOEFL and IELTS are nearing the finish line, but you are still just warming up.

References

Wagner, E. & Kunnan, A. J. (2015). The Duolingo English test. Language Assessment Quarterly, 12(3), 320-331. DOI: 10.1080/15434303.2015.1061530. Retrieved from here.