Do you Really Understand 80% of a Language if you Learn 1000 Words?
I read this a lot, learn the most frequent 1000 words (or 2000, it depends on the source) and you understand 80% (or 90%, it depends on the source) of the language. It’s a statement that is so deeply embedded in language learning that it appears everywhere: from Duolingo’s old fluency percentage to Lingvist’s pretty little graph. Even the good old BBC push this theory.
But It’s not about Words, it’s about Meaning
So let’s start with examining the problem with this idea. Language is about communicating something. But if we consider it on a word level then we’re highly restricted in what we can communicate. Say, for example, that I walked up to you and said:
Now you might guess that I want water, right? In fact, the fact you guess that without thinking about it is very important to our ability to communicate effectively. But you don’t really know what message I’m trying to put across. Perhaps I’m giving or selling you water. Perhaps I’m warning you that the water is poisoned. Perhaps I want to go to a lake, a pool, or the sea. Perhaps there’s a flood or tidal wave approaching. You can’t really tell what I mean unless I put it together with some more words. The word “water” alone doesn’t communicate a message, but a selection of possible messages. And “Water” is an easy one, there are many of the most popular 1000 words that just won’t convey any sort of meaning alone:
Did you get anything from that? Of course not. And it turns out that a lot of the popular words are of this variety. Or, to put it another way round, those words outside the popular list have a tendency to carry a lot of meaning when interpreted alongside other words.
We can demonstrate this with the following sentence:
The old man was missing
The first four words are in the most popular 1000 English words, the last is not. “The old man was”, tells you nothing without the final word.
Meaning Relies not on Single Words, but Collections of Words
This is important as language has, quite sensibly, evolved to put words together to convey meaning. This evolution was necessary for us to communicate properly. If we had to do that to be able to convey meaning, then it stands to reason that we can’t also take words singularly and imply understanding of meaning from them.
Just because you can recognise 80% of words on a bit of paper or in a conversation, doesn’t mean you understand the message of the conversation, or even 80% of it. There is no direct relationship between a word’s popularity in usage and its contribution to a sentence in terms of meaning.
We all know this, of course, none of us would by a book that was missing 20% of the words randomly, even if it was 50% of the price! But that doesn’t stop this bizarre language learning claim from circulating on the internet.
A Little Experiment
The easiest way to show this is to run an experiment. Thankfully, we live in the computer age and computers can help us a bit with this. To do so, though, we need to pick a representation that is more likely to convey meaning than a single word. For ease, let’s settle on the sentence. Each sentence represents a tiny little message, that combined with other sentences represents a whole. I’m not saying it’s the only representation that could represent meaning, or even necessarily the best one. Just that it is better than a “word” and a sentence is also easy to define for a computer.
Then we can ask the question: if I understand the 1000 most popular words, how many sentences do I know all the words in. If a learner knows all the words then we assume they know the meaning, if not then we assume they don’t. This approach favours the learner a bit though, as we automatically assume that they understand all the grammar to understand the meaning of the sentence.
We need, of course, a collection of sentences to test on. My choice is a film script. I’ve chosen this for a number of reasons.
Firstly, film scripts are heavily dialogue based. This means the language is generally simpler and there’s less of the creative wordplay (difficult words) that you get in complicated novels. The sentences are more likely to be used in real life and, yes, this favours the learner too.
But the film script also helps the computer simulation. Computer’s are really bad at recognising some things, in particular names.
“Miss Fluffligums”, said Jibbledywigglydy, “Mr Tomtomtitomtitom has left”
Where you instinctively know what are names, a computer has no clue. Names, of course, occur quite a lot in life but not in frequent word lists. Film scripts tend to list names in capitals when the character is speaking. Meaning that the computer can have a more realistic representation of identifying names.
Given we have a script, and a list of popular words. There’s nothing to stop us seeing how many sentences can be recognised with the 1000 most popular words, or even 2000, or 3000, all the way to, say, 20000.
Partly to hold the tension, but also because it’s really important. We need to talk about one more thing you’re really good at but computers aren’t. Otherwise it won’t be a very realistic experiment.
Assume you know the word “jump” and you see the words “jumped” and “jumping”. It’s pretty likely that with a little bit of grammar knowledge or even just a bit of contextual learning, you’re going to have a good idea what the other two words mean. Certainly after reading an entire film script.
The computer, though, will not. That would lead to an unfair experiment as we’d assume that the learner didn’t understand a sentence with the word “jumping” in it, when they probably did. To cater for this I used a stemming algorithm to reduce all the words in the popular words list and script to their stems. i.e. “jumping” becomes “jump” in both the popular words list and the script. The algorithm isn’t perfect, but neither is a person guessing, so this seems a reasonable approximation.
As you can see, you get rapid benefits for word recognition for the first few thousand words, which explains why the myth is so pervasive. But this tails off rapidly at higher word rates. The impact on sentence understanding is less dramatic though, which suggests that importance of words that are less frequently occurring to understanding sentences is significant.
In this case, at least, understanding 1000 words means you recognise 65% of the words written on the paper. But you’ll have difficulty with 76% of the sentences.
What This Tells Us
The first take away from this should be that we all guess what things mean. Our brains convince us that we know things that we don’t. This comes to light when we misinterpret something or somebody asks us for the definition of a word we’re convinced we understand but then can’t give. Even people with large vocabularies do this. And our brains do this for a reason. Take the “Water” example right at the beginning of the article. It is, simply, more efficient to assume meaning than to always seek 100% clarity.
As any second language learner at an advanced stage can tell you, we do this in our second languages too. I can happily watch a foreign language TV programme and think I understand it all. Ask me to transcribe it though, and I notice all the words and things that I’m missing or didn’t quite get.
The second take away is that you need to learn an awful lot of words. Remember that we stemmed the words, so we’re also not talking about “jump”, “jumping”, and “jumped” being three separate words, but one. Let’s call them word families. Let’s say that you need to know 50% of sentences well to be able to guess the others, albeit subconsciously. That equates to about 7000 word families.
But Wait – a Bonus Experiment
I think it’s safe to say that at this point in time, the argument that 1000 words means you understand 80% of a language (or thereabouts) is busted. But sometimes bad arguments are used as validity for good points. There’s a second experiment we can run and it’d be a shame not to do it. The dictionary of words I’m using is 20000 words. If I randomised the order, it would no longer be a popularity list but just a dictionary. If I picked the first, say, 1000 words and compared it to 1000 words ordered by popularity, it would be interesting to see what effect that had on sentence understanding. We could do that, say, at intervals of 1000 words. And we get a graph like this:
Which would imply that, for most of your learning journey, you won’t really understand 80% but you will understand more if you learn words on basis of frequency/popularity compared to in a random order.
Put simply, given no other factors, it is more efficient to learn words in order of word frequency. But, unfortunately, you’re still going to have to learn way more words than people claim.