Is the length of a word a poor way to predict English proficiency?

As mentioned in my Cambridge Learner Corpus research elsewhere, the main noticeable difference between CEFR levels is the length of sentences: shorter at A1 and longer at C2.  In this post, I wanted to do a quick search (20 sentences) on the word length (the letters in a word) using PELI corpus examples.  Surprisingly, the level 2 average word length was found to be 4.45 characters and level 5 (the most advanced level) is 4.55.  That is a very small increase if we consider all words.  But considering all the stop words used at all levels, we might imagine that by removing stop words we would get a bigger difference in the number of letters in each word. What we found is the exact opposite.  The pre-intermediate students had an average of 6.61 in each word and the advanced students 5.94. Now, there are limitations to the very small sample used to get a feel for the notion of ‘longer words are used by more advanced students’, but this does make a small case for ‘you shouldn’t rely on individual tokens as indicators of English proficiency or complexity of a text’.

However, I just don’t believe these results and thought I would take another approach.  This time I took all the EVP verbs at their lowest sense at C1 and A2.  The C1 average verb length of 6.43 – the A2 average 4.6 = a difference of 1.83. The uninflected verb form at C1 is almost two letters longer than at A2.   One can only imagine that comparing A1 to C2 or unlisted vocabulary that there would be an average of over 2 letters.

The following post is how I came to these conclusions.

Below there are random PELIC Level 2 sentences in the order they are found here  and are then counted using this software.

  • 526 Characters without white space
  • 118 Words

526/118 = 4.45 average word length.

Level 5 sentences:

  1. Should I visit Brazil, I would go with my friends.
  2. It is not hard to see why pre-natal health is so crucial for life expectancy.
  3. Hardly had I left the house when I realized that I left my sunglasses at home .
  4. Third, exercise can be an extremely effective stress reliever for several reasons.
  5. Every day after finishing school, she almost always practiced dancing with loud music.
  6. Even in the burning hot summer, some of them have spicy hot pot.
  7. Given the fact that having more leisure time can lead to positive results, one can be more creative in his work.
  8. When you think about it, it makes sense.
  9. It has been estimated that in the United States alone, at least 400,000 of such embryos are destroyed.
  • 574 Without White Space
  • 126 Words
  • 4.55 average word length

Next, we put the two texts through the profiler at EnglishGrammar.Pro.

The level 2 sentences are predicted to be from Elementary students. I would say they are more A2, also note the stand out: ‘sculptures on display and his artistic productions’.

It automatically predicts that the Level 5 texts together are between B2 and C1 which we could give an average IELTS score of 6.5

Next, we remove anything not tagged verb, adjective, adverb, noun. Proper nouns and modal or auxiliary verbs are removed, and I wasn’t sure whether to remove adverbs like ‘almost’ or ‘always’ so left them in.

Level 2:

has paintings sculptures display artistic productions people looks cute turtle favorite food chicken usually enjoy trip weekend  using computer homework writing homework apartment doing writing homework internet shy girls always tell brother shy best creatures world best friend becoming popular international activities hungry ate sandwiches think sleeping

  • 47 Words
  • 311 characters Without White Space
  • 6.61 average word length

Unmodified text inspector output:

Level 5:

visit go friends hard see pre-natal health crucial life expectancy
Hardly left house realized left sunglasses home exercise extremely
effective stress reliever reasons day finishing school almost always
practiced dancing loud music Even burning hot summer spicy hot pot
Given fact having leisure time lead positive results creative work
think makes sense estimated alone embryos destroyed

  • 56 Words
  • 333 characters without White Space
  • 333/56= 5.94 average word length.

Unmodified text inspector output:

C1 verbs (lowest sense) from the English Vocabulary Profile

*note the last letters are often removed so that varied verb endings can be caught.

accelerat accomplish acknowledg address aid alarm allocat alternat anticipat applaud appoint aris assert assign associat attain authoriz bewar brib broaden bull campaign cap cater clarif classif collaborat commut compensat compil complicat compliment compl compris conclud conquer constitut consult contradict conve corrupt crowds crowding crowd cultivat daydream dedicating deepen desir detect deteriorat determin dictat differentiat digest diminish discriminat dismiss displac dissolv distort draft dump dwell eliminat embrac enforc engag enhanc enrich envisag evaluat evolv exaggerat exceed exclud exhaust exhibit exist facilitat fascinat flee forese fund generaliz grad grasp group host imitat impact impos imprison infect insert inspect instruct integrat irritat jam label lengthen lessen march mingl minimiz misbehav misinform mislead misus modif monitor motivat narrow neglect negotiat nominat notif number omit opt outnumber outweigh overdo overestimat overwhelm perceiv pictur plung pos possess presum privatiz proceed prolong pursu puzzl quot rank rat readjust rear reassur recharg reconsider reconstruct recreat recruit rectif redevelop refresh refund regulat rehears reinforc relocat render renovat reorganiz reproduc resembl resolv restart restrain restrict resum rethink sacrific scan scar shield shift shorten simplif sip smuggl sow spar spin sp starv stock sue suing summariz surg surv swap tax thriv total trac transmit trigger twist uncover undergo undertak unfold unit unload unwind urg verif volunteer weaken withdraw worsen worship

  • 209 Words
  • 1344 Without White Space
  • 1344/209= 6.43

A2 verbs

add agree arriv bak become been believ boil born borrow bother brak bring brought brush build call camp chat check click climb collect complet contact cop cost cover cry cut decid dela describ discuss download dream dr earn end enter explain fail fall fill follow forget grow guess happen hat hit hold hope hurr hurt improv includ join jump keep kill kiss laugh lend let lie los matter mean miss mix mov offer order pack park pass pick point post pra prefer prepar print pull push receiv record rent repair repeat return ring roast sav sell sold serv shar shout shut sound spell spend stand steal surf telephon text thank throw tidy try turn won win

  • 116 Words
  • 534 Without White Space
  • 534/116=4.60

6.43-4.6 = 1.83

Finally, it should be mentioned that metrics that estimate the difficulty level of a text, like the Flesch-Kincaid readability test,  and the Coleman-Liau index do use syllable counts or characters too.



Here is a very related an interesting post from Cambridge that I came across:

More formal vocabulary commonly involves longer words or words with origins in Latin and Greek. More informal vocabulary commonly involves shorter words, or words with origins in Anglo-Saxon.









