Skip to content

Etymology Online: Issues with Google Ngram; or, Where the Missing Toast Lies

I cannot tell you much about how AI works. But I can tell you how AI handles something I know how to handle. Here’s another example. “The Great Toast Famine of ’77”

Ngram says toast almost vanishes from the English language by 1980, and then it pops back up.

WHY THAT’S WRONG

There’s a long-documented flaw in the Ngram formula, inherited from Google Books. The error makes a vast number of English words appear to be diminishing in use through the 20th century only to revive around 1980.

A rough gist of an explanation for it seems to be that Google Books’ corpus is heavily academic. The printed matter Google sucked up from universities had a disproportion of modern scientific and academic journals in it. The articles in those journals and textbooks lean on the same few words (as academics are wont to do when they write).

That not only bloats the scores for those few words, it falsely drives down the other words. That creates that mid-20th-century “dip” in the Ngram of almost every word.

Etymology Online – Who Lusts for Certainty Lusts for Lies

I'd love to hear your thoughts and recommended resources...