Using Google to learning the meanings of words

February 15th, 2005

The Web is the largest database on the Earth, and Google has the largest index of this database. Two researchers at University of Amsterdam proposed a new system that uses Google search to learn and distinguish the meanings of words.

Their work is based on the theory that the meaning of a word can usually be gleaned from the words used around it. Take the word “rider”. Its meaning can be deduced from the fact that it is often found close to words like “horse” and “saddle”.

Instead relying on a common sense knowledge base such as Cyc, the reseachers use Google search to measure how closely two words relate to each other.

To do this, it needs to build a word tree – a database of how words relate to each other. It might start off with any two words to see how they relate to each other. For example, if it googles “hat” and “head” together it gets nearly 9 million hits, compared to, say, fewer than half a million hits for “hat” and “banana”. Clearly “hat” and “head” are more closely related than “hat” and “banana”.

To gauge just how closely, Vitanyi and Cilibrasi have developed a statistical indicator based on these hit counts that gives a measure of a logical distance separating a pair of words. They call this the normalised Google distance, or NGD. The lower the NGD, the more closely the words are related.

See also: “Google’s search for meaning“, New Scientist.