Google announced that it will share an enormous word n-gram dataset culled from a training corpus of one trillion words from public Web pages. In the 1T words of text, Google found 1.1B five-word sequences that appear at least 40 times and 13.8M words that appear at least 200 times. The dataset will be distributed by the Linguistic Data Consortium.
Google’s describes its motivation as follows
“We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together.”
They are right — this will be a great resource that will help advance our capabilities in many language-oriented tasks and some related knowledge management problems. I guess it’s time for us to invest in more disk space…