Google Research releasing N-gram data
Tags: analysis | classification | researchGoogle Research announced today that they will be releasing their N-gram data to the public. N-gram models are a type of statistical model used to predict the occurance likelihood of the next item in a sequence, in this case, the items are words. N-gram models are used in a number of computational linguistics tasks like translation, part of speech tagging and word sense disambiguation.
We believe that the entire research community can benefit from access to such massive amounts of data. It will advance the state of the art, it will focus research in the promising direction of large-scale, data-driven approaches, and it will allow all research groups, no matter how large or small their computing resources, to play together. That's why we decided to share this enormous dataset with everyone. We processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times. There are 13,653,070 unique words, after discarding words that appear less than 200 times.
This is the first release from Goolge of computational models based on their enormous collection of data. Let's hope there will be more releases like this in the future !
The Google N-gram data was culled from a collection of training documents containing a total of one trillion words. Zing ! It will be released by the Linguistic Data Consortium soon on a 6 DVD set.

