UMBC webbase corpus
The UMBC webBase corpus (http://ebiq.org/r/351) is a dataset containing a collection of English paragraphs with over three billion words processed from the February 2007 crawl from the Stanford WebBase project (http://bit.ly/WebBase). Compressed, it is about 13GB in size.
It was derived from the February 2007 crawl, which is one of the largest collections and contains 100 million web pages from more than 50,000 websites. The Stanford WebBase project did an excellent job in extracting textual content from HTML tags but there are still many instances of text duplications, truncated texts, non-English texts and strange characters.
We processed the collection to remove undesired sections and produce high quality English paragraphs. We detected paragraphs using heuristic rules and only retrained those whose length was at least two hundred characters. We eliminated non-English text by checking the first twenty words of a paragraph to see if they were valid English words. We used the percentage of punctuation characters in a paragraph as a simple check for typical text. We removed duplicate paragraphs using a hash table. The result is a corpus with approximately three billion words of good quality English.
The corpus is available as a 13G compressed tar file which is about 48G when uncompressed. It contains 408 files with paragraphs extracted from web pages, one to a line with blank lines between them. A second set of 408 files have the same paragraphs, but with the words tagged with their part of speech (e.g., The_DT Option_NN draws_VBZ on_IN modules_NNS from_IN all_PDT the_DT).
The dataset has been used in several projects. If you use the dataset, please refer to it by citing the following paper, which describes it and its use in a system that measures the semantic similarity of short text sequences.
UMBC WebBase corpus by Lushan Han, UMBC Ebiquity Lab is licensed under a Creative Commons Attribution 3.0 Unported License. Based on a work at http://ebiq.org/r/351.
Authors: Lushan Han
Date: April 09, 2013
Format: TAR.GZIP Compressed (Need an extractor? Get one here)
Number of downloads: 639
Access Control: Publicly Available
Available for download as