UMBC ebiquity

UMBC webbase corpus

Description:

The UMBC webBase corpus (http://ebiq.org/r/351) is a dataset containing a collection of English paragraphs with over three billion words processed from the February 2007 crawl from the Stanford WebBase project (http://bit.ly/WebBase). Compressed, it is about 13GB in size.

It was derived from the February 2007 crawl, which is one of the largest collections and contains 100 million web pages from more than 50,000 websites. The Stanford WebBase project did an excellent job in extracting textual content from HTML tags but there are still many instances of text duplications, truncated texts, non-English texts and strange characters.

We processed the collection to remove undesired sections and produce high quality English paragraphs. We detected paragraphs using heuristic rules and only retrained those whose length was at least two hundred characters. We eliminated non-English text by checking the first twenty words of a paragraph to see if they were valid English words. We used the percentage of punctuation characters in a paragraph as a simple check for typical text. We removed duplicate paragraphs using a hash table. The result is a corpus with approximately three billion words of good quality English.

The corpus is available as a 13G compressed tar file which is about 48G when uncompressed. It contains 408 files with paragraphs extracted from web pages, one to a line with blank lines between them. A second set of 408 files have the same paragraphs, but with the words tagged with their part of speech (e.g., The_DT Option_NN draws_VBZ on_IN modules_NNS from_IN all_PDT the_DT).

The dataset has been used in several projects. If you use the dataset, please refer to it by citing the following paper, which describes it and its use in a system that measures the semantic similarity of short text sequences.




Creative Commons LicenseUMBC WebBase corpus by Lushan Han, UMBC Ebiquity Lab is licensed under a Creative Commons Attribution 3.0 Unported License. Based on a work at http://ebiq.org/r/351.

Type: Dataset

Authors: Lushan Han

Date: April 09, 2013

Tags: text corpus, natural language processing

Format: TAR.GZIP Compressed (Need an extractor? Get one here)

Number of downloads: 639

Access Control: Publicly Available

 

Available for download as


(remote link)
 

Related Projects:

Past Project

 Graph of Relations.

 

Assertions:

  1. (Resource) UMBC webbase corpus is a resource of (Publication) UMBC_EBIQUITY-CORE: Semantic Textual Similarity Systems
  2. (Resource) UMBC webbase corpus is a dataset of (Project) Graph of Relations