Web Spam Reference Collection
Pranam Kolari, 1:00pm 13 December 2006Researchers at Yahoo! Research Lab - Barcelona are hosting a collection of labeled web spam hosts, which they call WEBSPAM-UK2006. The dataset consists of around 2725 hosts that have agreements across atleast two labels.
The goal of our dataset activity is to make available reference collections that should be:
- Large: the collections should include many examples of spam and non-spam content.
- Clean: the collections should contain little classification errors.
- Uniform: the collections should represent a uniform random sample over a set of pages or hosts.
- Broad: the collections should include as many different Web spam aspects as possible.
- Open: the collections should be freely available for researchers.
We came across similar problems while creating a a labeled dataset on spam blogs late last year. The creation of this new collection has made important contributions to address some of these issues. A paper describing the collection is also available online[PDF].

