UMBC ebiquity research group Building intelligent systems in open, heterogeneous, dynamic, distributed environments
Web Spam Reference Collection

Web Spam Reference Collection

Pranam Kolari, 1:00pm 13 December 2006

Researchers at Yahoo! Research Lab – Barcelona are hosting a collection of labeled web spam hosts, which they call WEBSPAM-UK2006. The dataset consists of around 2725 hosts that have agreements across atleast two labels.

The goal of our dataset activity is to make available reference collections that should be:

  • Large: the collections should include many examples of spam and non-spam content.
  • Clean: the collections should contain little classification errors.
  • Uniform: the collections should represent a uniform random sample over a set of pages or hosts.
  • Broad: the collections should include as many different Web spam aspects as possible.
  • Open: the collections should be freely available for researchers.

We came across similar problems while creating a a labeled dataset on spam blogs late last year. The creation of this new collection has made important contributions to address some of these issues. A paper describing the collection is also available online[PDF].

Leave a Reply







UMBC