UMBC ebiquity
Web Spam Reference Collection

Web Spam Reference Collection

Pranam Kolari, 1:00pm 13 December 2006

Researchers at Yahoo! Research Lab – Barcelona are hosting a collection of labeled web spam hosts, which they call WEBSPAM-UK2006. The dataset consists of around 2725 hosts that have agreements across atleast two labels.

The goal of our dataset activity is to make available reference collections that should be:

  • Large: the collections should include many examples of spam and non-spam content.
  • Clean: the collections should contain little classification errors.
  • Uniform: the collections should represent a uniform random sample over a set of pages or hosts.
  • Broad: the collections should include as many different Web spam aspects as possible.
  • Open: the collections should be freely available for researchers.

We came across similar problems while creating a a labeled dataset on spam blogs late last year. The creation of this new collection has made important contributions to address some of these issues. A paper describing the collection is also available online[PDF].

Related posts:

  1. Honeyblogs lure suckers to known spam domains
  2. Blog comment spam magnet
  3. rdf:about is a concise collection of RDF resources
  4. How to get more to get more comment spam
  5. Splogs, like spam, will be with us for a while

Comments are closed.