Splog Blog Dataset

Pranam Kolari, Akshay Java, and Anupam Joshi

November 14, 2006

TAR.GZIP Compressed - Need an extractor? Get one here

This dataset consists of 3000 blog homepages, out of which 700 have been labeled as splogs, and another 700 as authentic blogs.

This training set was used in results of three papers, with emphasis on identifying blogs [1], on detecting spam blogs [2], and on analysing the splogosphere [3].

This collection can be used in further experimenting with splogs, or for building filters that could be deployed in real world systems. We, and our academic and industrial collaborators have been using such filters to eliminate spam blogs, with good results.

[1] Pranam Kolari, Tim Finin, Anupam Joshi, SVMs for the Blogosphere: Blog Identification and Splog Detection, AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, March 2006

[2] Pranam Kolari, Akshay Java, Tim Finin, Tim Oates, Anupam Joshi, Detecting Spam Blogs: A Machine Learning Approach, Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006), July 2006

[3] Pranam Kolari, Akshay Java, Tim Finin, Characterizing the Splogosphere, 3rd Annual Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference, May 2006



