Dataset
Splog Blog Dataset
November 14, 2006
TAR.GZIP Compressed - Need an extractor? Get one here
This dataset consists of 3000 blog homepages, out of which 700 have been labeled as splogs, and another 700 as authentic blogs.
This training set was used in results of three papers, with emphasis on identifying blogs [1], on detecting spam blogs [2], and on analysing the splogosphere [3].
This collection can be used in further experimenting with splogs, or for building filters that could be deployed in real world systems. We, and our academic and industrial collaborators have been using such filters to eliminate spam blogs, with good results.
[1] Pranam Kolari, Tim Finin, Anupam Joshi, SVMs for the Blogosphere: Blog Identification and Splog Detection, AAAI Spring Symposium on Computational Approaches to Analysing Weblogs, March 2006
[2] Pranam Kolari, Akshay Java, Tim Finin, Tim Oates, Anupam Joshi, Detecting Spam Blogs: A Machine Learning Approach, Proceedings of the 21st National Conference on Artificial Intelligence (AAAI 2006), July 2006
[3] Pranam Kolari, Akshay Java, Tim Finin, Characterizing the Splogosphere, 3rd Annual Workshop on Weblogging Ecosystem: Aggregation, Analysis and Dynamics, 15th World Wid Web Conference, May 2006
7632 downloads
Public