Detecting Spam Blogs: An Adaptive Online Approach
Tuesday, September 25, 2007, 14:00pm - Sunday, September 25, 2005, 16:00pm
Splogs impact search engines that index the entire Web or just the blogosphere by increasing computational overhead and reducing user satisfaction. Hence, search engines try to minimize the influence of spam, both prior to indexing and after indexing, by eliminating splogs, comment spam, social media spam, or generic web spam. In this work we further the state of the art of splog detection prior to indexing.
First, we have identified and developed techniques that are effective for splog detection in a supervised machine learning setting. While some of these are novel, a few others confirm the utility of techniques that have worked well for e-mail and Web spam detection in a new domain i.e. the blogosphere. Specifically, our techniques identify spam blogs using URL, home-page, and syndication feeds. To enable the utility of our techniques prior to indexing, the emphasis of our effort is fast online detection.
Second, to effectively utilize identified techniques in a real-world context, we have developed a novel system that filters out spam in a stream of update pings from blogs. Our approach is based on using filters serially in increasing cost of detection that better supports balancing cost and effectiveness. We have used such a system to support multiple blog related projects, both internally and externally.
Next, motivated by these experiences, and input from real-world deployments of our techniques for over a year, we have developed an approach for updating classifiers in an adversarial setting. We show how an ensemble of classifiers can co-evolve and adapt when used on a stream of unlabeled instances susceptible to concept drift. We discuss how our system is amenable to such evolution by discussing approaches that can feed into it.
Finally, over the course of this work we have characterized the specific nature of spam blogs along various dimensions, formalized the problem and created general awareness of the issue. We are the first to formalize and address the problem of spam in blogs and identify the general problem of spam in Social Media. We discuss how lessons learned can guide follow-up work on spam in social media, an important new problem on the Web.