Abstract: Weblogs, or blogs, are an important new way to publish information, engage in discussions, and form communities on the Internet. Blogs are a global phenomenon, and with numbers well over 100 million they form the core of the emerging paradigm of Social Media. While the utility of blogs is unquestionable, a serious problem now afflicts them, that of spam. Spam blogs, or splogs are blogs with auto-generated or plagiarized content with the sole purpose of hosting profitable contextual ads and/or inflating importance of linked-to sites. Though estimates vary, splogs account for more than 50% of blog content, and present a serious threat to their continued utility.
Splogs impact search engines that index the entire Web or just the blogosphere by increasing computational overhead and reducing user satisfaction. Hence, search engines try to minimize the influence of spam, both prior to indexing and after indexing, by eliminating splogs, comment spam, social media spam, or generic web spam. In this work we further the state of the art of splog detection prior to indexing.
First, we have identified and developed techniques that are effective for splog detection in a supervised machine learning setting. While some of these are novel, a few others confirm the utility of techniques that have worked well for e-mail and Web spam detection in a new domain i.e. the blogosphere. Specifically, our techniques identify spam blogs using URL, home-page, and syndication feeds. To enable the utility of our techniques prior to indexing, the emphasis of our effort is fast online detection.
Second, to effectively utilize identified techniques in a real-world context, we have developed a novel system that filters out spam in a stream of update pings from blogs. Our approach is based on using filters serially in increasing cost of detection that better supports balancing cost and effectiveness. We have used such a system to support multiple blog related projects, both internally and externally.
Next, motivated by these experiences, and input from real-world deployments of our techniques for over a year, we have developed an approach for updating classifiers in an adversarial setting. We show how an ensemble of classifiers can co-evolve and adapt when used on a stream of unlabeled instances susceptible to concept drift. We discuss how our system is amenable to such evolution by discussing approaches that can feed into it.
Finally, over the course of this work we have characterized the specific nature of spam blogs along various dimensions, formalized the problem and created general awareness of the issue. We are the first to formalize and address the problem of spam in blogs and identify the general problem of spam in Social Media. We discuss how lessons learned can guide follow-up work on spam in social media, an important new problem on the Web.
Committee: Prof. Tim Finin (Chair), Prof. Anupam Joshi, Prof. Yelena Yesha, Prof. Tim Oates, Dr. James Mayfield (JHU/APL), Dr. Nicolas Nicolov (Umbria)