UMBC ebiquity research group Building intelligent systems in open, heterogeneous, dynamic, distributed environments
16 May 2008, 20:38:19 EDT  
Pranam Kolari PhD dissertation defense: Detecting Spam Blogs

Pranam Kolari PhD dissertation defense: Detecting Spam Blogs

By Tim Finin on Thursday, September 27th, 2007 at 1:00 pm.

Earliet this week Pranam Kolari successfully defended his Ph.D. dissertation, Detecting Spam Blogs: An Adaptive Online Approach. Here’s a video of the defense.

Abstract: Weblogs, or blogs, are an important new way to publish information, engage in discussions, and form communities on the Internet. Blogs are a global phenomenon, and with numbers well over 100 million they form the core of the emerging paradigm of Social Media. While the utility of blogs is unquestionable, a serious problem now afflicts them, that of spam. Spam blogs, or splogs are blogs with auto-generated or plagiarized content with the sole purpose of hosting profitable contextual ads and/or inflating importance of linked-to sites. Though estimates vary, splogs account for more than 50% of blog content, and present a serious threat to their continued utility.

Splogs impact search engines that index the entire Web or just the blogosphere by increasing computational overhead and reducing user satisfaction. Hence, search engines try to minimize the influence of spam, both prior to indexing and after indexing, by eliminating splogs, comment spam, social media spam, or generic web spam. In this work we further the state of the art of splog detection prior to indexing.

First, we have identified and developed techniques that are effective for splog detection in a supervised machine learning setting. While some of these are novel, a few others confirm the utility of techniques that have worked well for e-mail and Web spam detection in a new domain i.e. the blogosphere. Specifically, our techniques identify spam blogs using URL, home-page, and syndication feeds. To enable the utility of our techniques prior to indexing, the emphasis of our effort is fast online detection.

Second, to effectively utilize identified techniques in a real-world context, we have developed a novel system that filters out spam in a stream of update pings from blogs. Our approach is based on using filters serially in increasing cost of detection that better supports balancing cost and effectiveness. We have used such a system to support multiple blog related projects, both internally and externally.

Next, motivated by these experiences, and input from real-world deployments of our techniques for over a year, we have developed an approach for updating classifiers in an adversarial setting. We show how an ensemble of classifiers can co-evolve and adapt when used on a stream of unlabeled instances susceptible to concept drift. We discuss how our system is amenable to such evolution by discussing approaches that can feed into it.

Finally, over the course of this work we have characterized the specific nature of spam blogs along various dimensions, formalized the problem and created general awareness of the issue. We are the first to formalize and address the problem of spam in blogs and identify the general problem of spam in Social Media. We discuss how lessons learned can guide follow-up work on spam in social media, an important new problem on the Web.

Committee: Prof. Tim Finin (Chair), Prof. Anupam Joshi, Prof. Yelena Yesha, Prof. Tim Oates, Dr. James Mayfield (JHU/APL), Dr. Nicolas Nicolov (Umbria)

Related posts: • Vote for Pranam;  • UMBC blog research on splogs in Baltimore Sun;  • ebiquity splog research mentioned in Wired article;  

 

 

Leave a Reply

Recent posts

  • The Psychology of Social Networking on KQED Forum show
  • Students: brand yourself with a blog
  • Social Data on the Web workshop at ISWC 2008
  • Petrini: Streaming Applications on the Cell BE Processor, 3pm 5/13 UMBC
  • Gossip-Based Outlier Detection for Mobile Ad Hoc Networks

  • Ebiquity community

  • Fieldmarking data blog
  • Geospatial Semantic Web
  • Harry Chen thinks aloud
  • Planet social media research
  • Social media research blog
  • TrackForward by Kolari
  • UMBC GAIM

  • UMBC