The Multi-Relational Blogosphere: Empirical Characterization and Spam Protection
Wednesday, May 10, 2006, 13:00pm - Wednesday, May 10, 2006, 14:30pm
Weblogs, or blogs, have become an important new way to publish information, engage in discussions and form communities. Blogs collectively constitute the blogosphere, a highly influential and dynamic subset on the Web. The nature of their content and publishing infrastructure requires that they be modeled, harvested and analyzed differently from the rest of the web.
We first propose a model for the blog graph that extends the more general web graph. The web is viewed as a graph G(V, E) where V is the set of pages and E represents hyperlinks between them. With a focus on the blogosphere, we view the web graph at a much lower granularity. Each entity v in the set V can also be associated with subsets constituted by the blogosphere or web news sources. In addition, every post hosted by a blog can be considered to be constituted of a Title, Content, Time, Tag, Author and Comment. This multi-relational conceptualization, and its instantiation is made possible through structured publishing on the blogosphere, enabled by RSS (RDF Site Summary), OPML (Outline Processor Markup Language), DC (Dublin Core) and FOAF (Friend of a Friend), all of which constitute popular metadata vocabularies.
We next propose to characterize instances of this multi-relational model, to include local content and the link structure involving various entities. We will identify the boundaries of the blogosphere, clarify what features makes it different from the rest of Web, and study the nature of spam. Such a characterization will be based on both publicly available blog data-sets, as well as those collected using our own system which is capable of discovering and harvesting blogs. We will share our experiences in implementing a blog harvesting system, the approaches we employed and their effectiveness, and provide new mechanisms that could be useful for timely content harvesting on the blogosphere.
We then propose to tackle spam afflicting the multi-relational blogosphere. We will formally make a distinction between spam in blogs with e-mail spam and the generic web spam. We will provide algorithms and techniques that employ both local and relational models to detect and eliminate spam posts and the blogs hosting them. We will then explore how such techniques can be made to adapt and learn in an adversarial classification setting, as mechanisms employed by spammers evolve. Based on our analysis of the algorithms and their cost, we will finally recommend a multi-step approach to eliminate spam blogs.