Akshay Java defended his PhD dissertation this fall on discovering communities in social media systems and the submitted version is now available online. Akshay is now a scientist at Microsoft’s Live Labs. The citation, link and abstract are below.
Akshay Java, Mining Social Media Communities and Content, Ph.D. Dissertation, Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, December 1, 2008. Available at http://ebiquity.umbc.edu/paper/html/id/429/Mining-Social-Media-Communities-and-Content.
Social Media is changing the way people find information, share knowledge and communicate with each other. The important factor contributing to the growth of these technologies is the ability to easily produce “user-generated content”. Blogs, Twitter, Wikipedia, Flickr and YouTube are just a few examples of Web 2.0 tools that are drastically changing the Internet landscape today. These platforms allow users to produce and annotate content and more importantly, empower them to share information with their social network. Friends can in turn, comment and interact with the producer of the original content and also with each other. Such social interactions foster communities in online, social media systems. User-generated content and the social graph are thus the two essential elements of any social media system.
Given the vast amount of user-generated content being produced each day and the easy access to the social graph, how can we analyze the structure and content of social media data to understand the nature of online communication and collaboration in social applications? This thesis presents a systematic study of the social media landscape through the combined analysis of its special properties, structure and content.
First, we have developed a framework for analyzing social media content effectively. The BlogVox opinion retrieval system is a large scale blog indexing and content analysis engine. For a given query term, the system retrieves and ranks blog posts expressing sentiments (either positive or negative) towards the query terms. Further, we have developed a framework to index and semantically analyze syndicated1 feeds from news websites. We use a sophisticated natural language processing system, OntoSem, to semantically analyze news stories and build a rich fact repository of knowledge extracted from real-time feeds. It enables other applications to benefit from such deep semantic analysis by exporting the text meaning representations in Semantic Web language, OWL.
Secondly, we describe novel algorithms that utilize the special structure and properties of social graphs to detect communities in social media. Communities are an essential element of social media systems and detecting their structure and membership is critical in several real-world applications. Many algorithms for community detection are computationally expensive and generally, do not scale well for large networks. In this work we present an approach that benefits from the scale-free distribution of node degrees to extract communities efficiently. Social media sites frequently allow users to provide additional meta-data about the shared resources, usually in the form of tags or folksonomies. We have developed a new community detection algorithm that can combine information from tags and the structural information obtained from the graphs to effectively detect communities. We demonstrate how structure and content analysis in social media can benefit from the availability of rich meta-data and special properties.
Finally, we study social media systems from the user perspective. In the first study we present an analysis of how a large population of users subscribes and organizes the blog feeds that they read. This study has revealed interesting properties and characteristics of the way we consume information. We are the first to present an approach to what is now known as the “feed distillation” task, which involves finding relevant feeds for a given query term. Based on our understanding of feed subscription patterns we have built a prototype system that provides recommendations for new feeds to subscribe and measures the readership based influence of blogs in different topics.
We are also the first to measure the usage and nature of communities in a relatively new phenomena called Microblogging. Microblogging is a new form of communication in which users can describe their current status in short posts distributed by instant messages, mobile phones, email or the Web. In this study, we present our observations of the microblogging phenomena and user intentions by studying the content, topological and geographical properties of such communities. We find that microblogging provides users with a more immediate form of communication to talk about their daily activities and to seek or share information.
The course of this research has highlighted several challenges that processing social media data presents. This class of problems requires us to re-think our approach to text mining, community and graph analysis. Comprehensive understanding of social media systems allows us to validate theories from social sciences and psychology, but on a scale much larger than ever imagined. Ultimately this leads to a better understanding of how we communicate and interact with each other today and in future.