Mining Social Media Communities and Content
by Akshay Java
Thursday, October 16, 2008, 10:30am - Thursday, October 16, 2008, 12:00pm
Ph.D. Dissertation Defense
Social Media is changing the way we find information, share knowledge and communicate with each other. The important factor contributing to the growth of these technologies is the ability to easily produce "user-generated content". Blogs, Twitter, Wikipedia, Flickr and YouTube are just a few examples of Web 2.0 tools that are drastically changing the Internet landscape today. These platforms allow users to produce, annotate and share information with their social network. Their combined content accounts for nearly four to five times that of edited text being produced each day on the Web. Given the vast amount of user-generated content and easy access to the underlying social graph, we can now begin to understand the nature of online communication and collaboration in social applications. This thesis presents a systematic study of the social media landscape through the combined analysis of its special properties, structure and content.
First, we have developed techniques to effectively mine content from the blogosphere. The BlogVox opinion retrieval system is a large scale blog indexing and content analysis engine. For a given query term, the system retrieves and ranks blog posts expressing sentiments (either positive or negative) towards the query terms. We evaluate the system on a large, standard corpus of blogs with available human verified, relevance assessments for opinions. Further, we have developed a framework to index and semantically analyze syndicated feeds from news websites. This system semantically analyzes news stories and build a rich fact repository of knowledge extracted from real-time feeds.
Communities are an essential element of social media systems and detecting their structure and membership is critical in several real-world applications. Many algorithms for community detection are computationally expensive and generally, do not scale well for large networks. In this work we present an approach that benefits from the scale-free distribution of node degrees to extract communities efficiently. Social media sites frequently allow users to provide additional meta-data about the shared resources, usually in the form of tags or folksonomies. We have developed a new community detection algorithm that can combine information from tags and the structural information obtained from the graphs to detect communities. We demonstrate how structure and content analysis in social media can benefit from the availability of rich meta-data and special properties.
Finally, we study social media systems from the user perspective. We present an analysis of how a large population of users subscribes and organizes the blog feeds that they read. It has revealed several interesting properties and characteristics of the way we consume information. With this understanding, we describe how social data can be leveraged for collaborative filtering, feed recommendation and clustering. Recent years have seen a number of new social tools emerge. Microblogging is a new form of communication in which users can describe their current status in short posts distributed by instant messages, mobile phones, email or the Web. We present our observations of the microblogging phenomena and user intentions by studying the content, topological and geographical properties of such communities.
The course of this study spans an interesting period in Web's history. Social media is connecting people and building online communities by bridging the gap between content production and consumption. Through our research, we have highlighted how social media data can be leveraged to find sentiments, extract knowledge and identify communities. Ultimately, this helps us understand how we communicate and interact in online, social systems.Committee:
- Dr. Tim Finin (Chair)
- Dr. Anupam Joshi
- Dr. Charles Nicholas
- Dr. Tim Oates
- Dr. James Mayfield, JHU/APL
- Dr. Belle Tseng, Yahoo!