Videos of 2008 ICWSM presentations

December 31st, 2008

The submission deadline for the Third International Conference on Weblogs and Social Media is just three weeks away. To inspire yourself to work on a submission, you can check out videos from the 2008 ICWSM which are online at Here are highlights.

UMBC ties for first in 2008 Pan-American Intercollegiate Team Chess Championship

December 30th, 2008

Congratulations to the UMBC Chess team and their advisor and our colleague, UMBC CSEE Professor Alan Sherman, for a first place tie in the 54th Pan-American Intercollegiate Team Chess Championship.

UMBC tied for first place with University of Texas at Dallas (B Team) in the sixth and final round of the three-day 2008 Pan-Am Championship which was held in Dallas. This year 29 four-person college teams competed in the annual event which is known as the “World Series of College Chess“. UMBC has now won the Pan-Am tournament a record eight times. The final standings are available at swchess.

The two first place winners will meet again with the third and fourth place teams, the University of Texas Brownville and Stanford, in the special Final Four of Chess tournament, which is held in spring 2009.

The UMBC chess team: front row, L to R: WGM Sabina Foisor, GM Timur Gareev, GM Sergey Erenburg, and GM Leonid Kritz, board one, Back row: UMBC coaches GM Sam Palantnik and NM Igor Epshteyn. Photo Alexey Root.

Wikirage tracks whats hot on Wikipedia

December 30th, 2008

Wikirage is yet another way to track what’s happening in the world via changes in social media, in this case, Wikipedia. As the site suggests, “popular people in the news, the latest fads, and the hottest video games can be quickly identified by monitoring this social phenomenon.”

Wikirage lists the 100 Wikipedia pages that are being heavily edited over any of six time periods from the last hour to the last month. You can see the top 100 by your choice of six metrics: number of quality edits, unique editors, total edits, vandalism, reversions, or undos. Clicking on a result shows a monthly summary for the article, for example, December 2008 Gaza Strip airstrikes, which is at the top of today’s list for number of edits as I write. I understand the Gaza article, but what’s up with the Tasmanian tiger?

The interface has some other nice features, such as marking pages in red that have high revision, vandalism or undo rates and showing associated Wikipedia flags that indicating articles that need attention or don’t live up to standards. Wikirage is also available for the English, Japanese, Spanish, German and French language Wikipedias.

Wikirage was developed by Craig Wood and is a nicely done system.

(via the Porn Sex Viagra Casino Spam site)

Akshay Java Ph.D.: Mining Social Media Communities and Content

December 30th, 2008

Akshay Java defended his PhD dissertation this fall on discovering communities in social media systems and the submitted version is now available online. Akshay is now a scientist at Microsoft’s Live Labs. The citation, link and abstract are below.

Akshay Java, Mining Social Media Communities and Content, Ph.D. Dissertation, Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, December 1, 2008. Available at

Social Media is changing the way people find information, share knowledge and communicate with each other. The important factor contributing to the growth of these technologies is the ability to easily produce “user-generated content”. Blogs, Twitter, Wikipedia, Flickr and YouTube are just a few examples of Web 2.0 tools that are drastically changing the Internet landscape today. These platforms allow users to produce and annotate content and more importantly, empower them to share information with their social network. Friends can in turn, comment and interact with the producer of the original content and also with each other. Such social interactions foster communities in online, social media systems. User-generated content and the social graph are thus the two essential elements of any social media system.

Given the vast amount of user-generated content being produced each day and the easy access to the social graph, how can we analyze the structure and content of social media data to understand the nature of online communication and collaboration in social applications? This thesis presents a systematic study of the social media landscape through the combined analysis of its special properties, structure and content.

First, we have developed a framework for analyzing social media content effectively. The BlogVox opinion retrieval system is a large scale blog indexing and content analysis engine. For a given query term, the system retrieves and ranks blog posts expressing sentiments (either positive or negative) towards the query terms. Further, we have developed a framework to index and semantically analyze syndicated1 feeds from news websites. We use a sophisticated natural language processing system, OntoSem, to semantically analyze news stories and build a rich fact repository of knowledge extracted from real-time feeds. It enables other applications to benefit from such deep semantic analysis by exporting the text meaning representations in Semantic Web language, OWL.

Secondly, we describe novel algorithms that utilize the special structure and properties of social graphs to detect communities in social media. Communities are an essential element of social media systems and detecting their structure and membership is critical in several real-world applications. Many algorithms for community detection are computationally expensive and generally, do not scale well for large networks. In this work we present an approach that benefits from the scale-free distribution of node degrees to extract communities efficiently. Social media sites frequently allow users to provide additional meta-data about the shared resources, usually in the form of tags or folksonomies. We have developed a new community detection algorithm that can combine information from tags and the structural information obtained from the graphs to effectively detect communities. We demonstrate how structure and content analysis in social media can benefit from the availability of rich meta-data and special properties.

Finally, we study social media systems from the user perspective. In the first study we present an analysis of how a large population of users subscribes and organizes the blog feeds that they read. This study has revealed interesting properties and characteristics of the way we consume information. We are the first to present an approach to what is now known as the “feed distillation” task, which involves finding relevant feeds for a given query term. Based on our understanding of feed subscription patterns we have built a prototype system that provides recommendations for new feeds to subscribe and measures the readership based influence of blogs in different topics.

We are also the first to measure the usage and nature of communities in a relatively new phenomena called Microblogging. Microblogging is a new form of communication in which users can describe their current status in short posts distributed by instant messages, mobile phones, email or the Web. In this study, we present our observations of the microblogging phenomena and user intentions by studying the content, topological and geographical properties of such communities. We find that microblogging provides users with a more immediate form of communication to talk about their daily activities and to seek or share information.

The course of this research has highlighted several challenges that processing social media data presents. This class of problems requires us to re-think our approach to text mining, community and graph analysis. Comprehensive understanding of social media systems allows us to validate theories from social sciences and psychology, but on a scale much larger than ever imagined. Ultimately this leads to a better understanding of how we communicate and interact with each other today and in future.

The true cost of sending SMS messages

December 28th, 2008

The NYT has an article, What Carriers Aren’t Eager to Tell You About Texting , about new interest in understanding why charges for SMS service has been increasing even while volume is up and communication costs are down.

I learned one interesting thing from the article about the length of SMS messages. I’d never thought much about where the limit on the number of characters came from. According to the article, the limit is 160 (7 bit) characters because that’s what will fit into the control channel messages that mobile phones exchange with cell towers.

“The lucrative nature of that revenue increase cannot be appreciated without doing something that T-Mobile chose not to do, which is to talk about whether its costs rose as the industry’s messaging volume grew tenfold. Mr. Kohl’s letter of inquiry noted that “text messaging files are very small, as the size of text messages are generally limited to 160 characters per message, and therefore cost carriers very little to transmit.” A better description might be “cost carriers very, very, very little to transmit.”

A text message initially travels wirelessly from a handset to the closest base-station tower and is then transferred through wired links to the digital pipes of the telephone network, and then, near its destination, converted back into a wireless signal to traverse the final leg, from tower to handset. In the wired portion of its journey, a file of such infinitesimal size is inconsequential. Srinivasan Keshav, a professor of computer science at the University of Waterloo, in Ontario, said: “Messages are small. Even though a trillion seems like a lot to carry, it isn’t.”

Perhaps the costs for the wireless portion at either end are high — spectrum is finite, after all, and carriers pay dearly for the rights to use it. But text messages are not just tiny; they are also free riders, tucked into what’s called a control channel, space reserved for operation of the wireless network. That’s why a message is so limited in length: it must not exceed the length of the message used for internal communication between tower and handset to set up a call. The channel uses space whether or not a text message is inserted.”

There’s a lot more to the protocols, of course. The Wikipedia SMS article looks like a good place to start.

Yongmei Shi PhD: Linguistic Information for Speech Recognition Error Detection

December 26th, 2008

Yongmei Shi defended her PhD dissertation earlier this fall on using syntactic and semantic information to detect errors in spoken language systems under the direction of Dr. R. Scott Cost (JHU/APL) and Professor Lina Zhou (UMBC). Her dissertation has been submitted an is now available online.

Yongmei Shi, An Investigation of Linguistic Information for Speech Recognition Error Detection, Ph.D. Dissertation, Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, October 2008.

After several decades of effort, signi?cant progress has been made in the area of speech recognition technologies, and various speech-based applications have been developed. However, current speech recognition systems still generate erroneous output, which hinders the wide adoption of speech applications. Given that the goal of error-free output can not be realized in near future, mechanisms for automatically detecting and even correcting speech recognition errors may prove useful for amending imperfect speech recognition systems. This dissertation research focuses on the automatic detection of speech recognition errors for monologue applications, and in particular, dictation applications.

Due to computational complexity and ef?ciency concerns, limited linguistic information is embedded in speech recognition systems. Furthermore, when identifying speech recognition errors, humans always apply linguistic knowledge to complete the task. This dissertation therefore investigates the effect of linguistic information on automatic error detection by applying two levels of linguistic analysis, speci?cally syntactic analysis and semantic analysis, to the post processing of speech recognition output. Experiments are conducted on two dictation corpora which differ in both topic and style (daily of?ce communication by students and Wall Street Journal news by journalists).

To catch grammatical abnormalities possibly caused by speech recognition errors, two sets of syntactic features, linkage information and word associations based on syntactic dependency, are extracted for each word from the output of two lexicalized robust syntactic parsers respectively. Con?dence measures, which combine features using Support Vector Machines, are used to detect speech recognition errors. A con?dence measure that combines syntactic features with non-linguistic features yields consistent performance improvement in one or more aspects over those obtained by using non-linguistic features alone.

Semantic abnormalities possibly caused by speech recognition errors are caught by the analysis of semantic relatedness of a word to its context. Two different methods are used to integrate semantic analysis with syntactic analysis. One approach addresses the problem by extracting features for each word from its relations to other words. To this end, various WordNet-based measures and different context lengths are examined. The addition of semantic features in con?dence measures can further yield small but consistent improvement in error detection performance. The other approach applies lexical cohesion analysis by taking both reiteration and collocation relationships into consideration and by augmenting words with probability predicted from syntactic analysis. Two WordNet-based measures and one measure based on Latent Semantic Analysis are used to instantiate lexical cohesion relationships. Additionally, various word probability thresholds and cosine similarity thresholds are examined. The incorporation of lexical cohesion analysis is superior to the use of syntactic analysis alone. In summary, the use of linguistic information as described, including syntactic and semantic information, can provide positive impact on automatic detection of speech recognition errors.

Social media conferences, sympoisa, workshops and events

December 25th, 2008

JD Lasica’s blog has a post, 2009 conferences: Social media, tech, marketing, that lists “some of the best social media, technology, media and marketing conferences for the upcoming year” in the US. The list doesn’t include any technology research-oriented conferences, but does have quite a range of others. The post invites everyone to suggest additional entries by adding comments about them. (I suggested ICWSM and the AAAI Spring Symposium on the Social Semantic Web.)

This list complements Akshay Java’s Social Media Events calendar which is focused mostly on research conferences. He also invites suggestions which you can submit by email or through comments.

WWGD: Understanding Google’s Technology Stack

December 24th, 2008

It’s popular to ask “What Would Google Do” these days — The Google reports over 7,000 results for the phrase. Of course, it’s not just about Google, which we all use as the archetype for a new Web way of building and thinking about information systems. Asking WWGD can be productive, but only if we know how to implement and exploit the insights the answer gives us. This in turn requires us (well, some of us, anyway) to understand the algorithms, techniques, and software technology that Google and other large scale Web-oriented companies use. We need to ask “How Would Google Do It”.

Michael Nielsen has a nice post on using your laptop to compute PageRank for millions of webpages. His posts reviews PageRank and how to compute it and shows a short, but reasonably efficient, Python program that can easily do a graph with a few million nodes. While not sufficient for many applications, like the Web, there are lots of interesting and significant graphs this small Python program can handle — Wikipedia pages, DBLP publications, RDF namespaces, BGP routers, Twitter followers, etc.

The post is part of a series Nielsen is making on the Google Technology Stack including PageRank, MapReduce, BigTable, and GFS. The posts are a byproduct of a series of weekly lectures he’s giving starting earlier this month in Waterloo. Here’s the way that Nielsen describes the series.

“Part of what makes Google such an amazing engine of innovation is their internal technology stack: a set of powerful proprietary technologies that makes it easy for Google developers to generate and process enormous quantities of data. According to a senior Microsoft developer who moved to Google, Googlers work and think at a higher level of abstraction than do developers at many other companies, including Microsoft: “Google uses Bayesian filtering the way Microsoft uses the if statement” (Credit: Joel Spolsky). This series of posts describes some of the technologies that make this high level of abstraction possible.”

Videos of the first two lectures, Introducion to PageRank and Building our PageRank Intuition) are available online. Nielsen illustrates the concepts and algorithms with well-written Python code and provides exercises to help readers master the material as well as “more challenging and often open-ended problems” which he has worked on but not completely solved.

Nielsen was trained as a as a theoretical Physicist but has shifted his attention to “the development of new tools for scientific collaboration and publication”. As far as I can see, he is offering these as free public lectures out of a desire to share his knowledge and also to help (or maybe force) him to deepen his own understanding of the topics and develop better ways of explaining them. In both cases, it an admirable and inspiring example for us all and appropriate for the holiday season. Merry Christmas!

Videos of Semantic Web talks and tutorials from ISWC 2008 now online

December 22nd, 2008

High quality videos of tutorials and talks from the Seventh International Semantic Web Conference are now available on the excellent site. It’s a great opportunity to benefit from the conference if you were not able to attend or, even if you were, to see presentations you were not able to attend.

Videolectures captured the slides for most of the presentations (which are available for downloading) and their site shows both the the speaker’s video and slides in synchronization. Videolectures used three camera crews in parallel so were able to capture almost all of the presentations. Here are some highlights from the ~90 videos to whet your appetite.

Tom Briggs Ph.D.: Constraint Generation and Reasoning in OWL

December 22nd, 2008

Tom Briggs defended his PhD dissertation last month on discovering domain and range constraints in OWL and the final copy is now available.

Thomas H. Briggs, Constraint Generation and Reasoning in OWL, 2008.

The majority of OWL ontologies in the emerging SemanticWeb are constructed from properties that lack domain and range constraints. Constraints in OWL are different from the familiar uses in programming languages and databases. They are actually type assertions that are made about the individualswhich are connected by the property. Because they are type assertions these assertions can add vital information to the individuals involved and give information on how the defining property may be used. Three different automated generation techniques are explored in this research: disjunction, least-common named subsumer, and vivification. Each algorithm is compared for the ability to generalize, and the performance impacts with respect to the reasoner. A large sample of ontologies from the Swoogle repository are used to compare real-world performance of these techniques. Using generated facts is a type of default reasoning. This may conflict with future assertions to the knowledge base. While general default reasoning is non-monotonic and undecidable a novel approach is introduced to support efficient contraction of the default knowledge. Constraint generation and default reasoning, together, enable a robust and efficient generation of domain and range constraints which will result in the inference of additional facts and improved performance for a number of Semantic Web applications.

Disco: a Map reduce framework in Python and Erlang

December 21st, 2008

Disco is a Python-friendly, open-source Map-Reduce framework for distributed computing with the slogan “massive data – minimal code”. Disco’s core is written in Erlang, a functional language designed for concurrent programming, and users typically write Disco map and reduce jobs in Python. So what’s wrong with using Hadoop? Nothing, according to the Disco site, but…

“We see that platforms for distributed computing will be of such high importance in the future that it is crucial to have a wide variety of different approaches which produces healthy competition and co-evolution between the projects. In this respect, Hadoop and Disco can be seen as complementary projects, similar to Apache, Lighttpd and Nginx.

It is a matter of taste whether Erlang and Python are more suitable for the task than Java. We feel much more productive with Python than with Java. We also feel that Erlang is a perfect match for the Disco core that needs to handle tens of thousands of tasks in parallel.

Thanks to Erlang, the Disco core remarkably compact, currently less than 2000 lines of code. It is relatively easy to understand how the core works, and start experimenting with it or adapt it to new environments. Thanks to Python, it is easy to add new features around the core which ensures that Disco can respond quickly to real-world needs.”

The Disco tutorial uses the standard word counting task to show how to set up and use Disco on both a local cluster and Amazon EC2. There is also homedisco, which lets programmers develop, debug, profile and test Disco functions on one local machine before running on a cluster. The word counting example from the tutorial is certainly nicely compact:

from disco.core import Disco, result_iterator

def fun_map(e, params):
    return [(w, 1) for w in e.split()]

def fun_reduce(iter, out, params):
    s = {}
    for w, f in iter:
        s[w] = s.get(w, 0) + int(f)
    for w, f in s.iteritems():
        out.add(w, f)

results = Disco("disco://localhost").new_job(
		name = "wordcount",
                input = [""],
                map = fun_map,
		reduce = fun_reduce).wait()

for word, frequency in result_iterator(results):
	print word, frequency measures and visualizes journal impact

December 19th, 2008 is a fascinating site that is exploring new ways to measure and visualize the importance or journals to scientific communities. The site is a result of work by the Bergstrom lab in the Department of Biology at the University of Washington. The project defines two metrics for scientific journals based on a page-rank like algorithm applied to citation graphs.

“A journal’s Eigenfactor score is our measure of the journal’s total importance to the scientific community. With all else equal, a journal’s Eigenfactor score doubles when it doubles in size. Thus a very large journal such as the Journal of Biological Chemistry which publishes more than 6,000 articles annually, will have extremely high Eigenfactor scores simply based upon its size. Eigenfactor scores are scaled so that the sum of the Eigenfactor scores of all journals listed in Thomson’s Journal Citation Reports (JCR) is 100.

A journal’s Article Influence score is a measure of the average influence of each of its articles over the first five years after publication. Article Influence measures the average influence, per article, of the papers in a journal. As such, it is comparable to Thomson Scientific’s widely-used Impact Factor. Article Influence scores are normalized so that the mean article in the entire Thomson Journal Citation Reports (JCR) database has an article influence of 1.00.”

For example, here are the ISI-indexed journals in the AI subject category ranked by the Article Influence score for 2006.

The site makes good use of GoogleDoc’s motion charts to visualize the changes of metrics for top journals in a subject area. You can also interactively explore maps that show the influence of different subject categories on one another as estimated from journal citations.

Map of Science

The details of the approach and algorithms are available in various papers by Bergstrom and his colleagues, such as

M. Rosvall and C. T. Bergstrom, Maps of random walks on complex networks reveal community structure, Proceedings of the National Academy of Sciences USA. 105:1118-1123. Also arXiv physics.soc-ph/0707.0609v3 [PDF]

(spotted on Steve Hsu’s blog)