Analytics for Detecting Web and Social Media Abuse
by Justin Ma
Friday, March 16, 2012, 13:00pm - Friday, March 16, 2012, 14:00pm
325b ITE
The Web and online social media provide invaluable communication services to a global Internet user base. The tremendous success of these services, however, has also created valuable opportunities for criminals and other miscreants to abuse them for their own gain. As a result, it is both an important yet challenging problem to detect, monitor, and curtail this abuse. However, the large scale and diversity of these services, combined with the tactics used by attackers, make it difficult to discern one clear and robust signal for detecting abuse. One approach, relying on domain expertise, is to construct a small set of well-crafted heuristics, but such heuristics tend to rapidly become obsolete. In this talk, I will describe more robust approaches based on machine learning, statistical modeling, and large-scale analytics of large data sets.
First I will describe online learning approaches for detecting malicious Web sites (those involved in criminal scams) using lexical and host-based features of the associated URLs. This application is particularly appropriate for online algorithms as the size of the training data is larger than can be efficiently processed in batch and because the features that typify malicious URLs evolve continuously. Motivated by this application, we built a real-time system to gather URL features and analyze them against a source of labeled URLs from a large Web mail provider. Our system adapts in an online fashion to the evolving characteristics of malicious URLs, achieving daily classification accuracies up to 99% over a balanced data set.
Next I will describe our ongoing efforts for creating analytics for detecting social media abuse. Deciding on a universal definition of social media abuse is difficult, as abuse is often in the eye of the beholder. In light of this challenge, we explore a more formal definition based on information theory. In particular, we hypothesize that messages with low information content are likely to be abusive. From this, we develop a measure of content complexity to identify abusive users that shows promise in our early evaluations.
In addition to our own experiments in the lab, this work has found success in practice as well. Companies serving hundreds of millions of users have adopted these ideas to improve abuse detection within their own services.