]>
First I will describe online learning approaches for detecting malicious Web sites (those involved in criminal scams) using lexical and host-based features of the associated URLs. This application is particularly appropriate for online algorithms as the size of the training data is larger than can be efficiently processed in batch and because the features that typify malicious URLs evolve continuously. Motivated by this application, we built a real-time system to gather URL features and analyze them against a source of labeled URLs from a large Web mail provider. Our system adapts in an online fashion to the evolving characteristics of malicious URLs, achieving daily classification accuracies up to 99% over a balanced data set.
Next I will describe our ongoing efforts for creating analytics for detecting social media abuse. Deciding on a universal definition of social media abuse is difficult, as abuse is often in the eye of the beholder. In light of this challenge, we explore a more formal definition based on information theory. In particular, we hypothesize that messages with low information content are likely to be abusive. From this, we develop a measure of content complexity to identify abusive users that shows promise in our early evaluations.
In addition to our own experiments in the lab, this work has found success in practice as well. Companies serving hundreds of millions of users have adopted these ideas to improve abuse detection within their own services.
]]>