Detecting Domain Shift
September 3, 2010
Microsoft PowerPoint - Need a reader? Get one here
Machine learning systems are typically trained in the lab and then deployed in the wild. But what happens when the data to which they are exposed in the wild change in a way that hurts accuracy? For example, a system may be trained to classify movie reviews as either positive or negative (i.e., sentiment classification), but over time book reviews get mixed into the data stream. The problem of responding to such changes when they are known to have occurred has been studied extensively. In this talk I will describe recent work (with Mark Dredze and Christine Piatko) on the problem of automatically detecting such domain changes. We assume only a stream of unlabeled examples and use a measure of the difference between probability distributions called the A-distance applied to margin values from large margin classifiers (such as support vector machines) to detect significant changes. I will describe the application domain, which is statistical natural language processing, the approach, experiments on a variety of corpora and with a variety of tasks, and a theoretical analysis of the A-distance that is used to automatically select parameters for the algorithm.