Identifying and Isolating Text Classification Signals from Domain and Genre Noise for Sentiment Analysis

Justin Martineau

Identifying and Isolating Text Classification Signals from Domain and Genre Noise for Sentiment Analysis

December 5, 2011

Sentiment analysis is the automatic detection and measurement of sentiment in text segments by machines. This problem is generally divided into three tasks: a sentiment detection task, a topic detection task, and a sentiment measurement task. The first task attempts to determine whether the author is being objective or whether they are expressing a value judgment on the topic. The second task attempts to determine the topic of the sentiment. The third task attempts to determine whether the author approves or disapproves of the topic and by how much.

The main difficulty in solving these tasks arises from noise in the author's sentiment signal that is caused by the variety of different domains (topics) and genres (communication media). The limitless scope of possible domains, plus the effort require to hand label data for sentiment analysis tasks, implies a lack of labeled data for any given domain. To further exacerbate this problem, any of these domains can occur in any known genre, creating further noise in the communicator's sentiment signal. Genres that are either freely available or have high data volumes, including web pages, blogs, news feeds, Twitter posts, Facebook updates, and SMS messages are very interesting. These data sources cover a broad range of topics and are primarily unedited, thus noise and domain dependence are very important issues for sentiment analysis.

Sentiment analysis techniques are widely applicable to both government and private sector problems. On the government side, textual sentiment analysis in blogs could help identify terrorists, terrorist supporters, and potential suicide victims. These techniques can also be used to in uence people around the world, and to measure the effectiveness of advertising campaigns. Textual sentiment analysis can be used for market research, financial investments, and politics.

To support these kinds of applications on blogs, news, and other text messages, I developed and evaluated techniques to identify and rank transferable discriminative sentimental terms, and used them on other domains of interest to classify the author's sentiment about the topic of his writing. These techniques for a four step process. The first step is to determine the sentimental orientation of terms and score their strength in a set of known domains using Delta IDF. The second step is to determine and score how well these terms should transfer to the target domain using my Domain Independence Verification Algorithm (DIVA). The third step is to create a weight vector for the target domain using the Delta IDF weights and the DIVA scores. The final step is to transform documents into term frequency weighted document vectors and then classify them using the result of their dot product with the target domain's weight vector.

Delta IDF works well because I designed it to correct certain problems with the existing best practice. Term frequency times inverse document frequency (TFIDF) is a very successful mainstream practice in the field of Information Retrieval (IR). It was designed to reduce the importance of common English words while simultaneously boosting the importance of context-specific words. It is effective at identifying topical words for a document, but it was not designed to identify discriminative vocabulary for specific classification tasks. Nevertheless, many sentiment analysis researchers still use TFIDF scores as their initial term frequency weights for machine learning algorithms. I designed Delta TFIDF [25] to correct this mismatch between design goals. Classifying documents by their dot product with an in-domain Delta IDF weight vector produces statistically better results than using a Support Vector Machine (SVM).

DIVA is an effective technique for identifying and scoring terms that will transfer well to a target domain because it relies upon my newly developed concept of sentimental domain independence. Sentimental domain independence is the degree to which a term's sentimental bias remains unchanged in multiple domains. This definition is useful because it exposes a statistical or counting problem. Before this dissertation, researchers lacked a clear, concrete definition of domain independence, a measure of domain independence, and statistics about domain independence. This definition is an important part of my solution. While it is hard to determine what effects an increase or decrease in document level frequency counts will have on a given term's sentimental bias, it is much easier to determine if a term's sentimental bias will remain largely unchanged in the target domain if we can verify that it changes little in other domains. The definition, statistics, and algorithm for sentimental domain-independent terms are some of my major contributions to the field of domain adaption and sentiment analysis.

I evaluated this approach as a classification problem. I evaluated Delta IDF using ten-fold cross-validation on different domains and genres against state-of-theart SVM baselines. These results are statistically superior using two-tailed tests. Other researchers have since published further tests showing Delta IDF works well on additional domains and genres [33]. My results with DIVA on multiple domains against state-of-the-art domain adaptation techniques were clearly superior. In fact, DIVA is so powerful that it can outperform in-domain Delta IDF baselines by leveraging greater amounts of out of domain data. Not only is it possible to accurately transfer models between domains, it is advantageous.

My research made both algorithmic and theoretical contributions to the fields of domain adaption and sentiment analysis. First, I provided algorithms to discover and weight discriminative classification task specific features within a domain. Second, I produced algorithms to score how well these features transfer to a new target domain. Third, I laid out a general theory for the kinds of information and the types of noise they produce in text classification problems. Fourth, I defined sentimental domain independence and statistically described it. This dissertation gives future readers a firm theoretical foundation and practical algorithms to build on top of for a wide array of identified classification problems and for future research in the field.

10032210 bytes

BibTeX OWL Tweet Scholar

Type: PhdThesis

Publisher: University of Maryland, Baltimore County

Organization: Computer Science and Electrical Engineering

Address: Computer Science and Electrical Engineering

Downloads: 2854 downloads