Textual Representations for Corpus-Based Bilingual Retrieval

Monday, November 24, 2008, 9:00am - Monday, November 24, 2008, 11:30am

The traditional approach to information retrieval is based on using words as the indexing and search terms for documents. One part of this research investigates alternative methods for representing text, including a method based on overlapping sequences of characters called n-gram tokenization. N-grams are studied in depth and one notable finding is that they achieve a 20% improvement in retrieval effectiveness over words in certain situations.

The other focus of this research is improving retrieval performance when foreign language documents must be searched and translation is required. In this scenario bilingual dictionaries are often used to translate user queries; however even among the most commonly spoken languages, for which large bilingual lexicons exist, dictionary-based translation suffers from several significant problems. These include: difficulty handling proper names, which are often missing; issues related to morphological variation since entries, or query terms, may not be lemmatized; and, an inability to robustly handle multiword phrases, especially non-compositional expressions. These problems can be addressed when translation is accomplished using parallel collections, sets of documents available in more than one language. Using parallel texts enables statistical translation of character n-grams rather than words or stemmed words, and with this technique highly effective bilingual retrieval performance is obtained. Translation of multiword expressions is also explored.

In this dissertation I present an overview of the field of cross- language information retrieval and then introduce the foundational concepts in n-gram tokenization and corpus-based translation. Then monolingual and bilingual experiments on test sets in 13 languages are described. Analysis of these experiments gives insight into: the relative efficacy of various tokenization methods; reasons why n-grams are effective; the utility of automated relevance feedback, in both monolingual and bilingual contexts; the interplay between tokenization and translation; and, how translation resource selection and size influence bilingual retrieval.

Hosted by: Charles Nicholas

OWL Tweet