| UMBC ebiquity |
Textual Representations for Corpus-Based Bilingual RetrievalTweetSpeaker: Paul McNamee Start: Monday, November 24, 2008, 09:00AM End: Monday, November 24, 2008, 11:30AM Abstract: The traditional approach to information retrieval is based on using
words as the indexing and search terms for documents. One part of this
research investigates alternative methods for representing text,
including a method based on overlapping sequences of characters called
n-gram tokenization. N-grams are studied in depth and one notable
finding is that they achieve a 20% improvement in retrieval
effectiveness over words in certain situations.
The other focus of this research is improving retrieval performance
when foreign language documents must be searched and translation is
required. In this scenario bilingual dictionaries are often used to
translate user queries; however even among the most commonly spoken
languages, for which large bilingual lexicons exist, dictionary-based
translation suffers from several significant problems. These include:
difficulty handling proper names, which are often missing; issues
related to morphological variation since entries, or query terms, may
not be lemmatized; and, an inability to robustly handle multiword
phrases, especially non-compositional expressions. These problems can
be addressed when translation is accomplished using parallel
collections, sets of documents available in more than one language.
Using parallel texts enables statistical translation of character
n-grams rather than words or stemmed words, and with this technique
highly effective bilingual retrieval performance is obtained.
Translation of multiword expressions is also explored.
In this dissertation I present an overview of the field of cross-
language information retrieval and then introduce the foundational
concepts in n-gram tokenization and corpus-based translation. Then
monolingual and bilingual experiments on test sets in 13 languages are
described. Analysis of these experiments gives insight into: the
relative efficacy of various tokenization methods; reasons why n-grams
are effective; the utility of automated relevance feedback, in both
monolingual and bilingual contexts; the interplay between tokenization
and translation; and, how translation resource selection and size
influence bilingual retrieval.
Host: Charles Nicholas , |