Why You Should Use N-grams for Multilingual Information Retrieval

Wednesday, October 18, 2006, 11:00am - Wednesday, October 18, 2006, 12:00pm

2120 A.V. Williams Building, UMCP

While generally accepted for languages such as Chinese and Japanese, the use of character n-gram tokenization has not been widely adopted for information retrieval in alphabetic languages. However, n-grams are a simple representation for text that is surprisingly effective in diverse languages. In this talk I present empirical results in twelve European languages that have been studied in the Cross Language Evaluation Forum (CLEF) competitions. These results demonstrate that:

n-gram tokenization is very effective for monolingual retrieval;
morphologically complex languages benefit from n-gram use;
n-grams have performance disadvantages that can be overcome;
and, given parallel text, n-grams can be projected from one language to another, enabling highly accurate bilingual retrieval.

I will describe issues particular to n-gram indexing and retrieval such as increased disk space consumption and query times, make a case that n-grams are a synthetic form of morphological normalization, and argue that when linguistic and translation resources are scarce (as is the case with less-commonly studied languages), n-grams are an extremely attractive option for multilingual retrieval.

OWL Tweet