Why You Should Use N-grams for Multilingual Information Retrieval
by Paul McNamee
Wednesday, October 18, 2006, 11:00am - Wednesday, October 18, 2006, 12:00pm
2120 A.V. Williams Building, UMCP
While generally accepted for languages such as Chinese and Japanese, the use of character n-gram tokenization has not been widely adopted for information retrieval in alphabetic languages. However, n-grams are a simple representation for text that is surprisingly effective in diverse languages. In this talk I present empirical results in twelve European languages that have been studied in the Cross Language Evaluation Forum (CLEF) competitions. These results demonstrate that:
- n-gram tokenization is very effective for monolingual retrieval;
- morphologically complex languages benefit from n-gram use;
- n-grams have performance disadvantages that can be overcome;
- and, given parallel text, n-grams can be projected from one language to another, enabling highly accurate bilingual retrieval.