]> 2006-10-18T11:00:00-05:00 2006-10-18T12:00:00-05:00

n-gram tokenization is very effective for monolingual retrieval;

morphologically complex languages benefit from n-gram use;

n-grams have performance disadvantages that can be overcome;

and, given parallel text, n-grams can be projected from one language to another, enabling highly accurate bilingual retrieval. I will describe issues particular to n-gram indexing and retrieval such as increased disk space consumption and query times, make a case that n-grams are a synthetic form of morphological normalization, and argue that when linguistic and translation resources are scarce (as is the case with less-commonly studied languages), n-grams are an extremely attractive option for multilingual retrieval. ]]>