<?xml version="1.0"?>

<!DOCTYPE owl [
  <!ENTITY rdf "http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <!ENTITY rdfs "http://www.w3.org/2000/01/rdf-schema#">
  <!ENTITY xsd "http://www.w3.org/2001/XMLSchema#">
  <!ENTITY owl "http://www.w3.org/2002/07/owl#">
  <!ENTITY cc "http://web.resource.org/cc/#">
  <!ENTITY event "http://ebiquity.umbc.edu/ontology/event.owl#">
  <!ENTITY person "http://ebiquity.umbc.edu/ontology/person.owl#">
  <!ENTITY assert "http://ebiquity.umbc.edu/ontology/assertion.owl#">]>

<!--
  This ontology document is licensed under the Creative Commons
  Attribution License. To view a copy of this license, visit
  http://creativecommons.org/licenses/by/2.0/ or send a letter to
  Creative Commons, 559 Nathan Abbott Way, Stanford, California
  94305, USA.
-->

<rdf:RDF 
  xmlns:rdf = "&rdf;"
  xmlns:rdfs = "&rdfs;"
  xmlns:xsd = "&xsd;"
  xmlns:owl = "&owl;"
  xmlns:cc = "&cc;"
  xmlns:event = "&event;"
  xmlns:person = "&person;"
  xmlns:assert = "&assert;">
  <event:Event rdf:about="http://ebiquity.umbc.edu/event/html/id/183/Why-You-Should-Use-N-grams-for-Multilingual-Information-Retrieval">
    <rdfs:label><![CDATA[Why You Should Use N-grams for Multilingual Information Retrieval]]></rdfs:label>
    <event:title><![CDATA[Why You Should Use N-grams for Multilingual Information Retrieval]]></event:title>
    <event:speaker><person:PhDStudent rdf:about="http://ebiquity.umbc.edu/person/html/Paul/McNamee/"><person:name><![CDATA[Paul  McNamee]]></person:name><rdfs:label><![CDATA[Paul  McNamee]]></rdfs:label></person:PhDStudent></event:speaker>
    <event:startDate rdf:datatype="&xsd;dateTime">2006-10-18T11:00:00-05:00</event:startDate>
    <event:endDate rdf:datatype="&xsd;dateTime">2006-10-18T12:00:00-05:00</event:endDate>
    <event:location><![CDATA[2120  A.V. Williams Building, UMCP]]></event:location>
    <event:abstract><![CDATA[While generally accepted for languages such as Chinese and Japanese, the use of character n-gram tokenization has not been widely adopted for information retrieval in alphabetic languages. However, n-grams are a simple representation for text that is surprisingly effective in diverse languages. In this talk I present empirical results in twelve European languages that have been studied in the Cross Language Evaluation Forum (CLEF) competitions. These results demonstrate that:

<ul>
<li> n-gram tokenization is very effective for monolingual retrieval;
<li> morphologically complex languages benefit from n-gram use;
<li> n-grams have performance disadvantages that can be overcome;
<li> and, given parallel text, n-grams can be projected from one
language to another, enabling highly accurate bilingual retrieval.
</ul>

I will describe issues particular to n-gram indexing and retrieval such as increased disk space consumption and query times, make a case that n-grams are a synthetic form of morphological normalization, and argue that when linguistic and translation resources are scarce (as is the case with less-commonly studied languages), n-grams are an extremely attractive option for multilingual retrieval.
]]></event:abstract>
  </event:Event>

  <rdf:Description rdf:about="">
    <cc:License rdf:resource="http://creativecommons.org/licenses/by/2.0/" />
  </rdf:Description>

</rdf:RDF>
