Reuters has released an API for its Calais Web service. The free service discovers entities, events and relations in text and returns the results in the form of RDF data. The services use information extraction technology from ClearForest, which Reuters acquired in April 2007.
“The Calais web service automatically attaches rich semantic metadata to the content you submit â€“ in well under a second. Using natural language processing, machine learning and other methods, Calais categorizes and links your document with entities (people, places, organizations, etc.), facts (person â€˜xâ€™ works for company â€˜yâ€™), and events (person â€˜zâ€™ was appointed chairman of company â€˜yâ€™ on date â€˜xâ€™). The metadata results are stored centrally and returned to you as industry-standard RDF constructs accompanied by a Globally Unique Identifier (GUID). Using the Calais GUID, any downstream consumer is able to retrieve this metadata via a simple call to Calais.” (link)
The semantic types it recognizes and uses in its annotations are a basic set typical of information extraction systems and include entities, facts, events and categories. See, for example, the description of the person entity type. The brief API documentation describes how to call the web services and interpret the results. As an example of the semantic metadata types supported by Calais, a preprocessed a sample content set of about 350 Business and Economic news articles from WikiNews for the year 2007 is available.
The service is free for both commercial and non-commercial purposes with a limit, but a generous one, on the number of service calls a registered developer can make in a day. A sample Java application is available that reads input from STDIN, writes output to STDOUT and takes processing parameters from a configuration file.
- updates: The sample application requires Java 6 to run! Here’s an example of input and the RDF output.
Making such a service freely available on the Web has the potential to be a disruptive move. Reuters will sponsor “a number of contests and bounties for applications developed using the Calais API.” An initial “bounty” of $5,000 is offered for “A highly configurable plugin for WordPress that enriches a blog with several capabilities” based on OpenCalais.
The kind of content extraction that Calias does falls considerably short of full language understanding. However, it does represent the state of the art in scalable, domain-independent information extraction, is immediately useful, and an important step toward the ultimate goal of full NLP.