UMBC ebiquity research group Building intelligent systems in open, heterogeneous, dynamic, distributed environments
22 May 2008, 16:00:17 EDT  
NLP

Archive for the 'NLP' Category

Words your mobile phone is not allowed to say

March 3rd, 2008, by Tim Finin, posted in Social media, NLP, Humor, Mobile Computing

Language models are widely used in processing both written and spoken language. They are used for part of speech tagging, sense tagging, disambiguation, text similarity metrics, and many other tasks, including predicting the words a person intends when typing on a telephone keypad. The last application has some interesting wrinkles, as this video we spotted on Language Log explains.



The most popular predictive text system in use today is T9, developed by Nuance Communications. You can check out the video’s examples using this T9 demo.

Reuters and the Semantic Web

February 10th, 2008, by Tim Finin, posted in Web 2.0, NLP, Semantic Web

Tim O’Reilly wrote in Reuters CEO sees “semantic web” in its future about Reuters’ motivations for embracing Semantic Web technology.

“At Money:Tech yesterday, I did an on-stage interview with Devin Wenig, the charismatic CEO-to-be of Reuters (following the still-not completed merger with Thomson). Devin highlighted what he considers two big trends hitting financial (and other professional) data: … The end of benefits from decreasing the time it takes for news to hit the market. … he increasingly sees Reuters’ job to be making connections, going from news to insight. He sees semantic markup to make it easier to follow paths of meaning through the data as an important part of Reuters’ future. … Ultimately, Reuters’ news is the raw material for analysis and application by investors and downstream news organizations. Adding metadata to make that job of analysis easier for those building additional value on top of your product is a really interesting way to view the publishing opportunity. If you don’t think of what you produce as the “final product” but rather as a step in an information pipeline, what do you do differently to add value for downstream consumers? In Reuters’ case, Devin thinks you add hooks to make your information more programmable.”

This provides some background for their recent announcement of the Reuters Calais information extraction service. It extracts named entities, events and relations from text and returns the information as RDF data.

Reuters Calais: free text to Semantic Web services

February 2nd, 2008, by Tim Finin, posted in Web 2.0, Social media, OWL, RDF, Web, NLP, Semantic Web

Reuters has released an API for its Calais Web service. The free service discovers entities, events and relations in text and returns the results in the form of RDF data. The services use information extraction technology from ClearForest, which Reuters acquired in April 2007.

“The Calais web service automatically attaches rich semantic metadata to the content you submit – in well under a second. Using natural language processing, machine learning and other methods, Calais categorizes and links your document with entities (people, places, organizations, etc.), facts (person ‘x’ works for company ‘y’), and events (person ‘z’ was appointed chairman of company ‘y’ on date ‘x’). The metadata results are stored centrally and returned to you as industry-standard RDF constructs accompanied by a Globally Unique Identifier (GUID). Using the Calais GUID, any downstream consumer is able to retrieve this metadata via a simple call to Calais.” (link)

The semantic types it recognizes and uses in its annotations are a basic set typical of information extraction systems and include entities, facts, events and categories. See, for example, the description of the person entity type. The brief API documentation describes how to call the web services and interpret the results. As an example of the semantic metadata types supported by Calais, a preprocessed a sample content set of about 350 Business and Economic news articles from WikiNews for the year 2007 is available.

The service is free for both commercial and non-commercial purposes with a limit, but a generous one, on the number of service calls a registered developer can make in a day. A sample Java application is available that reads input from STDIN, writes output to STDOUT and takes processing parameters from a configuration file.

    updates: The sample application requires Java 6 to run! Here’s an example of input and the RDF output.

Making such a service freely available on the Web has the potential to be a disruptive move. Reuters will sponsor “a number of contests and bounties for applications developed using the Calais API.” An initial “bounty” of $5,000 is offered for “A highly configurable plugin for WordPress that enriches a blog with several capabilities” based on OpenCalais.

The kind of content extraction that Calias does falls considerably short of full language understanding. However, it does represent the state of the art in scalable, domain-independent information extraction, is immediately useful, and an important step toward the ultimate goal of full NLP.

Cloud computing with Hadoop

December 26th, 2007, by Tim Finin, posted in Multicore Computation Center, NLP, AI, Semantic Web

The Web has become the repository of most the world’s pubic knowledge. Almost all of it is still bound up in text, images, audio and video, which are easy for people to understand but less accessible for machines. While the computer interpretation of visual and audio information is still challenging, text is within reach. The Web’s infrastructure makes access to all this information trivial, opening up tremendous opportunities to mine text to extract information that can be republished in a more structured representation (e.g., RDF, databases) or used by machine learning systems to discover new knowledge. Current technologies for human language understanding are far from perfect, but can harvest the low hanging fruit and are constantly improving. All that’s needed is an Internet connection and cycles — lots of them.

The latest approach to focusing lots of computing cycles on a problem is cloud computing, inspired in part by Google’s successful architecture and MapReduce software infrastructure.

Business Week had an article a few weeks ago, The Two Flavors of Google, that touches on some of the recent developments, including Hadoop and IBM and Google’s university cloud computing program. Hadoop is the produce of an Apache Lucene project that provides a Java-based software framework to distribute processing over a cluster of processors. The BW article notes

“Cutting, a 44-year-old search veteran, started developing Hadoop 18 months ago while running the nonprofit Nutch Foundation. After he later joined Yahoo, he says, the Hadoop project (named after his son’s stuffed elephant) was just “sitting in the corner.” But in short order, Yahoo saw Hadoop as a tool to enhance the operations of its own search engine and to power its own computing clouds.” (source)

and adds this significant anecdote

“In early November, for example, the tech team at The New York Times rented computing power on Amazon’s cloud and used Hadoop to convert 11 million archived articles, dating back to 1851, to digital and searchable documents. They turned around in a single day a job that otherwise would have taken months.” (source)

The NYT’s Derek Gottfrid described he process in some detail in a post on the NTY Open blog, Self-service, Prorated Super Computing Fun!.

The Hadoop Quickstart page describes how to run it on a single node, enabling any high school geek who knows Java and has a laptop to try it out before finding (or renting) time on a cluster. This is just what we need in for several upcoming projects and I am looking forward to trying it out soon. One requires processing the 1M documents in the Trec 8 collection and another the 10K documents in ACE 2008 collection.

AskWiki uses Wikipedia for semantic search

November 3rd, 2007, by Tim Finin, posted in Social media, NLP, Semantic Web

AskWiki is a service that answers simple questions with passages from Wikipedia. It is the product of a collaboration between the Wikimedia Foundation and AskMeNow, a company that uses NLP technology to answer simple questions sent via mobile phones. AskWiki classifies what it is doing as a kind of ’semantic search, which it defines (citing Wikipedia) as:

“Semantic Search attempts to augment and improve traditional Research Searches by leveraging XML data from semantic networks to disambiguate semantic search queries and web text in order to increase relevancy of results.” (link)

After playing with AskWiki a bit I was impressed with many of its answers and disappointed with others. About par for the course with this kind of system. But probing AskWiki’s performance and trying to reverse engineer what they are doing is a lot of fun. Mostly the system appears to work by selecting one sentence-level passage from Wikipedia that is the best match as an answer to the question. The question and sentences are probably pre-processed in various ways to improve the process. The match is probably based on the relevance as determined by the underlying IR engine.

What makes AskWiki perform so well, I’d guess, is the breadth, style and quality of Wikipedia’s content.

For more information, see AskWiki: Wikipedia Knows What You’re Thinking or, better yet, sit down with AskWiki and talk. Here’s a transcript of my session with it along with some comments.

Q: how high is the empire state building?
A: Antenna spire: 449 m (1,472 ft)
From: Empire State Building
Comment: This answer was extracted from the Infobox.

Q: empire state building height
A: The Empire State Building rises to 381 m (1,250 feet) at the 102nd floor, and including the 203ft pinnacle its full height reaches 443 m (1,453 ft and 8 9/16th in).
From: Empire State Building
Comment: This sentence is from the article’s text, and seems better than the first one.

Q: How heavy is a bowling ball?
A: Most people can easily buy a bowling ball that is at least 1 to 2 pounds heavier than they used as a house ball.
From: Bowling ball
Comment: Nice try, but its a spare.

Q: How long should I cook a hard boiled egg?
A: Studies done at the American Egg Board substantiate that unopened containers of brined or pickled eggs (marinated, hard-cooked eggs) keep for several months on the shelf.
From: Boiled egg
Comment: If the system could suggest three possible answers it might do better. For example, the boiled egg article does have this sentence, which would have been a nice answer: Boiled eggs are typically boiled for 10-11 minutes or until they float, depending on the size of the egg.

Q: What is the longest river in the world?
A: This is a list of the longest rivers on Earth. It includes river systems over 1,000 kilometers.
Comment: This might have been a good answer if it had not been truncated.

Q: What is the longest river in Africa
A: Nile.
From: Wikipedia:Nile
Comment: Good answer.

Q: What is the longest river in Africa ?
A: The length of a river between source and mouth may be hard to determine because of a lack of precise maps. In these cases, the measured length of a river will depend on the scale of the map on which the measurement is based; in general, due to the fractal quality of a river, the larger the scale, the longer the resulting length measurement.
From: List of rivers by length
Comment: adding the question mark changed the answer. I have no idea why.

Q: what is the longest river in Africa?
A: For example, although the Congo River appears to be a natural geographic boundary, there were groups that otherwise shared a language, culture or other similarity living on both sides. The division of the land between Belgium and France along the river isolated these groups from each other.
C: Yet another answer. Maybe it selects a random one from the top results.

Q: does P equal NP
A: A nurse practitioner is a registered nurse who has completed specific advanced nursing education (generally a master’s degree) and training in the diagnosis and management of common medical conditions. Nurse practitioners provide a broad range of health care services.
Comment: Maybe it’s a dodge.

Q: Who directed “They live by Night”?
A: Director: Nicholas Ray
Comment: That’s the right answer for a somewhat obscure film, which was Ray’s directing debut.

Q: Will there be peace in our lifetime?
A: No answer found for the question “Will there be peace in our lifetime?”
Comment: No comment.

Q: What is the value of pi?
A: Pi or π is the ratio of a circle’s circumference to its diameter in Euclidean geometry, approximately 3.14159.
Comment: correct, but an easy question.

Q: What is the square root of 16?
A: Year 16 was a leap year starting on Wednesday (link will display the full calendar) of the Julian calendar.
From: 16
Comment: this is a nice example that shows a fundamental weakness. AskWiki can only look for an answer that seems appropriate from among its several billion sentences. Even though that’s a large number, the number of ‘facts’ that people know is infinite.

Q: What is an emu
A: The Emu (pronunciation:), Dromaius novaehollandiae, is the largest bird native to Australia and the only extant member of the genus Dromaius. It is also the second-largest extant bird in the world by height, after its ratite relative, the ostrich. The soft-feathered, brown, flightless birds reach up to 2 m (6 ft) in height. The Emu is common over most of mainland Australia, although it avoids heavily populated areas, dense forest and arid areas. Emus can travel great distances at a fast, economical trot and, if necessary, can sprint at 50 km/h (30 mph) for some distance at a time.
From: Emu

Q: Can an emu fly?
A: The pelvic limb muscles of Emus have a similar contribution to total body mass as the flight muscles of flying birds.
From: Emu
Comment: While not a good answer, the sentence does at least imply that emus can not fly.

Q: How many popes have there been
A: Where there are sins, there are also divisions, schisms, heresies and disputes. Where there is virtue, however, there also are harmony and unity, from which arise the one heart and one soul of all believers (Cf CIC, can751.).
From: Pope
Comment: Inscrutable answer.

Q: Is there a god
A: Many historians of religion hold that monotheism may be of relatively recent historical origins - although comparison is difficult as many religions claim to be ancient.
Comment: Hmmmm…

Q: Where is Tanzania
A: Tanzania officially the United Republic of Tanzania (Swahili: Jamhuri ya Muungano wa Tanzania), is a country in East Africa bordered by Kenya and Uganda on the north, Rwanda, Burundi and the Democratic Republic of the Congo on the west, and Zambia, Malawi and Mozambique on the south. To the east it borders the Indian Ocean.
Comment: good answer

Q: Where is Brown University
A: Brown University is a private university located in Providence, Rhode Island. Founded in 1764 as the College of Rhode Island, it is the third-oldest institution of higher education in New England and the seventh-oldest in the United States. It is a member of the Ivy League. Pembroke College, Brown University’s all women’s college, merged with The College, in 1971.
From: Brown University
Comment: Also good.

Q: Why is the sky blue?
A: The light from the sky is a result of diffuse sky radiation which results in a light blue color being perceived by the human eye. On a sunny day the earth sky usually looks as a blue gradient — dark in the zenith, light near the horizon (due to Rayleigh scattering). It can turn a multitude of colors such as purple (especially near sunset and sunrise) and black at night. Scattering effects also partially polarize light from the sky.
From: Wikipedia:Sky
Comment: Not bad for this old chestnut.

Q: When was Charles Darwin born
A: Birth date: February 12, 1809
From: Charles Darwin
Comment: good!

Q: What is the capital of Belgium?
A: Capital: Brussels
From: Belgium
Comment: Right!

Q: What is the population of Brussels?
A: Population metro: 1,975,000
From: Brussels
Comment: Right!!

Q: what is the population of the capital of Belgium?
A: Population metro: 1,975,000
From: Brussels
Comment: I expected this one to fail!

Q: Is no your answer to this question?
A: This depends on the context, as do several other possibilities: Sometimes “I don’t know” is an acceptable answer, sometimes even a right answer. The same is true of “None of the above” and “There is no answer”. An answer is the, or a, right answer, if it presents true information which falls within the determined range of alternatives. Questions of this simplest sort usually begin with Who, what, which, where, when, does/do, is/are.
From: Question
Comment: Wow!

Q: is waterboarding torture?
A: Recent torture methods include waterboarding, sexual humiliation and sexual abuse, and the use of dogs against prisoners.
From: Torture
Comment: No moral confusion here!

Q: What is the airspeed velocity of an unladen swallow?
A: Monty Python and the Holy Grail is a 1975 film written and performed by the comedy group Monty Python (Graham Chapman, John Cleese, Terry Gilliam, Eric Idle, Terry Jones, and Michael Palin), and directed by Gilliam and Jones. It was made during a gap between the third and fourth seasons of their popular BBC television series Monty Python’s Flying Circus.
From: Wikipedia:Monty Python and the Holy Grail
Comment: Maybe the selection process is done in two steps: find a relevant article and then find a good answer within it.

Q: Can a computer think?
A: However, computers cannot “think” for themselves in the sense that they only solve problems in exactly the way they are programmed to.
Comment: This answer must have been painful for poor old AskWiki to give.

Q: who is jimbo wales?
A: Jimmy Donal “Jimbo” Wales (born 7 August 1966 in Huntsville, Alabama) is an American Internet entrepreneur known for his role in founding Wikipedia and other wiki-related
Comment: Know your master.

Q: Is jimbo wales a genius or what?
A: No answer found for the question “Is jimbo wales a genius or what?”
Comment: Ahhh, AskiWiki may be showing some tact.

Powerset outsources query result evaluation to Mechanical Turk

October 21st, 2007, by Tim Finin, posted in Social media, NLP, Web, Semantic Web

TechCrunch reports that Powerset is using Amazon’s Mechanical Turk to evaluate different search results for queries. Techcrunch has a screenshot of an example Turk query.

“See the screen shot… users are shown a query and a number of results and are asked to evaluate the relevancy of each result from five choices. In this case, the query is “revealing bikinis.” Users are asked to evaluate four sets of results within ten minutes, and are paid $0.02 for the effort.

I spoke with Powerset CEO Barney Pell this evening who confirmed that they are using Mechanical Turk to get human feedback on search results. He says the results are not all Powerset generated - rather, they show results from Powerset, Google and others to see which users prefer for a given query. He also says this is an ongoing project, and new ones will be added soon. Pell also said that Powerset plans to use Mechanical Turk over the long haul, even after launch. They’ll put actual user queries into Mechanical Turk in real time, add Powerset and competitor results and see which results people find more relevant. If results suggest Powerset isn’t more relevant, they’ll adjust their engine.” (link)

This is a good example of how Amazon’s service can work. I was surprised at the low cost — two cents for a judgment! We have a number of projects where we need to have human assessments for training data. Instead of turning to our usual source, students and faculty, maybe we should explore using the Mechanical Turk. In some cases, getting local people to do the judgments was difficult. For example, we were interested in expanding our splog detection system to languages other than English, but didn’t have access to native speakers to the right languages.

Search the Enron email corpus online

February 5th, 2006, by Tim Finin, posted in NLP, AI

The enron email corpus is a collection of hundreds of thousands of email messages from the infamous Enron corporation that researchers have been using to improve and evaluate techniques for analyzing email, e.g., NLP analysis, information extraction, sentiment detection, social network analysis, information flow, etc. It’s become important because it is the only substantial collection of real email that is public. In the ebiquity lab, for example, Akshay Java has worked with UMBC’s Institute for Language and Information Technologies to bring to bear their NLP technology on the messages.

InBoxer has put up an Enron Email site that lets anyone explore and search the collection on the Web. InBoxer is not a research group, but a company that sells an “anti-risk appliance” that is used to detect when email that is about to be sent or has been sent violates policy. (There should be a good market for this in the Government, too!).

You can also surf the corpus via a simple database interface at UC Berkeley.

William Cohen of CMU describes the collection:

This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. … The dataset here does not include attachments, and some messages have been deleted “as part of a redaction effort due to requests from affected employees”.

Now it’s convenient to explore corporate malfeasance on the Web.

SemNews: NLP system generates Semantic Web representation of news summaries

January 12th, 2006, by Tim Finin, posted in Swoogle, NLP, Ontologies, AI, Web, Semantic Web

SemNews is a prototype application being developed by UMBC Ph.D. student Akshay Java that uses a sophisticated text understanding system to interpret summaries of news stories, publishes the results on the semantic web and provides browsing and query services over them. The project is the result of a collaboration between the UMBC’s Institute for Language and Information Technologies and Ebiquity Laboratory with partial support from the Lockheed Martin Corporation.

SemNews monitors a number of news source RSS feeds and processes new stories as they are published. After extracting a story’s metadata, its news summary is interpreted by the OntoSem text analyzer which does a syntactic, semantic, and pragmatic analysis of the text, resulting in its text meaning representation or TMR. A TMR is a language-neutral description (an interlingua) of the meaning conveyed in a natural language text. In addition to providing information about the lexical-semantic dependencies in the text, the TMR represents stylistic factors, discourse relations, speaker attitudes, and other pragmatic factors present in the discourse structure. In doing so, the TMR captures not only the meaning of individual elements in the text, but also the relations between those elements, and captures both propositional and non-propositional components of textual meaning. OntoSem’s TMRs are represented in a custom frame-based representation language and grounded in the Mikrokosmos ontology, an extensive ontology with over 30K concepts and nearly 400K entities.

Each story’s metadata and TMR are translated into the Semantic Web language OWL via the OntoSem2OWL translator developed for this project. The results are then added to a special collection indexed by the Swoogle search engine and also put into a RDF triple store. These are used to support several services enabling people and agents to semantically browse, query and visualize the stories in the collection, enabling access to information that would otherwise not be easy to find using simple keyword based search.

For example, one can browse through the story collection via the ontology to find stories that involve certain concepts, such as a terrorist organization; find all stories that involve an entities in OntoSem’s onomasticon, such as al qaeda or Karbala; visualize the stories on a map based on the locations they reference; or construct an arbitrary query, such as finding “stories in which the nation named Afghanistan was the location of a bombing event.” Users can also define semantic “alerts” as queries over the RDF triple store and/or the Swoogle collection. For each alert, SemNews will generate an RSS feed of the results.

The SemNews system is currently a research prototype that is being used to refine the underlying technologies and to explore how the sophisticated automatic linguistic processing of text can be integrated into the Semantic Web and conventional web applications. Ongoing work on SemNews includes an evaluation of its semantic recall and precision as well as a service that can group and cluster stories based on their semantic representations.

For more information

You are currently browsing the archives for the NLP category.

  Home | Archive | Login | Feed

Recent posts

  • The "Missouri Mom" (Lori Drew) case -- Privacy Issues and New Legal Theories ?
  • An account of the Estonian Internet War
  • PhD proposal: Context and Policies in Declarative Networked Systems
  • RPI group developing Second Life robot
  • The Psychology of Social Networking on KQED Forum show

  • Ebiquity community

  • Fieldmarking data blog
  • Geospatial Semantic Web
  • Harry Chen thinks aloud
  • Planet social media research
  • Social media research blog
  • TrackForward by Kolari
  • UMBC GAIM

  • UMBC