UMBC ebiquity research group Building intelligent systems in open, heterogeneous, dynamic, distributed environments
NLP

Archive for the 'NLP' Category

NLTK: a natural language processing toolkit in Python

October 11th, 2008, by Tim Finin, posted in NLP, Semantic Web

NLTK looks very useful.

“NLTK — the Natural Language Toolkit — is a suite of open source Python modules, data and documentation for research and development in natural language processing. NLTK contains Code supporting dozens of NLP tasks, along with 40 popular Corpora and extensive Documentation including a 375-page online Book. Distributions for Windows, Mac OSX and Linux are available.”

The development of NLTK is led by Steven Bird, Edward Loper, and Ewan Klein.

(Spotted on Language Log.)

Evri helps you search less and understand more

September 27th, 2008, by Tim Finin, posted in NLP, Ontologies, Semantic Web

Evri is another entry into the ’semantic search’ space and has recently opened up a beta site with the slogan Search less, understand more. Evri is an startup launched by Vulcan Inc, a company founded by Paul Allen in 1986 as a private investment and R&D firm.

Here’s part of how Evri describes itself on their (FAQ).

What is Evri doing? Evri is creating a map of connections between people, places, and things on the web. You’ll use this map to find the things you’re interested in. Instead of searching by keywords and looking for relevant results, Evri will lead you to other relevant articles, images, and video based on what you’re reading.

Where does Evri get its information? We search the World Wide Web and gather content from as many highly regarded information sources as we can find, and we’re adding more sources all the time.”

Saying that Evri does ’semantic search’ is not quite right — their initial focus is on providing widgets for blogs and other web sites that use the text on the page to recommend links to other, related information.

Evri appears to have developed an underlying ontology that is used to organize their knowledge of “people, products and things”, capturing both a type taxonomy and relations. Some of this is revealed in the beta**2 part of their site, Evri’s Garden. There is a query system over their knowledge base complex search queries.

The current push, though, seems to be to get bloggers to add an Evri widget to their blogs that will pop up a window with links to related articles and information.

This is an interesting development that is worth watching.

HealthMap mines text for a global disease alert map

July 8th, 2008, by Tim Finin, posted in NLP, Semantic Web, Social media, Web, Web 2.0

HealthMap is an interesting Web site that displays a “global disease alert map” based on information extracted from a variety of text sources on the Web, including news, WHO and NGOs. HealthMap was developed as a research project by Clark Freifeld and John Brownstein of the Children’s Hospital Informatics Program, part of the Harvard-MIT Division of Health Sciences & Technology.

HealthMap mines text for a global disease alert map

Their site says

“HealthMap brings together disparate data sources to achieve a unified and comprehensive view of the current global state of infectious diseases and their effect on human and animal health. This freely available Web site integrates outbreak data of varying reliability, ranging from news sources (such as Google News) to curated personal accounts (such as ProMED) to validated official alerts (such as World Health Organization). Through an automated text processing system, the data is aggregated by disease and displayed by location for user-friendly access to the original alert. HealthMap provides a jumping-off point for real-time information on emerging infectious diseases and has particular interest for public health officials and international travelers.”

The work was done in part with support from Google, as described in a story on ABC news, Researchers Track Disease With Google News, Google.org Money

Microsoft rumored to buy semantic search startup Powerset

June 26th, 2008, by Tim Finin, posted in AI, NLP, Semantic Web, Web 2.0

Venture Beat reports that Microsoft will acquire Powerset for a price “rumored to be slightly more than $100 million”. Powerset has been developing a Web search system that uses natural language processing technology acquired from PARC to more fully understand user’s queries and the text of documents indexed.

“By buying Powerset, Microsoft is hoping to close the perceived quality gap with Google’s search engine. The move comes as Microsoft CEO Steve Ballmer continues to argue that improving search is Microsoft’s most important task. Microsoft’s market share in search has steadily declined, dropping further and further behind first-place Google and second place Yahoo.

Google has generally dismissed Powerset’s semantic, or “natural language” approach as being only marginally interesting, even though Google has hired some semantic specialists to work on that approach in limited fashion. Google’s search results are still based primarily on the individual words you type into its search bar, and its approach does very little to understand the possible meaning created by joining two or more words together.”

If you put the query “Where is Mount Kilimanjaro” into the beta version of Powerset, it answers “Mount Kilimanjaro: Contained by Tanzania” in addition to showing web pages extracted from Wikipedia. That’s a pretty good answer.

Its response to “what is the Serengeti” is a little less precise. It reports seven things it knows about Serengeti — that it replaced “desert, Platinum”, twilight and Caribbean Blue”, that it hosted ‘migration’, that it provided ‘draw’, that it gained ‘fame’, that it recorded ‘explorations’, that it rutted ’season’ and that it boasted ‘Blue Wildebeests’. I’m just glad I don’t have a school report due on the Serengeti due tomorrow!

Asking “Who is the president of Zimbabwe” results only in the fallback answer — which appears to be just the set of Wikipedia pages that the query words produce in an IR query. Compare this with the results of the Google query who is the president of zimbabwe site:wikipedia.org.

By the way, the AskWiki system often does a better job on these kinds of question. Asking “where is the Serengeti” produces the answer “The Serengeti ecosystem is located in north-western Tanzania and extends to south-western Kenya between latitudes 1 and 3 S and longitudes 34 and 36 E. It spans some 30,000 km.” It’s a bit of a hack, though. It seems to work by selecting the sentence or two in Wikipedia that best serves as an answer. See our post on Askwiki from last Fall for more examples.

Still, Powerset is an ambitious system that shows promise. What they are trying to do is important and will eventually be done. They have shown real progress in the past two years, more than I had expected. I hope Microsoft can accelerate the development and find practical ways to improve Web search even if the ultimate goal of full language understanding is many years away.

Colin de la Higuera on Grammatical Inference, 1pm Tue June 10, ITE 325, UMBC

June 5th, 2008, by Tim Finin, posted in AI, Machine Learning, NLP

Colin de la Higuera of Jean Monnet University will talk on “ Grammatical Inference: Some of the Questions Out There ” at 1:00pm next Tuesday in the large CSEE conference room.

“Grammatical Inference is a field concerned with learning grammars given data about a language. In this talk we survey some of the questions being addressed by researchers in the field. Some of these are now classical and have been looked into for some time, others are more recent:

  • understanding the models and the paradigms: what does polynomial language learning mean?
  • learning more complex families of languages
  • scaling up and using grammatical inference in applications

Words your mobile phone is not allowed to say

March 3rd, 2008, by Tim Finin, posted in Humor, Mobile Computing, NLP, Social media

Language models are widely used in processing both written and spoken language. They are used for part of speech tagging, sense tagging, disambiguation, text similarity metrics, and many other tasks, including predicting the words a person intends when typing on a telephone keypad. The last application has some interesting wrinkles, as this video we spotted on Language Log explains.



The most popular predictive text system in use today is T9, developed by Nuance Communications. You can check out the video’s examples using this T9 demo.

Reuters and the Semantic Web

February 10th, 2008, by Tim Finin, posted in NLP, Semantic Web, Web 2.0

Tim O’Reilly wrote in Reuters CEO sees “semantic web” in its future about Reuters’ motivations for embracing Semantic Web technology.

“At Money:Tech yesterday, I did an on-stage interview with Devin Wenig, the charismatic CEO-to-be of Reuters (following the still-not completed merger with Thomson). Devin highlighted what he considers two big trends hitting financial (and other professional) data: … The end of benefits from decreasing the time it takes for news to hit the market. … he increasingly sees Reuters’ job to be making connections, going from news to insight. He sees semantic markup to make it easier to follow paths of meaning through the data as an important part of Reuters’ future. … Ultimately, Reuters’ news is the raw material for analysis and application by investors and downstream news organizations. Adding metadata to make that job of analysis easier for those building additional value on top of your product is a really interesting way to view the publishing opportunity. If you don’t think of what you produce as the “final product” but rather as a step in an information pipeline, what do you do differently to add value for downstream consumers? In Reuters’ case, Devin thinks you add hooks to make your information more programmable.”

This provides some background for their recent announcement of the Reuters Calais information extraction service. It extracts named entities, events and relations from text and returns the information as RDF data.

Reuters Calais: free text to Semantic Web services

February 2nd, 2008, by Tim Finin, posted in NLP, OWL, RDF, Semantic Web, Social media, Web, Web 2.0

Reuters has released an API for its Calais Web service. The free service discovers entities, events and relations in text and returns the results in the form of RDF data. The services use information extraction technology from ClearForest, which Reuters acquired in April 2007.

“The Calais web service automatically attaches rich semantic metadata to the content you submit – in well under a second. Using natural language processing, machine learning and other methods, Calais categorizes and links your document with entities (people, places, organizations, etc.), facts (person ‘x’ works for company ‘y’), and events (person ‘z’ was appointed chairman of company ‘y’ on date ‘x’). The metadata results are stored centrally and returned to you as industry-standard RDF constructs accompanied by a Globally Unique Identifier (GUID). Using the Calais GUID, any downstream consumer is able to retrieve this metadata via a simple call to Calais.” (link)

The semantic types it recognizes and uses in its annotations are a basic set typical of information extraction systems and include entities, facts, events and categories. See, for example, the description of the person entity type. The brief API documentation describes how to call the web services and interpret the results. As an example of the semantic metadata types supported by Calais, a preprocessed a sample content set of about 350 Business and Economic news articles from WikiNews for the year 2007 is available.

The service is free for both commercial and non-commercial purposes with a limit, but a generous one, on the number of service calls a registered developer can make in a day. A sample Java application is available that reads input from STDIN, writes output to STDOUT and takes processing parameters from a configuration file.

    updates: The sample application requires Java 6 to run! Here’s an example of input and the RDF output.

Making such a service freely available on the Web has the potential to be a disruptive move. Reuters will sponsor “a number of contests and bounties for applications developed using the Calais API.” An initial “bounty” of $5,000 is offered for “A highly configurable plugin for WordPress that enriches a blog with several capabilities” based on OpenCalais.

The kind of content extraction that Calias does falls considerably short of full language understanding. However, it does represent the state of the art in scalable, domain-independent information extraction, is immediately useful, and an important step toward the ultimate goal of full NLP.

Cloud computing with Hadoop

December 26th, 2007, by Tim Finin, posted in AI, Multicore Computation Center, NLP, Semantic Web

The Web has become the repository of most the world’s pubic knowledge. Almost all of it is still bound up in text, images, audio and video, which are easy for people to understand but less accessible for machines. While the computer interpretation of visual and audio information is still challenging, text is within reach. The Web’s infrastructure makes access to all this information trivial, opening up tremendous opportunities to mine text to extract information that can be republished in a more structured representation (e.g., RDF, databases) or used by machine learning systems to discover new knowledge. Current technologies for human language understanding are far from perfect, but can harvest the low hanging fruit and are constantly improving. All that’s needed is an Internet connection and cycles — lots of them.

The latest approach to focusing lots of computing cycles on a problem is cloud computing, inspired in part by Google’s successful architecture and MapReduce software infrastructure.

Business Week had an article a few weeks ago, The Two Flavors of Google, that touches on some of the recent developments, including Hadoop and IBM and Google’s university cloud computing program. Hadoop is the produce of an Apache Lucene project that provides a Java-based software framework to distribute processing over a cluster of processors. The BW article notes

“Cutting, a 44-year-old search veteran, started developing Hadoop 18 months ago while running the nonprofit Nutch Foundation. After he later joined Yahoo, he says, the Hadoop project (named after his son’s stuffed elephant) was just “sitting in the corner.” But in short order, Yahoo saw Hadoop as a tool to enhance the operations of its own search engine and to power its own computing clouds.” (source)

and adds this significant anecdote

“In early November, for example, the tech team at The New York Times rented computing power on Amazon’s cloud and used Hadoop to convert 11 million archived articles, dating back to 1851, to digital and searchable documents. They turned around in a single day a job that otherwise would have taken months.” (source)

The NYT’s Derek Gottfrid described he process in some detail in a post on the NTY Open blog, Self-service, Prorated Super Computing Fun!.

The Hadoop Quickstart page describes how to run it on a single node, enabling any high school geek who knows Java and has a laptop to try it out before finding (or renting) time on a cluster. This is just what we need in for several upcoming projects and I am looking forward to trying it out soon. One requires processing the 1M documents in the Trec 8 collection and another the 10K documents in ACE 2008 collection.

AskWiki uses Wikipedia for semantic search

November 3rd, 2007, by Tim Finin, posted in NLP, Semantic Web, Social media

AskWiki is a service that answers simple questions with passages from Wikipedia. It is the product of a collaboration between the Wikimedia Foundation and AskMeNow, a company that uses NLP technology to answer simple questions sent via mobile phones. AskWiki classifies what it is doing as a kind of ’semantic search, which it defines (citing Wikipedia) as:

“Semantic Search attempts to augment and improve traditional Research Searches by leveraging XML data from semantic networks to disambiguate semantic search queries and web text in order to increase relevancy of results.” (link)

After playing with AskWiki a bit I was impressed with many of its answers and disappointed with others. About par for the course with this kind of system. But probing AskWiki’s performance and trying to reverse engineer what they are doing is a lot of fun. Mostly the system appears to work by selecting one sentence-level passage from Wikipedia that is the best match as an answer to the question. The question and sentences are probably pre-processed in various ways to improve the process. The match is probably based on the relevance as determined by the underlying IR engine.

What makes AskWiki perform so well, I’d guess, is the breadth, style and quality of Wikipedia’s content.

For more information, see AskWiki: Wikipedia Knows What You’re Thinking or, better yet, sit down with AskWiki and talk. Here’s a transcript of my session with it along with some comments.

Q: how high is the empire state building?
A: Antenna spire: 449 m (1,472 ft)
From: Empire State Building
Comment: This answer was extracted from the Infobox.

Q: empire state building height
A: The Empire State Building rises to 381 m (1,250 feet) at the 102nd floor, and including the 203ft pinnacle its full height reaches 443 m (1,453 ft and 8 9/16th in).
From: Empire State Building
Comment: This sentence is from the article’s text, and seems better than the first one.

Q: How heavy is a bowling ball?
A: Most people can easily buy a bowling ball that is at least 1 to 2 pounds heavier than they used as a house ball.
From: Bowling ball
Comment: Nice try, but its a spare.

Q: How long should I cook a hard boiled egg?
A: Studies done at the American Egg Board substantiate that unopened containers of brined or pickled eggs (marinated, hard-cooked eggs) keep for several months on the shelf.
From: Boiled egg
Comment: If the system could suggest three possible answers it might do better. For example, the boiled egg article does have this sentence, which would have been a nice answer: Boiled eggs are typically boiled for 10-11 minutes or until they float, depending on the size of the egg.

Q: What is the longest river in the world?
A: This is a list of the longest rivers on Earth. It includes river systems over 1,000 kilometers.
Comment: This might have been a good answer if it had not been truncated.

Q: What is the longest river in Africa
A: Nile.
From: Wikipedia:Nile
Comment: Good answer.

Q: What is the longest river in Africa ?
A: The length of a river between source and mouth may be hard to determine because of a lack of precise maps. In these cases, the measured length of a river will depend on the scale of the map on which the measurement is based; in general, due to the fractal quality of a river, the larger the scale, the longer the resulting length measurement.
From: List of rivers by length
Comment: adding the question mark changed the answer. I have no idea why.

Q: what is the longest river in Africa?
A: For example, although the Congo River appears to be a natural geographic boundary, there were groups that otherwise shared a language, culture or other similarity living on both sides. The division of the land between Belgium and France along the river isolated these groups from each other.
C: Yet another answer. Maybe it selects a random one from the top results.

Q: does P equal NP
A: A nurse practitioner is a registered nurse who has completed specific advanced nursing education (generally a master’s degree) and training in the diagnosis and management of common medical conditions. Nurse practitioners provide a broad range of health care services.
Comment: Maybe it’s a dodge.

Q: Who directed “They live by Night”?
A: Director: Nicholas Ray
Comment: That’s the right answer for a somewhat obscure film, which was Ray’s directing debut.

Q: Will there be peace in our lifetime?
A: No answer found for the question “Will there be peace in our lifetime?”
Comment: No comment.

Q: What is the value of pi?
A: Pi or Ï€ is the ratio of a circle’s circumference to its diameter in Euclidean geometry, approximately 3.14159.
Comment: correct, but an easy question.

Q: What is the square root of 16?
A: Year 16 was a leap year starting on Wednesday (link will display the full calendar) of the Julian calendar.
From: 16
Comment: this is a nice example that shows a fundamental weakness. AskWiki can only look for an answer that seems appropriate from among its several billion sentences. Even though that’s a large number, the number of ‘facts’ that people know is infinite.

Q: What is an emu
A: The Emu (pronunciation:), Dromaius novaehollandiae, is the largest bird native to Australia and the only extant member of the genus Dromaius. It is also the second-largest extant bird in the world by height, after its ratite relative, the ostrich. The soft-feathered, brown, flightless birds reach up to 2 m (6 ft) in height. The Emu is common over most of mainland Australia, although it avoids heavily populated areas, dense forest and arid areas. Emus can travel great distances at a fast, economical trot and, if necessary, can sprint at 50 km/h (30 mph) for some distance at a time.
From: Emu

Q: Can an emu fly?
A: The pelvic limb muscles of Emus have a similar contribution to total body mass as the flight muscles of flying birds.
From: Emu
Comment: While not a good answer, the sentence does at least imply that emus can not fly.

Q: How many popes have there been
A: Where there are sins, there are also divisions, schisms, heresies and disputes. Where there is virtue, however, there also are harmony and unity, from which arise the one heart and one soul of all believers (Cf CIC, can751.).
From: Pope
Comment: Inscrutable answer.

Q: Is there a god
A: Many historians of religion hold that monotheism may be of relatively recent historical origins – although comparison is difficult as many religions claim to be ancient.
Comment: Hmmmm…

Q: Where is Tanzania
A: Tanzania officially the United Republic of Tanzania (Swahili: Jamhuri ya Muungano wa Tanzania), is a country in East Africa bordered by Kenya and Uganda on the north, Rwanda, Burundi and the Democratic Republic of the Congo on the west, and Zambia, Malawi and Mozambique on the south. To the east it borders the Indian Ocean.
Comment: good answer

Q: Where is Brown University
A: Brown University is a private university located in Providence, Rhode Island. Founded in 1764 as the College of Rhode Island, it is the third-oldest institution of higher education in New England and the seventh-oldest in the United States. It is a member of the Ivy League. Pembroke College, Brown University’s all women’s college, merged with The College, in 1971.
From: Brown University
Comment: Also good.

Q: Why is the sky blue?
A: The light from the sky is a result of diffuse sky radiation which results in a light blue color being perceived by the human eye. On a sunny day the earth sky usually looks as a blue gradient — dark in the zenith, light near the horizon (due to Rayleigh scattering). It can turn a multitude of colors such as purple (especially near sunset and sunrise) and black at night. Scattering effects also partially polarize light from the sky.
From: Wikipedia:Sky
Comment: Not bad for this old chestnut.

Q: When was Charles Darwin born
A: Birth date: February 12, 1809
From: Charles Darwin
Comment: good!

Q: What is the capital of Belgium?
A: Capital: Brussels
From: Belgium
Comment: Right!

Q: What is the population of Brussels?
A: Population metro: 1,975,000
From: Brussels
Comment: Right!!

Q: what is the population of the capital of Belgium?
A: Population metro: 1,975,000
From: Brussels
Comment: I expected this one to fail!

Q: Is no your answer to this question?
A: This depends on the context, as do several other possibilities: Sometimes “I don’t know” is an acceptable answer, sometimes even a right answer. The same is true of “None of the above” and “There is no answer”. An answer is the, or a, right answer, if it presents true information which falls within the determined range of alternatives. Questions of this simplest sort usually begin with Who, what, which, where, when, does/do, is/are.
From: Question
Comment: Wow!

Q: is waterboarding torture?
A: Recent torture methods include waterboarding, sexual humiliation and sexual abuse, and the use of dogs against prisoners.
From: Torture
Comment: No moral confusion here!

Q: What is the airspeed velocity of an unladen swallow?
A: Monty Python and the Holy Grail is a 1975 film written and performed by the comedy group Monty Python (Graham Chapman, John Cleese, Terry Gilliam, Eric Idle, Terry Jones, and Michael Palin), and directed by Gilliam and Jones. It was made during a gap between the third and fourth seasons of their popular BBC television series Monty Python’s Flying Circus.
From: Wikipedia:Monty Python and the Holy Grail
Comment: Maybe the selection process is done in two steps: find a relevant article and then find a good answer within it.

Q: Can a computer think?
A: However, computers cannot “think” for themselves in the sense that they only solve problems in exactly the way they are programmed to.
Comment: This answer must have been painful for poor old AskWiki to give.

Q: who is jimbo wales?
A: Jimmy Donal “Jimbo” Wales (born 7 August 1966 in Huntsville, Alabama) is an American Internet entrepreneur known for his role in founding Wikipedia and other wiki-related
Comment: Know your master.

Q: Is jimbo wales a genius or what?
A: No answer found for the question “Is jimbo wales a genius or what?”
Comment: Ahhh, AskiWiki may be showing some tact.

Powerset outsources query result evaluation to Mechanical Turk

October 21st, 2007, by Tim Finin, posted in NLP, Semantic Web, Social media, Web

TechCrunch reports that Powerset is using Amazon’s Mechanical Turk to evaluate different search results for queries. Techcrunch has a screenshot of an example Turk query.

“See the screen shot… users are shown a query and a number of results and are asked to evaluate the relevancy of each result from five choices. In this case, the query is “revealing bikinis.” Users are asked to evaluate four sets of results within ten minutes, and are paid $0.02 for the effort.

I spoke with Powerset CEO Barney Pell this evening who confirmed that they are using Mechanical Turk to get human feedback on search results. He says the results are not all Powerset generated – rather, they show results from Powerset, Google and others to see which users prefer for a given query. He also says this is an ongoing project, and new ones will be added soon. Pell also said that Powerset plans to use Mechanical Turk over the long haul, even after launch. They’ll put actual user queries into Mechanical Turk in real time, add Powerset and competitor results and see which results people find more relevant. If results suggest Powerset isn’t more relevant, they’ll adjust their engine.” (link)

This is a good example of how Amazon’s service can work. I was surprised at the low cost — two cents for a judgment! We have a number of projects where we need to have human assessments for training data. Instead of turning to our usual source, students and faculty, maybe we should explore using the Mechanical Turk. In some cases, getting local people to do the judgments was difficult. For example, we were interested in expanding our splog detection system to languages other than English, but didn’t have access to native speakers to the right languages.

Search the Enron email corpus online

February 5th, 2006, by Tim Finin, posted in AI, NLP

The enron email corpus is a collection of hundreds of thousands of email messages from the infamous Enron corporation that researchers have been using to improve and evaluate techniques for analyzing email, e.g., NLP analysis, information extraction, sentiment detection, social network analysis, information flow, etc. It’s become important because it is the only substantial collection of real email that is public. In the ebiquity lab, for example, Akshay Java has worked with UMBC’s Institute for Language and Information Technologies to bring to bear their NLP technology on the messages.

InBoxer has put up an Enron Email site that lets anyone explore and search the collection on the Web. InBoxer is not a research group, but a company that sells an “anti-risk appliance” that is used to detect when email that is about to be sent or has been sent violates policy. (There should be a good market for this in the Government, too!).

You can also surf the corpus via a simple database interface at UC Berkeley.

William Cohen of CMU describes the collection:

This dataset was collected and prepared by the CALO Project (A Cognitive Assistant that Learns and Organizes). It contains data from about 150 users, mostly senior management of Enron, organized into folders. The corpus contains a total of about 0.5M messages. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation. … The dataset here does not include attachments, and some messages have been deleted “as part of a redaction effort due to requests from affected employees”.

Now it’s convenient to explore corporate malfeasance on the Web.

You are currently browsing the archives for the NLP category.

  Home | Archive | Login | Feed






UMBC