 | NLP 
Archive for the 'NLP' Category
July 25th, 2009, by Tim Finin, posted in AI, NLP, Semantic Web
John Markoff has an article for tomorrow’s New York Times, Scientists Worry Machines May Outsmart Man on a recent AAAI study on the future of AI.
“A robot that can open doors and find electrical outlets to recharge itself. Computer viruses that no one can stop. Predator drones, which, though still controlled remotely by humans, come close to a machine that can kill autonomously. Impressed and alarmed by advances in artificial intelligence, a group of computer scientists is debating whether there should be limits on research that might lead to loss of human control over computer-based systems that carry a growing share of society’s workload, from waging war to chatting with customers on the phone.”
The study was commissioned by AAAI to “to explore and address potential long-term societal influences of AI research and development”. Look for a report published by AAAI later this year. The study involved twenty-five participants who were divided into three subgroups: on concerns, control and guidelines, the nature and timing of disruptive advances, and ethical and legal issues.
There was a panel session earlier this month at IJCAI where some of the study participants discussed highlights from the study. Hopefully this was filmed and the results will be added to the videolectures.net IJCAI09 collection.
While I am generally skeptical of an impending technological singularity, which seems to sum up many of the concerns some have, there are aspects of the future that I do wonder about. At the top of my list is what will happen when virtually all of human knowledge is published on the Web (as it nearly is now) in a for that machines can understand. I’m pretty sure that this will happen in the next decade or two, either through the current Semantic Web approach (as a web of data) or by gradually improving techniques for machine understanding of human languages and images.
Edit | Bookmark@del.icio.us | Trackback | 3 Comments »
April 6th, 2009, by Tim Finin, posted in NLP, Web
Not only do you have to choose title of your papers, posts and web pages well, their first two words should be chosen to carry the message. Jakob Nielsen reports on UI research showing that the first 11 characters of links and headlines are important in forming some idea of what the item is about.
First 2 Words: A Signal for the Scanning Eye
“Our newest usability study … tests how well users understand the first 11 characters of a website’s links and headlines. For example, we’d represent this article by the “First 2 Wor” string. … Why test text that’s so severely truncated? Because online reading is often dominated by the F-pattern. That is, people read the first few listed items somewhat thoroughly — thus the cross-bars of the “F” — but read less and less as they continue down the list, eventually passing their eyes down the text’s left side in a fairly straight line. At this point, users see only the very beginning of the items in a list. …”
Nielsen calls the initial few words in a title “nano-content”. While it’s hard to pack some ideas into 11 characters, it sounds like a good goal.
Choosing the words for a link or title carefully is a key to influencing search engines — these words are given higher weight when indexing the associated content. But search engines don’t scan like humans, so putting the most relevant early in the string helps when a person is shown a list of results.
Edit | Bookmark@del.icio.us | Trackback | Comments Off
March 11th, 2009, by Tim Finin, posted in AI, Datamining, Google, NLP, Semantic Web
There’s been a lot of interest in Wolfram Alpha in the past week, starting with a blog post from Steve Wolfram, Wolfram|Alpha Is Coming!, in which he described his approach to building a system that integrates vast amounts of knowledge and then tries to answer free form questions posed to it by people. His post lays out his approach, which does not involve extracting data from online text.
“A lot of it is now on the web—in billions of pages of text. And with search engines, we can very efficiently search for specific terms and phrases in that text. But we can’t compute from that. And in effect, we can only answer questions that have been literally asked before. We can look things up, but we can’t figure anything new out.
So how can we deal with that? Well, some people have thought the way forward must be to somehow automatically understand the natural language that exists on the web. Perhaps getting the web semantically tagged to make that easier.
But armed with Mathematica and NKS I realized there’s another way: explicitly implement methods and models, as algorithms, and explicitly curate all data so that it is immediately computable.”
Nova Spivack took a look at Wolfram Alpha last week and thought that it could be “as important as Google”.
In a nutshell, Wolfram and his team have built what he calls a “computational knowledge engine” for the Web. OK, so what does that really mean? Basically it means that you can ask it factual questions and it computes answers for you.
It doesn’t simply return documents that (might) contain the answers, like Google does, and it isn’t just a giant database of knowledge, like the Wikipedia. It doesn’t simply parse natural language and then use that to retrieve documents, like Powerset, for example.
Instead, Wolfram Alpha actually computes the answers to a wide range of questions — like questions that have factual answers such as “What is the location of Timbuktu?” or “How many protons are in a hydrogen atom?,” “What was the average rainfall in Boston last year?,” “What is the 307th digit of Pi?,” “where is the ISS?” or “When was GOOG worth more than $300?”
Doug Lenat, also had a chance to preview Wolfram Alpha and came away impressed:
“Stephen Wolfram generously gave me a two-hour demo of Wolfram Alpha last evening, and I was quite positively impressed. As he said, it’s not AI, and not aiming to be, so it shouldn’t be measured by contrasting it with HAL or Cyc but with Google or Yahoo.”
Doug’s review does a good job of sketching the differences he ses between Wolfram Alpha and systems like Google and Cyc.
Lenat’s description makes Wolfram Alpha sound like a variation on the Semantic Web vision, but one that more like a giant closed database than a distributed Web of data. The system is set to launch in May 2009 and I’m anxious to give it a try.
Edit | Bookmark@del.icio.us | Trackback | 3 Comments »
February 24th, 2009, by Tim Finin, posted in Mobile Computing, NLP, Semantic Web, Social media, Twitter, Web 2.0
WindyCitizen.com is “a crowd-powered front page for the Windy City” that “brings Chicagoans the best of the local web by letting them share, rate and discuss their favorite local news, photos, videos and more.”

Their Windy City Twitter Tracker mashup uses Open Calais as a named entity recognizer to track Tweets about candidates in the special election to fill the US House seat for Chicago’s 5th district that that Rahm Emanuel vacated. Calais might be overkill for this, since there is a small set of known candidates, but it’s an impressive semantic mashup nonetheless.
“We’re searching Twitter constantly to keep you up to date with the conversation about the IL-5 special election. The graph above lets you track buzz about the candidates over the last two weeks.”
The Windy City Twitter Tracker is probably written to be easily repurposed, judging from the Web site, which describe it as currently tracking the “Race for the 5th”. The mashup is credited to Whattech.
Edit | Bookmark@del.icio.us | Trackback | 2 Comments »
January 30th, 2009, by Tim Finin, posted in NLP
Next week the JHU Center for Language and Speech Processing will host a talk by Martin Kay of Stanford University, When is a Translation not a Translation? at 4:30pm Tuesday, 3 February 2009. From the announcement:
“A translation is generally taken to be a text that expresses the same meaning as another text in a different language. But the products of the best translators reflects a different, if more illusive, goal. I will seek a somewhat more adequate characterization of translation as it is actually practiced and discuss its consequences for machine translation.
Martin Kay is a professor of linguistics and computer science at Stanford University. For many years, he was also a research fellow at the Xerox Palo Alto Research Center. He made a number of fundamental contributions to computational linguistics, including chart parsing, unification grammar, and applications of finite-state technology, notably in phonology. He has been an intermittent worker on, and skeptical observer of, machine translation since 1958.”
For a preview of what he will probably talk about, you might look at a paper on Professor Kay’s web site that he describes as “some unfinished musings on the nature of translation“.
This a chance to hear someone who has made many important contributions to several areas of computational linguistics and computer science over a long career.
Edit | Bookmark@del.icio.us | Trackback | Comments Off
January 27th, 2009, by Tim Finin, posted in NLP, Semantic Web, Social media, Wikipedia
This year’s Text Analysis Conference (TAC) has an interesting track focused on processing text to populate Wikipedia infoboxes, both for existing entities with missing values as well as newly discovered entities.
TAC has been run by the US National Institute of Standards and Technology (NIST) to to encourage research in natural language processing and related applications. As in the NIST sponsored MUC, TREC and ACE workshops, this is done by by providing a large test collection, common evaluation procedures, and a forum for organizations to share their results. The first TAC was held this year and included 65 teams from 20 countries who participated in three tracks: question answering, summarization and recognizing textual entailments.
TAC 2009 will include a new track on Knowledge Base Population coordinated by Paul McNamee of the Johns Hopkins University Human Language Technology Center of Excellence.
“The goal of the new Knowledge Base Population track is to augment an existing knowledge representation with information about entities that is discovered from a collection of documents. A snapshot of Wikipedia infoboxes will be used as the original knowledge source, and participants will be expected to fill in empty slots for entities that do exist, add missing entities and their learnable attributes, and provide links between entities and references to text supporting extracted information. The KBP task lies at the intersection of Question Answering and Information Extraction and is expected to be of particular interest to groups that have participated in ACE or TREC QA.”
This is an exciting task and doing well in it will require a a mixture of language processing, knowledge-based processing and (probably) machine learning.
The TAC 2009 workshop will be co-located with TREC and held 16-17 November in Gaithersburg, MD. If you are interested in participating, you should register by March 3.
Edit | Bookmark@del.icio.us | Trackback | 1 Comment »
January 7th, 2009, by Tim Finin, posted in AI, NLP, Semantic Web, UMBC
If you are a high school or middle school student who is interested in
computers and also in languages, you should consider participating in the 2009 North American Computational Linguistics Olympiad (NACLO). This might be the first step on a path that could lead to your helping to create the next Google!
NACLO is a competition for middle-school and high-school students focused on solving problems involving linguistics and computational linguistics. WOrking the problems only requires keen analytical ability and good problem-solving skills — no prior background in linguistics, foreign languages or computer science is required.
NACLO consists of two rounds — an initial round on February 4 open to all students and a subsequent invitational round on March 11 for contestants who have advanced from the first. Winners of the second round will be invited to participate in the International Linguistics Olympiad. Last year, two US teams went to Bulgaria to compete in the sixth International Linguistics Olympiad and gold medals in individual and team events.
Support for NACLO is provided by Google, the Associaton for Computational Linguistics, and the National Science Foundation, which said in an August press release :
“Aside from being a fun intellectual challenge, the Olympiad mimics the skills used by researchers and scholars in the field of computational linguistics, which is increasingly important for the United States and other countries. Using computational linguistics, these experts can develop automated technologies such as translation software that cut down on the time and training needed to work with other languages, or software that automatically produces informative English summaries of documents in other languages or answer questions about information in these documents. In an increasingly global economy where businesses operate across borders and languages, having a strong pool of computational linguists is a competitive advantage. With threats emerging from different parts of the world, developing computational linguistics skills has also been identified as vital to national defense in the 21st century.” (src)
Students can participate at the NACLO site at UMBC, which is sponsored by the UMBC Institute for Language in Information Technology. Check out their poster and sample problem If you like this kind of puzzle and others like it, sign up to be part this exciting competition.
Students should register online by January 20. Late registrations may be accepted up to February 3 if space is available. The UMBC NACLO event will take place on Wednesday February 4 in room 312 of the University Center. For more information, contact one of the local organizers: Professors Marjorie McShane (marge@umbc.edu), Sergei Nirenburg (sergei@umbc.edu) and Margaret A. Russell (margaret.a.russell@gmail.com).
Edit | Bookmark@del.icio.us | Trackback | 2 Comments »
December 26th, 2008, by Tim Finin, posted in NLP
Yongmei Shi defended her PhD dissertation earlier this fall on using syntactic and semantic information to detect errors in spoken language systems under the direction of Dr. R. Scott Cost (JHU/APL) and Professor Lina Zhou (UMBC). Her dissertation has been submitted an is now available online.
Yongmei Shi, An Investigation of Linguistic Information for Speech Recognition Error Detection, Ph.D. Dissertation, Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore County, October 2008.
After several decades of effort, signi?cant progress has been made in the area of speech recognition technologies, and various speech-based applications have been developed. However, current speech recognition systems still generate erroneous output, which hinders the wide adoption of speech applications. Given that the goal of error-free output can not be realized in near future, mechanisms for automatically detecting and even correcting speech recognition errors may prove useful for amending imperfect speech recognition systems. This dissertation research focuses on the automatic detection of speech recognition errors for monologue applications, and in particular, dictation applications.
Due to computational complexity and ef?ciency concerns, limited linguistic information is embedded in speech recognition systems. Furthermore, when identifying speech recognition errors, humans always apply linguistic knowledge to complete the task. This dissertation therefore investigates the effect of linguistic information on automatic error detection by applying two levels of linguistic analysis, speci?cally syntactic analysis and semantic analysis, to the post processing of speech recognition output. Experiments are conducted on two dictation corpora which differ in both topic and style (daily of?ce communication by students and Wall Street Journal news by journalists).
To catch grammatical abnormalities possibly caused by speech recognition errors, two sets of syntactic features, linkage information and word associations based on syntactic dependency, are extracted for each word from the output of two lexicalized robust syntactic parsers respectively. Con?dence measures, which combine features using Support Vector Machines, are used to detect speech recognition errors. A con?dence measure that combines syntactic features with non-linguistic features yields consistent performance improvement in one or more aspects over those obtained by using non-linguistic features alone.
Semantic abnormalities possibly caused by speech recognition errors are caught by the analysis of semantic relatedness of a word to its context. Two different methods are used to integrate semantic analysis with syntactic analysis. One approach addresses the problem by extracting features for each word from its relations to other words. To this end, various WordNet-based measures and different context lengths are examined. The addition of semantic features in con?dence measures can further yield small but consistent improvement in error detection performance. The other approach applies lexical cohesion analysis by taking both reiteration and collocation relationships into consideration and by augmenting words with probability predicted from syntactic analysis. Two WordNet-based measures and one measure based on Latent Semantic Analysis are used to instantiate lexical cohesion relationships. Additionally, various word probability thresholds and cosine similarity thresholds are examined. The incorporation of lexical cohesion analysis is superior to the use of syntactic analysis alone. In summary, the use of linguistic information as described, including syntactic and semantic information, can provide positive impact on automatic detection of speech recognition errors.
Edit | Bookmark@del.icio.us | Trackback | Comments Off
November 20th, 2008, by Tim Finin, posted in NLP
Paul McNamee will defend his dissertation on Textual Representations for Corpus-Based Bilingual Retrieval at 9:00am Monday 24 November 2008 in ITE 325B. His mentor is Charles Nicholas and the dissertation committee includes Tim Finin, James Mayfield (JHU), Sergei Nirenburg and Doug Oard (UMCP). Here is the abstract.
The traditional approach to information retrieval is based on using words as the indexing and search terms for documents. One part of this research investigates alternative methods for representing text, including a method based on overlapping sequences of characters called n-gram tokenization. N-grams are studied in depth and one notable finding is that they achieve a 20% improvement in retrieval effectiveness over words in certain situations.
The other focus of this research is improving retrieval performance when foreign language documents must be searched and translation is required. In this scenario bilingual dictionaries are often used to translate user queries; however even among the most commonly spoken languages, for which large bilingual lexicons exist, dictionary-based translation suffers from several significant problems. These include: difficulty handling proper names, which are often missing; issues related to morphological variation since entries, or query terms, may not be lemmatized; and, an inability to robustly handle multiword phrases, especially non-compositional expressions. These problems can be addressed when translation is accomplished using parallel collections, sets of documents available in more than one language. Using parallel texts enables statistical translation of character n-grams rather than words or stemmed words, and with this technique highly effective bilingual retrieval performance is obtained. Translation of multiword expressions is also explored.
In this dissertation I present an overview of the field of cross- language information retrieval and then introduce the foundational concepts in n-gram tokenization and corpus-based translation. Then monolingual and bilingual experiments on test sets in 13 languages are described. Analysis of these experiments gives insight into: the relative efficacy of various tokenization methods; reasons why n-grams are effective; the utility of automated relevance feedback, in both monolingual and bilingual contexts; the interplay between tokenization and translation; and, how translation resource selection and size influence bilingual retrieval.
Edit | Bookmark@del.icio.us | Trackback | Comments Off
October 13th, 2008, by Tim Finin, posted in AI, GENERAL, NLP

None of the six bots that made the Loebner Prize Competition finals won the prize, but Fred Roberts’ Elbot was declared the best of the lot, winning a bronze metal. Only five of the bots managed to start.
Apparently the sixth was busy elsewhere, rumored to be furiously buying and selling Credit Default Swaps on the weekend market.
The Guardian reports that
“Elbot emerged as the winner, after scooping a 25% success rate at convincing the judges that it was actually human. That’s not enough to please the ghost of Turing, but it was enough to pick up Elbot’s owner, Fred Roberts, a cash prize. Fred’s invention had a few tricks up his sleeve, including trying to the judges off their game by explicitly referring to itself as a machine.
“Hi. How’s it going?” one judge began.
“I feel terrible today,” Elbot replied. “This morning I made a mistake and poured milk over my breakfast instead of oil, and it rusted before I could eat it.”
The BBC has a video on the competition.
Edit | Bookmark@del.icio.us | Trackback | 2 Comments »
October 11th, 2008, by Tim Finin, posted in NLP, Semantic Web
NLTK looks very useful.
“NLTK — the Natural Language Toolkit — is a suite of open source Python modules, data and documentation for research and development in natural language processing. NLTK contains Code supporting dozens of NLP tasks, along with 40 popular Corpora and extensive Documentation including a 375-page online Book. Distributions for Windows, Mac OSX and Linux are available.”
The development of NLTK is led by Steven Bird, Edward Loper, and Ewan Klein.
(Spotted on Language Log.)
Edit | Bookmark@del.icio.us | Trackback | 1 Comment »
September 27th, 2008, by Tim Finin, posted in NLP, Ontologies, Semantic Web
Evri is another entry into the ‘semantic search’ space and has recently opened up a beta site with the slogan Search less, understand more. Evri is an startup launched by Vulcan Inc, a company founded by Paul Allen in 1986 as a private investment and R&D firm.
Here’s part of how Evri describes itself on their (FAQ).
“What is Evri doing? Evri is creating a map of connections between people, places, and things on the web. You’ll use this map to find the things you’re interested in. Instead of searching by keywords and looking for relevant results, Evri will lead you to other relevant articles, images, and video based on what you’re reading.
…
Where does Evri get its information? We search the World Wide Web and gather content from as many highly regarded information sources as we can find, and we’re adding more sources all the time.”
Saying that Evri does ‘semantic search’ is not quite right — their initial focus is on providing widgets for blogs and other web sites that use the text on the page to recommend links to other, related information.
Evri appears to have developed an underlying ontology that is used to organize their knowledge of “people, products and things”, capturing both a type taxonomy and relations. Some of this is revealed in the beta**2 part of their site, Evri’s Garden. There is a query system over their knowledge base complex search queries.
The current push, though, seems to be to get bloggers to add an Evri widget to their blogs that will pop up a window with links to related articles and information.
This is an interesting development that is worth watching.
Edit | Bookmark@del.icio.us | Trackback | Comments Off
|  |
|  |