 | AI 
Archive for the 'AI' Category
February 23rd, 2011, by Tim Finin, posted in Datamining, NLP, Semantic Web, Social media
The Fifth International AAAI Conference on Weblogs and Social Media is holding a new data challenge using a new dataset from that includes about three TB of social media data collected by Spinn3r between January 13 and February 14th, 2011.
The dataset consists of over 386M blog posts, news articles, classifieds, forum posts and social media content in a month including events such as the Tunisian revolution and the Egyptian protests. The content includes the syndicated text, its original HTML as found on the web, annotations and metadata (e.g., author information, time of publication and source URL), and boilerplate/chrome extracted content. The data is formatted as Spinn3r’s protostreams – an extension to Google protobuffers. It is also broken down by date, content type and language making it easy to work with selected data.
See the ICWSM Data Challenge pages for more information on the challenge task, its associated ICWSM workshop and procedures for data access.
Edit | Bookmark@del.icio.us | Trackback | 1 Comment »
February 22nd, 2011, by Tim Finin, posted in AI, Machine Learning, Semantic Web

IBM’s Watson’s performance in last week’s Jeopardy Challenge was an amazing accomplishment and a demonstration of how our computer systems are becoming more intelligent and capable of solving difficult tasks.
But I wonder if the way that questions were given to the human players and Watson doesn’t give Watson a short, but significant head start. According to the New York Times
“During the sparring matches, Watson received the questions as electronic texts at the same moment they were made visible to the human players;”
Once Watson received a query, it could process it immediately. While the human contestants got to see the query as written text at the same time, Alex Trebek also starts reading the question aloud. When I was watching Jeopardy, I found it almost impossible to read and understand the question more quickly than it was being spoken and suspect that Ken Jennings and Brad Rutter might also. It’s often observed that people find it very difficult to simultaneously process two language streams. While it took Trebek only a second or two to read the short Jeopardy queries, that could have given Watson a significant head start, enabling it to determine that it had a good answer and press its buzzer before the competition.
If this is the case, I am not sure if it is an unfair advantage. People and computers each have native advantages and disadvantages. If Jennings and Rutter got the questions as text without them being simultaneous read aloud, Watson might still have had the advantage of a quicker start.
Edit | Bookmark@del.icio.us | Trackback | 3 Comments »
February 13th, 2011, by Tim Finin, posted in AI, Datamining, Machine Learning, NLP, Semantic Web
On the eve of the big Jeopardy! match, Peter Norvig’s opinion piece in the New York Post (!) today, The Machine Age looks at AI’s progress over the past sixty years and lays out six surprising lessons we’ve learned.
- The things we thought were hard turned out to be easier.
- Dealing with uncertainty turned out to be more important than thinking with logical precision.
- Learning turned out to be more important than knowing.
- Current systems are more likely to be built from examples than from logical rules.
- The focus shifted from replacing humans to augmenting them.
- The partnership between human and machine is stronger than either one alone.
When took Pat Winston’s undergraduate AI class in 1970, only the first of those ideas was current. It’s a good essay.
Of course, after we we’ve exploited the new data-driven, statistical paradigm for the next decade or so, we’ll probably have to go back to figuring out how to get logic back into the framework.
Edit | Bookmark@del.icio.us | Trackback | 2 Comments »
February 12th, 2011, by Tim Finin, posted in Machine Learning, Semantic Web, Social media
The current (11 February 2011) issue of Science is a special issue on Dealing with Data. It includes a collection of free, online articles that “highlights both the challenges posed by the data deluge and the opportunities that can be realized if we can better organize and access the data.” Some of the articles are drawn from three sister publications: Science Signaling, Science Translational Medicine and Science Careers.
From the issue’s introduction:

“Scientific innovation has been called on to spur economic recovery; science and technology are essential to improving public health and welfare and to inform sustainability; and the scientific community has been criticized for not being sufficiently accountable and transparent. Data collection, curation, and access are central to all of these issues.
…
As you will discover, two themes appear repeatedly: Most scientific disciplines are finding the data deluge to be extremely challenging, and tremendous opportunities can be realized if we can better organize and access the data.”
One of the great things about the “data deluge” is that there is something in it for almost all computer science researchers including areas like machine learning, data mining, NLP, visualization, semantic web, security and privacy, social media, high performance computing, HCI, etc. Here are some of the articles that caught our eye:
and still more that look very interesting:
- Climate Data Challenges in the 21st Century, J. T. Overpeck et al.
- Challenges and Opportunities of Open Data in Ecology, O. J. Reichman et al.
- Challenges and Opportunities in Mining Neuroscience Data, H. Akil et al.
- The Disappearing Third Dimension, T. Rowe and L. R. Frank
- Advancing Global Health Research Through Digital Technology and Sharing Data, T. Lang
- More Is Less: Signal Processing and the Data Deluge, R. G. Baraniuk
- Access to Stem Cells and Data: Persons, Property Rights, and Scientific Progress, D. J. H. Mathews et al.
- On the Future of Genomic Data, S. D. Kahn
- Conquering the Data Mountain, N. R. Gough and M. B. Yaffe
- Power to the People: Participant Ownership of Clinical Trial Data, S. F. Terry and P. F. Terry
- Surfing the Tsunami, E. Pain
- Sharing Data in Biomedical and Clinical Research, K. Travis
Edit | Bookmark@del.icio.us | Trackback | 1 Comment »
February 2nd, 2011, by Tim Finin, posted in AI, UMBC
UMBC will host the 2011 FIRST Lego League Maryland State Championship on Saturday February 26 in the UMBC Retriever Activities Center.
FIRST Lego League (FLL) an international competition for elementary and middle school students that is run by the FIRST organization with support by Lego. FLL teams use Lego Mindstorms kits to build small autonomous robots built with a limited number of sensors and motors that complete to perform predefined challenge given tasks.
"Guided by adult mentors and their own imaginations, FLL students solve real-world engineering challenges, develop important life skills, and learn to make positive contributions to society. FLL provides students age 9-14 with an opportunity to challenge their math and science skills in an internationally recognized competitive environment. FLL combines a hands-on, interactive robotics program with a sports-like atmosphere. Teams of up to 10 players focus on team building, problem solving, creativity, and analytical thinking to develop a well thought out solution to a problem currently facing the world – the Challenge."
The UMBC organizers, led by UMBC Mechanical Engineering Professor Anne Spence, need volunteers from the UMBC community to help on the tournament day as well as to help set up in on Friday. If you are interested in helping please register online. Volunteering to help in the Maryland FLL championship is a great way to help engage young people in science and technology and have some fun doing it.
Edit | Bookmark@del.icio.us | Trackback | Comments Off
December 7th, 2010, by Krishnamurthy Viswanathan, posted in Machine Learning
The Naive Bayes classifier is one of the most versatile machine learning algorithms that I have seen around during my meager experience as a graduate student, and I wanted to do a toy implementation for fun. At its core, the implementation is reduced to a form of counting, and the entire Python module, including a test harness took only 50 lines of code. I haven’t really evaluated the performance, so I welcome any comments. I am a Python amateur, and am sure that experienced Python hackers can trim a few rough edges off this code.
Intuition and Design
Here is definition a of the classifier functionality (from wikipedia):

Now this means, that for each possible class label, multiply together the conditional probability of each feature, given the class label. This means, for us to implement the classifier, all we need to do, is compute these individual conditional probabilities for each label, for each feature, p(Fi | Cj), and multiply them together with the prior probability for that label p(Cj). The label for which we get the largest product, is the label returned by the classifier.
In order to compute these individual conditional probabilities, we use the Maximum Likelihood Estimation method. In a very short sentence, we approximate these probabilities using the counts from the input/training vectors.
Hence we have: p(Fi | Cj) = count( Fi ^ Cj) / count(Cj)
That is, we count from the training corpus, the ratio of the number of occurrences of the feature Fi and the label Cj together to the total number of occurrences of the label Cj.
Zero Probability Problem
What if we have never seen a particular feature Fa and a particular label Cb together in the training dataset? Whenever they occur in the test data, p(Fa | Cb) will be zero. Hence the overall product will also be zero. This is a problem with maximum likelihood estimates. Just because a particular observation was not made during training does not mean that it will never occur in the test data. In order to remedy this issue, we use what is known as smoothing. The simplest kind of smoothing that we use in this code, is called “add one smoothing”. Essentially, the probability for an unseen event should be greater than one. We achieve this by adding one to each zero count. The net effect should be that we redistribute some of the probability mass from the non-zero count observations to the zero-count observations. Hence, we also need to increase the total count for each label by the number of possible observations, in order to maintain the total probability mass at 1.
For example, if we have two classes C = 0 and C = 1, then after smoothing, the smoothed MLE probabilities can be written as:
p-smoothed(Fi | Cj) = [count(Fi ^ Cj) + 1]/[count(Cj) + N] where N is the total number of observations across all features in the training corpus.
Code
For simplicity, we will use Weka’s ARFF file format as input. We have a single class called Model which has a few dictionaries and lists to store the counts and feature vector details. In this implementation, we only deal with discrete valued features.
The dictionary ‘features’ saves all possible values for a feature. ‘featureNameList‘ is simply a list that contains the names of the features in the same order that it appears in the ARFF file. This is because our features dictionary does not have any intrinsic order, and we need to maintain feature order explicitly. ‘featureCounts‘ contains the actual counts for co-occurrence of each feature value with each label value. The keys for this dictionary are tuples of the form (class_label, feature_name, feature_value). Hence, if we have observed the feature F1 with the value ‘x’ for the label ‘yes’, fifteen times, then we will have the entry {(‘yes’, ‘F1′, 15)} in the dictionary. Note how the default values for counts in this dictionary is ’1′ instead of ’0′. This is because we are smoothing the counts. The ‘featureVectors‘ list actually contains all the input feature vectors from the ARFF file. The last feature in this vector is the class label itself, as is the convention with weka ARFF files. Finally, ‘labelCounts‘ stores the counts of the class labels themselves, i.e. now many times did we see the label Ci during training.
We also have the following member functions in the Model class:
The above method simply reads the feature names (including class labels), their possible values, and the feature vectors themselves; and populate the appropriate data structures defined above.
The TrainClassifier method simply counts the number of co-occurrences of each feature value with each class label, and stores them in the form of 3-tuples. These counts are automatically smoothed by using add-one smoothing as the default value of count for this dictionary is ’1′. The counts of the labels is also adjusted by incrementing these counts by the total number of observations.
Finally, we have the Classify method, that accepts as argument, a single feature vector (as a list), and computes the product of individual conditional probabilities (smoothed MLE) for each label. The final computed probabilities for each label are stored in the ‘probabilityPerLabel‘ dictionary. In the last line, we return the entry from probabilityPerLabel which has the highest probability. Note that the multiplication is actually done as addition in the log domain as the numbers involved are extremely small. Also, one of the factors used in this multiplication, is the prior probability of having this class label.
Here is the complete code, including a test method:
Download the sample ARFF file to try it out.
Update: I found a bug in the last but one(th) line of the GetValues() function. This line gets the possible attribute values from the arff file and stores them in self.featureNameList. This method did not deal with whitespaces correctly. Update this line to:
self.features[self.featureNameList[len(self.featureNameList) - 1]] = [featureName.strip() for featureName in line[line.find('{')+1: line.find('}')].strip().split(',')]
Edit | Bookmark@del.icio.us | Trackback | 2 Comments »
December 2nd, 2010, by Tim Finin, posted in Agents, AI, Games, Google, Social

The Google-supported Planet Wars Google AI Challenge had over 4000 entries that used AI and game theory to compete against one another. C at the R-Chart blog analyzed the programming languages used by the contestants with some interesting results.
The usual suspects were the most popular languages used: Java, C++, Python, C# and PHP. The winner, Hungarian Gábor Melis, was just one of 33 contestants who used Lisp. Even less common were entries in C, but the 18 “C hippies” did remarkably well.
Blogger C wonders if Lisp was the special sauce:
Paul Graham has stated that Java was designed for “average” programmers while other languages (like Lisp) are for good programmers. The fact that the winner of the competition wrote in Lisp seems to support this assertion. Or should we see Mr. Melis as an anomaly who happened to use Lisp for this task?
Edit | Bookmark@del.icio.us | Trackback | Comments Off
October 30th, 2010, by Tim Finin, posted in AI, Datamining, Google, Machine Learning, NLP, sEARCH, Semantic Web, Social media
Recorded Future is a Boston-based startup with backing from Google and In-Q-Tel uses sophisticated linguistic and statistical algorithms to extract time-related information from streams of Web data about entities and events. Their goal is to help their clients to understand how the relationships between entities and events of interest are changing over time and make predictions about the future.
A recent Technology Review article, See the Future with a Search, describes it this way.
“Conventional search engines like Google use links to rank and connect different Web pages. Recorded Future’s software goes a level deeper by analyzing the content of pages to track the “invisible” connections between people, places, and events described online.
”That makes it possible for me to look for specific patterns, like product releases expected from Apple in the near future, or to identify when a company plans to invest or expand into India,” says Christopher Ahlberg, founder of the Boston-based firm.
A search for information about drug company Merck, for example, generates a timeline showing not only recent news on earnings but also when various drug trials registered with the website clinicaltrials.gov will end in coming years. Another search revealed when various news outlets predict that Facebook will make its initial public offering.
That is done using a constantly updated index of what Ahlberg calls “streaming data,” including news articles, filings with government regulators, Twitter updates, and transcripts from earnings calls or political and economic speeches. Recorded Future uses linguistic algorithms to identify specific types of events, such as product releases, mergers, or natural disasters, the date when those events will happen, and related entities such as people, companies, and countries. The tool can also track the sentiment of news coverage about companies, classifying it as either good or bad.”
Pricing for access to their online services and API starts at $149 a month, but there is a free Futures email alert service through which you can get the results of some standing queries on a daily or weekly basis. You can also explore the capabilities they offer through their page on the 2010 US Senate Races.
“Rather than attempt to predict how the the races will turn out, we have drawn from our database the momentum, best characterized as online buzz, and sentiment, both positive and negative, associated with the coverage of the 29 candidates in 14 interesting races. This dashboard is meant to give the view of a campaign strategist, as it measures how well a campaign has done in getting the media to speak about the candidate, and whether that coverage has been positive, in comparison to the opponent.”
Their blog reveals some insights on the technology they are using and much more about the business opportunities they see. Clearly the company is leveraging named entity recognition, event recognition and sentiment analysis. A short A White Paper on Temporal Analytics has some details on their overall approach.
Edit | Bookmark@del.icio.us | Trackback | 2 Comments »
October 9th, 2010, by Tim Finin, posted in Agents, AI, Google
No, this is not an article from The Onion, but Google is working on a computer-controlled car. Two articles for tomorrow’s New York Times describe a research project at Google on developing an autonomous vehicle. Here is a picture of the prototype.
In the science science section, John Markoff has a story Google Cars Drive Themselves, in Traffic.
“Anyone driving the twists of Highway 1 between San Francisco and Los Angeles recently may have glimpsed a Toyota Prius with a curious funnel-like cylinder on the roof. Harder to notice was that the person at the wheel was not actually driving. A self-driving car developed and outfitted by Google, with device on roof, cruising along recently on Highway 101 in Mountain View, Calif. The car is a project of Google, which has been working in secret but in plain view on vehicles that can drive themselves, using artificial-intelligence software that can sense anything near the car and mimic the decisions made by a human driver.”
A companion article, also by Markoff, has some additional material, including this interesting note on the current approach.
“One main technique used by the Google team is known as SLAM, or simultaneous localization and mapping, which builds and updates a map of a vehicle’s surroundings while keeping the vehicle located within the map. To make a SLAM map, the car is first driven manually along a route while its sensors capture location, feature and obstacle data. Then a group of software engineers annotates the maps, making certain that road signs, crosswalks, street lights and unusual features are all embedded. The cars then drive autonomously over the mapped routes, recording changes as they occur and updating the map. The researchers said they were surprised to find how frequently the roads their robots drove on had changed.”
The project was the idea of Stanford computer science professor Sebastian Thrun who is also a Principal Engineer at Google, where he helped invent the Street View mapping service. Thrun has led the Stanford team that developed the Stanley robot car which won the 2005 DARPA Grand Challenge that was focused on developing autonomous vehicle technology.
It’s not clear what is the business case for this Google research project. But Google has the cash and the intellectual capital that might actually develop something in this space that can make money.
In a Google blog post from earlier today, What we’re driving at, Thrun gives one motivation.
“Larry and Sergey founded Google because they wanted to help solve really big problems using technology. And one of the big problems we’re working on today is car safety and efficiency. Our goal is to help prevent traffic accidents, free up people’s time and reduce carbon emissions by fundamentally changing car use.
So we have developed technology for cars that can drive themselves. Our automated cars, manned by trained operators, just drove from our Mountain View campus to our Santa Monica office and on to Hollywood Boulevard. They’ve driven down Lombard Street, crossed the Golden Gate bridge, navigated the Pacific Coast Highway, and even made it all the way around Lake Tahoe. All in all, our self-driving cars have logged over 140,000 miles. We think this is a first in robotics research.”
update: Techcrunch has an article speculating on the possible business applications, World-Changing Awesome Aside, How Will The Self-Driving Google Car Make Money?.
Edit | Bookmark@del.icio.us | Trackback | 2 Comments »
September 19th, 2010, by Tim Finin, posted in Agents, AI, Social media
The peer review process is central to most research disciplines and is used in the selection of papers for publication and research proposals for funding.
A new paper by Stefan Thurner and Rudolf Hanel develops an agent-based model of the scientific peer review process, Peer-review in a world with rational scientists: Toward selection of the average.
“… we are interested in the effects of rational referees, who might not have any incentive to see high quality work other than their own published or promoted. We find that a small fraction of incorrect (selfish or rational) referees can drastically reduce the quality of the published (accepted) scientific standard. We quantify the fraction for which peer review will no longer select better than pure chance. Decline of quality of accepted scientific work is shown as a function of the fraction of rational and unqualified referees. We show how a simple quality-increasing policy of e.g. a journal can lead to a loss in overall scientific quality, and how mutual support-networks of authors and referees deteriorate the system.”
Their agent model has several reviewers types:
- The correct: Accepts good and rejects bad papers.
- The stupid: This referee can not judge the quality of a paper (e.g. because of incompetence or lack of time) and takes a random decision on a paper.
- The rational: The rational referee knows that work better than his/her own might draw attention away from his/her own work. For him there is no incentive to accept anything better than one’s own work, while it might be fine to accept worse quality.
- The altruist: Accepts all papers.
- The misanthropist: Rejects all papers.
I’ve known them all, as I am sure many of us have. As an editor or program chair I’ve met a few other types, including these:
- The Bartleby: His or her response to an invitation is always “I would prefer not to.”
- The Black Hole: Messages go in and nothing ever comes out.
- The Gary Cooper: A person of few words, even when many are called for.
- The Perseverator: Sees all sides of any decision and keeps all carefull in balance. Usually recommends “major revision”.
I am sure I’ve overlooked some — suggest your own via a comment.
(h/t Shlomo Argamon)
Edit | Bookmark@del.icio.us | Trackback | Comments Off
September 16th, 2010, by Tim Finin, posted in Agents, AI
This is a call for bids to host the Twelfth International Conference on Autonomous Agents and Multiagent Systems (AAMAS) in 2013. Bids will be considered from all geographical regions; however, for the 2013 conference, we particularly encourage bids from the Americas.
Bids are sought from volunteers from the scientific community, though they may be supported by paid meeting professionals.
All correspondence regarding bids should be directed by email to the IFAAMAS Conference Committee Chair (Munindar P. Singh, singh@ncsu.edu) and Chair Elect (Onn Shehory, ONN@il.ibm.com).
Bids should be made by individuals or small groups, with the backing of a host institution, typically a university or research center. Groups or individuals who are planning to submit a bid should notify Drs. Singh and Shehory of their intention as soon as possible.
- Now: Expression of interest and queries
- November 17, 2010: Submission of final bid
- November 18, 2010-February 28, 2011: Potential discussions with bidders; internal discussions in the IFAAMAS Board
- March 1, 2011: Decision
See the full AAMAS-2013 call for bids for more information
Edit | Bookmark@del.icio.us | Trackback | Comments Off
September 7th, 2010, by Tim Finin, posted in AI, GENERAL
According to a recent post in the Microsoft Careers JobsBlog the top three hottest new majors for a career in technology are
- Data Mining/Machine Learning/AI/Natural Language Processing
- Business Intelligence/Competitive Intelligence
- Analytics/Statistics, specifically Web Analytics, A/B Testing and
statistical analysis
Happily these are all strengths of the IT programs at UMBC. In fact, we have placed a large number of graduates at leading edge technology companies in the past few years, including Microsoft, Google, Amazon, IBM, and Yahoo.
Edit | Bookmark@del.icio.us | Trackback | Comments Off
|  |
|  |