UMBC ebiquity
Machine Learning

Archive for the 'Machine Learning' Category

Recorded Future analyses streaming Web data to predict the future

October 30th, 2010, by Tim Finin, posted in AI, Datamining, Google, Machine Learning, NLP, sEARCH, Semantic Web, Social media

Recorded Future is a Boston-based startup with backing from Google and In-Q-Tel uses sophisticated linguistic and statistical algorithms to extract time-related information from streams of Web data about entities and events. Their goal is to help their clients to understand how the relationships between entities and events of interest are changing over time and make predictions about the future.

Recorded Future system architecture

A recent Technology Review article, See the Future with a Search, describes it this way.

“Conventional search engines like Google use links to rank and connect different Web pages. Recorded Future’s software goes a level deeper by analyzing the content of pages to track the “invisible” connections between people, places, and events described online.
   ”That makes it possible for me to look for specific patterns, like product releases expected from Apple in the near future, or to identify when a company plans to invest or expand into India,” says Christopher Ahlberg, founder of the Boston-based firm.
   A search for information about drug company Merck, for example, generates a timeline showing not only recent news on earnings but also when various drug trials registered with the website clinicaltrials.gov will end in coming years. Another search revealed when various news outlets predict that Facebook will make its initial public offering.
   That is done using a constantly updated index of what Ahlberg calls “streaming data,” including news articles, filings with government regulators, Twitter updates, and transcripts from earnings calls or political and economic speeches. Recorded Future uses linguistic algorithms to identify specific types of events, such as product releases, mergers, or natural disasters, the date when those events will happen, and related entities such as people, companies, and countries. The tool can also track the sentiment of news coverage about companies, classifying it as either good or bad.”

Pricing for access to their online services and API starts at $149 a month, but there is a free Futures email alert service through which you can get the results of some standing queries on a daily or weekly basis. You can also explore the capabilities they offer through their page on the 2010 US Senate Races.

“Rather than attempt to predict how the the races will turn out, we have drawn from our database the momentum, best characterized as online buzz, and sentiment, both positive and negative, associated with the coverage of the 29 candidates in 14 interesting races. This dashboard is meant to give the view of a campaign strategist, as it measures how well a campaign has done in getting the media to speak about the candidate, and whether that coverage has been positive, in comparison to the opponent.”

Their blog reveals some insights on the technology they are using and much more about the business opportunities they see. Clearly the company is leveraging named entity recognition, event recognition and sentiment analysis. A short A White Paper on Temporal Analytics has some details on their overall approach.

Smart Grid: the collision of energy and information

August 19th, 2010, by Tim Finin, posted in Machine Learning, UMBC

The Maryland Clean Energy Technology Incubator (CETI) at bwtech@UMBC will host a seminar series this Fall with focus on the Smart Grid. The series will discuss the issues and opportunities and speculate on expected business opportunities in this major restructuring of the electric grid. Huge investments (tens of billions of dollars) are committed to the Smart Grid for the coming decade.

About six seminars are planned for Fall 2010 to be held (mostly) on Wednesdays from 4-6pm and UMBC faculty, staff and students are encouraged to participate. They will include a ~45 minute presentation followed by a lively discussion and opportunity to socialize and enjoy light refreshments.

The first speaker, Peter Kelly-Detwiler leads a group at Constellation Energy that is developing new methods for data analysis and presentation. He is an “entrepreneur” within Constellation with 20 years of experience in the energy field and he has a perspective on the Smart Grid like few others.

A smart grid perspective: finding value in
the collision of energy and information

Peter Kelly-Detwiler, Constellation Energy

4-6pm Wednesday, 8 September 2010
2nd floor Courtyard Conference Room
UMBC Tech Center

Many people have heard of the term “smart grid” and there are many varying interpretations of what it means. But everybody can agree on three things:

  • It involves increased and timely access to information
  • There’s money in it
  • It will create new and unforeseen technologies and entrepreneurial opportunities

The discussion will center around why smart grid is needed, how an energy provider views the challenges and opportunities, the forces we see gathering on the horizon, and how Constellation Energy is responding. Issues related to power grid economics, volatility, risk management, and customer elasticities and perspectives will be addressed.

Peter Kelly-Detwiler is Senior Vice President of Energy Technology Services for Constellation NewEnergy, Inc., a subsidiary of Constellation Energy Group. He and his company-wide team oversee the integration of efficiency technologies and applications that help customers better manage their total energy bills and create optimal energy solutions. Peter has 20 years of experience in the energy industry. His accomplishments include managing the development of energy efficiency projects and reviewing economic impact of energy products.

Please RSVP to Bjorn Frogner (bjorn.frogner@umbc.edu), the CETI Entrepreneur in Residence, if you plan to attend.

Training Examples QA: stackoverflow for NLP and ML

June 30th, 2010, by Tim Finin, posted in AI, Machine Learning, NLP, Semantic Web, Social media

Training Examples QA is a site created by Joseph Turian where “data geeks ask and answer questions on machine learning, natural language processing, artificial intelligence, text analysis, information retrieval, search, data mining, statistical modeling, and data visualization!”

It’s a close knock off of the popular stack overflow site and appears to be very well done.

If it catches on in the relevant research communities, it could be a very useful resource. (via LingPipe blog)


Screen shot 2010-06-30 at 1.10.24 PM

Kaggle aims to host data-driven machine learning competitions

February 3rd, 2010, by Tim Finin, posted in Datamining, Machine Learning, Semantic Web, Social media

Kaggle is a site for data-related competitions in machine learning, statistics and econometrics. Companies, researchers, government and other organizations will be able to post their modeling problems and invite researchers to compete to produce the best solutions. The Kaggle demo site currently has three example competitions to illustrate how it will work and expects to host the first real one in March. Kaggle’s competition hosting service will be free, but the site says that it plans to “offer paid-for services in addition to its free competition hosting.”

Gaydar, Facebook and privacy

October 6th, 2009, by Tim Finin, posted in Machine Learning, Privacy, Semantic Web, Social media

In the Fall of 2007, two MIT students carried out a class project exploring how presumably private data could be inferred from an online social networking system. Their experiment was to predict the sexual orientation of Facebook users who make their basic information public by analyzing friendship associations. As reported in the Boston Globe last month, the students’ had not yet published their results.

Well, now they have — in the October issue of the First Monday, “one of the first openly accessible, peer–reviewed journals on the Internet”.

The paper has a lot of detail on the methodology for collecting the data and how it was analyzed. Here’s the abstract.

“Public information about one’s coworkers, friends, family, and acquaintances, as well as one’s associations with them, implicitly reveals private information. Social networking Web sites, e–mail, instant messaging, telephone, and VoIP are all technologies steeped in network data — data relating one person to another. Network data shifts the locus of information control away from individuals, as the individual’s traditional and absolute discretion is replaced by that of his social network. Our research demonstrates a method for accurately predicting the sexual orientation of Facebook users by analyzing friendship associations. After analyzing 4,080 Facebook profiles from the MIT network, we determined that the percentage of a given user’s friends who self–identify as gay male is strongly correlated with the sexual orientation of that user, and we developed a logistic regression classifier with strong predictive power. Although we studied Facebook friendship ties, network data is pervasive in the broader context of computer–mediated communication, raising significant privacy issues for communication technologies to which there are no neat solutions.”

As we had previously noted, this datamining exercise only accesses information that Facebook users explicitly choose to make public. The authors note that their analysis “relies on public self–identification of same–gender interest in Facebook profiles as a sentinel value for LGB identity”. The privacy vulnerability is that the default setting for a Facebook account is that friendship relations are public and you can not control the privacy settings of your friends. So if your leave your friend list public and many of your Facebook friends open up their profiles, it may be possible to draw reasonable inferences about your age, gender, political leanings, sexual preferences and other attributes.

$1M Netflix Prize goes to BellKor’s Pragmatic Chaos

September 21st, 2009, by Tim Finin, posted in AI, Machine Learning, Semantic Web, Social media

Netflix announced today that BellKor’s Pragmatic Chaos team was awarded the $1M Netflix Grand Prize.

“It is our great honor to announce the $1M Grand Prize winner of the Netflix Prize contest as team BellKor’s Pragmatic Chaos for their verified submission on July 26, 2009 at 18:18:28 UTC, achieving the winning RMSE of 0.8567 on the test subset. This represents a 10.06% improvement over Cinematch’s score on the test subset at the start of the contest. We congratulate the team of Bob Bell, Martin Chabbert, Michael Jahrer, Yehuda Koren, Martin Piotte, Andreas Töscher and Chris Volinsky for their superb work advancing and integrating many significant techniques to achieve this result.”

Netflix announced that it will hold a new Netflix Prize 2 contest with details to be released.

What about the Ensemble’s last-minute entry, the one that seemed to top BellKor’s?

“Team BellKor’s Pragmatic Chaos edged out team The Ensemble with the winning submission coming just 24 minutes before the conclusion of the nearly three-year-long contest. Historically the Leaderboard has only reported team scores on the quiz subset. The Prize is awarded based on teams’ test subset score. Now that the contest is closed we will be updating the Leaderboard to report team scores on both the test and quiz subsets.”

As part of the final submission, teams were required to submit papers describing the approach. Here are the three that the winning team delivered.

The New York Times Bits blog also has an article, Netflix Awards $1 Million Prize and Starts a New Contest.

Who won the Netflix Prize? Ensemble or BellKors Pragmatic Chaos?

July 27th, 2009, by Tim Finin, posted in AI, Machine Learning, Social media, Web

Who won the Netflix Prize? Ensemble or BellKors Pragmatic Chaos

Who won the Netflix Prize? According to a post in the NYT Bits blog, Netflix Challenge Ends, But Winner Is In Doubt, it’s still very much up in the air.

” So The Ensemble won, right? Not necessarily. In an e-mail message Sunday night, Chris Volinsky, a scientist at AT&T Research and a leader of the BellKor’s team, said: “Our team is in first place as we were contacted by Netflix to validate our entry.” And in an online forum, another member of the BellKor team, Yehuda Koren, a researcher for Yahoo in Israel, said his team had “a better Test score than The Ensemble,” despite what the rival team submitted for the leaderboard.

So is BellKor the winner? Certainly not yet, according to a Netflix spokesman, Steve Swasey. “There is no winner,” he said.

A winner, Mr. Swasey said, will probably not be announced until sometime in September at an event hosted by Reed Hastings, Netflix’s chief executive. The movie rental company is not holding off for maximum P.R. effect, Mr. Swasey said, but because the winner has not yet been determined.

The Web leaderboard, he explained, is based on what the teams submit. Next, Netflix’s in-house researchers and outside experts have to validate the teams’ submissions, poring over the submitted code, design documents and other materials. “This is really complex stuff,” Mr. Swasey said.

A leading member of The Ensemble, Domonkos Tikk, a Hungarian computer scientist, did not sound too hopeful. “We didn’t get any notification from Netflix,” Mr. Tikk said in a phone interview from Hungary. “So I think the chances that we won are very slight. It was a nice try.”

It seems strange that Netflix called the Bellkor team first, since according to the Leaderboard the Ensemble team submitted the top entry.

UPDATE 2/28: Today’s NYT has a good article on the Netflix Prize and the role of teamwork for developing machine learning systems, Netflix Competitors Learn the Power of Teamwork.

Netflix Prize contest closes; Ensemble wins

July 26th, 2009, by Tim Finin, posted in AI, Machine Learning, Social media, Web

Netflix has announced that the Netflix Prize contest is now closed. Presumably, The Ensemble is the winner, subject to final qualification.

“We are delighted to report that, after almost three years and more than 43,000 entries from over 5,100 teams in over 185 countries, the Netflix Prize Contest stopped accepting entries on 2009-07-26 18:42:37 UTC. The closing of the contest is in accordance with the Rules — thirty (30) days after a submitted prediction set achieved the Grand Prize qualifying RMSE on the quiz subset.

Qualified entries will be evaluated as described in the Rules. We look forward to awarding the Grand Prize, which we expect to announce in a few weeks. However if a Grand Prize cannot be awarded because no submission can be verified by the judges, the Contest will reopen. We will make an announcement on the Forum after the Contest judges reach a decision.”

So what’s left for the judges to do. The rules say that “a panel of senior Netflix engineers and qualified independent judges” need to “ensure that the provided algorithm description and source code could reasonably have generated the prediction sets submitted”. To do this, the candidate winner must produce the algorithm along with a description of who it works. And, of course, before receiving the prize the winner has to grant Netflix

“an irrevocable, royalty free, fully paid up, worldwide non-exclusive license under the Participants’ copyrights, patents or other intellectual property rights in the winning algorithm (“Winning Algorithm”) to reproduce, distribute, display, and create derivative works from the Winning Algorithm and also to make, have made, use, sell, offer for sale, and import products that would otherwise infringe the Winning Algorithm.”

The Netflix Prize was a great idea and generated a lot of interest around the world. It’s been good for the field of AI and its machine learning sub-field, especially. Congratulations to the Ensemble team and condolences to BellKor’s Pragmatic Chaos. I wish there could have been two winners.

UPDATE 2/27: Wait! The winner is still in doubt.

Ensemble leads Netflix Prize contest, besting BellKors Pragmatic Chaos

July 26th, 2009, by Tim Finin, posted in AI, Machine Learning, Social media

The race for the Netflix Prize is still on.

With just one day left in the 30 day last call period before BellKor’s Pragmatic Chaos (BKPC) was awarded the $1M Netflix Prize for a better movie recommender system, another team has broken the 10% improvement threshold and taken the lead by one hundredth of one percent — The Ensemble.

The Ensemble was formed by the merger of two existing Netflix Prize teams that had been ranked second and third behind BKPC: ‘Grand Prize Team’ and ‘Opera Solutions and Vandelay United’. Here’s how The Ensemble describes it’s genesis.

The crowd is indeed wiser than the individual.

The 10% barrier once seemed distant and insurmountable. But when the contest’s “last call” heralded the heroic achievements of BellKor’s Pragmatic Chaos, the rest of the crowd pondered, and asked why the barrier couldn’t be broken twice.

And lo, as if powered by gravity, Grand Prize Team and Vandelay Industries! began to draw in more and more members. And Vandelay went on to join forces with Opera Solutions, and then Vandelay and Opera united with Grand Prize Team, and then … and then … well, things got so complex we decided just to call ourselves The Ensemble.

We can be sure that there will be a lot of Netflix Prize activity in the coming weeks and maybe months as these two teams compete and perhaps more mergers create super-teams. BKPC and Ensemble could even decide to merge and share the prize. Watch the Netflix Leaderboard for the latest ranking.

UPDATE: I had assumed the 30 day last call would reset with each new leader, like auctions on ebay. Not so. The prize will be won (and lost) today! Here’s the relevant section in the rules:

“To qualify for the Grand Prize the RMSE of a Participant’s submitted predictions on the test subset must be less than or equal to 90% of 0.9525, or 0.8572 (the “qualifying RMSE”). After three (3) months have elapsed from the start of the Contest, when the RMSE of a submitted prediction set on the quiz subset improves beyond the qualifying RMSE an electronic announcement will inform all registered Participants that they have thirty (30) days to submit additional candidate prediction sets to be considered for judging. At the end of this period, qualifying submissions will be judged (see Judging below) in order of the largest improvement over the qualifying RMSE on the test subset. In the case of tied RMSE values on the test subsets, the submission received earliest by the Site will be judged first.”

The August 2009 CACM has a short note, Just for You (pdf), on recommender systems and the Netflix prize by BKPC member Don Monroe that includes a visualization by Ensemble member Chris Hefele.

Spotted on Hacker News. See Techcrunch also.

UPDATE II: The Netflix Prize contest has closed.

The $1M Netflix Grand Prize taken by BellKor’s Pragmatic Chaos?

June 26th, 2009, by Tim Finin, posted in AI, Machine Learning, Social media

BellKor’s Pragmatic Chaos has broken the 10% barrier, a feat that may have won them the $1M Netflix prize. We’ll know for sure in 30 days.

“June 26, 2009: Today our team submitted our solution to the Netflix Prize, resulting in a score of .8558, which corresponds to an improvement over Netflix Cinematch algorithm of 10.05%. This is the first submission in the competition to break the 10% barrier and sets off a 30 day period where all competitors are invited to submit their best and final solutions.

The prize is the award by Netflix for an open competition that started in October 2006 for the best collaborative filtering algorithm predicting user ratings for films from a database of previous ratings. Today the BellKor’s Pragmatic Chaos team submitted an entry that improved on the existing algorithm by 10.05%, exceeding the 10% improvement threshold required of a winner. The team is a collaboration between people from Pragmatic Theory, Commendo, Yahoo and AT&T.

“The Netflix Prize seeks to substantially improve the accuracy of predictions about how much someone is going to love a movie based on their movie preferences. Improve it enough and you win one (or more) Prizes. Winning the Netflix Prize improves our ability to connect people to the movies they love.”

It wasn’t me — the Bot did it!

September 15th, 2008, by Tim Finin, posted in AI, Machine Learning, Semantic Web, Social media, Web

The NYT has an interesting article, Stuck in Google’s Doghouse, on the importance of search engines to many businesses. Or maybe it’s about ad arbitrage and the ways that some Web business models are based on gaming search engines and Web advertising. In any case, it’s especially relevant in the light of the recently announced Google-Yahoo advertising deal.

One of the most interesting aspects to the story, at least to me, is who gets the credit or blame for significant decisions and events on the Web — people or machines.

“When Mr. Savage asked Google executives what the problem was, he was told that Sourcetool’s “landing page quality” was low. Google had recently changed the algorithm for choosing advertisements for prominent positions on Google search pages, and Mr. Savage’s site had been identified as one that didn’t meet the algorithm’s new standards. (As Google defines it, landing page quality includes a series of attributes — loading speed, user friendliness, relevancy, originality and dozens of other characteristics — that it deems appropriately “googly.”)” source

A more dramatic example of our brave new world was the $1B problem United Airlines stock had last week, as outlined in A Stock-Killer Fueled by Algorithm After Algorithm.

“What made a six-year-old article about a bankruptcy filing by United Airlines reappear on Wall Street traders’ screens on Monday as if it were fresh news, prompting a sell-off that erased $1 billion in the company’s market value in a matter of minutes? The path the article followed from forgotten archive entry to present-day stock-killer has begun to emerge, and it raises some interesting questions about how news rockets around the Web. Both human error and far-from-foolproof technology seem to have played a role in the episode, which involved a 2002 Chicago Tribune report; the web site of the Sun Sentinel, a Florida newspaper owned by the same company; the Bloomberg News financial wire service; and Google, all apparently unwittingly.” source

The automation is inevitable, IMHO, and probably a good thing. Of course, I reserve the right to revise and extend my remarks if my own ox is gored.

Colin de la Higuera on Grammatical Inference, 1pm Tue June 10, ITE 325, UMBC

June 5th, 2008, by Tim Finin, posted in AI, Machine Learning, NLP

Colin de la Higuera of Jean Monnet University will talk on “ Grammatical Inference: Some of the Questions Out There ” at 1:00pm next Tuesday in the large CSEE conference room.

“Grammatical Inference is a field concerned with learning grammars given data about a language. In this talk we survey some of the questions being addressed by researchers in the field. Some of these are now classical and have been looked into for some time, others are more recent:

  • understanding the models and the paradigms: what does polynomial language learning mean?
  • learning more complex families of languages
  • scaling up and using grammatical inference in applications

You are currently browsing the archives for the Machine Learning category.

  Home | Archive | Login | Feed