A free PDF version of the new second edition of Mining of Massive Datasets by Anand Rajaraman, Jure Leskovec and Jeffey Ullman is available. New chapters on mining large graphs, dimensionality reduction, and machine learning have been added. Related material from Professor Leskovec’s recent Stanford course on Mining Massive Data Sets is also available.
“President Obama signed an Executive Order directing historic steps to make government-held data more accessible to the public and to entrepreneurs and others as fuel for innovation and economic growth. Under the terms of the Executive Order and a new Open Data Policy released today by the Office of Science and Technology Policy and the Office of Management and Budget, all newly generated government data will be required to be made available in open, machine-readable formats, greatly enhancing their accessibility and usefulness, while ensuring privacy and security.”
While the policy doesn’t mention adding semantic markup to enhance machine understanding, calling for machine-readable datasets with persistent identifiers is a big step forward.
The dataset consists of over 386M blog posts, news articles, classifieds, forum posts and social media content in a month including events such as the Tunisian revolution and the Egyptian protests. The content includes the syndicated text, its original HTML as found on the web, annotations and metadata (e.g., author information, time of publication and source URL), and boilerplate/chrome extracted content. The data is formatted as Spinn3r’s protostreams – an extension to Google protobuffers. It is also broken down by date, content type and language making it easy to work with selected data.
See the ICWSM Data Challenge pages for more information on the challenge task, its associated ICWSM workshop and procedures for data access.
On the eve of the big Jeopardy! match, Peter Norvig’s opinion piece in the New York Post (!) today, The Machine Age looks at AI’s progress over the past sixty years and lays out six surprising lessons we’ve learned.
The things we thought were hard turned out to be easier.
Dealing with uncertainty turned out to be more important than thinking with logical precision.
Learning turned out to be more important than knowing.
Current systems are more likely to be built from examples than from logical rules.
The focus shifted from replacing humans to augmenting them.
The partnership between human and machine is stronger than either one alone.
When took Pat Winston’s undergraduate AI class in 1970, only the first of those ideas was current. It’s a good essay.
Of course, after we we’ve exploited the new data-driven, statistical paradigm for the next decade or so, we’ll probably have to go back to figuring out how to get logic back into the framework.
Recorded Future is a Boston-based startup with backing from Google and In-Q-Tel uses sophisticated linguistic and statistical algorithms to extract time-related information from streams of Web data about entities and events. Their goal is to help their clients to understand how the relationships between entities and events of interest are changing over time and make predictions about the future.
“Conventional search engines like Google use links to rank and connect different Web pages. Recorded Future’s software goes a level deeper by analyzing the content of pages to track the “invisible” connections between people, places, and events described online.
”That makes it possible for me to look for specific patterns, like product releases expected from Apple in the near future, or to identify when a company plans to invest or expand into India,” says Christopher Ahlberg, founder of the Boston-based firm.
A search for information about drug company Merck, for example, generates a timeline showing not only recent news on earnings but also when various drug trials registered with the website clinicaltrials.gov will end in coming years. Another search revealed when various news outlets predict that Facebook will make its initial public offering.
That is done using a constantly updated index of what Ahlberg calls “streaming data,” including news articles, filings with government regulators, Twitter updates, and transcripts from earnings calls or political and economic speeches. Recorded Future uses linguistic algorithms to identify specific types of events, such as product releases, mergers, or natural disasters, the date when those events will happen, and related entities such as people, companies, and countries. The tool can also track the sentiment of news coverage about companies, classifying it as either good or bad.”
Pricing for access to their online services and API starts at $149 a month, but there is a free Futures email alert service through which you can get the results of some standing queries on a daily or weekly basis. You can also explore the capabilities they offer through their page on the 2010 US Senate Races.
“Rather than attempt to predict how the the races will turn out, we have drawn from our database the momentum, best characterized as online buzz, and sentiment, both positive and negative, associated with the coverage of the 29 candidates in 14 interesting races. This dashboard is meant to give the view of a campaign strategist, as it measures how well a campaign has done in getting the media to speak about the candidate, and whether that coverage has been positive, in comparison to the opponent.”
Their blog reveals some insights on the technology they are using and much more about the business opportunities they see. Clearly the company is leveraging named entity recognition, event recognition and sentiment analysis. A short A White Paper on Temporal Analytics has some details on their overall approach.
Kaggle is a site for data-related competitions in machine learning, statistics and econometrics. Companies, researchers, government and other organizations will be able to post their modeling problems and invite researchers to compete to produce the best solutions. The Kaggle demo site currently has three example competitions to illustrate how it will work and expects to host the first real one in March. Kaggle’s competition hosting service will be free, but the site says that it plans to “offer paid-for services in addition to its free competition hosting.”
There’s been a lot of interest in Wolfram Alpha in the past week, starting with a blog post from Steve Wolfram, Wolfram|Alpha Is Coming!, in which he described his approach to building a system that integrates vast amounts of knowledge and then tries to answer free form questions posed to it by people. His post lays out his approach, which does not involve extracting data from online text.
“A lot of it is now on the web—in billions of pages of text. And with search engines, we can very efficiently search for specific terms and phrases in that text. But we can’t compute from that. And in effect, we can only answer questions that have been literally asked before. We can look things up, but we can’t figure anything new out.
So how can we deal with that? Well, some people have thought the way forward must be to somehow automatically understand the natural language that exists on the web. Perhaps getting the web semantically tagged to make that easier.
But armed with Mathematica and NKS I realized there’s another way: explicitly implement methods and models, as algorithms, and explicitly curate all data so that it is immediately computable.”
In a nutshell, Wolfram and his team have built what he calls a “computational knowledge engine” for the Web. OK, so what does that really mean? Basically it means that you can ask it factual questions and it computes answers for you.
It doesn’t simply return documents that (might) contain the answers, like Google does, and it isn’t just a giant database of knowledge, like the Wikipedia. It doesn’t simply parse natural language and then use that to retrieve documents, like Powerset, for example.
Instead, Wolfram Alpha actually computes the answers to a wide range of questions — like questions that have factual answers such as “What is the location of Timbuktu?” or “How many protons are in a hydrogen atom?,” “What was the average rainfall in Boston last year?,” “What is the 307th digit of Pi?,” “where is the ISS?” or “When was GOOG worth more than $300?”
“Stephen Wolfram generously gave me a two-hour demo of Wolfram Alpha last evening, and I was quite positively impressed. As he said, it’s not AI, and not aiming to be, so it shouldn’t be measured by contrasting it with HAL or Cyc but with Google or Yahoo.”
Doug’s review does a good job of sketching the differences he ses between Wolfram Alpha and systems like Google and Cyc.
Lenat’s description makes Wolfram Alpha sound like a variation on the Semantic Web vision, but one that more like a giant closed database than a distributed Web of data. The system is set to launch in May 2009 and I’m anxious to give it a try.
The Data Evolution blog has an interesting post that asks Is Big Data at a tipping point?. It’s suggests that we may be approaching a tipping point in which large amounts of online data will be interlinked and connected to suddenly produce a whole much larger than the parts.
“For the past several decades, an increasing number of business processes– from sales, customer service, shipping – have come online, along with the data they throw off. As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center. And every action — sales lead, mouse click, and shipping update — is stored. The result: organizations are overwhelmed by what feels like a tsunami of data. The same trend is occurring in the larger universe of data that these organizations inhabit. Big Data unleashed by the “Industrial Revolution of Data”, whether from public agencies, non-profit institutes, or forward-thinking private firms.”
I expected that the post would soon segue into a discussion of the Semantic Web and maybe even the increasingly popular linked data movement, but it did not. Even so, it sets up plenty of nails for which we have a an excellent hammer in hand. I really like this iceberg analogy, by the way.
“At present, much of the world’s Big Data is iceberg-like: frozen and mostly underwater. It’s frozen because format and meta-data standards make it hard to flow from one place to another: comparing the SEC’s financial data with that of Europe’s requires common formats and labels (ahem, XBRL) that don’t yet exist. Data is “underwater” when, whether reasons of competitiveness, privacy, or sheer incompetence it’s not shared: US medical records may contain a wealth of data, but much of it is on paper and offline (not so in Europe, enabling studies with huge cohorts).”
The post also points out some sources of online data and analysis tools, some familiar and some new to me (or maybe just forgotten.)
USA Today reports (Feds may mine blogs for terrorism clues) that the US Department of Homeland Security wants to use data-mining technology to search blogs and Internet message boards to find those used by terrorists to plan attacks.
“Blogging and message boards have played a substantial role in allowing communication among those who would do the United States harm,” DHS said in a recent notice.
Julian Sanchez notes on Ars Technica that the story is not new.
The National Research Council released a report on the effectiveness of collecting and mining personal data, such as such as phone, medical, and travel records or Web sites visited, as a tool for combating terrorism. The report, titled Protecting Individual Privacy in the Struggle Against Terrorists: A Framework for Program Assessment, was produced by a multi-year study was carried out at the request of DHS and NSF.
The NRC’s press release on the study notes that routine datamining can help in “expanding and speeding traditional investigative work”, it questions the effectiveness of automated datamining and behavioral surveillance.
“Far more problematic are automated data-mining techniques that search databases for unusual patterns of activity not already known to be associated with terrorists, the report says. Although these methods have been useful in the private sector for spotting consumer fraud, they are less helpful for counterterrorism precisely because so little is known about what patterns indicate terrorist activity; as a result, they are likely to generate huge numbers of false leads. Such techniques might, however, have some value as secondary components of a counterterrorism system to assist human analysts. Actions such as arrest, search, or denial of rights should never be taken solely on the basis of an automated data-mining result, the report adds.
The committee also examined behavioral surveillance techniques, which try to identify terrorists by observing behavior or measuring physiological states. There is no scientific consensus on whether these techniques are ready for use at all in counterterrorism, the report says; at most they should be used for preliminary screening, to identify those who merit follow-up investigation. Further, they have enormous potential for privacy violations because they will inevitably force targeted individuals to explain and justify their mental and emotional states.”
The report suggested criteria and questions addressing both the technical effectiveness as well as impact on privacy to help agencies and policymakers evaluate data-based counterterrorism programs. It also calls for oversight and both technical and policy safeguards to protect privacy and prevent “mission creep”. Declan McCullagh has a good summary of the key recommendations.
The 352 page report can be downloaded from the National Accademies Press site for $37.00.
Here at Ebiquity, we’ve had a number of great grad students. One of them, Akshay Java, hacked out a search engine for twitter posts around early April last year, and named it twitterment. He blogged about it here first. He did it without the benefit of the XMPP updates, by parsing the public timeline. It got talked about in the blogosphere, (including by Scoble), got some press, and there was an article in the MIT Tech review that used his visualization of some of the twitter links. It even got talked about in Wired’s blog, something we found out only yesterday. We were also told that three days after the post in Wired’s blog, someone somewhere registered the domain twitterment.com (I won’t feed them pagerank by linking!), and set up a page that looks very similar to Akshay’s. It has Google Adsense, and of course just passes the query to Google with a site restriction to twitter. So they’re poaching coffee and cookie money from the students in our lab
So of course we played with Akshay’s hack, hosted it on one of our university boxes for a few months, but didn’t really have the bandwidth or compute (or time) resources to keep up. Startups such as summize appeared later and provided similar functionality. For the last week or two we’ve been moving the code of twitterment to Amazon’s cloud to restart the service. Of course, today comes the news that twitter might buy summize, quasi confirmed by Om Malik. Lesson to you grad students — if you come up with something clever, file an invention disclosure with your university’s tech transfer folks. And don’t listen to your advisors if they think that there isn’t a paper in what you’ve hacked — there may yet be a few million dollars in it
A UMBC led team recently won a MURI award from DoD to work on “Assured Information Sharing Lifecycle”. It is an interesting mix of work on new security models, policy driven security systems, context awareness, privacy preserving data mining, and social networking. The award really brings together many different strains of research in eBiquity, as well as some related reserach in our department. We’re just starting off, and excited about it. UMBC’s web page had a story about this, and more recently, GCN covered it.