Wolfram Alpha: an alternative to Google, the Semantic Web and Cyc?

March 11th, 2009

There’s been a lot of interest in Wolfram Alpha in the past week, starting with a blog post from Steve Wolfram, Wolfram|Alpha Is Coming!, in which he described his approach to building a system that integrates vast amounts of knowledge and then tries to answer free form questions posed to it by people. His post lays out his approach, which does not involve extracting data from online text.

“A lot of it is now on the web—in billions of pages of text. And with search engines, we can very efficiently search for specific terms and phrases in that text. But we can’t compute from that. And in effect, we can only answer questions that have been literally asked before. We can look things up, but we can’t figure anything new out.

So how can we deal with that? Well, some people have thought the way forward must be to somehow automatically understand the natural language that exists on the web. Perhaps getting the web semantically tagged to make that easier.

But armed with Mathematica and NKS I realized there’s another way: explicitly implement methods and models, as algorithms, and explicitly curate all data so that it is immediately computable.”

Nova Spivack took a look at Wolfram Alpha last week and thought that it could be “as important as Google”.

In a nutshell, Wolfram and his team have built what he calls a “computational knowledge engine” for the Web. OK, so what does that really mean? Basically it means that you can ask it factual questions and it computes answers for you.

It doesn’t simply return documents that (might) contain the answers, like Google does, and it isn’t just a giant database of knowledge, like the Wikipedia. It doesn’t simply parse natural language and then use that to retrieve documents, like Powerset, for example.

Instead, Wolfram Alpha actually computes the answers to a wide range of questions — like questions that have factual answers such as “What is the location of Timbuktu?” or “How many protons are in a hydrogen atom?,” “What was the average rainfall in Boston last year?,” “What is the 307th digit of Pi?,” “where is the ISS?” or “When was GOOG worth more than $300?”

Doug Lenat, also had a chance to preview Wolfram Alpha and came away impressed:

“Stephen Wolfram generously gave me a two-hour demo of Wolfram Alpha last evening, and I was quite positively impressed. As he said, it’s not AI, and not aiming to be, so it shouldn’t be measured by contrasting it with HAL or Cyc but with Google or Yahoo.”

Doug’s review does a good job of sketching the differences he ses between Wolfram Alpha and systems like Google and Cyc.

Lenat’s description makes Wolfram Alpha sound like a variation on the Semantic Web vision, but one that more like a giant closed database than a distributed Web of data. The system is set to launch in May 2009 and I’m anxious to give it a try.


Big (linked?) data

February 8th, 2009

The Data Evolution blog has an interesting post that asks Is Big Data at a tipping point?. It’s suggests that we may be approaching a tipping point in which large amounts of online data will be interlinked and connected to suddenly produce a whole much larger than the parts.

“For the past several decades, an increasing number of business processes– from sales, customer service, shipping – have come online, along with the data they throw off. As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center. And every action — sales lead, mouse click, and shipping update — is stored. The result: organizations are overwhelmed by what feels like a tsunami of data. The same trend is occurring in the larger universe of data that these organizations inhabit. Big Data unleashed by the “Industrial Revolution of Data”, whether from public agencies, non-profit institutes, or forward-thinking private firms.”

I expected that the post would soon segue into a discussion of the Semantic Web and maybe even the increasingly popular linked data movement, but it did not. Even so, it sets up plenty of nails for which we have a an excellent hammer in hand. I really like this iceberg analogy, by the way.

“At present, much of the world’s Big Data is iceberg-like: frozen and mostly underwater. It’s frozen because format and meta-data standards make it hard to flow from one place to another: comparing the SEC’s financial data with that of Europe’s requires common formats and labels (ahem, XBRL) that don’t yet exist. Data is “underwater” when, whether reasons of competitiveness, privacy, or sheer incompetence it’s not shared: US medical records may contain a wealth of data, but much of it is on paper and offline (not so in Europe, enabling studies with huge cohorts).”

The post also points out some sources of online data and analysis tools, some familiar and some new to me (or maybe just forgotten.)

“Yet there’s a slow thaw underway as evidenced by a number of initiatives: Aaron Swartz’s theinfo.org, Flip Kromer’s infochimps, Carl Malamud’s bulk.resource.org, as well as Numbrary, Swivel, Freebase, and Amazon’s public data sets. These are all ambitious projects, but the challenge of weaving these data sets together is still greater.”


DHS wants to mine social media for terrorism relatated data

January 5th, 2009

USA Today reports (Feds may mine blogs for terrorism clues) that the US Department of Homeland Security wants to use data-mining technology to search blogs and Internet message boards to find those used by terrorists to plan attacks.

“Blogging and message boards have played a substantial role in allowing communication among those who would do the United States harm,” DHS said in a recent notice.

Julian Sanchez notes on Ars Technica that the story is not new.

“The story is actually pegged to a Sources Sought Notice posted by the Department of Homeland Security back in October. Our colleagues at Wired reported on it at the time.”


NRC study questions use of datamining for counterterrorism

October 7th, 2008

The National Research Council released a report on the effectiveness of collecting and mining personal data, such as such as phone, medical, and travel records or Web sites visited, as a tool for combating terrorism. The report, titled Protecting Individual Privacy in the Struggle Against Terrorists: A Framework for Program Assessment, was produced by a multi-year study was carried out at the request of DHS and NSF.

The NRC’s press release on the study notes that routine datamining can help in “expanding and speeding traditional investigative work”, it questions the effectiveness of automated datamining and behavioral surveillance.

“Far more problematic are automated data-mining techniques that search databases for unusual patterns of activity not already known to be associated with terrorists, the report says. Although these methods have been useful in the private sector for spotting consumer fraud, they are less helpful for counterterrorism precisely because so little is known about what patterns indicate terrorist activity; as a result, they are likely to generate huge numbers of false leads. Such techniques might, however, have some value as secondary components of a counterterrorism system to assist human analysts. Actions such as arrest, search, or denial of rights should never be taken solely on the basis of an automated data-mining result, the report adds.
    The committee also examined behavioral surveillance techniques, which try to identify terrorists by observing behavior or measuring physiological states. There is no scientific consensus on whether these techniques are ready for use at all in counterterrorism, the report says; at most they should be used for preliminary screening, to identify those who merit follow-up investigation. Further, they have enormous potential for privacy violations because they will inevitably force targeted individuals to explain and justify their mental and emotional states.”

The report suggested criteria and questions addressing both the technical effectiveness as well as impact on privacy to help agencies and policymakers evaluate data-based counterterrorism programs. It also calls for oversight and both technical and policy safeguards to protect privacy and prevent “mission creep”. Declan McCullagh has a good summary of the key recommendations.

The 352 page report can be downloaded from the National Accademies Press site for $37.00.


Twitterment, domain grabbing, and grad students who could have been rich!

July 8th, 2008

Here at Ebiquity, we’ve had a number of great grad students. One of them, Akshay Java, hacked out a search engine for twitter posts around early April last year, and named it twitterment. He blogged about it here first. He did it without the benefit of the XMPP updates, by parsing the public timeline. It got talked about in the blogosphere, (including by Scoble), got some press, and there was an article in the MIT Tech review that used his visualization of some of the twitter links. It even got talked about in Wired’s blog, something we found out only yesterday. We were also told that three days after the post in Wired’s blog, someone somewhere registered the domain twitterment.com (I won’t feed them pagerank by linking!), and set up a page that looks very similar to Akshay’s. It has Google Adsense, and of course just passes the query to Google with a site restriction to twitter. So they’re poaching coffee and cookie money from the students in our lab 🙂

So of course we played with Akshay’s hack, hosted it on one of our university boxes for a few months, but didn’t really have the bandwidth or compute (or time) resources to keep up. Startups such as summize appeared later and provided similar functionality. For the last week or two we’ve  been moving the code of twitterment to Amazon’s cloud to restart the service. Of course, today comes the news that twitter might buy summize, quasi confirmed by Om Malik. Lesson to you grad students — if you come up with something clever, file an invention disclosure with your university’s tech transfer folks. And don’t listen to your advisors if they think that there isn’t a paper in what you’ve hacked — there may yet be a few million dollars in it 🙂


Our MURI grant gets some press

June 12th, 2008

A UMBC led team recently won a MURI award from DoD to work on “Assured Information Sharing Lifecycle”. It is an interesting mix of work on  new security models, policy driven security systems, context awareness, privacy preserving data mining, and social networking. The award really brings together many different strains of research in eBiquity, as well as some related reserach in our department. We’re just starting off, and excited about it. UMBC’s web page had a story about this, and more recently, GCN covered it.

The UMBC team is lead by Tim Finin, and includes several of us. The other participants are UIUC (led by Jiawei Han), Purdue (led by Elisa Bertino),  UTSA (led by Ravi Sandhu), UTDallas (led by Bhavani Thurasingham), Michigan (Lada Adamic).


Jiawei Han: Research Challenges In Data Mining, 10am 4/22 LH8 UMBC

April 21st, 2008

Jiawei Han will give a talk tomorrow, Research Challenges In Data Mining at 10am in UMBC’s
LH8 (1st floor ITE building). Here’s the abstract.

“Research in data mining has led to advanced knowledge discovery technologies and applications. In this talk, we will discuss some emerging research issues for advanced technologies and applications in data mining and discuss some recent progress in this direction, including (1) exploration of the power of pattern mining, (2) analysis of multidimensional, heterogeneous and evolving information network, (3) mining of fast changing data streams, (4) mining of moving object data, RFID data, and data from sensor networks, (5) spatiotemporal and multimedia data mining, (6) biological data mining, (7) text and Web mining, (8) data mining for software engineering and computer system analysis, and (9) data cube-oriented multidimensional online analytical analysis.”

The talk is part of a distinguished lecture series sponsored by the UMBC Information Systems Department. Here’s a flier.