UMBC ebiquity
2012 May

Archive for May, 2012

NIST Big Data Workshop, 13-14 June 2012

May 31st, 2012, by Tim Finin, posted in cloud computing, Conferences

NIST will hold a Big Data Workshop 13-14 June 2012 in Gaithersburg to explore key national priority topics in support of the White House Big Data Initiative. The workshop is being held in collaboration with the NSF sponsored Center for Hybrid Multicore Productivity Research, a collaboration between UMBC, Georgia Tech and UCSD.

This first workshop will discuss examples from science, health, disaster management, security, and finance as well as topics in emerging technology areas, including analytics and architectures. Two issues of special interest are identifying the core technologies needed to collect, store, preserve, manage, analyze, and share big data that could be standardized and developing measurements to ensure the accuracy and robustness of big data methods.

The workshop format will be a mixture of sessions, panels, and posters. Session speakers and panel members are by invitation only but all interested parties are encouraged to submit extended abstracts and/or posters.

The workshop is being held at NIST’s Gaithersburg facility and is free, although online pre-registration is required. A preliminary agenda is available which is subject to change as the workshop date approaches.

Google releases dataset linking strings and concepts

May 19th, 2012, by Tim Finin, posted in AI, Google, KR, NLP, Ontologies, Semantic Web, Wikipedia

Yesterday Google announced a very interesting resource with 175M short, unique text strings that were used to refer to one of 7.6M Wikipedia articles. This should be very useful for research on information extraction from text.

“We consider each individual Wikipedia article as representing a concept (an entity or an idea), identified by its URL. Text strings that refer to concepts were collected using the publicly available hypertext of anchors (the text you click on in a web link) that point to each Wikipedia page, thus drawing on the vast link structure of the web. For every English article we harvested the strings associated with its incoming hyperlinks from the rest of Wikipedia, the greater web, and also anchors of parallel, non-English Wikipedia pages. Our dictionaries are cross-lingual, and any concept deemed too fine can be broadened to a desired level of generality using Wikipedia’s groupings of articles into hierarchical categories.

The data set contains triples, each consisting of (i) text, a short, raw natural language string; (ii) url, a related concept, represented by an English Wikipedia article’s canonical location; and (iii) count, an integer indicating the number of times text has been observed connected with the concept’s url. Our database thus includes weights that measure degrees of association.”

The details of the data and how it was constructed are in an LREC 2012 paper by Valentin Spitkovsky and Angel Chang, A Cross-Lingual Dictionary for English Wikipedia Concepts. Get the data here.

Google Knowledge Graph: first impressions

May 19th, 2012, by Tim Finin, posted in AI, Google, KR, NLP, Ontologies, Semantic Web

The Google’s Knowledge Graph showed up for me this morning — it’s been slowly rolling out since the announcement on Wednesday. It builds lots of research from human language technology (e.g., entity recognition and linking) and the semantic web (graphs of linked data). The slogan, “things not strings”, is brilliant and easily understood.

My first impression is that it’s fast, useful and a great accomplishment but leaves lots of room for improvement and expansion. That last bit is a good thing, at least for those of us in the R&D community. Here are some comments based on some initial experimentation.

GKG only works on searches that are simple entity mentions like people, places, organizations. It doesn’t do products (Toyota Camray), events (World War II), or diseases (diabetes) but does recognize that ‘Mercury’ could be a planet or an element.

It’s a bit aggressive about linking: when searching for “John Smith” it zeros in on the 17th century English explorer. Poor Professor Michael Jordan never get a chance, and providing context by adding Berkeley just suppresses the GKG sidebar. “Mitt” goes right to you know who. “George Bush” does lead to a disambiguation sidebar, though. Given that GKG doesn’t seem to allow for context information, the only disambiguating evidence it has is popularity (i.e., pagerank).

Speaking of context, the GKG results seem not to draw on user-specific information, like my location or past search history. When I search for “Columbia” from my location here in Maryland, it suggests “Columbia University” and “Columbia, South Carolina” and not “Columbia, Maryland” which is just five miles away from me.

Places include not just GPEs (geo-political entities) but also locations (Mars, Patapsco river) and facilities (MOMA, empire state building). To the GKG, the White House is just a place.

Organizations seem like a weak spot. It recognizes schools (UCLA) but company mentions seem not to be directly handled, not even for “Google”. A search for “NBA” suggests three “people associated with NBA” and “National Basketball Association” is not recognized. Forget finding out about the Cult of the Dead Cow.

Mike Bergman has some insights based on his exploration of the GKG in Deconstructing the Google Knowledge Graph

The use of structured and semi-structure knowledge in search is an exciting area. I expect we will see much more of this showing up in search engines, including Bing.

Google Knowledge Graph: things, not string

May 16th, 2012, by Tim Finin, posted in Google, Semantic Web

Google announced its “knowledge graph” today and describes it as “an intelligent model—in geek-speak, a ‘graph’ — that understands real-world entities and their relationships to one another: things, not strings. … It currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects. And it’s tuned based on what people search for, and what we find out on the web.” Information from the knowledge graph will initially augment search results — the feature is already being rolled out to US English users. A short video explains more.

A CNET article quotes KG project manager Jack Menzel: Menzel pitches Knowledge Graph without using the word “semantic” even once. While he says, “I dream of the semantic Web,” he takes pains to point out that what Google is announcing today is not what people talk about when they discuss semantic Web concepts. “We do continue to work on how to make search semantic,” he says, “but talking about it brings out the crazy people.” I hope this did not come out the way he intended it to.

You are currently browsing the UMBC ebiquity weblog archives for May, 2012.

  Home | Archive | Login | Feed