site devoted to linked open data development

March 26th, 2013


If you are interested in the semantic web and linked data, looks like a site worth investigating.

“This is a landmark site for academia providing a single point of contact for linked open data development. It not only provides access to the know-how and tools to discuss and create linked data and data aggregation sites, but also enables access to, and the creation of, large aggregated data sets providing powerful and flexible collections of information. Here at we’re working to inform national standards and assist in the development of national data aggregation subdomains.”

Results of the 2013 Semantic Textual Similarity task

March 25th, 2013

The results of the 2013 Semantic Textual Similarity task (STS) are out. We were happy to find that our system did very well on the core task, placing first out of the 35 participating teams. The three runs we submitted were ranked first, second and third in the overall summary score.

Congratulations are in order for Lushan Han and Abhay Kashyap, the two UMBC doctoral students whose research and hard work produced a very effective system.

The STS task

The STS core task is to take two sentences and to return a score between 0 and 5 representing how similar the sentences are, with a larger number meaning a higher similarity. Compared with word similarity, the definition of sentence similarity tends to be more difficult and different people may have different views.

The STS task provides a reasonable and interesting definition. More importantly, the Pearson correlation scores are about 0.90 [1] for human raters using Amazon Mechanical Turk on the 2012 STS gold standard datasets, almost same to inter-rater agreement level, 0.9026 [2], on the well-known Miller-Charles word similarity dataset. This shows that human raters largely agree on the definitions used in the scale.

  • 5: The sentences are completely equivalent, as they mean the same thing, e.g., “The bird is bathing in the sink” and “Birdie is washing itself in the water basin”.
  • 4: The sentences are mostly equivalent, but some unimportant details differ, e.g., “In May 2010, the troops attempted to invade Kabul” and “The US army invaded Kabul on May 7th last year, 2010”.
  • 3: The sentences are roughly equivalent, but some important information differs/missing, e.g., “John said he is considered a witness but not a suspect.” and “‘He is not a suspect anymore.’ John said.”
  • 2: The sentences are not equivalent, but share some details, e.g., “They flew out of the nest in groups” and “They flew into the nest together”.
  • 1: The sentences are not equivalent, but are on the same topic, e.g., “The woman is playing the violin” and “The young lady enjoys listening to the guitar”.
  • 0: The sentences are on different topics, e.g., “John went horse back riding at dawn with a whole group of friends” and “Sunrise at dawn is a magnificent view to take in if you wake up early enough for it”.

The STS datasets

There were 86 runs submitted from more than 35 teams. Each team could submit up to three runs over sentence pairs drawn from four datasets, which included the following.

  • Headlines (750 pairs): a collection of pairs of headlines mined from several news sources by European Media Monitor using the RSS feed, e.g., “Syrian rebels move command from Turkey to Syria” and “Free Syrian Army moves headquarters from Turkey to Syria”.
  • SMT (750 pairs): a collection with sentence pairs the DARPA GALE program, where one sentence is the output of a machine translation system and the other is a reference translation provided by a human, for example, “The statement, which appeared on a website used by Islamists, said that Al-Qaeda fighters in Islamic Maghreb had attacked three army centers in the town of Yakouren in Tizi-Ouzo” and the sentence “the pronouncement released that the mujaheddin of al qaeda in islamic maghreb countries attacked 3 stations of the apostates in city of aekorn in tizi ouzou , which was posted upon the web page used by islamists”.
  • OnWN (561 pairs): a collection of sentence pairs describing word senses, one from OntoNotes and another from WordNet, e.g., “the act of advocating or promoting something” and “the act of choosing or selecting”.
  • FNWN (189 pairs): a collection of pairs of sentences describing word senses, one from FrameNet and another from WordNet, for example: “there exist a number of different possible events that may happen in the future. in most cases, there is an agent involved who has to consider which of the possible events will or should occur. a salient_entity which is deeply involved in the event may also be mentioned” and “doing as one pleases or chooses;”.

Our three systems

We used a different system for each of our allowed runs, PairingWords, Galactus and Saiyan. While they shared a lot of the same infrastructure, each used a different mix of ideas and features.

  • ParingWords was built using hybrid word similarity features derived from LSA and WordNet. It used a simple algorithm to pair words/phrases in two sentences and compute the average of word similarity of the resulting pairs. It imposes penalties on words that are not matched with the words weighted by their PoS and log frequency. No training data is used. An online demonstration system is available to experiment with the underlying word similarity model used by this approach.
  • Galactus used unigrams, bigrams, trigrams and skip bigrams derived from the two sentences and paired them with the highest similarity based on exact string match, corpus and Wordnet based similarity metrics. These, along with contrast scores derived from antonym pairs, were used as features to train a support vector regression model to predict the similarity scores.
  • Saiyan was a fine tuned version of galactus which used domain specific features and training data to train a support vector regression model to predict the similarity scores. (Scores for FNWN was directly used from the PairingWords run.)

The results

Here’s how our three runs ranked (out of 86) on each of the four different data sets and on the overall task (mean).

  our three systems
dataset PairingWords Galactus Saiyan
Headlines 3 7 1
OnWN glosses 4 11 35
FNWN glosses 1 3 2
SMT 8 11 16
mean 1 2 3

Over the next two weeks we will write a short system paper for the *SEM 2013, the Second Joint Conference on Lexical and Computational Semantics.


[1] Eneko Agirre, Daniel Cer, Mona Diab and Gonzalez-Agirre Aitor. 2012. SemEval-2012 task 6: A pilot on semantic textual similarity. In Proc. 6th Int. Workshop on Semantic Evaluation (SemEval 2012), in conjunction with the First Joint Conf. on Lexical and Computational Semantics (*SEM 2012)., Montreal,Canada.

[2] P. Resnik, “Using information content to evaluate semantic similarity in a taxonomy,” in Proc. 14th Int. Joint Conf. on Artificial Intelligence, 1995.

Microsoft Bing updates its Satori knowledge base

March 22nd, 2013

A post in Micrsoft’s Bing blog, Understand Your World with Bing, announced that an update to their Satori knowledge base allows Bing to do a better job of identifying queries that mention a known entity, i.e., a person, place of organization. Bing’s use of Satori parallels the efforts of Google and Facebook in developing graph-based knowledge bases to move from “strings” to “things”.

Microsoft is using data from Satori to provide “snapshots” with data about an entity when it detects a likely mention of it in a query. This is very similar to how Google is using its Knowledge Graph KB.

One interesting thing that Satori is now doing is importing data from LinkedIn — data that neither Google’s Knowledge Graph nor Facebook’s Open Graph has. Another difference is that Satori uses RDF as its native model, or at least appears to, based on this description from 2012.

See recent posts in Techcrunch and Search Engine Land for more information.

Google Reader, we hardly knew ye

March 13th, 2013

We felt a great disturbance in the Force, as if millions of voices suddenly cried out in terror and were about to be silenced. We fear something terrible is about to happen. We went access the blogs we follow on Google Reader and found this.

Powering Down Google Reader
3/13/2013 04:06:00 PM

Posted by Alan Green, Software Engineer

We have just announced on the Official Google Blog that we will soon retire Google Reader (the actual date is July 1, 2013). We know Reader has a devoted following who will be very sad to see it go. We’re sad too.

There are two simple reasons for this: usage of Google Reader has declined, and as a company we’re pouring all of our energy into fewer products. We think that kind of focus will make for a better user experience.

To ensure a smooth transition, we’re providing a three-month sunset period so you have sufficient time to find an alternative feed-reading solution. If you want to retain your Reader data, including subscriptions, you can do so through Google Takeout.

Thank you again for using Reader as your RSS platform.
Labels: reader, sunset

Where is old Bloglines now that we need him again? We should not have been so disloyal.

Memoto lifelogging camera

March 9th, 2013

Memoto is a $279 lifelogging camera takes a geotagged photo every 30 seconds, holds 6K photos, and runs for several days without recharging. The company producing Memoto is a Swedish company intially funded via kickstarter and expects to start shipping the wearable camera in April 2013. The company will also offer “safe and secure infinite photo storage at a flat monthly fee, which will always be a lot more affordable than hard drives.”

The lifelogging idea has been around for many years but has yet to become propular. One reason is privacy concerns. DARPA’s IPTO office, for example, started a LifeLog program in 2004 which was almost immediately canceled after criticism from civil libertarians concerning the privacy implications of the system.

Google Wikilinks corpus

March 8th, 2013

Google released the Wikilinks Corpus, a collection of 40M disambiguated mentions from 10M web pages to 3M Wikipedia pages. This data can be used to train systems that do entity linking and cross-document co-reference, problems that Google researchers attacked with an earlier version of this data (see Large-Scale Cross-Document Coreference Using Distributed Inference and Hierarchical Models).

You can download the data as ten 175MB files from and some addional tools from UMASS.

This is yet another example of the important role that Wikipedia continues to play in building a common, machine useable semantic substrate for human conceptualizations.

Google Sets Live

March 7th, 2013

Google Sets was a the result of a early Google research project that ended in 2011. The idea was to be able to recognize the similarity of a set of terms (e.g., python, lisp and fortran) and automatically identify other similar terms (e.g., c, java, php). Suprisingly (to me) the results of the project live on as an undocumented feature in Google Doc spreadsheets. Try putting a few of the seven deadly sins into a Google spreadsheet and use the feature to see what else you should not do (e.g., creating fire with alchemy, I guess).

Google, of course, continues to work on expanding their use of semantic information, currently through efforts like the Google Knowledge Graph, Freebase, Microdata and Fusion Tables. Other companies, including Mcrosoft, IBM and a host of startups, are also hard at work on similar projects.