Results of the 2013 Semantic Textual Similarity task

March 25th, 2013

The results of the 2013 Semantic Textual Similarity task (STS) are out. We were happy to find that our system did very well on the core task, placing first out of the 35 participating teams. The three runs we submitted were ranked first, second and third in the overall summary score.

Congratulations are in order for Lushan Han and Abhay Kashyap, the two UMBC doctoral students whose research and hard work produced a very effective system.

The STS task

The STS core task is to take two sentences and to return a score between 0 and 5 representing how similar the sentences are, with a larger number meaning a higher similarity. Compared with word similarity, the definition of sentence similarity tends to be more difficult and different people may have different views.

The STS task provides a reasonable and interesting definition. More importantly, the Pearson correlation scores are about 0.90 [1] for human raters using Amazon Mechanical Turk on the 2012 STS gold standard datasets, almost same to inter-rater agreement level, 0.9026 [2], on the well-known Miller-Charles word similarity dataset. This shows that human raters largely agree on the definitions used in the scale.

  • 5: The sentences are completely equivalent, as they mean the same thing, e.g., “The bird is bathing in the sink” and “Birdie is washing itself in the water basin”.
  • 4: The sentences are mostly equivalent, but some unimportant details differ, e.g., “In May 2010, the troops attempted to invade Kabul” and “The US army invaded Kabul on May 7th last year, 2010”.
  • 3: The sentences are roughly equivalent, but some important information differs/missing, e.g., “John said he is considered a witness but not a suspect.” and “‘He is not a suspect anymore.’ John said.”
  • 2: The sentences are not equivalent, but share some details, e.g., “They flew out of the nest in groups” and “They flew into the nest together”.
  • 1: The sentences are not equivalent, but are on the same topic, e.g., “The woman is playing the violin” and “The young lady enjoys listening to the guitar”.
  • 0: The sentences are on different topics, e.g., “John went horse back riding at dawn with a whole group of friends” and “Sunrise at dawn is a magnificent view to take in if you wake up early enough for it”.

The STS datasets

There were 86 runs submitted from more than 35 teams. Each team could submit up to three runs over sentence pairs drawn from four datasets, which included the following.

  • Headlines (750 pairs): a collection of pairs of headlines mined from several news sources by European Media Monitor using the RSS feed, e.g., “Syrian rebels move command from Turkey to Syria” and “Free Syrian Army moves headquarters from Turkey to Syria”.
  • SMT (750 pairs): a collection with sentence pairs the DARPA GALE program, where one sentence is the output of a machine translation system and the other is a reference translation provided by a human, for example, “The statement, which appeared on a website used by Islamists, said that Al-Qaeda fighters in Islamic Maghreb had attacked three army centers in the town of Yakouren in Tizi-Ouzo” and the sentence “the pronouncement released that the mujaheddin of al qaeda in islamic maghreb countries attacked 3 stations of the apostates in city of aekorn in tizi ouzou , which was posted upon the web page used by islamists”.
  • OnWN (561 pairs): a collection of sentence pairs describing word senses, one from OntoNotes and another from WordNet, e.g., “the act of advocating or promoting something” and “the act of choosing or selecting”.
  • FNWN (189 pairs): a collection of pairs of sentences describing word senses, one from FrameNet and another from WordNet, for example: “there exist a number of different possible events that may happen in the future. in most cases, there is an agent involved who has to consider which of the possible events will or should occur. a salient_entity which is deeply involved in the event may also be mentioned” and “doing as one pleases or chooses;”.

Our three systems

We used a different system for each of our allowed runs, PairingWords, Galactus and Saiyan. While they shared a lot of the same infrastructure, each used a different mix of ideas and features.

  • ParingWords was built using hybrid word similarity features derived from LSA and WordNet. It used a simple algorithm to pair words/phrases in two sentences and compute the average of word similarity of the resulting pairs. It imposes penalties on words that are not matched with the words weighted by their PoS and log frequency. No training data is used. An online demonstration system is available to experiment with the underlying word similarity model used by this approach.
  • Galactus used unigrams, bigrams, trigrams and skip bigrams derived from the two sentences and paired them with the highest similarity based on exact string match, corpus and Wordnet based similarity metrics. These, along with contrast scores derived from antonym pairs, were used as features to train a support vector regression model to predict the similarity scores.
  • Saiyan was a fine tuned version of galactus which used domain specific features and training data to train a support vector regression model to predict the similarity scores. (Scores for FNWN was directly used from the PairingWords run.)

The results

Here’s how our three runs ranked (out of 86) on each of the four different data sets and on the overall task (mean).

  our three systems
dataset PairingWords Galactus Saiyan
Headlines 3 7 1
OnWN glosses 4 11 35
FNWN glosses 1 3 2
SMT 8 11 16
mean 1 2 3

Over the next two weeks we will write a short system paper for the *SEM 2013, the Second Joint Conference on Lexical and Computational Semantics.


[1] Eneko Agirre, Daniel Cer, Mona Diab and Gonzalez-Agirre Aitor. 2012. SemEval-2012 task 6: A pilot on semantic textual similarity. In Proc. 6th Int. Workshop on Semantic Evaluation (SemEval 2012), in conjunction with the First Joint Conf. on Lexical and Computational Semantics (*SEM 2012)., Montreal,Canada.

[2] P. Resnik, “Using information content to evaluate semantic similarity in a taxonomy,” in Proc. 14th Int. Joint Conf. on Artificial Intelligence, 1995.