UMBC ebiquity

Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy

Authors: Lushan Han, Tim Finin, Paul McNamee, Anupam Joshi, and Yelena Yesha

Date: June 01, 2011

Abstract: Although pointwise mutual information (PMI) has become a commonly used word similarity measure, a clear understanding of how it works has been lacking. In this paper we explore how PMI differs from distributional similarity, and we introduce a novel metric, PMImax, that augments PMI with information about a word's number of senses. The coefficients of PMImax are determined empirically by maximizing a utility function based on the performance of automatic thesaurus generation. We show that PMImax outperforms traditional PMI in the application of automatic thesaurus generation and in word similarity benchmark datasets: human similarity ratings and TOEFL synonym questions. PMImax achieves a correlation coefficient comparable to the best knowledge-based approaches on the Miller-Charles similarity rating dataset.

See Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy, TKDE, 2012.

Type: TechReport

Organization: Computer Science and Electrical Engineering

Institution: University of Maryland, Baltimore County

Tags: semantic similarity, pointwise mutual information, automatic thesaurus generation, corpus statistics

Google Scholar: search

 

Related Projects:

Past Project

 Graph of Relations.