Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy

Lushan Han; Tim Finin; Paul McNamee; Anupam Joshi; Yelena Yesha

Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy

Lushan Han, Tim Finin, Paul McNamee, Anupam Joshi, and Yelena Yesha

June 1, 2011

Although pointwise mutual information (PMI) has become a commonly used word similarity measure, a clear understanding of how it works has been lacking. In this paper we explore how PMI differs from distributional similarity, and we introduce a novel metric, PMImax, that augments PMI with information about a word's number of senses. The coefficients of PMImax are determined empirically by maximizing a utility function based on the performance of automatic thesaurus generation. We show that PMImax outperforms traditional PMI in the application of automatic thesaurus generation and in word similarity benchmark datasets: human similarity ratings and TOEFL synonym questions. PMImax achieves a correlation coefficient comparable to the best knowledge-based approaches on the Miller-Charles similarity rating dataset.

See Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy, TKDE, 2012.

BibTeX OWL Tweet Scholar

Tags: automatic thesaurus generation, corpus statistics, pointwise mutual information, semantic similarity

Type: TechReport

Publisher: University of Maryland, Baltimore County

Organization: Computer Science and Electrical Engineering

Past Projects

Graph of Relations