Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy

, , , , and

Although pointwise mutual information (PMI) has become a commonly used word similarity measure, a clear understanding of how it works has been lacking. In this paper we explore how PMI differs from distributional similarity, and we introduce a novel metric, PMImax, that augments PMI with information about a word's number of senses. The coefficients of PMImax are determined empirically by maximizing a utility function based on the performance of automatic thesaurus generation. We show that PMImax outperforms traditional PMI in the application of automatic thesaurus generation and in word similarity benchmark datasets: human similarity ratings and TOEFL synonym questions. PMImax achieves a correlation coefficient comparable to the best knowledge-based approaches on the Miller-Charles similarity rating dataset.

See Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy, TKDE, 2012.

automatic thesaurus generation, corpus statistics, pointwise mutual information, semantic similarity

TechReport

University of Maryland, Baltimore County

Computer Science and Electrical Engineering

UMBC ebiquity

Past Projects

  1. Graph of Relations