Improving Word Similarity by Augmenting PMI with Estimates of Word Polysemy
June 1, 2011
Although pointwise mutual information (PMI) has become a commonly used word similarity measure, a clear understanding of how it works has been lacking. In this paper we explore how PMI differs from distributional similarity, and we introduce a novel metric, PMImax, that augments PMI with information about a word's number of senses. The coefficients of PMImax are determined empirically by maximizing a utility function based on the performance of automatic thesaurus generation. We show that PMImax outperforms traditional PMI in the application of automatic thesaurus generation and in word similarity benchmark datasets: human similarity ratings and TOEFL synonym questions. PMImax achieves a correlation coefficient comparable to the best knowledge-based approaches on the Miller-Charles similarity rating dataset.
University of Maryland, Baltimore County
Computer Science and Electrical Engineering