Deep Representation of Lyrical Style and Semantics for Music Recommendation

In an increasingly mobile and connected world, digital music consumption has rapidly increased. More recently, faster and cheaper mobile bandwidth has given the average mobile user the potential to access large troves of music through streaming services like Spotify and Google Music that boast catalogs with tens of millions of songs. At this scale, effective music recommendation is an important part of user experience and music discovery. Collaborative filtering (CF), a popular technique used by recommendation systems, suffers from two major issues; popularity bias that leads to a long tail and cold-start for new items. In such cases, they use content features to supplement similarity measures which, for music, are acoustic features extracted from a song’s audio and textual features from its metadata, tags and lyrics. Research in content-based music similarity has largely been focused in the acoustic domain while lyrical content has received little attention and been limited to traditional Information Retrieval (IR) techniques. Lyrics contain information about the emotion and meaning conveyed in a song that cannot be easily extracted from the audio. This is especially important for lyrics-centric genres like Rap, which was also the most streamed genre in 2016. The goal of this dissertation is to explore and evaluate different lyrical content features that could be useful for content, context and emotion-based models for music recommendation systems.

With Rap as a model use case and a custom dataset comprising over 35, 000 songs from over 500 Rap artists, this dissertation focuses on featurizing two main aspects of lyrics; its artistic style of composition and its semantic content. For lyrical style, phonetic representations of lyrics are used to match rhymed syllables and extract a suite of high-level rhyme density features of different types. This is augmented with literary features like the use of figurative language, profanity, and vocabulary strength along with text statistics. In contrast to these engineered features, Convolutional Neural Networks (CNNs) are used to automatically learn rhyme patterns and other syllable statistics that are most relevant for a task like artist identification from raw syllable sequences. For semantics, lyrics are represented using both traditional IR techniques like LSA and the more recent neural embedding methods like doc2vec. Also, in addition to using only plain lyrics, their annotations are also included to provide an extra layer of contextual information. Finally, to mitigate long-tail & cold-start problems, these lyrical content features are used to map songs and artists to their corresponding points in the collaborative filtering based latent space using neural networks.

The usefulness of these lyrical style and semantic features are evaluated for three main tasks; artist identification, artist similarity, and song similarity. It is shown that both rhyme and literary features serve as strong indicators to identify artists from lyrics while comparable results are achieved from feature learning methods like CNNs. In addition to artist identification, which evaluates lyrical features in a purely content space, lyrical similarity between artists and songs are also compared to a real-world, collaborative filtering based recommendation system from and the results indicate a strong relationship between the way listeners consume music and lyrical content. For lyrical semantics, neural embedding methods significantly outperformed traditional LSA methods, and the inclusion of annotations improved song similarity measures. Finally, this dissertation is accompanied by a web application,, that is dedicated to visualizing all these extracted lyrical features and has been featured on a number of media outlets, most notably, Vox, attn: and Metro.

cnn, convolutional neural network, learning, recommendation


University of Maryland, Baltimore County

Downloads: 26 downloads

UMBC ebiquity