Song Lyrics: Exploratory Analysis

Pratap Vardhan

Use case: Kanye West

0. Introduction
1. Load data
2. Feature Engineering
3. Distributions
4. Sentiment
5. Song Generator via Markov chain
- 5.1 Generate Random Song
- 5.2 Generate Song on Trump
6. Topic Modelling
7. Next Steps

0. Introduction

This is a weekend stab at song lyrics, this time, it will be Kanye West's songs.

We'll attempt to do following:

Derive metrics from raw text, visualize distributions
Tell stories though visuals
Sentiment analysis
Random song lyrics generator with Markov Chains
Extract themes automatically aka topic modelling

Dependencies: Pandas, Numpy, Seaborn, NLTK, scikit-learn

1. Load data



In [1]:

    
import io
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(context='talk', style='ticks')
%matplotlib inline

Read text file into dataframe and split songs into rows



In [2]:

    
df = pd.DataFrame({'lyrics': io.open('data/kanye_verses.txt', 'r', encoding='ascii', errors='ignore').read().split('\n\n')})

Take a look at the dataframe head



In [3]:

    
df.head()









    Out[3]:







  
    
      
      lyrics
    
  
  
    
      0
      Let the suicide doors up\nI threw suicides on ...
    
    
      1
      She said, "'Ye, can we get married at the mall...
    
    
      2
      Break records at Louis, ate breakfast at Gucci...
    
    
      3
      What you doin' in the club on a Thursday?\nShe...
    
    
      4
      I wanna fuck you hard on the sink\nAfter that,...

Take a closer look at sample lyrics. Each Song has multiple lines like sentences. No paragraphs.



In [4]:

    
print df.loc[0, 'lyrics']









    



Let the suicide doors up
I threw suicides on the tour bus
I threw suicides on the private jet
You know what that mean, I'm fly to death
I step in Def Jam buildin' like I'm the shit
Tell 'em give me fifty million or I'ma quit
Most rappers' taste level ain't at my waist level
Turn up the bass 'til it's up-in-yo-face level
Don't do no press but I get the most press kit
Plus, yo, my bitch make your bitch look like Precious
Somethin' 'bout Mary, she gone off that molly
Now the whole party is melted like Dal
Now everybody is movin' they body
Don't sell me apartment, I'll move in the lobby
Niggas is loiterin' just to feel important
You gon' see lawyers and niggas in Jordans

2. Feature Engineering

Derive text related metrics (number of characters, words, lines, unique words) and lexical density for each song.



In [5]:

    
# characters, words, lines
df['#characters'] = df.lyrics.str.len()
df['#words'] = df.lyrics.str.split().str.len()
df['#lines'] = df.lyrics.str.split('\n').str.len()
df['#uniq_words'] = df.lyrics.apply(lambda x: len(set(x.split())))
df['lexical_density'] = df['#uniq_words'] / df['#words']

We now have 1 text column and 5 metrics.



In [6]:

    
df.head()









    Out[6]:







  
    
      
      lyrics
      #characters
      #words
      #lines
      #uniq_words
      lexical_density
    
  
  
    
      0
      Let the suicide doors up\nI threw suicides on ...
      675
      131
      16
      99
      0.755725
    
    
      1
      She said, "'Ye, can we get married at the mall...
      772
      148
      18
      111
      0.750000
    
    
      2
      Break records at Louis, ate breakfast at Gucci...
      1479
      286
      33
      212
      0.741259
    
    
      3
      What you doin' in the club on a Thursday?\nShe...
      632
      101
      16
      80
      0.792079
    
    
      4
      I wanna fuck you hard on the sink\nAfter that,...
      748
      144
      18
      96
      0.666667

3. Distributions

Now that we have text metrics, a quick histogram spread on all metrics.

Most songs are under 1000 characters, 15 lines, 200 words.
Most songs use around 100 unique words
Lexical density is spread across, mostly curved up between 0.6 and 0.8.
Lexical density is pretty high considering how other singer's lyrics usually are..



In [7]:

    
df.hist(sharey=True, layout=(2, 3), figsize=(15, 8));

Alternatively, look at violenplots for distributions.



In [8]:

    
cols_metrics = df.select_dtypes(include=[np.number]).columns
fig, axs = plt.subplots(ncols=len(cols_metrics), figsize=(16, 5))
for i, c in enumerate(cols_metrics):
    sns.violinplot(x=df[c], ax=axs[i], sharex=True)

3.1 Word Length Distribution

Looking at the word lengths, median length is 4 letters. But there exists decent long tail of longer length words.

Word lengths between 1-5 cover 85% of the total words used.



In [9]:

    
# Word length distribution
pd.Series(len(x) for x in ' '.join(df.lyrics).split()).value_counts().sort_index().plot(kind='bar', figsize=(12, 3))









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0xdd130b8>

Now, that gets us to move to look into words, phrases mostly used.

3.2 Most common words

Let's look at most commonly used words. We're not removing any stopwords.

Single words may not give much insights -- the, i you, my, me are most common



In [10]:

    
# top words
pd.Series(' '.join(df.lyrics).lower().split()).value_counts()[:20][::-1].plot(kind='barh')









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0xe1e6b70>

3.3 Most common longer words

Among the longer words (length > 7), combination of (every|some|any)(body|thing|time) are most common.



In [11]:

    
# top long words
pd.Series([w for w in ' '.join(df.lyrics).lower().split() if len(w) > 7]).value_counts()[:20][::-1].plot(kind='barh')









    Out[11]:





<matplotlib.axes._subplots.AxesSubplot at 0xe5d6e80>

3.4 Most common n-grams

Let's first look at collocations, called n-grams, where we look are collated phrases.



In [12]:

    
from nltk import ngrams



In [13]:

    
def get_ngrams_from_series(series, n=2):
    # using nltk.ngrams
    lines = ' '.join(series).lower().split('\n')
    lgrams = [ngrams(l.split(), n) for l in lines]
    grams = [[' '.join(g) for g in list(lg)] for lg in lgrams]
    return [item for sublist in grams for item in sublist]

3.4.1 Most common bi-grams

in the appears to most common bi-gram with 200+ occurrences and stand out from 2, 3rd popular ones



In [14]:

    
# Top bi-grams
pd.Series(get_ngrams_from_series(df.lyrics, 2)).value_counts()[:20][::-1].plot(kind='barh')









    Out[14]:





<matplotlib.axes._subplots.AxesSubplot at 0xf0c0a20>

3.4.2 Most common tri-grams

Many common trigrams start with i -- i know that, i got a, i had to, i feel like
Expletives like piss make it to the top



In [15]:

    
# Top tri-grams
pd.Series(get_ngrams_from_series(df.lyrics, 3)).value_counts()[:20][::-1].plot(kind='barh')









    Out[15]:





<matplotlib.axes._subplots.AxesSubplot at 0xf218b38>

3.4.3 Most common quad-grams



In [16]:

    
# Top four-grams
pd.Series(get_ngrams_from_series(df.lyrics, 4)).value_counts()[:20][::-1].plot(kind='barh')









    Out[16]:





<matplotlib.axes._subplots.AxesSubplot at 0xf47fe80>

4. Sentiment

Next, let's get sentiment scores of each song



In [17]:

    
# sentiment
import nltk
from nltk import sentiment
nltk.download('vader_lexicon')









    



d:\apps\anaconda2\lib\site-packages\nltk\twitter\__init__.py:20: UserWarning: The twython library has not been installed. Some functionality from the twitter package will not be available.
  warnings.warn("The twython library has not been installed. "






    



[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\pvardhan\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!






    Out[17]:





True

Load the inbuilt vader Sentiment Analyzer



In [18]:

    
senti_analyze = sentiment.vader.SentimentIntensityAnalyzer()

Try it on first lyrics, it would return a dictionary with polarity scores, we'll use compound score only.



In [19]:

    
senti_analyze.polarity_scores(df.lyrics[0])









    Out[19]:





{'compound': -0.6658, 'neg': 0.153, 'neu': 0.71, 'pos': 0.137}

Apply on all lyrics and store the ['negative', 'neutral', 'positive'] segments as well.



In [20]:

    
df['sentiment_score'] = pd.DataFrame(df.lyrics.apply(senti_analyze.polarity_scores).tolist())['compound']
df['sentiment'] = pd.cut(df['sentiment_score'], [-np.inf, -0.35, 0.35, np.inf], labels=['negative', 'neutral', 'positive'])

Now, we have 1 text, 1 dimension, 6 metrics



In [21]:

    
df.head()









    Out[21]:







  
    
      
      lyrics
      #characters
      #words
      #lines
      #uniq_words
      lexical_density
      sentiment_score
      sentiment
    
  
  
    
      0
      Let the suicide doors up\nI threw suicides on ...
      675
      131
      16
      99
      0.755725
      -0.6658
      negative
    
    
      1
      She said, "'Ye, can we get married at the mall...
      772
      148
      18
      111
      0.750000
      -0.8690
      negative
    
    
      2
      Break records at Louis, ate breakfast at Gucci...
      1479
      286
      33
      212
      0.741259
      -0.9499
      negative
    
    
      3
      What you doin' in the club on a Thursday?\nShe...
      632
      101
      16
      80
      0.792079
      -0.9900
      negative
    
    
      4
      I wanna fuck you hard on the sink\nAfter that,...
      748
      144
      18
      96
      0.666667
      -0.9869
      negative

4.1 Sentiment Score distribution

Interestingly, West's songs have decent spread across the negative and positive sentiments. Infact, most songs are either strongly positive or negative.



In [22]:

    
df[['sentiment_score']].hist(bins=25)









    Out[22]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000000000F833320>]], dtype=object)

4.2 Correlation

Is there any Correlation between Sentiment Score and other metrics?



In [23]:

    
corr = df.corr()
plt.figure(figsize=(8, 8))
sns.heatmap(corr, annot=True, fmt='.2f', square=True,
            xticklabels=corr.columns.values,
            yticklabels=corr.columns.values)









    Out[23]:





<matplotlib.axes._subplots.AxesSubplot at 0xf838748>

4.3 Pairwise relationships

Visualize pairwise combinations of continuous variables over sentiment category



In [24]:

    
sns.pairplot(df, hue='sentiment', hue_order=['neutral', 'positive', 'negative'], 
             plot_kws={'alpha': 0.5})









    Out[24]:





<seaborn.axisgrid.PairGrid at 0xf4fa6a0>

4.4 Lexical density sentiments

Songs with lower lexical density tend to have strong sentiments (negative and positive)



In [25]:

    
# Songs with lower lexical density tend to have strong sentiments
df.plot.scatter(x='sentiment_score', y='lexical_density', s=df['#characters']/20,
                c=np.where(df['lexical_density'].le(0.55), '#e41a1c', '#4c72b0'),
                figsize=(15, 6))









    Out[25]:





<matplotlib.axes._subplots.AxesSubplot at 0x14bd6160>

4.5 Non-parametric estimates

Range of unique words is limited in neutral songs
Negative songs tend to have higher lexical density but also larger spread
Negative songs tend to lower unique words over Positive songs



In [26]:

    
cols_metrics = ['lexical_density', '#uniq_words']
fig, axs = plt.subplots(figsize=(18, 6), ncols=len(cols_metrics))
for i, c in enumerate(cols_metrics):
    sns.lvplot(x='sentiment', y=c, data=df, ax=axs[i], order=['neutral', 'positive', 'negative'])

Looking at these same distributions with non-overlapping points.



In [27]:

    
cols_metrics = ['lexical_density', '#uniq_words']
fig, axs = plt.subplots(figsize=(18, 6), ncols=len(cols_metrics))
for i, c in enumerate(cols_metrics):
    sns.swarmplot(x='sentiment', y=c, data=df, ax=axs[i], order=['neutral', 'positive', 'negative'])

5. Song Generator via Markov chain

We use a simplistic Markov chain model to generate song from a West's corpus

The last two words are the current state.
Next word depends on last two words only, or on present state only.
The next word is randomly chosen from a statistical model of the corpus.



In [28]:

    
# Machine generated lyrics using Markov
import re
import random
from collections import defaultdict


class MarkovRachaita:
    def __init__(self, corpus='', order=2, length=8):
        self.order = order
        self.length = length
        self.words = re.findall("[a-z']+", corpus.lower())
        self.states = defaultdict(list)

        for i in range(len(self.words) - self.order):
            self.states[tuple(self.words[i:i + self.order])].append(self.words[i + order])

    def gen_sentence(self, length=8, startswith=None):
        terms = None
        if startswith:
            start_seed = [x for x in self.states.keys() if startswith in x]
            if start_seed:
                terms = list(start_seed[0])
        if terms is None:
            start_seed = random.randint(0, len(self.words) - self.order)
            terms = self.words[start_seed:start_seed + self.order]

        for _ in range(length):
            terms.append(random.choice(self.states[tuple(terms[-self.order:])]))

        return ' '.join(terms)

    def gen_song(self, lines=10, length=8, length_range=None, startswith=None):
        song = []
        if startswith:
            song.append(self.gen_sentence(length=length, startswith=startswith))
            lines -= 1
        for _ in range(lines):
            sent_len = random.randint(*length_range) if length_range else length
            song.append(self.gen_sentence(length=sent_len))
        return '\n'.join(song)

5.1 Generate Random Song



In [29]:

    
kanyai = MarkovRachaita(corpus=' '.join(df.lyrics))
print kanyai.gen_song(lines=10, length_range=[5, 10])









    



what if mary what if we die the money in
on this lonely night i came up y'all gave me
we know sometimes i gotta a duel
in the lobby niggas is bringing out
can i add that he do spaz out
to say hi i'm aria no you a hundred million and
the building ain't no scars a little blog just to get the
kate mimosa alessandra ambrosio anja rubik get olga
like the handyman but if you getting paper everybody
have to laugh so i guess i should've went to law school

5.2 Generate Song on Trump



In [30]:

    
print kanyai.gen_song(lines=10, length_range=[5, 10], startswith='trump')









    



donald trump taking dollars from y'all baby you're makin' it
fucking quit nigga just talked to farrakhan that's sensei nigga told
like it's still niggas that know me
spit a couple laps in my pajamas to
out the word can't 'round same time when nas said i
palm i take my life like theres no tomorrow wake up
know then errrr away i stuck my dick why fuckin' centerfolds
on a smpte but i prefer the term african american
when we cut i'm razorbladin' i'm so lazer
be your damn liar let's start the scrimmage

6. Topic Modelling

We'll create a simplistic topic model using Non-Negative Matrix Factorization (NMF) to group lyrics into topics. LDA is another quite popular alternative for topic modelling. We generate a tf-idf transformer, to apply it to the bag of words matrix that NMF will process with the TfidfVectorizer.



In [31]:

    
# Song themes via Simplistic topic modelling

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

no_topics = 5
no_features = 50
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(df.lyrics)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)

def get_topics(model, feature_names, no_topwords):
    for topic_id, topic in enumerate(model.components_):
        print 'topic %d:' % (topic_id)
        print ' '.join([feature_names[i] for i in topic.argsort()[:-no_topwords-1:-1]])

s = pd.DataFrame(nmf.transform(tfidf)).idxmax(1)

List the derived topics, each represented as a list of top terms



In [32]:

    
# NMP topics
get_topics(nmf, tfidf_feature_names, 20)









    



topic 0:
know got don girl love fuck let ll baby said gotta tell bout ain shit mean want need told new
topic 1:
im niggas said bitch make good god got fuck just love bad cause shit ass need yeah new gon way
topic 2:
like black told said ll gon make hit em people feel ass girl ain got niggas love shit mean god
topic 3:
right life yeah let time look did cause good got man money need tell say feel know shit god make
topic 4:
just nigga ain man say niggas people make cause feel shit em want don way money did need time god

6.1 Top n-grams from topics

Now, let's get top tri-grams for each of these topics/themes



In [33]:

    
# Top n-grams from the topics
topics = set(s)
fig, axs = plt.subplots(figsize=(18, 6), ncols=len(topics))
for i, v in enumerate(topics):
    dfsm = df.loc[s.eq(v), 'lyrics']
    ngram = pd.Series(get_ngrams_from_series(dfsm, 3)).value_counts()[:20][::-1]
    ngram.plot(kind='barh', ax=axs[i], title='Topic {} - {} lyrics'.format(v, s.eq(v).sum()))
plt.tight_layout()
df['topic'] = s.astype(str).radd('Topic ')



In [34]:

    
df.head()









    Out[34]:







  
    
      
      lyrics
      #characters
      #words
      #lines
      #uniq_words
      lexical_density
      sentiment_score
      sentiment
      topic
    
  
  
    
      0
      Let the suicide doors up\nI threw suicides on ...
      675
      131
      16
      99
      0.755725
      -0.6658
      negative
      Topic 0
    
    
      1
      She said, "'Ye, can we get married at the mall...
      772
      148
      18
      111
      0.750000
      -0.8690
      negative
      Topic 1
    
    
      2
      Break records at Louis, ate breakfast at Gucci...
      1479
      286
      33
      212
      0.741259
      -0.9499
      negative
      Topic 3
    
    
      3
      What you doin' in the club on a Thursday?\nShe...
      632
      101
      16
      80
      0.792079
      -0.9900
      negative
      Topic 0
    
    
      4
      I wanna fuck you hard on the sink\nAfter that,...
      748
      144
      18
      96
      0.666667
      -0.9869
      negative
      Topic 0

6.2 Sentiment across Topics

Topic 0 has songs mostly concentrated for high polar sentiments
Topic 1 songs which are positive have a high score
Topic 2 doesn't have neutral songs



In [35]:

    
fig, axs = plt.subplots(figsize=(15, 6))
sns.swarmplot(x='topic', y='sentiment_score', data=df)









    Out[35]:





<matplotlib.axes._subplots.AxesSubplot at 0xdaf9358>

6.3 Lexical Density across Topics

Lexical density is mostly concentrated around 0.7 higher for Topic 0
Lexical density is higher for Topic 1



In [36]:

    
fig, axs = plt.subplots(figsize=(15, 6))
sns.swarmplot(x='topic', y='lexical_density', hue='sentiment',
              hue_order=['neutral', 'positive', 'negative'], data=df, dodge=True)









    Out[36]:





<matplotlib.axes._subplots.AxesSubplot at 0x19232f28>

7. Next Steps

With that we've covered what we began with. Kanye's corpus has good variations with mix of lexical sentiments.

We could improve on theme based Kanye's machine generated songs
Topic modelling can made easier to interpret the themes
Detect Names, Brands, Emotions mentioned in songs
Add external data (date, genre, audio, visual, billboard reception, youtube views etc)

	lyrics
0	Let the suicide doors up\nI threw suicides on ...
1	She said, "'Ye, can we get married at the mall...
2	Break records at Louis, ate breakfast at Gucci...
3	What you doin' in the club on a Thursday?\nShe...
4	I wanna fuck you hard on the sink\nAfter that,...

	lyrics	#characters	#words	#lines	#uniq_words	lexical_density
0	Let the suicide doors up\nI threw suicides on ...	675	131	16	99	0.755725
1	She said, "'Ye, can we get married at the mall...	772	148	18	111	0.750000
2	Break records at Louis, ate breakfast at Gucci...	1479	286	33	212	0.741259
3	What you doin' in the club on a Thursday?\nShe...	632	101	16	80	0.792079
4	I wanna fuck you hard on the sink\nAfter that,...	748	144	18	96	0.666667

	lyrics	#characters	#words	#lines	#uniq_words	lexical_density	sentiment_score	sentiment
0	Let the suicide doors up\nI threw suicides on ...	675	131	16	99	0.755725	-0.6658	negative
1	She said, "'Ye, can we get married at the mall...	772	148	18	111	0.750000	-0.8690	negative
2	Break records at Louis, ate breakfast at Gucci...	1479	286	33	212	0.741259	-0.9499	negative
3	What you doin' in the club on a Thursday?\nShe...	632	101	16	80	0.792079	-0.9900	negative
4	I wanna fuck you hard on the sink\nAfter that,...	748	144	18	96	0.666667	-0.9869	negative