Goals

This work aims at conducting a sentiment analysis through music genres. There are many ways to complete this goal, however it is important that the parameters and analysers we use are fitting our data. Sentiment analysis is a very large field, and we believe that we used the librairies and functions that are the most appropriate for the chosen dataset. This work can be divided into 4 parts:

Data imports: structures, sorting and wrangling
Classifiers: Choosing the methods of analysis and extracting features
Visualization: How to meaningfully represent the data
Finalization: Organization of the outputs

Dataset credits:

musiXmatch dataset, the official lyrics collection for the Million Song Dataset, available at: http://labrosa.ee.columbia.edu/millionsong/musixmatch



In [32]:

    
#Importing libraries
%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
import re
import nltk
import scipy
import sklearn
import sklearn.preprocessing
import gensim as gs
import pylab as pl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from IPython.display import display, Image
from nltk.corpus import stopwords
import nltk
import os
import pickle

# internal imports
import helpers as HL


# Constants: PS! put in your own paths to the files
GLOVE_FOLDER = 'glove.twitter.27B'
GS_FOLDER = os.path.abspath("doesnt_matter" + "/../../../../" + "Machine_Learning/CD-433-Project-2/gensim_data_folder/") #PS: this is only in my folderstructuree
GS_25DIM = GS_FOLDER + "/gensim_glove_vectors_25dim.txt"
GS_50DIM = GS_FOLDER + "/gensim_glove_vectors_50dim.txt"
GS_100DIM = GS_FOLDER + "/gensim_glove_vectors_100dim.txt"
GS_200DIM = GS_FOLDER + "/gensim_glove_vectors_200dim.txt"









    



The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

1. Data imports: structures, sorting and wrangling

Import the Songs Titles and Artists from MusiXMatch 779052 songs
Import the Year from million song additional files 498466 songs
Import the Genres from TagTraum 255015 songs
Import the lyrics & the bag of words 91625 songs

1.1. Importing the songs

For now we only use titles and artist names, we are able to handle this part with only the musixmatch website. We download the data and put it into a dataframe with the Id of MusiXMatch(MXM_Tid) and the Track ID of the Million Song DataSet(Tid). Because we might have data that is given with one classification or the other, we decide to keep the two IDs, but we are fully aware that having two IDs is not giving additional information, it is only to be sure that other datasets will be easier to merge.
We for now, we get 779052 song's artists and titles



In [2]:

    
#Importing the text file in a DataFrame, removing exceptions (incomplete data)
matches = pd.read_table('Data/mxm_779k_matches.txt', error_bad_lines=False)

#Changing the column's title in order to be clearer
matches.columns = ['Raw']

#Getting the Tid
matches['Tid'] = matches['Raw'].str.split('<SEP>', expand=True)[0]

#Extracting artist names
matches['Artist_Name'] = matches['Raw'].str.split('<SEP>', expand=True)[1]

#Extracting titles
matches['Title'] = matches['Raw'].str.split('<SEP>', expand=True)[2]

#Extractign MXM_Tid
matches['MXM_Tid'] = matches['Raw'].str.split('<SEP>', expand=True)[3]

#Dropping rows we do not need ()
matches = matches.drop(matches.index[:17])

#Droppign the column with raw data
matches = matches.drop('Raw', axis=1)

#set index Track ID
matches.set_index('Tid',inplace=True)

#Displaying results
display(matches.shape)
display(matches.head())









    



b'Skipping line 60821: expected 1 fields, saw 2\nSkipping line 126702: expected 1 fields, saw 2\n'
b'Skipping line 580629: expected 1 fields, saw 2\nSkipping line 632526: expected 1 fields, saw 2\n'






    





(779052, 3)






    







  
    
      
      Artist_Name
      Title
      MXM_Tid
    
    
      Tid
      
      
      
    
  
  
    
      TRMMMKD128F425225D
      Karkkiautomaatti
      Tanssi vaan
      4418550
    
    
      TRMMMRX128F93187D9
      Hudson Mohawke
      No One Could Ever
      8898149
    
    
      TRMMMCH128F425532C
      Yerba Brava
      Si Vos Querés
      9239868
    
    
      TRMMMXN128F42936A5
      David Montgomery
      Symphony No. 1 G minor "Sinfonie Serieuse"/All...
      5346741
    
    
      TRMMMBB12903CB7D21
      Kris Kross
      2 Da Beat Ch'yall
      2511405

Remarks:

There are two distinct identifiers for the same data. Because we might have data that is given with one classification or the other, we decide to keep the two IDs, but we are fully aware that having two IDs is not giving additional information, it is only to be sure that other datasets will be easier to merge.
This is only containing the artist and title, we need further informations such as the genre and the bags of words for each song.

1.2. Extracting the Year of the songs

We download the text file from the "A million song" website. It is provided as an additional feature of the dataset.
We merge the year dataset with the artists and song titles in the same dataframe.



In [3]:

    
#Loading the year of publication data, skipping incomplete data in order to avoid errors
years = pd.read_table('Data/tracks_per_year.txt', error_bad_lines=False)
#Changing the column's title in order to be clearer
years.columns = ['Raw']

#Getting the year publication
years['year'] = years['Raw'].str.split('<SEP>', expand=True)[0]

#Getting the Tid
years['Tid'] = years['Raw'].str.split('<SEP>', expand=True)[1]

#Dropping the raw data
years = years.drop('Raw', axis=1)

#set index Track ID
years.set_index('Tid',inplace=True)

#Appending the years to the original DataFrame
matches = pd.merge(matches, years, left_index=True, right_index=True)









    



b'Skipping line 487582: expected 1 fields, saw 2\nSkipping line 487590: expected 1 fields, saw 2\n'



In [4]:

    
#display the results
print(matches.shape)
display(matches.head())









    



(498466, 4)






    







  
    
      
      Artist_Name
      Title
      MXM_Tid
      year
    
    
      Tid
      
      
      
      
    
  
  
    
      TRMMMKD128F425225D
      Karkkiautomaatti
      Tanssi vaan
      4418550
      1995
    
    
      TRMMMRX128F93187D9
      Hudson Mohawke
      No One Could Ever
      8898149
      2006
    
    
      TRMMMCH128F425532C
      Yerba Brava
      Si Vos Querés
      9239868
      2003
    
    
      TRMMMBB12903CB7D21
      Kris Kross
      2 Da Beat Ch'yall
      2511405
      1993
    
    
      TRMMMNS128F93548E1
      3 Gars Su'l Sofa
      L'antarctique
      7503609
      2007

Remarks:

We delete the rows without year infos. Thus why the dataframe contains less songs. In order to be able to be as complete as accurate as possible, we consider only full matching.

1.3 Importing genres

We will now append each genre to a specific track.
We download the data from the TagTraum dataset and merge them without our previous dataframe.



In [5]:

    
#Creating a DataFrame to store the genres:
GenreFrame = pd.read_table('Data/msd-topMAGD-genreAssignment.txt', names=['Tid', 'genre'])

#set index Track ID
GenreFrame.set_index('Tid',inplace=True)

#merge the new datas with the previous dataframe
matches = pd.merge(GenreFrame, matches, left_index=True, right_index=True)



In [6]:

    
#Displaying results
print(matches.shape)
display(matches.head())









    



(255015, 5)






    







  
    
      
      genre
      Artist_Name
      Title
      MXM_Tid
      year
    
    
      Tid
      
      
      
      
      
    
  
  
    
      TRAAAAK128F9318786
      Pop_Rock
      Adelitas Way
      Scream
      8692587
      2009
    
    
      TRAAAAV128F421A322
      Pop_Rock
      Western Addiction
      A Poor Recipe For Civic Cohesion
      4623710
      2005
    
    
      TRAAABD128F429CF47
      Pop_Rock
      The Box Tops
      Soul Deep
      6477168
      1969
    
    
      TRAAAEF128F4273421
      Pop_Rock
      Adam Ant
      Something Girls
      3759847
      1982
    
    
      TRAAAEM128F93347B9
      Electronic
      Son Kite
      Game & Watch
      2626706
      2004

Comment:

The dataframe contains once again less songs. We proceed this way for the same reason as mentioned in the part before.

1.4. Importing Location

We download the file with the location of every artist from the additional files



In [7]:

    
#Creating a DataFrame to store the location:
location = pd.read_csv('Data/artist_location.txt', sep="<SEP>",header=None,names=['ArtistID','Latitude','Longitude','Artist_Name','City'])
#Keep useful datas
location.drop(['ArtistID','City'],inplace=True,axis=1)
 
#matches = pd.merge(location, matches, on='Tid')
matches.reset_index(inplace=True)
matches = pd.merge(location, matches, on='Artist_Name')
matches.set_index('Tid',inplace = True)









    



/Users/havardbjornoy/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.



In [8]:

    
#Displaying results
display(matches.head())
print(matches.shape)









    







  
    
      
      Latitude
      Longitude
      Artist_Name
      genre
      Title
      MXM_Tid
      year
    
    
      Tid
      
      
      
      
      
      
      
    
  
  
    
      TRBAWHU128EF3563C5
      51.59678
      -0.33556
      Screaming Lord Sutch
      Pop_Rock
      Murder In The Graveyard
      7883403
      1982
    
    
      TREPTTY128EF3563C1
      51.59678
      -0.33556
      Screaming Lord Sutch
      Pop_Rock
      Penny Penny
      2546441
      1982
    
    
      TRJSKEX128EF3563C7
      51.59678
      -0.33556
      Screaming Lord Sutch
      Pop_Rock
      London Rocker
      2546440
      1982
    
    
      TRKCPUR128F92CF37D
      51.59678
      -0.33556
      Screaming Lord Sutch
      Pop_Rock
      Jack The Ripper
      4439345
      1982
    
    
      TRKYESP128EF3563C0
      51.59678
      -0.33556
      Screaming Lord Sutch
      Pop_Rock
      Monster Rock
      2546437
      1982
    
  








    



(103401, 7)

1.5. Bags of words, extracting them from the train dataset

We downloaded the train datafile which is 30% of the whole dataset. Inside we have a list of the 5000 words the most used in the ... songs. We then make two dataframes:

One with the Id of every songs and their lyrics. We merge this with our previous dataframe.

The lyrics are presented as follow : [(id of word),(occurence in song)][2,24][5,47]...

Another one with the 5000 top words of the songs (Bag of Words)

We work with only 30% of the whole dataset because we use the MusicXMatch dataset and it is the only data that are available.
The rest of the data are not free. You could see that page : https://developer.musixmatch.com/plans to verify.



In [9]:

    
#import file
lyrics = pd.read_table('Data/mxm_dataset_train.txt', error_bad_lines=False)

#change name of the column
lyrics.columns = ['Raw_Training']

# take the bag of word to use it later
words_train = lyrics.iloc[16]

#drop useless rows
lyrics=lyrics[17:].copy()

# get TrackID, MxMID and lyrics and put them separated columns
def sortdata(x):
    splitted = x['Raw_Training'].split(',')
    x['Tid']=splitted[0]
    #x['MXM_Tid']=splitted[1]
    x['words_freq']=splitted[2:]
    return x

#Apply the function to every column
lyrics = lyrics.apply(sortdata,axis=1)
lyrics = lyrics[['Tid','words_freq']]



In [10]:

    
#set index Track ID
lyrics.set_index('Tid',inplace=True)

#Appending the years to the original DataFrame
matches = pd.merge(matches, lyrics, left_index=True, right_index=True)



In [11]:

    
#Displaying the results
print(matches.shape)
display(matches.head())









    



(38513, 8)






    







  
    
      
      Latitude
      Longitude
      Artist_Name
      genre
      Title
      MXM_Tid
      year
      words_freq
    
    
      Tid
      
      
      
      
      
      
      
      
    
  
  
    
      TRAAAAV128F421A322
      37.77916
      -122.42005
      Western Addiction
      Pop_Rock
      A Poor Recipe For Civic Cohesion
      4623710
      2005
      [1:6, 2:4, 3:2, 4:2, 5:5, 6:3, 7:1, 8:1, 11:1,...
    
    
      TRAAABD128F429CF47
      35.14968
      -90.04892
      The Box Tops
      Pop_Rock
      Soul Deep
      6477168
      1969
      [1:10, 3:17, 4:8, 5:2, 6:2, 7:1, 8:3, 9:2, 10:...
    
    
      TRAAAEF128F4273421
      35.83073
      -85.97874
      Adam Ant
      Pop_Rock
      Something Girls
      3759847
      1982
      [1:5, 2:4, 3:3, 4:2, 5:1, 6:11, 9:4, 12:9, 13:...
    
    
      TRAAAHJ128F931194C
      39.74001
      -104.99226
      Devotchka
      Pop_Rock
      The Last Beat Of My Heart (b-side)
      5133845
      2004
      [1:4, 2:11, 3:2, 4:7, 5:3, 6:5, 8:1, 9:3, 10:6...
    
    
      TRAABIG128F9356C56
      40.71455
      -74.00712
      Poe
      Pop_Rock
      Walk the Walk
      678806
      2000
      [1:28, 2:77, 3:31, 4:41, 5:5, 6:13, 8:17, 9:5,...

Comments on the size:

Due to the fact that we do not have access to the entire dataset, our analysis is limited to the 30% that is freely available on MusixMatch.

1.6. From generic bags of words to lyrics

We Create a function that take the list of the word and the occurency in one song : [(id of word),(occurency in the song)][2,24][5,47]...
And output all the corresponding words in a list

For example : [1:2,2:5,3:3] gives us --> [i,i,the,the,the,the,the,you,you,you]



In [12]:

    
#get the datas
bag_of_words = words_train
# clean the data and split it to create a list of 5000 words
bag_of_words = bag_of_words.str.replace('%','')
bag_of_words = bag_of_words.str.split(',')

display(bag_of_words.head())









    





Raw_Training    [i, the, you, to, and, a, me, it, not, in, my,...
Name: 16, dtype: object



In [13]:

    
#Defining a function
def create_text(words_freq):
    #create the final list of all words
    list_words=''
    #iterate over every id of words
    for compteur in words_freq:
        
        word = bag_of_words[0][int(compteur.split(':')[0])-1]
        times = int(compteur.split(':')[1])
        
        #Separating every word with a space to be able to work on it with librairies during part 2
        for i in range(times):
            list_words += ' ' + word + ' '
    return list_words



In [14]:

    
#Testing the function
print(create_text(lyrics.iloc[0]['words_freq']))









    



 i  i  i  i  i  i  the  the  the  the  you  you  to  to  and  and  and  and  and  a  a  a  me  it  my  is  is  of  of  of  your  that  are  are  we  we  am  am  will  will  for  for  for  for  be  have  have  so  this  like  like  de  up  was  was  if  got  would  been  these  these  seem  someon  understand  pass  river  met  piec  damn  worth  flesh  grace  poor  poor  somehow  ignor  passion  tide  season  seed  resist  order  order  piti  fashion  grant  captur  captur  ici  soil  patienc  social  social  highest  highest  slice  leaf  lifeless  arrang  wilder  shark  devast  element

Comments on part one:

As it is noticeable through each step, we loose data every time we merge datasets. We chose this approach because we only want to deal with complete information in order to be coherent. We want to compare parameters between items and we believe that the analysis is less relevant if we consider a larger dataset that contains data incomplete data.

We now have 38 513 songs, but for each one we have all the features that we want to use. We will analyse our data with different parameters, thus why it is important that it each song provides each item. Later in the analysis we may use data from 1.4. (providing 103 401 songs) in order to get a broader overview.

2. Classifiers: Choosing the methods of analysis and extracting features

In order to analyse songs, we will use sentiment analysis on the lyrics. We chose to use 2 key features, which are the polarity and the lexical complexity. Because we only use bags of words, some parameters such as rhymes and structures are not defined with bags of words and they should be taken in consideration when speaking of the whole complexity of lyrics.

2.1. Word polarity

Vader package

VADER, which stands for Valence Aware Dictionary and sEntiment Reasoner, is a sentiment analysis package that provides a polarity score for a given word or sentences. It is known to be a very powerful tool, especially because it was trained on tweets, meaning that it takes into account most of modern vocabulary. This is especially relevant for our project because we deal with modern music, implying that the words that are used are as modern as the ones analysed by VADER on tweets. The fact that the sentiment analyser takes its roots from the same vocabulary is make the analysis more relevant.

Polarity is expressed between -1 (negative polarity) and 1 (positive polarity).



In [15]:

    
import nltk.sentiment.sentiment_analyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer









    



/Users/havardbjornoy/anaconda3/lib/python3.6/site-packages/nltk/twitter/__init__.py:20: UserWarning: The twython library has not been installed. Some functionality from the twitter package will not be available.
  warnings.warn("The twython library has not been installed. "



In [16]:

    
#Defining the analyser
analyser = SentimentIntensityAnalyzer()

2.2. Lyrics' complexity

Because we want to be able to know what type of audience a specific type of music is targeting we need to analyse the complexity of the lyrics. We are aware that dividing an audience into social profiles is far beyond the scope of our analysis. We do not have enough sociological knowledge to categorize an audience in a precise way. This is the reason why we will use large indicators. We want to know how complex a set of word is, and the only social assumption we will make is that complexity is correlated with the age and the educational level of the audience.

We use the occurence of each word in the whole dataset.

Extracting the vocabulary

Importing the most used words and their count inside the dataset in order to start a text processing analysis.

Extracting additional features

From the dataset, there was some given metadata. The total word count is 55 163 335.

Because of the long tail effect of language, we will proceed with the first 10 000 words of the list. This will enable us to have less computing time when iterating on the full_word_list.

From the vocabulary we remove stopwords. Those are too often mentionned in every level of language to be relevant for this analysis.

We then compute the percentage of occurence, because it will help us when dealing with lyrics' complexity. We then use the occurence precentage to get a Complexity weight. It means that when a word is used a lot it will have a low weight and a high weight for words rarely used.



In [17]:

    
Word_count_total = 55163335

#Importing the data, putting it in a DataFrame
full_word_list = pd.read_table('Data/full_word_list.txt')
#Renaming the columns
full_word_list.columns = ['Word']
#Extracting word count
full_word_list['Count'] = pd.to_numeric(full_word_list['Word'].str.split('<SEP>', expand=True)[1])
#Extracted words that were used
full_word_list['Word'] = full_word_list['Word'].str.split('<SEP>', expand=True)[0]
#Dropping rows we will not use
full_word_list = full_word_list.drop(full_word_list.index[:6])

#Extracting the first 50 0000  values, because the rest is not stemmed and not necessarly in english
full_word_list = full_word_list.head(50000)


#Removing english stop words 
for word in full_word_list['Word']:
    if word in stopwords.words('english'):
        full_word_list = full_word_list[full_word_list.Word != word]
        
#Computing the percentage of occurence:
full_word_list['Occurence_percentage'] = (full_word_list['Count']/ Word_count_total)*100
#computing weight of words
full_word_list['Weight']= 1/full_word_list['Occurence_percentage']

display(full_word_list.shape)
display(full_word_list.head())









    





(49872, 4)






    







  
    
      
      Word
      Count
      Occurence_percentage
      Weight
    
  
  
    
      31
      love
      298043.0
      0.540292
      1.850852
    
    
      34
      know
      273137.0
      0.495142
      2.019621
    
    
      39
      like
      227624.0
      0.412636
      2.423441
    
    
      44
      get
      192961.0
      0.349799
      2.858782
    
    
      46
      go
      182812.0
      0.331401
      3.017490

Removing non english words

Because they are much less commonly encountered in the dataset, words that are not in english will be ranked with a very high complexity. In addition to introduce a bias in the lexical complexity analysis, they wil also cause trouble when treating the polarity, because the VADER library is solely analysing english words. We will use the NLTK library in order to remove each non-english word from the bags of words.

We first need to download the "wordnet" NLTK package:



In [18]:

    
import nltk
#Using the NLTK downloader to get wordnet
#nltk.download()
from nltk.corpus import wordnet as wn



In [19]:

    
for j in full_word_list.index: 
    if not wn.synsets(full_word_list.Word[j]):#Comparing if word is non-English
        full_word_list.drop(j, inplace=True)



In [20]:

    
full_word_list = full_word_list.sort_values('Weight', ascending=False)
display(full_word_list.head())
pickle.dump(full_word_list, open("full_word_list.pkl", "wb"))









    



[autoreload of helpers failed: Traceback (most recent call last):
  File "/Users/havardbjornoy/anaconda3/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 246, in check
    superreload(m, reload, self.old_objects)
  File "/Users/havardbjornoy/anaconda3/lib/python3.6/site-packages/IPython/extensions/autoreload.py", line 369, in superreload
    module = reload(module)
  File "/Users/havardbjornoy/anaconda3/lib/python3.6/imp.py", line 314, in reload
    return importlib.reload(module)
  File "/Users/havardbjornoy/anaconda3/lib/python3.6/importlib/__init__.py", line 166, in reload
    _bootstrap._exec(spec, module)
  File "<frozen importlib._bootstrap>", line 608, in _exec
  File "<frozen importlib._bootstrap_external>", line 674, in exec_module
  File "<frozen importlib._bootstrap_external>", line 781, in get_code
  File "<frozen importlib._bootstrap_external>", line 741, in source_to_code
  File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
  File "/Users/havardbjornoy/EPFL_Jupyter/ADA/DataAnalysis/project/helpers.py", line 211
    occurence_words_not_in_vocab = 0
                                   ^
IndentationError: unindent does not match any outer indentation level
]






    







  
    
      
      Word
      Count
      Occurence_percentage
      Weight
    
  
  
    
      49995
      leipzig
      15.0
      0.000027
      36775.556667
    
    
      49219
      chide
      15.0
      0.000027
      36775.556667
    
    
      49125
      bookmark
      15.0
      0.000027
      36775.556667
    
    
      49129
      bottleneck
      15.0
      0.000027
      36775.556667
    
    
      49134
      bracken
      15.0
      0.000027
      36775.556667



In [21]:

    
# function to get the complexity of one song by analyzing the weight of all his word
def complexity_Song(lyrics):
    #create a variable to stock the sum of the weights for every word of the song
    sum_weight= 0
    #split the lyrics to get an array of words and not just one big string
    lyric = lyrics.split(' ')
    #lyric = lyric.remove(' ')
    
    #filtering empty values
    lyric = list(filter(None, lyric))
    
    #Removing every english stopword from the given lyric
    lyric = [word for word in lyric if word not in stopwords.words('english')]
    
    for x in lyric:
        #Making sure that the data is not empty
        if len(full_word_list.loc[full_word_list['Word'] == x]['Weight'].values) != 0 :
            sum_weight += full_word_list.loc[full_word_list['Word'] == x]['Weight'].values  

    return float(sum_weight/len(lyric))

Comment:

This implementation is inspired by the TF-IDF algorithm. If the occurence of a word is weak in a dataset, it means that it is less common in the language, meaning that the lexical complexity is higher.

English stopwords are very common in every sentences, they are used so typically that they do not add anything relevant to the analysis. This is the reason why we take them out. Our complexity analysis must be focused on words that do not appear regularly.

Analysis

We need to go from word frequency to bags of words. Once this is done using our "create_text" function, we will use the polarity analyser.



In [22]:

    
#Resetting index
matches.reset_index(inplace=True)

#Intiating an empty column, in order to be able to interate on it
matches['Bags_of_words'] = ''
#Getting all the textual data in the DataFrame
for i in matches.index:
    matches.at[i, 'Bags_of_words'] = create_text(matches.at[i, 'words_freq'])

#Because we have all the intial data in our DataFrame, we will store it as pickle object
matches.to_pickle('full_table.pkl')

Now that we have the bags of words in the DataFrame, we can conduct the analysis. Let us first work with the polarity:



In [23]:

    
#taking out the pickle object
matches = pd.read_pickle('full_table.pkl')

#Applying the polarity analysis for the bags of words
for i in matches.index:
    matches.at[i, 'Polarity_score'] = analyser.polarity_scores(matches.at[i, 'Bags_of_words'])['compound']



In [24]:

    
display(matches.head(4))









    







  
    
      
      Tid
      Latitude
      Longitude
      Artist_Name
      genre
      Title
      MXM_Tid
      year
      words_freq
      Bags_of_words
      Polarity_score
    
  
  
    
      0
      TRAAAAV128F421A322
      37.77916
      -122.42005
      Western Addiction
      Pop_Rock
      A Poor Recipe For Civic Cohesion
      4623710
      2005
      [1:6, 2:4, 3:2, 4:2, 5:5, 6:3, 7:1, 8:1, 11:1,...
      i  i  i  i  i  i  the  the  the  the  you  yo...
      0.7748
    
    
      1
      TRAAABD128F429CF47
      35.14968
      -90.04892
      The Box Tops
      Pop_Rock
      Soul Deep
      6477168
      1969
      [1:10, 3:17, 4:8, 5:2, 6:2, 7:1, 8:3, 9:2, 10:...
      i  i  i  i  i  i  i  i  i  i  you  you  you  ...
      0.9686
    
    
      2
      TRAAAEF128F4273421
      35.83073
      -85.97874
      Adam Ant
      Pop_Rock
      Something Girls
      3759847
      1982
      [1:5, 2:4, 3:3, 4:2, 5:1, 6:11, 9:4, 12:9, 13:...
      i  i  i  i  i  the  the  the  the  you  you  ...
      0.8689
    
    
      3
      TRAAAHJ128F931194C
      39.74001
      -104.99226
      Devotchka
      Pop_Rock
      The Last Beat Of My Heart (b-side)
      5133845
      2004
      [1:4, 2:11, 3:2, 4:7, 5:3, 6:5, 8:1, 9:3, 10:6...
      i  i  i  i  the  the  the  the  the  the  the...
      0.8720

Sorting outputs in valuable categories

Because we want a precise data structure, we must aggregate our outputs the most efficient way for later visualization.

We need metadata per topic, per genre and per artist.



In [25]:

    
sns.set(color_codes=True, style="darkgrid")

def polarity_graph_generator(Data_in, categorization):
    
    Data_in[categorization] = Data_in[categorization].astype('category')

    for cat in Data_in[categorization].cat.categories:
        Division = pd.DataFrame()
        Division = Data_in[(Data_in[categorization] == cat)]
        
        #Sorting values by polarity to create a graph
        Division = Division.sort_values('Polarity_score', ascending=False)
        #Index reseting
        Division = Division.reset_index()
        #plotting the results
        sns_plot = sns.tsplot(Division['Polarity_score'], color='m').set_title('Polarity in {}'.format(cat))
        x = len(Division['Polarity_score'])
        y = Division['Polarity_score']
        ax = sns_plot.axes
        ax.fill_between(x, 0, y)
        fig = sns_plot.get_figure()
        #Storing the graph (MUST GREATE THE FOLDER BEFORE !)
        fig.savefig("Polarity_plots/{} polarity.png" .format(cat))
        
        #Clearing the figure
        fig.clf()
    return

2.3. Topic classification

Having the data divided in genres in important for our analysis, however we are still missing one key dimension to make our work relevant for social good: The topic that is adressed in the songs. We must be able to know which subject is dealt with in a song, and then we will aggregate the data for the genre and we will be able to understand how a particular genre is handling a specific topic. For this part we are still considering two options:



In [27]:

    
#import global vectors from stanfords pretrained set, trained on tweets, one can choose wished dim=25,50,100,200
global_vectors = HL.load_gensim_global_vectors(GS_200DIM)

Defining topics so we can calculate a words similarity to topic

This is where there might be most bias from us creators. We chose the defining words using thesaurus and the vectorspaces outputs.



In [28]:

    
#Defining the topics
racism = ['racism', 'nigger', 'negro', 'race', 'racist', 'bigot', 'bigotry', 'apartheid', 'discrimination', 'segregation', 'unfairness', 'partiality', 'sectarianism', 'colored']
women = ['women','girl', 'daughter', 'mother', 'she', 'wife', 'aunt', 'gentlewoman', 'girlfriend', 'grandmother', 'matron', 'niece', 'spouse', 'miss', 'genre']
money = ['money','bill', 'capital', 'cash', 'check', 'fund', 'pay', 'payment', 'property', 'salary', 'wage', 'wealth', 'banknote', 'bankroll', 'bread', 'bucks', 'chips', 'coin', 'coinage', 'dough', 'finances', 'funds', 'gold', 'gravy', 'greenback', 'loot', 'pesos', 'ressources', 'riches', 'roll', 'silver', 'specie', 'treasure', 'wad', 'wherewithal']
revolution = ['revolution','change', 'overthrow', 'demand', 'freedom', 'war', 'movement', 'brotherhood', 'reform', 'radical', 'leadership']
politics =  ['politics', 'president', 'governor', 'senator', 'campaigning','government','civics','electioneering','legislature','policy','political']
religion = ['religion', 'religious', 'religions', 'atheism', 'secular', 'islam', 'islamic', 'atheist', 'bible', 'christian', 'jew', 'muslim', 'theology', 'god', 'church', 'buddhism', 'hinduism','belief', 'pray', 'prayer', 'worship']
art = ['art', 'movie', 'singing', 'painting', 'ballet', 'theatre']
health = ['health', 'nutrition', 'medical', 'wellness', 'healthy', 'care', 'safety', 'fitness', 'obesity', 'cancer', 'sickness', 'disease'] 

# make lists so one can iterate through the topics
name_of_topics = ['racism', 'women', 'money', 'revolution', 'politics', 'religion', 'art', 'health']
words_defining_topics = [racism, women, money, revolution, politics, religion, art, health]

Calculate every words relation to the different topics



In [38]:

    
vocab_topics = HL.vocabulary_calculate_topics(words_defining_topics, name_of_topics, global_vectors)









    



racism
women
money
revolution
politics
religion
art
health

Print out the dataframe and see the most relevant words for the topic "revolution"



In [39]:

    
visual_vocab = vocab_topics.copy(deep=True)
print(visual_vocab.shape)

display(visual_vocab.sort_values('topic_revolution',ascending=False).head(15))









    



(11624, 12)






    







  
    
      
      Word
      Count
      Occurence_percentage
      Weight
      topic_racism
      topic_women
      topic_money
      topic_revolution
      topic_politics
      topic_religion
      topic_art
      topic_health
    
  
  
    
      23095
      pakistan
      49.0
      0.000089
      11257.823469
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      19726
      congress
      62.0
      0.000112
      8897.312097
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      967
      vision
      5269.0
      0.009552
      104.694126
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      1895
      religion
      2113.0
      0.003830
      261.066422
      1
      0
      0
      1
      0
      1
      0
      0
    
    
      10198
      reform
      175.0
      0.000317
      3152.190571
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      11307
      president
      149.0
      0.000270
      3702.237248
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      1320
      action
      3447.0
      0.006249
      160.032884
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      8595
      patriot
      226.0
      0.000410
      2440.855531
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      35595
      feudal
      25.0
      0.000045
      22065.334000
      1
      0
      0
      1
      0
      0
      0
      0
    
    
      19050
      sanction
      66.0
      0.000120
      8358.081061
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      2922
      threat
      1127.0
      0.002043
      489.470586
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      24717
      implement
      44.0
      0.000080
      12537.121591
      0
      0
      0
      1
      1
      0
      0
      0
    
    
      3244
      driven
      965.0
      0.001749
      571.640777
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      31762
      personnel
      30.0
      0.000054
      18387.778333
      0
      0
      0
      1
      0
      0
      0
      0
    
    
      41651
      radical
      20.0
      0.000036
      27581.667500
      0
      0
      0
      1
      0
      0
      0
      0

Extrapolate from (word <-> topic)-relation to (song <-> topic)-relation



In [40]:

    
# dataframe with songs
matches = pd.read_pickle('full_table.pkl')



In [41]:

    
display(matches.head(4))









    







  
    
      
      Tid
      Latitude
      Longitude
      Artist_Name
      genre
      Title
      MXM_Tid
      year
      words_freq
      Bags_of_words
    
  
  
    
      0
      TRAAAAV128F421A322
      37.77916
      -122.42005
      Western Addiction
      Pop_Rock
      A Poor Recipe For Civic Cohesion
      4623710
      2005
      [1:6, 2:4, 3:2, 4:2, 5:5, 6:3, 7:1, 8:1, 11:1,...
      i  i  i  i  i  i  the  the  the  the  you  yo...
    
    
      1
      TRAAABD128F429CF47
      35.14968
      -90.04892
      The Box Tops
      Pop_Rock
      Soul Deep
      6477168
      1969
      [1:10, 3:17, 4:8, 5:2, 6:2, 7:1, 8:3, 9:2, 10:...
      i  i  i  i  i  i  i  i  i  i  you  you  you  ...
    
    
      2
      TRAAAEF128F4273421
      35.83073
      -85.97874
      Adam Ant
      Pop_Rock
      Something Girls
      3759847
      1982
      [1:5, 2:4, 3:3, 4:2, 5:1, 6:11, 9:4, 12:9, 13:...
      i  i  i  i  i  the  the  the  the  you  you  ...
    
    
      3
      TRAAAHJ128F931194C
      39.74001
      -104.99226
      Devotchka
      Pop_Rock
      The Last Beat Of My Heart (b-side)
      5133845
      2004
      [1:4, 2:11, 3:2, 4:7, 5:3, 6:5, 8:1, 9:3, 10:6...
      i  i  i  i  the  the  the  the  the  the  the...

Find out if the songs are about any of the topics and assign values in the dataframe



In [43]:

    
matches, occz, totz = HL.score_songs(matches, vocab_topics)



In [45]:

    
print("Percentage of the songs that is not in the vocabulary: %.2f%%" % ((occz/totz)*100))









    



Percentage of the songs that is not in the vocabulary: 65.15%



In [46]:

    
#Fetch the columnnames of the topics
column_names = [col for col in vocab_topics.columns if col.startswith('topic')]
for col in column_names:
    col_average = np.mean(matches[col])
    print(np.mean(matches[col]))
    matches[col] = matches[col].apply(lambda x: 1 if x>col_average else 0)









    



0.0004855860355439085
0.008619094213475184
0.0036768540716145665
0.004160959722150218
0.00024649280481682734
0.010195056976289656
0.0027011381547442557
0.0025004970332949835



In [47]:

    
# this is how the dataframe looks after manipuation
visual_matches = matches.copy(deep=True)
display(visual_matches.sort_values('topic_racism',ascending=False).head(40))









    







  
    
      
      Tid
      Latitude
      Longitude
      Artist_Name
      genre
      Title
      MXM_Tid
      year
      words_freq
      Bags_of_words
      topic_racism
      topic_women
      topic_money
      topic_revolution
      topic_politics
      topic_religion
      topic_art
      topic_health
    
  
  
    
      26702
      TRRVRKZ128F422F360
      42.31256
      -71.08868
      Bury Your Dead
      Pop_Rock
      Let Down Your Hair (Album Version)
      5229968
      2006
      [1:11, 2:3, 3:8, 4:2, 5:7, 7:4, 8:4, 9:1, 10:2...
      i  i  i  i  i  i  i  i  i  i  i  the  the  th...
      1
      1
      0
      1
      0
      1
      0
      0
    
    
      30923
      TRUTCXT128F934B6B3
      29.95369
      -90.07771
      Goatwhore
      Pop_Rock
      Diabolical Submergence Of Rebirth
      5424744
      2006
      [2:18, 4:10, 5:1, 6:2, 8:1, 10:12, 11:2, 12:1,...
      the  the  the  the  the  the  the  the  the  ...
      1
      0
      0
      1
      0
      0
      0
      1
    
    
      2341
      TRBOTOT128F932EA4C
      51.50632
      -0.12714
      Current 93
      Pop_Rock
      Alone
      2155342
      1987
      [1:9, 2:18, 3:1, 4:7, 5:11, 6:9, 7:7, 8:6, 9:3...
      i  i  i  i  i  i  i  i  i  the  the  the  the...
      1
      0
      1
      1
      1
      1
      0
      0
    
    
      2339
      TRBOTKX128F42BB7E9
      50.97768
      11.02307
      Yvonne Catterfeld
      Pop_Rock
      Die Zeit ist reif
      5585229
      2006
      [10:8, 20:5, 21:2, 22:1, 54:9, 97:19, 122:2, 1...
      in  in  in  in  in  in  in  in  am  am  am  a...
      1
      0
      0
      1
      0
      0
      0
      0
    
    
      7003
      TRESRJI128E079340A
      51.50632
      -0.12714
      Ms. Dynamite
      RnB
      It Takes More
      9939995
      2002
      [1:9, 2:13, 3:18, 4:29, 5:11, 6:17, 7:19, 8:10...
      i  i  i  i  i  i  i  i  i  the  the  the  the...
      1
      1
      1
      1
      0
      0
      0
      0
    
    
      18834
      TRMRVHI128F146C70D
      50.77813
      6.08849
      LaFee
      Pop_Rock
      Virus
      5442621
      2006
      [10:4, 22:5, 28:4, 54:2, 97:6, 102:1, 122:5, 1...
      in  in  in  in  all  all  all  all  all  so  ...
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      20663
      TRNXIQU128F92DC556
      50.11204
      8.68342
      Rapsoul
      Rap
      Sag ja
      6453491
      2007
      [10:3, 21:3, 28:3, 54:4, 97:6, 102:8, 122:4, 1...
      in  in  in  will  will  will  so  so  so  was...
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      26041
      TRRKFXU128F932E72D
      51.50632
      -0.12714
      Echobelly
      Pop_Rock
      Paradise
      1258395
      1997
      [1:5, 2:8, 3:4, 4:3, 5:9, 6:3, 8:11, 9:15, 10:...
      i  i  i  i  i  the  the  the  the  the  the  ...
      1
      0
      0
      1
      0
      1
      0
      0
    
    
      35017
      TRXOWVZ128F42AED4F
      53.55334
      9.99245
      Revolverheld
      Pop_Rock
      Generation Rock
      4941099
      2005
      [20:2, 22:3, 28:1, 54:4, 73:1, 97:9, 102:5, 12...
      am  am  all  all  all  so  was  was  was  was...
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      4615
      TRDCMFH128F429626D
      50.77813
      6.08849
      LaFee
      Pop_Rock
      Jetzt Erst Recht
      7368749
      2007
      [1:1, 20:1, 21:4, 28:5, 54:4, 97:2, 122:5, 189...
      i  am  will  will  will  will  so  so  so  so...
      1
      1
      0
      0
      0
      0
      0
      0
    
    
      30893
      TRUSNSW128F42B7BB8
      40.71455
      -74.00712
      The Bravery
      Pop_Rock
      Fearless
      5616161
      2004
      [1:17, 2:10, 3:12, 4:5, 5:10, 6:4, 7:9, 8:4, 9...
      i  i  i  i  i  i  i  i  i  i  i  i  i  i  i  ...
      1
      0
      0
      1
      0
      0
      0
      0
    
    
      24794
      TRQQCDY128F92C57C5
      36.99462
      -86.44558
      Nappy Roots
      Rap
      Nappy Roots Day (Explicit Album Version)
      1590753
      2003
      [1:20, 2:34, 3:5, 4:20, 5:14, 6:17, 7:1, 8:12,...
      i  i  i  i  i  i  i  i  i  i  i  i  i  i  i  ...
      1
      0
      1
      1
      0
      0
      0
      1
    
    
      8743
      TRFVPIE128F92D8813
      51.16418
      10.45415
      Funny Van Dannen
      Pop_Rock
      Mode
      1982738
      1997
      [10:5, 22:1, 28:1, 54:1, 97:2, 102:8, 158:2, 2...
      in  in  in  in  in  all  so  was  die  die  u...
      1
      0
      0
      1
      0
      0
      0
      0
    
    
      8744
      TRFVQAN128F1476ABC
      50.77813
      6.08849
      LaFee
      Pop_Rock
      Sterben Für Dich
      5329925
      2006
      [10:2, 22:1, 28:1, 140:1, 181:11, 189:11, 218:...
      in  in  all  so  hand  du  du  du  du  du  du...
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      5684
      TRDUNRO12903CF2530
      43.04999
      -76.14739
      Brand New Sin
      Pop_Rock
      Dead Man Walking
      3987590
      2005
      [2:1, 3:2, 5:1, 6:4, 9:3, 12:2, 16:2, 18:2, 24...
      the  you  you  and  a  a  a  a  not  not  not...
      1
      0
      0
      1
      0
      1
      0
      0
    
    
      30888
      TRUSMSY128F42758F0
      51.50632
      -0.12714
      The Waterboys
      Pop_Rock
      Further Up Further In (2008 Digital Remaster)
      744614
      1990
      [1:16, 2:22, 3:3, 4:7, 5:6, 6:8, 7:2, 8:3, 10:...
      i  i  i  i  i  i  i  i  i  i  i  i  i  i  i  ...
      1
      1
      1
      1
      0
      1
      0
      0
    
    
      4606
      TRDCIED128F421CD32
      60.20624
      24.65620
      Children Of Bodom
      Pop_Rock
      Roadkill Morning
      6926817
      2008
      [1:12, 2:9, 3:10, 4:11, 5:3, 6:7, 7:8, 8:3, 9:...
      i  i  i  i  i  i  i  i  i  i  i  i  the  the ...
      1
      1
      1
      1
      0
      1
      0
      0
    
    
      12745
      TRIPJXI128F934C109
      35.96049
      -83.92091
      Whitechapel
      Pop_Rock
      Father Of Lies
      7318239
      2008
      [1:10, 2:10, 3:4, 5:1, 7:4, 11:6, 13:10, 14:10...
      i  i  i  i  i  i  i  i  i  i  the  the  the  ...
      1
      0
      0
      1
      0
      1
      0
      1
    
    
      23782
      TRPYYNO128F92F4337
      50.11204
      8.68342
      Rapsoul
      Rap
      Du siehst es doch genauso
      6453489
      2007
      [10:5, 22:10, 28:16, 46:13, 54:12, 73:13, 76:3...
      in  in  in  in  in  all  all  all  all  all  ...
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      37253
      TRZCUIA128F424CB62
      45.19398
      5.73200
      Miss Kittin
      Electronic
      Sunset Strip
      6877932
      2008
      [2:9, 4:2, 5:2, 6:3, 8:1, 12:1, 13:3, 17:5, 19...
      the  the  the  the  the  the  the  the  the  ...
      1
      1
      0
      0
      0
      1
      0
      0
    
    
      8197
      TRFMRFB128F42BB7F7
      50.97768
      11.02307
      Yvonne Catterfeld
      Pop_Rock
      Ich lauf einfach los
      5585237
      2006
      [21:2, 22:2, 28:1, 54:5, 97:4, 102:2, 122:1, 1...
      will  will  all  all  so  was  was  was  was ...
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      1278
      TRAVHEP128F934B52A
      42.31256
      -71.08868
      The Red Chord
      Pop_Rock
      Fixation On Plastics
      5483558
      2005
      [1:2, 2:4, 3:1, 4:5, 5:5, 6:13, 7:2, 8:6, 9:1,...
      i  i  the  the  the  the  you  to  to  to  to...
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      11979
      TRIBVBW128F92EA87D
      53.41961
      -8.24055
      Bell X1
      Pop_Rock
      The Ribs Of A Broken Umbrella (Album Version)
      8043244
      2009
      [1:3, 2:11, 4:5, 5:8, 6:12, 8:5, 10:4, 12:1, 1...
      i  i  i  the  the  the  the  the  the  the  t...
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      3203
      TRCEPWK128F42365A9
      40.85251
      -73.13585
      Glassjaw
      Pop_Rock
      Pretty Lush (Album Version)
      7636880
      2000
      [1:12, 2:8, 3:23, 4:11, 5:13, 6:6, 7:2, 8:4, 1...
      i  i  i  i  i  i  i  i  i  i  i  i  the  the ...
      1
      1
      0
      1
      0
      1
      0
      0
    
    
      4086
      TRCTRQA128F92C1FAC
      53.55334
      9.99245
      Revolverheld
      Pop_Rock
      Gegen die Zeit
      6231841
      2007
      [10:8, 22:2, 54:17, 97:12, 102:9, 106:1, 122:1...
      in  in  in  in  in  in  in  in  all  all  was...
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      7796
      TRFGIKW128F42AA0BB
      33.79502
      9.56154
      Dany Brillant
      Pop_Rock
      Dans Les Rues De Rome
      2100401
      2001
      [1:1, 6:2, 7:2, 17:6, 38:3, 42:9, 47:11, 117:4...
      i  a  a  me  me  on  on  on  on  on  on  que ...
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      28205
      TRSWAMW12903CE6AAB
      42.28474
      -83.38348
      Insane Clown Posse
      Pop_Rock
      17 Dead
      1226405
      1993
      [1:39, 2:36, 3:20, 4:8, 5:17, 6:22, 7:5, 8:9, ...
      i  i  i  i  i  i  i  i  i  i  i  i  i  i  i  ...
      1
      0
      1
      0
      0
      1
      1
      0
    
    
      11069
      TRHLPXG12903CFED67
      34.05349
      -118.24532
      Threshold
      Pop_Rock
      Sanity's End (live)
      5172893
      1994
      [1:6, 2:15, 3:5, 4:6, 5:10, 6:8, 7:2, 9:6, 10:...
      i  i  i  i  i  i  the  the  the  the  the  th...
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      4626
      TRDCSDL128F9355D56
      40.65507
      -73.94888
      The Cardigans
      Pop_Rock
      Been It
      1520922
      1996
      [1:24, 3:12, 4:5, 5:2, 6:1, 7:10, 8:1, 10:1, 1...
      i  i  i  i  i  i  i  i  i  i  i  i  i  i  i  ...
      1
      0
      0
      1
      0
      1
      0
      0
    
    
      5668
      TRDUIDB128F9303D0C
      33.62646
      -80.94740
      Pur
      Pop_Rock
      Herz Für Kinder.
      2614723
      1990
      [10:2, 20:3, 21:1, 28:1, 54:3, 97:7, 122:5, 15...
      in  in  am  am  am  will  so  was  was  was  ...
      1
      0
      0
      1
      0
      0
      0
      0
    
    
      625
      TRAKMJW128F425F12C
      57.65337
      14.69725
      Backyard Babies
      Pop_Rock
      Be Myself And I
      2251787
      2003
      [1:31, 3:27, 4:5, 5:2, 6:7, 8:2, 9:9, 11:2, 12...
      i  i  i  i  i  i  i  i  i  i  i  i  i  i  i  ...
      1
      0
      0
      1
      0
      1
      0
      0
    
    
      37227
      TRZCKKT128F9303FF0
      52.51607
      13.37698
      Element Of Crime
      Pop_Rock
      Alle Vier Minuten
      769450
      2001
      [10:8, 21:1, 22:2, 28:1, 97:10, 102:4, 106:5, ...
      in  in  in  in  in  in  in  in  will  all  al...
      1
      0
      0
      1
      0
      0
      0
      0
    
    
      37228
      TRZCKTV128F92D3830
      51.16418
      10.45415
      Glashaus
      RnB
      Liebst Du mich?
      1124618
      2001
      [10:2, 21:2, 22:4, 28:4, 54:4, 97:1, 122:1, 15...
      in  in  will  will  all  all  all  all  so  s...
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      28264
      TRSXBZE12903CED21E
      40.71455
      -74.00712
      Riot
      Pop_Rock
      Storming The Gates Of Hell
      2344709
      1990
      [1:3, 2:19, 4:2, 5:9, 6:2, 10:2, 12:2, 13:11, ...
      i  i  i  the  the  the  the  the  the  the  t...
      1
      0
      1
      1
      0
      1
      0
      1
    
    
      3585
      TRCKOSL128F4226F6D
      35.96049
      -83.92091
      Whitechapel
      Pop_Rock
      Devirgination Studies
      6346182
      2007
      [1:3, 2:2, 3:3, 4:7, 5:2, 9:2, 10:2, 11:5, 12:...
      i  i  i  the  the  you  you  you  to  to  to ...
      1
      1
      0
      1
      0
      1
      0
      0
    
    
      36197
      TRYJBUK128F92C1FC0
      53.55334
      9.99245
      Revolverheld
      Pop_Rock
      Hallo Welt
      7183860
      2007
      [10:2, 22:26, 28:2, 97:2, 123:1, 158:3, 181:1,...
      in  in  all  all  all  all  all  all  all  al...
      1
      0
      0
      1
      0
      0
      0
      0
    
    
      36198
      TRYJCFO128F92E5269
      52.82812
      12.07305
      Annett Louisan
      Pop_Rock
      Die sein
      6540547
      2007
      [10:9, 28:3, 97:19, 122:2, 189:1, 218:6, 226:2...
      in  in  in  in  in  in  in  in  in  so  so  s...
      1
      0
      0
      0
      0
      0
      0
      0
    
    
      20035
      TRNNBNM128F4277EB9
      34.05349
      -118.24532
      L7
      Pop_Rock
      Ms. 45
      592696
      1988
      [2:6, 3:7, 5:3, 6:5, 8:1, 9:3, 12:9, 14:1, 16:...
      the  the  the  the  the  the  you  you  you  ...
      1
      0
      0
      1
      0
      1
      0
      0
    
    
      3175
      TRCEETX128F1491B7E
      53.38311
      -1.46454
      Bring Me The Horizon
      Pop_Rock
      [I Used To Make Out With] Medusa
      5607564
      2006
      [1:8, 2:9, 3:13, 4:6, 5:3, 6:1, 7:8, 9:6, 10:4...
      i  i  i  i  i  i  i  i  the  the  the  the  t...
      1
      0
      0
      1
      0
      1
      0
      0
    
    
      622
      TRAKLWA128F4296E3B
      36.16778
      -86.77836
      Laurent Voulzy
      International
      Qui Est In Qui Est Out
      6204529
      1979
      [6:1, 7:1, 38:1, 42:1, 47:3, 77:1, 90:1, 112:1...
      a  me  que  de  la  la  la  y  en  te  tu  tu...
      1
      0
      0
      0
      0
      0
      0
      0



In [ ]:

    
matches.to_excel("final_table.xls")



In [ ]:

	Artist_Name	Title	MXM_Tid
Tid
TRMMMKD128F425225D	Karkkiautomaatti	Tanssi vaan	4418550
TRMMMRX128F93187D9	Hudson Mohawke	No One Could Ever	8898149
TRMMMCH128F425532C	Yerba Brava	Si Vos Querés	9239868
TRMMMXN128F42936A5	David Montgomery	Symphony No. 1 G minor "Sinfonie Serieuse"/All...	5346741
TRMMMBB12903CB7D21	Kris Kross	2 Da Beat Ch'yall	2511405

	genre	Artist_Name	Title	MXM_Tid	year
Tid
TRAAAAK128F9318786	Pop_Rock	Adelitas Way	Scream	8692587	2009
TRAAAAV128F421A322	Pop_Rock	Western Addiction	A Poor Recipe For Civic Cohesion	4623710	2005
TRAAABD128F429CF47	Pop_Rock	The Box Tops	Soul Deep	6477168	1969
TRAAAEF128F4273421	Pop_Rock	Adam Ant	Something Girls	3759847	1982
TRAAAEM128F93347B9	Electronic	Son Kite	Game & Watch	2626706	2004

	Latitude	Longitude	Artist_Name	genre	Title	MXM_Tid	year
Tid
TRBAWHU128EF3563C5	51.59678	-0.33556	Screaming Lord Sutch	Pop_Rock	Murder In The Graveyard	7883403	1982
TREPTTY128EF3563C1	51.59678	-0.33556	Screaming Lord Sutch	Pop_Rock	Penny Penny	2546441	1982
TRJSKEX128EF3563C7	51.59678	-0.33556	Screaming Lord Sutch	Pop_Rock	London Rocker	2546440	1982
TRKCPUR128F92CF37D	51.59678	-0.33556	Screaming Lord Sutch	Pop_Rock	Jack The Ripper	4439345	1982
TRKYESP128EF3563C0	51.59678	-0.33556	Screaming Lord Sutch	Pop_Rock	Monster Rock	2546437	1982

	Latitude	Longitude	Artist_Name	genre	Title	MXM_Tid	year	words_freq
Tid
TRAAAAV128F421A322	37.77916	-122.42005	Western Addiction	Pop_Rock	A Poor Recipe For Civic Cohesion	4623710	2005	[1:6, 2:4, 3:2, 4:2, 5:5, 6:3, 7:1, 8:1, 11:1,...
TRAAABD128F429CF47	35.14968	-90.04892	The Box Tops	Pop_Rock	Soul Deep	6477168	1969	[1:10, 3:17, 4:8, 5:2, 6:2, 7:1, 8:3, 9:2, 10:...
TRAAAEF128F4273421	35.83073	-85.97874	Adam Ant	Pop_Rock	Something Girls	3759847	1982	[1:5, 2:4, 3:3, 4:2, 5:1, 6:11, 9:4, 12:9, 13:...
TRAAAHJ128F931194C	39.74001	-104.99226	Devotchka	Pop_Rock	The Last Beat Of My Heart (b-side)	5133845	2004	[1:4, 2:11, 3:2, 4:7, 5:3, 6:5, 8:1, 9:3, 10:6...
TRAABIG128F9356C56	40.71455	-74.00712	Poe	Pop_Rock	Walk the Walk	678806	2000	[1:28, 2:77, 3:31, 4:41, 5:5, 6:13, 8:17, 9:5,...

	Word	Count	Occurence_percentage	Weight
31	love	298043.0	0.540292	1.850852
34	know	273137.0	0.495142	2.019621
39	like	227624.0	0.412636	2.423441
44	get	192961.0	0.349799	2.858782
46	go	182812.0	0.331401	3.017490

	Word	Count	Occurence_percentage	Weight
49995	leipzig	15.0	0.000027	36775.556667
49219	chide	15.0	0.000027	36775.556667
49125	bookmark	15.0	0.000027	36775.556667
49129	bottleneck	15.0	0.000027	36775.556667
49134	bracken	15.0	0.000027	36775.556667

	Word	Count	Occurence_percentage	Weight	topic_racism	topic_revolution	topic_politics	topic_religion
23095	pakistan	49.0	0.000089	11257.823469	0	1	0	0
19726	congress	62.0	0.000112	8897.312097	0	1	1	0
967	vision	5269.0	0.009552	104.694126	0	1	0	0
1895	religion	2113.0	0.003830	261.066422	1	1	0	1
10198	reform	175.0	0.000317	3152.190571	0	1	1	0
11307	president	149.0	0.000270	3702.237248	0	1	1	0
1320	action	3447.0	0.006249	160.032884	0	1	0	0
8595	patriot	226.0	0.000410	2440.855531	0	1	0	0
35595	feudal	25.0	0.000045	22065.334000	1	1	0	0
19050	sanction	66.0	0.000120	8358.081061	0	1	0	0
2922	threat	1127.0	0.002043	489.470586	0	1	0	0
24717	implement	44.0	0.000080	12537.121591	0	1	1	0
3244	driven	965.0	0.001749	571.640777	0	1	0	0
31762	personnel	30.0	0.000054	18387.778333	0	1	0	0
41651	radical	20.0	0.000036	27581.667500	0	1	0	0

	Tid	Latitude	Longitude	Artist_Name	genre	Title	MXM_Tid	year	words_freq	Bags_of_words	topic_racism	topic_women	topic_money	topic_revolution	topic_politics	topic_religion	topic_art	topic_health
26702	TRRVRKZ128F422F360	42.31256	-71.08868	Bury Your Dead	Pop_Rock	Let Down Your Hair (Album Version)	5229968	2006	[1:11, 2:3, 3:8, 4:2, 5:7, 7:4, 8:4, 9:1, 10:2...	i i i i i i i i i i i the the th...	1	1	0	1	0	1	0	0
30923	TRUTCXT128F934B6B3	29.95369	-90.07771	Goatwhore	Pop_Rock	Diabolical Submergence Of Rebirth	5424744	2006	[2:18, 4:10, 5:1, 6:2, 8:1, 10:12, 11:2, 12:1,...	the the the the the the the the the ...	1	0	0	1	0	0	0	1
2341	TRBOTOT128F932EA4C	51.50632	-0.12714	Current 93	Pop_Rock	Alone	2155342	1987	[1:9, 2:18, 3:1, 4:7, 5:11, 6:9, 7:7, 8:6, 9:3...	i i i i i i i i i the the the the...	1	0	1	1	1	1	0	0
2339	TRBOTKX128F42BB7E9	50.97768	11.02307	Yvonne Catterfeld	Pop_Rock	Die Zeit ist reif	5585229	2006	[10:8, 20:5, 21:2, 22:1, 54:9, 97:19, 122:2, 1...	in in in in in in in in am am am a...	1	0	0	1	0	0	0	0
7003	TRESRJI128E079340A	51.50632	-0.12714	Ms. Dynamite	RnB	It Takes More	9939995	2002	[1:9, 2:13, 3:18, 4:29, 5:11, 6:17, 7:19, 8:10...	i i i i i i i i i the the the the...	1	1	1	1	0	0	0	0
18834	TRMRVHI128F146C70D	50.77813	6.08849	LaFee	Pop_Rock	Virus	5442621	2006	[10:4, 22:5, 28:4, 54:2, 97:6, 102:1, 122:5, 1...	in in in in all all all all all so ...	1	0	0	0	0	0	0	0
20663	TRNXIQU128F92DC556	50.11204	8.68342	Rapsoul	Rap	Sag ja	6453491	2007	[10:3, 21:3, 28:3, 54:4, 97:6, 102:8, 122:4, 1...	in in in will will will so so so was...	1	0	0	0	0	0	0	0
26041	TRRKFXU128F932E72D	51.50632	-0.12714	Echobelly	Pop_Rock	Paradise	1258395	1997	[1:5, 2:8, 3:4, 4:3, 5:9, 6:3, 8:11, 9:15, 10:...	i i i i i the the the the the the ...	1	0	0	1	0	1	0	0
35017	TRXOWVZ128F42AED4F	53.55334	9.99245	Revolverheld	Pop_Rock	Generation Rock	4941099	2005	[20:2, 22:3, 28:1, 54:4, 73:1, 97:9, 102:5, 12...	am am all all all so was was was was...	1	0	0	0	0	0	0	0
4615	TRDCMFH128F429626D	50.77813	6.08849	LaFee	Pop_Rock	Jetzt Erst Recht	7368749	2007	[1:1, 20:1, 21:4, 28:5, 54:4, 97:2, 122:5, 189...	i am will will will will so so so so...	1	1	0	0	0	0	0	0
30893	TRUSNSW128F42B7BB8	40.71455	-74.00712	The Bravery	Pop_Rock	Fearless	5616161	2004	[1:17, 2:10, 3:12, 4:5, 5:10, 6:4, 7:9, 8:4, 9...	i i i i i i i i i i i i i i i ...	1	0	0	1	0	0	0	0
24794	TRQQCDY128F92C57C5	36.99462	-86.44558	Nappy Roots	Rap	Nappy Roots Day (Explicit Album Version)	1590753	2003	[1:20, 2:34, 3:5, 4:20, 5:14, 6:17, 7:1, 8:12,...	i i i i i i i i i i i i i i i ...	1	0	1	1	0	0	0	1
8743	TRFVPIE128F92D8813	51.16418	10.45415	Funny Van Dannen	Pop_Rock	Mode	1982738	1997	[10:5, 22:1, 28:1, 54:1, 97:2, 102:8, 158:2, 2...	in in in in in all so was die die u...	1	0	0	1	0	0	0	0
8744	TRFVQAN128F1476ABC	50.77813	6.08849	LaFee	Pop_Rock	Sterben Für Dich	5329925	2006	[10:2, 22:1, 28:1, 140:1, 181:11, 189:11, 218:...	in in all so hand du du du du du du...	1	0	0	0	0	0	0	0
5684	TRDUNRO12903CF2530	43.04999	-76.14739	Brand New Sin	Pop_Rock	Dead Man Walking	3987590	2005	[2:1, 3:2, 5:1, 6:4, 9:3, 12:2, 16:2, 18:2, 24...	the you you and a a a a not not not...	1	0	0	1	0	1	0	0
30888	TRUSMSY128F42758F0	51.50632	-0.12714	The Waterboys	Pop_Rock	Further Up Further In (2008 Digital Remaster)	744614	1990	[1:16, 2:22, 3:3, 4:7, 5:6, 6:8, 7:2, 8:3, 10:...	i i i i i i i i i i i i i i i ...	1	1	1	1	0	1	0	0
4606	TRDCIED128F421CD32	60.20624	24.65620	Children Of Bodom	Pop_Rock	Roadkill Morning	6926817	2008	[1:12, 2:9, 3:10, 4:11, 5:3, 6:7, 7:8, 8:3, 9:...	i i i i i i i i i i i i the the ...	1	1	1	1	0	1	0	0
12745	TRIPJXI128F934C109	35.96049	-83.92091	Whitechapel	Pop_Rock	Father Of Lies	7318239	2008	[1:10, 2:10, 3:4, 5:1, 7:4, 11:6, 13:10, 14:10...	i i i i i i i i i i the the the ...	1	0	0	1	0	1	0	1
23782	TRPYYNO128F92F4337	50.11204	8.68342	Rapsoul	Rap	Du siehst es doch genauso	6453489	2007	[10:5, 22:10, 28:16, 46:13, 54:12, 73:13, 76:3...	in in in in in all all all all all ...	1	0	0	0	0	0	0	0
37253	TRZCUIA128F424CB62	45.19398	5.73200	Miss Kittin	Electronic	Sunset Strip	6877932	2008	[2:9, 4:2, 5:2, 6:3, 8:1, 12:1, 13:3, 17:5, 19...	the the the the the the the the the ...	1	1	0	0	0	1	0	0
8197	TRFMRFB128F42BB7F7	50.97768	11.02307	Yvonne Catterfeld	Pop_Rock	Ich lauf einfach los	5585237	2006	[21:2, 22:2, 28:1, 54:5, 97:4, 102:2, 122:1, 1...	will will all all so was was was was ...	1	0	0	0	0	0	0	0
1278	TRAVHEP128F934B52A	42.31256	-71.08868	The Red Chord	Pop_Rock	Fixation On Plastics	5483558	2005	[1:2, 2:4, 3:1, 4:5, 5:5, 6:13, 7:2, 8:6, 9:1,...	i i the the the the you to to to to...	1	0	0	0	0	0	0	0
11979	TRIBVBW128F92EA87D	53.41961	-8.24055	Bell X1	Pop_Rock	The Ribs Of A Broken Umbrella (Album Version)	8043244	2009	[1:3, 2:11, 4:5, 5:8, 6:12, 8:5, 10:4, 12:1, 1...	i i i the the the the the the the t...	1	0	0	0	0	0	0	0
3203	TRCEPWK128F42365A9	40.85251	-73.13585	Glassjaw	Pop_Rock	Pretty Lush (Album Version)	7636880	2000	[1:12, 2:8, 3:23, 4:11, 5:13, 6:6, 7:2, 8:4, 1...	i i i i i i i i i i i i the the ...	1	1	0	1	0	1	0	0
4086	TRCTRQA128F92C1FAC	53.55334	9.99245	Revolverheld	Pop_Rock	Gegen die Zeit	6231841	2007	[10:8, 22:2, 54:17, 97:12, 102:9, 106:1, 122:1...	in in in in in in in in all all was...	1	0	0	0	0	0	0	0
7796	TRFGIKW128F42AA0BB	33.79502	9.56154	Dany Brillant	Pop_Rock	Dans Les Rues De Rome	2100401	2001	[1:1, 6:2, 7:2, 17:6, 38:3, 42:9, 47:11, 117:4...	i a a me me on on on on on on que ...	1	0	0	0	0	0	0	0
28205	TRSWAMW12903CE6AAB	42.28474	-83.38348	Insane Clown Posse	Pop_Rock	17 Dead	1226405	1993	[1:39, 2:36, 3:20, 4:8, 5:17, 6:22, 7:5, 8:9, ...	i i i i i i i i i i i i i i i ...	1	0	1	0	0	1	1	0
11069	TRHLPXG12903CFED67	34.05349	-118.24532	Threshold	Pop_Rock	Sanity's End (live)	5172893	1994	[1:6, 2:15, 3:5, 4:6, 5:10, 6:8, 7:2, 9:6, 10:...	i i i i i i the the the the the th...	1	0	0	0	0	0	0	0
4626	TRDCSDL128F9355D56	40.65507	-73.94888	The Cardigans	Pop_Rock	Been It	1520922	1996	[1:24, 3:12, 4:5, 5:2, 6:1, 7:10, 8:1, 10:1, 1...	i i i i i i i i i i i i i i i ...	1	0	0	1	0	1	0	0
5668	TRDUIDB128F9303D0C	33.62646	-80.94740	Pur	Pop_Rock	Herz Für Kinder.	2614723	1990	[10:2, 20:3, 21:1, 28:1, 54:3, 97:7, 122:5, 15...	in in am am am will so was was was ...	1	0	0	1	0	0	0	0
625	TRAKMJW128F425F12C	57.65337	14.69725	Backyard Babies	Pop_Rock	Be Myself And I	2251787	2003	[1:31, 3:27, 4:5, 5:2, 6:7, 8:2, 9:9, 11:2, 12...	i i i i i i i i i i i i i i i ...	1	0	0	1	0	1	0	0
37227	TRZCKKT128F9303FF0	52.51607	13.37698	Element Of Crime	Pop_Rock	Alle Vier Minuten	769450	2001	[10:8, 21:1, 22:2, 28:1, 97:10, 102:4, 106:5, ...	in in in in in in in in will all al...	1	0	0	1	0	0	0	0
37228	TRZCKTV128F92D3830	51.16418	10.45415	Glashaus	RnB	Liebst Du mich?	1124618	2001	[10:2, 21:2, 22:4, 28:4, 54:4, 97:1, 122:1, 15...	in in will will all all all all so s...	1	0	0	0	0	0	0	0
28264	TRSXBZE12903CED21E	40.71455	-74.00712	Riot	Pop_Rock	Storming The Gates Of Hell	2344709	1990	[1:3, 2:19, 4:2, 5:9, 6:2, 10:2, 12:2, 13:11, ...	i i i the the the the the the the t...	1	0	1	1	0	1	0	1
3585	TRCKOSL128F4226F6D	35.96049	-83.92091	Whitechapel	Pop_Rock	Devirgination Studies	6346182	2007	[1:3, 2:2, 3:3, 4:7, 5:2, 9:2, 10:2, 11:5, 12:...	i i i the the you you you to to to ...	1	1	0	1	0	1	0	0
36197	TRYJBUK128F92C1FC0	53.55334	9.99245	Revolverheld	Pop_Rock	Hallo Welt	7183860	2007	[10:2, 22:26, 28:2, 97:2, 123:1, 158:3, 181:1,...	in in all all all all all all all al...	1	0	0	1	0	0	0	0
36198	TRYJCFO128F92E5269	52.82812	12.07305	Annett Louisan	Pop_Rock	Die sein	6540547	2007	[10:9, 28:3, 97:19, 122:2, 189:1, 218:6, 226:2...	in in in in in in in in in so so s...	1	0	0	0	0	0	0	0
20035	TRNNBNM128F4277EB9	34.05349	-118.24532	L7	Pop_Rock	Ms. 45	592696	1988	[2:6, 3:7, 5:3, 6:5, 8:1, 9:3, 12:9, 14:1, 16:...	the the the the the the you you you ...	1	0	0	1	0	1	0	0
3175	TRCEETX128F1491B7E	53.38311	-1.46454	Bring Me The Horizon	Pop_Rock	[I Used To Make Out With] Medusa	5607564	2006	[1:8, 2:9, 3:13, 4:6, 5:3, 6:1, 7:8, 9:6, 10:4...	i i i i i i i i the the the the t...	1	0	0	1	0	1	0	0
622	TRAKLWA128F4296E3B	36.16778	-86.77836	Laurent Voulzy	International	Qui Est In Qui Est Out	6204529	1979	[6:1, 7:1, 38:1, 42:1, 47:3, 77:1, 90:1, 112:1...	a me que de la la la y en te tu tu...	1	0	0	0	0	0	0	0

Polarity and lexical complexity to study social topics in music genres

Goals

Dataset credits:

1. Data imports: structures, sorting and wrangling

1.1. Importing the songs

Remarks:

1.2. Extracting the Year of the songs

Remarks:

1.3 Importing genres

Comment:

1.4. Importing Location

1.5. Bags of words, extracting them from the train dataset

Comments on the size:

1.6. From generic bags of words to lyrics

Comments on part one:

2. Classifiers: Choosing the methods of analysis and extracting features

2.1. Word polarity

Vader package

2.2. Lyrics' complexity

Extracting the vocabulary

Extracting additional features

Removing non english words

Comment:

Analysis

Sorting outputs in valuable categories

2.3. Topic classification

Defining topics so we can calculate a words similarity to topic

Calculate every words relation to the different topics

Print out the dataframe and see the most relevant words for the topic "revolution"

Extrapolate from (word <-> topic)-relation to (song <-> topic)-relation

Find out if the songs are about any of the topics and assign values in the dataframe