This work aims at conducting a sentiment analysis through music genres. There are many ways to complete this goal, however it is important that the parameters and analysers we use are fitting our data. Sentiment analysis is a very large field, and we believe that we used the librairies and functions that are the most appropriate for the chosen dataset. This work can be divided into 4 parts:
musiXmatch dataset, the official lyrics collection for the Million Song Dataset, available at: http://labrosa.ee.columbia.edu/millionsong/musixmatch
In [32]:
#Importing libraries
%matplotlib inline
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
import re
import nltk
import scipy
import sklearn
import sklearn.preprocessing
import gensim as gs
import pylab as pl
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from IPython.display import display, Image
from nltk.corpus import stopwords
import nltk
import os
import pickle
# internal imports
import helpers as HL
# Constants: PS! put in your own paths to the files
GLOVE_FOLDER = 'glove.twitter.27B'
GS_FOLDER = os.path.abspath("doesnt_matter" + "/../../../../" + "Machine_Learning/CD-433-Project-2/gensim_data_folder/") #PS: this is only in my folderstructuree
GS_25DIM = GS_FOLDER + "/gensim_glove_vectors_25dim.txt"
GS_50DIM = GS_FOLDER + "/gensim_glove_vectors_50dim.txt"
GS_100DIM = GS_FOLDER + "/gensim_glove_vectors_100dim.txt"
GS_200DIM = GS_FOLDER + "/gensim_glove_vectors_200dim.txt"
For now we only use titles and artist names, we are able to handle this part with only the musixmatch website. We download the data and put it into a dataframe with the Id of MusiXMatch(MXM_Tid) and the Track ID of the Million Song DataSet(Tid). Because we might have data that is given with one classification or the other, we decide to keep the two IDs, but we are fully aware that having two IDs is not giving additional information, it is only to be sure that other datasets will be easier to merge.
We for now, we get 779052 song's artists and titles
In [2]:
#Importing the text file in a DataFrame, removing exceptions (incomplete data)
matches = pd.read_table('Data/mxm_779k_matches.txt', error_bad_lines=False)
#Changing the column's title in order to be clearer
matches.columns = ['Raw']
#Getting the Tid
matches['Tid'] = matches['Raw'].str.split('<SEP>', expand=True)[0]
#Extracting artist names
matches['Artist_Name'] = matches['Raw'].str.split('<SEP>', expand=True)[1]
#Extracting titles
matches['Title'] = matches['Raw'].str.split('<SEP>', expand=True)[2]
#Extractign MXM_Tid
matches['MXM_Tid'] = matches['Raw'].str.split('<SEP>', expand=True)[3]
#Dropping rows we do not need ()
matches = matches.drop(matches.index[:17])
#Droppign the column with raw data
matches = matches.drop('Raw', axis=1)
#set index Track ID
matches.set_index('Tid',inplace=True)
#Displaying results
display(matches.shape)
display(matches.head())
We download the text file from the "A million song" website. It is provided as an additional feature of the dataset.
We merge the year dataset with the artists and song titles in the same dataframe.
In [3]:
#Loading the year of publication data, skipping incomplete data in order to avoid errors
years = pd.read_table('Data/tracks_per_year.txt', error_bad_lines=False)
#Changing the column's title in order to be clearer
years.columns = ['Raw']
#Getting the year publication
years['year'] = years['Raw'].str.split('<SEP>', expand=True)[0]
#Getting the Tid
years['Tid'] = years['Raw'].str.split('<SEP>', expand=True)[1]
#Dropping the raw data
years = years.drop('Raw', axis=1)
#set index Track ID
years.set_index('Tid',inplace=True)
#Appending the years to the original DataFrame
matches = pd.merge(matches, years, left_index=True, right_index=True)
In [4]:
#display the results
print(matches.shape)
display(matches.head())
We delete the rows without year infos. Thus why the dataframe contains less songs. In order to be able to be as complete as accurate as possible, we consider only full matching.
We will now append each genre to a specific track.
We download the data from the TagTraum dataset and merge them without our previous dataframe.
In [5]:
#Creating a DataFrame to store the genres:
GenreFrame = pd.read_table('Data/msd-topMAGD-genreAssignment.txt', names=['Tid', 'genre'])
#set index Track ID
GenreFrame.set_index('Tid',inplace=True)
#merge the new datas with the previous dataframe
matches = pd.merge(GenreFrame, matches, left_index=True, right_index=True)
In [6]:
#Displaying results
print(matches.shape)
display(matches.head())
In [7]:
#Creating a DataFrame to store the location:
location = pd.read_csv('Data/artist_location.txt', sep="<SEP>",header=None,names=['ArtistID','Latitude','Longitude','Artist_Name','City'])
#Keep useful datas
location.drop(['ArtistID','City'],inplace=True,axis=1)
#matches = pd.merge(location, matches, on='Tid')
matches.reset_index(inplace=True)
matches = pd.merge(location, matches, on='Artist_Name')
matches.set_index('Tid',inplace = True)
In [8]:
#Displaying results
display(matches.head())
print(matches.shape)
We downloaded the train datafile which is 30% of the whole dataset. Inside we have a list of the 5000 words the most used in the ... songs. We then make two dataframes:
One with the Id of every songs and their lyrics. We merge this with our previous dataframe.
The lyrics are presented as follow : [(id of word),(occurence in song)][2,24][5,47]...
We work with only 30% of the whole dataset because we use the MusicXMatch dataset and it is the only data that are available.
The rest of the data are not free. You could see that page : https://developer.musixmatch.com/plans to verify.
In [9]:
#import file
lyrics = pd.read_table('Data/mxm_dataset_train.txt', error_bad_lines=False)
#change name of the column
lyrics.columns = ['Raw_Training']
# take the bag of word to use it later
words_train = lyrics.iloc[16]
#drop useless rows
lyrics=lyrics[17:].copy()
# get TrackID, MxMID and lyrics and put them separated columns
def sortdata(x):
splitted = x['Raw_Training'].split(',')
x['Tid']=splitted[0]
#x['MXM_Tid']=splitted[1]
x['words_freq']=splitted[2:]
return x
#Apply the function to every column
lyrics = lyrics.apply(sortdata,axis=1)
lyrics = lyrics[['Tid','words_freq']]
In [10]:
#set index Track ID
lyrics.set_index('Tid',inplace=True)
#Appending the years to the original DataFrame
matches = pd.merge(matches, lyrics, left_index=True, right_index=True)
In [11]:
#Displaying the results
print(matches.shape)
display(matches.head())
We Create a function that take the list of the word and the occurency in one song : [(id of word),(occurency in the song)][2,24][5,47]...
And output all the corresponding words in a list
For example : [1:2,2:5,3:3] gives us --> [i,i,the,the,the,the,the,you,you,you]
In [12]:
#get the datas
bag_of_words = words_train
# clean the data and split it to create a list of 5000 words
bag_of_words = bag_of_words.str.replace('%','')
bag_of_words = bag_of_words.str.split(',')
display(bag_of_words.head())
In [13]:
#Defining a function
def create_text(words_freq):
#create the final list of all words
list_words=''
#iterate over every id of words
for compteur in words_freq:
word = bag_of_words[0][int(compteur.split(':')[0])-1]
times = int(compteur.split(':')[1])
#Separating every word with a space to be able to work on it with librairies during part 2
for i in range(times):
list_words += ' ' + word + ' '
return list_words
In [14]:
#Testing the function
print(create_text(lyrics.iloc[0]['words_freq']))
As it is noticeable through each step, we loose data every time we merge datasets. We chose this approach because we only want to deal with complete information in order to be coherent. We want to compare parameters between items and we believe that the analysis is less relevant if we consider a larger dataset that contains data incomplete data.
We now have 38 513 songs, but for each one we have all the features that we want to use. We will analyse our data with different parameters, thus why it is important that it each song provides each item. Later in the analysis we may use data from 1.4. (providing 103 401 songs) in order to get a broader overview.
In order to analyse songs, we will use sentiment analysis on the lyrics. We chose to use 2 key features, which are the polarity and the lexical complexity. Because we only use bags of words, some parameters such as rhymes and structures are not defined with bags of words and they should be taken in consideration when speaking of the whole complexity of lyrics.
VADER, which stands for Valence Aware Dictionary and sEntiment Reasoner, is a sentiment analysis package that provides a polarity score for a given word or sentences. It is known to be a very powerful tool, especially because it was trained on tweets, meaning that it takes into account most of modern vocabulary. This is especially relevant for our project because we deal with modern music, implying that the words that are used are as modern as the ones analysed by VADER on tweets. The fact that the sentiment analyser takes its roots from the same vocabulary is make the analysis more relevant.
Polarity is expressed between -1 (negative polarity) and 1 (positive polarity).
In [15]:
import nltk.sentiment.sentiment_analyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
In [16]:
#Defining the analyser
analyser = SentimentIntensityAnalyzer()
Because we want to be able to know what type of audience a specific type of music is targeting we need to analyse the complexity of the lyrics. We are aware that dividing an audience into social profiles is far beyond the scope of our analysis. We do not have enough sociological knowledge to categorize an audience in a precise way. This is the reason why we will use large indicators. We want to know how complex a set of word is, and the only social assumption we will make is that complexity is correlated with the age and the educational level of the audience.
We use the occurence of each word in the whole dataset.
Importing the most used words and their count inside the dataset in order to start a text processing analysis.
From the dataset, there was some given metadata. The total word count is 55 163 335.
Because of the long tail effect of language, we will proceed with the first 10 000 words of the list. This will enable us to have less computing time when iterating on the full_word_list.
From the vocabulary we remove stopwords. Those are too often mentionned in every level of language to be relevant for this analysis.
We then compute the percentage of occurence, because it will help us when dealing with lyrics' complexity. We then use the occurence precentage to get a Complexity weight. It means that when a word is used a lot it will have a low weight and a high weight for words rarely used.
In [17]:
Word_count_total = 55163335
#Importing the data, putting it in a DataFrame
full_word_list = pd.read_table('Data/full_word_list.txt')
#Renaming the columns
full_word_list.columns = ['Word']
#Extracting word count
full_word_list['Count'] = pd.to_numeric(full_word_list['Word'].str.split('<SEP>', expand=True)[1])
#Extracted words that were used
full_word_list['Word'] = full_word_list['Word'].str.split('<SEP>', expand=True)[0]
#Dropping rows we will not use
full_word_list = full_word_list.drop(full_word_list.index[:6])
#Extracting the first 50 0000 values, because the rest is not stemmed and not necessarly in english
full_word_list = full_word_list.head(50000)
#Removing english stop words
for word in full_word_list['Word']:
if word in stopwords.words('english'):
full_word_list = full_word_list[full_word_list.Word != word]
#Computing the percentage of occurence:
full_word_list['Occurence_percentage'] = (full_word_list['Count']/ Word_count_total)*100
#computing weight of words
full_word_list['Weight']= 1/full_word_list['Occurence_percentage']
display(full_word_list.shape)
display(full_word_list.head())
Because they are much less commonly encountered in the dataset, words that are not in english will be ranked with a very high complexity. In addition to introduce a bias in the lexical complexity analysis, they wil also cause trouble when treating the polarity, because the VADER library is solely analysing english words. We will use the NLTK library in order to remove each non-english word from the bags of words.
We first need to download the "wordnet" NLTK package:
In [18]:
import nltk
#Using the NLTK downloader to get wordnet
#nltk.download()
from nltk.corpus import wordnet as wn
In [19]:
for j in full_word_list.index:
if not wn.synsets(full_word_list.Word[j]):#Comparing if word is non-English
full_word_list.drop(j, inplace=True)
In [20]:
full_word_list = full_word_list.sort_values('Weight', ascending=False)
display(full_word_list.head())
pickle.dump(full_word_list, open("full_word_list.pkl", "wb"))
In [21]:
# function to get the complexity of one song by analyzing the weight of all his word
def complexity_Song(lyrics):
#create a variable to stock the sum of the weights for every word of the song
sum_weight= 0
#split the lyrics to get an array of words and not just one big string
lyric = lyrics.split(' ')
#lyric = lyric.remove(' ')
#filtering empty values
lyric = list(filter(None, lyric))
#Removing every english stopword from the given lyric
lyric = [word for word in lyric if word not in stopwords.words('english')]
for x in lyric:
#Making sure that the data is not empty
if len(full_word_list.loc[full_word_list['Word'] == x]['Weight'].values) != 0 :
sum_weight += full_word_list.loc[full_word_list['Word'] == x]['Weight'].values
return float(sum_weight/len(lyric))
This implementation is inspired by the TF-IDF algorithm. If the occurence of a word is weak in a dataset, it means that it is less common in the language, meaning that the lexical complexity is higher.
English stopwords are very common in every sentences, they are used so typically that they do not add anything relevant to the analysis. This is the reason why we take them out. Our complexity analysis must be focused on words that do not appear regularly.
We need to go from word frequency to bags of words. Once this is done using our "create_text" function, we will use the polarity analyser.
In [22]:
#Resetting index
matches.reset_index(inplace=True)
#Intiating an empty column, in order to be able to interate on it
matches['Bags_of_words'] = ''
#Getting all the textual data in the DataFrame
for i in matches.index:
matches.at[i, 'Bags_of_words'] = create_text(matches.at[i, 'words_freq'])
#Because we have all the intial data in our DataFrame, we will store it as pickle object
matches.to_pickle('full_table.pkl')
Now that we have the bags of words in the DataFrame, we can conduct the analysis. Let us first work with the polarity:
In [23]:
#taking out the pickle object
matches = pd.read_pickle('full_table.pkl')
#Applying the polarity analysis for the bags of words
for i in matches.index:
matches.at[i, 'Polarity_score'] = analyser.polarity_scores(matches.at[i, 'Bags_of_words'])['compound']
In [24]:
display(matches.head(4))
In [25]:
sns.set(color_codes=True, style="darkgrid")
def polarity_graph_generator(Data_in, categorization):
Data_in[categorization] = Data_in[categorization].astype('category')
for cat in Data_in[categorization].cat.categories:
Division = pd.DataFrame()
Division = Data_in[(Data_in[categorization] == cat)]
#Sorting values by polarity to create a graph
Division = Division.sort_values('Polarity_score', ascending=False)
#Index reseting
Division = Division.reset_index()
#plotting the results
sns_plot = sns.tsplot(Division['Polarity_score'], color='m').set_title('Polarity in {}'.format(cat))
x = len(Division['Polarity_score'])
y = Division['Polarity_score']
ax = sns_plot.axes
ax.fill_between(x, 0, y)
fig = sns_plot.get_figure()
#Storing the graph (MUST GREATE THE FOLDER BEFORE !)
fig.savefig("Polarity_plots/{} polarity.png" .format(cat))
#Clearing the figure
fig.clf()
return
Having the data divided in genres in important for our analysis, however we are still missing one key dimension to make our work relevant for social good: The topic that is adressed in the songs. We must be able to know which subject is dealt with in a song, and then we will aggregate the data for the genre and we will be able to understand how a particular genre is handling a specific topic. For this part we are still considering two options:
In [27]:
#import global vectors from stanfords pretrained set, trained on tweets, one can choose wished dim=25,50,100,200
global_vectors = HL.load_gensim_global_vectors(GS_200DIM)
In [28]:
#Defining the topics
racism = ['racism', 'nigger', 'negro', 'race', 'racist', 'bigot', 'bigotry', 'apartheid', 'discrimination', 'segregation', 'unfairness', 'partiality', 'sectarianism', 'colored']
women = ['women','girl', 'daughter', 'mother', 'she', 'wife', 'aunt', 'gentlewoman', 'girlfriend', 'grandmother', 'matron', 'niece', 'spouse', 'miss', 'genre']
money = ['money','bill', 'capital', 'cash', 'check', 'fund', 'pay', 'payment', 'property', 'salary', 'wage', 'wealth', 'banknote', 'bankroll', 'bread', 'bucks', 'chips', 'coin', 'coinage', 'dough', 'finances', 'funds', 'gold', 'gravy', 'greenback', 'loot', 'pesos', 'ressources', 'riches', 'roll', 'silver', 'specie', 'treasure', 'wad', 'wherewithal']
revolution = ['revolution','change', 'overthrow', 'demand', 'freedom', 'war', 'movement', 'brotherhood', 'reform', 'radical', 'leadership']
politics = ['politics', 'president', 'governor', 'senator', 'campaigning','government','civics','electioneering','legislature','policy','political']
religion = ['religion', 'religious', 'religions', 'atheism', 'secular', 'islam', 'islamic', 'atheist', 'bible', 'christian', 'jew', 'muslim', 'theology', 'god', 'church', 'buddhism', 'hinduism','belief', 'pray', 'prayer', 'worship']
art = ['art', 'movie', 'singing', 'painting', 'ballet', 'theatre']
health = ['health', 'nutrition', 'medical', 'wellness', 'healthy', 'care', 'safety', 'fitness', 'obesity', 'cancer', 'sickness', 'disease']
# make lists so one can iterate through the topics
name_of_topics = ['racism', 'women', 'money', 'revolution', 'politics', 'religion', 'art', 'health']
words_defining_topics = [racism, women, money, revolution, politics, religion, art, health]
In [38]:
vocab_topics = HL.vocabulary_calculate_topics(words_defining_topics, name_of_topics, global_vectors)
In [39]:
visual_vocab = vocab_topics.copy(deep=True)
print(visual_vocab.shape)
display(visual_vocab.sort_values('topic_revolution',ascending=False).head(15))
In [40]:
# dataframe with songs
matches = pd.read_pickle('full_table.pkl')
In [41]:
display(matches.head(4))
In [43]:
matches, occz, totz = HL.score_songs(matches, vocab_topics)
In [45]:
print("Percentage of the songs that is not in the vocabulary: %.2f%%" % ((occz/totz)*100))
In [46]:
#Fetch the columnnames of the topics
column_names = [col for col in vocab_topics.columns if col.startswith('topic')]
for col in column_names:
col_average = np.mean(matches[col])
print(np.mean(matches[col]))
matches[col] = matches[col].apply(lambda x: 1 if x>col_average else 0)
In [47]:
# this is how the dataframe looks after manipuation
visual_matches = matches.copy(deep=True)
display(visual_matches.sort_values('topic_racism',ascending=False).head(40))
In [ ]:
matches.to_excel("final_table.xls")
In [ ]: