This work aims at conducting a sentiment analysis through music genres. There are many ways to complete this goal, however it is important that the parameters and analysers we use are fitting our data. Sentiment analysis is a very large field, and we believe that we used the librairies and functions that are the most appropriate for the chosen dataset. This work can be divided into 4 parts:
musiXmatch dataset, the official lyrics collection for the Million Song Dataset, available at: http://labrosa.ee.columbia.edu/millionsong/musixmatch
In [3]:
#Importing libraries
%matplotlib inline
import numpy as np
import pandas as pd
import re
import nltk
import sklearn
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from IPython.display import display, Image
from nltk.corpus import stopwords
import nltk
For now we only use titles and artist names, we are able to handle this part with only the musixmatch website. We download the data and put it into a dataframe with the Id of MusiXMatch(MXM_Tid) and the Track ID of the Million Song DataSet(Tid). Because we might have data that is given with one classification or the other, we decide to keep the two IDs, but we are fully aware that having two IDs is not giving additional information, it is only to be sure that other datasets will be easier to merge.
We for now, we get 779052 song's artists and titles
In [4]:
#Importing the text file in a DataFrame, removing exceptions (incomplete data)
matches = pd.read_table('Data/mxm_779k_matches.txt', error_bad_lines=False)
#Changing the column's title in order to be clearer
matches.columns = ['Raw']
#Getting the Tid
matches['Tid'] = matches['Raw'].str.split('<SEP>', expand=True)[0]
#Extracting artist names
matches['Artist_Name'] = matches['Raw'].str.split('<SEP>', expand=True)[1]
#Extracting titles
matches['Title'] = matches['Raw'].str.split('<SEP>', expand=True)[2]
#Extractign MXM_Tid
matches['MXM_Tid'] = matches['Raw'].str.split('<SEP>', expand=True)[3]
#Dropping rows we do not need ()
matches = matches.drop(matches.index[:17])
#Droppign the column with raw data
matches = matches.drop('Raw', axis=1)
#set index Track ID
matches.set_index('Tid',inplace=True)
#Displaying results
display(matches.shape)
display(matches.head())
We download the text file from the "A million song" website. It is provided as an additional feature of the dataset.
We merge the year dataset with the artists and song titles in the same dataframe.
In [5]:
#Loading the year of publication data, skipping incomplete data in order to avoid errors
years = pd.read_table('Data/tracks_per_year.txt', error_bad_lines=False)
#Changing the column's title in order to be clearer
years.columns = ['Raw']
#Getting the year publication
years['year'] = years['Raw'].str.split('<SEP>', expand=True)[0]
#Getting the Tid
years['Tid'] = years['Raw'].str.split('<SEP>', expand=True)[1]
#Dropping the raw data
years = years.drop('Raw', axis=1)
#set index Track ID
years.set_index('Tid',inplace=True)
#Appending the years to the original DataFrame
matches = pd.merge(matches, years, left_index=True, right_index=True)
In [6]:
#display the results
print(matches.shape)
display(matches.head())
We delete the rows without year infos. Thus why the dataframe contains less songs. In order to be able to be as complete as accurate as possible, we consider only full matching.
We will now append each genre to a specific track.
We download the data from the TagTraum dataset and merge them without our previous dataframe.
In [7]:
#Creating a DataFrame to store the genres:
GenreFrame = pd.read_table('Data/msd-topMAGD-genreAssignment.txt', names=['Tid', 'genre'])
#set index Track ID
GenreFrame.set_index('Tid',inplace=True)
#merge the new datas with the previous dataframe
matches = pd.merge(GenreFrame, matches, left_index=True, right_index=True)
In [8]:
#Displaying results
print(matches.shape)
display(matches.head())
In [9]:
#Creating a DataFrame to store the location:
location = pd.read_csv('Data/artist_location.txt', sep="<SEP>",header=None,names=['ArtistID','Latitude','Longitude','Artist_Name','City'])
#Keep useful datas
location.drop(['ArtistID','City'],inplace=True,axis=1)
#matches = pd.merge(location, matches, on='Tid')
matches.reset_index(inplace=True)
matches = pd.merge(location, matches, on='Artist_Name')
matches.set_index('Tid',inplace = True)
In [10]:
#Displaying results
display(matches.head())
print(matches.shape)
We downloaded the train datafile which is 30% of the whole dataset. Inside we have a list of the 5000 words the most used in the ... songs. We then make two dataframes:
One with the Id of every songs and their lyrics. We merge this with our previous dataframe.
The lyrics are presented as follow : [(id of word),(occurence in song)][2,24][5,47]...
We work with only 30% of the whole dataset because we use the MusicXMatch dataset and it is the only data that are available.
The rest of the data are not free. You could see that page : https://developer.musixmatch.com/plans to verify.
In [11]:
#import file
lyrics = pd.read_table('Data/mxm_dataset_train.txt', error_bad_lines=False)
#change name of the column
lyrics.columns = ['Raw_Training']
# take the bag of word to use it later
words_train = lyrics.iloc[16]
#drop useless rows
lyrics=lyrics[17:].copy()
# get TrackID, MxMID and lyrics and put them separated columns
def sortdata(x):
splitted = x['Raw_Training'].split(',')
x['Tid']=splitted[0]
#x['MXM_Tid']=splitted[1]
x['words_freq']=splitted[2:]
return x
#Apply the function to every column
lyrics = lyrics.apply(sortdata,axis=1)
lyrics = lyrics[['Tid','words_freq']]
In [12]:
#set index Track ID
lyrics.set_index('Tid',inplace=True)
#Appending the years to the original DataFrame
matches = pd.merge(matches, lyrics, left_index=True, right_index=True)
In [13]:
#Displaying the results
print(matches.shape)
display(matches.head())
We Create a function that take the list of the word and the occurency in one song : [(id of word),(occurency in the song)][2,24][5,47]...
And output all the corresponding words in a list
For example : [1:2,2:5,3:3] gives us --> [i,i,the,the,the,the,the,you,you,you]
In [14]:
#get the datas
bag_of_words = words_train
# clean the data and split it to create a list of 5000 words
bag_of_words = bag_of_words.str.replace('%','')
bag_of_words = bag_of_words.str.split(',')
display(bag_of_words.head())
In [15]:
#Defining a function
def create_text(words_freq):
#create the final list of all words
list_words=''
#iterate over every id of words
for compteur in words_freq:
word = bag_of_words[0][int(compteur.split(':')[0])-1]
times = int(compteur.split(':')[1])
#Separating every word with a space to be able to work on it with librairies during part 2
for i in range(times):
list_words += ' ' + word + ' '
return list_words
In [16]:
#Testing the function
print(create_text(lyrics.iloc[0]['words_freq']))
As it is noticeable through each step, we loose data every time we merge datasets. We chose this approach because we only want to deal with complete information in order to be coherent. We want to compare parameters between items and we believe that the analysis is less relevant if we consider a larger dataset that contains data incomplete data.
We now have 38 513 songs, but for each one we have all the features that we want to use. We will analyse our data with different parameters, thus why it is important that it each song provides each item. Later in the analysis we may use data from 1.4. (providing 103 401 songs) in order to get a broader overview.
In order to analyse songs, we will use sentiment analysis on the lyrics. We chose to use 2 key features, which are the polarity and the lexical complexity. Because we only use bags of words, some parameters such as rhymes and structures are not defined with bags of words and they should be taken in consideration when speaking of the whole complexity of lyrics.
VADER, which stands for Valence Aware Dictionary and sEntiment Reasoner, is a sentiment analysis package that provides a polarity score for a given word or sentences. It is known to be a very powerful tool, especially because it was trained on tweets, meaning that it takes into account most of modern vocabulary. This is especially relevant for our project because we deal with modern music, implying that the words that are used are as modern as the ones analysed by VADER on tweets. The fact that the sentiment analyser takes its roots from the same vocabulary is make the analysis more relevant.
Polarity is expressed between -1 (negative polarity) and 1 (positive polarity).
In [17]:
import nltk.sentiment.sentiment_analyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
In [18]:
#Defining the analyser
analyser = SentimentIntensityAnalyzer()
Because we want to be able to know what type of audience a specific type of music is targeting we need to analyse the complexity of the lyrics. We are aware that dividing an audience into social profiles is far beyond the scope of our analysis. We do not have enough sociological knowledge to categorize an audience in a precise way. This is the reason why we will use large indicators. We want to know how complex a set of word is, and the only social assumption we will make is that complexity is correlated with the age and the educational level of the audience.
We use the occurence of each word in the whole dataset.
Importing the most used words and their count inside the dataset in order to start a text processing analysis.
From the dataset, there was some given metadata. The total word count is 55 163 335.
Because of the long tail effect of language, we will proceed with the first 10 000 words of the list. This will enable us to have less computing time when iterating on the full_word_list.
From the vocabulary we remove stopwords. Those are too often mentionned in every level of language to be relevant for this analysis.
We then compute the percentage of occurence, because it will help us when dealing with lyrics' complexity. We then use the occurence precentage to get a Complexity weight. It means that when a word is used a lot it will have a low weight and a high weight for words rarely used.
In [19]:
Word_count_total = 55163335
#Importing the data, putting it in a DataFrame
full_word_list = pd.read_table('Data/full_word_list.txt')
#Renaming the columns
full_word_list.columns = ['Word']
#Extracting word count
full_word_list['Count'] = pd.to_numeric(full_word_list['Word'].str.split('<SEP>', expand=True)[1])
#Extracted words that were used
full_word_list['Word'] = full_word_list['Word'].str.split('<SEP>', expand=True)[0]
#Dropping rows we will not use
full_word_list = full_word_list.drop(full_word_list.index[:6])
#Extracting the first 10 0000 values, because the rest is not stemmed and not necessarly in english
full_word_list = full_word_list.head(10000)
#Removing english stop words
for word in full_word_list['Word']:
if word in stopwords.words('english'):
full_word_list = full_word_list[full_word_list.Word != word]
#Computing the percentage of occurence:
full_word_list['Occurence_percentage'] = (full_word_list['Count']/ Word_count_total)*100
#computing weight of words
full_word_list['Weight']= 1/full_word_list['Occurence_percentage']
#Display
display(full_word_list.head(10))
In [20]:
# function to get the complexity of one song by analyzing the weight of all his word
def complexity_Song(lyrics):
#create a variable to stock the sum of the weights for every word of the song
sum_weight= 0
#split the lyrics to get an array of words and not just one big string
lyric = lyrics.split(' ')
#lyric = lyric.remove(' ')
#filtering empty values
lyric = list(filter(None, lyric))
#Removing every english stopword from the given lyric
lyric = [word for word in lyric if word not in stopwords.words('english')]
for x in lyric:
#Making sure that the data is not empty
if len(full_word_list.loc[full_word_list['Word'] == x]['Weight'].values) != 0 :
sum_weight += full_word_list.loc[full_word_list['Word'] == x]['Weight'].values
return float(sum_weight/len(lyric))
This implementation is inspired by the TF-IDF algorithm. If the occurence of a word is weak in a dataset, it means that it is less common in the language, meaning that the lexical complexity is higher.
English stopwords are very common in every sentences, they are used so typically that they do not add anything relevant to the analysis. This is the reason why we take them out. Our complexity analysis must be focused on words that do not appear regularly.
Having the data divided in genres in important for our analysis, however we are still missing one key dimension to make our work relevant for social good: The topic that is adressed in the songs. We must be able to know which subject is dealt with in a song, and then we will aggregate the data for the genre and we will be able to understand how a particular genre is handling a specific topic. For this part we are still considering two options:
The work that was done in the second part of Homework 4. The good point is that we have a tuned algorithm to classify text into 20 different classes with more than 85% of accuracy. It is very powerful but not the exact algorithm we are looking for. It provides a strict classification, putting a text into one precise class while we would rather have several tags for a particular music. Artists tend to treat several topics we writing a song, and the 20newsgroup classifier would limitate our data treatment.
Some data cleaning will be required if applying the 20newsgroup classifier. For instance categories dealing with computer science are likley to be irrelevant when classifying songs, thus why they will be taken away.
We are still exploring this possibility because option 1 does not provide the most appealing data we want. We have not encountered yet an algorithm that would provide a categorization with several tags. We are aware that we may not find such classifier, but since the 20newsgroup one is already fully implemented and ready to use (Tf-idf and RandomForest are coded. We will look for another option for as long as we can.
In order to show how we will proceed on the dataset, we will use the first 100 songs of one genre as an example (in this case Rap). We know that the size is too small to have some conclusions, however we are interested in running and visualization for now. We want to make sure that this code works and provides appealing visualization. The generalization of this code will be implemented in the last part of the project.
We start with the polarity analysis:
In [21]:
#Extracting the Data of the genre
Rap_example = matches[(matches['genre'] == 'Rap')].head(100)
#Resetting index
Rap_example.reset_index(inplace=True)
#Intiating an empty column, in order to be able to interate on it
Rap_example['Bags_of_words'] = ''
#Getting all the textual data in the DataFrame
for i in Rap_example.index:
Rap_example.at[i, 'Bags_of_words'] = create_text(Rap_example.at[i, 'words_freq'])
In [22]:
#Displaying results
display(Rap_example.head())
In [23]:
#Applying the polarity analysis for the bags of words
for i in Rap_example.index:
Rap_example.at[i, 'Polarity_score'] = analyser.polarity_scores(Rap_example.at[i, 'Bags_of_words'])['compound']
#Sorting values to have a meaningfull plot
Rap_example = Rap_example.sort_values('Polarity_score', ascending=False)
Rap_example = Rap_example.reset_index()
#plotting the results: Polarity
plt.bar(Rap_example.index, Rap_example['Polarity_score'])
plt.title('Polarity in Rap')
plt.show()
#Computing the polarity's mean value for the genre
print("The mean value is")
display(pd.to_numeric(Rap_example['Polarity_score'].mean()))
We like this representation. We find it meaningful and visually appealing. This is the reason why we will keep this representation as our main visualization.
And now, we apply computation for the lexical complexity:
In [24]:
#Applying the lexical complexity analysis for the bags of words
for i in Rap_example.index:
Rap_example.at[i,'Lexical_complexity'] = complexity_Song(Rap_example.at[i,'Bags_of_words'])
#Sorting values to have a meaningfull plot
Rap_example = Rap_example.sort_values('Lexical_complexity', ascending=False)
#plotting the results: Lexical complexity
plt.bar(Rap_example.index, Rap_example['Lexical_complexity'])
plt.title('Lexical complexity in Rap')
plt.show()
#Computing the lexical complexity's mean value for the genre
print("The mean value is")
display(pd.to_numeric(Rap_example['Lexical_complexity'].mean()))
Finally, we can visualize the evolution of complexity through time for the given subset:
In [25]:
#Sorting values to have a meaningfull plot
Rap_example = Rap_example.sort_values('year')
#plotting the results: Evolution of complexity through time
plt.bar(Rap_example['year'], Rap_example['Lexical_complexity'])
plt.title('Rap: Lexical complexity through time')
plt.show()
Now that we have the tools for a sentiment analysis, we must decide a way to visualize the results. Having a straightforward visualization will helop to compare the analyse features between genres, and this is precisely what we will try to achieve with the following visualization.
All the interactive visualization structure that we have here will be displayed in a blog.
We first deal with polarity, then with lexical complexity.
When having the topics and the polarity, we will be able to have an overview of how social topics are dealt with across music genres. We cannot draw any conclusion for this part without having every visual, but we believe that the comparison will provide a good insight on how music treats social topics.
Once every visualization will be generated, here is the representation structure that we want to have:
In [26]:
display(Image(filename='Data/Organigram_polarity_viz.png'))
We believe that there is a correlation between the complexity and the target audience. Has asserted earlier in this work, we do not have the sociological expertise to match the complextiy with precise social groups, but we assume that there is a correlation between the audience's age + educational level and the complexity of lyrics.
We want to be able to visualize the complexity in two distinct ways. At first from a topic perspective by having genre displaying their complexity when treating a particular subject. Here is the display organigram we want to have:
In [27]:
display(Image(filename='Data/Organigram_complexity_viz.png'))
As before, edges stand for a user's click. Then when picking a genre, we will visualize what is the used lexical complexity. Comparing this value to this average of the genre will help to see if the target is more educated/older than the usual case for the genre.
The second visualization we want to provide is starting with topics on top. For every topic we want to be able to see the mean lexical complexity that is used by a genre to treat the topic. This will allow is to know what are the targeted audiences of each genres for a precise topic. Here is the data organigram:
In [28]:
display(Image(filename='Data/Organigram_complexity_topic_viz.png'))
We want to know what topic is adressed in each genre, and especially how is it spread accross the globe. For that we will use the location data that we gathered. We want to produce maps for:
Due to the fact that we are solely treating text, and that the lyrics are not entirely given, we use a volume of data that is relatively small. This situation enables us to have the original text files stored on our machines, and proceeding to storage of value when using the notebook is not necessary since our computer's cache is big enough to treat our data.
We are convinced that the barplot is the most fitting representation for the generated graphs. We might find another librairy that makes it look more visually appealing, but we believe that it is the most meaningful way to show our results.
Once we will have a topic-based split of our data, we will be able to compare how topics are treated in different genres.
04.12.
11.12.
19.12.