A million song: Milestone 2

Goals

This work aims at conducting a sentiment analysis through music genres. There are many ways to complete this goal, however it is important that the parameters and analysers we use are fitting our data. Sentiment analysis is a very large field, and we believe that we used the librairies and functions that are the most appropriate for the chosen dataset. This work can be divided into 4 parts:

Data imports: structures, sorting and wrangling
Classifiers: Choosing the methods of analysis and extracting features
Visualization: How to meaningfully represent the data
Decisions: What can be said from the outputs, and what is left to do

Dataset credits:

musiXmatch dataset, the official lyrics collection for the Million Song Dataset, available at: http://labrosa.ee.columbia.edu/millionsong/musixmatch



In [3]:

    
#Importing libraries
%matplotlib inline
import numpy as np
import pandas as pd
import re
import nltk
import sklearn
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from IPython.display import display, Image
from nltk.corpus import stopwords
import nltk

1. Data imports: structures, sorting and wrangling

Import the Songs Titles and Artists from MusiXMatch 779052 songs
Import the Year from million song additional files 498466 songs
Import the Genres from TagTraum 255015 songs
Import the lyrics & the bag of words 91625 songs

1.1. Importing the songs

For now we only use titles and artist names, we are able to handle this part with only the musixmatch website. We download the data and put it into a dataframe with the Id of MusiXMatch(MXM_Tid) and the Track ID of the Million Song DataSet(Tid). Because we might have data that is given with one classification or the other, we decide to keep the two IDs, but we are fully aware that having two IDs is not giving additional information, it is only to be sure that other datasets will be easier to merge.
We for now, we get 779052 song's artists and titles



In [4]:

    
#Importing the text file in a DataFrame, removing exceptions (incomplete data)
matches = pd.read_table('Data/mxm_779k_matches.txt', error_bad_lines=False)

#Changing the column's title in order to be clearer
matches.columns = ['Raw']

#Getting the Tid
matches['Tid'] = matches['Raw'].str.split('<SEP>', expand=True)[0]

#Extracting artist names
matches['Artist_Name'] = matches['Raw'].str.split('<SEP>', expand=True)[1]

#Extracting titles
matches['Title'] = matches['Raw'].str.split('<SEP>', expand=True)[2]

#Extractign MXM_Tid
matches['MXM_Tid'] = matches['Raw'].str.split('<SEP>', expand=True)[3]

#Dropping rows we do not need ()
matches = matches.drop(matches.index[:17])

#Droppign the column with raw data
matches = matches.drop('Raw', axis=1)

#set index Track ID
matches.set_index('Tid',inplace=True)

#Displaying results
display(matches.shape)
display(matches.head())









    



b'Skipping line 60821: expected 1 fields, saw 2\nSkipping line 126702: expected 1 fields, saw 2\n'
b'Skipping line 580629: expected 1 fields, saw 2\nSkipping line 632526: expected 1 fields, saw 2\n'






    





(779052, 3)






    







  
    
      
      Artist_Name
      Title
      MXM_Tid
    
    
      Tid
      
      
      
    
  
  
    
      TRMMMKD128F425225D
      Karkkiautomaatti
      Tanssi vaan
      4418550
    
    
      TRMMMRX128F93187D9
      Hudson Mohawke
      No One Could Ever
      8898149
    
    
      TRMMMCH128F425532C
      Yerba Brava
      Si Vos Querés
      9239868
    
    
      TRMMMXN128F42936A5
      David Montgomery
      Symphony No. 1 G minor "Sinfonie Serieuse"/All...
      5346741
    
    
      TRMMMBB12903CB7D21
      Kris Kross
      2 Da Beat Ch'yall
      2511405

Remarks:

There are two distinct identifiers for the same data. Because we might have data that is given with one classification or the other, we decide to keep the two IDs, but we are fully aware that having two IDs is not giving additional information, it is only to be sure that other datasets will be easier to merge.
This is only containing the artist and title, we need further informations such as the genre and the bags of words for each song.

1.2. Extracting the Year of the songs

We download the text file from the "A million song" website. It is provided as an additional feature of the dataset.
We merge the year dataset with the artists and song titles in the same dataframe.



In [5]:

    
#Loading the year of publication data, skipping incomplete data in order to avoid errors
years = pd.read_table('Data/tracks_per_year.txt', error_bad_lines=False)
#Changing the column's title in order to be clearer
years.columns = ['Raw']

#Getting the year publication
years['year'] = years['Raw'].str.split('<SEP>', expand=True)[0]

#Getting the Tid
years['Tid'] = years['Raw'].str.split('<SEP>', expand=True)[1]

#Dropping the raw data
years = years.drop('Raw', axis=1)

#set index Track ID
years.set_index('Tid',inplace=True)

#Appending the years to the original DataFrame
matches = pd.merge(matches, years, left_index=True, right_index=True)









    



b'Skipping line 487582: expected 1 fields, saw 2\nSkipping line 487590: expected 1 fields, saw 2\n'



In [6]:

    
#display the results
print(matches.shape)
display(matches.head())









    



(498466, 4)






    







  
    
      
      Artist_Name
      Title
      MXM_Tid
      year
    
    
      Tid
      
      
      
      
    
  
  
    
      TRMMMKD128F425225D
      Karkkiautomaatti
      Tanssi vaan
      4418550
      1995
    
    
      TRMMMRX128F93187D9
      Hudson Mohawke
      No One Could Ever
      8898149
      2006
    
    
      TRMMMCH128F425532C
      Yerba Brava
      Si Vos Querés
      9239868
      2003
    
    
      TRMMMBB12903CB7D21
      Kris Kross
      2 Da Beat Ch'yall
      2511405
      1993
    
    
      TRMMMNS128F93548E1
      3 Gars Su'l Sofa
      L'antarctique
      7503609
      2007

Remarks:

We delete the rows without year infos. Thus why the dataframe contains less songs. In order to be able to be as complete as accurate as possible, we consider only full matching.

1.3 Importing genres

We will now append each genre to a specific track.
We download the data from the TagTraum dataset and merge them without our previous dataframe.



In [7]:

    
#Creating a DataFrame to store the genres:
GenreFrame = pd.read_table('Data/msd-topMAGD-genreAssignment.txt', names=['Tid', 'genre'])

#set index Track ID
GenreFrame.set_index('Tid',inplace=True)

#merge the new datas with the previous dataframe
matches = pd.merge(GenreFrame, matches, left_index=True, right_index=True)



In [8]:

    
#Displaying results
print(matches.shape)
display(matches.head())









    



(255015, 5)






    







  
    
      
      genre
      Artist_Name
      Title
      MXM_Tid
      year
    
    
      Tid
      
      
      
      
      
    
  
  
    
      TRAAAAK128F9318786
      Pop_Rock
      Adelitas Way
      Scream
      8692587
      2009
    
    
      TRAAAAV128F421A322
      Pop_Rock
      Western Addiction
      A Poor Recipe For Civic Cohesion
      4623710
      2005
    
    
      TRAAABD128F429CF47
      Pop_Rock
      The Box Tops
      Soul Deep
      6477168
      1969
    
    
      TRAAAEF128F4273421
      Pop_Rock
      Adam Ant
      Something Girls
      3759847
      1982
    
    
      TRAAAEM128F93347B9
      Electronic
      Son Kite
      Game & Watch
      2626706
      2004

Comment:

The dataframe contains once again less songs. We proceed this way for the same reason as mentioned in the part before.

1.4. Importing Location

We download the file with the location of every artist from the additional files



In [9]:

    
#Creating a DataFrame to store the location:
location = pd.read_csv('Data/artist_location.txt', sep="<SEP>",header=None,names=['ArtistID','Latitude','Longitude','Artist_Name','City'])
#Keep useful datas
location.drop(['ArtistID','City'],inplace=True,axis=1)
 
#matches = pd.merge(location, matches, on='Tid')
matches.reset_index(inplace=True)
matches = pd.merge(location, matches, on='Artist_Name')
matches.set_index('Tid',inplace = True)









    



C:\Users\santiago\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  from ipykernel import kernelapp as app



In [10]:

    
#Displaying results
display(matches.head())
print(matches.shape)









    







  
    
      
      Latitude
      Longitude
      Artist_Name
      genre
      Title
      MXM_Tid
      year
    
    
      Tid
      
      
      
      
      
      
      
    
  
  
    
      TRBAWHU128EF3563C5
      51.59678
      -0.33556
      Screaming Lord Sutch
      Pop_Rock
      Murder In The Graveyard
      7883403
      1982
    
    
      TREPTTY128EF3563C1
      51.59678
      -0.33556
      Screaming Lord Sutch
      Pop_Rock
      Penny Penny
      2546441
      1982
    
    
      TRJSKEX128EF3563C7
      51.59678
      -0.33556
      Screaming Lord Sutch
      Pop_Rock
      London Rocker
      2546440
      1982
    
    
      TRKCPUR128F92CF37D
      51.59678
      -0.33556
      Screaming Lord Sutch
      Pop_Rock
      Jack The Ripper
      4439345
      1982
    
    
      TRKYESP128EF3563C0
      51.59678
      -0.33556
      Screaming Lord Sutch
      Pop_Rock
      Monster Rock
      2546437
      1982
    
  








    



(103401, 7)

1.5. Bags of words, extracting them from the train dataset

We downloaded the train datafile which is 30% of the whole dataset. Inside we have a list of the 5000 words the most used in the ... songs. We then make two dataframes:

One with the Id of every songs and their lyrics. We merge this with our previous dataframe.

The lyrics are presented as follow : [(id of word),(occurence in song)][2,24][5,47]...

Another one with the 5000 top words of the songs (Bag of Words)

We work with only 30% of the whole dataset because we use the MusicXMatch dataset and it is the only data that are available.
The rest of the data are not free. You could see that page : https://developer.musixmatch.com/plans to verify.



In [11]:

    
#import file
lyrics = pd.read_table('Data/mxm_dataset_train.txt', error_bad_lines=False)

#change name of the column
lyrics.columns = ['Raw_Training']

# take the bag of word to use it later
words_train = lyrics.iloc[16]

#drop useless rows
lyrics=lyrics[17:].copy()

# get TrackID, MxMID and lyrics and put them separated columns
def sortdata(x):
    splitted = x['Raw_Training'].split(',')
    x['Tid']=splitted[0]
    #x['MXM_Tid']=splitted[1]
    x['words_freq']=splitted[2:]
    return x

#Apply the function to every column
lyrics = lyrics.apply(sortdata,axis=1)
lyrics = lyrics[['Tid','words_freq']]



In [12]:

    
#set index Track ID
lyrics.set_index('Tid',inplace=True)

#Appending the years to the original DataFrame
matches = pd.merge(matches, lyrics, left_index=True, right_index=True)



In [13]:

    
#Displaying the results
print(matches.shape)
display(matches.head())









    



(38513, 8)






    







  
    
      
      Latitude
      Longitude
      Artist_Name
      genre
      Title
      MXM_Tid
      year
      words_freq
    
    
      Tid
      
      
      
      
      
      
      
      
    
  
  
    
      TRAAAAV128F421A322
      37.77916
      -122.42005
      Western Addiction
      Pop_Rock
      A Poor Recipe For Civic Cohesion
      4623710
      2005
      [1:6, 2:4, 3:2, 4:2, 5:5, 6:3, 7:1, 8:1, 11:1,...
    
    
      TRAAABD128F429CF47
      35.14968
      -90.04892
      The Box Tops
      Pop_Rock
      Soul Deep
      6477168
      1969
      [1:10, 3:17, 4:8, 5:2, 6:2, 7:1, 8:3, 9:2, 10:...
    
    
      TRAAAEF128F4273421
      35.83073
      -85.97874
      Adam Ant
      Pop_Rock
      Something Girls
      3759847
      1982
      [1:5, 2:4, 3:3, 4:2, 5:1, 6:11, 9:4, 12:9, 13:...
    
    
      TRAAAHJ128F931194C
      39.74001
      -104.99226
      Devotchka
      Pop_Rock
      The Last Beat Of My Heart (b-side)
      5133845
      2004
      [1:4, 2:11, 3:2, 4:7, 5:3, 6:5, 8:1, 9:3, 10:6...
    
    
      TRAABIG128F9356C56
      40.71455
      -74.00712
      Poe
      Pop_Rock
      Walk the Walk
      678806
      2000
      [1:28, 2:77, 3:31, 4:41, 5:5, 6:13, 8:17, 9:5,...

Comments on the size:

Due to the fact that we do not have access to the entire dataset, our analysis is limited to the 30% that is freely available on MusixMatch.

1.6. From generic bags of words to lyrics

We Create a function that take the list of the word and the occurency in one song : [(id of word),(occurency in the song)][2,24][5,47]...
And output all the corresponding words in a list

For example : [1:2,2:5,3:3] gives us --> [i,i,the,the,the,the,the,you,you,you]



In [14]:

    
#get the datas
bag_of_words = words_train
# clean the data and split it to create a list of 5000 words
bag_of_words = bag_of_words.str.replace('%','')
bag_of_words = bag_of_words.str.split(',')

display(bag_of_words.head())









    





Raw_Training    [i, the, you, to, and, a, me, it, not, in, my,...
Name: 16, dtype: object



In [15]:

    
#Defining a function
def create_text(words_freq):
    #create the final list of all words
    list_words=''
    #iterate over every id of words
    for compteur in words_freq:
        
        word = bag_of_words[0][int(compteur.split(':')[0])-1]
        times = int(compteur.split(':')[1])
        
        #Separating every word with a space to be able to work on it with librairies during part 2
        for i in range(times):
            list_words += ' ' + word + ' '
    return list_words



In [16]:

    
#Testing the function
print(create_text(lyrics.iloc[0]['words_freq']))









    



 i  i  i  i  i  i  the  the  the  the  you  you  to  to  and  and  and  and  and  a  a  a  me  it  my  is  is  of  of  of  your  that  are  are  we  we  am  am  will  will  for  for  for  for  be  have  have  so  this  like  like  de  up  was  was  if  got  would  been  these  these  seem  someon  understand  pass  river  met  piec  damn  worth  flesh  grace  poor  poor  somehow  ignor  passion  tide  season  seed  resist  order  order  piti  fashion  grant  captur  captur  ici  soil  patienc  social  social  highest  highest  slice  leaf  lifeless  arrang  wilder  shark  devast  element

Comments on part one:

As it is noticeable through each step, we loose data every time we merge datasets. We chose this approach because we only want to deal with complete information in order to be coherent. We want to compare parameters between items and we believe that the analysis is less relevant if we consider a larger dataset that contains data incomplete data.

We now have 38 513 songs, but for each one we have all the features that we want to use. We will analyse our data with different parameters, thus why it is important that it each song provides each item. Later in the analysis we may use data from 1.4. (providing 103 401 songs) in order to get a broader overview.

2. Classifiers: Choosing the methods of analysis and extracting features

In order to analyse songs, we will use sentiment analysis on the lyrics. We chose to use 2 key features, which are the polarity and the lexical complexity. Because we only use bags of words, some parameters such as rhymes and structures are not defined with bags of words and they should be taken in consideration when speaking of the whole complexity of lyrics.

2.1. Word polarity

Vader package

VADER, which stands for Valence Aware Dictionary and sEntiment Reasoner, is a sentiment analysis package that provides a polarity score for a given word or sentences. It is known to be a very powerful tool, especially because it was trained on tweets, meaning that it takes into account most of modern vocabulary. This is especially relevant for our project because we deal with modern music, implying that the words that are used are as modern as the ones analysed by VADER on tweets. The fact that the sentiment analyser takes its roots from the same vocabulary is make the analysis more relevant.

Polarity is expressed between -1 (negative polarity) and 1 (positive polarity).



In [17]:

    
import nltk.sentiment.sentiment_analyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer









    



C:\Users\santiago\Anaconda3\lib\site-packages\nltk\twitter\__init__.py:20: UserWarning: The twython library has not been installed. Some functionality from the twitter package will not be available.
  warnings.warn("The twython library has not been installed. "



In [18]:

    
#Defining the analyser
analyser = SentimentIntensityAnalyzer()

2.2. Lyrics' complexity

Because we want to be able to know what type of audience a specific type of music is targeting we need to analyse the complexity of the lyrics. We are aware that dividing an audience into social profiles is far beyond the scope of our analysis. We do not have enough sociological knowledge to categorize an audience in a precise way. This is the reason why we will use large indicators. We want to know how complex a set of word is, and the only social assumption we will make is that complexity is correlated with the age and the educational level of the audience.

We use the occurence of each word in the whole dataset.

Extracting the vocabulary

Importing the most used words and their count inside the dataset in order to start a text processing analysis.

Extracting additional features

From the dataset, there was some given metadata. The total word count is 55 163 335.

Because of the long tail effect of language, we will proceed with the first 10 000 words of the list. This will enable us to have less computing time when iterating on the full_word_list.

From the vocabulary we remove stopwords. Those are too often mentionned in every level of language to be relevant for this analysis.

We then compute the percentage of occurence, because it will help us when dealing with lyrics' complexity. We then use the occurence precentage to get a Complexity weight. It means that when a word is used a lot it will have a low weight and a high weight for words rarely used.



In [19]:

    
Word_count_total = 55163335

#Importing the data, putting it in a DataFrame
full_word_list = pd.read_table('Data/full_word_list.txt')
#Renaming the columns
full_word_list.columns = ['Word']
#Extracting word count
full_word_list['Count'] = pd.to_numeric(full_word_list['Word'].str.split('<SEP>', expand=True)[1])
#Extracted words that were used
full_word_list['Word'] = full_word_list['Word'].str.split('<SEP>', expand=True)[0]
#Dropping rows we will not use
full_word_list = full_word_list.drop(full_word_list.index[:6])

#Extracting the first 10 0000  values, because the rest is not stemmed and not necessarly in english
full_word_list = full_word_list.head(10000)


#Removing english stop words 
for word in full_word_list['Word']:
    if word in stopwords.words('english'):
        full_word_list = full_word_list[full_word_list.Word != word]
        
#Computing the percentage of occurence:
full_word_list['Occurence_percentage'] = (full_word_list['Count']/ Word_count_total)*100
#computing weight of words
full_word_list['Weight']= 1/full_word_list['Occurence_percentage']

#Display
display(full_word_list.head(10))









    







  
    
      
      Word
      Count
      Occurence_percentage
      Weight
    
  
  
    
      31
      love
      298043.0
      0.540292
      1.850852
    
    
      34
      know
      273137.0
      0.495142
      2.019621
    
    
      39
      like
      227624.0
      0.412636
      2.423441
    
    
      44
      get
      192961.0
      0.349799
      2.858782
    
    
      46
      go
      182812.0
      0.331401
      3.017490
    
    
      47
      time
      181248.0
      0.328566
      3.043528
    
    
      48
      come
      180061.0
      0.326414
      3.063592
    
    
      49
      oh
      178751.0
      0.324040
      3.086043
    
    
      52
      got
      170641.0
      0.309338
      3.232713
    
    
      55
      one
      164380.0
      0.297988
      3.355842



In [20]:

    
# function to get the complexity of one song by analyzing the weight of all his word
def complexity_Song(lyrics):
    #create a variable to stock the sum of the weights for every word of the song
    sum_weight= 0
    #split the lyrics to get an array of words and not just one big string
    lyric = lyrics.split(' ')
    #lyric = lyric.remove(' ')
    
    #filtering empty values
    lyric = list(filter(None, lyric))
    
    #Removing every english stopword from the given lyric
    lyric = [word for word in lyric if word not in stopwords.words('english')]
    
    for x in lyric:
        #Making sure that the data is not empty
        if len(full_word_list.loc[full_word_list['Word'] == x]['Weight'].values) != 0 :
            sum_weight += full_word_list.loc[full_word_list['Word'] == x]['Weight'].values  

    return float(sum_weight/len(lyric))

Comment:

This implementation is inspired by the TF-IDF algorithm. If the occurence of a word is weak in a dataset, it means that it is less common in the language, meaning that the lexical complexity is higher.

English stopwords are very common in every sentences, they are used so typically that they do not add anything relevant to the analysis. This is the reason why we take them out. Our complexity analysis must be focused on words that do not appear regularly.

2.3. Topic classification

Having the data divided in genres in important for our analysis, however we are still missing one key dimension to make our work relevant for social good: The topic that is adressed in the songs. We must be able to know which subject is dealt with in a song, and then we will aggregate the data for the genre and we will be able to understand how a particular genre is handling a specific topic. For this part we are still considering two options:

Option 1: Use the classifier from the 20newsgroup

The work that was done in the second part of Homework 4. The good point is that we have a tuned algorithm to classify text into 20 different classes with more than 85% of accuracy. It is very powerful but not the exact algorithm we are looking for. It provides a strict classification, putting a text into one precise class while we would rather have several tags for a particular music. Artists tend to treat several topics we writing a song, and the 20newsgroup classifier would limitate our data treatment.

Some data cleaning will be required if applying the 20newsgroup classifier. For instance categories dealing with computer science are likley to be irrelevant when classifying songs, thus why they will be taken away.

Option 2: Find a deep learning algorithm/dataset that provides tags for songs

We are still exploring this possibility because option 1 does not provide the most appealing data we want. We have not encountered yet an algorithm that would provide a categorization with several tags. We are aware that we may not find such classifier, but since the 20newsgroup one is already fully implemented and ready to use (Tf-idf and RandomForest are coded. We will look for another option for as long as we can.

2.4. Example: Polarity and Complexity in Rap

In order to show how we will proceed on the dataset, we will use the first 100 songs of one genre as an example (in this case Rap). We know that the size is too small to have some conclusions, however we are interested in running and visualization for now. We want to make sure that this code works and provides appealing visualization. The generalization of this code will be implemented in the last part of the project.

We start with the polarity analysis:



In [21]:

    
#Extracting the Data of the genre
Rap_example = matches[(matches['genre'] == 'Rap')].head(100)

#Resetting index
Rap_example.reset_index(inplace=True)

#Intiating an empty column, in order to be able to interate on it
Rap_example['Bags_of_words'] = ''
#Getting all the textual data in the DataFrame
for i in Rap_example.index:
    Rap_example.at[i, 'Bags_of_words'] = create_text(Rap_example.at[i, 'words_freq'])



In [22]:

    
#Displaying results
display(Rap_example.head())









    







  
    
      
      Tid
      Latitude
      Longitude
      Artist_Name
      genre
      Title
      MXM_Tid
      year
      words_freq
      Bags_of_words
    
  
  
    
      0
      TRABIVW12903CFC962
      -10.40633
      -77.23871
      Silkk The Shocker
      Rap
      Thug 'N' Me
      1014500
      1998
      [1:22, 2:10, 3:26, 4:8, 5:8, 6:6, 7:36, 8:8, 9...
      i  i  i  i  i  i  i  i  i  i  i  i  i  i  i  ...
    
    
      1
      TRACHHH128E0788A35
      34.05349
      -118.24532
      Dilated Peoples
      Rap
      Big Business
      2178075
      2004
      [1:8, 2:18, 3:4, 4:12, 5:6, 6:4, 7:3, 8:5, 9:1...
      i  i  i  i  i  i  i  i  the  the  the  the  t...
    
    
      2
      TRADODV128F4283EB6
      34.05349
      -118.24532
      Mobb Deep
      Rap
      Got It Twisted
      2683192
      2004
      [1:4, 2:17, 3:14, 4:19, 5:5, 6:6, 7:2, 8:10, 9...
      i  i  i  i  the  the  the  the  the  the  the...
    
    
      3
      TRAGUEL128F92C48CF
      29.76045
      -95.36978
      Lil Mama
      Rap
      G-Slide (Tour Bus)
      7318348
      2007
      [1:19, 2:24, 3:23, 4:21, 5:10, 6:7, 7:18, 8:18...
      i  i  i  i  i  i  i  i  i  i  i  i  i  i  i  ...
    
    
      4
      TRAGVAO128F4289B5D
      34.05349
      -118.24532
      Mobb Deep
      Rap
      Up North Trip
      8876004
      1995
      [1:20, 2:40, 3:27, 4:26, 5:18, 6:25, 7:5, 8:16...
      i  i  i  i  i  i  i  i  i  i  i  i  i  i  i  ...



In [23]:

    
#Applying the polarity analysis for the bags of words
for i in Rap_example.index:
    Rap_example.at[i, 'Polarity_score'] = analyser.polarity_scores(Rap_example.at[i, 'Bags_of_words'])['compound']
    
#Sorting values to have a meaningfull plot
Rap_example = Rap_example.sort_values('Polarity_score', ascending=False)
Rap_example = Rap_example.reset_index()

#plotting the results: Polarity
plt.bar(Rap_example.index, Rap_example['Polarity_score'])
plt.title('Polarity in Rap')
plt.show()

#Computing the polarity's mean value for the genre
print("The mean value is")
display(pd.to_numeric(Rap_example['Polarity_score'].mean()))









    












    



The mean value is






    





-0.09608200000000001

We like this representation. We find it meaningful and visually appealing. This is the reason why we will keep this representation as our main visualization.

And now, we apply computation for the lexical complexity:



In [24]:

    
#Applying the lexical complexity analysis for the bags of words
for i in Rap_example.index:
    Rap_example.at[i,'Lexical_complexity'] = complexity_Song(Rap_example.at[i,'Bags_of_words'])

#Sorting values to have a meaningfull plot
Rap_example = Rap_example.sort_values('Lexical_complexity', ascending=False)
#plotting the results: Lexical complexity
plt.bar(Rap_example.index, Rap_example['Lexical_complexity'])
plt.title('Lexical complexity in Rap')
plt.show()

#Computing the lexical complexity's mean value for the genre
print("The mean value is")
display(pd.to_numeric(Rap_example['Lexical_complexity'].mean()))









    












    



The mean value is






    





114.95028863171996

Finally, we can visualize the evolution of complexity through time for the given subset:



In [25]:

    
#Sorting values to have a meaningfull plot
Rap_example = Rap_example.sort_values('year')

#plotting the results: Evolution of complexity through time

plt.bar(Rap_example['year'], Rap_example['Lexical_complexity'])
plt.title('Rap: Lexical complexity through time')
plt.show()

Comments:

This allows us to get an overview of the genre
The obtained graph could be compared with other genres.
When having data about the topics, we will be able to analyse compare how genres are treating specific topics.
Lexical complexity can be analysed through time, allowing to get an overview of the evolution of a the genre itself.

3. Visualization: How to meaningfully represent the data

Now that we have the tools for a sentiment analysis, we must decide a way to visualize the results. Having a straightforward visualization will helop to compare the analyse features between genres, and this is precisely what we will try to achieve with the following visualization.

All the interactive visualization structure that we have here will be displayed in a blog.

3.1. Topic-based approach

We first deal with polarity, then with lexical complexity.

3.1.1. Dealing with polarity

When having the topics and the polarity, we will be able to have an overview of how social topics are dealt with across music genres. We cannot draw any conclusion for this part without having every visual, but we believe that the comparison will provide a good insight on how music treats social topics.

Once every visualization will be generated, here is the representation structure that we want to have:



In [26]:

    
display(Image(filename='Data/Organigram_polarity_viz.png'))

Description

Each edge stands for a user's click.
Rectangles are representing the barplot of polarity (for each genre/artist at the level)
The user will pick a topic, this will display every genre treating the topic and their polarity barplot. Then when clicking on a genre's plot it will display what is polarity barplot for every artiss.

3.1.2. Dealing with complexity

We believe that there is a correlation between the complexity and the target audience. Has asserted earlier in this work, we do not have the sociological expertise to match the complextiy with precise social groups, but we assume that there is a correlation between the audience's age + educational level and the complexity of lyrics.

We want to be able to visualize the complexity in two distinct ways. At first from a topic perspective by having genre displaying their complexity when treating a particular subject. Here is the display organigram we want to have:



In [27]:

    
display(Image(filename='Data/Organigram_complexity_viz.png'))

Description

As before, edges stand for a user's click. Then when picking a genre, we will visualize what is the used lexical complexity. Comparing this value to this average of the genre will help to see if the target is more educated/older than the usual case for the genre.

The second visualization we want to provide is starting with topics on top. For every topic we want to be able to see the mean lexical complexity that is used by a genre to treat the topic. This will allow is to know what are the targeted audiences of each genres for a precise topic. Here is the data organigram:



In [28]:

    
display(Image(filename='Data/Organigram_complexity_topic_viz.png'))

3.2. Data by location

We want to know what topic is adressed in each genre, and especially how is it spread accross the globe. For that we will use the location data that we gathered. We want to produce maps for:

Complexity within a genre. In order to know how elaborate a genre is in different places around the world. We will use a heatmap to visualize the data.
Polarity of genres when dealing with a certain topic. For instance we want to be able to know if a subject has a negative polarity in rock in the US while it is positive in the UK.

3.3. Chronological progression

In order to know how a genre evolved, we want to plot the lexical complexity through time of every genre. We will use, again, a barplot. With the chronology on the x-axis an the complexity on the y-axis.

4. Decisions: What can be said from the outputs, and what is left to do

4.1. Intermediary conclusions

Data handling

Due to the fact that we are solely treating text, and that the lyrics are not entirely given, we use a volume of data that is relatively small. This situation enables us to have the original text files stored on our machines, and proceeding to storage of value when using the notebook is not necessary since our computer's cache is big enough to treat our data.

Visualization

We are convinced that the barplot is the most fitting representation for the generated graphs. We might find another librairy that makes it look more visually appealing, but we believe that it is the most meaningful way to show our results.

Comparison

Once we will have a topic-based split of our data, we will be able to compare how topics are treated in different genres.

4.2. Further objectives

Find a classification to have topics, and if no useful algorithm, use the one we built on the 20newsgroup. Establish a host and astructure for a blog. Now that the raw material is present, we must seek for a layout that fits our project.
Expand our analysis to the whole dataset. As discussed earlier, we only treated a sample for this milestone, in order to show what we are aiming at.

4.3. Planning until milestone 3

04.12.

Decide which option we should consider for the topic analysis (if nothing relevant is find for option 2, we will use option 1).
Generalize our data treatment to the whole dataset.
Generate the visualizations we want to have.
Decide what type of storage is the most efficient for our generated graphics
Start build the blog

11.12.

Start building the blog.
Finish the written report
Debug our code

19.12.

Handout milestone

Questions for the TA

Is there a deep learning that could classify our bags of words by topic? We believe that something implement with TF-IDF as a starting point would be useful but could not find anything concluant yet.
Is there a blog platform that could host our data structure and visualization for free?
Does it seem to you that the analysis we are aiming at is meaningful and relevant enough to be considered as "Handling data for social good"?

	Artist_Name	Title	MXM_Tid
Tid
TRMMMKD128F425225D	Karkkiautomaatti	Tanssi vaan	4418550
TRMMMRX128F93187D9	Hudson Mohawke	No One Could Ever	8898149
TRMMMCH128F425532C	Yerba Brava	Si Vos Querés	9239868
TRMMMXN128F42936A5	David Montgomery	Symphony No. 1 G minor "Sinfonie Serieuse"/All...	5346741
TRMMMBB12903CB7D21	Kris Kross	2 Da Beat Ch'yall	2511405

	genre	Artist_Name	Title	MXM_Tid	year
Tid
TRAAAAK128F9318786	Pop_Rock	Adelitas Way	Scream	8692587	2009
TRAAAAV128F421A322	Pop_Rock	Western Addiction	A Poor Recipe For Civic Cohesion	4623710	2005
TRAAABD128F429CF47	Pop_Rock	The Box Tops	Soul Deep	6477168	1969
TRAAAEF128F4273421	Pop_Rock	Adam Ant	Something Girls	3759847	1982
TRAAAEM128F93347B9	Electronic	Son Kite	Game & Watch	2626706	2004

	Latitude	Longitude	Artist_Name	genre	Title	MXM_Tid	year
Tid
TRBAWHU128EF3563C5	51.59678	-0.33556	Screaming Lord Sutch	Pop_Rock	Murder In The Graveyard	7883403	1982
TREPTTY128EF3563C1	51.59678	-0.33556	Screaming Lord Sutch	Pop_Rock	Penny Penny	2546441	1982
TRJSKEX128EF3563C7	51.59678	-0.33556	Screaming Lord Sutch	Pop_Rock	London Rocker	2546440	1982
TRKCPUR128F92CF37D	51.59678	-0.33556	Screaming Lord Sutch	Pop_Rock	Jack The Ripper	4439345	1982
TRKYESP128EF3563C0	51.59678	-0.33556	Screaming Lord Sutch	Pop_Rock	Monster Rock	2546437	1982

	Latitude	Longitude	Artist_Name	genre	Title	MXM_Tid	year	words_freq
Tid
TRAAAAV128F421A322	37.77916	-122.42005	Western Addiction	Pop_Rock	A Poor Recipe For Civic Cohesion	4623710	2005	[1:6, 2:4, 3:2, 4:2, 5:5, 6:3, 7:1, 8:1, 11:1,...
TRAAABD128F429CF47	35.14968	-90.04892	The Box Tops	Pop_Rock	Soul Deep	6477168	1969	[1:10, 3:17, 4:8, 5:2, 6:2, 7:1, 8:3, 9:2, 10:...
TRAAAEF128F4273421	35.83073	-85.97874	Adam Ant	Pop_Rock	Something Girls	3759847	1982	[1:5, 2:4, 3:3, 4:2, 5:1, 6:11, 9:4, 12:9, 13:...
TRAAAHJ128F931194C	39.74001	-104.99226	Devotchka	Pop_Rock	The Last Beat Of My Heart (b-side)	5133845	2004	[1:4, 2:11, 3:2, 4:7, 5:3, 6:5, 8:1, 9:3, 10:6...
TRAABIG128F9356C56	40.71455	-74.00712	Poe	Pop_Rock	Walk the Walk	678806	2000	[1:28, 2:77, 3:31, 4:41, 5:5, 6:13, 8:17, 9:5,...

	Word	Count	Occurence_percentage	Weight
31	love	298043.0	0.540292	1.850852
34	know	273137.0	0.495142	2.019621
39	like	227624.0	0.412636	2.423441
44	get	192961.0	0.349799	2.858782
46	go	182812.0	0.331401	3.017490
47	time	181248.0	0.328566	3.043528
48	come	180061.0	0.326414	3.063592
49	oh	178751.0	0.324040	3.086043
52	got	170641.0	0.309338	3.232713
55	one	164380.0	0.297988	3.355842

	Tid	Latitude	Longitude	Artist_Name	genre	Title	MXM_Tid	year	words_freq	Bags_of_words
0	TRABIVW12903CFC962	-10.40633	-77.23871	Silkk The Shocker	Rap	Thug 'N' Me	1014500	1998	[1:22, 2:10, 3:26, 4:8, 5:8, 6:6, 7:36, 8:8, 9...	i i i i i i i i i i i i i i i ...
1	TRACHHH128E0788A35	34.05349	-118.24532	Dilated Peoples	Rap	Big Business	2178075	2004	[1:8, 2:18, 3:4, 4:12, 5:6, 6:4, 7:3, 8:5, 9:1...	i i i i i i i i the the the the t...
2	TRADODV128F4283EB6	34.05349	-118.24532	Mobb Deep	Rap	Got It Twisted	2683192	2004	[1:4, 2:17, 3:14, 4:19, 5:5, 6:6, 7:2, 8:10, 9...	i i i i the the the the the the the...
3	TRAGUEL128F92C48CF	29.76045	-95.36978	Lil Mama	Rap	G-Slide (Tour Bus)	7318348	2007	[1:19, 2:24, 3:23, 4:21, 5:10, 6:7, 7:18, 8:18...	i i i i i i i i i i i i i i i ...
4	TRAGVAO128F4289B5D	34.05349	-118.24532	Mobb Deep	Rap	Up North Trip	8876004	1995	[1:20, 2:40, 3:27, 4:26, 5:18, 6:25, 7:5, 8:16...	i i i i i i i i i i i i i i i ...