A million song: Milestone 2

Goals

This work aims at conducting a sentiment analysis through music genres. There are many ways to complete this goal, however it is important that the parameters and analysers we use are fitting our data. Sentiment analysis is a very large field, and we believe that we used the librairies and functions that are the most appropriate for the chosen dataset. This work can be divided into 4 parts:

  1. Data imports: structures, sorting and wrangling
  2. Classifiers: Choosing the methods of analysis and extracting features
  3. Visualization: How to meaningfully represent the data
  4. Decisions: What can be said from the outputs, and what is left to do
Dataset credits:

musiXmatch dataset, the official lyrics collection for the Million Song Dataset, available at: http://labrosa.ee.columbia.edu/millionsong/musixmatch


In [3]:
#Importing libraries
%matplotlib inline
import numpy as np
import pandas as pd
import re
import nltk
import sklearn
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from IPython.display import display, Image
from nltk.corpus import stopwords
import nltk

1. Data imports: structures, sorting and wrangling

  • Import the Songs Titles and Artists from MusiXMatch 779052 songs
  • Import the Year from million song additional files 498466 songs
  • Import the Genres from TagTraum 255015 songs
  • Import the lyrics & the bag of words 91625 songs

1.1. Importing the songs

For now we only use titles and artist names, we are able to handle this part with only the musixmatch website. We download the data and put it into a dataframe with the Id of MusiXMatch(MXM_Tid) and the Track ID of the Million Song DataSet(Tid). Because we might have data that is given with one classification or the other, we decide to keep the two IDs, but we are fully aware that having two IDs is not giving additional information, it is only to be sure that other datasets will be easier to merge.
We for now, we get 779052 song's artists and titles


In [4]:
#Importing the text file in a DataFrame, removing exceptions (incomplete data)
matches = pd.read_table('Data/mxm_779k_matches.txt', error_bad_lines=False)

#Changing the column's title in order to be clearer
matches.columns = ['Raw']

#Getting the Tid
matches['Tid'] = matches['Raw'].str.split('<SEP>', expand=True)[0]

#Extracting artist names
matches['Artist_Name'] = matches['Raw'].str.split('<SEP>', expand=True)[1]

#Extracting titles
matches['Title'] = matches['Raw'].str.split('<SEP>', expand=True)[2]

#Extractign MXM_Tid
matches['MXM_Tid'] = matches['Raw'].str.split('<SEP>', expand=True)[3]

#Dropping rows we do not need ()
matches = matches.drop(matches.index[:17])

#Droppign the column with raw data
matches = matches.drop('Raw', axis=1)

#set index Track ID
matches.set_index('Tid',inplace=True)

#Displaying results
display(matches.shape)
display(matches.head())


b'Skipping line 60821: expected 1 fields, saw 2\nSkipping line 126702: expected 1 fields, saw 2\n'
b'Skipping line 580629: expected 1 fields, saw 2\nSkipping line 632526: expected 1 fields, saw 2\n'
(779052, 3)
Artist_Name Title MXM_Tid
Tid
TRMMMKD128F425225D Karkkiautomaatti Tanssi vaan 4418550
TRMMMRX128F93187D9 Hudson Mohawke No One Could Ever 8898149
TRMMMCH128F425532C Yerba Brava Si Vos Querés 9239868
TRMMMXN128F42936A5 David Montgomery Symphony No. 1 G minor "Sinfonie Serieuse"/All... 5346741
TRMMMBB12903CB7D21 Kris Kross 2 Da Beat Ch'yall 2511405
Remarks:
  • There are two distinct identifiers for the same data. Because we might have data that is given with one classification or the other, we decide to keep the two IDs, but we are fully aware that having two IDs is not giving additional information, it is only to be sure that other datasets will be easier to merge.
  • This is only containing the artist and title, we need further informations such as the genre and the bags of words for each song.

1.2. Extracting the Year of the songs

We download the text file from the "A million song" website. It is provided as an additional feature of the dataset.
We merge the year dataset with the artists and song titles in the same dataframe.


In [5]:
#Loading the year of publication data, skipping incomplete data in order to avoid errors
years = pd.read_table('Data/tracks_per_year.txt', error_bad_lines=False)
#Changing the column's title in order to be clearer
years.columns = ['Raw']

#Getting the year publication
years['year'] = years['Raw'].str.split('<SEP>', expand=True)[0]

#Getting the Tid
years['Tid'] = years['Raw'].str.split('<SEP>', expand=True)[1]

#Dropping the raw data
years = years.drop('Raw', axis=1)

#set index Track ID
years.set_index('Tid',inplace=True)

#Appending the years to the original DataFrame
matches = pd.merge(matches, years, left_index=True, right_index=True)


b'Skipping line 487582: expected 1 fields, saw 2\nSkipping line 487590: expected 1 fields, saw 2\n'

In [6]:
#display the results
print(matches.shape)
display(matches.head())


(498466, 4)
Artist_Name Title MXM_Tid year
Tid
TRMMMKD128F425225D Karkkiautomaatti Tanssi vaan 4418550 1995
TRMMMRX128F93187D9 Hudson Mohawke No One Could Ever 8898149 2006
TRMMMCH128F425532C Yerba Brava Si Vos Querés 9239868 2003
TRMMMBB12903CB7D21 Kris Kross 2 Da Beat Ch'yall 2511405 1993
TRMMMNS128F93548E1 3 Gars Su'l Sofa L'antarctique 7503609 2007

Remarks:

We delete the rows without year infos. Thus why the dataframe contains less songs. In order to be able to be as complete as accurate as possible, we consider only full matching.

1.3 Importing genres

We will now append each genre to a specific track.
We download the data from the TagTraum dataset and merge them without our previous dataframe.


In [7]:
#Creating a DataFrame to store the genres:
GenreFrame = pd.read_table('Data/msd-topMAGD-genreAssignment.txt', names=['Tid', 'genre'])

#set index Track ID
GenreFrame.set_index('Tid',inplace=True)

#merge the new datas with the previous dataframe
matches = pd.merge(GenreFrame, matches, left_index=True, right_index=True)

In [8]:
#Displaying results
print(matches.shape)
display(matches.head())


(255015, 5)
genre Artist_Name Title MXM_Tid year
Tid
TRAAAAK128F9318786 Pop_Rock Adelitas Way Scream 8692587 2009
TRAAAAV128F421A322 Pop_Rock Western Addiction A Poor Recipe For Civic Cohesion 4623710 2005
TRAAABD128F429CF47 Pop_Rock The Box Tops Soul Deep 6477168 1969
TRAAAEF128F4273421 Pop_Rock Adam Ant Something Girls 3759847 1982
TRAAAEM128F93347B9 Electronic Son Kite Game & Watch 2626706 2004
Comment:

The dataframe contains once again less songs. We proceed this way for the same reason as mentioned in the part before.

1.4. Importing Location

We download the file with the location of every artist from the additional files


In [9]:
#Creating a DataFrame to store the location:
location = pd.read_csv('Data/artist_location.txt', sep="<SEP>",header=None,names=['ArtistID','Latitude','Longitude','Artist_Name','City'])
#Keep useful datas
location.drop(['ArtistID','City'],inplace=True,axis=1)
 
#matches = pd.merge(location, matches, on='Tid')
matches.reset_index(inplace=True)
matches = pd.merge(location, matches, on='Artist_Name')
matches.set_index('Tid',inplace = True)


C:\Users\santiago\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  from ipykernel import kernelapp as app

In [10]:
#Displaying results
display(matches.head())
print(matches.shape)


Latitude Longitude Artist_Name genre Title MXM_Tid year
Tid
TRBAWHU128EF3563C5 51.59678 -0.33556 Screaming Lord Sutch Pop_Rock Murder In The Graveyard 7883403 1982
TREPTTY128EF3563C1 51.59678 -0.33556 Screaming Lord Sutch Pop_Rock Penny Penny 2546441 1982
TRJSKEX128EF3563C7 51.59678 -0.33556 Screaming Lord Sutch Pop_Rock London Rocker 2546440 1982
TRKCPUR128F92CF37D 51.59678 -0.33556 Screaming Lord Sutch Pop_Rock Jack The Ripper 4439345 1982
TRKYESP128EF3563C0 51.59678 -0.33556 Screaming Lord Sutch Pop_Rock Monster Rock 2546437 1982
(103401, 7)

1.5. Bags of words, extracting them from the train dataset

We downloaded the train datafile which is 30% of the whole dataset. Inside we have a list of the 5000 words the most used in the ... songs. We then make two dataframes:

  • One with the Id of every songs and their lyrics. We merge this with our previous dataframe.

    The lyrics are presented as follow : [(id of word),(occurence in song)][2,24][5,47]...

  • Another one with the 5000 top words of the songs (Bag of Words)

We work with only 30% of the whole dataset because we use the MusicXMatch dataset and it is the only data that are available.
The rest of the data are not free. You could see that page : https://developer.musixmatch.com/plans to verify.


In [11]:
#import file
lyrics = pd.read_table('Data/mxm_dataset_train.txt', error_bad_lines=False)

#change name of the column
lyrics.columns = ['Raw_Training']

# take the bag of word to use it later
words_train = lyrics.iloc[16]

#drop useless rows
lyrics=lyrics[17:].copy()

# get TrackID, MxMID and lyrics and put them separated columns
def sortdata(x):
    splitted = x['Raw_Training'].split(',')
    x['Tid']=splitted[0]
    #x['MXM_Tid']=splitted[1]
    x['words_freq']=splitted[2:]
    return x

#Apply the function to every column
lyrics = lyrics.apply(sortdata,axis=1)
lyrics = lyrics[['Tid','words_freq']]

In [12]:
#set index Track ID
lyrics.set_index('Tid',inplace=True)

#Appending the years to the original DataFrame
matches = pd.merge(matches, lyrics, left_index=True, right_index=True)

In [13]:
#Displaying the results
print(matches.shape)
display(matches.head())


(38513, 8)
Latitude Longitude Artist_Name genre Title MXM_Tid year words_freq
Tid
TRAAAAV128F421A322 37.77916 -122.42005 Western Addiction Pop_Rock A Poor Recipe For Civic Cohesion 4623710 2005 [1:6, 2:4, 3:2, 4:2, 5:5, 6:3, 7:1, 8:1, 11:1,...
TRAAABD128F429CF47 35.14968 -90.04892 The Box Tops Pop_Rock Soul Deep 6477168 1969 [1:10, 3:17, 4:8, 5:2, 6:2, 7:1, 8:3, 9:2, 10:...
TRAAAEF128F4273421 35.83073 -85.97874 Adam Ant Pop_Rock Something Girls 3759847 1982 [1:5, 2:4, 3:3, 4:2, 5:1, 6:11, 9:4, 12:9, 13:...
TRAAAHJ128F931194C 39.74001 -104.99226 Devotchka Pop_Rock The Last Beat Of My Heart (b-side) 5133845 2004 [1:4, 2:11, 3:2, 4:7, 5:3, 6:5, 8:1, 9:3, 10:6...
TRAABIG128F9356C56 40.71455 -74.00712 Poe Pop_Rock Walk the Walk 678806 2000 [1:28, 2:77, 3:31, 4:41, 5:5, 6:13, 8:17, 9:5,...
Comments on the size:

Due to the fact that we do not have access to the entire dataset, our analysis is limited to the 30% that is freely available on MusixMatch.

1.6. From generic bags of words to lyrics

We Create a function that take the list of the word and the occurency in one song : [(id of word),(occurency in the song)][2,24][5,47]...
And output all the corresponding words in a list

For example : [1:2,2:5,3:3] gives us --> [i,i,the,the,the,the,the,you,you,you]


In [14]:
#get the datas
bag_of_words = words_train
# clean the data and split it to create a list of 5000 words
bag_of_words = bag_of_words.str.replace('%','')
bag_of_words = bag_of_words.str.split(',')

display(bag_of_words.head())


Raw_Training    [i, the, you, to, and, a, me, it, not, in, my,...
Name: 16, dtype: object

In [15]:
#Defining a function
def create_text(words_freq):
    #create the final list of all words
    list_words=''
    #iterate over every id of words
    for compteur in words_freq:
        
        word = bag_of_words[0][int(compteur.split(':')[0])-1]
        times = int(compteur.split(':')[1])
        
        #Separating every word with a space to be able to work on it with librairies during part 2
        for i in range(times):
            list_words += ' ' + word + ' '
    return list_words

In [16]:
#Testing the function
print(create_text(lyrics.iloc[0]['words_freq']))


 i  i  i  i  i  i  the  the  the  the  you  you  to  to  and  and  and  and  and  a  a  a  me  it  my  is  is  of  of  of  your  that  are  are  we  we  am  am  will  will  for  for  for  for  be  have  have  so  this  like  like  de  up  was  was  if  got  would  been  these  these  seem  someon  understand  pass  river  met  piec  damn  worth  flesh  grace  poor  poor  somehow  ignor  passion  tide  season  seed  resist  order  order  piti  fashion  grant  captur  captur  ici  soil  patienc  social  social  highest  highest  slice  leaf  lifeless  arrang  wilder  shark  devast  element 
Comments on part one:

As it is noticeable through each step, we loose data every time we merge datasets. We chose this approach because we only want to deal with complete information in order to be coherent. We want to compare parameters between items and we believe that the analysis is less relevant if we consider a larger dataset that contains data incomplete data.

We now have 38 513 songs, but for each one we have all the features that we want to use. We will analyse our data with different parameters, thus why it is important that it each song provides each item. Later in the analysis we may use data from 1.4. (providing 103 401 songs) in order to get a broader overview.

2. Classifiers: Choosing the methods of analysis and extracting features

In order to analyse songs, we will use sentiment analysis on the lyrics. We chose to use 2 key features, which are the polarity and the lexical complexity. Because we only use bags of words, some parameters such as rhymes and structures are not defined with bags of words and they should be taken in consideration when speaking of the whole complexity of lyrics.

2.1. Word polarity

Vader package

VADER, which stands for Valence Aware Dictionary and sEntiment Reasoner, is a sentiment analysis package that provides a polarity score for a given word or sentences. It is known to be a very powerful tool, especially because it was trained on tweets, meaning that it takes into account most of modern vocabulary. This is especially relevant for our project because we deal with modern music, implying that the words that are used are as modern as the ones analysed by VADER on tweets. The fact that the sentiment analyser takes its roots from the same vocabulary is make the analysis more relevant.

Polarity is expressed between -1 (negative polarity) and 1 (positive polarity).


In [17]:
import nltk.sentiment.sentiment_analyzer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


C:\Users\santiago\Anaconda3\lib\site-packages\nltk\twitter\__init__.py:20: UserWarning: The twython library has not been installed. Some functionality from the twitter package will not be available.
  warnings.warn("The twython library has not been installed. "

In [18]:
#Defining the analyser
analyser = SentimentIntensityAnalyzer()

2.2. Lyrics' complexity

Because we want to be able to know what type of audience a specific type of music is targeting we need to analyse the complexity of the lyrics. We are aware that dividing an audience into social profiles is far beyond the scope of our analysis. We do not have enough sociological knowledge to categorize an audience in a precise way. This is the reason why we will use large indicators. We want to know how complex a set of word is, and the only social assumption we will make is that complexity is correlated with the age and the educational level of the audience.

We use the occurence of each word in the whole dataset.

Extracting the vocabulary

Importing the most used words and their count inside the dataset in order to start a text processing analysis.

Extracting additional features

From the dataset, there was some given metadata. The total word count is 55 163 335.

Because of the long tail effect of language, we will proceed with the first 10 000 words of the list. This will enable us to have less computing time when iterating on the full_word_list.

From the vocabulary we remove stopwords. Those are too often mentionned in every level of language to be relevant for this analysis.

We then compute the percentage of occurence, because it will help us when dealing with lyrics' complexity. We then use the occurence precentage to get a Complexity weight. It means that when a word is used a lot it will have a low weight and a high weight for words rarely used.


In [19]:
Word_count_total = 55163335

#Importing the data, putting it in a DataFrame
full_word_list = pd.read_table('Data/full_word_list.txt')
#Renaming the columns
full_word_list.columns = ['Word']
#Extracting word count
full_word_list['Count'] = pd.to_numeric(full_word_list['Word'].str.split('<SEP>', expand=True)[1])
#Extracted words that were used
full_word_list['Word'] = full_word_list['Word'].str.split('<SEP>', expand=True)[0]
#Dropping rows we will not use
full_word_list = full_word_list.drop(full_word_list.index[:6])

#Extracting the first 10 0000  values, because the rest is not stemmed and not necessarly in english
full_word_list = full_word_list.head(10000)


#Removing english stop words 
for word in full_word_list['Word']:
    if word in stopwords.words('english'):
        full_word_list = full_word_list[full_word_list.Word != word]
        
#Computing the percentage of occurence:
full_word_list['Occurence_percentage'] = (full_word_list['Count']/ Word_count_total)*100
#computing weight of words
full_word_list['Weight']= 1/full_word_list['Occurence_percentage']

#Display
display(full_word_list.head(10))


Word Count Occurence_percentage Weight
31 love 298043.0 0.540292 1.850852
34 know 273137.0 0.495142 2.019621
39 like 227624.0 0.412636 2.423441
44 get 192961.0 0.349799 2.858782
46 go 182812.0 0.331401 3.017490
47 time 181248.0 0.328566 3.043528
48 come 180061.0 0.326414 3.063592
49 oh 178751.0 0.324040 3.086043
52 got 170641.0 0.309338 3.232713
55 one 164380.0 0.297988 3.355842

In [20]:
# function to get the complexity of one song by analyzing the weight of all his word
def complexity_Song(lyrics):
    #create a variable to stock the sum of the weights for every word of the song
    sum_weight= 0
    #split the lyrics to get an array of words and not just one big string
    lyric = lyrics.split(' ')
    #lyric = lyric.remove(' ')
    
    #filtering empty values
    lyric = list(filter(None, lyric))
    
    #Removing every english stopword from the given lyric
    lyric = [word for word in lyric if word not in stopwords.words('english')]
    
    for x in lyric:
        #Making sure that the data is not empty
        if len(full_word_list.loc[full_word_list['Word'] == x]['Weight'].values) != 0 :
            sum_weight += full_word_list.loc[full_word_list['Word'] == x]['Weight'].values  

    return float(sum_weight/len(lyric))
Comment:

This implementation is inspired by the TF-IDF algorithm. If the occurence of a word is weak in a dataset, it means that it is less common in the language, meaning that the lexical complexity is higher.

English stopwords are very common in every sentences, they are used so typically that they do not add anything relevant to the analysis. This is the reason why we take them out. Our complexity analysis must be focused on words that do not appear regularly.

2.3. Topic classification

Having the data divided in genres in important for our analysis, however we are still missing one key dimension to make our work relevant for social good: The topic that is adressed in the songs. We must be able to know which subject is dealt with in a song, and then we will aggregate the data for the genre and we will be able to understand how a particular genre is handling a specific topic. For this part we are still considering two options:

Option 1: Use the classifier from the 20newsgroup

The work that was done in the second part of Homework 4. The good point is that we have a tuned algorithm to classify text into 20 different classes with more than 85% of accuracy. It is very powerful but not the exact algorithm we are looking for. It provides a strict classification, putting a text into one precise class while we would rather have several tags for a particular music. Artists tend to treat several topics we writing a song, and the 20newsgroup classifier would limitate our data treatment.

Some data cleaning will be required if applying the 20newsgroup classifier. For instance categories dealing with computer science are likley to be irrelevant when classifying songs, thus why they will be taken away.

Option 2: Find a deep learning algorithm/dataset that provides tags for songs

We are still exploring this possibility because option 1 does not provide the most appealing data we want. We have not encountered yet an algorithm that would provide a categorization with several tags. We are aware that we may not find such classifier, but since the 20newsgroup one is already fully implemented and ready to use (Tf-idf and RandomForest are coded. We will look for another option for as long as we can.

2.4. Example: Polarity and Complexity in Rap

In order to show how we will proceed on the dataset, we will use the first 100 songs of one genre as an example (in this case Rap). We know that the size is too small to have some conclusions, however we are interested in running and visualization for now. We want to make sure that this code works and provides appealing visualization. The generalization of this code will be implemented in the last part of the project.

We start with the polarity analysis:


In [21]:
#Extracting the Data of the genre
Rap_example = matches[(matches['genre'] == 'Rap')].head(100)

#Resetting index
Rap_example.reset_index(inplace=True)

#Intiating an empty column, in order to be able to interate on it
Rap_example['Bags_of_words'] = ''
#Getting all the textual data in the DataFrame
for i in Rap_example.index:
    Rap_example.at[i, 'Bags_of_words'] = create_text(Rap_example.at[i, 'words_freq'])

In [22]:
#Displaying results
display(Rap_example.head())


Tid Latitude Longitude Artist_Name genre Title MXM_Tid year words_freq Bags_of_words
0 TRABIVW12903CFC962 -10.40633 -77.23871 Silkk The Shocker Rap Thug 'N' Me 1014500 1998 [1:22, 2:10, 3:26, 4:8, 5:8, 6:6, 7:36, 8:8, 9... i i i i i i i i i i i i i i i ...
1 TRACHHH128E0788A35 34.05349 -118.24532 Dilated Peoples Rap Big Business 2178075 2004 [1:8, 2:18, 3:4, 4:12, 5:6, 6:4, 7:3, 8:5, 9:1... i i i i i i i i the the the the t...
2 TRADODV128F4283EB6 34.05349 -118.24532 Mobb Deep Rap Got It Twisted 2683192 2004 [1:4, 2:17, 3:14, 4:19, 5:5, 6:6, 7:2, 8:10, 9... i i i i the the the the the the the...
3 TRAGUEL128F92C48CF 29.76045 -95.36978 Lil Mama Rap G-Slide (Tour Bus) 7318348 2007 [1:19, 2:24, 3:23, 4:21, 5:10, 6:7, 7:18, 8:18... i i i i i i i i i i i i i i i ...
4 TRAGVAO128F4289B5D 34.05349 -118.24532 Mobb Deep Rap Up North Trip 8876004 1995 [1:20, 2:40, 3:27, 4:26, 5:18, 6:25, 7:5, 8:16... i i i i i i i i i i i i i i i ...

In [23]:
#Applying the polarity analysis for the bags of words
for i in Rap_example.index:
    Rap_example.at[i, 'Polarity_score'] = analyser.polarity_scores(Rap_example.at[i, 'Bags_of_words'])['compound']
    
#Sorting values to have a meaningfull plot
Rap_example = Rap_example.sort_values('Polarity_score', ascending=False)
Rap_example = Rap_example.reset_index()

#plotting the results: Polarity
plt.bar(Rap_example.index, Rap_example['Polarity_score'])
plt.title('Polarity in Rap')
plt.show()

#Computing the polarity's mean value for the genre
print("The mean value is")
display(pd.to_numeric(Rap_example['Polarity_score'].mean()))


The mean value is
-0.09608200000000001

We like this representation. We find it meaningful and visually appealing. This is the reason why we will keep this representation as our main visualization.

And now, we apply computation for the lexical complexity:


In [24]:
#Applying the lexical complexity analysis for the bags of words
for i in Rap_example.index:
    Rap_example.at[i,'Lexical_complexity'] = complexity_Song(Rap_example.at[i,'Bags_of_words'])

#Sorting values to have a meaningfull plot
Rap_example = Rap_example.sort_values('Lexical_complexity', ascending=False)
#plotting the results: Lexical complexity
plt.bar(Rap_example.index, Rap_example['Lexical_complexity'])
plt.title('Lexical complexity in Rap')
plt.show()

#Computing the lexical complexity's mean value for the genre
print("The mean value is")
display(pd.to_numeric(Rap_example['Lexical_complexity'].mean()))


The mean value is
114.95028863171996

Finally, we can visualize the evolution of complexity through time for the given subset:


In [25]:
#Sorting values to have a meaningfull plot
Rap_example = Rap_example.sort_values('year')

#plotting the results: Evolution of complexity through time

plt.bar(Rap_example['year'], Rap_example['Lexical_complexity'])
plt.title('Rap: Lexical complexity through time')
plt.show()


Comments:
  • This allows us to get an overview of the genre
  • The obtained graph could be compared with other genres.
  • When having data about the topics, we will be able to analyse compare how genres are treating specific topics.
  • Lexical complexity can be analysed through time, allowing to get an overview of the evolution of a the genre itself.

3. Visualization: How to meaningfully represent the data

Now that we have the tools for a sentiment analysis, we must decide a way to visualize the results. Having a straightforward visualization will helop to compare the analyse features between genres, and this is precisely what we will try to achieve with the following visualization.

All the interactive visualization structure that we have here will be displayed in a blog.

3.1. Topic-based approach

We first deal with polarity, then with lexical complexity.

3.1.1. Dealing with polarity

When having the topics and the polarity, we will be able to have an overview of how social topics are dealt with across music genres. We cannot draw any conclusion for this part without having every visual, but we believe that the comparison will provide a good insight on how music treats social topics.

Once every visualization will be generated, here is the representation structure that we want to have:


In [26]:
display(Image(filename='Data/Organigram_polarity_viz.png'))


Description
  • Each edge stands for a user's click.
  • Rectangles are representing the barplot of polarity (for each genre/artist at the level)
  • The user will pick a topic, this will display every genre treating the topic and their polarity barplot. Then when clicking on a genre's plot it will display what is polarity barplot for every artiss.

3.1.2. Dealing with complexity

We believe that there is a correlation between the complexity and the target audience. Has asserted earlier in this work, we do not have the sociological expertise to match the complextiy with precise social groups, but we assume that there is a correlation between the audience's age + educational level and the complexity of lyrics.

We want to be able to visualize the complexity in two distinct ways. At first from a topic perspective by having genre displaying their complexity when treating a particular subject. Here is the display organigram we want to have:


In [27]:
display(Image(filename='Data/Organigram_complexity_viz.png'))


Description

As before, edges stand for a user's click. Then when picking a genre, we will visualize what is the used lexical complexity. Comparing this value to this average of the genre will help to see if the target is more educated/older than the usual case for the genre.

The second visualization we want to provide is starting with topics on top. For every topic we want to be able to see the mean lexical complexity that is used by a genre to treat the topic. This will allow is to know what are the targeted audiences of each genres for a precise topic. Here is the data organigram:


In [28]:
display(Image(filename='Data/Organigram_complexity_topic_viz.png'))


3.2. Data by location

We want to know what topic is adressed in each genre, and especially how is it spread accross the globe. For that we will use the location data that we gathered. We want to produce maps for:

  • Complexity within a genre. In order to know how elaborate a genre is in different places around the world. We will use a heatmap to visualize the data.
  • Polarity of genres when dealing with a certain topic. For instance we want to be able to know if a subject has a negative polarity in rock in the US while it is positive in the UK.

3.3. Chronological progression

In order to know how a genre evolved, we want to plot the lexical complexity through time of every genre. We will use, again, a barplot. With the chronology on the x-axis an the complexity on the y-axis.

4. Decisions: What can be said from the outputs, and what is left to do

4.1. Intermediary conclusions

Data handling

Due to the fact that we are solely treating text, and that the lyrics are not entirely given, we use a volume of data that is relatively small. This situation enables us to have the original text files stored on our machines, and proceeding to storage of value when using the notebook is not necessary since our computer's cache is big enough to treat our data.

Visualization

We are convinced that the barplot is the most fitting representation for the generated graphs. We might find another librairy that makes it look more visually appealing, but we believe that it is the most meaningful way to show our results.

Comparison

Once we will have a topic-based split of our data, we will be able to compare how topics are treated in different genres.

4.2. Further objectives

  • Find a classification to have topics, and if no useful algorithm, use the one we built on the 20newsgroup. Establish a host and astructure for a blog. Now that the raw material is present, we must seek for a layout that fits our project.
  • Expand our analysis to the whole dataset. As discussed earlier, we only treated a sample for this milestone, in order to show what we are aiming at.

4.3. Planning until milestone 3

04.12.

  • Decide which option we should consider for the topic analysis (if nothing relevant is find for option 2, we will use option 1).
  • Generalize our data treatment to the whole dataset.
  • Generate the visualizations we want to have.
  • Decide what type of storage is the most efficient for our generated graphics
  • Start build the blog

11.12.

  • Start building the blog.
  • Finish the written report
  • Debug our code

19.12.

  • Handout milestone

Questions for the TA

  • Is there a deep learning that could classify our bags of words by topic? We believe that something implement with TF-IDF as a starting point would be useful but could not find anything concluant yet.
  • Is there a blog platform that could host our data structure and visualization for free?
  • Does it seem to you that the analysis we are aiming at is meaningful and relevant enough to be considered as "Handling data for social good"?