Million Song Subset Metadata Exploration with Unsupervised Clustering

In the present notebook we perform a first decriptive analysis on a subset of the million song dataset. This subset is available here.



In [1]:

    
import numpy as np
from scipy.stats import kurtosis, skew
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sb
import sqlite3

%matplotlib inline
plt.rcParams['figure.figsize'] = (8,6)
plt.rc('axes', titlesize=18)
plt.rc('axes', labelsize=15)
sb.set_palette('Dark2')
sb.set_style('whitegrid')
path = '../MillionSongSubset/'

Reading the Metadata SQL table

For this demo the database .db file is in a separate folder named MillionSongSubset/AdditionalFiles/. An alternative to the lines below can be found at the following address : https://labrosa.ee.columbia.edu/millionsong/sites/default/files/tutorial1.py.txt

The latter demo was specifically tailored for reading these files.



In [2]:

    
# creating the connector for the SQL database
con_meta = sqlite3.connect(path+'AdditionalFiles/subset_track_metadata.db')
cur_meta = con_meta.cursor()



In [3]:

    
# creating and executing an SQL request
res = con_meta.execute("SELECT name FROM sqlite_master WHERE type='table';")

# printing available table names
for name in res:
    print(name[0])









    



songs

Exploring the Metadata tables

The metadata dataset contains basic information about the artist name, song title and album name, year of release but also a measure of hotness and familiarity :

The familiarity quantifies how well an artist is know at the time of the song release.
The hotness is a measure of the popularity at the time of the song release.

What insights can we find in there ?



In [4]:

    
# I this cell we load the dataset in a pandas.DataFrame omitting entries without year information
songs = pd.read_sql_query('SELECT * FROM songs WHERE year!=0', con_meta)
songs.head(5)









    Out[4]:







  
    
      
      track_id
      title
      song_id
      release
      artist_id
      artist_mbid
      artist_name
      duration
      artist_familiarity
      artist_hotttnesss
      year
    
  
  
    
      0
      TRACCVZ128F4291A8A
      Deep Sea Creature
      SOVLGJY12A8C13FBED
      Call of the Mastodon
      ARMQHX71187B9890D3
      bc5e2ad6-0a4a-4d90-b911-e9a7e6861727
      Mastodon
      280.21506
      0.780462
      0.574275
      2001
    
    
      1
      TRACCMH128F428E4CD
      No Quieras Marcharte
      SOGDQZK12A8C13F37C
      Adelante
      AR2PT4M1187FB55B1A
      d54ea4a6-0b9c-4e47-bed0-289ae9ff4037
      Los Chichos
      191.68608
      0.561590
      0.420570
      1984
    
    
      2
      TRACCVS12903D014F8
      Ad Te Clamamus Exsvles Mortva Liberi
      SOHCCIA12AC907577F
      Pentagrammaton
      ARZEWUR1187FB53DC8
      0be59867-0da4-4e45-9b64-728cdf25487c
      Enthroned
      89.15546
      0.593416
      0.395710
      2010
    
    
      3
      TRACCKS128F42B77AE
      Murder One
      SOBOAQC12A8C13E3E9
      BTNHRESURRECTION
      ARUZRFN1187B98AC05
      2fa45bbb-0efb-4950-9d40-94bf23cbfec1
      Bone Thugs-N-Harmony
      255.73832
      0.815923
      0.555138
      2000
    
    
      4
      TRACCJA128F149A144
      Segredo
      SODPNJR12A6D4FA52D
      Joao Voz E Violato
      AR3THYK1187B999F1F
      286ec4c2-b5ca-4f85-b331-280a6d73dd14
      João Gilberto
      197.19791
      0.645192
      0.471224
      2000



In [5]:

    
# Then we simply display a couple of information to know
# which kind of data we are dealing with and check potential NaN values
songs.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4680 entries, 0 to 4679
Data columns (total 11 columns):
track_id              4680 non-null object
title                 4680 non-null object
song_id               4680 non-null object
release               4680 non-null object
artist_id             4680 non-null object
artist_mbid           4680 non-null object
artist_name           4680 non-null object
duration              4680 non-null float64
artist_familiarity    4680 non-null float64
artist_hotttnesss     4680 non-null float64
year                  4680 non-null int64
dtypes: float64(3), int64(1), object(7)
memory usage: 402.3+ KB

There are no missing values but that does not mean there are not any weird values.

In the following we will explore the songs distributions in the multi-dimensional feature space. Such descriptive analysis step is essential to later have a critical understanding of any trends or interesting facts we might discover about the Million Song dataset.

One Dimensional Analysis

At first we decided to look at features individually (aka along one dimension of the feature space only).

Songs per Year



In [6]:

    
# simple statistical description
songs.year.describe()









    Out[6]:





count    4680.000000
mean     1997.232051
std        11.686691
min      1926.000000
25%      1993.000000
50%      2001.000000
75%      2006.000000
max      2010.000000
Name: year, dtype: float64

Having information about the minimum, maximum, mean, median, quartiles, etc. is interesting but it is probably more informative to look at the distribution of songs per year in details.



In [7]:

    
# first we count the number of songs per year in chronological ordre.
songs_per_yr = songs.year.value_counts().sort_index()
songs_per_yr.head(5)









    Out[7]:





1926    2
1927    3
1929    1
1930    2
1934    1
Name: year, dtype: int64



In [8]:

    
# add missing years in the original dataset with 0
songs_per_yr = songs_per_yr.reindex(index=list(range(songs.year.min(),songs.year.max())),
                                    fill_value=0)



In [9]:

    
# Visualizing the number of songs per year
l_col = songs_per_yr / songs_per_yr.max()
songs_per_yr.plot.bar(color=cm.plasma(l_col), figsize=(12,7))

plt.xlabel('Year')
plt.ylabel('Number of Songs')
plt.title('Number of Songs vs Year');

Looking at the very unbalanced distribution of songs per year it appears that the initial statistical indicators are not carrying very meaningful information. We can clearly see that the dataset is biased toward more recent songs.

Songs and Albums Authors

Now we know there is a year bias, let's look at the songs author. Is there another bias ?



In [10]:

    
# Counting the number of entry per artist
songs_per_artist = songs.artist_name.value_counts()
songs_per_artist[songs_per_artist >= 8].sort_values().plot.barh(color='midnightblue', figsize=(7,7))
plt.title('Songs per Artist (8 and more)')
plt.xlabel('Number of Songs')
plt.ylabel('Artist Name');



In [11]:

    
# Number of different albums release per artist
songs_per_art_and_rel = songs.groupby(['artist_name']).release.nunique()
songs_per_art_and_rel[songs_per_art_and_rel >= 6].sort_values().plot.barh(color='midnightblue', figsize=(6,10))
plt.title('Albums per Artist (6 and more)')
plt.xlabel('Number of Albums')
plt.ylabel('Artist Name');

The dataset is made of 4680 songs while there is a maximum of 11 songs and 10 unique albums per artist... we can reasonably conclude that there is no bias toward a specific artist or album.

Songs Duration

Contrary to the very unbalanced historical distribution of songs we would expect the songs duration to follow more closely a Normal distribution.

Before looking at the histogram we display a couple of statistical quantities (especially to check the minimum and maximum) :



In [12]:

    
songs.duration.describe()









    Out[12]:





count    4680.000000
mean      238.781146
std       101.345438
min         1.044440
25%       181.152200
50%       227.382400
75%       278.406080
max      1598.197100
Name: duration, dtype: float64



In [13]:

    
# Looking at the songs distribution in bins of 10seconds
songs.duration.plot.hist(bins=np.arange(0.0, 1610.0, 10.0), figsize=(12,7))
# Visualizing the quartiles
plt.axvline(181.1522, linestyle=':', lw=1.0, c='k')
plt.axvline(227.3824, linestyle='-.', lw=1.0, c='k')
plt.axvline(278.40608, linestyle=':', lw=1.0, c='k')

plt.xlim(0.0,1600)
plt.xlabel('Song Duration [sec]')
plt.ylabel('Number of Songs')
plt.title('Number of Songs VS Duration');

Despite a couple of pretty long songs (nearly 27minutes for the maximum) the distribution appears roughly symmetrical. This is partially confirmed by the small difference between the mean and the median.

Confirmation of the degree of symmetry can be found in the skewness of the distribution :



In [14]:

    
# computing  distribution skewness
print('Skewness : {:.3f}'.format(skew(songs.duration.values)))









    



Skewness : 2.566

The fact that the skewness is positive but not very large indicates that the distribution is slightly assymetric, with a tail longer for values higher than the median (as we can see).

Often the kurtosis is computed as well to give an idea of how "flat/pointy" is the distribution. In our case we expect a positive value (rather pointy) :



In [15]:

    
# computing the kurtosis 
print('Kurtosis : {:.3f}'.format(kurtosis(songs.duration.values)))









    



Kurtosis : 19.889

Songs Hotness and Familiarity

In this section we focus on two engineered featured, namely the artist hotness and familiarity



In [16]:

    
songs[['artist_hotttnesss','artist_familiarity']].describe()









    Out[16]:







  
    
      
      artist_hotttnesss
      artist_familiarity
    
  
  
    
      count
      4680.000000
      4680.000000
    
    
      mean
      0.441711
      0.631585
    
    
      std
      0.128043
      0.140808
    
    
      min
      0.000000
      0.000000
    
    
      25%
      0.366820
      0.540750
    
    
      50%
      0.420938
      0.622875
    
    
      75%
      0.511054
      0.727230
    
    
      max
      1.082503
      1.000000



In [19]:

    
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12,6), sharex=True, sharey=True)

songs.artist_familiarity.hist(bins=np.arange(0.0,1.02,0.02), ax=ax[0], color='C2');
ax[0].set_title('Artist Familiarity')
ax[0].axvline(0.54075, linestyle=':', lw=1.0, c='k')
ax[0].axvline(0.622875, linestyle='-.', lw=1.0, c='k')
ax[0].axvline(0.72723, linestyle=':', lw=1.0, c='k')

songs.artist_hotttnesss.hist(bins=np.arange(0.0,1.02,0.02), ax=ax[1], color='C1');
ax[1].set_title('Artist Hotness')
ax[1].axvline(0.36682, linestyle=':', lw=1.0, c='k')
ax[1].axvline(0.420938, linestyle='-.', lw=1.0, c='k')
ax[1].axvline(0.511054, linestyle=':', lw=1.0, c='k')

ax[-1].set_xlim(0.0,1.0)
plt.subplots_adjust(wspace=0.05);

This is not much information to extract from the one-dimensional anlaysis of these two features, later we will come back to the relationship between hotness and familiarity that shows much more interesting patterns.

Looking at Patterns and Correlations

One dimensional analysis is by definition limited and in this section we will investigate further correlations between features and possible existing patterns.

An easy way to visualize possible patterns is to use a pair grid where every feature is plotted against the others :



In [20]:

    
g = sb.PairGrid(songs, vars=['year','duration','artist_familiarity','artist_hotttnesss'])
g.map_upper(sb.kdeplot)
g.map_lower(plt.scatter, marker='.', edgecolor='w')
g.map_diag(plt.hist, bins=51);









    



/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/matplotlib/contour.py:967: UserWarning: The following kwargs were not used by contour: 'color', 'label'
  s)

Wa can clearly see that there is a correlation between artist familiarity and hotness. In order to quantify such correlation we compute the (Pearson) correlation coefficient between all pairs :



In [21]:

    
features = ['year','duration','artist_familiarity','artist_hotttnesss']
corr_mat = np.corrcoef(songs[features].values.T)
ax = sb.heatmap(corr_mat, xticklabels=features, yticklabels=features,
                cmap='RdBu', vmin=-1.0, vmax=1.0, annot=True)
plt.setp( ax.xaxis.get_majorticklabels(), fontsize=15, rotation=60 )
plt.setp( ax.yaxis.get_majorticklabels(), fontsize=15, rotation=0 );

Well, artist familiarity and hotness are highly correlated with a coefficient of ~0.81 while other features are much more independent from each other (coefficient < 0.1).

In the following we will take a closer look at the pair familiarity-hotness and try to better understand the correlation between these two features.

Artist Hotness vs Artist Familiarity

Let's first have a look again at the dataset in this 2-D space defined by the artist hotness and familiarity :



In [22]:

    
fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, sharey=True,
                       figsize=(12,6))

songs.plot.scatter(x='artist_familiarity', y='artist_hotttnesss', marker='+', ax=ax[0])
songs.plot.hexbin(x='artist_familiarity', y='artist_hotttnesss', ax=ax[1],
                  gridsize=41, mincnt=1.0, cmap='viridis', colorbar=False)

ax[0].set_title('Scatter Plot')
ax[1].set_title('Density Plot')
plt.subplots_adjust(wspace=0.03);

Familiarity vs Hotness Clustering

When looking at the previous figures one can easily spot a "main sequence" (term borrowed from astronomy) where hotness linearly increase with familiarity and a number of cases where hotness is nearly zero if not zero. In addition there seems to be a different relationship between hotness and familiarity at high familiarity (>0.8) and numerous outliers surrounding the "main sequence" and at low familiarity (<0.4).

To better understand the pattern(s) in this 2-D space we choose to use an unsupervised clustering approach with the hope that it will reveal interesting structure.
We decided to illustrate this approach with two methods :

The Gaussian Mixture Model (GMM) which makes the assumptions that the distribution can be viewed as samples from k Gaussian distributions.
Agglomerative Clustering, a hierarchical clustering approach, grouping samples by increasing distance.

Both methods require the user to define a priori the number of clusters which can then be validated using measures such as the $\log[Likelihood]$, Bayesian Information Criteria (BIC) or the Akaike Information Criteria (AIC).

Hereafter we exclude the collection of points with hotness at zero as it may very well prevents clustering from giving sensible results.



In [24]:

    
# checking the number of points at hotness = 0
songs[songs.artist_hotttnesss == 0.0].song_id.nunique()









    Out[24]:





58



In [25]:

    
# importing the methods
from sklearn.mixture import GaussianMixture
from sklearn.cluster import AgglomerativeClustering



In [26]:

    
# Selecting only the useful part of the dataset
songs_cl = songs[songs.artist_hotttnesss != 0.0][['artist_familiarity','artist_hotttnesss']]

In order to estimate the optimal number of clusters we iteratively compute the different metrics with GMM :



In [29]:

    
# Here we compute metrics for GMM with different number of clusters
scores_n = []

for i in range(2,12):
    print('number of clusters : {}'.format(i))
    gmm = GaussianMixture(n_components=i, covariance_type='full')
    bic, aic, log = 0.0, 0.0, 0.0
    for j in range(7):
        gmm.fit(songs_cl)
        bic += gmm.bic(songs_cl)
        aic += gmm.aic(songs_cl)
        log += gmm.score(songs_cl)
    scores_n.append([np.mean(bic), np.mean(aic), np.mean(log)])
    
scores_n = np.asarray(scores_n)









    



number of clusters : 2
number of clusters : 3
number of clusters : 4
number of clusters : 5
number of clusters : 6
number of clusters : 7
number of clusters : 8
number of clusters : 9
number of clusters : 10
number of clusters : 11



In [30]:

    
# Visualizing the results
fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(10,4))

ax[0].plot(range(2,12), scores_n[:,0] - np.min(scores_n[:,0]), 'C0')
ax[0].plot(range(2,12), scores_n[:,1] - np.min(scores_n[:,1]), 'C1')
ax[0].legend(['BIC','AIC'])
ax[0].set_xlabel('Number of Clusters $k$')

ax[1].plot(range(2,12), scores_n[:,2] - np.min(scores_n[:,2]), 'C2')
ax[1].legend(['Log Likelihood'])
ax[1].set_xlabel('Number of Clusters $k$');

When looking at the AIC and BIC the optimal number of clusters is given by the minimum while with the $\log[Likelihood]$ we look at the change in the slope ("elbow"). It is always more sensible to look at different metrics when it comes to clustering.
With this in mind, metrics seem to indicate that $k=4$ gives the best result for our dataset.

Let's look at the resulting clustering :



In [31]:

    
# Defining GMM with k=4 clusters
gmm_4 = GaussianMixture(n_components=4, covariance_type='full')
gmm_4.fit(songs_cl)
gmm_4_pred = gmm_4.predict_proba(songs_cl)



In [44]:

    
# Visualizing the result
plt.scatter(songs_cl.artist_familiarity, songs_cl.artist_hotttnesss, c=np.argmax(gmm_4_pred, axis=1),
            marker='.', alpha=0.5, cmap='Dark2_r')
plt.title('GMM with k=4')
plt.xlabel('Artist Familiarity')
plt.ylabel('Artist Hotness');

Well... even if metrics recommend four clusters, resulting groups differ significantly from the intuition we have. Let's try with maybe $k=3$ :



In [33]:

    
# Defining GMM with k=3 clusters
gmm_3 = GaussianMixture(n_components=3, covariance_type='full')
gmm_3.fit(songs_cl)
gmm_3_pred = gmm_3.predict_proba(songs_cl)



In [34]:

    
# Visualizing the result
plt.scatter(songs_cl.artist_familiarity, songs_cl.artist_hotttnesss, c=np.argmax(gmm_3_pred, axis=1),
            marker='.', alpha=0.5, cmap='Dark2_r')
plt.title('GMM with k=3')
plt.xlabel('Artist Familiarity')
plt.ylabel('Artist Hotness');

That is much closer to the intuition we had. Let's have a look at what Agglomerative clustering suggests (with 4 and 3 clusters) :



In [35]:

    
# Agglomerative Clustering with k=4 clusters
agg_4 = AgglomerativeClustering(n_clusters=4, linkage='average')
agg_4_pred = agg_4.fit_predict(songs_cl)



In [45]:

    
# Agglomerative Clustering with k=3 clusters
agg_3 = AgglomerativeClustering(n_clusters=3, linkage='average')
agg_3_pred = agg_3.fit_predict(songs_cl)



In [46]:

    
fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, sharey=True, figsize=(12,6))

ax[0].scatter(songs_cl.artist_familiarity, songs_cl.artist_hotttnesss, c=agg_3_pred,
              marker='.', alpha=1.0, cmap='Dark2_r')
ax[0].set_title('Agg. Clus. with k=3')
ax[0].set_xlabel('Artist Familiarity')
ax[0].set_ylabel('Artist Hotness')

ax[1].scatter(songs_cl.artist_familiarity, songs_cl.artist_hotttnesss, c=agg_4_pred,
              marker='.', alpha=1.0, cmap='Dark2_r')
ax[1].set_title('Agg. Clus. with k=4')
ax[1].set_xlabel('Artist Familiarity')

plt.subplots_adjust(wspace=0.025);

It is possible to look separately at the resulting clusters from both methods.



In [47]:

    
songs_cl['gmm'] = np.argmax(gmm_3_pred, axis=1)
songs_cl['agg_clus'] = agg_3_pred



In [48]:

    
sb.lmplot(data=songs_cl, x='artist_familiarity', y='artist_hotttnesss',
          hue='gmm', col='gmm',
          fit_reg=False, markers='.', scatter_kws={'alpha':0.7});



In [49]:

    
sb.lmplot(data=songs_cl, x='artist_familiarity', y='artist_hotttnesss',
          hue='agg_clus', col='agg_clus',
          fit_reg=False, markers='.', scatter_kws={'alpha':0.7});

The nature of the GMM method result in our case in two clusters presenting a gap while both methods recover a pretty similar "main sequence". Therefore we will keep Agglomerative Clustering results.

Clustering quality assessment can be very tedious and there exists a lot of metrics for this purpose. However, we choose to limit ourselves to the previous qualitative analysis.

Linear Regression of the Clustering

Now we have defined three different groups it is interesting to introduce some quantification.

First we perform a linear regression of the "main sequence" as it exhibits a clear correlation.



In [53]:

    
songs_linreg = np.polyfit(songs_cl.artist_familiarity[songs_cl.agg_clus == 2].values,
                          songs_cl.artist_hotttnesss[songs_cl.agg_clus == 2].values, 1)
print("Linear Regression of the 'main sequence' :\n\
Hotness = {:.4f} * Familiarity + {:.4f}".format(songs_linreg[0],songs_linreg[1]))









    



Linear Regression of the 'main sequence' :
Hotness = 0.5944 * Familiarity + 0.0582

We can actually look at what a linear regression would have given on the GMM result :



In [54]:

    
songs_linreg_gmm = np.polyfit(songs_cl.artist_familiarity[songs_cl.gmm == 0].values,
                              songs_cl.artist_hotttnesss[songs_cl.gmm == 0].values, 1)
print("Linear Regression of the 'main sequence' :\n\
Hotness = {:.4f} * Familiarity + {:.4f}".format(songs_linreg_gmm[0],songs_linreg_gmm[1]))









    



Linear Regression of the 'main sequence' :
Hotness = 0.5920 * Familiarity + 0.0563

As expected (and fortunately) results are very similar. Let's visualize the result :



In [63]:

    
plt.figure(figsize=(8,7))
plt.scatter(songs.artist_familiarity, songs.artist_hotttnesss,
            marker='.', edgecolors='none', c='C2', alpha=0.7)
plt.plot([0.0,1.0], [0.0,songs_linreg[0]] + songs_linreg[1], 'k', lw=1.0)
plt.xlabel('Artist Familiarity')
plt.ylabel('Artist Hotness');

This line can be viewed as a reference of expected hotness given the artist familiarity.

Somehow points below the line can be viewed as artist songs being less hot than expected while points above this line are artists that have exceeded expectations!



In [ ]:

	track_id	title	song_id	release	artist_id	artist_mbid	artist_name	duration	artist_familiarity	artist_hotttnesss	year
0	TRACCVZ128F4291A8A	Deep Sea Creature	SOVLGJY12A8C13FBED	Call of the Mastodon	ARMQHX71187B9890D3	bc5e2ad6-0a4a-4d90-b911-e9a7e6861727	Mastodon	280.21506	0.780462	0.574275	2001
1	TRACCMH128F428E4CD	No Quieras Marcharte	SOGDQZK12A8C13F37C	Adelante	AR2PT4M1187FB55B1A	d54ea4a6-0b9c-4e47-bed0-289ae9ff4037	Los Chichos	191.68608	0.561590	0.420570	1984
2	TRACCVS12903D014F8	Ad Te Clamamus Exsvles Mortva Liberi	SOHCCIA12AC907577F	Pentagrammaton	ARZEWUR1187FB53DC8	0be59867-0da4-4e45-9b64-728cdf25487c	Enthroned	89.15546	0.593416	0.395710	2010
3	TRACCKS128F42B77AE	Murder One	SOBOAQC12A8C13E3E9	BTNHRESURRECTION	ARUZRFN1187B98AC05	2fa45bbb-0efb-4950-9d40-94bf23cbfec1	Bone Thugs-N-Harmony	255.73832	0.815923	0.555138	2000
4	TRACCJA128F149A144	Segredo	SODPNJR12A6D4FA52D	Joao Voz E Violato	AR3THYK1187B999F1F	286ec4c2-b5ca-4f85-b331-280a6d73dd14	João Gilberto	197.19791	0.645192	0.471224	2000

	artist_hotttnesss	artist_familiarity
count	4680.000000	4680.000000
mean	0.441711	0.631585
std	0.128043	0.140808
min	0.000000	0.000000
25%	0.366820	0.540750
50%	0.420938	0.622875
75%	0.511054	0.727230
max	1.082503	1.000000