In the present notebook we perform a first decriptive analysis on a subset of the million song dataset. This subset is available here.
In [1]:
import numpy as np
from scipy.stats import kurtosis, skew
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sb
import sqlite3
%matplotlib inline
plt.rcParams['figure.figsize'] = (8,6)
plt.rc('axes', titlesize=18)
plt.rc('axes', labelsize=15)
sb.set_palette('Dark2')
sb.set_style('whitegrid')
path = '../MillionSongSubset/'
For this demo the database .db
file is in a separate folder named MillionSongSubset/AdditionalFiles/
.
An alternative to the lines below can be found at the following address : https://labrosa.ee.columbia.edu/millionsong/sites/default/files/tutorial1.py.txt
The latter demo was specifically tailored for reading these files.
In [2]:
# creating the connector for the SQL database
con_meta = sqlite3.connect(path+'AdditionalFiles/subset_track_metadata.db')
cur_meta = con_meta.cursor()
In [3]:
# creating and executing an SQL request
res = con_meta.execute("SELECT name FROM sqlite_master WHERE type='table';")
# printing available table names
for name in res:
print(name[0])
The metadata dataset contains basic information about the artist name, song title and album name, year of release but also a measure of hotness and familiarity :
What insights can we find in there ?
In [4]:
# I this cell we load the dataset in a pandas.DataFrame omitting entries without year information
songs = pd.read_sql_query('SELECT * FROM songs WHERE year!=0', con_meta)
songs.head(5)
Out[4]:
In [5]:
# Then we simply display a couple of information to know
# which kind of data we are dealing with and check potential NaN values
songs.info()
There are no missing values but that does not mean there are not any weird values.
In the following we will explore the songs distributions in the multi-dimensional feature space. Such descriptive analysis step is essential to later have a critical understanding of any trends or interesting facts we might discover about the Million Song dataset.
In [6]:
# simple statistical description
songs.year.describe()
Out[6]:
Having information about the minimum, maximum, mean, median, quartiles, etc. is interesting but it is probably more informative to look at the distribution of songs per year in details.
In [7]:
# first we count the number of songs per year in chronological ordre.
songs_per_yr = songs.year.value_counts().sort_index()
songs_per_yr.head(5)
Out[7]:
In [8]:
# add missing years in the original dataset with 0
songs_per_yr = songs_per_yr.reindex(index=list(range(songs.year.min(),songs.year.max())),
fill_value=0)
In [9]:
# Visualizing the number of songs per year
l_col = songs_per_yr / songs_per_yr.max()
songs_per_yr.plot.bar(color=cm.plasma(l_col), figsize=(12,7))
plt.xlabel('Year')
plt.ylabel('Number of Songs')
plt.title('Number of Songs vs Year');
Looking at the very unbalanced distribution of songs per year it appears that the initial statistical indicators are not carrying very meaningful information. We can clearly see that the dataset is biased toward more recent songs.
Now we know there is a year bias, let's look at the songs author. Is there another bias ?
In [10]:
# Counting the number of entry per artist
songs_per_artist = songs.artist_name.value_counts()
songs_per_artist[songs_per_artist >= 8].sort_values().plot.barh(color='midnightblue', figsize=(7,7))
plt.title('Songs per Artist (8 and more)')
plt.xlabel('Number of Songs')
plt.ylabel('Artist Name');
In [11]:
# Number of different albums release per artist
songs_per_art_and_rel = songs.groupby(['artist_name']).release.nunique()
songs_per_art_and_rel[songs_per_art_and_rel >= 6].sort_values().plot.barh(color='midnightblue', figsize=(6,10))
plt.title('Albums per Artist (6 and more)')
plt.xlabel('Number of Albums')
plt.ylabel('Artist Name');
The dataset is made of 4680 songs while there is a maximum of 11 songs and 10 unique albums per artist... we can reasonably conclude that there is no bias toward a specific artist or album.
In [12]:
songs.duration.describe()
Out[12]:
In [13]:
# Looking at the songs distribution in bins of 10seconds
songs.duration.plot.hist(bins=np.arange(0.0, 1610.0, 10.0), figsize=(12,7))
# Visualizing the quartiles
plt.axvline(181.1522, linestyle=':', lw=1.0, c='k')
plt.axvline(227.3824, linestyle='-.', lw=1.0, c='k')
plt.axvline(278.40608, linestyle=':', lw=1.0, c='k')
plt.xlim(0.0,1600)
plt.xlabel('Song Duration [sec]')
plt.ylabel('Number of Songs')
plt.title('Number of Songs VS Duration');
Despite a couple of pretty long songs (nearly 27minutes for the maximum) the distribution appears roughly symmetrical. This is partially confirmed by the small difference between the mean and the median.
Confirmation of the degree of symmetry can be found in the skewness of the distribution :
In [14]:
# computing distribution skewness
print('Skewness : {:.3f}'.format(skew(songs.duration.values)))
The fact that the skewness is positive but not very large indicates that the distribution is slightly assymetric, with a tail longer for values higher than the median (as we can see).
Often the kurtosis is computed as well to give an idea of how "flat/pointy" is the distribution. In our case we expect a positive value (rather pointy) :
In [15]:
# computing the kurtosis
print('Kurtosis : {:.3f}'.format(kurtosis(songs.duration.values)))
In [16]:
songs[['artist_hotttnesss','artist_familiarity']].describe()
Out[16]:
In [19]:
fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(12,6), sharex=True, sharey=True)
songs.artist_familiarity.hist(bins=np.arange(0.0,1.02,0.02), ax=ax[0], color='C2');
ax[0].set_title('Artist Familiarity')
ax[0].axvline(0.54075, linestyle=':', lw=1.0, c='k')
ax[0].axvline(0.622875, linestyle='-.', lw=1.0, c='k')
ax[0].axvline(0.72723, linestyle=':', lw=1.0, c='k')
songs.artist_hotttnesss.hist(bins=np.arange(0.0,1.02,0.02), ax=ax[1], color='C1');
ax[1].set_title('Artist Hotness')
ax[1].axvline(0.36682, linestyle=':', lw=1.0, c='k')
ax[1].axvline(0.420938, linestyle='-.', lw=1.0, c='k')
ax[1].axvline(0.511054, linestyle=':', lw=1.0, c='k')
ax[-1].set_xlim(0.0,1.0)
plt.subplots_adjust(wspace=0.05);
This is not much information to extract from the one-dimensional anlaysis of these two features, later we will come back to the relationship between hotness and familiarity that shows much more interesting patterns.
One dimensional analysis is by definition limited and in this section we will investigate further correlations between features and possible existing patterns.
An easy way to visualize possible patterns is to use a pair grid where every feature is plotted against the others :
In [20]:
g = sb.PairGrid(songs, vars=['year','duration','artist_familiarity','artist_hotttnesss'])
g.map_upper(sb.kdeplot)
g.map_lower(plt.scatter, marker='.', edgecolor='w')
g.map_diag(plt.hist, bins=51);
Wa can clearly see that there is a correlation between artist familiarity and hotness. In order to quantify such correlation we compute the (Pearson) correlation coefficient between all pairs :
In [21]:
features = ['year','duration','artist_familiarity','artist_hotttnesss']
corr_mat = np.corrcoef(songs[features].values.T)
ax = sb.heatmap(corr_mat, xticklabels=features, yticklabels=features,
cmap='RdBu', vmin=-1.0, vmax=1.0, annot=True)
plt.setp( ax.xaxis.get_majorticklabels(), fontsize=15, rotation=60 )
plt.setp( ax.yaxis.get_majorticklabels(), fontsize=15, rotation=0 );
Well, artist familiarity and hotness are highly correlated with a coefficient of ~0.81 while other features are much more independent from each other (coefficient < 0.1).
In the following we will take a closer look at the pair familiarity-hotness and try to better understand the correlation between these two features.
In [22]:
fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, sharey=True,
figsize=(12,6))
songs.plot.scatter(x='artist_familiarity', y='artist_hotttnesss', marker='+', ax=ax[0])
songs.plot.hexbin(x='artist_familiarity', y='artist_hotttnesss', ax=ax[1],
gridsize=41, mincnt=1.0, cmap='viridis', colorbar=False)
ax[0].set_title('Scatter Plot')
ax[1].set_title('Density Plot')
plt.subplots_adjust(wspace=0.03);
When looking at the previous figures one can easily spot a "main sequence" (term borrowed from astronomy) where hotness linearly increase with familiarity and a number of cases where hotness is nearly zero if not zero. In addition there seems to be a different relationship between hotness and familiarity at high familiarity (>0.8) and numerous outliers surrounding the "main sequence" and at low familiarity (<0.4).
To better understand the pattern(s) in this 2-D space we choose to use an unsupervised clustering approach with the hope that it will reveal interesting structure.
We decided to illustrate this approach with two methods :
Both methods require the user to define a priori the number of clusters which can then be validated using measures such as the $\log[Likelihood]$, Bayesian Information Criteria (BIC) or the Akaike Information Criteria (AIC).
Hereafter we exclude the collection of points with hotness at zero as it may very well prevents clustering from giving sensible results.
In [24]:
# checking the number of points at hotness = 0
songs[songs.artist_hotttnesss == 0.0].song_id.nunique()
Out[24]:
In [25]:
# importing the methods
from sklearn.mixture import GaussianMixture
from sklearn.cluster import AgglomerativeClustering
In [26]:
# Selecting only the useful part of the dataset
songs_cl = songs[songs.artist_hotttnesss != 0.0][['artist_familiarity','artist_hotttnesss']]
In order to estimate the optimal number of clusters we iteratively compute the different metrics with GMM :
In [29]:
# Here we compute metrics for GMM with different number of clusters
scores_n = []
for i in range(2,12):
print('number of clusters : {}'.format(i))
gmm = GaussianMixture(n_components=i, covariance_type='full')
bic, aic, log = 0.0, 0.0, 0.0
for j in range(7):
gmm.fit(songs_cl)
bic += gmm.bic(songs_cl)
aic += gmm.aic(songs_cl)
log += gmm.score(songs_cl)
scores_n.append([np.mean(bic), np.mean(aic), np.mean(log)])
scores_n = np.asarray(scores_n)
In [30]:
# Visualizing the results
fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, figsize=(10,4))
ax[0].plot(range(2,12), scores_n[:,0] - np.min(scores_n[:,0]), 'C0')
ax[0].plot(range(2,12), scores_n[:,1] - np.min(scores_n[:,1]), 'C1')
ax[0].legend(['BIC','AIC'])
ax[0].set_xlabel('Number of Clusters $k$')
ax[1].plot(range(2,12), scores_n[:,2] - np.min(scores_n[:,2]), 'C2')
ax[1].legend(['Log Likelihood'])
ax[1].set_xlabel('Number of Clusters $k$');
When looking at the AIC and BIC the optimal number of clusters is given by the minimum while with the $\log[Likelihood]$ we look at the change in the slope ("elbow").
It is always more sensible to look at different metrics when it comes to clustering.
With this in mind, metrics seem to indicate that $k=4$ gives the best result for our dataset.
Let's look at the resulting clustering :
In [31]:
# Defining GMM with k=4 clusters
gmm_4 = GaussianMixture(n_components=4, covariance_type='full')
gmm_4.fit(songs_cl)
gmm_4_pred = gmm_4.predict_proba(songs_cl)
In [44]:
# Visualizing the result
plt.scatter(songs_cl.artist_familiarity, songs_cl.artist_hotttnesss, c=np.argmax(gmm_4_pred, axis=1),
marker='.', alpha=0.5, cmap='Dark2_r')
plt.title('GMM with k=4')
plt.xlabel('Artist Familiarity')
plt.ylabel('Artist Hotness');
Well... even if metrics recommend four clusters, resulting groups differ significantly from the intuition we have. Let's try with maybe $k=3$ :
In [33]:
# Defining GMM with k=3 clusters
gmm_3 = GaussianMixture(n_components=3, covariance_type='full')
gmm_3.fit(songs_cl)
gmm_3_pred = gmm_3.predict_proba(songs_cl)
In [34]:
# Visualizing the result
plt.scatter(songs_cl.artist_familiarity, songs_cl.artist_hotttnesss, c=np.argmax(gmm_3_pred, axis=1),
marker='.', alpha=0.5, cmap='Dark2_r')
plt.title('GMM with k=3')
plt.xlabel('Artist Familiarity')
plt.ylabel('Artist Hotness');
That is much closer to the intuition we had. Let's have a look at what Agglomerative clustering suggests (with 4 and 3 clusters) :
In [35]:
# Agglomerative Clustering with k=4 clusters
agg_4 = AgglomerativeClustering(n_clusters=4, linkage='average')
agg_4_pred = agg_4.fit_predict(songs_cl)
In [45]:
# Agglomerative Clustering with k=3 clusters
agg_3 = AgglomerativeClustering(n_clusters=3, linkage='average')
agg_3_pred = agg_3.fit_predict(songs_cl)
In [46]:
fig, ax = plt.subplots(nrows=1, ncols=2, sharex=True, sharey=True, figsize=(12,6))
ax[0].scatter(songs_cl.artist_familiarity, songs_cl.artist_hotttnesss, c=agg_3_pred,
marker='.', alpha=1.0, cmap='Dark2_r')
ax[0].set_title('Agg. Clus. with k=3')
ax[0].set_xlabel('Artist Familiarity')
ax[0].set_ylabel('Artist Hotness')
ax[1].scatter(songs_cl.artist_familiarity, songs_cl.artist_hotttnesss, c=agg_4_pred,
marker='.', alpha=1.0, cmap='Dark2_r')
ax[1].set_title('Agg. Clus. with k=4')
ax[1].set_xlabel('Artist Familiarity')
plt.subplots_adjust(wspace=0.025);
It is possible to look separately at the resulting clusters from both methods.
In [47]:
songs_cl['gmm'] = np.argmax(gmm_3_pred, axis=1)
songs_cl['agg_clus'] = agg_3_pred
In [48]:
sb.lmplot(data=songs_cl, x='artist_familiarity', y='artist_hotttnesss',
hue='gmm', col='gmm',
fit_reg=False, markers='.', scatter_kws={'alpha':0.7});
In [49]:
sb.lmplot(data=songs_cl, x='artist_familiarity', y='artist_hotttnesss',
hue='agg_clus', col='agg_clus',
fit_reg=False, markers='.', scatter_kws={'alpha':0.7});
The nature of the GMM method result in our case in two clusters presenting a gap while both methods recover a pretty similar "main sequence". Therefore we will keep Agglomerative Clustering results.
Clustering quality assessment can be very tedious and there exists a lot of metrics for this purpose. However, we choose to limit ourselves to the previous qualitative analysis.
In [53]:
songs_linreg = np.polyfit(songs_cl.artist_familiarity[songs_cl.agg_clus == 2].values,
songs_cl.artist_hotttnesss[songs_cl.agg_clus == 2].values, 1)
print("Linear Regression of the 'main sequence' :\n\
Hotness = {:.4f} * Familiarity + {:.4f}".format(songs_linreg[0],songs_linreg[1]))
We can actually look at what a linear regression would have given on the GMM result :
In [54]:
songs_linreg_gmm = np.polyfit(songs_cl.artist_familiarity[songs_cl.gmm == 0].values,
songs_cl.artist_hotttnesss[songs_cl.gmm == 0].values, 1)
print("Linear Regression of the 'main sequence' :\n\
Hotness = {:.4f} * Familiarity + {:.4f}".format(songs_linreg_gmm[0],songs_linreg_gmm[1]))
As expected (and fortunately) results are very similar. Let's visualize the result :
In [63]:
plt.figure(figsize=(8,7))
plt.scatter(songs.artist_familiarity, songs.artist_hotttnesss,
marker='.', edgecolors='none', c='C2', alpha=0.7)
plt.plot([0.0,1.0], [0.0,songs_linreg[0]] + songs_linreg[1], 'k', lw=1.0)
plt.xlabel('Artist Familiarity')
plt.ylabel('Artist Hotness');
This line can be viewed as a reference of expected hotness given the artist familiarity.
Somehow points below the line can be viewed as artist songs being less hot than expected while points above this line are artists that have exceeded expectations!
In [ ]: