Quantifying Influence of The Beatle and The Rolling Stones

Using data acquired from the MusicBrainz database, cleaned and aggregated in this notebook, I have developed a set of observations around the songs recorded and released by two artists - The Beatles and The Rolling Stones. I chose these two artists for the enduring popularity and the time-tested legacy of their recorded output. The response by which I will measure their influence is measured by other artists recorded use of their songs, "cover versions" or "covers" in the vernacular.

I have compiled data from the source artists representing all songs recorded, the number of times the song was released by the artist, the number of countries for each release and the average rating for the song or release as reported by MusicBrainz.
I also acquired the lyrics for as many of the original songs as I could using Beautiful Soup to scrape the lyrics from (lyrics.wikia.com)[http://lyrics.wikia.com). I was able to acquire lyrical content for 96.5% of the original songs to apply sentiment analysis using TextBlob. The sentiment polarity was applied separately to both the song title and the lyrics themselves to create features to augment the song/release data. As I am able to show below, the lyric sentiment is one of the more predictive measurements, second only to the year a song was released.

With each row in my dataset - songname, artist, etc., I then merged the number of times the song was covered ("times_covered") and the number of artists covering the song ("artist_cnt"). I then created a binary response ("is_covered") as a simpler indicator that the song was used over the number of times used. This proved to be more predictable than the number of times covered. Also included is the average rating of the cover versions per song though I suspect the data is not so useful.


In [1]:
### Import as many items as possible to have available.
### Import data from CSV

%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn.naive_bayes import MultinomialNB

data = pd.read_csv('data/Influence_clean.csv', header=0,encoding= 'utf-8',  delimiter='|')
data['minreleasedate'] = pd.to_datetime(pd.Series(data.minreleasedate))
data['times_covered'].fillna(0, inplace=True)
data['artistid'] = data.artist.map({'The Beatles':0, 'The Rolling Stones':1})

In [2]:
data['artist'] = data.artist.astype('category')
data['songname'] = data.songname.astype('category')
# Add column for year of release for simpler grouping.
data['year'] = data['minreleasedate'].apply(lambda x: x.year)

# Make binary response - song has been covered or not. Far better accuracy over "times covered".
data['is_covered'] = data.times_covered.map(lambda x: 1 if x > 0 else 0)

Check coverage of lyrics for the original songs.

96.6% have lyrics (and lyric sentiment polarity score) in the dataset.


In [19]:
data[(data.is_cover == 0) & (data.lyrics.isnull())].workid.count().astype(float)/data[(data.is_cover == 0)].workid.count()


Out[19]:
0.033457249070631967

Create base set of features for fitting to models


In [24]:
feature_cols = [ 'year','num_releases','lyric_sent','title_sent', 'countries', 'avg_rating']
X= data[data.is_cover == 0][feature_cols]
y = data[data.is_cover == 0].is_covered
print X.shape
print y.shape


(269, 6)
(269,)

In [88]:
### TODO - get null accuracy score. ????
means = []
for col in X.columns:
    newmean = np.sqrt(metrics.mean_squared_error(X[col], y))
    means.append(newmean.astype(float))
means = pd.Series(means)
print means.mean()
print zip(X.columns,means)


337.024746509
[('year', 1972.77953291843), (u'num_releases', 7.4922512635577387), (u'lyric_sent', 0.81210518610143201), (u'title_sent', 0.89708337806483351), (u'countries', 3.2210968138220979), (u'avg_rating', 36.946409497015942)]

Build model with Random Forest Classifier


In [154]:
rfreg = RandomForestClassifier(n_estimators=175,  max_features=6,oob_score=True, random_state=50)
rfreg.fit(X, y)
print rfreg.oob_score_
print cross_val_score(rfreg, X, y, cv=10, scoring='accuracy').mean()


0.921933085502
0.903347578348

At 92% and 90% respectively, both the out of bag and cross-validation scores are quite positive for the Random Forest Classifier.
I chose 6 features after running a looped evaluation of the maximum features for the model using cross validation.


In [152]:
feature_range = range(1, len(feature_cols)+1)

# list to store the average RMSE for each value of max_features
Acc_scores = []

# use 10-fold cross-validation with each value of max_features (WARNING: SLOW!)
for feature in feature_range:
    rfreg = RandomForestClassifier(n_estimators=500, max_features=feature, random_state=50)
    acc_val_scores = cross_val_score(rfreg, X, y, cv=10, scoring='accuracy')
    Acc_scores.append(acc_val_scores.mean())

To this point, my somewhat thin collection of features have not shown a very useful measure of predicting the number of covers of a song (my basic premise for measuring influence. However the Random Forest appears to prefer a whopping 4 features for the meager measure of accuracy per the graph below.


In [153]:
# plot max_features (x-axis) versus Accuracy score (y-axis)
plt.plot(feature_range, Acc_scores)
plt.xlabel('max_features')
plt.ylabel('Accuracy (higher is better)')


Out[153]:
<matplotlib.text.Text at 0x129976e50>

Test with DecisionTreeClassifier

While seeking a less opaque model than the Random Forest, I tried the DecisionTreeClassifier, which has a very nice method to rank the importance of the features.


In [155]:
treeclf = DecisionTreeClassifier(max_depth = 10, random_state=1)
treeclf.fit(X, y)


Out[155]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=1, splitter='best')

In [157]:
### As shown in the bar chart below, year and lyric sentiment are better predictors of whether or not a song is covered.

In [160]:
pd.DataFrame({'feature':feature_cols, 'importance':treeclf.feature_importances_}).sort_values('importance').plot(kind='bar',x='feature',figsize=(16,5),fontsize='14',title="Feature Importance")


Out[160]:
<matplotlib.axes._subplots.AxesSubplot at 0x12a0777d0>

Beginning to try out time series analysis below, however this exposed problems in the release dates either from my original queries or in the aggregation in Notebook 1. THis is on the TODO list.

A quick measure of songs covered by release year for both artists.


While The Beatle disbanded by 1970, The Stones continue to this day. However, their early work appears far more influential, with the greatest body of influence more or less paralelling that of The more covered Beatles. Let's put a number on that, shall we?


In [109]:
### First plot is The Beatles, second The Rolling Stones. Sum of times_covered by year of oriiginal relase of the song.
yticks = np.arange(100, 1000, 100)
data[data.times_covered > 0].groupby('year').times_covered.sum().plot(kind='bar',x='year',y='times_covered',figsize=(16,9))
plt.title('Total songs covered by year of original',size =28)
plt.ylabel('Times Covered(Sum)', size = 24)
plt.yticks(yticks)


Out[109]:
([<matplotlib.axis.YTick at 0x12811a1d0>,
  <matplotlib.axis.YTick at 0x1273d87d0>,
  <matplotlib.axis.YTick at 0x1282054d0>,
  <matplotlib.axis.YTick at 0x128205a10>,
  <matplotlib.axis.YTick at 0x128205fd0>,
  <matplotlib.axis.YTick at 0x1281f0d90>,
  <matplotlib.axis.YTick at 0x1274edc90>,
  <matplotlib.axis.YTick at 0x128248ad0>,
  <matplotlib.axis.YTick at 0x128248fd0>],
 <a list of 9 Text yticklabel objects>)

In [108]:
bar = data.sort_values(by='year').groupby(['year', 'artist'])['times_covered'].sum().unstack('artist')
yticks = np.arange(25, 1000, 50)
bar.plot(kind='bar', stacked=True,figsize=(16,12),subplots='True')
plt.yticks(yticks)
plt.title('Total songs covered per Artist, by year of original',size =24)
plt.ylabel('Times Covered', size = 24)


Out[108]:
<matplotlib.text.Text at 0x127242d50>

It appears that a pretty large percentage of the artist's catalogs have been covered at least once.


In [165]:
# Throw out covers recorded by each band and see what percentage of their catalogs have been covered.
data[(data.is_cover == 0) & (data.is_covered == 0)].groupby('artist').count()/data[(data.is_cover == 0)].groupby('artist').count()


Out[165]:
workid songname minreleasedate num_releases countries avg_rating lyrics title_sent lyric_sent is_cover artist_cnt cov_rating_avg times_covered artistid year is_covered
artist
The Beatles 0.090395 0.090395 0.090395 0.090395 0.090395 0.090395 0.053254 0.090395 0.090395 0.090395 0.0 0.0 0.090395 0.090395 0.090395 0.090395
The Rolling Stones 0.336957 0.336957 0.336957 0.336957 0.336957 0.336957 0.340659 0.336957 0.336957 0.336957 0.0 0.0 0.336957 0.336957 0.336957 0.336957

In [166]:
data[data.is_covered > 0].groupby(['artist', 'year']).songname.count().unstack('artist').plot(kind='bar',subplots='True',figsize=(16,9))
plt.title("Original Songs Released Per Year", size=20)


Out[166]:
<matplotlib.text.Text at 0x12a7c5150>

Test Logistic Regression Model


In [126]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=101)

In [127]:
# Compute Null accuracy
y_null = np.zeros_like(y_test, dtype=float)
# fill the array with the mean value of y_test
y_null.fill(y_test.mean())
y_null
np.sqrt(metrics.mean_squared_error(y_test, y_null))


Out[127]:
0.36823482428085047

Null Accuracy result is 37%.

As we will see below, when we fit the Logistic Regression estimator with our data and compute a cross validation score we improve significantly over the null test result.


In [132]:
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)
zip(feature_cols, logreg.coef_[0])


Out[132]:
[('year', -6.00176939197862e-05),
 ('num_releases', -0.12490383180456067),
 ('lyric_sent', 2.6363935941680432),
 ('title_sent', -1.0990470630763087),
 ('countries', 0.7737027649489393),
 ('avg_rating', -0.021820203303004219)]

In [133]:
print cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean()


0.866157916158

In [135]:
y_pred_prob = logreg.predict_proba(X_test)[:,1 ]
print(y_pred_prob).mean()


0.805913171628

Compute ROC curve and AUC score.


In [136]:
# plot ROC curve
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')


Out[136]:
<matplotlib.text.Text at 0x128365290>

In [137]:
# calculate AUC
print metrics.roc_auc_score(y_test, y_pred_prob)
print cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()


0.881977671451
0.797994071146

In [ ]:
# TODO: Work on improving accuracy with Random Forest model, TIme Series Analysis 
# and at least one Linear or Logistic Regression Model. Also create some basic bar charts to illustrate some basic assumptions about the data.