Using data acquired from the MusicBrainz database, cleaned and aggregated in this notebook, I have developed a set of observations around the songs recorded and released by two artists - The Beatles and The Rolling Stones. I chose these two artists for the enduring popularity and the time-tested legacy of their recorded output. The response by which I will measure their influence is measured by other artists recorded use of their songs, "cover versions" or "covers" in the vernacular.
I have compiled data from the source artists representing all songs recorded, the number of times the song was released by the artist, the number of countries for each release and the average rating for the song or release as reported by MusicBrainz.
I also acquired the lyrics for as many of the original songs as I could using Beautiful Soup to scrape the lyrics from (lyrics.wikia.com)[http://lyrics.wikia.com). I was able to acquire lyrical content for 96.5% of the original songs to apply sentiment analysis using TextBlob. The sentiment polarity was applied separately to both the song title and the lyrics themselves to create features to augment the song/release data. As I am able to show below, the lyric sentiment is one of the more predictive measurements, second only to the year a song was released.
With each row in my dataset - songname, artist, etc., I then merged the number of times the song was covered ("times_covered") and the number of artists covering the song ("artist_cnt"). I then created a binary response ("is_covered") as a simpler indicator that the song was used over the number of times used. This proved to be more predictable than the number of times covered. Also included is the average rating of the cover versions per song though I suspect the data is not so useful.
In [1]:
### Import as many items as possible to have available.
### Import data from CSV
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
from sklearn.naive_bayes import MultinomialNB
data = pd.read_csv('data/Influence_clean.csv', header=0,encoding= 'utf-8', delimiter='|')
data['minreleasedate'] = pd.to_datetime(pd.Series(data.minreleasedate))
data['times_covered'].fillna(0, inplace=True)
data['artistid'] = data.artist.map({'The Beatles':0, 'The Rolling Stones':1})
In [2]:
data['artist'] = data.artist.astype('category')
data['songname'] = data.songname.astype('category')
# Add column for year of release for simpler grouping.
data['year'] = data['minreleasedate'].apply(lambda x: x.year)
# Make binary response - song has been covered or not. Far better accuracy over "times covered".
data['is_covered'] = data.times_covered.map(lambda x: 1 if x > 0 else 0)
In [19]:
data[(data.is_cover == 0) & (data.lyrics.isnull())].workid.count().astype(float)/data[(data.is_cover == 0)].workid.count()
Out[19]:
In [24]:
feature_cols = [ 'year','num_releases','lyric_sent','title_sent', 'countries', 'avg_rating']
X= data[data.is_cover == 0][feature_cols]
y = data[data.is_cover == 0].is_covered
print X.shape
print y.shape
In [88]:
### TODO - get null accuracy score. ????
means = []
for col in X.columns:
newmean = np.sqrt(metrics.mean_squared_error(X[col], y))
means.append(newmean.astype(float))
means = pd.Series(means)
print means.mean()
print zip(X.columns,means)
In [154]:
rfreg = RandomForestClassifier(n_estimators=175, max_features=6,oob_score=True, random_state=50)
rfreg.fit(X, y)
print rfreg.oob_score_
print cross_val_score(rfreg, X, y, cv=10, scoring='accuracy').mean()
At 92% and 90% respectively, both the out of bag and cross-validation scores are quite positive for the Random Forest Classifier.
I chose 6 features after running a looped evaluation of the maximum features for the model using cross validation.
In [152]:
feature_range = range(1, len(feature_cols)+1)
# list to store the average RMSE for each value of max_features
Acc_scores = []
# use 10-fold cross-validation with each value of max_features (WARNING: SLOW!)
for feature in feature_range:
rfreg = RandomForestClassifier(n_estimators=500, max_features=feature, random_state=50)
acc_val_scores = cross_val_score(rfreg, X, y, cv=10, scoring='accuracy')
Acc_scores.append(acc_val_scores.mean())
In [153]:
# plot max_features (x-axis) versus Accuracy score (y-axis)
plt.plot(feature_range, Acc_scores)
plt.xlabel('max_features')
plt.ylabel('Accuracy (higher is better)')
Out[153]:
In [155]:
treeclf = DecisionTreeClassifier(max_depth = 10, random_state=1)
treeclf.fit(X, y)
Out[155]:
In [157]:
### As shown in the bar chart below, year and lyric sentiment are better predictors of whether or not a song is covered.
In [160]:
pd.DataFrame({'feature':feature_cols, 'importance':treeclf.feature_importances_}).sort_values('importance').plot(kind='bar',x='feature',figsize=(16,5),fontsize='14',title="Feature Importance")
Out[160]:
Beginning to try out time series analysis below, however this exposed problems in the release dates either from my original queries or in the aggregation in Notebook 1. THis is on the TODO list.
While The Beatle disbanded by 1970, The Stones continue to this day. However, their early work appears far more influential, with the greatest body of influence more or less paralelling that of The more covered Beatles. Let's put a number on that, shall we?
In [109]:
### First plot is The Beatles, second The Rolling Stones. Sum of times_covered by year of oriiginal relase of the song.
yticks = np.arange(100, 1000, 100)
data[data.times_covered > 0].groupby('year').times_covered.sum().plot(kind='bar',x='year',y='times_covered',figsize=(16,9))
plt.title('Total songs covered by year of original',size =28)
plt.ylabel('Times Covered(Sum)', size = 24)
plt.yticks(yticks)
Out[109]:
In [108]:
bar = data.sort_values(by='year').groupby(['year', 'artist'])['times_covered'].sum().unstack('artist')
yticks = np.arange(25, 1000, 50)
bar.plot(kind='bar', stacked=True,figsize=(16,12),subplots='True')
plt.yticks(yticks)
plt.title('Total songs covered per Artist, by year of original',size =24)
plt.ylabel('Times Covered', size = 24)
Out[108]:
It appears that a pretty large percentage of the artist's catalogs have been covered at least once.
In [165]:
# Throw out covers recorded by each band and see what percentage of their catalogs have been covered.
data[(data.is_cover == 0) & (data.is_covered == 0)].groupby('artist').count()/data[(data.is_cover == 0)].groupby('artist').count()
Out[165]:
In [166]:
data[data.is_covered > 0].groupby(['artist', 'year']).songname.count().unstack('artist').plot(kind='bar',subplots='True',figsize=(16,9))
plt.title("Original Songs Released Per Year", size=20)
Out[166]:
In [126]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=101)
In [127]:
# Compute Null accuracy
y_null = np.zeros_like(y_test, dtype=float)
# fill the array with the mean value of y_test
y_null.fill(y_test.mean())
y_null
np.sqrt(metrics.mean_squared_error(y_test, y_null))
Out[127]:
In [132]:
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)
zip(feature_cols, logreg.coef_[0])
Out[132]:
In [133]:
print cross_val_score(logreg, X, y, cv=10, scoring='accuracy').mean()
In [135]:
y_pred_prob = logreg.predict_proba(X_test)[:,1 ]
print(y_pred_prob).mean()
In [136]:
# plot ROC curve
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')
Out[136]:
In [137]:
# calculate AUC
print metrics.roc_auc_score(y_test, y_pred_prob)
print cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()
In [ ]:
# TODO: Work on improving accuracy with Random Forest model, TIme Series Analysis
# and at least one Linear or Logistic Regression Model. Also create some basic bar charts to illustrate some basic assumptions about the data.