This notebook contains all steps to import, clean, aggregate and merge the 3 datasets I queried from the MusicBrainz database.

Queries located here

The three datasets consist of:

allalbums_newmeta2.csv - All recordings by the bands The Beatles & The Rolling Stones.

bs_covers.csv - All cover versions (re-recordings) of original material from both bands.

work_cover.csv - A lookup of the workids representing the original songs from each band.


In [132]:
# Import data and modules. References to Plotly may be ignored for now, though the library offers far nicer 
# visualizations over basic Pandas and Matplotlib.

%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from textblob import TextBlob, Word
from nltk.stem.snowball import SnowballStemmer


import plotly.plotly as py
import plotly.tools as tls
tls.embed('https://plot.ly/~cufflinks/8')
import cufflinks as cf

py.sign_in('maxrose', '••••••••••')

cf.set_config_file(offline=True, world_readable=False, theme='ggplot')

plt.rcParams['figure.figsize'] = 8, 6

### Frame with all cover versions
covers = pd.read_table('data/bs_covers.csv', encoding= 'utf-8',header=0,delimiter='|')

### Frame with workid and is_cover (written by either Beatles or Stones)
is_cover = pd.read_csv("data/work_cover.csv")

### Frame with source data for all recordings by Beatle and stones
bs = pd.read_table('data/allalbums_newmeta2.csv', encoding= 'utf-8',header=0,delimiter='|')
### fix incorrect releasedates
bs.set_value(bs.releasename.str.contains("Meet The Beatles"), 'releasedate', '1964-01-20')
bs.set_value(bs.releasename.str.contains("Singles Collection: The London Years"), 'releasedate', '1989-08-15')
bs.set_value(bs.releasename.str.contains("The Very Best of the Rolling Stones 1964-1971"), 'releasedate', '2011-01-01')
bs.set_value(bs.releasename.str.contains("Best Of the Beatles"), 'releasedate', '1997-01-01')
bs.set_value(bs.releasename.str.contains("1962 Live Recordings"), 'releasedate', '1988-01-01')
bs.set_value(bs.releasename.str.contains("Rockin' at the Star-Club"), 'releasedate', '1991-01-01')
bs.set_value(bs.releasename.str.contains("Biggest Mistake"), 'releasedate', '2006-01-01')
bs.set_value(bs.releasename.str.contains("Rolled Gold"), 'releasedate', '2007-11-12')
bs.set_value(bs.releasename.str.contains("Rewind"), 'releasedate', '1984-07-01')
bs.set_value(bs.releasename.str.contains("Rewind (1971–1984)"), 'releasedate', '1984-07-01')
bs.set_value(bs.releasename.str.contains("Hot Rocks"), 'releasedate', '1971-01-01')
bs.set_value(bs.releasename.str.contains("Highwire"), 'releasedate', '1991-01-01')
bs.set_value(bs.releasename.str.contains("Sad Sad Sad"), 'releasedate', '1990-01-01')
bs.set_value(bs.releasename.str.contains("Highwire"), 'releasedate', '1991-01-01')
bs.set_value(bs.releasename.str.contains("Perfect Collection"), 'releasedate', '1997-01-01')
bs.set_value(bs.releasename.str.contains("Reeperbahn"), 'releasedate', '1996-01-01')
bs.set_value(bs.releasename.str.contains("The Beatles Best: Super Star Hit Collection"), 'releasedate', '2010-01-01')
bs.set_value(bs.releasename.str.contains("The Best of Beatles"), 'releasedate', '1999-01-01')
bs.set_value(bs.releasename.str.contains("The Beatles Beat"), 'releasedate', '1964-04-01')
bs.set_value(bs.releasename.str.contains("Star-Club In Hamburg"), 'releasedate', '1977-01-01')
### format releasedate as datetime
bs['releasedate'] = pd.to_datetime(pd.Series(bs.releasedate))
#bs.head(4)


Ultimately the year (of the releasedate) is an important feature,I took the time to manually look up and implement code to fix.

In my inital SQL query I created a dummy releasedate of "1962-01-01" for items that had null release dates. It turns out that was a poor choice, and a number of compilations, oddball foreign releases and live releases skewed the releasedates for a large number of songs.


In [145]:
# Update various songname mistakes - better web-scraping matches for lyrics
bs.set_value(bs.songname.str.contains('Keep'), 'songname', 'Keep Your Hands Off My Baby')
bs.set_value(bs.songname.str.contains(u"Baby You’re a Rich Man"), 'songname', "Baby, You're a Rich Man")
bs.set_value(bs.songname.str.contains(u"It's Only Rock 'n' Roll "), 'songname', "It's Only Rock 'n Roll (But I Like It)")
bs.set_value(bs.songname.str.contains(u"When I’m Sixty‐Four"), 'songname', "When I'm Sixty-Four")
bs.set_value(bs.songname.str.contains(u"Ob‐La‐Di, Ob‐La‐Da"), 'songname', "Ob-La-Di, Ob-La-Da")
bs.set_value(bs.songname.str.contains(u"Sure To Fall"), 'songname', "Sure To Fall (In Love With You)")
bs.set_value(bs.songname.str.contains(u"Honey, Don’t!"), 'songname', "Honey, Don't")
bs.set_value(bs.songname.str.contains(u"Love in Vain Blues"), 'songname', "Love in Vain")
# Replace bad unicode character to apostrophe.
bs['songname'] = bs['songname'].str.replace(u"’", "'")

#misc['product_desc'] = misc['product_desc'].str.replace('\n', '')

In [146]:
### Add Times recorded for each song
vc = pd.DataFrame(bs.songname.value_counts().reset_index())
vc.columns = ['songname', 'timesrec']
bs = pd.merge(bs, vc, on='songname')

In [147]:
print bs.shape
print bs.describe()
print covers.shape
print covers.describe()

Perform aggregations using "agg" method, accepting a dictionary.


In [148]:
### aggregations for Beatles and Rolling Stones recordings
aggregations = {
    'releasename':{'num_releases':'count'}
    ,'releasedate':{'maxreleasedate':'min'}
    ,'rating': {'avg_rating':'min'}
    ,'country_cnt':{'countries':'nunique'}
    #,'songname': {'times_rec':'value_counts'}
}
#bsagg = bs.groupby(['workid', 'artist', 'songname', 'releasename','release_group','releasedate']).agg(aggregations).reset_index()
bsagg = bs.groupby(['workid', 'artist', 'songname']).agg(aggregations).reset_index()
bsagg.columns = bsagg.columns.droplevel(1)
bsagg.columns = ['workid','artist','songname','minreleasedate','num_releases','countries','avg_rating']

In [149]:
### Prepare for scraping Lyrics
# encoding: utf-8
import requests as req
from urllib import quote_plus
from bs4 import BeautifulSoup, SoupStrainer
only_lyrics = SoupStrainer('div',class_='lyricbox')

After trying out the Textblob library to incorporate sentiment polarity as a feature, I found "lyrics.wikia.com" to be simple to scrape the entire lyrics for most of the songs in my data, from which I will also extract sentiment analysis as an added feature.


In [150]:
### Function for scraping lyrics from Wikia using Beautiful Soup.
baseurl = 'http://lyrics.wikia.com/wiki/'
def getLyrics(workid,artist, song):
    #ssong = song.replace("’", "'")
    #print ssong
    tUrl = "%s%s:%s" %(baseurl,artist,song)
    result = req.get(tUrl)
    lyric = BeautifulSoup(result.text, 'html.parser', parse_only=only_lyrics).get_text(' ',strip=True)
    #lyrics.append(workid, song, lyric)
    return lyric

In [151]:
### Get lyrics for each song, add to aggregated data
lyrics = []
for idx, row in bsagg.iterrows():
    lyrics.append(getLyrics(row.workid, row.artist, row.songname))
bsagg['lyrics'] = lyrics

In [156]:
### Get sentiment of text
def detect_sentiment(text):
    return TextBlob(text).sentiment.polarity

In [157]:
### Add sentiment of songname, lyrics to aggregated data
bsagg['title_sent'] = bsagg.songname.apply(detect_sentiment)
bsagg['lyric_sent'] = bsagg.lyrics.apply(detect_sentiment)

In [158]:
### clean up covers dataframe

covers['src_id'] = covers.source_artist.map({303:0,825:1}) 
#covers.drop(['srcid'], axis = 1, inplace = True)

### fill missing ratings with mean for workid where ratings exist.
means = covers[covers.rating > 0].groupby('workid').rating.mean()
covers = covers.set_index(['workid'])
covers['rating'] = covers['rating'].fillna(means)
covers = covers.reset_index()
### Fill with 0 where no ratings exist for this cover song.
covers.rating.fillna(0, inplace=True)

In [159]:
covers.head()


Out[159]:
workid recording_id source_artist recording_artist record_name rec_artist_id rating src_id
0 245459 2889179 825 Züri West (Bitte Baby) Blib no chli bi mir / Lue zersch ... 148091 70.00 1
1 357708 10470131 825 Cat Power (I Can't Get No) Satisfaction 35 62.92 1
2 357708 5974631 825 Bill Cosby (I Can't Get No) Satisfaction 91 40.00 1
3 357708 6718106 825 The Ventures (I Can't Get No) Satisfaction 343 60.00 1
4 357708 3024003 825 Manfred Mann (I Can't Get No) Satisfaction 820 60.00 1

In [160]:
### Aggregations for covers

aggregations = {
    'recording_id':{'times_covered':'count'}
    ,'rec_artist_id':{'artist_cnt':'nunique'}
    ,'rating': {'cov_rating_avg':'mean'}
}


coversagg = covers.groupby(['workid']).agg(aggregations).reset_index()
coversagg.columns = coversagg.columns.droplevel(1)
coversagg.columns = ['workid','artist_cnt','cov_rating_avg','times_covered']
coversagg.shape


Out[160]:
(221, 4)

In [161]:
### merge dataframes for analysis
### adding is_cover binary
bsagg = bsagg.merge(is_cover[['workid', 'is_cover']], on=['workid'])
### adding cover aggregates
bsagg = bsagg.merge(coversagg[['workid','artist_cnt','cov_rating_avg','times_covered']], how='left', on=['workid'])

In [162]:
import plotly
plotly.offline.iplot([{
    'x': bsagg.num_releases,
    'y': bsagg.avg_rating
    
}])



In [163]:
sns.heatmap(bsagg.corr())


Out[163]:
<matplotlib.axes._subplots.AxesSubplot at 0x143f97450>

In [164]:
bsagg.shape


Out[164]:
(374, 14)

In [165]:
from pandas.tools.plotting import scatter_matrix
scatter_matrix(bsagg, alpha=0.5, figsize=(12, 12), diagonal='kde')


Out[165]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x13e5906d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11e066590>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11e14cad0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11e16dc90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11e573250>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11e51f790>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11e981f50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11ea11410>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11ea72a90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x11eaf3b10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x14380a8d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x143ac3e50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x143ce1dd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x143d4eb90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x143dd1e50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x143e33e10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x143ec13d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x143e617d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1441a5110>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1442273d0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x144287c10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1444c9c90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x144624a50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1446b0c50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x144734bd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1447a0d50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x144822bd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x144892850>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x144a1f090>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x144a6cb90>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x144af8c10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x144b79b90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x144c27e90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x144ca9e10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x144d18990>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x144da71d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x144df4cd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x144f80d50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x145005cd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x14506ffd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1450f3f50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x145360ad0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1453f0310>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x14543ce10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1454c7e90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x14554be10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1455c4150>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1456470d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1456abc10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x145738450>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x145784f50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x14580ffd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x145893f50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x145a0e290>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x145a90210>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x145af2d50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x145c7f590>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x145ce80d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x145e68150>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x145eec0d0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x145f593d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x145fda350>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x14603de90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1460cc6d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x146231210>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1462b2290>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x146334210>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1463a2510>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x146424490>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x146488fd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x146515810>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x146578350>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1465fa3d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x14667f350>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1466eb650>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x14716d5d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1471de150>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x14725d950>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1472c5490>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x147345510>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1473c9490>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x147436790>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1474b8710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x147629290>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1476a7a90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x14770d5d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x14778e650>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1481125d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x14817f8d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x148200850>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1482713d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1482f2bd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x148449710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1484d7790>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x14865c710>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1486c7a10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x14874a990>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1487bb510>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x14903bd10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x149095850>]], dtype=object)

Export cleaned data to csv for use in separate notebook.


In [166]:
bsagg.to_csv('data/Influence_clean.csv',sep='|',encoding='utf-8',index=False)

In [ ]: