Exploring and Visualizing the Rolling Stone's 500 Greatest Albums of All Time

Importing the necessary modules

I'll be using pandas to clean and structure the data and bokeh for visualization.

pandas: http://pandas.pydata.org/pandas-docs/stable/

bokeh: http://bokeh.pydata.org/en/latest/


In [1]:
import pandas as pd
from bokeh.charts import Donut, HeatMap, Histogram, Line, Scatter, show, output_notebook, output_file
from bokeh.plotting import figure

In [2]:
output_notebook()


Loading BokehJS ...

Getting the data and structuring it

Get the album list from the xls file into a pandas dataframe


In [3]:
album_list = pd.read_excel('albumlist.xls')

In [4]:
album_list.dtypes


Out[4]:
Number       int64
Year         int64
Album       object
Artist      object
Genre       object
Subgenre    object
dtype: object

In [5]:
album_list.head()


Out[5]:
Number Year Album Artist Genre Subgenre
0 1 1967 Sgt. Pepper's Lonely Hearts Club Band The Beatles Rock Rock & Roll, Psychedelic Rock
1 2 1966 Pet Sounds The Beach Boys Rock Pop Rock, Psychedelic Rock
2 3 1966 Revolver The Beatles Rock Psychedelic Rock, Pop Rock
3 4 1965 Highway 61 Revisited Bob Dylan Rock Folk Rock, Blues Rock
4 5 1965 Rubber Soul The Beatles Rock, Pop Pop Rock

The Genre and Subgenre categories have multiple comma separated values. I'm going to keep just the first value and drop the others for the category.


In [13]:
#efficent method to do the same thing commented below
for lab, row in album_list.iterrows():
    album_list.loc[lab, "Genre_Refined"] = row["Genre"].split(',')[0]
    album_list.loc[lab, "Subgenre_Refined"] = row["Subgenre"].split(',')[0]

In [14]:
#add Genre_Refined column by selecting just the first value
#album_list['Genre_Refined'] = album_list['Genre']
#for i in range(len(album_list)):
    #album_list['Genre_Refined'][i] = album_list['Genre'][i].split(',')[0]

In [15]:
#add Subgenre_Refined column by selecting just the first value
#album_list['Subgenre_Refined'] = album_list['Subgenre']
#for i in range(len(album_list)):
    #album_list['Subgenre_Refined'][i] = album_list['Subgenre'][i].split(',')[0]

In [16]:
album_list.head()


Out[16]:
Number Year Album Artist Genre Subgenre Genre_Refined Subgenre_Refined
0 1 1967 Sgt. Pepper's Lonely Hearts Club Band The Beatles Rock Rock & Roll, Psychedelic Rock Rock Rock & Roll
1 2 1966 Pet Sounds The Beach Boys Rock Pop Rock, Psychedelic Rock Rock Pop Rock
2 3 1966 Revolver The Beatles Rock Psychedelic Rock, Pop Rock Rock Psychedelic Rock
3 4 1965 Highway 61 Revisited Bob Dylan Rock Folk Rock, Blues Rock Rock Folk Rock
4 5 1965 Rubber Soul The Beatles Rock, Pop Pop Rock Rock Pop Rock

In [17]:
album_list.dtypes


Out[17]:
Number               int64
Year                 int64
Album               object
Artist              object
Genre               object
Subgenre            object
Genre_Refined       object
Subgenre_Refined    object
dtype: object

Top 10 artists having the most number of albums in the list


In [18]:
#get a count of how many times each artist made it into the list
artists_count = album_list.groupby(['Artist'], as_index=False).count()

In [19]:
#find the top 10 artists
top_artists = artists_count.sort_values(by='Number', ascending=False).head(10)
top_artists = top_artists.reset_index().drop(['index', 'Album','Year','Genre','Subgenre', 'Genre_Refined', 'Subgenre_Refined'], axis=1)
top_artists.head(10)


Out[19]:
Artist Number
0 The Beatles 10
1 Bob Dylan 10
2 The Rolling Stones 10
3 Bruce Springsteen 8
4 The Who 7
5 Radiohead 5
6 U2 5
7 Led Zeppelin 5
8 Elton John 5
9 David Bowie 5

In [20]:
#get the artists and corresponding counts into two lists for plotting
top_artists_list = top_artists.Artist.values.tolist()
top_artists_count = top_artists.Number.astype(float).values.tolist()

In [21]:
#visualize the data using bokeh
#output_file("top_artists.html", title="top artists")
p = figure(x_range=top_artists_list, plot_height = 500, plot_width = 500)

#set x-axis properties
p.xgrid.visible = False
p.xaxis.major_label_orientation = 3.14/4
p.xaxis.axis_label = 'Artist Name'

#set y-axis properties
p.ygrid.visible = False
p.yaxis.axis_label = 'Album Count'

#draw circles
p.circle(y=top_artists_count, x=top_artists_list, size=15, fill_color="black")
show(p)


Year wise count of number of albums


In [22]:
#get count of albums in each year
yearwise_albums = album_list.groupby(['Year'], as_index=False).count()
yearwise_albums = yearwise_albums.sort_values(by='Year').reset_index().drop(['index', 'Album', 'Artist','Genre','Subgenre','Subgenre_Refined','Genre_Refined'], axis=1)
yearwise_albums.head()


Out[22]:
Year Number
0 1955 1
1 1956 2
2 1957 2
3 1958 1
4 1959 4

In [23]:
#visulaizing the data using bokeh line graphs
#output_file("yearwise_albums.html", title="yearwise_albums")
line = Line(data=yearwise_albums, x='Year', y='Number')
line.yaxis.axis_label = 'Number of Albums'
show(line)


Top genres and subgenres


In [24]:
#pivot the data and get a subset of the pivoted data where each subgenre has a count of more than 5
pivoted = pd.pivot_table(album_list, index=['Genre_Refined', 'Subgenre_Refined'], values=['Number'], aggfunc='count')
pivoted_subset = pivoted[pivoted['Number'] > 5]
pivoted_subset = pivoted_subset.reset_index()
pivoted_subset


Out[24]:
Genre_Refined Subgenre_Refined Number
0 Electronic New Wave 7
1 Electronic Synth-pop 6
2 Folk Country 9
3 Funk / Soul Rhythm & Blues 11
4 Funk / Soul Soul 19
5 Hip Hop None 16
6 Rock Alternative Rock 33
7 Rock Art Rock 6
8 Rock Blues Rock 37
9 Rock Classic Rock 18
10 Rock Folk Rock 37
11 Rock Garage Rock 8
12 Rock Hard Rock 17
13 Rock Indie Rock 13
14 Rock New Wave 12
15 Rock None 8
16 Rock Pop Rock 37
17 Rock Psychedelic Rock 14
18 Rock Punk 6
19 Rock Rock & Roll 14
20 Rock Soft Rock 6

In [25]:
#visualizing the data using the bokeh donut chart
#output_file("donut.html", title="donut")
from bokeh.palettes import Purples9 as palette1
palette1 = palette1[::-1]
d = Donut(pivoted_subset, label=['Genre_Refined', 'Subgenre_Refined'], values='Number', 
          text_font_size='15pt', plot_height=1000, plot_width=1000, palette=palette1)
show(d)


Songs in each genre by year


In [26]:
#getting yearwise data for each genre
yearwise_data = album_list.groupby(['Year', 'Genre_Refined'], as_index=False).count()
yearwise_data = yearwise_data.sort_values(by='Year').reset_index().drop(['index', 'Album', 'Artist','Genre','Subgenre','Subgenre_Refined'], axis=1)
yearwise_data.head(25)


Out[26]:
Year Genre_Refined Number
0 1955 Jazz 1
1 1956 Jazz 1
2 1956 Rock 1
3 1957 Rock 2
4 1958 Blues 1
5 1959 Jazz 4
6 1960 Funk / Soul 1
7 1960 Jazz 1
8 1960 Rock 1
9 1961 Funk / Soul 1
10 1962 Blues 1
11 1962 Funk / Soul 1
12 1963 Folk 1
13 1963 Funk / Soul 1
14 1963 Rock 3
15 1964 Blues 1
16 1964 Jazz 1
17 1964 Rock 2
18 1965 Rock 11
19 1965 Jazz 1
20 1965 Blues 1
21 1965 Funk / Soul 1
22 1966 Electronic 1
23 1966 Funk / Soul 2
24 1966 Rock 10

In [27]:
#visualizing the data using a bokeh heatmap
#output_file("yearwise_genre.html", title="yearwise_subgenre")
from bokeh.palettes import Reds9 as palette2
palette2 = palette2[::-1]
hm_year = HeatMap(yearwise_data, x='Year', y='Genre_Refined', values='Number', stat=None,
               width=750, plot_height=500, palette=palette2)
#y-axis properties
hm_year.yaxis.axis_label = 'Genre'
hm_year.yaxis.major_label_orientation = 'horizontal'
show(hm_year)


Rock subgenres over the years


In [28]:
#count subgenres yearwise and subset it for rock music
yearwise_subgenres = album_list.groupby(['Year', 'Genre_Refined', 'Subgenre_Refined'], as_index=False).count()
rock_subgenres_yearwise = yearwise_subgenres[yearwise_subgenres['Genre_Refined'] == 'Rock'].reset_index().drop(['index', 'Album', 'Artist','Genre','Subgenre'], axis=1)
rock_subgenres_yearwise.head()


Out[28]:
Year Genre_Refined Subgenre_Refined Number
0 1956 Rock Rock & Roll 1
1 1957 Rock Rock & Roll 1
2 1957 Rock Rockabilly 1
3 1960 Rock Blues Rock 1
4 1963 Rock Beat 1

In [29]:
#visualizing the data using bokeh scatterplot
#output_file("rock_subgenres_yearwise.html", title="rock_subgenres_yearwise")
hm_rock_subgenres = Scatter(rock_subgenres_yearwise, x='Year', y='Subgenre_Refined', width=800, plot_height=800)
#x-axis properties
hm_rock_subgenres.xgrid.visible = False
#y-axis properties
hm_rock_subgenres.yaxis.major_label_orientation = 'horizontal'
hm_rock_subgenres.yaxis.axis_label = 'Subgenres of Rock'
hm_rock_subgenres.ygrid.visible = False
show(hm_rock_subgenres)


Getting a summary of the top 10 Albums


In [30]:
#top 10 albums
top_albums = album_list.head(10)
#Get artists and albums into a new data frame
top_albums_a = top_albums['Artist']
top_albums_b = top_albums['Album']
top_albums_final = pd.concat([top_albums_a, top_albums_b], axis=1)
#groupby and summarize
top_albums_chart = top_albums_final.groupby(['Artist', 'Album']).count()
top_albums_chart


Out[30]:
Artist Album
Bob Dylan Blonde on Blonde
Highway 61 Revisited
Marvin Gaye What's Going On
The Beach Boys Pet Sounds
The Beatles Revolver
Rubber Soul
Sgt. Pepper's Lonely Hearts Club Band
The Beatles ("The White Album")
The Clash London Calling
The Rolling Stones Exile on Main St.

In [ ]: