Plotting

There are many libraries for plotting in Python. The standard library is matplotlib. Its examples and gallery are particularly useful references.

Matplotlib is most useful if you have data in numpy arrays. We can then plot standard single graphs straightforwardly:


In [1]:
%matplotlib inline

The above command is only needed if you are plotting in a Jupyter notebook.

We now construct some data:


In [2]:
import numpy

x = numpy.linspace(0, 1)
y1 = numpy.sin(numpy.pi * x) + 0.1 * numpy.random.rand(50)
y2 = numpy.cos(3.0 * numpy.pi * x) + 0.2 * numpy.random.rand(50)

And then produce a line plot:


In [3]:
from matplotlib import pyplot
pyplot.plot(x, y1)
pyplot.show()


We can add labels and titles:


In [8]:
pyplot.plot(x, y1)
pyplot.xlabel('x')
pyplot.ylabel('y')
pyplot.title('A single line plot')
pyplot.show()


We can change the plotting style, and use LaTeX style notation where needed:


In [9]:
pyplot.plot(x, y1, linestyle='--', color='black', linewidth=3)
pyplot.xlabel(r'$x$')
pyplot.ylabel(r'$y$')
pyplot.title(r'A single line plot, roughly $\sin(\pi x)$')
pyplot.show()


We can plot two lines at once, and add a legend, which we can position:


In [10]:
pyplot.plot(x, y1, label=r'$y_1$')
pyplot.plot(x, y2, label=r'$y_2$')
pyplot.xlabel(r'$x$')
pyplot.ylabel(r'$y$')
pyplot.title('Two line plots')
pyplot.legend(loc='lower left')
pyplot.show()


We would probably prefer to use subplots. At this point we have to leave the simple interface, and start building the plot using its individual components, figures and axes, which are objects to manipulate:


In [11]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(10,6))
axis1 = axes[0]
axis1.plot(x, y1)
axis1.set_xlabel(r'$x$')
axis1.set_ylabel(r'$y_1$')
axis2 = axes[1]
axis2.plot(x, y2)
axis2.set_xlabel(r'$x$')
axis2.set_ylabel(r'$y_2$')
fig.tight_layout()
pyplot.show()


The axes variable contains all of the separate axes that you may want. This makes it easy to construct many subplots using a loop:


In [12]:
data = []
for nx in range(2,5):
    for ny in range(2,5):
        data.append(numpy.sin(nx * numpy.pi * x) + numpy.cos(ny * numpy.pi * x))

fig, axes = pyplot.subplots(nrows=3, ncols=3, figsize=(10,10))
for nrow in range(3):
    for ncol in range(3):
        ndata = ncol + 3 * nrow
        axes[nrow, ncol].plot(x, data[ndata])
        axes[nrow, ncol].set_xlabel(r'$x$')
        axes[nrow, ncol].set_ylabel(r'$\sin({} \pi x) + \cos({} \pi x)$'.format(nrow+2, ncol+2))
fig.tight_layout()
pyplot.show()


Matplotlib will allow you to generate and place axes pretty much wherever you like, to use logarithmic scales, to do different types of plot, and so on. Check the examples and gallery for details.

Data sets

If the information is not in numpy arrays but in a spreadsheet-like format, Matplotlib may not be the best approach.

For handling large data sets, the standard Python library is pandas. It keeps the data in a dataframe which keeps the rectangular data together with its labels.

Let's load the standard Iris data set, which we can get from GitHub, in:


In [1]:
import pandas

In [2]:
iris = pandas.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv')

Let's get some information about the file we just read in. First, let's see what data fields our dataset has:


In [3]:
iris.columns


Out[3]:
Index(['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Name'], dtype='object')

Now let's see what datatype (i.e. integer, boolean, string, float,...) the data in each field is:


In [4]:
iris.dtypes


Out[4]:
SepalLength    float64
SepalWidth     float64
PetalLength    float64
PetalWidth     float64
Name            object
dtype: object

Finally, let's try printing the first few records in our dataframe:


In [40]:
# print first 5 records
iris.head()


Out[40]:
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Note that pandas can read Excel files (using pandas.read_excel), and takes as arguments either the URL (as here) or the filename on the local machine.

Once we have the data, <dataframe>.plot gives us lots of options to plot the result. Let's plot a histogram of the Petal Length:


In [16]:
iris['PetalLength'].plot.hist()
pyplot.show()


We can see the underlying library is Matplotlib, but it's far easier to plot large data sets.

We can get some basic statistics for our data using describe():


In [18]:
iris.describe()


Out[18]:
SepalLength SepalWidth PetalLength PetalWidth
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

We can also extract specific metrics:


In [29]:
print(iris['SepalLength'].min())
print(iris['PetalLength'].std())
print(iris['PetalWidth'].count())


4.3
1.7644204199522617
150

However, we often wish to calculate statistics for a subset of our data. For this, we can use pandas' groups. Let's group our data by Name and try running describe again. We see that pandas has now calculated statistics for each type of iris separately.


In [24]:
grouped_iris = iris.groupby('Name')
grouped_iris.describe()


Out[24]:
PetalLength PetalWidth SepalLength SepalWidth
Name
Iris-setosa count 50.000000 50.000000 50.000000 50.000000
mean 1.464000 0.244000 5.006000 3.418000
std 0.173511 0.107210 0.352490 0.381024
min 1.000000 0.100000 4.300000 2.300000
25% 1.400000 0.200000 4.800000 3.125000
50% 1.500000 0.200000 5.000000 3.400000
75% 1.575000 0.300000 5.200000 3.675000
max 1.900000 0.600000 5.800000 4.400000
Iris-versicolor count 50.000000 50.000000 50.000000 50.000000
mean 4.260000 1.326000 5.936000 2.770000
std 0.469911 0.197753 0.516171 0.313798
min 3.000000 1.000000 4.900000 2.000000
25% 4.000000 1.200000 5.600000 2.525000
50% 4.350000 1.300000 5.900000 2.800000
75% 4.600000 1.500000 6.300000 3.000000
max 5.100000 1.800000 7.000000 3.400000
Iris-virginica count 50.000000 50.000000 50.000000 50.000000
mean 5.552000 2.026000 6.588000 2.974000
std 0.551895 0.274650 0.635880 0.322497
min 4.500000 1.400000 4.900000 2.200000
25% 5.100000 1.800000 6.225000 2.800000
50% 5.550000 2.000000 6.500000 3.000000
75% 5.875000 2.300000 6.900000 3.175000
max 6.900000 2.500000 7.900000 3.800000

In [33]:
grouped_iris['PetalLength'].mean()


Out[33]:
Name
Iris-setosa        1.464
Iris-versicolor    4.260
Iris-virginica     5.552
Name: PetalLength, dtype: float64

We can select subsets of our data using criteria. For example, we can select all records with PetalLength greater than 5:


In [36]:
iris[iris.PetalLength > 5].head()


Out[36]:
SepalLength SepalWidth PetalLength PetalWidth Name
83 6.0 2.7 5.1 1.6 Iris-versicolor
100 6.3 3.3 6.0 2.5 Iris-virginica
101 5.8 2.7 5.1 1.9 Iris-virginica
102 7.1 3.0 5.9 2.1 Iris-virginica
103 6.3 2.9 5.6 1.8 Iris-virginica

We can also combine criteria like so:


In [39]:
iris[(iris.Name == 'Iris-setosa') & (iris.PetalWidth < 0.3)].head()


Out[39]:
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Data across multiple files

Now let's look at a slightly more complex example where the data is spread across multiple files and contains many different fields of different datatypes.

Spotify provide a web API which can be used to download data about its music. This data includes the audio features of a track, a set of measures including 'acousticness', 'danceability', 'speechiness' and 'valence':

A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

We can download this data using a library such as spotipy. In the folder spotify_data, you shall find a few .csv files containing data downloaded for tracks from playlists of several different musical genres.

Let's begin by importing our data.


In [5]:
dfs = {'indie': pandas.read_csv('spotify_data/indie.csv'), 'pop': pandas.read_csv('spotify_data/pop.csv'), 
       'country': pandas.read_csv('spotify_data/country.csv'), 'metal': pandas.read_csv('spotify_data/metal.csv'), 
       'house': pandas.read_csv('spotify_data/house.csv'), 'rap': pandas.read_csv('spotify_data/rap.csv')}

To compare the data from these different datasets, it will help if we first combine them into a single dataframe. Before we do this, we'll add an extra field to each of our dataframes describing the musical genre so that we do not lose this information when the dataframes are combined.


In [6]:
# add genre field to each dataframe
for name, df in dfs.items():
    df['genre'] = name

# combine into single dataframe
data = pandas.concat(dfs.values())
data


Out[6]:
Unnamed: 0 album artists duration_ms explicit href id name popularity preview_url ... loudness mode speechiness tempo time_signature track_href type.1 uri.1 valence genre
0 1 All Your Fault: Pt. 1 Bebe Rexha, Ty Dolla $ign 197253 True https://api.spotify.com/v1/tracks/4ZJPwET9Jrgp... 4ZJPwET9Jrgpkqi4Vo3Yg8 Bad Bitch (feat. Ty Dolla $ign) 85 https://p.scdn.co/mp3-preview/9284a8a94b8c16ef... ... -6.252 1 0.0539 139.910 4 https://api.spotify.com/v1/tracks/4ZJPwET9Jrgp... audio_features spotify:track:4ZJPwET9Jrgpkqi4Vo3Yg8 0.364 pop
1 2 Good Life (with G-Eazy & Kehlani) G-Eazy, Kehlani 225525 False https://api.spotify.com/v1/tracks/1Eck97uRMlpr... 1Eck97uRMlprKOOJN9oO1E Good Life (with G-Eazy & Kehlani) 83 https://p.scdn.co/mp3-preview/88d8d456dcf9b55e... ... -5.220 1 0.2120 168.385 4 https://api.spotify.com/v1/tracks/1Eck97uRMlpr... audio_features spotify:track:1Eck97uRMlprKOOJN9oO1E 0.551 pop
2 3 13 Reasons Why (A Netflix Original Series Soun... Lord Huron 206933 False https://api.spotify.com/v1/tracks/3FsBtu3gdlfZ... 3FsBtu3gdlfZjBLXyDvmj1 The Night We Met 32 NaN ... -9.560 1 0.0378 87.024 3 https://api.spotify.com/v1/tracks/3FsBtu3gdlfZ... audio_features spotify:track:3FsBtu3gdlfZjBLXyDvmj1 0.117 pop
3 4 Obsession (feat. Jon Bellion) Vice, Jon Bellion 221982 False https://api.spotify.com/v1/tracks/542Xd5qDeLBv... 542Xd5qDeLBvgXZXhfW7LE Obsession (feat. Jon Bellion) 84 https://p.scdn.co/mp3-preview/16ab1dd04110aa3f... ... -7.775 1 0.0300 101.999 4 https://api.spotify.com/v1/tracks/542Xd5qDeLBv... audio_features spotify:track:542Xd5qDeLBvgXZXhfW7LE 0.441 pop
4 5 Slow Hands Niall Horan 188174 False https://api.spotify.com/v1/tracks/27vTihlWXiz9... 27vTihlWXiz9f9lJM3XGVU Slow Hands 83 NaN ... -6.623 1 0.0519 85.899 4 https://api.spotify.com/v1/tracks/27vTihlWXiz9... audio_features spotify:track:27vTihlWXiz9f9lJM3XGVU 0.874 pop
5 6 So Good Clean Bandit, Zara Larsson 214866 False https://api.spotify.com/v1/tracks/4SPLWgCPoKwU... 4SPLWgCPoKwULz2UTM8TKg Symphony 41 NaN ... -4.699 0 0.0429 122.948 4 https://api.spotify.com/v1/tracks/4SPLWgCPoKwU... audio_features spotify:track:4SPLWgCPoKwULz2UTM8TKg 0.470 pop
6 7 Memories...Do Not Open The Chainsmokers 207520 True https://api.spotify.com/v1/tracks/6cPyTS0Kk2sc... 6cPyTS0Kk2sc4xQwC93kOg Break Up Every Night 84 https://p.scdn.co/mp3-preview/119d2078bf607422... ... -5.957 1 0.0437 149.999 4 https://api.spotify.com/v1/tracks/6cPyTS0Kk2sc... audio_features spotify:track:6cPyTS0Kk2sc4xQwC93kOg 0.536 pop
7 8 I'm the One DJ Khaled, Justin Bieber, Quavo, Chance The Ra... 288876 True https://api.spotify.com/v1/tracks/72Q0FQQo32KJ... 72Q0FQQo32KJloivv5xge2 I'm the One 100 https://p.scdn.co/mp3-preview/f6fdecfbaae1ed54... ... -4.267 1 0.0367 80.984 4 https://api.spotify.com/v1/tracks/72Q0FQQo32KJ... audio_features spotify:track:72Q0FQQo32KJloivv5xge2 0.811 pop
8 9 Attention Charlie Puth 211475 False https://api.spotify.com/v1/tracks/4iLqG9SeJSnt... 4iLqG9SeJSnt0cSPICSjxv Attention 94 https://p.scdn.co/mp3-preview/e20bdb50a10a7c5a... ... -4.432 0 0.0432 100.041 4 https://api.spotify.com/v1/tracks/4iLqG9SeJSnt... audio_features spotify:track:4iLqG9SeJSnt0cSPICSjxv 0.758 pop
9 10 The Cure Lady Gaga 211363 False https://api.spotify.com/v1/tracks/51PIvodunv6N... 51PIvodunv6NmX5250zxAh The Cure 88 NaN ... -4.842 1 0.0356 99.977 4 https://api.spotify.com/v1/tracks/51PIvodunv6N... audio_features spotify:track:51PIvodunv6NmX5250zxAh 0.539 pop
10 12 Thunder Imagine Dragons 187761 False https://api.spotify.com/v1/tracks/0oP9pK1D1lNF... 0oP9pK1D1lNF3Lb7jkl6Xx Thunder 84 NaN ... -3.798 1 0.0464 167.969 4 https://api.spotify.com/v1/tracks/0oP9pK1D1lNF... audio_features spotify:track:0oP9pK1D1lNF3Lb7jkl6Xx 0.250 pop
11 13 Most Girls Hailee Steinfeld 204400 False https://api.spotify.com/v1/tracks/10GJQkjRJcZh... 10GJQkjRJcZhGTLagFOC62 Most Girls 81 NaN ... -7.082 1 0.0775 102.974 4 https://api.spotify.com/v1/tracks/10GJQkjRJcZh... audio_features spotify:track:10GJQkjRJcZhGTLagFOC62 0.384 pop
12 15 Stay Zedd, Alessia Cara 210090 False https://api.spotify.com/v1/tracks/6uBhi9gBXWja... 6uBhi9gBXWjanegOb2Phh0 Stay (with Alessia Cara) 90 NaN ... -5.091 0 0.1110 101.384 4 https://api.spotify.com/v1/tracks/6uBhi9gBXWja... audio_features spotify:track:6uBhi9gBXWjanegOb2Phh0 0.535 pop
13 16 9 Cashmere Cat, Ariana Grande 258188 True https://api.spotify.com/v1/tracks/4rwqrKdwlFWJ... 4rwqrKdwlFWJ6LvPYaOtgn Quit (feat. Ariana Grande) 80 NaN ... -7.425 0 0.0446 131.659 4 https://api.spotify.com/v1/tracks/4rwqrKdwlFWJ... audio_features spotify:track:4rwqrKdwlFWJ6LvPYaOtgn 0.125 pop
14 17 No Promises (feat. Demi Lovato) Cheat Codes, Demi Lovato 223503 False https://api.spotify.com/v1/tracks/1louJpMmzEic... 1louJpMmzEicAn7lzDalPW No Promises (feat. Demi Lovato) 90 https://p.scdn.co/mp3-preview/eb769786f45b9fe4... ... -5.445 1 0.1340 112.956 4 https://api.spotify.com/v1/tracks/1louJpMmzEic... audio_features spotify:track:1louJpMmzEicAn7lzDalPW 0.609 pop
15 18 No Vacancy OneRepublic 223189 False https://api.spotify.com/v1/tracks/4QeoDcR16IHp... 4QeoDcR16IHpmmgFGQDrCp No Vacancy 79 NaN ... -3.946 1 0.0472 99.954 4 https://api.spotify.com/v1/tracks/4QeoDcR16IHp... audio_features spotify:track:4QeoDcR16IHpmmgFGQDrCp 0.483 pop
16 19 HUMBLE. Kendrick Lamar 177604 True https://api.spotify.com/v1/tracks/3GnLo84IkdSW... 3GnLo84IkdSWCPYt6tnLll HUMBLE. 74 NaN ... -7.496 0 0.1180 149.996 4 https://api.spotify.com/v1/tracks/3GnLo84IkdSW... audio_features spotify:track:3GnLo84IkdSWCPYt6tnLll 0.419 pop
17 20 First Time Kygo, Ellie Goulding 193511 False https://api.spotify.com/v1/tracks/2Gl0FzuLxflY... 2Gl0FzuLxflY6nPifJp5Dr First Time 93 https://p.scdn.co/mp3-preview/f5e35495b56260a0... ... -7.245 0 0.1120 90.066 4 https://api.spotify.com/v1/tracks/2Gl0FzuLxflY... audio_features spotify:track:2Gl0FzuLxflY6nPifJp5Dr 0.643 pop
18 21 The Things You Do Griffin Stoller 172173 False https://api.spotify.com/v1/tracks/0owbFAS9rCBS... 0owbFAS9rCBSqJYly3fMbE The Things You Do 67 https://p.scdn.co/mp3-preview/0220fac07f28d669... ... -6.746 0 0.0378 91.958 4 https://api.spotify.com/v1/tracks/0owbFAS9rCBS... audio_features spotify:track:0owbFAS9rCBSqJYly3fMbE 0.546 pop
19 22 Sunset Lover Petit Biscuit 237792 False https://api.spotify.com/v1/tracks/0hNduWmlWmEm... 0hNduWmlWmEmuwEFcYvRu1 Sunset Lover 83 https://p.scdn.co/mp3-preview/548545f6df276b92... ... -9.474 1 0.0503 90.838 4 https://api.spotify.com/v1/tracks/0hNduWmlWmEm... audio_features spotify:track:0hNduWmlWmEmuwEFcYvRu1 0.236 pop
20 23 That´s Me HEDEGAARD 205226 True https://api.spotify.com/v1/tracks/1UiEaMpzSMIC... 1UiEaMpzSMICYs0FWQga7S That´s Me 68 NaN ... -4.434 0 0.1500 104.982 4 https://api.spotify.com/v1/tracks/1UiEaMpzSMIC... audio_features spotify:track:1UiEaMpzSMICYs0FWQga7S 0.518 pop
21 24 Empty Streets Kota Banks, MOZA 186857 False https://api.spotify.com/v1/tracks/21xFsQzlsh6c... 21xFsQzlsh6cr5XIDqKc7S Empty Streets 64 NaN ... -7.063 1 0.0399 140.024 4 https://api.spotify.com/v1/tracks/21xFsQzlsh6c... audio_features spotify:track:21xFsQzlsh6cr5XIDqKc7S 0.303 pop
22 26 Human Touch Betty Who 213750 False https://api.spotify.com/v1/tracks/073wmqmf5Kfr... 073wmqmf5KfrQiNukyuqrq Human Touch 70 https://p.scdn.co/mp3-preview/eb8d446290968031... ... -4.090 0 0.0495 102.898 4 https://api.spotify.com/v1/tracks/073wmqmf5Kfr... audio_features spotify:track:073wmqmf5KfrQiNukyuqrq 0.568 pop
23 27 Hung Up Tritonal, Sj, Emma Gatsby 220000 False https://api.spotify.com/v1/tracks/2ULg8Cw0Ckn5... 2ULg8Cw0Ckn5JDGUkCNXko Hung Up 71 https://p.scdn.co/mp3-preview/b0b498715df97ffc... ... -3.745 1 0.1670 119.824 4 https://api.spotify.com/v1/tracks/2ULg8Cw0Ckn5... audio_features spotify:track:2ULg8Cw0Ckn5JDGUkCNXko 0.362 pop
24 28 Now Or Never Halsey 214801 False https://api.spotify.com/v1/tracks/3Px934J0dBoC... 3Px934J0dBoCmpUhUuCQkD Now Or Never 86 NaN ... -4.934 0 0.0367 110.091 4 https://api.spotify.com/v1/tracks/3Px934J0dBoC... audio_features spotify:track:3Px934J0dBoCmpUhUuCQkD 0.485 pop
25 29 Aloha Møme, Merryn Jeann 218400 False https://api.spotify.com/v1/tracks/4uNDs2TBsv2K... 4uNDs2TBsv2KX9b4LIxfdt Aloha 65 NaN ... -5.955 0 0.1470 100.095 4 https://api.spotify.com/v1/tracks/4uNDs2TBsv2K... audio_features spotify:track:4uNDs2TBsv2KX9b4LIxfdt 0.444 pop
26 30 Solo Dance Martin Jensen 174933 False https://api.spotify.com/v1/tracks/3R6dPfF2yBO8... 3R6dPfF2yBO8mHySW1XDAa Solo Dance 82 https://p.scdn.co/mp3-preview/76414b590c5de583... ... -2.432 1 0.0480 114.955 4 https://api.spotify.com/v1/tracks/3R6dPfF2yBO8... audio_features spotify:track:3R6dPfF2yBO8mHySW1XDAa 0.418 pop
27 31 Alright Stupead 185066 False https://api.spotify.com/v1/tracks/1fDLHQ4oKWj7... 1fDLHQ4oKWj7KxQRg9WOHQ Alright 64 https://p.scdn.co/mp3-preview/1449a74ef40e6703... ... -5.929 1 0.3280 124.034 4 https://api.spotify.com/v1/tracks/1fDLHQ4oKWj7... audio_features spotify:track:1fDLHQ4oKWj7KxQRg9WOHQ 0.511 pop
28 32 The Shine ayokay, Chelsea Cutler 188026 False https://api.spotify.com/v1/tracks/5p6eKTYwFYit... 5p6eKTYwFYitUS617IMtyd The Shine 70 https://p.scdn.co/mp3-preview/a11a3662018be947... ... -6.074 1 0.0493 99.983 4 https://api.spotify.com/v1/tracks/5p6eKTYwFYit... audio_features spotify:track:5p6eKTYwFYitUS617IMtyd 0.580 pop
29 34 DGAF (feat. Shiloh Dynasty) Noah Slee, Shiloh Dynasty 196640 False https://api.spotify.com/v1/tracks/4JU4oMSWyzHC... 4JU4oMSWyzHCiszqtsQptt DGAF 64 https://p.scdn.co/mp3-preview/fd21e629686f93c0... ... -7.319 0 0.0485 134.168 4 https://api.spotify.com/v1/tracks/4JU4oMSWyzHC... audio_features spotify:track:4JU4oMSWyzHCiszqtsQptt 0.252 pop
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
20 21 H.O.L.Y. Florida Georgia Line 196960 False https://api.spotify.com/v1/tracks/19PJf5HzeiVW... 19PJf5HzeiVW0mdk82Swmn H.O.L.Y. 19 NaN ... -3.828 0 0.0333 78.132 4 https://api.spotify.com/v1/tracks/19PJf5HzeiVW... audio_features spotify:track:19PJf5HzeiVW0mdk82Swmn 0.560 country
21 22 Ripcord Keith Urban 230600 False https://api.spotify.com/v1/tracks/6wycnu8FWXsj... 6wycnu8FWXsj68ig7BEot9 Blue Ain't Your Color 75 NaN ... -7.787 1 0.0363 82.407 3 https://api.spotify.com/v1/tracks/6wycnu8FWXsj... audio_features spotify:track:6wycnu8FWXsj68ig7BEot9 0.450 country
22 23 Bring You Back Brett Eldredge 179293 False https://api.spotify.com/v1/tracks/3un2KQUaQ2br... 3un2KQUaQ2brEpd8dK93wI Beat Of The Music 63 https://p.scdn.co/mp3-preview/2990273cbb8dc089... ... -4.125 1 0.0269 110.001 4 https://api.spotify.com/v1/tracks/3un2KQUaQ2br... audio_features spotify:track:3un2KQUaQ2brEpd8dK93wI 0.777 country
23 24 High Noon Jerrod Niemann 224813 False https://api.spotify.com/v1/tracks/3M31S6f0z8S3... 3M31S6f0z8S3nkFh3eS06W Drink to That All Night 65 https://p.scdn.co/mp3-preview/3e1f057daa61f7a1... ... -6.184 0 0.0439 115.965 4 https://api.spotify.com/v1/tracks/3M31S6f0z8S3... audio_features spotify:track:3M31S6f0z8S3nkFh3eS06W 0.476 country
24 25 Ignite the Night (Party Edition) Chase Rice 241240 True https://api.spotify.com/v1/tracks/2IDZZPDo5t7E... 2IDZZPDo5t7E5PE8Gv2I28 Ride (feat. Macy Maloy) 13 NaN ... -6.347 1 0.0318 115.039 4 https://api.spotify.com/v1/tracks/2IDZZPDo5t7E... audio_features spotify:track:2IDZZPDo5t7E5PE8Gv2I28 0.340 country
25 26 Anything Goes Florida Georgia Line 230586 False https://api.spotify.com/v1/tracks/2KklDCzQYLA5... 2KklDCzQYLA5YtOL3lbQbH Dirt 47 NaN ... -4.174 1 0.0464 121.959 4 https://api.spotify.com/v1/tracks/2KklDCzQYLA5... audio_features spotify:track:2KklDCzQYLA5YtOL3lbQbH 0.607 country
26 27 Crickets Joe Nichols 230706 False https://api.spotify.com/v1/tracks/0uRlP6bHbSgB... 0uRlP6bHbSgBGklmaCMqG7 Yeah 65 https://p.scdn.co/mp3-preview/279cf907390f4abd... ... -4.638 1 0.0531 165.713 4 https://api.spotify.com/v1/tracks/0uRlP6bHbSgB... audio_features spotify:track:0uRlP6bHbSgBGklmaCMqG7 0.596 country
27 28 Montevallo Sam Hunt 310306 False https://api.spotify.com/v1/tracks/4nrtux8xl5lY... 4nrtux8xl5lYegU8II5rAQ Single For The Summer 65 NaN ... -6.316 0 0.0465 155.877 4 https://api.spotify.com/v1/tracks/4nrtux8xl5lY... audio_features spotify:track:4nrtux8xl5lYegU8II5rAQ 0.474 country
28 29 Two Lanes Of Freedom Tim McGraw, Taylor Swift, Keith Urban 279066 False https://api.spotify.com/v1/tracks/60hGQrn24APq... 60hGQrn24APqEFSLObLeDc Highway Don't Care 54 NaN ... -5.287 1 0.0444 78.905 4 https://api.spotify.com/v1/tracks/60hGQrn24APq... audio_features spotify:track:60hGQrn24APqEFSLObLeDc 0.478 country
29 30 How Country Feels Randy Houser 185093 False https://api.spotify.com/v1/tracks/68A4qiCqD9y3... 68A4qiCqD9y39N3WNEsJVD How Country Feels 69 https://p.scdn.co/mp3-preview/2886b6d192985243... ... -3.608 1 0.0396 104.944 4 https://api.spotify.com/v1/tracks/68A4qiCqD9y3... audio_features spotify:track:68A4qiCqD9y39N3WNEsJVD 0.814 country
30 31 Old Dominion Old Dominion 190346 False https://api.spotify.com/v1/tracks/4pXQMKhsrOab... 4pXQMKhsrOabXRdc0ZjoUE Nowhere Fast 3 NaN ... -7.316 1 0.0297 101.981 4 https://api.spotify.com/v1/tracks/4pXQMKhsrOab... audio_features spotify:track:4pXQMKhsrOabXRdc0ZjoUE 0.500 country
31 32 Traveller Chris Stapleton 253200 False https://api.spotify.com/v1/tracks/5jROdl6MhcmP... 5jROdl6MhcmP3O7h2sVgtw Parachute 70 NaN ... -7.327 1 0.0297 113.033 4 https://api.spotify.com/v1/tracks/5jROdl6MhcmP... audio_features spotify:track:5jROdl6MhcmP3O7h2sVgtw 0.631 country
32 33 Anything Goes Florida Georgia Line 218866 False https://api.spotify.com/v1/tracks/46ZfPS5VpSQV... 46ZfPS5VpSQVU5gb82hg3K Anything Goes 71 NaN ... -3.805 1 0.0450 154.038 4 https://api.spotify.com/v1/tracks/46ZfPS5VpSQV... audio_features spotify:track:46ZfPS5VpSQVU5gb82hg3K 0.757 country
33 34 Chief Eric Church 263386 False https://api.spotify.com/v1/tracks/7L8zhGCm45v9... 7L8zhGCm45v984vCmYBS1x Springsteen 44 NaN ... -5.935 1 0.0255 104.023 4 https://api.spotify.com/v1/tracks/7L8zhGCm45v9... audio_features spotify:track:7L8zhGCm45v984vCmYBS1x 0.930 country
34 35 RISER Dierks Bentley 254466 False https://api.spotify.com/v1/tracks/2a3KlwRjbAFP... 2a3KlwRjbAFPYMV2sdzsFM Drunk On A Plane 50 NaN ... -5.557 1 0.0328 205.932 4 https://api.spotify.com/v1/tracks/2a3KlwRjbAFP... audio_features spotify:track:2a3KlwRjbAFPYMV2sdzsFM 0.667 country
35 36 Crash My Party Luke Bryan 226866 False https://api.spotify.com/v1/tracks/03fT3OHB9KyM... 03fT3OHB9KyMtGMt2zwqCT Play It Again 73 NaN ... -3.150 1 0.0696 144.056 4 https://api.spotify.com/v1/tracks/03fT3OHB9KyM... audio_features spotify:track:03fT3OHB9KyMtGMt2zwqCT 0.595 country
36 37 Yours Russell Dickerson 221280 False https://api.spotify.com/v1/tracks/4U5DDEuYdoMm... 4U5DDEuYdoMm4Xv9yklrWu Yours 15 NaN ... -5.308 1 0.0313 133.975 4 https://api.spotify.com/v1/tracks/4U5DDEuYdoMm... audio_features spotify:track:4U5DDEuYdoMm4Xv9yklrWu 0.544 country
37 38 They Don't Know Jason Aldean 203293 False https://api.spotify.com/v1/tracks/7DEgL4rErTwj... 7DEgL4rErTwjrsxLMlUVhf Any Ol' Barstool 76 https://p.scdn.co/mp3-preview/551f994084946608... ... -4.163 1 0.0284 143.780 4 https://api.spotify.com/v1/tracks/7DEgL4rErTwj... audio_features spotify:track:7DEgL4rErTwjrsxLMlUVhf 0.472 country
38 39 Don't It Billy Currington 190920 False https://api.spotify.com/v1/tracks/5ARuoJ6sMbMR... 5ARuoJ6sMbMRvvGU9Ft18z Don't It 10 NaN ... -5.293 1 0.0369 90.970 4 https://api.spotify.com/v1/tracks/5ARuoJ6sMbMR... audio_features spotify:track:5ARuoJ6sMbMRvvGU9Ft18z 0.674 country
39 40 Thomas Rhett EP Thomas Rhett 227106 False https://api.spotify.com/v1/tracks/3X33opT0KHbk... 3X33opT0KHbkadEKmBPyPD Make Me Wanna 31 NaN ... -5.121 1 0.0420 108.974 4 https://api.spotify.com/v1/tracks/3X33opT0KHbk... audio_features spotify:track:3X33opT0KHbkadEKmBPyPD 0.552 country
40 41 El Rio Frankie Ballard 240413 False https://api.spotify.com/v1/tracks/61jWRBNxdM3s... 61jWRBNxdM3sO5oKLwZV9y You'll Accomp'ny Me 68 https://p.scdn.co/mp3-preview/4729635078f5382e... ... -5.605 1 0.0319 107.970 4 https://api.spotify.com/v1/tracks/61jWRBNxdM3s... audio_features spotify:track:61jWRBNxdM3sO5oKLwZV9y 0.677 country
41 42 How Country Feels Randy Houser 193573 False https://api.spotify.com/v1/tracks/7CL6oaDC9d0Q... 7CL6oaDC9d0QXFMQRkNBmy Runnin' Outta Moonlight 71 https://p.scdn.co/mp3-preview/1d7b5bce4395fb06... ... -3.759 1 0.0393 172.050 4 https://api.spotify.com/v1/tracks/7CL6oaDC9d0Q... audio_features spotify:track:7CL6oaDC9d0QXFMQRkNBmy 0.791 country
42 43 Tailgates & Tanlines Luke Bryan 219973 False https://api.spotify.com/v1/tracks/0cV4xwUA4ue2... 0cV4xwUA4ue2deqq4CZFko I Don't Want This Night to End 66 NaN ... -4.020 0 0.0278 111.934 4 https://api.spotify.com/v1/tracks/0cV4xwUA4ue2... audio_features spotify:track:0cV4xwUA4ue2deqq4CZFko 0.375 country
43 44 From The Ground Up Dan + Shay 254293 False https://api.spotify.com/v1/tracks/0AdjJ2SaOeb5... 0AdjJ2SaOeb5bPJ67nDbsW From the Ground Up 69 https://p.scdn.co/mp3-preview/311f28baa84fbdbe... ... -5.431 1 0.0262 101.214 4 https://api.spotify.com/v1/tracks/0AdjJ2SaOeb5... audio_features spotify:track:0AdjJ2SaOeb5bPJ67nDbsW 0.287 country
44 45 Tangled Up Thomas Rhett 227426 False https://api.spotify.com/v1/tracks/5kNe7PE09d6K... 5kNe7PE09d6Kvw5pAsx23n Die A Happy Man 75 NaN ... -9.377 1 0.0304 83.096 4 https://api.spotify.com/v1/tracks/5kNe7PE09d6K... audio_features spotify:track:5kNe7PE09d6Kvw5pAsx23n 0.380 country
45 46 Stay A Little Longer Brothers Osborne 335266 False https://api.spotify.com/v1/tracks/6rqxivjFHp8K... 6rqxivjFHp8K0yMiefG56g Stay A Little Longer 69 NaN ... -5.515 1 0.0294 97.006 4 https://api.spotify.com/v1/tracks/6rqxivjFHp8K... audio_features spotify:track:6rqxivjFHp8K0yMiefG56g 0.537 country
46 47 Just As I Am Brantley Gilbert 222626 False https://api.spotify.com/v1/tracks/5LPkLveErawy... 5LPkLveErawyu1XDAGXHOj One Hell Of An Amen 46 NaN ... -5.630 1 0.0282 138.027 4 https://api.spotify.com/v1/tracks/5LPkLveErawy... audio_features spotify:track:5LPkLveErawyu1XDAGXHOj 0.556 country
47 48 May We All Florida Georgia Line, Tim McGraw 226173 False https://api.spotify.com/v1/tracks/2cuGAe3C8BHJ... 2cuGAe3C8BHJN57JASaS3P May We All 10 NaN ... -4.280 1 0.0393 75.016 4 https://api.spotify.com/v1/tracks/2cuGAe3C8BHJ... audio_features spotify:track:2cuGAe3C8BHJN57JASaS3P 0.636 country
48 49 Old Boots, New Dirt Jason Aldean 223626 False https://api.spotify.com/v1/tracks/1uhaXll708Tq... 1uhaXll708Tq7NDwu7fJBd Gonna Know We Were Here 47 https://p.scdn.co/mp3-preview/c0e4752ca195f740... ... -5.544 1 0.2100 174.066 4 https://api.spotify.com/v1/tracks/1uhaXll708Tq... audio_features spotify:track:1uhaXll708Tq7NDwu7fJBd 0.538 country
49 50 It Goes Like This Thomas Rhett 186853 False https://api.spotify.com/v1/tracks/1S1u0ausWGi2... 1S1u0ausWGi2msWnuQgTCY It Goes Like This 63 NaN ... -5.914 1 0.0627 167.957 4 https://api.spotify.com/v1/tracks/1S1u0ausWGi2... audio_features spotify:track:1S1u0ausWGi2msWnuQgTCY 0.561 country

513 rows × 32 columns

This has given us a fairly sizeable dataframe with 513 rows and 32 columns. However, if you look closely at the index column you'll notice something dodgey has happened - combining our dataframes has meant that the index field is no longer unique (multiple records share the same index).


In [59]:
data.index.is_unique


Out[59]:
False

This is not good. Looking at the printout of the dataframe above, we see that the last record is LOYALTY. by Kendrick Lamar and has index 46. However, if we try to access the record with index 46, we instead get Rebellion (Lies) by Arcade Fire.


In [60]:
data.iloc[46]


Out[60]:
Unnamed: 0                                                         48
album                                                         Funeral
artists                                                   Arcade Fire
duration_ms                                                    310893
explicit                                                        False
href                https://api.spotify.com/v1/tracks/5qk1xXcERl8R...
id                                             5qk1xXcERl8RW645ztqDAW
name                                                 Rebellion (Lies)
popularity                                                         58
preview_url         https://p.scdn.co/mp3-preview/f891f8274794a442...
track_number                                                        9
type                                                            track
uri                              spotify:track:5qk1xXcERl8RW645ztqDAW
acousticness                                                   0.0068
analysis_url        https://api.spotify.com/v1/audio-analysis/5qk1...
danceability                                                    0.401
duration_ms.1                                                  310893
energy                                                          0.941
id.1                                           5qk1xXcERl8RW645ztqDAW
instrumentalness                                                0.607
key                                                                 8
liveness                                                        0.288
loudness                                                       -5.652
mode                                                                1
speechiness                                                    0.0349
tempo                                                         127.178
time_signature                                                      4
track_href          https://api.spotify.com/v1/tracks/5qk1xXcERl8R...
type.1                                                 audio_features
uri.1                            spotify:track:5qk1xXcERl8RW645ztqDAW
valence                                                         0.738
genre                                                           indie
Name: 46, dtype: object

We can remedy this by reindexing. Looking at the fields available, it looks like the tracks' id would be a good choice for a unique index.


In [61]:
data.set_index('id', inplace=True)

In [62]:
data.index.is_unique


Out[62]:
False

Unfortunately, there are still duplicates where the same track appears in multiple playlists. Let's remove these duplicates, keeping only the first instance.


In [68]:
data = data[~data.index.duplicated(keep='first')]
data.index.is_unique


Out[68]:
True

Sucess! Before we do anything else, let's write our single combined dataset to file.


In [127]:
data.to_csv('spotify_data/combined_data.csv')

Now onto some analysis. Let's first look at some statistics for each of our genres.


In [72]:
data[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability', 'energy', 'instrumentalness', 
     'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'genre']].groupby('genre').mean()


Out[72]:
duration_ms explicit popularity acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence
genre
country 222361.300000 0.020000 52.260000 0.123496 0.575860 0.755760 0.004834 0.185076 -5.356140 0.041214 125.422320 0.563620
house 413951.619565 0.010870 32.184783 0.071019 0.764717 0.672033 0.741966 0.122800 -10.278283 0.059933 120.842522 0.374475
indie 237722.958333 0.031250 56.583333 0.189482 0.598167 0.693562 0.147994 0.185283 -6.631104 0.045101 122.083937 0.542333
metal 264057.406780 0.084746 30.694915 0.001797 0.420085 0.946678 0.022654 0.229066 -4.277203 0.103297 131.340254 0.358563
pop 214640.508982 0.185629 67.131737 0.170122 0.658557 0.658542 0.017714 0.156259 -6.187772 0.083049 115.435132 0.466786
rap 214801.739130 0.978261 76.695652 0.147325 0.785978 0.595761 0.000365 0.142937 -7.173261 0.213624 135.210065 0.419304

From this alone we can get a lot of information: house tracks are on average almost twice as long as tracks from the other genres, over 97% of rap tracks contain explicit lyrics, metal tracks are the most energetic but tend to be sadder (lower valence) than country or indie. Let's try sorting our data to find the saddest tracks in each genre.

We do this by sorting the data by valence (sort_values('valence')), grouping by genre (groupby('genre')) then by taking the first value of each group (head(1)).


In [123]:
data.sort_values('valence')[['album', 'artists', 'name', 'genre', 'valence']].groupby('genre').head(1)


Out[123]:
album artists name genre valence
id
4R1AbCs2wEu4e6j7FB7sRZ The Touch Rampa The Touch house 0.0354
2yoCtR2C0sMFgII70RosuY The Raven Age The Raven Age Angel In Disgrace metal 0.0634
2Ce5IyMlVRVvN997ZJjJJA HNDRXX Future, Rihanna Selfish pop 0.0951
05nbZ1xxVNwUTcGwLbp7CN NAV NAV Myself rap 0.1000
1gk3FhAV07q9Jg77UxnVjX ZABA Glass Animals Gooey indie 0.1070
0xwPzLmBAYro8BUz7MrtAo Montevallo Sam Hunt Make You Miss Me country 0.1670

We can visualise our data by plotting the various characteristics against each other. In the plot below, we compare the energy and danceability of country, metal and house music. The data from the three different genres separates into three pretty distinct clusters.


In [111]:
colours = ['red', 'blue', 'green', 'orange', 'pink', 'purple']

ax = data[data.genre == 'country'].plot.scatter('danceability', 'energy', c=colours[0], label='country', figsize=(10,10))
data[data.genre == 'metal'].plot.scatter('danceability', 'energy', c=colours[1], marker='x', label='metal', ax=ax)
data[data.genre == 'house'].plot.scatter('danceability', 'energy', c=colours[2], marker='+', label='house', ax=ax)


Out[111]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f20fb5129b0>

More information about pandas can be found in the documentation, or in tutorials or in standard books.

Messy data

In real life, datasets are often messy, with records containing invalid or missing entries. Fortunately, pandas is equipped with several functions that allow us to deal with messy data.

In this example, we shall be using a dataset from the Data Carpentry website which is a subset of hte data from Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA. This data contains a set a records of animals caught during the study.

Let's begin by reading in the data


In [ ]:
survey = pandas.read_excel('surveys.xlsx')

In [6]:
survey.head()


Out[6]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

In the weight column, instead of a number as we may expect, we see the values are 'NaN' or 'Not a Number'. If you open the original spreadsheet, you'll see that the original weight data is missing for these records. The count function returns the number of non-NaN entries per column, so if we subtract that from the length of the survey, we can see how many NaN entries there are per column


In [20]:
len(survey) - survey.count()


Out[20]:
record_id             0
month                 0
day                   0
year                  0
plot_id               0
species_id          763
sex                2511
hindfoot_length    4111
weight             3266
dtype: int64

We need to work out a sensible way to deal with this missing data, as if we try to do any analysis on the dataset in its current state, python may throw value errors. For example, let's try converting the data in the weight column to an integer:


In [8]:
survey.weight.astype('int')


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-55a36a4f97d3> in <module>()
----> 1 survey.weight.astype('int')

/home/alice/anaconda3/lib/python3.4/site-packages/pandas/core/generic.py in astype(self, dtype, copy, raise_on_error, **kwargs)
   3052         # else, only a single dtype is given
   3053         new_data = self._data.astype(dtype=dtype, copy=copy,
-> 3054                                      raise_on_error=raise_on_error, **kwargs)
   3055         return self._constructor(new_data).__finalize__(self)
   3056 

/home/alice/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py in astype(self, dtype, **kwargs)
   3187 
   3188     def astype(self, dtype, **kwargs):
-> 3189         return self.apply('astype', dtype=dtype, **kwargs)
   3190 
   3191     def convert(self, **kwargs):

/home/alice/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3054 
   3055             kwargs['mgr'] = self
-> 3056             applied = getattr(b, f)(**kwargs)
   3057             result_blocks = _extend_blocks(applied, result_blocks)
   3058 

/home/alice/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py in astype(self, dtype, copy, raise_on_error, values, **kwargs)
    459                **kwargs):
    460         return self._astype(dtype, copy=copy, raise_on_error=raise_on_error,
--> 461                             values=values, **kwargs)
    462 
    463     def _astype(self, dtype, copy=False, raise_on_error=True, values=None,

/home/alice/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py in _astype(self, dtype, copy, raise_on_error, values, klass, mgr, **kwargs)
    502 
    503                 # _astype_nansafe works fine with 1-d only
--> 504                 values = _astype_nansafe(values.ravel(), dtype, copy=True)
    505                 values = values.reshape(self.shape)
    506 

/home/alice/anaconda3/lib/python3.4/site-packages/pandas/types/cast.py in _astype_nansafe(arr, dtype, copy)
    529 
    530         if np.isnan(arr).any():
--> 531             raise ValueError('Cannot convert NA to integer')
    532     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
    533         # work around NumPy brokenness, #1987

ValueError: Cannot convert NA to integer

There are several different ways we can deal with NaNs - which we choose depends on the individual dataset.

It may be that missing data is due to e.g. the machine reading the data in malfunctioning, and the best practice is just to discard all records containing missing data. We can do that with the dropna function.


In [11]:
survey.dropna()


Out[11]:
record_id month day year plot_id species_id sex hindfoot_length weight
62 63 8 19 1977 3 DM M 35.0 40.0
63 64 8 19 1977 7 DM M 37.0 48.0
64 65 8 19 1977 4 DM F 34.0 29.0
65 66 8 19 1977 4 DM F 35.0 46.0
66 67 8 19 1977 7 DM M 35.0 36.0
67 68 8 19 1977 8 DO F 32.0 52.0
68 69 8 19 1977 2 PF M 15.0 8.0
69 70 8 19 1977 3 OX F 21.0 22.0
70 71 8 19 1977 7 DM F 36.0 35.0
73 74 8 19 1977 8 PF M 12.0 7.0
74 75 8 19 1977 8 DM F 32.0 22.0
77 78 8 19 1977 1 PF M 16.0 9.0
78 79 8 19 1977 7 DM F 34.0 42.0
80 81 8 19 1977 4 PF F 14.0 8.0
81 82 8 19 1977 4 DM F 35.0 41.0
82 83 8 20 1977 6 DM F 37.0 37.0
83 84 8 20 1977 19 DM F 35.0 43.0
84 85 8 20 1977 23 DM F 35.0 41.0
85 86 8 20 1977 18 DM F 33.0 40.0
86 87 8 20 1977 5 PF F 11.0 9.0
87 88 8 20 1977 18 DM F 35.0 45.0
88 89 8 20 1977 12 PP F 20.0 15.0
89 90 8 20 1977 18 DM M 35.0 29.0
91 92 8 20 1977 6 DM M 35.0 39.0
93 94 8 20 1977 18 DM F 36.0 43.0
94 95 8 20 1977 23 DM M 38.0 46.0
95 96 8 20 1977 12 DM M 36.0 41.0
96 97 8 20 1977 18 DM M 36.0 41.0
97 98 8 20 1977 5 DM M 38.0 40.0
98 99 8 20 1977 11 DM M 37.0 45.0
... ... ... ... ... ... ... ... ... ...
35507 35508 12 31 2002 6 PB F 25.0 35.0
35508 35509 12 31 2002 6 PB M 26.0 47.0
35509 35510 12 31 2002 6 PB F 26.0 30.0
35513 35514 12 31 2002 11 PP M 23.0 18.0
35515 35516 12 31 2002 11 DO F 35.0 52.0
35516 35517 12 31 2002 11 DM F 36.0 42.0
35517 35518 12 31 2002 11 DO M 36.0 38.0
35518 35519 12 31 2002 9 DM M 37.0 49.0
35520 35521 12 31 2002 9 DM M 37.0 48.0
35521 35522 12 31 2002 9 DM F 35.0 45.0
35522 35523 12 31 2002 9 DM F 36.0 44.0
35523 35524 12 31 2002 9 PB F 25.0 27.0
35524 35525 12 31 2002 9 OL M 21.0 26.0
35525 35526 12 31 2002 8 OT F 20.0 24.0
35526 35527 12 31 2002 13 DO F 33.0 43.0
35528 35529 12 31 2002 13 PB F 25.0 25.0
35531 35532 12 31 2002 14 DM F 34.0 43.0
35532 35533 12 31 2002 14 DM F 36.0 48.0
35533 35534 12 31 2002 14 DM M 37.0 56.0
35534 35535 12 31 2002 14 DM M 37.0 53.0
35535 35536 12 31 2002 14 DM F 35.0 42.0
35536 35537 12 31 2002 14 DM F 36.0 46.0
35537 35538 12 31 2002 15 PB F 26.0 31.0
35538 35539 12 31 2002 15 SF M 26.0 68.0
35539 35540 12 31 2002 15 PB F 26.0 23.0
35540 35541 12 31 2002 15 PB F 24.0 31.0
35541 35542 12 31 2002 15 PB F 26.0 29.0
35542 35543 12 31 2002 15 PB F 27.0 34.0
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0

30676 rows × 9 columns

We may just wish to discard records with NaNs in a particular column (e.g. if we wish to deal with NaNs in other columns in a different way). We can discard all the records with NaNs in the weight column like so:


In [12]:
survey.dropna(subset=['weight'])


Out[12]:
record_id month day year plot_id species_id sex hindfoot_length weight
62 63 8 19 1977 3 DM M 35.0 40.0
63 64 8 19 1977 7 DM M 37.0 48.0
64 65 8 19 1977 4 DM F 34.0 29.0
65 66 8 19 1977 4 DM F 35.0 46.0
66 67 8 19 1977 7 DM M 35.0 36.0
67 68 8 19 1977 8 DO F 32.0 52.0
68 69 8 19 1977 2 PF M 15.0 8.0
69 70 8 19 1977 3 OX F 21.0 22.0
70 71 8 19 1977 7 DM F 36.0 35.0
73 74 8 19 1977 8 PF M 12.0 7.0
74 75 8 19 1977 8 DM F 32.0 22.0
77 78 8 19 1977 1 PF M 16.0 9.0
78 79 8 19 1977 7 DM F 34.0 42.0
80 81 8 19 1977 4 PF F 14.0 8.0
81 82 8 19 1977 4 DM F 35.0 41.0
82 83 8 20 1977 6 DM F 37.0 37.0
83 84 8 20 1977 19 DM F 35.0 43.0
84 85 8 20 1977 23 DM F 35.0 41.0
85 86 8 20 1977 18 DM F 33.0 40.0
86 87 8 20 1977 5 PF F 11.0 9.0
87 88 8 20 1977 18 DM F 35.0 45.0
88 89 8 20 1977 12 PP F 20.0 15.0
89 90 8 20 1977 18 DM M 35.0 29.0
91 92 8 20 1977 6 DM M 35.0 39.0
92 93 8 20 1977 18 DM NaN NaN 42.0
93 94 8 20 1977 18 DM F 36.0 43.0
94 95 8 20 1977 23 DM M 38.0 46.0
95 96 8 20 1977 12 DM M 36.0 41.0
96 97 8 20 1977 18 DM M 36.0 41.0
97 98 8 20 1977 5 DM M 38.0 40.0
... ... ... ... ... ... ... ... ... ...
35508 35509 12 31 2002 6 PB M 26.0 47.0
35509 35510 12 31 2002 6 PB F 26.0 30.0
35513 35514 12 31 2002 11 PP M 23.0 18.0
35515 35516 12 31 2002 11 DO F 35.0 52.0
35516 35517 12 31 2002 11 DM F 36.0 42.0
35517 35518 12 31 2002 11 DO M 36.0 38.0
35518 35519 12 31 2002 9 DM M 37.0 49.0
35519 35520 12 31 2002 9 SF NaN 24.0 36.0
35520 35521 12 31 2002 9 DM M 37.0 48.0
35521 35522 12 31 2002 9 DM F 35.0 45.0
35522 35523 12 31 2002 9 DM F 36.0 44.0
35523 35524 12 31 2002 9 PB F 25.0 27.0
35524 35525 12 31 2002 9 OL M 21.0 26.0
35525 35526 12 31 2002 8 OT F 20.0 24.0
35526 35527 12 31 2002 13 DO F 33.0 43.0
35528 35529 12 31 2002 13 PB F 25.0 25.0
35531 35532 12 31 2002 14 DM F 34.0 43.0
35532 35533 12 31 2002 14 DM F 36.0 48.0
35533 35534 12 31 2002 14 DM M 37.0 56.0
35534 35535 12 31 2002 14 DM M 37.0 53.0
35535 35536 12 31 2002 14 DM F 35.0 42.0
35536 35537 12 31 2002 14 DM F 36.0 46.0
35537 35538 12 31 2002 15 PB F 26.0 31.0
35538 35539 12 31 2002 15 SF M 26.0 68.0
35539 35540 12 31 2002 15 PB F 26.0 23.0
35540 35541 12 31 2002 15 PB F 24.0 31.0
35541 35542 12 31 2002 15 PB F 26.0 29.0
35542 35543 12 31 2002 15 PB F 27.0 34.0
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0

32283 rows × 9 columns

It may be that it's more appropriate for us to set all missing data with a certain value. For example, let's set all missing weights to 0:


In [24]:
nan_zeros = survey.copy() # make a copy so we don't overwrite original dataframe
nan_zeros.weight.fillna(0, inplace=True)
nan_zeros.head()


Out[24]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 0.0
1 2 7 16 1977 3 NL M 33.0 0.0
2 3 7 16 1977 2 DM F 37.0 0.0
3 4 7 16 1977 7 DM M 36.0 0.0
4 5 7 16 1977 3 DM M 35.0 0.0

For our dataset, this is not the best choice as it will change the mean of our data:


In [25]:
print(survey.weight.mean(), nan_zeros.weight.mean())


42.672428212991356 38.751976145601844

A better solution here is to fill all NaN values with the mean weight value:


In [27]:
nan_mean = survey.copy()
nan_mean.weight.fillna(survey.weight.mean(), inplace=True)
print(survey.weight.mean(), nan_mean.weight.mean())
nan_mean.head()


42.672428212991356 42.67242821299182
Out[27]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 42.672428
1 2 7 16 1977 3 NL M 33.0 42.672428
2 3 7 16 1977 2 DM F 37.0 42.672428
3 4 7 16 1977 7 DM M 36.0 42.672428
4 5 7 16 1977 3 DM M 35.0 42.672428
Exercises
  1. Create a histogram of the SepalWidth for each of the species groups in the iris dataset
  2. Plot acousticness against liveness for the music dataset. Use a for loop to add the different datasets to the plot (i.e. rather than typing each out by hand, as done above).

In [39]:
for name, df in iris.groupby('Name'):
    # create a new figure
    pyplot.figure()
    # plot histogram of sepalwidth
    df['SepalWidth'].plot.hist()
    # add title
    pyplot.title(name)


In the solution below for the music genre exercise, we've included a few extra steps in order to format the plot and make it more readable (e.g. changing the axis limits, increasing the figure size and fontsize).


In [33]:
# create a new axis
fig, axis = pyplot.subplots()

# create a dictionary of colours
colours = {'indie': 'red', 'pop': 'blue', 
           'country': 'green', 'metal': 'black', 
           'house': 'orange', 'rap': 'pink'}
# create a dictionary of markers 
markers = {'indie': '+', 'pop': 'x', 
           'country': 'o', 'metal': 'd', 
           'house': 's', 'rap': '*'}

for name, df in data.groupby('genre'):
    df.plot.scatter('acousticness', 'liveness', label=name, s=30, color=colours[name], marker=markers[name],
                    ax=axis, figsize=(10,8), fontsize=16)

# set limits of x and y axes so that they are between 0 and 1
axis.set_xlim([0,1.0])
axis.set_ylim([0,1.0])

# set the font size of the axis labels
axis.xaxis.label.set_fontsize(16)
axis.yaxis.label.set_fontsize(16)
pyplot.show()


Further reading

For a basic pandas tutorial, check out Python for ecologists from the Data Carpentry website. Of particular interest may be the last lesson which shows how to interact with SQL databases using python and pandas.

For a more in-depth pandas tutorial, check out these notebooks by Chris Fonnesbeck. In the last notebook, there is quite a lot of material on using pandas with scikit-learn for machine learning, including regression analysis, decision trees and random forests.

Other libraries

There are many other options depending on what you need to display. If you have large data and want to more easily make nice plots, try seaborn or altair. If you want to make the data interactive, especially online, try plotly or bokeh. For a detailed discussion of plotting in Python in 2017, see this talk by Jake Vanderplas.