Plotting

There are many libraries for plotting in Python. The standard library is matplotlib. Its examples and gallery are particularly useful references.

Matplotlib is most useful if you have data in numpy arrays. We can then plot standard single graphs straightforwardly:


In [1]:
%matplotlib inline

The above command is only needed if you are plotting in a Jupyter notebook.

We now construct some data:


In [2]:
import numpy

x = numpy.linspace(0, 1)
y1 = numpy.sin(numpy.pi * x) + 0.1 * numpy.random.rand(50)
y2 = numpy.cos(3.0 * numpy.pi * x) + 0.2 * numpy.random.rand(50)

And then produce a line plot:


In [3]:
from matplotlib import pyplot
pyplot.plot(x, y1)
pyplot.show()


We can add labels and titles:


In [4]:
pyplot.plot(x, y1)
pyplot.xlabel('x')
pyplot.ylabel('y')
pyplot.title('A single line plot')
pyplot.show()


We can change the plotting style, and use LaTeX style notation where needed:


In [5]:
pyplot.plot(x, y1, linestyle='--', color='black', linewidth=3)
pyplot.xlabel(r'$x$')
pyplot.ylabel(r'$y$')
pyplot.title(r'A single line plot, roughly $\sin(\pi x)$')
pyplot.show()


We can plot two lines at once, and add a legend, which we can position:


In [6]:
pyplot.plot(x, y1, label=r'$y_1$')
pyplot.plot(x, y2, label=r'$y_2$')
pyplot.xlabel(r'$x$')
pyplot.ylabel(r'$y$')
pyplot.title('Two line plots')
pyplot.legend(loc='lower left')
pyplot.show()


We would probably prefer to use subplots. At this point we have to leave the simple interface, and start building the plot using its individual components, figures and axes, which are objects to manipulate:


In [7]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(10,6))
axis1 = axes[0]
axis1.plot(x, y1)
axis1.set_xlabel(r'$x$')
axis1.set_ylabel(r'$y_1$')
axis2 = axes[1]
axis2.plot(x, y2)
axis2.set_xlabel(r'$x$')
axis2.set_ylabel(r'$y_2$')
fig.tight_layout()
pyplot.show()


The axes variable contains all of the separate axes that you may want. This makes it easy to construct many subplots using a loop:


In [8]:
data = []
for nx in range(2,5):
    for ny in range(2,5):
        data.append(numpy.sin(nx * numpy.pi * x) + numpy.cos(ny * numpy.pi * x))

fig, axes = pyplot.subplots(nrows=3, ncols=3, figsize=(10,10))
for nrow in range(3):
    for ncol in range(3):
        ndata = ncol + 3 * nrow
        axes[nrow, ncol].plot(x, data[ndata])
        axes[nrow, ncol].set_xlabel(r'$x$')
        axes[nrow, ncol].set_ylabel(r'$\sin({} \pi x) + \cos({} \pi x)$'.format(nrow+2, ncol+2))
fig.tight_layout()
pyplot.show()


Matplotlib will allow you to generate and place axes pretty much wherever you like, to use logarithmic scales, to do different types of plot, and so on. Check the examples and gallery for details.

Data sets

If the information is not in numpy arrays but in a spreadsheet-like format, Matplotlib may not be the best approach.

For handling large data sets, the standard Python library is pandas. It keeps the data in a dataframe which keeps the rectangular data together with its labels.

Let's load the standard Iris data set, which we can get from GitHub, in:


In [9]:
import pandas

In [10]:
iris = pandas.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv')

Let's get some information about the file we just read in. First, let's see what data fields our dataset has:


In [11]:
iris.columns


Out[11]:
Index(['SepalLength', 'SepalWidth', 'PetalLength', 'PetalWidth', 'Name'], dtype='object')

Now let's see what datatype (i.e. integer, boolean, string, float,...) the data in each field is:


In [12]:
iris.dtypes


Out[12]:
SepalLength    float64
SepalWidth     float64
PetalLength    float64
PetalWidth     float64
Name            object
dtype: object

Finally, let's try printing the first few records in our dataframe:


In [13]:
# print first 5 records
iris.head()


Out[13]:
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Note that pandas can read Excel files (using pandas.read_excel), and takes as arguments either the URL (as here) or the filename on the local machine.

Once we have the data, <dataframe>.plot gives us lots of options to plot the result. Let's plot a histogram of the Petal Length:


In [14]:
iris['PetalLength'].plot.hist()
pyplot.show()


We can see the underlying library is Matplotlib, but it's far easier to plot large data sets.

We can get some basic statistics for our data using describe():


In [15]:
iris.describe()


Out[15]:
SepalLength SepalWidth PetalLength PetalWidth
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

We can also extract specific metrics:


In [16]:
print(iris['SepalLength'].min())
print(iris['PetalLength'].std())
print(iris['PetalWidth'].count())


4.3
1.7644204199522617
150

However, we often wish to calculate statistics for a subset of our data. For this, we can use pandas' groups. Let's group our data by Name and try running describe again. We see that pandas has now calculated statistics for each type of iris separately.


In [17]:
grouped_iris = iris.groupby('Name')
grouped_iris.describe()


Out[17]:
PetalLength PetalWidth ... SepalLength SepalWidth
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
Name
Iris-setosa 50.0 1.464 0.173511 1.0 1.4 1.50 1.575 1.9 50.0 0.244 ... 5.2 5.8 50.0 3.418 0.381024 2.3 3.125 3.4 3.675 4.4
Iris-versicolor 50.0 4.260 0.469911 3.0 4.0 4.35 4.600 5.1 50.0 1.326 ... 6.3 7.0 50.0 2.770 0.313798 2.0 2.525 2.8 3.000 3.4
Iris-virginica 50.0 5.552 0.551895 4.5 5.1 5.55 5.875 6.9 50.0 2.026 ... 6.9 7.9 50.0 2.974 0.322497 2.2 2.800 3.0 3.175 3.8

3 rows × 32 columns


In [18]:
grouped_iris['PetalLength'].mean()


Out[18]:
Name
Iris-setosa        1.464
Iris-versicolor    4.260
Iris-virginica     5.552
Name: PetalLength, dtype: float64

We can select subsets of our data using criteria. For example, we can select all records with PetalLength greater than 5:


In [19]:
iris[iris.PetalLength > 5].head()


Out[19]:
SepalLength SepalWidth PetalLength PetalWidth Name
83 6.0 2.7 5.1 1.6 Iris-versicolor
100 6.3 3.3 6.0 2.5 Iris-virginica
101 5.8 2.7 5.1 1.9 Iris-virginica
102 7.1 3.0 5.9 2.1 Iris-virginica
103 6.3 2.9 5.6 1.8 Iris-virginica

We can also combine criteria like so:


In [20]:
iris[(iris.Name == 'Iris-setosa') & (iris.PetalWidth < 0.3)].head()


Out[20]:
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

Data across multiple files

Now let's look at a slightly more complex example where the data is spread across multiple files and contains many different fields of different datatypes.

Spotify provide a web API which can be used to download data about its music. This data includes the audio features of a track, a set of measures including 'acousticness', 'danceability', 'speechiness' and 'valence':

A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).

We can download this data using a library such as spotipy. In the folder spotify_data, you shall find a few .csv files containing data downloaded for tracks from playlists of several different musical genres.

Let's begin by importing our data.


In [21]:
dfs = {'indie': pandas.read_csv('spotify_data/indie.csv'), 'pop': pandas.read_csv('spotify_data/pop.csv'), 
       'country': pandas.read_csv('spotify_data/country.csv'), 'metal': pandas.read_csv('spotify_data/metal.csv'), 
       'house': pandas.read_csv('spotify_data/house.csv'), 'rap': pandas.read_csv('spotify_data/rap.csv')}

To compare the data from these different datasets, it will help if we first combine them into a single dataframe. Before we do this, we'll add an extra field to each of our dataframes describing the musical genre so that we do not lose this information when the dataframes are combined.


In [22]:
# add genre field to each dataframe
for name, df in dfs.items():
    df['genre'] = name

# combine into single dataframe
data = pandas.concat(dfs.values())
data


Out[22]:
Unnamed: 0 album artists duration_ms explicit href id name popularity preview_url ... loudness mode speechiness tempo time_signature track_href type.1 uri.1 valence genre
0 0 ZABA Glass Animals 289306 False https://api.spotify.com/v1/tracks/1gk3FhAV07q9... 1gk3FhAV07q9Jg77UxnVjX Gooey 68 NaN ... -13.463 0 0.0433 182.969 4 https://api.spotify.com/v1/tracks/1gk3FhAV07q9... audio_features spotify:track:1gk3FhAV07q9Jg77UxnVjX 0.107 indie
1 1 Mature Themes Ariel Pink 287426 False https://api.spotify.com/v1/tracks/2TUdxbIF3rpY... 2TUdxbIF3rpY887d8m0ZGV Baby 60 https://p.scdn.co/mp3-preview/08dc36572cec683a... ... -14.360 1 0.0331 92.584 4 https://api.spotify.com/v1/tracks/2TUdxbIF3rpY... audio_features spotify:track:2TUdxbIF3rpY887d8m0ZGV 0.328 indie
2 2 Nightlife Phantogram 258186 False https://api.spotify.com/v1/tracks/0UPsdEgJx6yp... 0UPsdEgJx6ypkfFUxD3ZHh Don't Move 62 https://p.scdn.co/mp3-preview/449a5847540951e2... ... -5.206 0 0.0321 99.977 4 https://api.spotify.com/v1/tracks/0UPsdEgJx6yp... audio_features spotify:track:0UPsdEgJx6ypkfFUxD3ZHh 0.492 indie
3 3 Currents Tame Impala 216320 True https://api.spotify.com/v1/tracks/6K4t31amVTZD... 6K4t31amVTZDgR3sKmwUJJ The Less I Know The Better 72 NaN ... -4.077 0 0.0285 116.887 4 https://api.spotify.com/v1/tracks/6K4t31amVTZD... audio_features spotify:track:6K4t31amVTZDgR3sKmwUJJ 0.760 indie
4 4 Gorilla Manor Local Natives 266053 False https://api.spotify.com/v1/tracks/44DnIGEvTlhD... 44DnIGEvTlhDkGv3xLSDcp Wide Eyes 62 https://p.scdn.co/mp3-preview/f400ce8af05f57cd... ... -8.033 0 0.0397 123.120 4 https://api.spotify.com/v1/tracks/44DnIGEvTlhD... audio_features spotify:track:44DnIGEvTlhDkGv3xLSDcp 0.495 indie
5 5 Sunset Lover Petit Biscuit 237792 False https://api.spotify.com/v1/tracks/0hNduWmlWmEm... 0hNduWmlWmEmuwEFcYvRu1 Sunset Lover 83 https://p.scdn.co/mp3-preview/548545f6df276b92... ... -9.474 1 0.0503 90.838 4 https://api.spotify.com/v1/tracks/0hNduWmlWmEm... audio_features spotify:track:0hNduWmlWmEmuwEFcYvRu1 0.236 indie
6 6 I See You The xx 233280 False https://api.spotify.com/v1/tracks/044qU7kwVZaD... 044qU7kwVZaDwzWnRsmVVL I Dare You 69 https://p.scdn.co/mp3-preview/0448f1d3b5983b84... ... -9.483 1 0.0398 126.001 4 https://api.spotify.com/v1/tracks/044qU7kwVZaD... audio_features spotify:track:044qU7kwVZaDwzWnRsmVVL 0.295 indie
7 8 Little Neon Limelight Houndmouth 239880 False https://api.spotify.com/v1/tracks/2kG5litgxXvx... 2kG5litgxXvx6ZRgwO1v7W Sedona 66 https://p.scdn.co/mp3-preview/92c941c847c2841d... ... -7.513 1 0.0354 135.177 4 https://api.spotify.com/v1/tracks/2kG5litgxXvx... audio_features spotify:track:2kG5litgxXvx6ZRgwO1v7W 0.253 indie
8 10 Visions Grimes 255320 False https://api.spotify.com/v1/tracks/4sCYKMatyhaz... 4sCYKMatyhazyy6r2N7Hp2 Genesis 57 https://p.scdn.co/mp3-preview/f65d1db711cd2b66... ... -6.745 0 0.0340 166.040 4 https://api.spotify.com/v1/tracks/4sCYKMatyhaz... audio_features spotify:track:4sCYKMatyhazyy6r2N7Hp2 0.233 indie
9 11 The Singles Collection Silversun Pickups 354360 False https://api.spotify.com/v1/tracks/7eZyj92v6bZs... 7eZyj92v6bZsoQ85MxWZsO Lazy Eye 59 https://p.scdn.co/mp3-preview/4596d41cd35db5dc... ... -5.053 1 0.0461 127.397 4 https://api.spotify.com/v1/tracks/7eZyj92v6bZs... audio_features spotify:track:7eZyj92v6bZsoQ85MxWZsO 0.252 indie
10 12 Dreams The Whitest Boy Alive 191173 False https://api.spotify.com/v1/tracks/08HmtdAdRYiu... 08HmtdAdRYiutP5WO02utw Burning 11 NaN ... -7.536 0 0.0372 150.825 4 https://api.spotify.com/v1/tracks/08HmtdAdRYiu... audio_features spotify:track:08HmtdAdRYiutP5WO02utw 0.864 indie
11 13 Lewis Del Mar Lewis Del Mar 243413 True https://api.spotify.com/v1/tracks/4kK14radw0Xf... 4kK14radw0XfwxJDPt9tnP Painting (Masterpiece) 59 https://p.scdn.co/mp3-preview/862cb96402d381a5... ... -5.167 1 0.0695 91.966 4 https://api.spotify.com/v1/tracks/4kK14radw0Xf... audio_features spotify:track:4kK14radw0XfwxJDPt9tnP 0.710 indie
12 14 Idle Labor Craft Spells 240608 False https://api.spotify.com/v1/tracks/5ozIJrlaCAf3... 5ozIJrlaCAf3g8MMyot483 After the Moment 58 https://p.scdn.co/mp3-preview/a4b01330abdc34c6... ... -3.889 1 0.0293 112.011 4 https://api.spotify.com/v1/tracks/5ozIJrlaCAf3... audio_features spotify:track:5ozIJrlaCAf3g8MMyot483 0.750 indie
13 15 Master of My Make-Believe Santigold 284400 False https://api.spotify.com/v1/tracks/0t2QiRkpag0f... 0t2QiRkpag0fAgs9zuCPlH Disparate Youth 63 https://p.scdn.co/mp3-preview/25b3e506c1b2542e... ... -4.142 1 0.0417 152.005 4 https://api.spotify.com/v1/tracks/0t2QiRkpag0f... audio_features spotify:track:0t2QiRkpag0fAgs9zuCPlH 0.537 indie
14 16 Trouble Cage The Elephant 225973 False https://api.spotify.com/v1/tracks/3pxw9Agh9VB9... 3pxw9Agh9VB9kyx79REpRQ Trouble 58 https://p.scdn.co/mp3-preview/809d189e8d99db82... ... -5.655 1 0.0302 77.861 4 https://api.spotify.com/v1/tracks/3pxw9Agh9VB9... audio_features spotify:track:3pxw9Agh9VB9kyx79REpRQ 0.302 indie
15 17 Brothers The Black Keys 203773 False https://api.spotify.com/v1/tracks/6dU5RxthbuaN... 6dU5RxthbuaN31bRbEDlNw Everlasting Light 59 https://p.scdn.co/mp3-preview/7f9e5423c0c3cfa5... ... -6.554 1 0.0483 181.039 4 https://api.spotify.com/v1/tracks/6dU5RxthbuaN... audio_features spotify:track:6dU5RxthbuaN31bRbEDlNw 0.725 indie
16 18 Fire Diskopunk 197749 False https://api.spotify.com/v1/tracks/5jRJlGB5EP7H... 5jRJlGB5EP7HFKwYBKvlSA Fire 54 https://p.scdn.co/mp3-preview/6a08c9a7b02ad6c1... ... -3.829 0 0.0578 115.980 4 https://api.spotify.com/v1/tracks/5jRJlGB5EP7H... audio_features spotify:track:5jRJlGB5EP7HFKwYBKvlSA 0.494 indie
17 19 Veckatimest Grizzly Bear 243160 False https://api.spotify.com/v1/tracks/0iTpQYzJnYgh... 0iTpQYzJnYgh7kIxyq8A2O Two Weeks 69 https://p.scdn.co/mp3-preview/d7918651392a70a6... ... -5.597 1 0.0286 159.871 4 https://api.spotify.com/v1/tracks/0iTpQYzJnYgh... audio_features spotify:track:0iTpQYzJnYgh7kIxyq8A2O 0.701 indie
18 20 Currents Tame Impala 467586 False https://api.spotify.com/v1/tracks/2X485T9Z5Ly0... 2X485T9Z5Ly0xyaghN73ed Let It Happen 66 NaN ... -5.866 0 0.0448 124.997 4 https://api.spotify.com/v1/tracks/2X485T9Z5Ly0... audio_features spotify:track:2X485T9Z5Ly0xyaghN73ed 0.566 indie
19 21 What Will We Be (Standard) Devendra Banhart 185786 False https://api.spotify.com/v1/tracks/17XvbAd3K1GO... 17XvbAd3K1GOjWy5uQvK2U Baby 59 https://p.scdn.co/mp3-preview/985b135eb9d8f1cf... ... -10.093 0 0.0404 102.020 4 https://api.spotify.com/v1/tracks/17XvbAd3K1GO... audio_features spotify:track:17XvbAd3K1GOjWy5uQvK2U 0.863 indie
20 22 Hurry Up, We're Dreaming M83 241440 False https://api.spotify.com/v1/tracks/1eyzqe2QqGZU... 1eyzqe2QqGZUmfcPZtrIyt Midnight City 70 https://p.scdn.co/mp3-preview/2620971cb1ef9fda... ... -6.525 0 0.0356 105.009 4 https://api.spotify.com/v1/tracks/1eyzqe2QqGZU... audio_features spotify:track:1eyzqe2QqGZUmfcPZtrIyt 0.299 indie
21 23 Oh, Inverted World The Shins 231173 False https://api.spotify.com/v1/tracks/5oUV6yWdDM0R... 5oUV6yWdDM0R9Q2CizRhIt New Slang 70 https://p.scdn.co/mp3-preview/cddd138adb6362fb... ... -11.547 1 0.0306 130.015 4 https://api.spotify.com/v1/tracks/5oUV6yWdDM0R... audio_features spotify:track:5oUV6yWdDM0R9Q2CizRhIt 0.642 indie
22 24 Eyelid Movies Phantogram 249066 False https://api.spotify.com/v1/tracks/3498wF96LsgV... 3498wF96LsgVgMkGmJzJOC When I'm Small 61 https://p.scdn.co/mp3-preview/e6229c2b4d0bbe54... ... -4.340 1 0.0314 91.998 4 https://api.spotify.com/v1/tracks/3498wF96LsgV... audio_features spotify:track:3498wF96LsgVgMkGmJzJOC 0.432 indie
23 25 Walking On A Dream Empire of the Sun 198440 False https://api.spotify.com/v1/tracks/0mBkoM8r7KAQ... 0mBkoM8r7KAQzZij5swTUL Walking On A Dream 68 NaN ... -5.541 0 0.0459 126.967 4 https://api.spotify.com/v1/tracks/0mBkoM8r7KAQ... audio_features spotify:track:0mBkoM8r7KAQzZij5swTUL 0.725 indie
24 26 Hospitality Hospitality 187453 False https://api.spotify.com/v1/tracks/6kaPNiBCzWGz... 6kaPNiBCzWGzY2QoZl4PtR Friends of Friends 49 https://p.scdn.co/mp3-preview/5006247f79b5ec6e... ... -6.291 0 0.0353 104.786 4 https://api.spotify.com/v1/tracks/6kaPNiBCzWGz... audio_features spotify:track:6kaPNiBCzWGzY2QoZl4PtR 0.459 indie
25 27 Being No One, Going Nowhere STRFKR 232946 False https://api.spotify.com/v1/tracks/0xxm1WskKZSC... 0xxm1WskKZSCvbkGjkSMr2 Open Your Eyes 61 https://p.scdn.co/mp3-preview/f84d1e81154865df... ... -4.299 0 0.0322 129.988 4 https://api.spotify.com/v1/tracks/0xxm1WskKZSC... audio_features spotify:track:0xxm1WskKZSCvbkGjkSMr2 0.413 indie
26 28 Writer's Block Peter Bjorn and John 279026 False https://api.spotify.com/v1/tracks/6M6UoxIPn4NO... 6M6UoxIPn4NOWW0x7JPRfv Young Folks 62 https://p.scdn.co/mp3-preview/24b2d194bdffc2f5... ... -5.995 1 0.0380 138.142 4 https://api.spotify.com/v1/tracks/6M6UoxIPn4NO... audio_features spotify:track:6M6UoxIPn4NOWW0x7JPRfv 0.652 indie
27 29 Heza Generationals 226933 False https://api.spotify.com/v1/tracks/4ZFHITy0bAeZ... 4ZFHITy0bAeZ3oFfvR0HaI Put a Light On 62 https://p.scdn.co/mp3-preview/41974e436063af2d... ... -4.463 1 0.0291 100.010 4 https://api.spotify.com/v1/tracks/4ZFHITy0bAeZ... audio_features spotify:track:4ZFHITy0bAeZ3oFfvR0HaI 0.489 indie
28 30 Sylvan Esso Sylvan Esso 269693 False https://api.spotify.com/v1/tracks/71cUqXJ3h1r0... 71cUqXJ3h1r0Ees6YdENLU Coffee 63 https://p.scdn.co/mp3-preview/aad9438fecf5f06a... ... -8.112 1 0.1880 100.049 4 https://api.spotify.com/v1/tracks/71cUqXJ3h1r0... audio_features spotify:track:71cUqXJ3h1r0Ees6YdENLU 0.172 indie
29 31 Manners Passion Pit 174760 False https://api.spotify.com/v1/tracks/4prEPl61C8qZ... 4prEPl61C8qZpeo3IkYSMl Sleepyhead 66 https://p.scdn.co/mp3-preview/de7153eb71eb8a58... ... -5.406 1 0.0354 106.036 4 https://api.spotify.com/v1/tracks/4prEPl61C8qZ... audio_features spotify:track:4prEPl61C8qZpeo3IkYSMl 0.549 indie
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
17 19 Disrespectful G4SHI 202751 True https://api.spotify.com/v1/tracks/0dvu4Irbq0SM... 0dvu4Irbq0SM5BWny30rRR Disrespectful 75 https://p.scdn.co/mp3-preview/c349e2baa9009cce... ... -8.215 1 0.0691 152.003 4 https://api.spotify.com/v1/tracks/0dvu4Irbq0SM... audio_features spotify:track:0dvu4Irbq0SM5BWny30rRR 0.160 rap
18 20 Culture Migos 257545 True https://api.spotify.com/v1/tracks/76Y0gxtTxN0F... 76Y0gxtTxN0FyDCYh5qYQj Get Right Witcha 83 https://p.scdn.co/mp3-preview/0b87da8236cde7dc... ... -5.861 0 0.2380 136.980 4 https://api.spotify.com/v1/tracks/76Y0gxtTxN0F... audio_features spotify:track:76Y0gxtTxN0FyDCYh5qYQj 0.630 rap
19 21 Teenage Emotions Lil Yachty 168320 True https://api.spotify.com/v1/tracks/2hfuZh5FuIOF... 2hfuZh5FuIOFMoUPU01fjv Harley 65 NaN ... -5.364 1 0.0386 143.064 4 https://api.spotify.com/v1/tracks/2hfuZh5FuIOF... audio_features spotify:track:2hfuZh5FuIOFMoUPU01fjv 0.222 rap
20 22 Playboi Carti Playboi Carti, Lil Uzi Vert 235535 True https://api.spotify.com/v1/tracks/59J5nzL1KniF... 59J5nzL1KniFHnU120dQzt wokeuplikethis* 82 NaN ... -6.741 1 0.2540 78.475 4 https://api.spotify.com/v1/tracks/59J5nzL1KniF... audio_features spotify:track:59J5nzL1KniFHnU120dQzt 0.478 rap
21 23 True to Self Bryson Tiller 209040 True https://api.spotify.com/v1/tracks/0OVhQZkNe7lh... 0OVhQZkNe7lh0fQeH96EFW Don't Get Too High 71 https://p.scdn.co/mp3-preview/9663cb843eb7e45c... ... -3.476 1 0.0681 124.055 4 https://api.spotify.com/v1/tracks/0OVhQZkNe7lh... audio_features spotify:track:0OVhQZkNe7lh0fQeH96EFW 0.470 rap
22 24 Painting Pictures Kodak Black 198109 True https://api.spotify.com/v1/tracks/0bXFIF7iL17T... 0bXFIF7iL17TYLyx8JHziM Patty Cake 78 https://p.scdn.co/mp3-preview/61944446d7dbe03b... ... -6.464 0 0.1900 152.094 4 https://api.spotify.com/v1/tracks/0bXFIF7iL17T... audio_features spotify:track:0bXFIF7iL17TYLyx8JHziM 0.419 rap
23 25 bloom Machine Gun Kelly, Quavo, Ty Dolla $ign 203000 True https://api.spotify.com/v1/tracks/5hfpu8saHLGe... 5hfpu8saHLGedbHRtNCYjm Trap Paris (feat. Quavo & Ty Dolla $ign) 74 NaN ... -5.070 0 0.0686 124.992 4 https://api.spotify.com/v1/tracks/5hfpu8saHLGe... audio_features spotify:track:5hfpu8saHLGedbHRtNCYjm 0.188 rap
24 27 True to Self Bryson Tiller 169266 True https://api.spotify.com/v1/tracks/5GG3knKdxKWr... 5GG3knKdxKWrNboRijxeKF Run Me Dry 72 https://p.scdn.co/mp3-preview/a49586868718985f... ... -9.017 0 0.4870 193.881 4 https://api.spotify.com/v1/tracks/5GG3knKdxKWr... audio_features spotify:track:5GG3knKdxKWrNboRijxeKF 0.614 rap
25 28 From The D To The A (feat. Lil Yachty) Tee Grizzley, Lil Yachty 158684 True https://api.spotify.com/v1/tracks/2NyrXRn4tanc... 2NyrXRn4tancYPW6JwtTl2 From The D To The A (feat. Lil Yachty) 80 https://p.scdn.co/mp3-preview/155643656a12e570... ... -5.417 0 0.3030 99.061 4 https://api.spotify.com/v1/tracks/2NyrXRn4tanc... audio_features spotify:track:2NyrXRn4tancYPW6JwtTl2 0.669 rap
26 29 Teenage Emotions Lil Yachty 210973 True https://api.spotify.com/v1/tracks/41nFodiQgbNh... 41nFodiQgbNhVrPsLuKZ2a Priorities 63 NaN ... -8.335 1 0.4310 140.005 4 https://api.spotify.com/v1/tracks/41nFodiQgbNh... audio_features spotify:track:41nFodiQgbNhVrPsLuKZ2a 0.830 rap
27 30 Ransom 2 Mike WiLL Made-It, Kendrick Lamar, Gucci Mane,... 240926 True https://api.spotify.com/v1/tracks/3H1JaS0NUNFj... 3H1JaS0NUNFjwVcJZ76SgC Perfect Pint (feat. Kendrick Lamar, Gucci Mane... 68 NaN ... -5.509 0 0.1460 180.057 4 https://api.spotify.com/v1/tracks/3H1JaS0NUNFj... audio_features spotify:track:3H1JaS0NUNFjwVcJZ76SgC 0.299 rap
28 31 Droptopwop Gucci Mane, 2 Chainz, Young Dolph 261219 True https://api.spotify.com/v1/tracks/5HtziFfFGxXp... 5HtziFfFGxXpugCqZaAHoY Both Eyes Closed (feat. 2 Chainz and Young Dolph) 65 https://p.scdn.co/mp3-preview/5f5224430cbe53e1... ... -11.074 1 0.3540 164.061 4 https://api.spotify.com/v1/tracks/5HtziFfFGxXp... audio_features spotify:track:5HtziFfFGxXpugCqZaAHoY 0.429 rap
29 32 4 Your Eyez Only J. Cole 216520 True https://api.spotify.com/v1/tracks/0utlOiJy2weV... 0utlOiJy2weVl9WTkcEWHy Neighbors 79 NaN ... -13.002 0 0.3300 138.944 4 https://api.spotify.com/v1/tracks/0utlOiJy2weV... audio_features spotify:track:0utlOiJy2weVl9WTkcEWHy 0.413 rap
30 33 What in XXXTarnation Xxxtentacion, Ski Mask The Slump God 164597 True https://api.spotify.com/v1/tracks/3wBAhQLR9VjW... 3wBAhQLR9VjWfyYO3u4oPR What in XXXTarnation 75 https://p.scdn.co/mp3-preview/26f5538a17173cc7... ... -5.583 1 0.1790 119.088 4 https://api.spotify.com/v1/tracks/3wBAhQLR9VjW... audio_features spotify:track:3wBAhQLR9VjWfyYO3u4oPR 0.428 rap
31 34 Flex Like Ouu Lil Pump 108251 True https://api.spotify.com/v1/tracks/4XF1D03U8Lua... 4XF1D03U8LuaLzlCMugIE5 Flex Like Ouu 76 https://p.scdn.co/mp3-preview/475b6e9e01a2be3f... ... -11.346 1 0.0663 141.981 4 https://api.spotify.com/v1/tracks/4XF1D03U8Lua... audio_features spotify:track:4XF1D03U8LuaLzlCMugIE5 0.709 rap
32 35 FUTURE Future 261751 True https://api.spotify.com/v1/tracks/4lozYCMRLtEc... 4lozYCMRLtEc46exlhoK2Q I'm so Groovy 77 https://p.scdn.co/mp3-preview/38d45b2fe9da7cc6... ... -9.739 1 0.2650 138.059 4 https://api.spotify.com/v1/tracks/4lozYCMRLtEc... audio_features spotify:track:4lozYCMRLtEc46exlhoK2Q 0.383 rap
33 36 Love Galore SZA, Travis Scott 260769 True https://api.spotify.com/v1/tracks/1E2ePlJmijVS... 1E2ePlJmijVSX0o1G9TFgR Love Galore 78 https://p.scdn.co/mp3-preview/7c9e74dd63eea9bb... ... -5.314 1 0.0848 134.938 4 https://api.spotify.com/v1/tracks/1E2ePlJmijVS... audio_features spotify:track:1E2ePlJmijVSX0o1G9TFgR 0.382 rap
34 37 DAMN. Kendrick Lamar, Zacari 213400 True https://api.spotify.com/v1/tracks/6PGoSes0D9eU... 6PGoSes0D9eUDeeAafB2As LOVE. FEAT. ZACARI. 88 NaN ... -7.380 1 0.1030 126.075 4 https://api.spotify.com/v1/tracks/6PGoSes0D9eU... audio_features spotify:track:6PGoSes0D9eUDeeAafB2As 0.782 rap
35 38 There's Really A Wolf Russ 164200 True https://api.spotify.com/v1/tracks/4f1QbCjIAgQw... 4f1QbCjIAgQwnf7ms9NXWx Me You 77 https://p.scdn.co/mp3-preview/1673c8436d116e16... ... -2.733 1 0.0328 95.019 4 https://api.spotify.com/v1/tracks/4f1QbCjIAgQw... audio_features spotify:track:4f1QbCjIAgQwnf7ms9NXWx 0.250 rap
36 39 First Day Out Tee Grizzley 254693 True https://api.spotify.com/v1/tracks/0mGgkrD2phGm... 0mGgkrD2phGmsBiDy58z1O First Day Out 70 https://p.scdn.co/mp3-preview/238e15cd82500367... ... -6.312 1 0.4140 97.991 4 https://api.spotify.com/v1/tracks/0mGgkrD2phGm... audio_features spotify:track:0mGgkrD2phGmsBiDy58z1O 0.402 rap
37 40 Up Desiigner 235677 True https://api.spotify.com/v1/tracks/5YYjsDIxJKpV... 5YYjsDIxJKpVgNLkPpBmRA Up 75 NaN ... -7.535 1 0.3750 144.051 4 https://api.spotify.com/v1/tracks/5YYjsDIxJKpV... audio_features spotify:track:5YYjsDIxJKpVgNLkPpBmRA 0.388 rap
38 41 Gilligan (feat. A$AP Rocky & Juicy J) D.R.A.M., A$AP Rocky, Juicy J 207800 True https://api.spotify.com/v1/tracks/2frDm1v0SFNy... 2frDm1v0SFNyeduw3gO632 Gilligan 72 https://p.scdn.co/mp3-preview/34f410a463d5b04f... ... -7.740 1 0.2100 148.010 4 https://api.spotify.com/v1/tracks/2frDm1v0SFNy... audio_features spotify:track:2frDm1v0SFNyeduw3gO632 0.620 rap
39 42 Look At Me! Xxxtentacion 126345 True https://api.spotify.com/v1/tracks/7floNISpH8VF... 7floNISpH8VF4z4459Qo18 Look At Me! 87 https://p.scdn.co/mp3-preview/b1aec4586cd200a8... ... -6.405 1 0.2820 139.059 4 https://api.spotify.com/v1/tracks/7floNISpH8VF... audio_features spotify:track:7floNISpH8VF4z4459Qo18 0.356 rap
40 43 SHINE Wale, J Balvin 227666 True https://api.spotify.com/v1/tracks/6HL9fQAZ5jwa... 6HL9fQAZ5jwajChlEIqVzf Colombia Heights (Te Llamo) [feat. J Balvin] 75 https://p.scdn.co/mp3-preview/7af8bb2a75d0c002... ... -6.284 0 0.1550 135.970 4 https://api.spotify.com/v1/tracks/6HL9fQAZ5jwa... audio_features spotify:track:6HL9fQAZ5jwajChlEIqVzf 0.701 rap
41 44 Everybody Logic 162346 True https://api.spotify.com/v1/tracks/7cGFbx7MP0H2... 7cGFbx7MP0H23iHZTZpqMM Everybody 80 NaN ... -5.968 1 0.0910 110.003 4 https://api.spotify.com/v1/tracks/7cGFbx7MP0H2... audio_features spotify:track:7cGFbx7MP0H23iHZTZpqMM 0.785 rap
42 45 Kill Jill Big Boi, Killer Mike, Jeezy 266856 True https://api.spotify.com/v1/tracks/0rU1aBF8cQ8x... 0rU1aBF8cQ8xS3H25qWuMz Kill Jill 75 https://p.scdn.co/mp3-preview/1cb344e49f240ec1... ... -6.798 0 0.2320 139.934 4 https://api.spotify.com/v1/tracks/0rU1aBF8cQ8x... audio_features spotify:track:0rU1aBF8cQ8xS3H25qWuMz 0.108 rap
43 46 East Coast A$AP Ferg, Remy Ma 260418 True https://api.spotify.com/v1/tracks/4D1aO0inbRZF... 4D1aO0inbRZFsicPx9Jw3B East Coast 72 https://p.scdn.co/mp3-preview/86352a9df1b3eef5... ... -3.298 0 0.3040 134.383 4 https://api.spotify.com/v1/tracks/4D1aO0inbRZF... audio_features spotify:track:4D1aO0inbRZFsicPx9Jw3B 0.176 rap
44 47 NAV NAV 226040 True https://api.spotify.com/v1/tracks/05nbZ1xxVNwU... 05nbZ1xxVNwUTcGwLbp7CN Myself 79 NaN ... -7.621 0 0.2840 159.741 4 https://api.spotify.com/v1/tracks/05nbZ1xxVNwU... audio_features spotify:track:05nbZ1xxVNwUTcGwLbp7CN 0.100 rap
45 48 More Life Drake 283550 True https://api.spotify.com/v1/tracks/2KvHC9z14GSl... 2KvHC9z14GSl4YpkNMX384 Do Not Disturb 79 NaN ... -6.206 0 0.4040 171.029 4 https://api.spotify.com/v1/tracks/2KvHC9z14GSl... audio_features spotify:track:2KvHC9z14GSl4YpkNMX384 0.455 rap
46 49 DAMN. Kendrick Lamar, Rihanna 227360 True https://api.spotify.com/v1/tracks/6SwRhMLwNqEi... 6SwRhMLwNqEi6alNPVG00n LOYALTY. FEAT. RIHANNA. 87 NaN ... -8.926 0 0.0550 110.803 4 https://api.spotify.com/v1/tracks/6SwRhMLwNqEi... audio_features spotify:track:6SwRhMLwNqEi6alNPVG00n 0.424 rap

513 rows × 32 columns

This has given us a fairly sizeable dataframe with 513 rows and 32 columns. However, if you look closely at the index column you'll notice something dodgey has happened - combining our dataframes has meant that the index field is no longer unique (multiple records share the same index).


In [23]:
data.index.is_unique


Out[23]:
False

This is not good. Looking at the printout of the dataframe above, we see that the last record is LOYALTY. by Kendrick Lamar and has index 46. However, if we try to access the record with index 46, we instead get Rebellion (Lies) by Arcade Fire.


In [24]:
data.iloc[46]


Out[24]:
Unnamed: 0                                                         48
album                                                         Funeral
artists                                                   Arcade Fire
duration_ms                                                    310893
explicit                                                        False
href                https://api.spotify.com/v1/tracks/5qk1xXcERl8R...
id                                             5qk1xXcERl8RW645ztqDAW
name                                                 Rebellion (Lies)
popularity                                                         58
preview_url         https://p.scdn.co/mp3-preview/f891f8274794a442...
track_number                                                        9
type                                                            track
uri                              spotify:track:5qk1xXcERl8RW645ztqDAW
acousticness                                                   0.0068
analysis_url        https://api.spotify.com/v1/audio-analysis/5qk1...
danceability                                                    0.401
duration_ms.1                                                  310893
energy                                                          0.941
id.1                                           5qk1xXcERl8RW645ztqDAW
instrumentalness                                                0.607
key                                                                 8
liveness                                                        0.288
loudness                                                       -5.652
mode                                                                1
speechiness                                                    0.0349
tempo                                                         127.178
time_signature                                                      4
track_href          https://api.spotify.com/v1/tracks/5qk1xXcERl8R...
type.1                                                 audio_features
uri.1                            spotify:track:5qk1xXcERl8RW645ztqDAW
valence                                                         0.738
genre                                                           indie
Name: 46, dtype: object

We can remedy this by reindexing. Looking at the fields available, it looks like the tracks' id would be a good choice for a unique index.


In [25]:
data.set_index('id', inplace=True)

In [26]:
data.index.is_unique


Out[26]:
False

Unfortunately, there are still duplicates where the same track appears in multiple playlists. Let's remove these duplicates, keeping only the first instance.


In [27]:
data = data[~data.index.duplicated(keep='first')]
data.index.is_unique


Out[27]:
True

Sucess! Before we do anything else, let's write our single combined dataset to file.


In [28]:
data.to_csv('spotify_data/combined_data.csv')

Now onto some analysis. Let's first look at some statistics for each of our genres.


In [29]:
data[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability', 'energy', 'instrumentalness', 
     'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'genre']].groupby('genre').mean()


Out[29]:
duration_ms explicit popularity acousticness danceability energy instrumentalness liveness loudness speechiness tempo valence
genre
country 223524.061224 0.020408 51.571429 0.116567 0.572694 0.761612 0.004932 0.186751 -5.317980 0.041390 125.962306 0.562510
house 413951.619565 0.010870 32.184783 0.071019 0.764717 0.672033 0.741966 0.122800 -10.278283 0.059933 120.842522 0.374475
indie 237722.958333 0.031250 56.583333 0.189482 0.598167 0.693562 0.147994 0.185283 -6.631104 0.045101 122.083937 0.542333
metal 264057.406780 0.084746 30.694915 0.001797 0.420085 0.946678 0.022654 0.229066 -4.277203 0.103297 131.340254 0.358563
pop 214347.327381 0.184524 67.244048 0.171865 0.658988 0.657414 0.017609 0.155942 -6.193952 0.082749 115.337083 0.467686
rap 214801.739130 0.978261 76.695652 0.147325 0.785978 0.595761 0.000365 0.142937 -7.173261 0.213624 135.210065 0.419304

From this alone we can get a lot of information: house tracks are on average almost twice as long as tracks from the other genres, over 97% of rap tracks contain explicit lyrics, metal tracks are the most energetic but tend to be sadder (lower valence) than country or indie. Let's try sorting our data to find the saddest tracks in each genre.

We do this by sorting the data by valence (sort_values('valence')), grouping by genre (groupby('genre')) then by taking the first value of each group (head(1)).


In [30]:
data.sort_values('valence')[['album', 'artists', 'name', 'genre', 'valence']].groupby('genre').head(1)


Out[30]:
album artists name genre valence
id
4R1AbCs2wEu4e6j7FB7sRZ The Touch Rampa The Touch house 0.0354
2yoCtR2C0sMFgII70RosuY The Raven Age The Raven Age Angel In Disgrace metal 0.0634
2Ce5IyMlVRVvN997ZJjJJA HNDRXX Future, Rihanna Selfish pop 0.0951
05nbZ1xxVNwUTcGwLbp7CN NAV NAV Myself rap 0.1000
1gk3FhAV07q9Jg77UxnVjX ZABA Glass Animals Gooey indie 0.1070
0xwPzLmBAYro8BUz7MrtAo Montevallo Sam Hunt Make You Miss Me country 0.1670

We can visualise our data by plotting the various characteristics against each other. In the plot below, we compare the energy and danceability of country, metal and house music. The data from the three different genres separates into three pretty distinct clusters.


In [31]:
colours = ['red', 'blue', 'green', 'orange', 'pink', 'purple']

ax = data[data.genre == 'country'].plot.scatter('danceability', 'energy', c=colours[0], label='country', figsize=(10,10))
data[data.genre == 'metal'].plot.scatter('danceability', 'energy', c=colours[1], marker='x', label='metal', ax=ax)
data[data.genre == 'house'].plot.scatter('danceability', 'energy', c=colours[2], marker='+', label='house', ax=ax)


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x1199c79b0>

More information about pandas can be found in the documentation, or in tutorials or in standard books.

Messy data

In real life, datasets are often messy, with records containing invalid or missing entries. Fortunately, pandas is equipped with several functions that allow us to deal with messy data.

In this example, we shall be using a dataset from the Data Carpentry website which is a subset of the data from Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA. This data contains a set a records of animals caught during the study.

Let's begin by reading in the data


In [32]:
survey = pandas.read_excel('https://github.com/IanHawke/msc-or-week0/blob/master/excel_data/surveys.xlsx?raw=true')

In [33]:
survey.head()


Out[33]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 NaN
1 2 7 16 1977 3 NL M 33.0 NaN
2 3 7 16 1977 2 DM F 37.0 NaN
3 4 7 16 1977 7 DM M 36.0 NaN
4 5 7 16 1977 3 DM M 35.0 NaN

In the weight column, instead of a number as we may expect, we see the values are 'NaN' or 'Not a Number'. If you open the original spreadsheet, you'll see that the original weight data is missing for these records. The count function returns the number of non-NaN entries per column, so if we subtract that from the length of the survey, we can see how many NaN entries there are per column


In [34]:
len(survey) - survey.count()


Out[34]:
record_id             0
month                 0
day                   0
year                  0
plot_id               0
species_id          763
sex                2511
hindfoot_length    4111
weight             3266
dtype: int64

We need to work out a sensible way to deal with this missing data, as if we try to do any analysis on the dataset in its current state, python may throw value errors. For example, let's try converting the data in the weight column to an integer:


In [35]:
survey.weight.astype('int')


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-35-43f4a4725608> in <module>()
----> 1 survey.weight.astype('int')

/anaconda/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
     89                 else:
     90                     kwargs[new_arg_name] = new_arg_value
---> 91             return func(*args, **kwargs)
     92         return wrapper
     93     return _deprecate_kwarg

/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   3408         # else, only a single dtype is given
   3409         new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 3410                                      **kwargs)
   3411         return self._constructor(new_data).__finalize__(self)
   3412 

/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in astype(self, dtype, **kwargs)
   3222 
   3223     def astype(self, dtype, **kwargs):
-> 3224         return self.apply('astype', dtype=dtype, **kwargs)
   3225 
   3226     def convert(self, **kwargs):

/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3089 
   3090             kwargs['mgr'] = self
-> 3091             applied = getattr(b, f)(**kwargs)
   3092             result_blocks = _extend_blocks(applied, result_blocks)
   3093 

/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in astype(self, dtype, copy, errors, values, **kwargs)
    469     def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
    470         return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 471                             **kwargs)
    472 
    473     def _astype(self, dtype, copy=False, errors='raise', values=None,

/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, raise_on_error, **kwargs)
    519 
    520                 # _astype_nansafe works fine with 1-d only
--> 521                 values = astype_nansafe(values.ravel(), dtype, copy=True)
    522                 values = values.reshape(self.shape)
    523 

/anaconda/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy)
    618 
    619         if not np.isfinite(arr).all():
--> 620             raise ValueError('Cannot convert non-finite values (NA or inf) to '
    621                              'integer')
    622 

ValueError: Cannot convert non-finite values (NA or inf) to integer

There are several different ways we can deal with NaNs - which we choose depends on the individual dataset.

It may be that missing data is due to e.g. the machine reading the data in malfunctioning, and the best practice is just to discard all records containing missing data. We can do that with the dropna function.


In [36]:
survey.dropna()


Out[36]:
record_id month day year plot_id species_id sex hindfoot_length weight
62 63 8 19 1977 3 DM M 35.0 40.0
63 64 8 19 1977 7 DM M 37.0 48.0
64 65 8 19 1977 4 DM F 34.0 29.0
65 66 8 19 1977 4 DM F 35.0 46.0
66 67 8 19 1977 7 DM M 35.0 36.0
67 68 8 19 1977 8 DO F 32.0 52.0
68 69 8 19 1977 2 PF M 15.0 8.0
69 70 8 19 1977 3 OX F 21.0 22.0
70 71 8 19 1977 7 DM F 36.0 35.0
73 74 8 19 1977 8 PF M 12.0 7.0
74 75 8 19 1977 8 DM F 32.0 22.0
77 78 8 19 1977 1 PF M 16.0 9.0
78 79 8 19 1977 7 DM F 34.0 42.0
80 81 8 19 1977 4 PF F 14.0 8.0
81 82 8 19 1977 4 DM F 35.0 41.0
82 83 8 20 1977 6 DM F 37.0 37.0
83 84 8 20 1977 19 DM F 35.0 43.0
84 85 8 20 1977 23 DM F 35.0 41.0
85 86 8 20 1977 18 DM F 33.0 40.0
86 87 8 20 1977 5 PF F 11.0 9.0
87 88 8 20 1977 18 DM F 35.0 45.0
88 89 8 20 1977 12 PP F 20.0 15.0
89 90 8 20 1977 18 DM M 35.0 29.0
91 92 8 20 1977 6 DM M 35.0 39.0
93 94 8 20 1977 18 DM F 36.0 43.0
94 95 8 20 1977 23 DM M 38.0 46.0
95 96 8 20 1977 12 DM M 36.0 41.0
96 97 8 20 1977 18 DM M 36.0 41.0
97 98 8 20 1977 5 DM M 38.0 40.0
98 99 8 20 1977 11 DM M 37.0 45.0
... ... ... ... ... ... ... ... ... ...
35507 35508 12 31 2002 6 PB F 25.0 35.0
35508 35509 12 31 2002 6 PB M 26.0 47.0
35509 35510 12 31 2002 6 PB F 26.0 30.0
35513 35514 12 31 2002 11 PP M 23.0 18.0
35515 35516 12 31 2002 11 DO F 35.0 52.0
35516 35517 12 31 2002 11 DM F 36.0 42.0
35517 35518 12 31 2002 11 DO M 36.0 38.0
35518 35519 12 31 2002 9 DM M 37.0 49.0
35520 35521 12 31 2002 9 DM M 37.0 48.0
35521 35522 12 31 2002 9 DM F 35.0 45.0
35522 35523 12 31 2002 9 DM F 36.0 44.0
35523 35524 12 31 2002 9 PB F 25.0 27.0
35524 35525 12 31 2002 9 OL M 21.0 26.0
35525 35526 12 31 2002 8 OT F 20.0 24.0
35526 35527 12 31 2002 13 DO F 33.0 43.0
35528 35529 12 31 2002 13 PB F 25.0 25.0
35531 35532 12 31 2002 14 DM F 34.0 43.0
35532 35533 12 31 2002 14 DM F 36.0 48.0
35533 35534 12 31 2002 14 DM M 37.0 56.0
35534 35535 12 31 2002 14 DM M 37.0 53.0
35535 35536 12 31 2002 14 DM F 35.0 42.0
35536 35537 12 31 2002 14 DM F 36.0 46.0
35537 35538 12 31 2002 15 PB F 26.0 31.0
35538 35539 12 31 2002 15 SF M 26.0 68.0
35539 35540 12 31 2002 15 PB F 26.0 23.0
35540 35541 12 31 2002 15 PB F 24.0 31.0
35541 35542 12 31 2002 15 PB F 26.0 29.0
35542 35543 12 31 2002 15 PB F 27.0 34.0
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0

30676 rows × 9 columns

We may just wish to discard records with NaNs in a particular column (e.g. if we wish to deal with NaNs in other columns in a different way). We can discard all the records with NaNs in the weight column like so:


In [37]:
survey.dropna(subset=['weight'])


Out[37]:
record_id month day year plot_id species_id sex hindfoot_length weight
62 63 8 19 1977 3 DM M 35.0 40.0
63 64 8 19 1977 7 DM M 37.0 48.0
64 65 8 19 1977 4 DM F 34.0 29.0
65 66 8 19 1977 4 DM F 35.0 46.0
66 67 8 19 1977 7 DM M 35.0 36.0
67 68 8 19 1977 8 DO F 32.0 52.0
68 69 8 19 1977 2 PF M 15.0 8.0
69 70 8 19 1977 3 OX F 21.0 22.0
70 71 8 19 1977 7 DM F 36.0 35.0
73 74 8 19 1977 8 PF M 12.0 7.0
74 75 8 19 1977 8 DM F 32.0 22.0
77 78 8 19 1977 1 PF M 16.0 9.0
78 79 8 19 1977 7 DM F 34.0 42.0
80 81 8 19 1977 4 PF F 14.0 8.0
81 82 8 19 1977 4 DM F 35.0 41.0
82 83 8 20 1977 6 DM F 37.0 37.0
83 84 8 20 1977 19 DM F 35.0 43.0
84 85 8 20 1977 23 DM F 35.0 41.0
85 86 8 20 1977 18 DM F 33.0 40.0
86 87 8 20 1977 5 PF F 11.0 9.0
87 88 8 20 1977 18 DM F 35.0 45.0
88 89 8 20 1977 12 PP F 20.0 15.0
89 90 8 20 1977 18 DM M 35.0 29.0
91 92 8 20 1977 6 DM M 35.0 39.0
92 93 8 20 1977 18 DM NaN NaN 42.0
93 94 8 20 1977 18 DM F 36.0 43.0
94 95 8 20 1977 23 DM M 38.0 46.0
95 96 8 20 1977 12 DM M 36.0 41.0
96 97 8 20 1977 18 DM M 36.0 41.0
97 98 8 20 1977 5 DM M 38.0 40.0
... ... ... ... ... ... ... ... ... ...
35508 35509 12 31 2002 6 PB M 26.0 47.0
35509 35510 12 31 2002 6 PB F 26.0 30.0
35513 35514 12 31 2002 11 PP M 23.0 18.0
35515 35516 12 31 2002 11 DO F 35.0 52.0
35516 35517 12 31 2002 11 DM F 36.0 42.0
35517 35518 12 31 2002 11 DO M 36.0 38.0
35518 35519 12 31 2002 9 DM M 37.0 49.0
35519 35520 12 31 2002 9 SF NaN 24.0 36.0
35520 35521 12 31 2002 9 DM M 37.0 48.0
35521 35522 12 31 2002 9 DM F 35.0 45.0
35522 35523 12 31 2002 9 DM F 36.0 44.0
35523 35524 12 31 2002 9 PB F 25.0 27.0
35524 35525 12 31 2002 9 OL M 21.0 26.0
35525 35526 12 31 2002 8 OT F 20.0 24.0
35526 35527 12 31 2002 13 DO F 33.0 43.0
35528 35529 12 31 2002 13 PB F 25.0 25.0
35531 35532 12 31 2002 14 DM F 34.0 43.0
35532 35533 12 31 2002 14 DM F 36.0 48.0
35533 35534 12 31 2002 14 DM M 37.0 56.0
35534 35535 12 31 2002 14 DM M 37.0 53.0
35535 35536 12 31 2002 14 DM F 35.0 42.0
35536 35537 12 31 2002 14 DM F 36.0 46.0
35537 35538 12 31 2002 15 PB F 26.0 31.0
35538 35539 12 31 2002 15 SF M 26.0 68.0
35539 35540 12 31 2002 15 PB F 26.0 23.0
35540 35541 12 31 2002 15 PB F 24.0 31.0
35541 35542 12 31 2002 15 PB F 26.0 29.0
35542 35543 12 31 2002 15 PB F 27.0 34.0
35546 35547 12 31 2002 10 RM F 15.0 14.0
35547 35548 12 31 2002 7 DO M 36.0 51.0

32283 rows × 9 columns

It may be that it's more appropriate for us to set all missing data with a certain value. For example, let's set all missing weights to 0:


In [38]:
nan_zeros = survey.copy() # make a copy so we don't overwrite original dataframe
nan_zeros.weight.fillna(0, inplace=True)
nan_zeros.head()


Out[38]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 0.0
1 2 7 16 1977 3 NL M 33.0 0.0
2 3 7 16 1977 2 DM F 37.0 0.0
3 4 7 16 1977 7 DM M 36.0 0.0
4 5 7 16 1977 3 DM M 35.0 0.0

For our dataset, this is not the best choice as it will change the mean of our data:


In [39]:
print(survey.weight.mean(), nan_zeros.weight.mean())


42.672428212991356 38.751976145601844

A better solution here is to fill all NaN values with the mean weight value:


In [40]:
nan_mean = survey.copy()
nan_mean.weight.fillna(survey.weight.mean(), inplace=True)
print(survey.weight.mean(), nan_mean.weight.mean())
nan_mean.head()


42.672428212991356 42.67242821299182
Out[40]:
record_id month day year plot_id species_id sex hindfoot_length weight
0 1 7 16 1977 2 NL M 32.0 42.672428
1 2 7 16 1977 3 NL M 33.0 42.672428
2 3 7 16 1977 2 DM F 37.0 42.672428
3 4 7 16 1977 7 DM M 36.0 42.672428
4 5 7 16 1977 3 DM M 35.0 42.672428
Exercises
  1. Load the Excel spreadsheet from https://github.com/IanHawke/msc-or-week0/blob/master/excel_data/dice-roll-example.xlsx?raw=true which generates and displays three dice rolls. Work out how to isolate the data for just the dice rolls (eg, use skiprows when loading, and drop unused columns) and compute summary data for each.
  2. Create a histogram of the SepalWidth for each of the species groups in the iris dataset
  3. Plot acousticness against liveness for the music dataset. Use a for loop to add the different datasets to the plot (i.e. rather than typing each out by hand, as done above).

In [41]:
dice = pandas.read_excel('https://github.com/IanHawke/msc-or-week0/blob/master/excel_data/dice-roll-example.xlsx?raw=true', skiprows=5)
print(dice.columns)
dice = dice[['# 1', '# 2', '# 3']]
print(dice.columns)

dice.describe()


Index(['# 1', '# 2', '# 3', '# 4', 'Unnamed: 4', 'Unnamed: 5', 'Unnamed: 6',
       'Unnamed: 7', 'Standard Deviation', 'Unnamed: 9', 'Unnamed: 10',
       'Unnamed: 11', 'Unnamed: 12'],
      dtype='object')
Index(['# 1', '# 2', '# 3'], dtype='object')
Out[41]:
# 1 # 2 # 3
count 2000.000000 2000.000000 2000.000000
mean 3.468500 3.381500 3.507000
std 1.695419 1.682371 1.688898
min 1.000000 1.000000 1.000000
25% 2.000000 2.000000 2.000000
50% 3.000000 3.000000 3.000000
75% 5.000000 5.000000 5.000000
max 6.000000 6.000000 6.000000

In [42]:
for name, df in iris.groupby('Name'):
    # create a new figure
    pyplot.figure()
    # plot histogram of sepalwidth
    df['SepalWidth'].plot.hist()
    # add title
    pyplot.title(name)


In the solution below for the music genre exercise, we've included a few extra steps in order to format the plot and make it more readable (e.g. changing the axis limits, increasing the figure size and fontsize).


In [43]:
# create a new axis
fig, axis = pyplot.subplots()

# create a dictionary of colours
colours = {'indie': 'red', 'pop': 'blue', 
           'country': 'green', 'metal': 'black', 
           'house': 'orange', 'rap': 'pink'}
# create a dictionary of markers 
markers = {'indie': '+', 'pop': 'x', 
           'country': 'o', 'metal': 'd', 
           'house': 's', 'rap': '*'}

for name, df in data.groupby('genre'):
    df.plot.scatter('acousticness', 'liveness', label=name, s=30, color=colours[name], marker=markers[name],
                    ax=axis, figsize=(10,8), fontsize=16)

# set limits of x and y axes so that they are between 0 and 1
axis.set_xlim([0,1.0])
axis.set_ylim([0,1.0])

# set the font size of the axis labels
axis.xaxis.label.set_fontsize(16)
axis.yaxis.label.set_fontsize(16)
pyplot.show()


Further reading

For a basic pandas tutorial, check out Python for ecologists from the Data Carpentry website. Of particular interest may be the last lesson which shows how to interact with SQL databases using python and pandas.

For a more in-depth pandas tutorial, check out these notebooks by Chris Fonnesbeck. In the last notebook, there is quite a lot of material on using pandas with scikit-learn for machine learning, including regression analysis, decision trees and random forests.

Other libraries

There are many other options depending on what you need to display. If you have large data and want to more easily make nice plots, try seaborn or altair. If you want to make the data interactive, especially online, try plotly or bokeh. For a detailed discussion of plotting in Python in 2017, see this talk by Jake Vanderplas.