There are many libraries for plotting in Python. The standard library is matplotlib
. Its examples and gallery are particularly useful references.
Matplotlib is most useful if you have data in numpy
arrays. We can then plot standard single graphs straightforwardly:
In [1]:
%matplotlib inline
The above command is only needed if you are plotting in a Jupyter notebook.
We now construct some data:
In [2]:
import numpy
x = numpy.linspace(0, 1)
y1 = numpy.sin(numpy.pi * x) + 0.1 * numpy.random.rand(50)
y2 = numpy.cos(3.0 * numpy.pi * x) + 0.2 * numpy.random.rand(50)
And then produce a line plot:
In [3]:
from matplotlib import pyplot
pyplot.plot(x, y1)
pyplot.show()
We can add labels and titles:
In [4]:
pyplot.plot(x, y1)
pyplot.xlabel('x')
pyplot.ylabel('y')
pyplot.title('A single line plot')
pyplot.show()
We can change the plotting style, and use LaTeX style notation where needed:
In [5]:
pyplot.plot(x, y1, linestyle='--', color='black', linewidth=3)
pyplot.xlabel(r'$x$')
pyplot.ylabel(r'$y$')
pyplot.title(r'A single line plot, roughly $\sin(\pi x)$')
pyplot.show()
We can plot two lines at once, and add a legend, which we can position:
In [6]:
pyplot.plot(x, y1, label=r'$y_1$')
pyplot.plot(x, y2, label=r'$y_2$')
pyplot.xlabel(r'$x$')
pyplot.ylabel(r'$y$')
pyplot.title('Two line plots')
pyplot.legend(loc='lower left')
pyplot.show()
We would probably prefer to use subplots. At this point we have to leave the simple interface, and start building the plot using its individual components, figures
and axes
, which are objects to manipulate:
In [7]:
fig, axes = pyplot.subplots(nrows=1, ncols=2, figsize=(10,6))
axis1 = axes[0]
axis1.plot(x, y1)
axis1.set_xlabel(r'$x$')
axis1.set_ylabel(r'$y_1$')
axis2 = axes[1]
axis2.plot(x, y2)
axis2.set_xlabel(r'$x$')
axis2.set_ylabel(r'$y_2$')
fig.tight_layout()
pyplot.show()
The axes
variable contains all of the separate axes that you may want. This makes it easy to construct many subplots using a loop:
In [8]:
data = []
for nx in range(2,5):
for ny in range(2,5):
data.append(numpy.sin(nx * numpy.pi * x) + numpy.cos(ny * numpy.pi * x))
fig, axes = pyplot.subplots(nrows=3, ncols=3, figsize=(10,10))
for nrow in range(3):
for ncol in range(3):
ndata = ncol + 3 * nrow
axes[nrow, ncol].plot(x, data[ndata])
axes[nrow, ncol].set_xlabel(r'$x$')
axes[nrow, ncol].set_ylabel(r'$\sin({} \pi x) + \cos({} \pi x)$'.format(nrow+2, ncol+2))
fig.tight_layout()
pyplot.show()
If the information is not in numpy
arrays but in a spreadsheet-like format, Matplotlib may not be the best approach.
For handling large data sets, the standard Python library is pandas
. It keeps the data in a dataframe which keeps the rectangular data together with its labels.
Let's load the standard Iris data set, which we can get from GitHub, in:
In [9]:
import pandas
In [10]:
iris = pandas.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv')
Let's get some information about the file we just read in. First, let's see what data fields our dataset has:
In [11]:
iris.columns
Out[11]:
Now let's see what datatype (i.e. integer, boolean, string, float,...) the data in each field is:
In [12]:
iris.dtypes
Out[12]:
Finally, let's try printing the first few records in our dataframe:
In [13]:
# print first 5 records
iris.head()
Out[13]:
Note that pandas
can read Excel files (using pandas.read_excel
), and takes as arguments either the URL (as here) or the filename on the local machine.
Once we have the data, <dataframe>.plot
gives us lots of options to plot the result. Let's plot a histogram of the Petal Length:
In [14]:
iris['PetalLength'].plot.hist()
pyplot.show()
We can see the underlying library is Matplotlib, but it's far easier to plot large data sets.
We can get some basic statistics for our data using describe()
:
In [15]:
iris.describe()
Out[15]:
We can also extract specific metrics:
In [16]:
print(iris['SepalLength'].min())
print(iris['PetalLength'].std())
print(iris['PetalWidth'].count())
However, we often wish to calculate statistics for a subset of our data. For this, we can use pandas' groups. Let's group our data by Name
and try running describe
again. We see that pandas has now calculated statistics for each type of iris separately.
In [17]:
grouped_iris = iris.groupby('Name')
grouped_iris.describe()
Out[17]:
In [18]:
grouped_iris['PetalLength'].mean()
Out[18]:
We can select subsets of our data using criteria. For example, we can select all records with PetalLength
greater than 5:
In [19]:
iris[iris.PetalLength > 5].head()
Out[19]:
We can also combine criteria like so:
In [20]:
iris[(iris.Name == 'Iris-setosa') & (iris.PetalWidth < 0.3)].head()
Out[20]:
Now let's look at a slightly more complex example where the data is spread across multiple files and contains many different fields of different datatypes.
Spotify provide a web API which can be used to download data about its music. This data includes the audio features of a track, a set of measures including 'acousticness', 'danceability', 'speechiness' and 'valence':
A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
We can download this data using a library such as spotipy. In the folder spotify_data
, you shall find a few .csv
files containing data downloaded for tracks from playlists of several different musical genres.
Let's begin by importing our data.
In [21]:
dfs = {'indie': pandas.read_csv('spotify_data/indie.csv'), 'pop': pandas.read_csv('spotify_data/pop.csv'),
'country': pandas.read_csv('spotify_data/country.csv'), 'metal': pandas.read_csv('spotify_data/metal.csv'),
'house': pandas.read_csv('spotify_data/house.csv'), 'rap': pandas.read_csv('spotify_data/rap.csv')}
To compare the data from these different datasets, it will help if we first combine them into a single dataframe. Before we do this, we'll add an extra field to each of our dataframes describing the musical genre so that we do not lose this information when the dataframes are combined.
In [22]:
# add genre field to each dataframe
for name, df in dfs.items():
df['genre'] = name
# combine into single dataframe
data = pandas.concat(dfs.values())
data
Out[22]:
This has given us a fairly sizeable dataframe with 513 rows and 32 columns. However, if you look closely at the index column you'll notice something dodgey has happened - combining our dataframes has meant that the index field is no longer unique (multiple records share the same index).
In [23]:
data.index.is_unique
Out[23]:
This is not good. Looking at the printout of the dataframe above, we see that the last record is LOYALTY. by Kendrick Lamar and has index 46. However, if we try to access the record with index 46, we instead get Rebellion (Lies) by Arcade Fire.
In [24]:
data.iloc[46]
Out[24]:
We can remedy this by reindexing. Looking at the fields available, it looks like the tracks' id
would be a good choice for a unique index.
In [25]:
data.set_index('id', inplace=True)
In [26]:
data.index.is_unique
Out[26]:
Unfortunately, there are still duplicates where the same track appears in multiple playlists. Let's remove these duplicates, keeping only the first instance.
In [27]:
data = data[~data.index.duplicated(keep='first')]
data.index.is_unique
Out[27]:
Sucess! Before we do anything else, let's write our single combined dataset to file.
In [28]:
data.to_csv('spotify_data/combined_data.csv')
Now onto some analysis. Let's first look at some statistics for each of our genres.
In [29]:
data[['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability', 'energy', 'instrumentalness',
'liveness', 'loudness', 'speechiness', 'tempo', 'valence', 'genre']].groupby('genre').mean()
Out[29]:
From this alone we can get a lot of information: house tracks are on average almost twice as long as tracks from the other genres, over 97% of rap tracks contain explicit lyrics, metal tracks are the most energetic but tend to be sadder (lower valence) than country or indie. Let's try sorting our data to find the saddest tracks in each genre.
We do this by sorting the data by valence (sort_values('valence')
), grouping by genre (groupby('genre')
) then by taking the first value of each group (head(1)
).
In [30]:
data.sort_values('valence')[['album', 'artists', 'name', 'genre', 'valence']].groupby('genre').head(1)
Out[30]:
We can visualise our data by plotting the various characteristics against each other. In the plot below, we compare the energy and danceability of country, metal and house music. The data from the three different genres separates into three pretty distinct clusters.
In [31]:
colours = ['red', 'blue', 'green', 'orange', 'pink', 'purple']
ax = data[data.genre == 'country'].plot.scatter('danceability', 'energy', c=colours[0], label='country', figsize=(10,10))
data[data.genre == 'metal'].plot.scatter('danceability', 'energy', c=colours[1], marker='x', label='metal', ax=ax)
data[data.genre == 'house'].plot.scatter('danceability', 'energy', c=colours[2], marker='+', label='house', ax=ax)
Out[31]:
More information about pandas
can be found in the documentation, or in tutorials or in standard books.
In real life, datasets are often messy, with records containing invalid or missing entries. Fortunately, pandas is equipped with several functions that allow us to deal with messy data.
In this example, we shall be using a dataset from the Data Carpentry website which is a subset of the data from Ernst et al Long-term monitoring and experimental manipulation of a Chihuahuan Desert ecosystem near Portal, Arizona, USA. This data contains a set a records of animals caught during the study.
Let's begin by reading in the data
In [32]:
survey = pandas.read_excel('https://github.com/IanHawke/msc-or-week0/blob/master/excel_data/surveys.xlsx?raw=true')
In [33]:
survey.head()
Out[33]:
In the weight column, instead of a number as we may expect, we see the values are 'NaN' or 'Not a Number'. If you open the original spreadsheet, you'll see that the original weight data is missing for these records. The count
function returns the number of non-NaN entries per column, so if we subtract that from the length of the survey, we can see how many NaN entries there are per column
In [34]:
len(survey) - survey.count()
Out[34]:
We need to work out a sensible way to deal with this missing data, as if we try to do any analysis on the dataset in its current state, python may throw value errors. For example, let's try converting the data in the weight column to an integer:
In [35]:
survey.weight.astype('int')
There are several different ways we can deal with NaNs - which we choose depends on the individual dataset.
It may be that missing data is due to e.g. the machine reading the data in malfunctioning, and the best practice is just to discard all records containing missing data. We can do that with the dropna
function.
In [36]:
survey.dropna()
Out[36]:
We may just wish to discard records with NaNs in a particular column (e.g. if we wish to deal with NaNs in other columns in a different way). We can discard all the records with NaNs in the weight column like so:
In [37]:
survey.dropna(subset=['weight'])
Out[37]:
It may be that it's more appropriate for us to set all missing data with a certain value. For example, let's set all missing weights to 0:
In [38]:
nan_zeros = survey.copy() # make a copy so we don't overwrite original dataframe
nan_zeros.weight.fillna(0, inplace=True)
nan_zeros.head()
Out[38]:
For our dataset, this is not the best choice as it will change the mean of our data:
In [39]:
print(survey.weight.mean(), nan_zeros.weight.mean())
A better solution here is to fill all NaN values with the mean weight value:
In [40]:
nan_mean = survey.copy()
nan_mean.weight.fillna(survey.weight.mean(), inplace=True)
print(survey.weight.mean(), nan_mean.weight.mean())
nan_mean.head()
Out[40]:
skiprows
when loading, and drop unused columns) and compute summary data for each.
In [41]:
dice = pandas.read_excel('https://github.com/IanHawke/msc-or-week0/blob/master/excel_data/dice-roll-example.xlsx?raw=true', skiprows=5)
print(dice.columns)
dice = dice[['# 1', '# 2', '# 3']]
print(dice.columns)
dice.describe()
Out[41]:
In [42]:
for name, df in iris.groupby('Name'):
# create a new figure
pyplot.figure()
# plot histogram of sepalwidth
df['SepalWidth'].plot.hist()
# add title
pyplot.title(name)
In the solution below for the music genre exercise, we've included a few extra steps in order to format the plot and make it more readable (e.g. changing the axis limits, increasing the figure size and fontsize).
In [43]:
# create a new axis
fig, axis = pyplot.subplots()
# create a dictionary of colours
colours = {'indie': 'red', 'pop': 'blue',
'country': 'green', 'metal': 'black',
'house': 'orange', 'rap': 'pink'}
# create a dictionary of markers
markers = {'indie': '+', 'pop': 'x',
'country': 'o', 'metal': 'd',
'house': 's', 'rap': '*'}
for name, df in data.groupby('genre'):
df.plot.scatter('acousticness', 'liveness', label=name, s=30, color=colours[name], marker=markers[name],
ax=axis, figsize=(10,8), fontsize=16)
# set limits of x and y axes so that they are between 0 and 1
axis.set_xlim([0,1.0])
axis.set_ylim([0,1.0])
# set the font size of the axis labels
axis.xaxis.label.set_fontsize(16)
axis.yaxis.label.set_fontsize(16)
pyplot.show()
For a basic pandas tutorial, check out Python for ecologists from the Data Carpentry website. Of particular interest may be the last lesson which shows how to interact with SQL databases using python and pandas.
For a more in-depth pandas tutorial, check out these notebooks by Chris Fonnesbeck. In the last notebook, there is quite a lot of material on using pandas with scikit-learn for machine learning, including regression analysis, decision trees and random forests.
There are many other options depending on what you need to display. If you have large data and want to more easily make nice plots, try seaborn
or altair
. If you want to make the data interactive, especially online, try plotly
or bokeh
. For a detailed discussion of plotting in Python in 2017, see this talk by Jake Vanderplas.