We previously looked at an introduction to NumPy and Pandas, which are the two core Python libraries for handling data. Here we'll look at Matplotlib and Seaborn as visualization tools.
Numpy is a core package for handling array-based data, and Pandas builds on NumPy to provide operations which work quickly on labeled data. Analogously, matplotlib is a core package which makes use of NumPy arrays for visualization, and Seaborn builds on Matplotlib to provide operations which work on Pandas data.
Thus, we'll start with a quick intro to the basics of Matplotlib before moving on to show how Seaborn makes your life easier.
The matplotlib library is a powerful tool capable of producing complex publication-quality figures with fine layout control in two and three dimensions; here we will only provide a minimal self-contained introduction to its usage that covers the functionality needed for the rest of the section.
Just as we typically use the shorthand np
for Numpy, we will use plt
for the matplotlib.pyplot
module where the easy-to-use plotting functions reside (the library also contains a rich object-oriented architecture that we don't have the space to discuss here):
In [1]:
from __future__ import print_function, division
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
Here we'll go through just the basics of using matplotlib to create visualizations in Python.
In [2]:
plt.plot(np.random.rand(100));
In [3]:
x = np.linspace(0, 2*np.pi, 300)
y = np.sin(x)
plt.plot(x, y);
The plt.plot()
function is a real workhorse: you can use it for line plots, scatter plots, plotting of multiple lines at one time, etc. By adding other plt
functions, you can add other plot elements as well.
Here is how you can make a simple plot of $\sin(x)$ and $\sin(x^2)$ for $x \in [0, 2\pi]$ with labels and a grid (we use the semicolon in the last line to suppress the display of some information that is unnecessary right now):
In [4]:
y2 = np.sin(x**2)
plt.plot(x, y, label=r'$\sin(x)$')
plt.plot(x, y2, label=r'$\sin(x^2)$')
plt.title('Some functions')
plt.xlabel('x')
plt.ylabel('y')
plt.grid()
plt.legend();
In [5]:
x = np.linspace(0, 2*np.pi, 50)
y = np.sin(x)
plt.plot(x, y, linewidth=3, color='green');
In [6]:
plt.plot(x, y, 'o', markersize=6, color='r');
There is much more that can be done with the simple plt.plot
function; for help, take a look at the documentation using IPython's ?
functionality:
In [7]:
plt.plot?
In [8]:
# example data
x = np.arange(0.1, 4, 0.5)
y = np.exp(-x)
# example variable error bar values
yerr = 0.1 + 0.2*np.sqrt(x)
xerr = 0.1 + yerr
# First illustrate basic pyplot interface, using defaults where possible.
plt.figure()
plt.errorbar(x, y, xerr=0.2, yerr=0.4, fmt='o')
plt.title("Simplest errorbars, 0.2 in x, 0.4 in y");
In [9]:
x = np.linspace(-5, 5)
y = np.exp(-x**2)
plt.semilogy(x, y);
In [10]:
mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)
# the histogram of the data
n, bins, patches = plt.hist(x, 50, normed=1, facecolor='g', alpha=0.75)
plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title('Histogram of IQ')
# This will put a text fragment at the position given:
plt.text(55, .027, r'$\mu=100,\ \sigma=15$', fontsize=14)
plt.axis([40, 160, 0, 0.03])
plt.grid(True)
In [11]:
plt.imshow(np.random.rand(5,10), interpolation='nearest', cmap='Blues');
Images can be shown in very flexible ways. For example, it can even handle RGB tuples at each point:
In [12]:
img = plt.imread('images/stoplight.png')
img.shape
Out[12]:
In [13]:
plt.imshow(img)
plt.xticks([])
plt.yticks([]);
In [14]:
fig, ax = plt.subplots(1, 4, figsize=(10,6),
subplot_kw=dict(xticks=[], yticks=[]))
for i, cmap in enumerate(['Red', 'Green', 'Blue']):
ax[i].imshow(img[:,:,i], cmap=cmap + 's_r')
ax[i].set_title(cmap)
ax[3].imshow(img)
ax[3].set_title('RGB');
In [15]:
from mpl_toolkits.mplot3d import Axes3D
One this has been done, you can create 3d axes with the projection='3d'
keyword to add_subplot
:
fig = plt.figure()
fig.add_subplot(...,
projection='3d')
Here is a simple 3D surface plot:
In [16]:
from mpl_toolkits.mplot3d.axes3d import Axes3D
from matplotlib import cm
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1, projection='3d')
X = np.arange(-5, 5, 0.25)
Y = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(X, Y)
R = np.sqrt(X**2 + Y**2)
Z = np.sin(R)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap='rainbow',
linewidth=0, antialiased=False)
ax.set_zlim3d(-1.01, 1.01);
There is much, much more that matplotlib can do: we've just scratched the surface here. For more info, check out the matplotlib documentation, and especially the matplotlib gallery
One of the most useful ways to learn how to use matplotlib is to search the gallery for a plot that you're interested in, and then load the code using IPython's %load
magic. Then you can run, modify, and experiment with the code:
In [17]:
# %load http://matplotlib.org/mpl_examples/pie_and_polar_charts/polar_scatter_demo.py
Matplotlib is a useful tool, but it leaves much to be desired. There are several valid complaints about matplotlib that often come up:
The answer to these problems is seaborn. Seaborn provides an API on top of matplotlib which uses sane plot & color defaults, uses simple functions for common statistical plot types, and which integrates with the functionality provided by Pandas dataframes.
Let's take a look at seaborn in action. We'll start by importing seaborn, which by convention is imported as sns
.
We can set the seaborn style as the default matplotlib style by calling sns.set()
: after doing this, even simple matplotlib plots will look much better.
Let's look at a before and after:
In [18]:
x = np.linspace(0, 10, 1000)
plt.plot(x, np.sin(x), x, np.cos(x));
In [19]:
import seaborn as sns
sns.set(color_codes=True)
plt.plot(x, np.sin(x), x, np.cos(x));
Ah, much better!
The main idea of Seaborn is that it can create complicated plot types from Pandas data with relatively simple commands.
Let's take a look at a few of the datasets and plot types available in seaborn. Note that all o the following could be done using raw matplotlib commands (this is, in fact, what seaborn does under the hood) but the seaborn API is much more convenient.
In [20]:
data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], size=2000)
data = pd.DataFrame(data, columns=['x', 'y'])
for col in 'xy':
plt.hist(data[col], normed=True, alpha=0.5)
Rather than a histogram, we can get a smooth estimate of the distribution using a kernel density estimation:
In [21]:
for col in 'xy':
sns.kdeplot(data[col], shade=True)
Histograms and KDE can be combined using distplot
:
In [22]:
sns.distplot(data['x']);
If we pass the two variables to kdeplot
, we will get a bivariate visualization of the data:
In [23]:
sns.kdeplot(data["x"], data["y"]);
We can see the joint distribution and the marginal distributions together using sns.jointplot
.
For this plot, we'll set the style to a white background:
In [24]:
with sns.axes_style('white'):
sns.jointplot("x", "y", data, kind='kde');
There are other parameters which can be passed to jointplot
: for example, we can use a hexagonally-based histogram instead:
In [25]:
with sns.axes_style('white'):
sns.jointplot("x", "y", data, kind='hex')
When you generalize joint plots to data sets of larger dimensions, you end up with pair plots. This is very useful for exploring correlations between multi-dimensional data, when you'd like to plot all pairs of values against each other.
We'll demo this with the well-known iris dataset, which lists measurements of petals and sepals of three iris species:
In [26]:
iris = sns.load_dataset("iris")
iris.head()
Out[26]:
Visualizing the multi-dimensional relationships among the samples is as easy as calling sns.pairplot
:
In [27]:
sns.pairplot(iris, hue='species');
In [28]:
tips = sns.load_dataset('tips')
tips.head()
Out[28]:
In [29]:
tips['tip_pct'] = 100 * tips['tip'] / tips['total_bill']
grid = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True)
grid.map(plt.hist, "tip_pct", bins=np.linspace(0, 40, 15), color="g");
In [30]:
with sns.axes_style(style='ticks'):
g = sns.factorplot("day", "total_bill", "sex", data=tips, kind="box")
g.set_axis_labels("Day", "Total Bill");
In [31]:
with sns.axes_style('white'):
sns.jointplot("total_bill", "tip", data=tips, kind='hex')
The joint plot can even do some automatic kernel density estimation and regression:
In [32]:
sns.jointplot("total_bill", "tip", data=tips, kind='reg');
In [33]:
planets = sns.load_dataset('planets')
planets.head()
Out[33]:
In [34]:
with sns.axes_style('white'):
g = sns.factorplot("year", data=planets, aspect=2, kind="count", color="b")
g.set_xticklabels(step=5)
We can learn more by looking at the method of discovery of each of these planets:
In [35]:
with sns.axes_style('white'):
methods = planets["method"].value_counts().index
g = sns.factorplot("year", col='method', col_wrap=2, data=planets, kind="count",
size=2, aspect=3, palette="Purples_d",
order=range(2001, 2015), col_order=methods)
g.set_ylabels('Number of discoveries')
For more information on plotting with Seaborn, see the seaborn documentation, the seaborn gallery, and the official seaborn tutorial.
Download this data at https://www.dropbox.com/s/tfy7ygsih7go37j/NYCMresults_2008.csv
Move the file into the data
directory to use it below
In [36]:
nyc_data = pd.read_csv('data/NYCMresults_2008.csv')
nyc_data.head()
Out[36]:
In [37]:
nyc_data.dtypes
Out[37]:
We see that Pandas assumed the first row was column labels. Also, we see that the times are of dtype "object". Let's fix both of these by providing a list of column names, and by providing a converter for the times:
In [38]:
def convert_time(s):
h, m, s = map(int, s.split(':'))
return pd.datetools.timedelta(hours=h, minutes=m, seconds=s)
nyc_data = pd.read_csv('data/NYCMresults_2008.csv',
names=['first', 'last', 'age', 'gender', 'split', 'final'],
converters={'split':convert_time, 'final':convert_time})
nyc_data.head()
Out[38]:
That looks much better. For the purpose of our seaborn utilities, let's add columns which give the times in seconds:
In [39]:
nyc_data['split_sec'] = nyc_data['split'].astype(int) / 1E9
nyc_data['final_sec'] = nyc_data['final'].astype(int) / 1E9
In [40]:
with sns.axes_style('white'):
g = sns.jointplot("split_sec", "final_sec", nyc_data, kind='hex', stat_func=None)
g.ax_joint.plot(np.linspace(4000, 16000),
np.linspace(8000, 32000), ':k')
The dotted line shows where someone's time would lie if they ran the marathon at a perfectly steady pace. The fact that the distribution lies above this indicates (as you might expect) that most people slow down over the course of the marathon.
Let's create another column in the data, the split fraction, which tells whether someone did a negative split or positive split:
In [41]:
nyc_data['split_frac'] = 1 - 2 * nyc_data['split_sec'] / nyc_data['final_sec']
nyc_data.head()
Out[41]:
Where this split difference is less than zero, the person negative-split the race by that fraction. Let's do a distribution plot of this split fraction:
In [42]:
sns.distplot(nyc_data['split_frac'], kde=False);
plt.axvline(0, color="k", linestyle="--");
In [43]:
sum(nyc_data.split_frac < 0)
Out[43]:
There were 240 people who negative-split their race.
Let's see whether there is any correlation between this split fraction and other variables. We'll do this using a PairGrid
:
In [44]:
g = sns.PairGrid(nyc_data,
x_vars=['age', 'split_sec', 'final_sec'],
y_vars=['split_frac'],
hue='gender', palette=['r', 'b'], size=4)
g.map(plt.scatter, marker='.')
g.add_legend();
In [45]:
sns.kdeplot(nyc_data.split_frac[nyc_data.gender=='M'], label='men', color='b', shade=True)
sns.kdeplot(nyc_data.split_frac[nyc_data.gender=='W'], label='women', color='r', shade=True)
plt.xlabel('split_frac');
The interesting thing here is that there are many more men than women who are running close to an even split! This almost looks like some kind of bimodal distribution among the men and women. Let's see if we can suss-out what's going on by looking at the distributions as a function of age.
A nice way to compare distributions is to use a Violin Plot
In [46]:
def age_range(age_min, age_max):
return (nyc_data.age >= age_min) & (nyc_data.age < age_max)
sns.violinplot("gender", "split_frac", data=nyc_data, palette=["b", "r"]);
This is yet another way to view the distributions among men and women.
Let's look a little deeper, and compare these violin plots as a function of age. We'll start by creating a new column in the array which specifies the decade of age that each person is in:
In [47]:
nyc_data['age_dec'] = nyc_data.age.map(lambda age: 10 * (age // 10))
nyc_data.head()
Out[47]:
In [48]:
sns.violinplot("age_dec", "split_frac", hue="gender", data=nyc_data,
split=True, inner="quartile", palette=["b", "r"]);
Looking at this, we can see where the distributions of men and women differ: the split distributions of men in their 20s-50s show a pronounced over-density toward lower splits when compared to women of the same age (or of any age, for that matter).
Also surprisingly, the 80-year-old women seem to out-perform everyone in terms of their split time. I'm not sure how to explain that.
Back to the men with fast second-halfs: who are these runners? Does this split fraction correlate with finishing quickly? We can plot this very easily. We'll use lmplot
, which will automatically fit a linear regression to the data:
In [49]:
g = sns.lmplot('final_sec', 'split_frac', col='gender', data=nyc_data,
markers=".", scatter_kws=dict(color='c'))
g.map(plt.axhline, y=0.1, color="k", ls=":");
Apparently the people with fast splits are the elite runners who are finishing within ~15000 sec, or about 4 hours. People slower than that are much less likely to have a fast second split.
I would hypothesize that you could describe the distribution of runners with a two-component Gaussian distribution: there are the elite runners who are in shape and have fast splits, and there are the casual runners who are less in shape and tend to tire out more. When we get to Unsupervised Machine Learning, we'll have a chance to test this theory out.
In [ ]: