Plotting and Visualization

In [1]:
import numpy as np
import pandas as pd
import matplotlib as mpl  # used sparingly
import matplotlib.pyplot as plt

In [2]:
pd.set_option("notebook_repr_html", False)
pd.set_option("max_rows", 10)

Landscape of Plotting Libraries

  • matplotlib
  • mpld3
    • "Bringing matplotlib to the browser"
  • d3py
    • "a plotting library for python based on d3."
  • mayavi
    • "seeks to provide easy and interactive visualization of 3D data."
  • ggplot
    • "Yes, it's another port of ggplot2."
  • bokeh
    • "Bokeh is a Python interactive visualization library that targets modern web browsers for presentation."
  • mpl_toolkits

Matplotlib Orientation


  • Matplotlib is the de facto standard for plotting in Python
  • Understanding matplotlib is key to unlocking its power

Online Documentation

Getting Help

Notebook specifics

In [3]:
%matplotlib inline


  • Potential uses of matplotlib
    • interactively from python shell/IPython
    • Embed in a GUI
    • Generate postscript images in batch scripts
    • In a web application to serve graphs
  • Each of these use cases is enabled by using a backend
  • Two types
    • User interface / Interactive (for use in pygtk, wxpython, tkinter, qt4, or macosx)
    • Hard copy / Non-interactive (PNG, SVG, PDF, PS)
  • Set your backend in your matplotlibrc
  • Or with the use function (before importing pyplot
from matplotlib import use
use('PS')  # postscript


  • See Customizing Matplotlib for more information
  • You can edit your matplotlibrc to change the matplotlib defaults

In [4]:
from matplotlib import matplotlib_fname


You can also change them dynamically using the global rcParams object

In [5]:
from matplotlib import rcParams

In [6]:


In [7]:


In [8]:
rcParams[''] = 'monospace'

In [9]:


In [10]:
rcParams[''] = 'sans-serif'

You can also use the rc_context context manager

In [11]:
from matplotlib import rc_context

In [12]:
with rc_context({'': 'monospace'}):


In [13]:


Interactive Plotting with PyPlot

  • Interative backends allow plotting to the screen
  • Interactive mode plots to the screen without calls to show
  • Interactive mode does not require using pyplot
  • Doing the following at the interpreter will show a plot
import matplotlib.pyplot as plt
plt.plot([1, 2, 3, 4, 5])
  • At IPython interpreter enable interactive with (or set it in matplotlibrc
import matplotlib.pyplot as plt

or with

from matplotlib import interactive

In [14]:
import matplotlib.pyplot as plt

In [15]:
plt.plot([1, 2, 3, 4])

<matplotlib.text.Text at 0x1075836a0>
  • If using object method calls, you must call draw or draw_if_interactive to see changes
  • Again, this is unnecessary in the notebook

In [16]:
fig, ax = plt.subplots()
ax.plot([1, 2, 3, 4, 5])

  • By default the plot method takes x values, then y values
  • If the y values are omitted, then it is assumed that the x values are the indices of the given values

In [17]:
plt.plot([1, 5, 3])

[<matplotlib.lines.Line2D at 0x109fdecc0>]

What is the pyplot namespace?

  • It's where everything comes together
  • Usually where you want to start
  • Broadly, 3 categories of functions
  • Plotting preparation
  • Plotting functions
  • Plot modifiers

Plotting Preparation

Function Description
autoscale Autoscale the axis view to the data (toggle).
axes Add an axes to the figure.
axis Convenience method to get or set axis properties.
cla Clear the current axes.
clf Clear the current figure.
clim Set the color limits of the current image.
delaxes Remove an axes from the current figure.
locator_params Control behavior of tick locators.
margins Set or retrieve autoscaling margins.
figure Creates a new figure.
gca Return the current axis instance.
gcf Return a reference to the current figure.
gci Get the current colorable artist.
hold Set the hold state.
ioff Turn interactive mode off.
ion Turn interactive mode on.
ishold Return the hold status of the current axes.
isinteractive Return status of interactive mode.
rc Set the current rc params.
rc_context Return a context manager for managing rc settings.
rcdefaults Restore the default rc params.
savefig Save the current figure.
sca Set the current Axes instance.
sci Set the current image.
set_cmap Set the default colormap
setp Set a property on an artist object
show Display a figure
subplot Return a subplot axes positioned by the given grid definition.
subplot2grid Create a subplot in a grid.
subplot_tool Launch a subplot tool window for a figure.
subplots Create a figure with a set of subplots already made.
subplots_adjust Tune the subplot layout.
switch_backend Switch the default backend.
tick_params Change the appearance of ticks and tick labels.
ticklabel_format Change the ScalarFormatter used by default for linear axes.
tight_layout Automatically adjust subplot parameters to give specified padding.
xkcd Turns on XKCD sketch-style drawing mode.
xlabel Set the x axis label of the current axis.
xlim Get or set the x limits of the current axes.
xscale Set the scaling of the x-axis.
xticks Get or set the x-limits of the current tick locations and labels.
ylabel Set the y axis label of the current axis.
ylim Get or set the y-limits of the current axes.
yscale Set the scaling of the y-axis.
yticks Get or set the y-limits of the current tick locations and labels.

Plotting Functions

Function Description
acorr Plot the autocorrelation of x
bar Make a bar plot
barbs Plot a 2-D field of barbs
barh Make a horizontal bar plot
boxplot Make a box and whisker plot
broken_barh Plot horizontal bars
cohere Plot the coherence between x and y
contour Plot contours
contourf Plot filled contours
csd Plot cross-spectral density
errorbar Plot an errorbar graph
eventplot Plot identical parallel lines at specific positions
fill Plot filled polygons
fill_between Make filled polygons between two curves
fill_betweenx Make filled polygons between two horizontal curves
hexbin Make a hexagonal binning plot
hist Plot a histogram
hist2d Make a 2D histogram plot
imshow Display an image on the axes
loglog Make a plot with log scaling on both the x and y axis
matshow Display an array as a matrix in a new figure window
pcolor Create a pseudocolor plot of a 2-D array
pcolormesh Plot a quadrilateral mesh
pie Plot a pie chart
plot Plot lines and/or markers
plot_date Plot with data with dates
polar Make a polar plot
psd Plot the power spectral density
quiver Plot a 2-D field of arrows
scatter Make a scatter plot of x vs y
semilogx Make a plot with log scaling on the x axis
semilogy Make a plot with log scaling on the y axis
specgram Plot a spectrogram
spy Plot the sparsity pattern on a 2-D array
stackplot Draws a stacked area plot
stem Create a stem plot
step Make a step plot
streamplot Draws streamlines of a vector flow
tricontour Draw contours on an unstructured triangular grid
tricontourf Draw filled contours on an unstructured triangular grid
tripcolor Create a pseudocolor plot of an unstructured triangular grid
triplot Draw a unstructured triangular grid as lines and/or markers
xcorr Plot the cross-correlation between x and y

Plot modifiers

Function Description
annotate Create an annotation: a piece of text referring to a data point
arrow Add an arrow to the axes
axhline Add a horizontal line across the axis
axhspan Add a horizontal span (rectangle) across the axis
axvline Add a vertical line across the axes
axvspan Add a vertical span (rectangle) across the axes
box Turn the axes box on or off
clabel Label a contour plot
colorbar Add a colorbar to a plot
grid Turn the axes grids on or off
hlines Plot horizontal lines
legend Place a legend on the current axes
minorticks_off Remove minor ticks from the current plot
minorticks_on Display minor ticks on the current plot
quiverkey Add a key to a quiver plot
rgrids Get or set the radial gridlines on a polar plot
suptitle Add a centered title to the figure
table Add a table to the current axes
text Add text to the axes
title Set a title of the current axes
vlines Plot vertical lines
xlabel Set the x axis label of the current axis
ylabel Set the y axis label of the current axis"


  • The Figure is the central object of matplotlib
  • It is the GUI window that contains the plot

In [18]:

In [19]:
fig = plt.Figure()
  • Close the last made Figure, by default

In [20]:
  • You can also refer to figures by their number starting at 1
  • plt.close('all') is handy
  • One of the most commonly used option used to create a Figure is figsize, a tuple of integers specifying the width and height in inches

In [21]:
fig = plt.figure(figsize=(5, 5))

<matplotlib.figure.Figure at 0x109fc0198>


  • The Axes object is contained within and belongs to a figure
  • This is where the plotting happens
  • You will interact with the Axes most often
  • Use the add_subplot method to put an axes on a figure
  • It takes the shorthand for n_rows, n_cols, plot_number

In [22]:
fig = plt.figure()
ax = fig.add_subplot(111)
lines = ax.plot([1, 2, 3])
text = ax.set_xlabel("X")

  • You may have guessed that you can have more than one axes on a plot

In [23]:
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(121)
ax1.plot([1, 2, 3])

ax2 = fig.add_subplot(122)
ax2.plot([3, 2, 1])

[<matplotlib.lines.Line2D at 0x10a498860>]

Library Plotting

  • You'll notice above that I stopped using plt for almost everything but figure creation
  • This is usually how I use matplotlib and allows the most flexible, powerful usage
  • In fact, most calls functions in the pyplot namespace call gca to get the current axis and then delegate to the method of the Axes object

In [24]:
  • You'll also notice that I assign the returns from the matplotlib object method calls to variables
  • This is a good habit to get in and we will see why below
  • One last handy function is plt.subplots
  • It's almost all I ever use from the plt namespace with a few exceptions

In [25]:
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(np.random.randn(20), np.random.randn(20))

<matplotlib.collections.PathCollection at 0x10a6fbf98>

Notebook aside

You can work on figures across cells. Just make the existing figure object the last line in the cell.

In [26]:
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(np.random.randn(20), np.random.randn(20))

<matplotlib.collections.PathCollection at 0x10a823978>

In [27]:
ax.scatter(np.random.randn(20), np.random.randn(20), color='r')



Let's make some basic plots. Make a scatter plot as above with 500 points. Draw random numbers from 0 to 100 for the y axis and set the limits of the y axis at 0 and 200.


In [28]:


  • Single letter shortcuts

      b: blue
      g: green
      r: red
      c: cyan
      m: magenta
      y: yellow
      k: black
      w: white
  • Shades of gray string float in the 0-1 range

    color = '0.75'

  • HTML hex strings

    color = '#eeefff'

  • R, G, B tuples with R, G, B in [0, 1]
  • HTML names for colors, like ‘red’, ‘burlywood’ and ‘chartreuse’


  • See here for the full list
  • A few commonly used ones are
".":    point
",":    pixel
"o":    circle
"*":    star
"+":    plus
"x":    x
"D”:    diamond


'-' solid
'--'    dashed
'-.'    dash_dot
':' dotted
'None'  draw nothing
' ' draw nothing
''  draw nothing


Create a figure that holds two subplots in two rows. In the top one, plot a sin curve from $-2\pi$ to $2\pi$ in green. In the second one, plot a dashed red line (Hint: you may find np.linspace to be useful).

In [29]:
x = np.linspace(-2*np.pi, 2*np.pi, 100)

In [30]:
y = np.sin(x)

In [31]:
plt.plot(x, y)

[<matplotlib.lines.Line2D at 0x10a930a90>]

Labels and Legends

You can label many things in matplotlib

Labeling lines allows automatic legend creation

In [32]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.plot([1, 2, 4, 5], label="Line 1")
ax.plot([2, 5, 3, 4], label="Line 2")
legend = ax.legend(loc='best', fontsize=20)

You can label the X and Y axes

In [33]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.plot([1, 2, 4, 5], label="Line 1")
ax.plot([2, 5, 3, 4], label="Line 2")

ax.set_xlabel("X", fontsize=20)
ax.set_ylabel("Y", fontsize=20)
legend = ax.legend(loc='best', fontsize=20)

Label the axes with a title

In [34]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.plot([1, 2, 4, 5], label="Line 1")
ax.plot([2, 5, 3, 4], label="Line 2")

ax.set_xlabel("X", fontsize=20)
ax.set_ylabel("Y", fontsize=20)

ax.set_title("Title", fontsize=20)

legend = ax.legend(loc='best', fontsize=20)

Ticks and Tick Labels

  • The Ticks are the location of the Tick labels
  • The Tick lines denote the Ticks
  • The Tick labels are the text accompanying the tick
  • A Ticker determines the ticks and their labels automatically
  • You can use tick_params to adjust the appearance of the ticks

In [35]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.tick_params(axis='y', which='major', length=15, right=False)
ax.tick_params(axis='x', which='major', length=15, top=False, direction="out", pad=15)

You can set your own tick labels

In [36]:
fig, ax = plt.subplots(figsize=(8, 8))
ax.tick_params(axis='y', which='major', length=15, right=False)
ax.tick_params(axis='x', which='major', length=15, top=False)

ticklabels = ax.xaxis.set_ticklabels(['aaaa', 'bbbb', 'cccc', 
                                      'dddd', 'eeee', 'ffff'],
                                     rotation=45, fontsize=15)


The spines are the boundaries of the axes, and they can be selectively turned off

In [37]:

{'bottom': <matplotlib.spines.Spine at 0x10ac02ba8>,
 'left': <matplotlib.spines.Spine at 0x10acc7780>,
 'right': <matplotlib.spines.Spine at 0x10acc7da0>,
 'top': <matplotlib.spines.Spine at 0x10ac022e8>}

In [38]:
fig, ax = plt.subplots(figsize=(8, 8))

ax.tick_params(bottom=False, top=False, left=False, right=False)




More on plot

The plot function is a bit of a work horse with a flexible API

In [39]:
x, y = np.random.randn(2, 100)

In [40]:
fig, ax = plt.subplots()
ax.plot(y, 'g--')

[<matplotlib.lines.Line2D at 0x109ea0908>]

In [41]:
fig, ax = plt.subplots()
ax.plot(x, y)

[<matplotlib.lines.Line2D at 0x10b7414a8>]

In [42]:
fig, ax = plt.subplots()
ax.plot(x, y, 'o')

[<matplotlib.lines.Line2D at 0x10b7bc400>]

In [43]:
x2, y2 = np.random.randn(2, 200)

In [44]:
fig, ax = plt.subplots()
lines = ax.plot(x, y, 'o', x2, y2, 'ro', ms=8, alpha=.5)

Plotting in Pandas vs Matplotlib

  • Pandas provides a few accessors that allow you to stay fairly high-level without giving up any of the power and flexibility of matplotlib
  • Series and DataFrames have a plot method
  • They take a kind keyword argument which accepts several values for plots other than the default line plot. These include:

    • bar or barh for bar plots
    • hist for histogram
    • box for boxplot
    • kde or 'density' for density plots
    • area for area plots
    • scatter for scatter plots
    • hexbin for hexagonal bin plots
    • pie for pie plots

In [45]:
y = pd.Series(np.random.randn(25))

<matplotlib.axes._subplots.AxesSubplot at 0x10b914ef0>

In [46]:

<matplotlib.axes._subplots.AxesSubplot at 0x10bbae4a8>
  • Notice that these return AxesSubplot objects, so we have our hook in to all of the powerful methods from matplotlib
  • So, too, do DataFrames

In [47]:
dta = pd.DataFrame({'normal': np.random.normal(size=100), 
                    'gamma': np.random.gamma(1, size=100), 
                   'poisson': np.random.poisson(size=100)})
ax = dta.cumsum(0).plot()


Without re-plotting any of the above, re-size the fonts for the labels and the legend and display the figure.

  • Alternatively, we can plot the above in separate subplots
  • We can also change the figsize

In [48]:
ax = dta.cumsum(0).plot(subplots=True, figsize=(10, 10))

  • These are just matplotlib objects
  • Note the use of tight_layout below
  • tight_layout automatically adjusts the subplot params so that the subplot fits the figure
  • You can have more fine-grained control using

In [49]:
axes = dta.cumsum(0).plot(subplots=True, figsize=(10, 10))
fig = axes[0].figure

  • We can easily add a secondary y-axis

In [50]:
axes = dta.cumsum().plot(secondary_y='normal')

  • We can also ask pandas to plot on already existing axes

In [51]:
fig, axes = plt.subplots(1, 3, figsize=(12, 4))

for i, ax in enumerate(axes):
    variable = dta.columns[i]
    ax = dta[variable].cumsum().plot(ax=ax)
    ax.set_title(variable, fontsize=16)
axes[0].set_ylabel("Cumulative Sum", fontsize=14);

Bar plots

  • Bar plots are useful for displaying and comparing measurable quantities, such as counts or volumes.
  • We can use the plot method with a kind='bar' argument.
  • Let's use temperature data from NYC 1995 - 2014

In [52]:
dta = pd.read_csv("../data/weather_nyc.csv")

OSError                                   Traceback (most recent call last)
<ipython-input-52-89ff0e1be859> in <module>()
----> 1 dta = pd.read_csv("../data/weather_nyc.csv")

/Users/fonnescj/anaconda3/lib/python3.5/site-packages/pandas/io/ in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, float_precision, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format, skip_blank_lines)
    496                     skip_blank_lines=skip_blank_lines)
--> 498         return _read(filepath_or_buffer, kwds)
    500     parser_f.__name__ = name

/Users/fonnescj/anaconda3/lib/python3.5/site-packages/pandas/io/ in _read(filepath_or_buffer, kwds)
    274     # Create the parser.
--> 275     parser = TextFileReader(filepath_or_buffer, **kwds)
    277     if (nrows is not None) and (chunksize is not None):

/Users/fonnescj/anaconda3/lib/python3.5/site-packages/pandas/io/ in __init__(self, f, engine, **kwds)
    588             self.options['has_index_names'] = kwds['has_index_names']
--> 590         self._make_engine(self.engine)
    592     def _get_options_with_defaults(self, engine):

/Users/fonnescj/anaconda3/lib/python3.5/site-packages/pandas/io/ in _make_engine(self, engine)
    729     def _make_engine(self, engine='c'):
    730         if engine == 'c':
--> 731             self._engine = CParserWrapper(self.f, **self.options)
    732         else:
    733             if engine == 'python':

/Users/fonnescj/anaconda3/lib/python3.5/site-packages/pandas/io/ in __init__(self, src, **kwds)
   1101         kwds['allow_leading_cols'] = self.index_col is not False
-> 1103         self._reader = _parser.TextReader(src, **kwds)
   1105         # XXX

pandas/parser.pyx in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3246)()

pandas/parser.pyx in pandas.parser.TextReader._setup_parser_source (pandas/parser.c:6111)()

OSError: File b'../data/weather_nyc.csv' does not exist

In [ ]:
dta = dta.ix[dta.year < 2015]  # truncate to end of year

Or equivalently

In [ ]:
dta.query("year < 2015")

Recall that pandas.cut can be used to bin continuous data into buckets

In [ ]:
bins = [dta.temp.min(), 32, 55, 80, dta.temp.max()]

In [ ]:
labels = ["freezing", "cold", "warm", "hot"]

dta["temp_bin"] = pd.cut(dta.temp, bins, labels=labels)

In [ ]:
    from scipy.constants import F2C
except ImportError:  # no scipy installed
    def F2C(f):
        return (np.array(f) - 32)/1.8

In [ ]:
lmap = lambda func, x : list(map(func, x))

Celsius bins

In [ ]:
bins = [dta.tempc.min()] + lmap(F2C, (32, 55, 80)) + [dta.tempc.max()]

In [ ]:
labels = ["freezing", "cold", "warm", "hot"]

dta["tempc_bin"] = pd.cut(dta.temp, bins, labels=labels)

In [ ]:

In [ ]:
ax = dta.groupby("temp_bin").size().plot(kind="bar")
  • What's wrong with this graph?
  • Axis labels and tick labels to start
  • Some things we can do through the plot method
  • Some things we have to do with matplotlib

Make the xticks labels bigger and rotate them

In [ ]:
ax = dta.groupby("temp_bin").size().plot(kind="bar", rot=0, fontsize=16, figsize=(8, 5))

ax.set_ylabel("Number of Days")
ax.set_title("Temperatures from 1995 - 2014");

Horizontal bar chart

In [ ]:
dta.groupby(["season", "temp_bin"]).size().plot(kind="barh", figsize=(6, 8))

Stacked bar chart

The pandas crosstab function creates a cross-tabulation of two or more factors.

In [ ]:
ct = pd.crosstab(dta.temp_bin, dta.season)

In [ ]:
ax = ct.plot(kind="bar", stacked=True, figsize=(12, 8), grid=False, 
  • Matplotlib provides a variety of ColorMaps
  • The Paired colormap is a good qualitative colormap

In [ ]:
colors =, 1, 4))

In [ ]:
ax = pd.crosstab(dta.temp_bin, dta.season).plot(kind="bar", stacked=True, 
                                                figsize=(12, 8), grid=False, 
                                                legend=True, colors=colors, rot=0,

# adjust the fontsize of the legend
legend = ax.get_legend()
for text in legend.get_texts():


  • Frequently it is useful to look at the distribution of data before you analyze it.
  • Histograms display relative frequencies of data values
  • The y-axis is always some measure of frequency, raw counts of values or scaled proportions

In [ ]:

In [ ]:
ax = dta.temp.plot(kind="hist", bins=50)

It's even a good exercise here! Let's drop turn the -99 into NaNs.

In [ ]:
dta.ix[dta.temp == -99, ["temp", "tempc"]] = np.nan

Incidentally, pandas will handle nulls in plotting

In [ ]:
ax = dta.temp.plot(kind="hist", bins=50, grid=False, figsize=(10, 6))

# plot a vertical line that spans the axis
line = ax.axvline(dta.temp.mean(), color='r', lw=3, label="Mean")

# specifically add a legend
handles, labels = ax.get_legend_handles_labels()
ax.legend([handles[0]], [labels[0]], fontsize=16)

In [ ]:
  • Optimal number of bins
  • Scott's rule

In [ ]:
def scotts_rule(x):
    x = x.dropna()
    std = x.std()
    return 3.5 * std / (len(x)**(1./3))

def width_to_nbins(x, h):
    x = x.dropna()
    return int(round(x.ptp()/h))

In [ ]:
h = scotts_rule(dta.temp)
nbins = width_to_nbins(dta.temp, h)

In [ ]:
ax = dta.temp.plot(kind="hist", bins=nbins, grid=False, figsize=(10, 6))

# plot a vertical line that spans the axis
line = ax.axvline(dta.temp.mean(), color='r', lw=3, label="Mean")

Density Plots

  • Kernel Density Estimators are a kind of smoothed histogram (more on this later)
  • Pandas provides a hook to KDE plots using statsmodels, if installed, or scipy

In [ ]:
ax = dta.temp.plot(kind='kde', grid=False, figsize=(10, 6))
ax.set_xlim(0, 100)

We can compare the KDE to the normed histogram

In [ ]:
ax = dta.temp.plot(kind='kde', grid=False, figsize=(10, 6), color='r', lw=3)
ax = dta.temp.plot(kind="hist", bins=nbins, grid=False, figsize=(10, 6), ax=ax, normed=True, alpha=.7)
ax.set_xlim(0, 100)


Create KDE estimates for the temperature in each season on a single plot. Label the plotted lines.

Box plots

  • Boxplots (aka "box and whisker" plots) are a different way to display distributions of data
  • The box contains the quartiles of the data
  • The "whiskers" are typically the lower and upper 5 percent values
    • In matplotlib they are 1.5 * the lower/upper quarteriles by default
  • The horizontal line is the median
  • Boxplots have their own method on DataFrames

In [ ]:
ax = dta.boxplot(column="temp", by="season", grid=False, figsize=(8, 10), fontsize=16,
                 whis=[5, 95])

ax.set_title(ax.get_title(), fontsize=20)

fig = ax.figure

# Change the size of the figure title

# whitespace between axes and fig boundary
  • We can add some more information by overlaying the original data on the boxplot

In [ ]:
def jitter(x, n, noise=.05):
    return x + np.random.normal(0, noise, size=n)

In [ ]:
ax = dta.boxplot(column="temp", by="season", grid=False, figsize=(8, 10), fontsize=16,
                 whis=[5, 95])

ax.set_title(ax.get_title(), fontsize=20)

fig = ax.figure


# whitespace between axes and fig boundary

for i, season in enumerate(ax.get_xticklabels()):
    y = dta.ix[dta.season == season.get_text()].temp
    x = jitter(i + 1, len(y))
    # there's a lot of data so turn the alpha way down (or sub-sample)
    ax.plot(x, y, 'ro', alpha=.05)


  • Let's load the baseball dataset to look at scatterplots

In [ ]:
baseball = pd.read_csv("../data/baseball.csv")

In [ ]:

In [ ]:
ax = baseball.plot(kind="scatter", x="ab", y="h", grid=False, figsize=(8, 6), s=8**2,
ax.set_xlim(0, 700)
ax.set_ylim(0, 200)
  • We can uncover more information by changing the size of the points

In [ ]:
ax = baseball.plot(kind="scatter", x="ab", y="h", grid=False, figsize=(8, 6),*10,

ax.set_xlim(0, 700)
ax.set_ylim(0, 200)
  • Or by adding color using the c keyword

In [ ]:
ax = baseball.plot(kind="scatter", x="ab", y="h", grid=False, figsize=(8, 6), c="DarkGreen", s=50)
ax = baseball.plot(kind="scatter", x="ab", y="rbi", grid=False, figsize=(8, 6), c="Blue", s=50, 

ax.set_xlim(0, 700)
ax.set_ylim(0, 200);
  • c can also be a color intensity
  • in this case we can specify a colormap through the cmap keyword

In [ ]:
ax = baseball.plot(kind="scatter", x="ab", y="h", grid=False, figsize=(8, 6),*10,
                   s=40, cmap="hot")
ax.set_xlim(0, 700)
ax.set_ylim(0, 200);
  • Notice that there is a colorbar automatically
  • We can adjust it just like all other things matplotlib
  • It's actually implemented as a separate axes subplot in the figure

In [ ]:
ax = baseball.plot(kind="scatter", x="ab", y="h", grid=False, figsize=(8, 6),*10,
                   s=40, cmap="hot")
ax.set_xlim(0, 700)
ax.set_ylim(0, 200)

fig = ax.figure
# colorbars are actually a separate subplot in your figure
colorbar = fig.axes[1]
  • Use pd.scatter_matrix To view a large number of variables simultaenously

In [ ]:
ax = pd.scatter_matrix(baseball.loc[:,'r':'sb'], figsize=(14, 10), diagonal='hist')

In [ ]:
ax = pd.scatter_matrix(baseball.loc[:,'r':'sb'], figsize=(14, 10), diagonal='kde')

Plotting Time-Series

  • Let's convert the temperature data into a TimeSeries for convenience

In [ ]:
idx = pd.to_datetime(dta.year*10000 + dta.month*100 +, format='%Y%m%d')

In [ ]:

In [ ]:
y = dta.set_index(idx).temp

In [ ]:

In [ ]:
  • Pandas plotting is DatetimeIndex aware
  • Outside of the browser, you can pan and zoom and the tick labels adjust dynamically

In [ ]:
#ax = y.plot(figsize=(12, 8))
ax = pd.rolling_mean(y, window=60, min_periods=1, center=True).plot(figsize=(12, 8),
                                                                   label="Rolling 2-month mean")

means = y.groupby(lambda x : x.year).mean()
means.index = pd.DatetimeIndex(pd.to_datetime(means.index * 10000 + 1231, format="%Y%m%d"))
ax = means.plot(ax=ax, label="Yearly Average")

legend = ax.legend()


  • GridSpec provides a high-level abstraction for placing subplots on a grid
  • plt.subplot2grid is a helper function for creating grids of subplots
  • To create a 2x2 figure with a reference to the first axes we could do
ax = plt.subplot(2, 2, 1)
  • Equivalently with subplot2grid

In [ ]:
ax = plt.subplot2grid((2, 2), (0, 0))
  • We can have more easy, fine-grained control with subplot2grid for creating multiple subplots that span columns, for example

In [ ]:
with plt.rc_context(rc={"xtick.labelsize": 0,
                        "ytick.labelsize": 0,
                        "axes.facecolor": "lightgray",
                        "figure.figsize": (8, 8)}):
    ax1 = plt.subplot2grid((3,3), (0,0), colspan=3)
    ax2 = plt.subplot2grid((3,3), (1,0), colspan=2)
    ax3 = plt.subplot2grid((3,3), (1, 2), rowspan=2)
    ax4 = plt.subplot2grid((3,3), (2, 0))
    ax5 = plt.subplot2grid((3,3), (2, 1))
    ax1.figure.suptitle("subplot2grid", fontsize=20)
  • You can use GridSpec class directly to create the same plot

In [ ]:
from matplotlib.gridspec import GridSpec

with plt.rc_context(rc={"xtick.labelsize": 0,
                        "ytick.labelsize": 0,
                        "axes.facecolor": "lightgray"}):

    fig, ax = plt.subplots(figsize=(8, 8))

    gs = GridSpec(3, 3)
    ax1 = plt.subplot(gs[0, :])
    # identical to ax1 = plt.subplot(gs.new_subplotspec((0,0), colspan=3))
    ax2 = plt.subplot(gs[1,:-1])
    ax3 = plt.subplot(gs[1:, -1])
    ax4 = plt.subplot(gs[-1,0])
    ax5 = plt.subplot(gs[-1,-2])

    fig.suptitle("GridSpec", fontsize=20)


  • Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
  • It is built on top of matplotlib
  • Provides support for numpy and pandas
  • Coupled with statistical routines from scipy and statsmodels

Trellis plots

"At the heart of quantitative reasoning is a single question: Compared to what? Small multiple designs, multivariate and data bountiful, answer directly by visually enforcing comparisons of changes, of the differences among objects, of the scope of alternatives. For a wide range of problems in data presentation, small multiples are the best design solution."

-Edward Tufte

  • For medium-dimensional data,
  • Multiple instances of the same plot on different subsets of your dataset.
  • Quickly extract a large amount of information about complex data.

In [ ]:
import seaborn as sns
tips = sns.load_dataset("tips")

In [ ]:


  • Used to visualize the distribution of a variable or the relationship between multiple variables within subsets of your data
  • Can be drawn with up to three dimensions: row, col, and hue.
  • These should be discrete variables
  • Say we wanted to examine differences between lunch and dinner in the tips dataset

In [ ]:
with mpl.rc_context(rc={"legend.fontsize": "18", "axes.titlesize": "18"}):
    g = sns.FacetGrid(tips, col="sex", hue="smoker", size=7), "total_bill", "tip", alpha=.7, s=80)
g.axes[0, 0].title.set_fontsize(20)
g.axes[0, 0].xaxis.get_label().set_fontsize(20)
g.axes[0, 1].title.set_fontsize(20)
g.axes[0, 1].xaxis.get_label().set_fontsize(20)

Violin plot

The violin plot is a combination of a boxplot and a kernel density estimator

In [ ]:
ax = dta.boxplot(column="temp", by="season", grid=False, figsize=(8, 10), fontsize=16,
                 whis=[5, 95])

In [ ]:
X = dta[["temp", "season"]].dropna()

In [ ]:
ax = sns.violinplot(X.temp, groupby=X.season)

We can plot the points inside the violins and re-order the seasons

In [ ]:
ax = sns.violinplot(X.temp, groupby=X.season, inner='points', alpha=.5,
                    order=['Winter', 'Spring', 'Summer', 'Fall'])

Distribution plots

Seaborn allows you to look at bivariate distributions. Here, we can compare the distribution of the temperatures in 1995 and 2014.

In [ ]:
temp95 = dta.query("year == 1995")[["temp", "month", "day"]]
temp14 = dta.query("year == 2014")[["temp", "month", "day"]]

In [ ]:
temps = temp95.merge(temp14, on=["month", "day"], how="inner", suffixes=("_95", "_14"))

In [ ]:
g = sns.jointplot(temps.temp_95, temps.temp_14, kind="kde", size=7, space=0)

We can also look at a hexbin plot of the same data with the marginal distributions as histograms.

In [ ]:
g = sns.jointplot(temps.temp_95, temps.temp_14, kind="hex", color="#4CB391", 
                  joint_kws={"bins": 200})


The mpld3 project brings together Matplotlib, and D3js, the popular Javascript library for creating interactive data visualizations for the web. The result is a simple API for exporting your matplotlib graphics to HTML code which can be used within the browser, within standard web pages, blogs, or tools such as the IPython notebook.

Let's look at a regular scatter plot

In [ ]:
fig, ax = plt.subplots(figsize=(6, 6))
x, y = np.random.normal(size=(2, 200))
color, size = np.random.random((2, 200))

ax.scatter(x, y, c=color, s=500 * size, alpha=0.5, cmap="rainbow")
ax.grid(color='lightgray', alpha=0.7)

Unfortunately, this is just a static image. Let's use mpld3 to change that. Using the display command, you get a fully interactive visualization of the figure.

In [ ]:
import mpld3

Notice the toolbar on hover. You can use that to interact with the figure.

You can use mpld3 for every plot that you render in the notebook by executing


mpld3 plugins

Much like event handling via callback functions in regular matplotlib (not covered in this notebook), you can define plugins for mpld3 to specify additional interactivity.

A number of plugins are built-in, and it is also possible to define new, custom plugins for nearly limitless interactive behaviors. For example, here is the built-in Linked Brushing plugin that allows exploration of multi-dimensional datasets:

In [ ]:
from mpld3 import plugins

fig, ax = plt.subplots(6, 6, figsize=(6, 6))
fig.subplots_adjust(hspace=0.1, wspace=0.1)
ax = ax[::-1]

X = baseball.loc[:, 'r':'rbi']
for i in range(6):
    for j in range(6):
        ax[i, j].xaxis.set_major_formatter(plt.NullFormatter())
        ax[i, j].yaxis.set_major_formatter(plt.NullFormatter())
        points = ax[i, j].scatter(X.values[:, j], X.values[:, i])
        if i == 0:
            ax[i, j].set_xlabel(X.columns[j])
    ax[i, 0].set_ylabel(X.columns[i])
plugins.connect(fig, plugins.LinkedBrush(points))

Putting it all together

  • Let's recreate this graphic inspired by Tufte's

In [ ]:
from IPython.display import Image, HTML
# Image("./tufte.svg")
  • This is a plot of NYC's weather in 2014 versus historical averages
    • Daily historical highs and lows
    • Historical confidence intervals around averages
    • The daily temperatures for 2013
    • Markers for new highs and lows
    • Annotations for points
    • Text for the graphic
    • Custom tick labels
  • Load the data from yesterday

In [ ]:
import os

to_colors = lambda x : x/255.
blue3 = list(map(to_colors, (24, 116, 205)))  # 1874CD
wheat2 = list(map(to_colors, (238, 216, 174)))  # EED8AE
wheat3 = list(map(to_colors, (205, 186, 150)))  # CDBA96
wheat4 = list(map(to_colors, (139, 126, 102)))  # 8B7E66
firebrick3 = list(map(to_colors, (205, 38, 38)))  # CD2626
gray30 = list(map(to_colors, (77, 77, 77)))  # 4D4D4D
  • You probably don't wan't to work with the month, day tuples in its present form for plotting
  • Instead, you can use the below for the x axis

In [ ]:
idx = range(366)
  • First, make the figure and plot the high and low bars (Hints: see the ax.vlines)
  • The color is wheat3
  • Second, plot the confidence intervals around the historical means
  • The color is wheat4
  • Plot the highs and lows of the present year in present_highs and present_lows
  • You will need the x axes of these two objects to line up with your current x axis (Hint: you may find np.where to be helpful)

In [ ]:
np.where([True, False, False, True, False])[0]
  • Annotate the points one of the 2014 historical lows and one of the 2014 historical highs with the appropriate text (Hint: see ax.annotate)
  • You may want to look at some of the examples below for annotate and arrows
  • Now, add text to the figure. (Hint: see ax.text)
  • Finally, let's add the correct tick labels
  • You can use unicode to add the $^\circ$

In [ ]:
yticks = range(-10, 101, 10)
ylabels = [str(i) + u"\u00b0" for i in yticks]

Other frequently used plotting tricks

XKCD and Annotation

In [ ]:
with plt.xkcd():
    # Based on "Stove Ownership" from XKCD by Randall Monroe

    fig = plt.figure()
    ax = fig.add_axes((0.1, 0.2, 0.8, 0.7))
    ax.set_ylim([-30, 10])

    data = np.ones(100)
    data[70:] -= np.arange(1, 31)

        xy=(70, 1), arrowprops=dict(arrowstyle='->'), xytext=(15, -10), zorder=-1)


    plt.ylabel('my overall health')
    fig.text(0.5, 0.05, 
             '"Stove Ownership" from xkcd by Randall Monroe', ha='center')

In [ ]:
with plt.xkcd():
    # Based on "The data So Far" from XKCD by Randall Monroe

    fig = plt.figure()
    ax = fig.add_axes((0.1, 0.2, 0.8, 0.7))[-0.125, 1.0-0.125], [0, 100], 0.25)
    ax.set_xticks([0, 1])
    ax.set_xlim([-0.5, 1.5])
    ax.set_ylim([0, 110])


    fig.text(0.5, 0.01,
             '"The Data So Far" from xkcd by Randall Monroe',
             ha='center', )

Tick Tricks

In [ ]:
from matplotlib.ticker import MaxNLocator

In [ ]:
x = np.arange(20)
y = np.random.randn(20)

In [ ]:
fig, ax = plt.subplots()
ax.plot(x, y)


Sharing Axes

In [ ]:
x = np.arange(20)

y1 = np.random.randn(20)
y2 = np.random.randn(20)

In [ ]:
fig, axes = plt.subplots(2, 1, sharex=True)

axes[0].plot(x, y1)
axes[1].plot(x, y2)

Twinning Axes

In [ ]:
t = np.arange(0.01, 10.0, 0.01)
s1 = np.exp(t)
s2 = np.sin(2*np.pi*t)

In [ ]:
fig, ax1 = plt.subplots()

ax1.plot(t, s1, 'b-')
ax1.set_xlabel('time (s)')

# Make the y-axis label and tick labels match the line color.
ax1.set_ylabel('exp', color='b', fontsize=18)
for tl in ax1.get_yticklabels():

ax2 = ax1.twinx()

ax2.plot(t, s2, 'r.')
ax2.set_ylabel('sin', color='r', fontsize=18)

for tl in ax2.get_yticklabels():

Image Plots

In [ ]:
fig, ax = plt.subplots()
ax.imshow(np.random.uniform(0, 1, size=(50, 50)), cmap="RdYlGn")


  • By default, matplotlib uses its own $TeX$ enging for text and math layout
  • You have the option to use call out to $TeX$, though by setting the text.usetext option

In [ ]:
fig, ax = plt.subplots()
ax.set_ylabel("$\\beta^2$", fontsize=20, rotation=0, labelpad=20)

In [ ]:
with mpl.rc_context(rc={"text.usetex": True}):
    fig, ax = plt.subplots(figsize=(5, 5))
    ax.set_ylabel("$\\beta^2$", fontsize=20, rotation=0, labelpad=20)

Contour Plots

In [ ]:
from matplotlib.pylab import bivariate_normal

delta = 0.025
x = np.arange(-3.0, 3.0, delta)
y = np.arange(-2.0, 2.0, delta)
X, Y = np.meshgrid(x, y)

Z1 = bivariate_normal(X, Y, 1.0, 1.0, 0.0, 0.0)
Z2 = bivariate_normal(X, Y, 1.5, 0.5, 1, 1)
# difference of Gaussians
Z = 10.0 * (Z2 - Z1)

In [ ]:
with mpl.rc_context(rc={'xtick.direction': 'out',
                        'ytick.direction': 'out'}):
    # Create a simple contour plot with labels using default colors.  The
    # inline argument to clabel will control whether the labels are draw
    # over the line segments of the contour, removing the lines beneath
    # the label
    fig, ax = plt.subplots(figsize=(8, 8))
    contours = ax.contour(X, Y, Z)
    ax.clabel(contours, inline=1, fontsize=10)


In [ ]:
fig, ax = plt.subplots()
ax.arrow(0, 0, 0.5, 0.5, head_width=0.05, head_length=0.1, fc='k', ec='k')

ax.arrow(0.25, 0, 0.5, 0.5, head_width=0, head_length=0, fc='k', ec='k')

Filling in plots

In [ ]:
x = np.arange(0.0, 2, 0.01)
y1 = np.sin(2*np.pi*x)
y2 = 1.2*np.sin(4*np.pi*x)

In [ ]:
fig, axes = plt.subplots(3, 1, sharex=True, figsize=(6, 10))

axes[0].fill_between(x, 0, y1)
axes[0].set_ylabel('between y1 and 0')

axes[1].fill_between(x, y1, 1)
axes[1].set_ylabel('between y1 and 1')