We illustrate three approaches to graphing data with Python's Matplotlib package:
plot()
method to a dataframeplot(x,y)
function The last one is the least intuitive but also the most useful. We work up to it gradually. This book chapter covers the same material with more words and fewer pictures.
This IPython notebook was created by Dave Backus for the NYU Stern course Data Bootcamp.
import
statementsjustdoit
to the object x
with x.justdoit
Look around, what do you see? Check out the menubar at the top: File, Edit, etc. Also the toolbar below it. Click on Help -> User Interface Tour for a tour of the landscape.
The cells below come in two forms. Those labeled Code (see the menu in the toolbar) are Python code. Those labeled Markdown are text.
Markdown is a user-friendly language for text formatting. You can see how it works by clicking on any of the Markdown cells and looking at the raw text that underlies it. In addition to just plain text, we'll use three things a lot:
**bold**
displays as bold. The raw text *italics*
displays as italics. # Heading
gives us a large heading. Two hashes a smaller heading, three hashes smaller still, up to four hashes. In this cell there's a two-hash heading at the top. Exercise. Click on this cell, then click the +
in the toolbar to create a new empty cell below.
Exercise. Click on the new cell below. Choose Markdown in the menubar at the top. Add your name and a description of what we're doing. Execute the cell by either (i) clicking on the "run cell" button in the toolbar or (ii) clicking on "Cell" in the menubar and choosing Run.
In [1]:
import sys # system module
import pandas as pd # data package
import matplotlib as mpl # graphics package
import matplotlib.pyplot as plt # graphics module
import datetime as dt # date and time module
# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Matplotlib version: ', mpl.__version__)
print('Today: ', dt.date.today())
Comment. When you run the code cell above, its output appears below it.
Exercise. Enter pd.read_csv?
in the empty cell below. Run the cell (Cell at the top, or shift-enter). Do you see the documentation? This is the Jupyter version of help in Spyder's IPython console.
In [2]:
# This is an IPython command. It puts plots here in the notebook, rather than a separate window.
%matplotlib inline
In [3]:
# US GDP and consumption
gdp = [13271.1, 13773.5, 14234.2, 14613.8, 14873.7, 14830.4, 14418.7,
14783.8, 15020.6, 15369.2, 15710.3]
pce = [8867.6, 9208.2, 9531.8, 9821.7, 10041.6, 10007.2, 9847.0, 10036.3,
10263.5, 10449.7, 10699.7]
year = list(range(2003,2014)) # use range for years 2003-2013
# create dataframe from dictionary
us = pd.DataFrame({'gdp': gdp, 'pce': pce}, index=year)
print(us.head(3))
In [4]:
# GDP per capita (World Bank data, 2013, thousands of USD)
code = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX']
country = ['United States', 'France', 'Japan', 'China', 'India',
'Brazil', 'Mexico']
gdppc = [53.1, 36.9, 36.3, 11.9, 5.4, 15.0, 16.5]
wbdf = pd.DataFrame({'gdppc': gdppc, 'country': country}, index=code)
wbdf
Out[4]:
Comment. In the previous cell, we used the print()
function to produce output. Here we just put the name of the dataframe. The latter displays the dataframe -- and formats it nicely -- if it's the last statement in the cell.
In [5]:
# Fama-French
import pandas.io.data as web
# read annual data from website and rename variables
ff = web.DataReader('F-F_Research_Data_factors', 'famafrench')[1]
ff.columns = ['xsm', 'smb', 'hml', 'rf']
ff['rm'] = ff['xsm'] + ff['rf']
ff = ff[['rm', 'rf']] # extract rm and rf (return on market, riskfree rate, percent)
ff.head(5)
Out[5]:
Comment. The warning in pink tells us that the Pandas DataReader will be spun off into a separate package in the near future.
Exercise. What kind of object is wbdf
? What are its column and row labels?
Exercise. What is ff.index
? What does that tell us?
In [ ]:
In [6]:
# This is an IPython command: it puts plots here in the notebook, rather than a separate window.
%matplotlib inline
plot()
method to dataframeGood simple approach, we use it a lot. It comes with some useful defaults:
x
and y
variables. By default, the x
variable is the dataframe's index and the y
variables are all the columns of the dataframe. All of these things can be changed, but this is the starting point.
Let's do some examples, see how they work.
In [7]:
# try this with US GDP
us.plot()
Out[7]:
In [8]:
# do GDP alone
us['gdp'].plot()
Out[8]:
In [9]:
# bar chart
us.plot(kind='bar')
Out[9]:
Exercise. Show that we get the output from us.plot.bar()
.
In [10]:
us.plot
Out[10]:
In [11]:
# scatter plot
# we need to be explicit about the x and y variables: x = 'gdp', y = 'pce'
us.plot.scatter('gdp', 'pce')
Out[11]:
Exercise. Enter us.plot(kind='bar')
and us.plot.bar()
in separate cells. Show that they produce the same bar chart.
Exercise. Add each of these arguments, one at a time, to us.plot()
:
kind='area'
subplots=True
sharey=True
figsize=(3,6)
ylim=(0,16000)
What do they do?
Exercise. Type us.plot?
in a new cell. Run the cell (shift-enter or click on the run cell icon). What options do you see for the kind=
argument? Which ones have we tried? What are the other ones?
In [ ]:
In [12]:
# now try a few things with the Fama-French data
ff.plot()
Out[12]:
In [13]:
ff.plot()
Out[13]:
Exercise. What do each of the arguments do in the code below?
In [14]:
ff.plot(kind='hist', bins=20, subplots=True)
Out[14]:
In [15]:
# "smoothed" histogram
ff.plot(kind='kde', subplots=True, sharex=True) # smoothed histogram ("kernel density estimate")
Out[15]:
Exercise. Let's see if we can dress up the histogram a little. Try adding, one at a time, the arguments title='Fama-French returns'
, grid=True
, and legend=False
. What does the documentation say about them? What do they do?
Exercise. What do the histograms tell us about the two returns? How do they differ?
Exercise. Use the World Bank dataframe wbdf
to create a bar chart of GDP per capita, the variable 'gdppc'
. Bonus points: Create a horizontal bar chart. Which do you prefer?
In [ ]:
In [16]:
# import pyplot module of Matplotlib
import matplotlib.pyplot as plt
In [17]:
plt.plot(us.index, us['gdp'])
Out[17]:
Exercise. What is the x
variable here? The y
variable?
In [18]:
# we can do two lines together
plt.plot(us.index, us['gdp'])
plt.plot(us.index, us['pce'])
Out[18]:
In [19]:
# or a bar chart
plt.bar(us.index, us['gdp'], align='center')
Out[19]:
Exercise. Experiment with
plt.bar(us.index, us['gdp'],
align='center',
alpha=0.65,
color='red',
edgecolor='green')
Play with the arguments one by one to see what they do. Or use plt.bar?
to look them up. Add comments to remind yourself. Bonus points: Can you make this graph even uglier?
In [ ]:
In [20]:
# we can also add things to plots
plt.plot(us.index, us['gdp'])
plt.plot(us.index, us['pce'])
plt.title('US GDP', fontsize=14, loc='left') # add title
plt.ylabel('Billions of 2009 USD') # y axis label
plt.xlim(2002.5, 2013.5) # shrink x axis limits
plt.tick_params(labelcolor='red') # change tick labels to red
plt.legend(['GDP', 'Consumption']) # more descriptive variable names
Out[20]:
Comment. All of these statements must be in the same cell for this to work.
Comment. This is overkill -- it looks horrible -- but it makes the point that we control everything in the plot. We recommend you do very little of this until you're more comfortable with the basics.
Exercise. Add a plt.ylim()
statement to make the y
axis start at zero, as it did in the bar charts. Bonus points: Change the color to magenta and the linewidth to 2. Hint: Use plt.ylim?
and plt.plot?
to get the documentation.
In [ ]:
Exercise. Create a line plot for the Fama-French dataframe ff
that includes both returns. Bonus points: Add a title and label the y axis.
In [ ]:
This approach is the most foreign to beginners, but now that we’re used to it we like it a lot. The idea is to generate an object – two objects, in fact – and apply methods to them to produce the various elements of a graph: the data, their axes, their labels, and so on.
In [21]:
# create fig and ax objects
fig, ax = plt.subplots()
Exercise. What do we have here? What type
are fig
and ax
?
In [ ]:
We say fig
is a figure object and ax
is an axis object. This means:
fig
is a blank canvas for creating a figure.ax
is everything in it: axes, labels, lines or bars, and so on. Exercise. Use tab completion to see what methods are available for fig
and ax
. What do you see? Do you feel like screaming?
In [ ]:
In [22]:
# let's try that again, this time with content
# create objects
fig, axe = plt.subplots()
# add things by applying methods to ax
us.plot(ax=axe)
Out[22]:
Comment. Both of these statements must be in the same cell.
In [23]:
# Fama-French example
fig, ax = plt.subplots()
ff.plot(ax=ax,
kind='line', # line plot
color=['blue', 'magenta'], # line color
title='Fama-French market and riskfree returns')
Out[23]:
Exercise. Let's see if we can teach ourselves the rest:
kind='bar'
to convert this into a bar chart. alpha=0.65
to the bar chart. What does it do? Exercise (somewhat challenging). Use the same approach to reproduce our earlier histograms of the Fama-French series.
In [ ]:
Take a deep breath. We've covered a lot of ground, let's take stock.
We looked at three ways to use Matplotlib:
plot(x,y)
function fig, ax
objects, apply plot methods to themSame result, different syntax. This is what each of them looks like applied to US GDP:
us['gdp'].plot() # Approach #1
plt.plot(us.index, us['gdp']) # Approach #2
fig, ax = plt.subplots() # Approach #3
ax.plot(us.index, us['gdp'])
In [28]:
fig, ax = plt.subplots()
us.plot(ax=ax)
ax.set_title('US GDP and Consumption', fontsize=14, loc='left')
ax.set_ylabel('Billions of 2013 USD')
ax.legend(['Real GDP', 'Consumption'], loc=0) # more descriptive variable names
ax.set_xlim(2002.5, 2013.5) # expand x axis limits
ax.tick_params(labelcolor='red') # change tick labels to red
ax.set_ylim(0)
Out[28]:
(Your results may differ, but we really enjoyed that.)
Exercise. Use the set_xlabel()
method to add an x-axis label. What would you choose? Or would you prefer to leave it empty?
Exercise. Enter ax.set_legend?
to access the documentation for the set_legend
method. What options appeal to you?
Exercise. Change the line width to 2 and the line colors to blue and magenta. Hint: Use us.plot?
to get the documentation.
Exercise (challenging). Use the set_ylim()
method to start the y
axis at zero. Hint: Use ax.set_ylim?
to get the documentation.
Exercise. Create a line plot for the Fama-French dataframe ff
that includes both returns. Bonus points: Add a title with the set_title
method.
In [ ]:
In [29]:
# this creates a 2-dimensional ax
fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True)
print('Object ax has dimension', len(ax))
In [31]:
# now add some content
fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True)
us['gdp'].plot(ax=ax[0], color='green') # first plot
us['pce'].plot(ax=ax[1], color='red') # second plot
Out[31]:
In [ ]:
In [32]:
# data input
import pandas as pd
url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
skiprows=18, # skip the first 18 rows
skipfooter=7, # skip the last 7
parse_cols=[0,1,9,13], # select columns
index_col=0, # set index = first column
header=[0,1] # set variable names
)
pisa = pisa.dropna() # drop blank lines
pisa.columns = ['Math', 'Reading', 'Science'] # simplify variable names
In [33]:
# bar chart of math scores
fig, ax = plt.subplots()
pisa['Math'].plot(kind='barh', ax=ax)
Out[33]:
Comment. Yikes! That's horrible! What can we do about it?
Let's make the figure taller. The figsize
argument has the form (width, height)
. The default is (6, 4)
. We want a tall figure, so we need to increase the height setting.
In [ ]:
fig.
In [35]:
# make the plot taller
fig, ax = plt.subplots(figsize=(4, 13)) # note figsize
pisa['Math'].plot(kind='barh', ax=ax)
ax.set_title('PISA Math Score', loc='left')
Out[35]:
Comment. What if we wanted to make the US bar red? This is ridiculously complicated, but we used our Google fu and found a solution. Remember: The solution to many problems is Google fu + patience.
In [52]:
fig, ax = plt.subplots()
pisa['Math'].plot(kind='barh', ax=ax, figsize=(4,13))
ax.set_title('PISA Math Score', loc='left')
ax.get_children()[36].set_color('r')
Exercise. Create the same graph for the Reading score.
In [ ]:
In [53]:
# load packages (redundancy is ok)
import pandas as pd # data management tools
from pandas.io import wb # World Bank api
import matplotlib.pyplot as plt # plotting tools
# variable list (GDP, GDP per capita, life expectancy)
var = ['NY.GDP.PCAP.PP.KD', 'NY.GDP.MKTP.PP.KD', 'SP.DYN.LE00.IN']
# country list (ISO codes)
iso = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX']
year = 2013
# get data from World Bank
df = wb.download(indicator=var, country=iso, start=year, end=year)
# massage data
df = df.reset_index(level='year', drop=True)
df.columns = ['gdppc', 'gdp', 'life'] # rename variables
df['pop'] = df['gdp']/df['gdppc'] # population
df['gdp'] = df['gdp']/10**12 # convert to trillions
df['gdppc'] = df['gdppc']/10**3 # convert to thousands
df['order'] = [5, 3, 1, 4, 2, 6, 0] # reorder countries
df = df.sort_values(by='order', ascending=False)
df
Out[53]:
In [54]:
# GDP bar chart
fig, ax = plt.subplots()
df['gdp'].plot(ax=ax, kind='barh', alpha=0.5)
ax.set_title('GDP', loc='left', fontsize=14)
ax.set_xlabel('Trillions of US Dollars')
ax.set_ylabel('')
Out[54]:
In [62]:
# ditto for GDP per capita (per person)
fig, ax = plt.subplots()
df['gdppc'].plot(ax=ax, kind='barh', color='m', alpha=0.50) # 'm' == 'magenta'
ax.set_title('GDP Per Capita', loc='left', fontsize=14)
ax.set_xlabel('Thousands of US Dollars')
ax.set_ylabel('')
Out[62]:
And just because it's fun, here's an example of Tufte-like axes from Matplotlib examples. If you want to do this yourself, copy the last six line and prepare yourself to sink some time into it.
In [65]:
# ditto for GDP per capita (per person)
fig, ax = plt.subplots()
df['gdppc'].plot(ax=ax, kind='barh', color='b', alpha=0.5)
ax.set_title('GDP Per Capita', loc='left', fontsize=14)
ax.set_xlabel('Thousands of US Dollars')
ax.set_ylabel('')
# Tufte-like axes
ax.spines['left'].set_position(('outward', 7))
ax.spines['bottom'].set_position(('outward', 7))
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.yaxis.set_ticks_position('left')
ax.xaxis.set_ticks_position('bottom')
Exercise (challenging). Make the ticks point out.
In [97]:
# scatterplot of life expectancy vs gdp per capita
fig, ax = plt.subplots()
ax.scatter(df['gdppc'], df['life'], # x,y variables
s=df['pop']/10**6, # size of bubbles
alpha=0.5)
ax.set_title('Life expectancy vs. GDP per capita', loc='left', fontsize=14)
ax.set_xlabel('GDP Per Capita')
ax.set_ylabel('Life Expectancy')
ax.text(58, 66, 'Bubble size represents population', horizontalalignment='right')
Out[97]:
Exercise. Make the bubble a little larger.
In [98]:
# We'll look at this chart under a variety of styles.
# Let's make a function so we don't have to repeat the
# code to create
def gdp_bar():
fig, ax = plt.subplots()
df['gdp'].plot(ax=ax, kind='barh', alpha=0.5)
ax.set_title('Real GDP', loc='left', fontsize=14)
ax.set_xlabel('Trillions of US Dollars')
ax.set_ylabel('')
gdp_bar()
Exercise. Create the same graph with this statement at the top:
plt.style.use('fivethirtyeight')
(Once we execute this statement, it stays executed.)
Comment. We can get a list of files from plt.style.available
.
In [88]:
plt.style.available
Out[88]:
Comment. Ignore the seaborn styles, that's a package we don't have yet.
Exercise. Try another one by editing the code below.
In [99]:
plt.style.use('fivethirtyeight')
gdp_bar()
Comment. For aficionados, the always tasteful xkcd style.
In [100]:
plt.style.use('ggplot')
gdp_bar()
In [101]:
plt.xkcd()
gdp_bar()
Comment. We reset the style with these two lines:
In [95]:
mpl.rcParams.update(mpl.rcParamsDefault)
%matplotlib inline
Consider the data from Randal Olson's blog post:
import pandas as pd
data = {'Food': ['French Fries', 'Potato Chips', 'Bacon', 'Pizza', 'Chili Dog'],
'Calories per 100g': [607, 542, 533, 296, 260]}
cals = pd.DataFrame(data)
The dataframe cals
contains the calories in 100 grams of several different foods.
Exercise. We'll create and modify visualizations of this data:
'Food'
as the index of cals
. cals
using figure and axis objects. alpha=0.5
. What does it do?
In [ ]: