We illustrate three approaches to graphing data with Python's Matplotlib package:
plot()
method to a dataframeplot(x,y)
function The last one is the least intuitive but also the most useful. We work up to it gradually. This book chapter covers the same material with more words and fewer pictures.
This IPython notebook was created by Dave Backus for the NYU Stern course Data Bootcamp.
Look around, what do you see? Check out the menubar at the top: File, Edit, etc. Also the toolbar below it. Click on Help -> User Interface Tour for a tour of the landscape.
The cells below come in two forms. Those labeled Code (see the menu in the toolbar) are Python code. Those labeled Markdown are text.
Markdown is a user-friendly language for text formatting. You can see how it works by clicking on any of the Markdown cells and looking at the raw text that underlies it. In addition to just plain text, we'll use three things a lot:
**bold**
displays as bold. The raw text *italics*
displays as italics. # Heading
gives us a large heading. Two hashes a smaller heading, three hashes smaller still, up to four hashes. In this cell there's a two-hash heading at the top. Exercise. Click on the blank cell below. Note that it's labeled Markdown in the menubar. Add a heading and some text. Execute the cell by either (i) clicking on the "run cell" button in the toolbar or (ii) clicking on "Cell" in the menubar and choosing Run.
In [73]:
import sys # system module
import pandas as pd # data package
import matplotlib as mpl # graphics package
import datetime as dt # date and time module
# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Matplotlib version: ', mpl.__version__)
print('Today: ', dt.date.today())
Comment. When you run the code cell above, its output appears below it.
In [74]:
# This is an IPython command. It puts plots here in the notebook, rather than a separate window.
%matplotlib inline
In [75]:
# US GDP and consumption
gdp = [13271.1, 13773.5, 14234.2, 14613.8, 14873.7, 14830.4, 14418.7,
14783.8, 15020.6, 15369.2, 15710.3]
pce = [8867.6, 9208.2, 9531.8, 9821.7, 10041.6, 10007.2, 9847.0, 10036.3,
10263.5, 10449.7, 10699.7]
year = list(range(2003,2014)) # use range for years 2003-2013
# create dataframe from dictionary
us = pd.DataFrame({'gdp': gdp, 'pce': pce}, index=year)
print(us.head(3))
In [76]:
# GDP per capita (World Bank data, 2013, thousands of USD)
code = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX']
country = ['United States', 'France', 'Japan', 'China', 'India',
'Brazil', 'Mexico']
gdppc = [53.1, 36.9, 36.3, 11.9, 5.4, 15.0, 16.5]
wbdf = pd.DataFrame({'gdppc': gdppc, 'country': country}, index=code)
wbdf
Out[76]:
Comment. In the previous cell, we used the print()
function to produce output. Here we just put the name of the dataframe. The latter displays the dataframe -- and formats it nicely -- if it's the last line in the cell.
In [77]:
# Fama-French
import pandas_datareader.data as web
# read annual data from website and rename variables
ff = web.DataReader('F-F_Research_Data_factors', 'famafrench')[1]
ff.columns = ['xsm', 'smb', 'hml', 'rf']
ff['rm'] = ff['xsm'] + ff['rf']
ff = ff[['rm', 'rf']] # extract rm and rf (return on market, riskfree rate, percent)
ff.head(5)
Out[77]:
Comment. The warning in pink tells us that the Pandas DataReader will be spun off into a separate package in the near future.
Exercise. What kind of object is wb
? How would you access its column and row labels? What are they?
In [ ]:
In [78]:
# This is an IPython command: it puts plots here in the notebook, rather than a separate window.
%matplotlib inline
plot()
method to dataframeGood simple approach, we use it a lot. It comes with some useful defaults:
x
and y
variables. By default, the x
variable is the dataframe's index and the y
variables are all the columns of the dataframe. All of these things can be changed, but this is the starting point.
Let's do some examples, see how they work.
In [79]:
# try this with US GDP
us.plot()
Out[79]:
In [80]:
# do GDP alone
us['gdp'].plot()
Out[80]:
In [81]:
# bar chart
us.plot(kind='bar')
Out[81]:
Exercise. Show that we get the output from us.plot.bar()
.
In [82]:
us.plot
Out[82]:
In [83]:
# scatter plot
# we need to be explicit about the x and y variables: x = 'gdp', y = 'pce'
us.plot.scatter('gdp', 'pce')
Out[83]:
Comment. We can get help in IPython by adding a question mark after a function or method.
Exercise. How can you get help for us.plot()
? Try it and see.
Exercise. Add each of these arguments/parameters to us.plot()
in the code cell below and describe what they do:
kind='area'
subplots=True
sharey=True
figsize=(3,6)
xlim=(0,16000)
In [ ]:
In [84]:
# now try a few things with the Fama-French data
ff.plot()
Out[84]:
Exercise. We can dress up the plots using the arguments of the plot()
function. Try adding, one at a time, the arguments title='Fama-French returns'
, grid=True
, and legend=False
. What does the documentation say about them? What do they do?
In [85]:
ff.plot()
Out[85]:
Exercise. What do each of the arguments do in the code below?
In [86]:
ff.plot(kind='hist', bins=20, subplots=True)
Out[86]:
Exercise. What do you see here? How do the returns differ?
In [87]:
ff.plot(kind='kde', subplots=True, sharex=True) # smoothed histogram ("kernel density estimate")
Out[87]:
Exercise. Use the World Bank dataframe wbdf
to create a bar chart of GDP per capita. Bonus points: Create a horizontal bar chart.
In [ ]:
In [88]:
# import pyplot module of Matplotlib
import matplotlib.pyplot as plt
In [89]:
plt.plot(us.index, us['gdp'])
Out[89]:
Exercise. What is the x
variable here? The y
variable?
In [90]:
# we can do two lines together
plt.plot(us.index, us['gdp'])
plt.plot(us.index, us['pce'])
Out[90]:
In [91]:
# or a bar chart
plt.bar(us.index, us['gdp'], align='center')
Out[91]:
Exercise. Experiment with
plt.bar(us.index, us['gdp'],
align='center',
alpha=0.65,
color='red',
edgecolor='green')
Play with the arguments one by one to see what they do. Or use plt.bar?
to look them up. Add comments to remind yourself. Bonus points: Can you make this graph even uglier?
In [ ]:
In [92]:
# we can also add things to plots
plt.plot(us.index, us['gdp'])
plt.plot(us.index, us['pce'])
plt.title('US GDP', fontsize=14, loc='left') # add title
plt.ylabel('Billions of 2009 USD') # y axis label
plt.xlim(2002.5, 2013.5) # shrink x axis limits
plt.tick_params(labelcolor='red') # change tick labels to red
plt.legend(['GDP', 'Consumption']) # more descriptive variable names
Out[92]:
Comment. All of these statements must be in the same cell for this to work.
Comment. This is overkill -- it looks horrible -- but it makes the point that we control everything in the plot. We recommend you do very little of this until you're more comfortable with the basics.
Exercise. Add a plt.ylim()
statement to make the y
axis start at zero, as it did in the bar charts. Bonus points: Change the color to magenta and the linewidth to 2. Hint: Use plt.ylim?
and plt.plot?
to get the documentation.
In [ ]:
Exercise. Create a line plot for the Fama-French dataframe ff
that includes both returns. Bonus points: Add a title and label the y axis.
In [ ]:
This approach is the most foreign to beginners, but now that we’re used to it we like it a lot. We either use it on its own, or adapt its functionality to the dataframe plot methods we saw in Approach #1. The idea is to generate an object – two objects, in fact – and apply methods to them to produce the various elements of a graph: the data, their axes, their labels, and so on.
In [93]:
# create fig and ax objects
fig, ax = plt.subplots()
Exercise. What do we have here? What type
are fig
and ax
?
In [ ]:
We say fig
is a figure object and ax
is an axis object. This means:
fig
is a blank canvas for creating a figure.ax
is everything in it: axes, labels, lines or bars, and so on. Exercise. Use tab completion to see what methods are available for fig
and ax
. What do you see? Do you feel like screaming?
In [ ]:
In [94]:
# let's try that again, this time with content
# create objects
fig, ax = plt.subplots()
# add things by applying methods to ax
ax.plot(us.index, us['gdp'], linewidth=2, color='magenta')
ax.set_title('US GDP', fontsize=14, loc='left')
ax.set_ylabel('Billions of USD')
ax.set_xticks([2004, 2008, 2012])
ax.grid(True)
Comment. All of these statements must be in the same cell.
In [95]:
# a figure method: save figure as a pdf
fig.savefig('us_gdp.pdf')
Exercise. Use figure and axis objects to create a bar chart of variable rm
in the ff
dataframe.
In [ ]:
In [96]:
# this creates a 2-dimensional ax
fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True)
print('Object ax has dimension', len(ax))
In [97]:
# now add some content
fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True)
ax[0].plot(us.index, us['gdp'], color='green') # first plot
ax[1].plot(us.index, us['pce'], color='red') # second plot
Out[97]:
In Approach #1, we applied plot()
and related methods to a dataframe. We also used arguments to fix up the graph, but that got complicated pretty quickly.
Here we combine Approaches 1 and 3. If we check the documentation of df.plot()
we see that it "returns" an axis object. We can assign it to a variable and then apply methods to make the figure more compelling.
In [98]:
# grab the axis
ax = us.plot()
In [99]:
# grab it and apply methods
ax = us.plot()
ax.set_title('US GDP and Consumption', fontsize=14, loc='left')
ax.set_ylabel('Billions of 2013 USD')
ax.legend(loc='center right')
Out[99]:
Comment. If we want the figure object for this plot, we apply a method to the axis object ax
:
fig = ax.get_figure()
That's not something we'll do often, but it completes the connection between Approaches #1 and #3.
In [ ]:
Take a deep breath. We've covered a lot of ground, let's take stock.
We looked at three ways to use Matplotlib:
plot(x,y)
function fig, ax
objects, apply plot methods to themSame result, different syntax. This is what each of them looks like applied to US GDP:
us['gdp'].plot() # Approach #1
plt.plot(us.index, us['gdp']) # Approach #2
fig, ax = plt.subplots() # Approach #3
ax.plot(us.index, us['gdp'])
In [100]:
# data input
import pandas as pd
url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
skiprows=18, # skip the first 18 rows
skipfooter=7, # skip the last 7
parse_cols=[0,1,9,13], # select columns
index_col=0, # set index = first column
header=[0,1] # set variable names
)
pisa = pisa.dropna() # drop blank lines
pisa.columns = ['Math', 'Reading', 'Science'] # simplify variable names
In [101]:
# simple plot
pisa['Math'].plot(kind='barh')
Out[101]:
Comment. Yikes! That's horrible! What can we do about it?
Let's make the figure taller. The figsize
argument has the form (width, height)
. The default is (6, 4)
. We want a tall figure, so we need to increase the height setting.
In [102]:
# make the plot taller
ax = pisa['Math'].plot(kind='barh', figsize=(4,13)) # note figsize
ax.set_title('PISA Math Score', loc='left')
Out[102]:
Comment. What if we wanted to make the US bar red? This is ridiculously complicated, but we used our Google fu and found a solution. Remember: The solution to many problems is Google fu + patience.
In [103]:
ax = pisa['Math'].plot(kind='barh', figsize=(4,13))
ax.set_title('PISA Math Score', loc='left')
ax.get_children()[38].set_color('r')
Exercise. Create the same graph for the Reading score.
In [ ]:
In [104]:
# load packages (redundancy is ok)
import pandas as pd # data management tools
from pandas_datareader import data, wb # World Bank api
import matplotlib.pyplot as plt # plotting tools
# variable list (GDP, GDP per capita, life expectancy)
var = ['NY.GDP.PCAP.PP.KD', 'NY.GDP.MKTP.PP.KD', 'SP.DYN.LE00.IN']
# country list (ISO codes)
iso = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX']
year = 2013
# get data from World Bank
df = wb.download(indicator=var, country=iso, start=year, end=year)
# massage data
df = df.reset_index(level='year', drop=True)
df.columns = ['gdppc', 'gdp', 'life'] # rename variables
df['pop'] = df['gdp']/df['gdppc'] # population
df['gdp'] = df['gdp']/10**12 # convert to trillions
df['gdppc'] = df['gdppc']/10**3 # convert to thousands
df['order'] = [5, 3, 1, 4, 2, 6, 0] # reorder countries
df = df.sort_values(by='order', ascending=False)
df
Out[104]:
In [105]:
# GDP bar chart
ax = df['gdp'].plot(kind='barh', alpha=0.5)
ax.set_title('GDP', loc='left', fontsize=14)
ax.set_xlabel('Trillions of US Dollars')
ax.set_ylabel('')
Out[105]:
In [106]:
# ditto for GDP per capita (per person)
ax = df['gdppc'].plot(kind='barh', color='m', alpha=0.5)
ax.set_title('GDP Per Capita', loc='left', fontsize=14)
ax.set_xlabel('Thousands of US Dollars')
ax.set_ylabel('')
Out[106]:
And just because it's fun, here's an example of Tufte-like axes from Matplotlib examples. If you want to do this yourself, copy the last six line and prepare yourself to sink some time into it.
In [107]:
# ditto for GDP per capita (per person)
ax = df['gdppc'].plot(kind='barh', color='b', alpha=0.5)
ax.set_title('GDP Per Capita', loc='left', fontsize=14)
ax.set_xlabel('Thousands of US Dollars')
ax.set_ylabel('')
# Tufte-like axes
ax.spines['left'].set_position(('outward', 10))
ax.spines['bottom'].set_position(('outward', 10))
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.yaxis.set_ticks_position('left')
ax.xaxis.set_ticks_position('bottom')
In [108]:
# scatterplot of life expectancy vs gdp per capita
plt.scatter(df['gdppc'], df['life'], # x,y variables
s=df['pop']/10**6, # size of bubbles
alpha=0.5)
plt.title('Life expectancy vs. GDP per capita', loc='left', fontsize=14)
plt.xlabel('GDP Per Capita')
plt.ylabel('Life Expectancy')
plt.text(58, 66, 'Bubble size represents population', horizontalalignment='right',)
Out[108]:
In [109]:
ax = df['gdp'].plot(kind='barh', alpha=0.5)
ax.set_title('GDP', loc='left', fontsize=14)
ax.set_xlabel('Trillions of US Dollars')
ax.set_ylabel('')
Out[109]:
Exercise. Create the same graph with this statement at the top:
plt.style.use('fivethirtyeight')
(Once we execute this statement, it stays executed.)
Comment. We can get a list of files from plt.style.available
.
In [110]:
plt.style.available
Out[110]:
Exercise. Try another one by editing the code beloe.
In [111]:
plt.style.use('fivethirtyeight')
ax = df['gdp'].plot(kind='barh', alpha=0.5)
ax.set_title('GDP', loc='left', fontsize=14)
ax.set_xlabel('Trillions of US Dollars')
ax.set_ylabel('')
Out[111]:
Comment. For aficionados, the always tasteful xkcd style.
In [112]:
plt.xkcd()
ax = df['gdp'].plot(kind='barh', alpha=0.5)
ax.set_title('GDP', loc='left', fontsize=14)
ax.set_xlabel('Trillions of US Dollars')
ax.set_ylabel('')
Out[112]:
Comment. We reset the style with these two lines:
In [113]:
mpl.rcParams.update(mpl.rcParamsDefault)
%matplotlib inline
In [ ]: