We illustrate three approaches to graphing data with Python's Matplotlib package:
plot()
method to a dataframeplot(x,y)
function from matplotlib.pyplot
The last one is the least intuitive but also the most useful. We work up to it gradually. This book chapter covers the same material with more words and fewer pictures.
import
statementsjustdoit
to the object x
with x.justdoit
Look around, what do you see? Check out the menubar at the top: File, Edit, etc. Also the toolbar below it. Click on Help -> User Interface Tour for a tour of the landscape.
The cells below come in two forms. Those labeled Code (see the menu in the toolbar) are Python code. Those labeled Markdown are text.
Simplified version of html (the language used to construct basic websites). Best way to learn about its rules is by clicking on a cell that contains text and try to imitate what you see. (More about it later and in the book.)
In [ ]:
# make plots show up in notebook
%matplotlib inline
import pandas as pd # data package
import matplotlib.pyplot as plt # pyplot module
Comment. When you run the code cell above, its output appears below it.
Exercise. Enter pd.read_csv?
in the empty cell below. Run the cell (Cell at the top, or shift-enter). Do you see the documentation? This is the Jupyter version of help in Spyder's IPython console.
Last time we saw how to read
The pandas
method was read_csv()
and read_excel()
. A third way is through APIs.
APIs are "application program interfaces". A dataset with an API allows access through some method other than a spreadsheet.
The API is the set of rules for accessing the data. People have written easy-to-use code to access the APIs.
The Pandas developers have created what they call a set of Remote Data Access tools and have put them into a package called pandas_datareader
.
The St Louis Fed has put together a large collection of time series data that they refer to as FRED: Federal Reserve Economic Data. They started with the US, but now include data for many countries.
The Pandas docs describe how to access FRED. Here's an example that reads in quarterly data for US real GDP and real consumption.
codes
-- not to be confused with "code" -- consists of FRED variable codes. Go to FRED, use the search box to find the series you want, and look for the variable code at the end of the url in your browser.start
and end
contain dates in (year, month, day) format using datetime.datetime
.
In [ ]:
from pandas_datareader import data
import datetime as dt # package to handle dates
codes = ['GDPCA', 'PCECCA'] # real GDP, real consumption (from FRED website)
start = dt.datetime(2003, 1, 1) # start date
end = dt.datetime(2013, 1, 1) # end date
us = data.DataReader(codes, 'fred', start, end)
us.columns = ['gdp', 'pce'] #us.set_index (we don't need it now)
print(us.head(3))
The World Bank's databank covers economic and social statistics for most countries in the world.
Variables include GDP, population, education, and infrastructure. Here's an example:
In [ ]:
from pandas_datareader import wb
iso_codes = ['BRA', 'CHN', 'FRA', 'IND', 'JPN', 'MEX', 'USA']
var = ['NY.GDP.PCAP.PP.KD']
year = 2013
wbdf = wb.download(indicator=var, country=iso_codes, start=year, end=year)
print(wbdf)
In [ ]:
# Change the index for iso codes
wbdf.index = iso_codes
# Add country variable
country = ['Brazil', 'China', 'France', 'India', 'Japan', 'Mexico', 'United States']
wbdf['country'] = country
# Rename the variables
wbdf.columns = ['gdppc', 'country']
# set the display precision in terms of decimal places
pd.set_option('precision', 2)
wbdf['gdppc'] = wbdf['gdppc']/1000
print(wbdf)
Comment. In the previous cell, we used the print()
function to produce output. Here we just put the name of the dataframe. The latter displays the dataframe -- and formats it nicely -- if it's the last statement in the cell.
In [ ]:
ff = data.DataReader('F-F_Research_Data_factors', 'famafrench')
ff
What is this object?
In [ ]:
type(ff)
Learn about the structure
In [ ]:
ff.keys()
In [ ]:
ff['DESCR']
In [ ]:
ff = ff[1]
ff
In [ ]:
ff.columns = ['xsm', 'smb', 'hml', 'rf']
ff['rm'] = ff['xsm'] + ff['rf']
ff = ff[['rm', 'rf']] # extract rm (market) and rf (riskfree)
ff.head(5)
The simplest way to produce graphics from a dataframe is to apply a plot method to it.
We see that a number of things are preset for us:
x
and y
variables. By default, the x
variable is the dataframe's index and the y
variables are the columns of the dataframe -- all of them that can be plotted (e.g. columns with a numeric dtype).We can change all of these things, but that's always a good starting point.
In [ ]:
# try this with US GDP
us.plot()
In [ ]:
# do GDP alone
us['gdp'].plot()
In [ ]:
us.plot(y="gdp")
In [ ]:
# bar chart
us.plot(kind='bar')
In [ ]:
# scatter plot
# we need to be explicit about the x and y variables: x = 'gdp', y = 'pce'
us.plot.scatter('gdp', 'pce')
In [ ]:
us.plot('gdp', 'pce', kind='scatter')
Exercise. Add each of these arguments, one at a time, to us.plot()
:
kind='area'
subplots=True
sharey=True
figsize=(3,6)
ylim=(0,16000)
What do they do?
In [ ]:
us.plot(kind='area') # fill the area below
In [ ]:
us.plot(subplots=True) # make separate subplots for the variables in the dataframe
In [ ]:
us.plot(subplots=True, sharey = True) # make the y axis the same
In [ ]:
us.plot(figsize = (10, 2)) # first arg: width, second: height (inches)
In [ ]:
us.plot(ylim = (0, 16000)) # change the range of the y axis
In [ ]:
# now try a few things with the Fama-French data
ff.plot()
Exercise. What do each of the arguments do in the code below?
In [ ]:
ff.plot(kind='hist',
bins=20,
subplots=True)
In [ ]:
# "smoothed" histogram
ff.plot(kind='kde', subplots=True, sharex=True) # smoothed histogram ("kernel density estimate")
Exercise. Try adding the arguments title='Fama-French returns'
, grid=True
, and legend=False
.
In [ ]:
ff.plot(kind='kde',
subplots=True,
sharex=True,
title='Fama-French returns',
grid=True,
legend=False)
In [ ]:
plt.plot(us.index, us['gdp'])
In [ ]:
# we can do two lines together
plt.plot(us.index, us['gdp'])
plt.plot(us.index, us['pce'])
In [ ]:
# we can also add things to plots
plt.plot(us.index, us['gdp'])
plt.plot(us.index, us['pce'])
plt.title('US GDP', fontsize=14, loc='left') # add title
plt.ylabel('Billions of 2009 USD') # y axis label
plt.xlabel('Year') # y axis label
plt.tick_params(labelcolor='red') # change tick labels to red
#plt.legend(['GDP', 'Consumption'], loc='best')
Comment. All of these statements must be in the same cell for this to work.
Comment. This is overkill -- it looks horrible -- but it makes the point that we control everything in the plot. We recommend you do very little of this until you're more comfortable with the basics.
This approach is probably the most mysterious, but it's the best.
The idea is to use the matplotlib.pyplot
function subplots()
, which creates two objects:
fig
: figure object -- blank canvas for creating a figureax
: axis object -- everything in the figure: axes, labels, legendapply methods on these objects to set the various elements of the graph.
Create objects. We'll see this line over and over:
In [ ]:
fig, ax = plt.subplots(2, 1) # create fig and ax objects -- nrows, ncols
Exercise. What do we have here? What type
are fig
and ax
?
In [ ]:
print('fig is ', type(fig))
print('ax is ', type(ax))
In [ ]:
# let's try that again, this time with content
fig, axe = plt.subplots(figsize=(8, 4))
# add things to ax
us.plot(ax=axe, color = ['red', 'green'])
Comment. Both of these statements must be in the same cell.
In [ ]:
# Fama-French example
fig, ax = plt.subplots()
ff.plot(ax=ax,
kind='line', # line plot
color=['blue', 'magenta'], # line color
title='Fama-French market and riskfree returns')
In [ ]:
# Fama-French example
fig, ax = plt.subplots(1, 2, figsize=(12, 4))
ff.plot(ax=ax[0],
kind='hist', # line plot
color=['blue', 'magenta'], # line color
alpha=0.65,
bins=20,
title='Fama-French market and riskfree returns')
ff.plot(ax=ax[1],
kind='kde', # line plot
color=['blue', 'magenta'], # line color
title='Fama-French market and riskfree returns',
alpha=0.65)
fig.tight_layout()
We looked at three ways to use Matplotlib:
plot(x,y)
function fig, ax
objects, apply plot methods to themSame result, different syntax. This is what each of them looks like applied to US GDP:
us['gdp'].plot() # Approach #1
plt.plot(us.index, us['gdp']) # Approach #2
fig, ax = plt.subplots() # Approach #3
ax.plot(us.index, us['gdp'])
# Or
fig, ax = plt.subplots() # Approach #3
us['gdp'].plot(ax=ax)
Each one produces the same graph.
Which one should we use? Use Approach #3. Really. This is a case where choice is confusing.
We also suggest you not commit any of this to memory. If you end up using it a lot, you'll remember it. If you don't, it's not worth remembering.
In [ ]:
fig, ax = plt.subplots()
us.plot(ax=ax)
# Apply axis methods
ax.set_title('US GDP and Consumption', fontsize=14, loc='left')
ax.set_ylabel('Billions of 2013 USD')
ax.legend(['Real GDP', 'Consumption'], loc=0) # more descriptive variable names
ax.tick_params(labelcolor='red') # change tick labels to red
ax.set_ylim(0)
(Your results may differ, but we really enjoyed that.)
Exercise. Use the set_xlabel()
method to add an x-axis label. What would you choose? Or would you prefer to leave it empty?
Exercise. Enter ax.legend?
to access the documentation for the legend
method. What options appeal to you?
Exercise. Change the line width to 2 and the line colors to blue and magenta. Hint: Use us.plot?
to get the documentation.
Exercise (challenging). Use the set_ylim()
method to start the y
axis at zero. Hint: Use ax.set_ylim?
to get the documentation.
Exercise. Create a line plot for the Fama-French dataframe ff
that includes both returns. Bonus points: Add a title with the set_title
method.
In [ ]:
fig, ax = plt.subplots()
us.plot(ax=ax, lw=2)
ax.set_title('US GDP and Consumption', fontsize=14, loc='left')
ax.set_ylabel('Billions of 2013 USD')
ax.legend(['Real GDP', 'Consumption'], loc=2) # more descriptive variable names
ax.tick_params(labelcolor='green') # change tick labels to green
ax.set_ylim(0)
In [ ]:
fig, ax = plt.subplots(nrows=2, ncols=2, sharex=True)
print('Object ax has dimension', ax.shape)
In [ ]:
fig, ax = plt.subplots(nrows=2, ncols=2, sharex=True, sharey=True)
print('Object ax has dimension', ax.shape)
ax[0, 1].plot(us.index, us['pce'])
ax[1, 0].plot(us.index, us['gdp'])
In [ ]:
# now add some content
fig, ax = plt.subplots(nrows=2, ncols=1, sharex=True, sharey=True)
us['gdp'].plot(ax=ax[0], color='green') # first plot
us['pce'].plot(ax=ax[1], color='red') # second plot
In [ ]:
url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
skiprows=18, # skip the first 18 rows
skipfooter=7, # skip the last 7
parse_cols=[0,1,9,13], # select columns
index_col=0, # set index = first column
header=[0,1] # set variable names
)
pisa = pisa.dropna() # drop blank lines
pisa.columns = ['Math', 'Reading', 'Science'] # simplify variable names
In [ ]:
# bar chart of math scores
fig, ax = plt.subplots()
pisa['Math'].plot(kind='barh', ax=ax)
Comment. Yikes! That's horrible! What can we do about it? Any suggestions?
Let's make the figure taller. The figsize
argument has the form (width, height)
. The default is (6, 4)
. We want a tall figure, so we need to increase the height setting.
In [ ]:
# make the plot taller
fig, ax = plt.subplots(figsize=(4, 13)) # note figsize
pisa['Math'].plot(kind='barh', ax=ax)
ax.set_title('PISA Math Score', loc='left')
Comment. What if we wanted to make the US bar red? This is far too complicated, but we used our Google fu and found a solution.
In [ ]:
fig, ax = plt.subplots()
pisa['Math'].plot(ax=ax, kind='barh', figsize=(4,13))
ax.set_title('PISA Math Score', loc='left')
us_index = pisa.index.tolist().index('United States')
ax.get_children()[us_index].set_color('r')
Exercise. Create the same graph for the Reading score.
In [ ]:
# variable list (GDP, GDP per capita, life expectancy)
var = ['NY.GDP.PCAP.PP.KD', 'NY.GDP.MKTP.PP.KD', 'SP.DYN.LE00.IN']
# country list (ISO codes)
iso = ['USA', 'FRA', 'JPN', 'CHN', 'IND', 'BRA', 'MEX']
year = 2013
# get data from World Bank
df = wb.download(indicator=var, country=iso, start=year, end=year)
# massage data
df = df.reset_index(level='year', drop=True)
df.columns = ['gdppc', 'gdp', 'life'] # rename variables
df['pop'] = df['gdp']/df['gdppc'] # population
df['gdp'] = df['gdp']/10**12 # convert to trillions
df['gdppc'] = df['gdppc']/10**3 # convert to thousands
df['order'] = [5, 3, 1, 4, 2, 6, 0] # reorder countries
df = df.sort_values(by='order', ascending=False)
df
In [ ]:
# We'll use this same basic graph a few times.
# Let's make a function so we don't have to repeat the
# code to create
def gdp_bar(variable="gdp"):
fig, ax = plt.subplots()
df[variable].plot(ax=ax, kind='barh', alpha=0.5)
ax.set_title('Real GDP', loc='left', fontsize=14)
ax.set_xlabel('Trillions of US Dollars')
ax.set_ylabel('')
return fig, ax
In [ ]:
gdp_bar()
In [ ]:
# ditto for GDP per capita (per person)
fig, ax = gdp_bar("gdppc")
ax.set_title('GDP Per Capita', loc='left', fontsize=14)
And just because it's fun, here's an example of Tufte-like axes from Matplotlib examples:
In [ ]:
fig, ax = gdp_bar()
# Tufte-like axes
ax.spines['left'].set_position(('outward', 7))
ax.spines['bottom'].set_position(('outward', 7))
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.yaxis.set_ticks_position('left')
ax.xaxis.set_ticks_position('bottom')
Exercise (challenging). Make the ticks point out.
In [ ]:
# scatterplot of life expectancy vs gdp per capita
fig, ax = plt.subplots()
ax.scatter(df['gdppc'], df['life'], # x,y variables
s=df['pop']/10**6, # size of bubbles
alpha=0.5)
ax.set_title('Life expectancy vs. GDP per capita', loc='left', fontsize=14)
ax.set_xlabel('GDP Per Capita')
ax.set_ylabel('Life Expectancy')
ax.text(58, 66, 'Bubble size represents population', horizontalalignment='right')
Exercise. Make the bubble a little larger.
In [ ]:
Exercise (challenging). Add labels to the bubbles so we know which country they correspond to.
In [ ]:
# scatterplot of life expectancy vs gdp per capita
fig, ax = plt.subplots()
ax.scatter(df['gdppc'], df['life'], # x,y variables
s=df['pop']/10**6, # size of bubbles
alpha=0.5)
ax.set_title('Life expectancy vs. GDP per capita', loc='left', fontsize=14)
ax.set_xlabel('GDP Per Capita')
ax.set_ylabel('Life Expectancy')
ax.text(58, 66, 'Bubble size represents population', horizontalalignment='right')
for (x, y, country) in zip(df['gdppc'], df['life'], df.index):
ax.text(x, y, country)
Consider the data from Randal Olson's blog post:
In [ ]:
import pandas as pd
data = {'Food': ['French Fries', 'Potato Chips', 'Bacon', 'Pizza', 'Chili Dog'],
'Calories per 100g': [607, 542, 533, 296, 260]}
cals = pd.DataFrame(data)
The dataframe cals
contains the calories in 100 grams of several different foods.
Exercise. We'll create and modify visualizations of this data:
'Food'
as the index of cals
.cals
using figure and axis objects.alpha=0.5
. What does it do?
In [ ]:
In [ ]: