The second in a series of notebooks that describe Pandas' powerful data management tools. This one covers shaping methods: switching rows and columns, pivoting, and stacking. We'll see that this is all about the indexes: the row and column labels.
Outline:
More data management topics coming.
Note: requires internet access to run.
This Jupyter notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.
Let df
be a DataFrame
df.set_index
to move columns into the index of df
df.reset_index
to move one or more levels of the index back to columns. If we set drop=True
, the requested index levels are simply thrown away instead of made into columnsdf.stack
to move column index levels into the row indexdf.unstack
to move row index levels into the colunm index (Helpful mnemonic: unstack
moves index levels up)
In [3]:
%matplotlib inline
import pandas as pd # data package
import matplotlib.pyplot as plt # graphics module
import datetime as dt # date and time module
import numpy as np # foundation for Pandas
We spend most of our time on one of the examples from the previous notebook. The problem in this example is that variables run across rows, rather than down columns. Our want is to flip some of the rows and columns so that we can plot the data against time. The question is how.
We use a small subset of the IMF's World Economic Outlook database that contains two variables and three countries.
In [4]:
url = 'http://www.imf.org/external/pubs/ft/weo/2016/02/weodata/WEOOct2016all.xls'
# (1) define the column indices
col_indices = [1, 2, 3, 4, 6] + list(range(9, 46))
# (2) download the dataset
weo = pd.read_csv(url,
sep = '\t',
#index_col='ISO',
usecols=col_indices,
skipfooter=1, engine='python',
na_values=['n/a', '--'],
thousands =',',encoding='windows-1252')
# (3) turn the types of year variables into float
years = [str(year) for year in range(1980, 2017)]
weo[years] = weo[years].astype(float)
print('Variable dtypes:\n', weo.dtypes, sep='')
In [5]:
# create debt and deficits dataframe: two variables and three countries
variables = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
countries = ['ARG', 'DEU', 'GRC']
dd = weo[weo['WEO Subject Code'].isin(variables) & weo['ISO'].isin(countries)]
In [6]:
# change column labels to something more intuitive
dd = dd.rename(columns={'WEO Subject Code': 'Variable',
'Subject Descriptor': 'Description'})
In [7]:
# rename variables (i.e. values of observables)
dd['Variable'] = dd['Variable'].replace(to_replace=['GGXWDG_NGDP', 'GGXCNL_NGDP'], value=['Debt', 'Surplus'])
dd
Out[7]:
In [ ]:
In [8]:
dd.index
Out[8]:
In [9]:
dd.columns
Out[9]:
In [10]:
dd['ISO']
Out[10]:
In [11]:
dd[['ISO', 'Variable']]
Out[11]:
In [12]:
dd[dd['ISO'] == 'ARG']
Out[12]:
In [ ]:
We might imagine doing several different things with this data:
Depending on which we want, we might organize the data differently. We'll focus on the last two.
Here's a brute force approach to the problem: simply transpose the data. This is where that leads:
In [13]:
dd.T
Out[13]:
Comments. The problem here is that the columns include both the numbers (which we want to plot) and some descriptive information (which we don't).
In [ ]:
We start by setting and resetting the index. That may sound like a step backwards -- haven't we done this already? -- but it reminds us of some things that will be handy later.
Take the dataframe dd
. What would we like in the index? Evenutally we'd like the dates llke [2011, 2012, 2013]
, but right now the row labels are more naturally the variable or country. Here are some varriants.
In [14]:
dd.set_index('Country')
Out[14]:
In [15]:
# we can do the same thing with a list, which will be meaningful soon...
dd.set_index(['Country'])
Out[15]:
Exercise. Set Variable
as the index.
In [ ]:
Comment. Note that the new index brought its name along: Country
in the two examples, Variable
in the exercise. That's incredibly useful because we can refer to index levels by name. If we happen to have an index without a name, we can set it with
df.index.name = 'Whatever name we like'
We can put more than one variable in an index, which gives us a multi-index. This is sometimes called a hierarchical index because the levels of the index (as they're called) are ordered.
Multi-indexes are more common than you might think. One reason is that data itself is often multi-dimensional. A typical spreadsheet has two dimensions: the variable and the observation. The WEO data is naturally three dimensional: the variable, the year, and the country. (Think about that for a minute, it's deeper than it sounds.)
The problem we're having is fitting this nicely into two dimensions. A multi-index allows us to manage that. A two-dimensional index would work here -- the country and the variable code -- but right now we have some redundancy.
Example. We push all the descriptive, non-numerical columns into the index, leaving the dataframe itself with only numbers, which seems like a step in thee right direction.
In [16]:
ddi = dd.set_index(['Variable', 'Country', 'ISO', 'Description', 'Units'])
ddi
Out[16]:
Let's take a closer look at the index
In [17]:
ddi.index
Out[17]:
That's a lot to process, so we break it into pieces.
ddi.index.names
contains a list of level names. (Remind yourself that lists are ordered, so this tracks levels.)ddi.index.levels
contains the values in each level. Here's what they like like here:
In [18]:
# Chase and Spencer like double quotes
print("The level names are:\n", ddi.index.names, "\n", sep="")
print("The levels (aka level values) are:\n", ddi.index.levels, sep="")
Knowing the order of the index components and being able to inspect their values and names is fundamental to working with a multi-index.
Exercise: What would happen if we had switched the order of the strings in the list when we called dd.set_index
? Try it with this list to find out: ['ISO', 'Country', 'Variable', 'Description', 'Units']
In [19]:
ddi.head(2)
Out[19]:
In [20]:
ddi.reset_index()
Out[20]:
In [21]:
# or we can reset the index by level
ddi.reset_index(level=1).head(2)
Out[21]:
In [22]:
# or by name
ddi.reset_index(level='Units').head(2)
Out[22]:
In [23]:
# or do more than one at a time
ddi.reset_index(level=[1, 3]).head(2)
Out[23]:
Comment. By default, reset_index
pushes one or more index levels into columns. If we want to discard that level of the index altogether, we use the parameter drop=True
.
In [24]:
ddi.reset_index(level=[1, 3], drop=True).head(2)
Out[24]:
Exercise. For the dataframe ddi
do the following in separate code cells:
reset_index
method to move the Units
level of the index to a column of the dataframe.drop
parameter of reset_index
to delete Units
from the dataframe.
In [ ]:
The simplest way to flip rows and columns is to use the T
or transpose property. When we do that, we end up with a lot of stuff in the column labels, as the multi-index for the rows gets rotated into the columns. Other than that, we're good. We can even do a plot. The only problem is all the stuff we've pushed into the column labels -- it's kind of a mess.
In [25]:
ddt = ddi.T
ddt
Out[25]:
Comment. We see here that the multi-index for the rows has been turned into a multi-index for the columns. Works the same way.
The only problem here is that the column labels are more complicated than we might want. Here, for example, is what we get with the plot method. As usual, .plot()
plots all the columns of the dataframe, but here that means we're mixing variables. And the legend contains all the levels of the column labels.
In [26]:
ddt.plot()
Out[26]:
In [ ]:
Can we refer to variables in the same way? Sort of, as long as we refer to the top level of the column index. It gives us a dataframe that's a subset of the original one.
Let's try each of these:
ddt['Debt']
ddt['Debt']['Argentina']
ddt['Debt', 'Argentina']
ddt['ARG']
What do you see?
In [27]:
# indexing by variable
debt = ddt['Debt']
debt
Out[27]:
In [28]:
ddt['Debt']['Argentina']
Out[28]:
In [29]:
ddt['Debt', 'Argentina']
Out[29]:
In [30]:
#ddt['ARG']
What's going on? The theme is that we can reference the top level, which in ddi
is the Variable
. If we try to access a lower level, it bombs.
Exercise. With the dataframe ddt
:
What type of object is ddt["Debt"]
?
Construct a line plot of Debt
over time with one line for each country.
SOL
SOL
In [ ]:
Example. Let's do this together. How would we fix up the legend? What approaches cross your mind? (No code, just the general approach.)
In [31]:
fig, ax = plt.subplots()
ddt['Debt'].plot(ax=ax)
ax.legend(['ARG', 'DEU', 'GRE'], loc='best')
#ax.axhline(100, color='k', linestyle='--', alpha=.5)
Out[31]:
Since variables refer to the first level of the column index, it's not clear how we would group data by country. Suppose, for example, we wanted to plot Debt
and Surplus
for a specific country. What would we do?
One way to do that is to make the country the top level with the swaplevel
method. Note the axis
parameter. With axis=1
we swap column levels, with axis=0
(the default) we swap row levels.
In [32]:
ddts = ddt.swaplevel(0, 1, axis=1)
ddts
Out[32]:
Exercise. Use the dataframe ddts
to plot Debt
and Surplus
across time for Argentina. Hint: In the plot
method, set subplots=True
so that each variable is in a separate subplot.
SOL
xs
methodAnother approach to extracting data that cuts across levels of the row or column index: the xs
method. This is recent addition to Pandas and an extremely good method once you get the hang of it.
The basic syntax is
df.xs(item, axis=X, level=N)
where N
is the name or number of an index level and X
describes if we are extracting from the index or column names. Setting X=0
(so axis=0
) will slice up the data along the index, X=1
extracts data for column labels.
Here's how we could use xs
to get the Argentina data without swapping the level of the column labels
In [33]:
# ddt.xs?
In [34]:
ddt.xs("Argentina", axis=1, level="Country")
Out[34]:
In [35]:
ddt.xs("Argentina", axis=1, level="Country")["Debt"]
Out[35]:
Exercise. Use a combination of xs
and standard slicing with [...]
to extract the variable Debt
for Greece.
SOL
Exercise. Use the dataframe ddt
-- and the xs
method -- to plot Debt
and Surplus
across time for Argentina.
SOL
The set_index
and reset_index
methods work on the row labels -- the index. They move columns to the index and the reverse. The stack
and unstack
methods move index levels to and from column levels:
stack
moves the "inner most" (closest to the data when printed) column label into a row label. This creates a long dataframe. unstack
does the reverse, it moves the inner most level of the index u
p to become the inner most column label. This creates a wide dataframe. We use both to shape (or reshape) our data. We use set_index
to push things into the index. And then use reset_index
to push some of them back to the columns. That gives us pretty fine-grainded control over the shape of our data. Intuitively
In [36]:
ddi.stack?
Single level index
In [37]:
# example from docstring
dic = {'a': [1, 3], 'b': [2, 4]}
s = pd.DataFrame(data=dic, index=['one', 'two'])
print(s)
In [38]:
s.stack()
Out[38]:
Multi-index
In [39]:
ddi.index
Out[39]:
In [40]:
ddi.unstack() # Units variable has only one value, so this doesn't do much
Out[40]:
In [41]:
ddi.unstack(level='ISO')
Out[41]:
Let's get a smaller subset of this data to work with so we can see things a bit more clearly
In [42]:
# drop some of the index levels (think s for small)
dds = ddi.reset_index(level=[1, 3, 4], drop=True)
dds
Out[42]:
In [43]:
# give a name to the column labels
dds.columns.name = 'Year'
dds
Out[43]:
Let's remind ourselves what we want. We want to
Variable
and ISO
levels the other way, into the column labels. The first one uses stack
, the second one unstack
.
In [44]:
# convert to long format. Notice printing is different... what `type` is ds?
ds = dds.stack()
ds
Out[44]:
In [45]:
# same thing with explicit reference to column name
dds.stack(level='Year').head(8)
Out[45]:
In [46]:
# or with level number
dds.stack(level=0).head(8)
Out[46]:
In [47]:
# now go long to wide
ds.unstack() # default is lowest value wich is year now
Out[47]:
In [48]:
# different level
ds.unstack(level='Variable')
Out[48]:
In [49]:
# or two at once
ds.unstack(level=['Variable', 'ISO'])
Out[49]:
In [ ]:
Exercise. Run the code below and explain what each line of code does.
In [50]:
# stacked dataframe
ds.head(8)
Out[50]:
In [51]:
du1 = ds.unstack()
In [52]:
du2 = du1.unstack()
Exercise (challenging). Take the unstacked dataframe dds
. Use some combination of stack
, unstack
, and plot
to plot the variable Surplus
against Year
for all three countries. Challenging mostly because you need to work out the steps by yourself.
SOL
In [ ]:
The Census's Business Dynamnics Statistics collects annual information about the hiring decisions of firms by size and age. This table list the number of firms and total employment by employment size categories: 1 to 4 employees, 5 to 9, and so on.
Apply want operator. Our want is to plot total employment (the variable Emp
) against size (variable fsize
). Both are columns in the original data.
Here we construct a subset of the data, where we look at two years rather than the whole 1976-2013 period.
In [53]:
url = 'http://www2.census.gov/ces/bds/firm/bds_f_sz_release.csv'
raw = pd.read_csv(url)
raw.head()
Out[53]:
In [54]:
# Four size categories
sizes = ['a) 1 to 4', 'b) 5 to 9', 'c) 10 to 19', 'd) 20 to 49']
# only defined size categories and only period since 2012
restricted_sample = (raw['year2']>=2012) & raw['fsize'].isin(sizes)
# don't need all variables
var_names = ['year2', 'fsize', 'Firms', 'Emp']
bds = raw[restricted_sample][var_names]
bds
Out[54]:
Let's think specifically about what we want. We want to graph Emp
against fsize
for (say) 2013. This calls for:
fsize
. year2
, namely 2012
, 2013
and `2014. Emp
. These inputs translate directly into the following pivot
method:
In [55]:
bdsp = bds.pivot(index='fsize', columns='year2', values='Emp')
In [56]:
# divide by a million so bars aren't too long
bdsp = bdsp/10**6
bdsp
Out[56]:
Comment. Note that all the parameters here are columns. That's not a choice, it's the way the the pivot
method is written.
We do a plot for fun:
In [57]:
# plot 2013 as bar chart
fig, ax = plt.subplots()
bdsp[2013].plot(ax=ax, kind='barh')
ax.set_ylabel('')
ax.set_xlabel('Number of Employees (millions)')
Out[57]:
In [ ]:
In [ ]:
In [58]:
url1 = 'http://www.oecd.org/health/health-systems/'
url2 = 'OECD-Health-Statistics-2017-Frequently-Requested-Data.xls'
docs = pd.read_excel(url1+url2,
skiprows=3,
usecols=[0, 51, 52, 53, 54, 55, 57],
sheetname='Physicians',
na_values=['..'],
skip_footer=21)
# rename country variable
names = list(docs)
docs = docs.rename(columns={names[0]: 'Country'})
# strip footnote numbers from country names
docs['Country'] = docs['Country'].str.rsplit(n=1).str.get(0)
docs = docs.head()
docs
Out[58]:
Use this data to:
Country
. drop
method to docs
to create a dataframe new
that's missing the last column. stack
and unstack
to "pivot" the data so that columns are labeled by country names and rows are labeled by year. This is challenging because we have left out the intermediate steps. Comment. In the last plot, the x axis labels are non-intuitive. Ignore that.
In [ ]:
In [ ]:
In [ ]:
In [ ]:
#
In [ ]: