Pandas 3: Shaping data

The second in a series of notebooks that describe Pandas' powerful data management tools. This one covers shaping methods: switching rows and columns, pivoting, and stacking. We'll see that this is all about the indexes: the row and column labels.

Outline:

More data management topics coming.

Note: requires internet access to run.

This Jupyter notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.

tl;dr

Let df be a DataFrame

  • We use df.set_index to move columns into the index of df
  • We use df.reset_index to move one or more levels of the index back to columns. If we set drop=True, the requested index levels are simply thrown away instead of made into columns
  • We use df.stack to move column index levels into the row index
  • We use df.unstack to move row index levels into the colunm index (Helpful mnemonic: unstack moves index levels up)

Preliminaries

Import packages, etc.


In [ ]:
import sys                             # system module 
import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module  
import datetime as dt                  # date and time module
import numpy as np                     # foundation for Pandas 

%matplotlib inline                     

# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())

Example: WEO debt and deficits

We spend most of our time on one of the examples from the previous notebook. The problem in this example is that variables run across rows, rather than down columns. Our want is to flip some of the rows and columns so that we can plot the data against time. The question is how.

We use a small subset of the IMF's World Economic Outlook database that contains two variables and three countries.


In [ ]:
url1 = 'http://www.imf.org/external/pubs/ft/weo/2015/02/weodata/'
url2 = 'WEOOct2015all.xls'
url = url1 + url2 
weo = pd.read_csv(url, sep='\t', 
                  usecols=[1,2,3,4,6,40,41,42], 
                  thousands=',', 
                  na_values=['n/a', '--']) 
print('Variable dtypes:\n', weo.dtypes, sep='')

In [ ]:
# create debt and deficits dataframe:  two variables and three countries 
variables = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
countries = ['ARG', 'DEU', 'GRC']
dd = weo[weo['WEO Subject Code'].isin(variables) & weo['ISO'].isin(countries)]

# change column labels to something more intuitive 
dd = dd.rename(columns={'WEO Subject Code': 'Variable', 
                        'Subject Descriptor': 'Description'})

# rename variables 
dd['Variable'] = dd['Variable'].replace(to_replace=['GGXWDG_NGDP', 'GGXCNL_NGDP'], 
                                        value=['Debt', 'Surplus'])
dd

In [ ]:

Reminders

What kind of object does each of the following produce?


In [ ]:
dd.index

In [ ]:
dd.columns

In [ ]:
dd['ISO']

In [ ]:
dd[['ISO', 'Variable']]

In [ ]:
dd[dd['ISO'] == 'ARG']

In [ ]:

Wants

We might imagine doing several different things with this data:

  • Plot a specific variable (debt or surplus) for a given date.
  • Time series plots for a specific country.
  • Time series plots for a specific variable.

Depending on which we want, we might organize the data differently. We'll focus on the last two.

Here's a brute force approach to the problem: simply transpose the data. This is where that leads:


In [ ]:
dd.T

Comments. The problem here is that the columns include both the numbers (which we want to plot) and some descriptive information (which we don't).


In [ ]:

Setting and resetting the index

We start by setting and resetting the index. That may sound like a step backwards -- haven't we done this already? -- but it reminds us of some things that will be handy later.

Take the dataframe dd. What would we like in the index? Evenutally we'd like the dates [2011, 2012, 2013], but right now the row labels are more naturally the variable or country. Here are some varriants.

Setting the index


In [ ]:
dd.set_index('Country')

In [ ]:
# we can do the same thing with a list, which will be meaningful soon...
dd.set_index(['Country'])

Exercise. Set Variable as the index.


In [ ]:

Comment. Note that the new index brought its name along: Country in the two examples, Variable in the exercise. That's incredibly useful because we can refer to index levels by name. If we happen to have an index without a name, we can set it with

df.index.name = 'Whatever name we like'

Multi-indexes

We can put more than one variable in an index, which gives us a multi-index. This is sometimes called a hierarchical index because the levels of the index (as they're called) are ordered.

Multi-indexes are more common than you might think. One reason is that data itself is often multi-dimensional. A typical spreadsheet has two dimensions: the variable and the observation. The WEO data is naturally three dimensional: the variable, the year, and the country. (Think about that for a minute, it's deeper than it sounds.)

The problem we're having is fitting this nicely into two dimensions. A multi-index allows us to manage that. A two-dimensional index would work here -- the country and the variable code -- but right now we have some redundancy.

Example. We push all the descriptive, non-numerical columns into the index, leaving the dataframe itself with only numbers, which seems like a step in thee right direction.


In [ ]:
ddi = dd.set_index(['Variable', 'Country', 'ISO', 'Description', 'Units'])
ddi

Let's take a closer look at the index


In [ ]:
ddi.index

That's a lot to process, so we break it into pieces.

  • ddi.index.names contains a list of level names. (Remind yourself that lists are ordered, so this tracks levels.)
  • ddi.index.levels contains the values in each level.

Here's what they like like here:


In [ ]:
# Chase and Spencer like double quotes 
print("The level names are:\n", ddi.index.names, "\n", sep="")
print("The levels (aka level values) are:\n", ddi.index.levels, sep="")

Knowing the order of the index components and being able to inspect their values and names is fundamental to working with a multi-index.

Exercise: What would happen if we had switched the order of the strings in the list when we called dd.set_index? Try it with this list to find out: ['ISO', 'Country', 'Variable', 'Description', 'Units']

Resetting the index

We've seen that set_index pushes columns into the index. Here we see that reset_index does the reverse: it pushes components of the index back to the columns.

Example.


In [ ]:
ddi.head(2)

In [ ]:
ddi.reset_index()

In [ ]:
# or we can reset the index by level 
ddi.reset_index(level=1).head(2)

In [ ]:
# or by name 
ddi.reset_index(level='Country').head(2)

In [ ]:
# or do more than one at a time
ddi.reset_index(level=[1,3]).head(2)

Comment. By default, reset_index pushes one or more index levels into columns. If we want to discard that level of the index altogether, we use the parameter drop=True.


In [ ]:
ddi.reset_index(level=[1,3], drop=True).head(2)

Exercise. For the dataframe ddi do the following in separate code cells:

  • Use the reset_index method to move the Units level of the index to a column of the dataframe.
  • Use the drop parameter of reset_index to delete Units from the dataframe.

In [ ]:


In [ ]:

Switching rows and columns

If we take the dataframe ddi, we see that the everything's been put into the index but the data itself. Perhaps we can get what we want if we just flip the rows and columns. Roughly speaking, we refer to this as pivoting.

First look at switching rows and columns

The simplest way to flip rows and columns is to use the T or transpose property. When we do that, we end up with a lot of stuff in the column labels, as the multi-index for the rows gets rotated into the columns. Other than that, we're good. We can even do a plot. The only problem is all the stuff we've pushed into the column labels -- it's kind of a mess.


In [ ]:
ddt = ddi.T
ddt

Comment. We see here that the multi-index for the rows has been turned into a multi-index for the columns. Works the same way.

The only problem here is that the column labels are more complicated than we might want. Here, for example, is what we get with the plot method. As usual, .plot() plots all the columns of the dataframe, but here that means we're mixing variables. And the legend contains all the levels of the column labels.


In [ ]:
ddt.plot()

Comment. Ooooh, that's ugly! We're on the right track, but evidently not there yet.


In [ ]:

Referring to variables with a multi-index

Can we refer to variables in the same way? Sort of, as long as we refer to the top level of the column index. It gives us a dataframe that's a subset of the original one.

Let's try each of these:

  • ddt['Debt']
  • ddt['Debt']['Argentina']
  • ddt['Debt', 'Argentina']
  • ddt['ARG']

What do you see? What's going on? The theme is that we can reference the top level, which in ddi is the Variable. If we try to access a lower level, it bombs.


In [ ]:
# indexing by variable 
debt = ddt['Debt']
debt

Exercise. With the dataframe ddt:

  • What type of object is ddt["Debt"]?
  • Construct a line plot of Debt over time with one line for each country.

In [ ]:


In [ ]:


In [ ]:

Example. Let's do this together. How would we fix up the legend? What approaches cross your mind? (No code, just the general approach.)

Swapping levels

Since variables refer to the first level of the column index, it's not clear how we would group data by country. Suppose, for example, we wanted to plot Debt and Surplus for a specific country. What would we do?

One way to do that is to make the country the top level with the swaplevel method. Note the axis parameter. With axis=1 we swap column levels, with axis=0 (the default) we swap row levels.


In [ ]:
ddts = ddt.swaplevel(0,1, axis=1) 
ddts

Exercise. Use the dataframe ddts to plot Debt and Surplus across time for Argentina. Hint: In the plot method, set subplots=True so that each variable is in a separate subplot.


In [ ]:

The xs method

Another approach to extracting data that cuts across levels of the row or column index: the xs method. This is recent addition tpo Pandas and an extremely good method once you get the hang of it.

The basic syntax is df.xs(item, axis=X, level=N), where N is the name or number of an index level and X describes if we are extracting from the index or column names. Setting X=0 (so axis=0) will slice up the data along the index, X=1 extracts data for column labels.

Here's how we could use xs to get the Argentina data without swapping the level of the column labels


In [ ]:
# ddt.xs?

In [ ]:
ddt.xs("Argentina", axis=1, level="Country")

In [ ]:
ddt.xs("Argentina", axis=1, level="Country")["Debt"]

Exercise. Use a combination of xs and standard slicing with [...] to extract the variable Debt for Greece.


In [ ]:

Exercise. Use the dataframe ddt -- and the xs method -- to plot Debt and Surplus across time for Argentina.


In [ ]:

Stacking and unstacking

The set_index and reset_index methods work on the row labels -- the index. They move columns to the index and the reverse. The stack and unstack methods move index levels to and from column levels:

  • stack moves the "inner most" (closest to the data when printed) column label into a row label. This creates a long dataframe.
  • unstack does the reverse, it moves the inner most level of the index up to become the inner most column label. This creates a wide dataframe.

We use both to shape (or reshape) our data. We use set_index to push things into the index. And then use reset_index to push some of them back to the columns. That gives us pretty fine-grainded control over the shape of our data.

We start by simplifying our initial dataframe.


In [ ]:
ddi.stack?

In [ ]:
ddi.index

In [ ]:
ddi.unstack()

Let's get a smaller subset of this data to work with so we can see things a bit more clearly


In [ ]:
# drop some of the index levels (think s for small)
dds = ddi.reset_index(level=[1,3,4], drop=True)
# give a name to the column labels
dds.columns.name = 'Year'  
dds

Let's remind ourselves what we want. We want to

  • move the column index (Year) into the row index
  • move the Variable and ISO levels the other way, into the column labels.

The first one uses stack, the second one unstack.

Stacking

We stack our data, one variable on top of another, with a multi-index to keep track of what's what. In simple terms, we change the data from a wide format to a long format. The stack method takes the inner most column level and makes it the lowest row level.


In [ ]:
# convert to long format. Notice printing is different... what `type` is ds?
ds = dds.stack()    
ds

In [ ]:
# same thing with explicit reference to column name 
dds.stack(level='Year').head(8)

In [ ]:
# or with level number 
dds.stack(level=0).head(8)

In [ ]:

Unstacking

Stacking moves columns into the index, "stacking" the data up into longer columns. Unstacking does the reverse, taking levels of the row index and turning them into column labels. Roughly speaking we're rotating or pivoting the data.


In [ ]:
# now go long to wide 
ds.unstack()  # default is lowest value level='ISO'

In [ ]:
# different level 
ds.unstack(level='Variable')

In [ ]:
# or two at once 
ds.unstack(level=['Variable', 'ISO'])

In [ ]:

Exercise. Run the code below and explain what each line of code does.


In [ ]:
# stacked dataframe 
ds.head(8)

In [ ]:
du1 = ds.unstack()

In [ ]:
du2 = du1.unstack()

In [ ]:

Exercise (challenging). Take the unstacked dataframe dds. Use some combination of stack, unstack, and plot to plot the variable Surplus against Year for all three countries. Challenging mostly because you need to work out the steps by yourself.


In [ ]:


In [ ]:

Pivoting

The pivot method: a short cut to some kinds of unstacking. In rough terms, it takes a wide dataframe and constructs a long one. The inputs are columns, not index levels.

Example: BDS data

The Census's Business Dynamnics Statistics collects annual information about the hiring decisions of firms by size and age. This table list the number of firms and total employment by employment size categories: 1 to 4 employees, 5 to 9, and so on.

Apply want operator. Our want is to plot total employment (the variable Emp) against size (variable fsize). Both are columns in the original data.

Here we construct a subset of the data, where we look at two years rather than the whole 1976-2013 period.


In [ ]:
url = 'http://www2.census.gov/ces/bds/firm/bds_f_sz_release.csv'
raw = pd.read_csv(url)
raw.head()

sizes = ['a) 1 to 4', 'b) 5 to 9', 'c) 10 to 19', 'd) 20 to 49']
bds = raw[(raw['year2']>=2012) & raw['fsize'].isin(sizes)][['year2', 'fsize', 'Firms', 'Emp']]
bds

Pivoting the data

Let's think specifically about what we want. We want to graph Emp against fsize for (say) 2013. This calls for:

  • The index should be the size categories fsize.
  • The column labels should be the entries of year2, namely 2012 and 2013.
  • The data should come from the variable Emp.

These inputs translate directly into the following pivot method:


In [ ]:
# pivot and divide by a million (dividing so bars aren't too long)
bdsp = bds.pivot(index='fsize', columns='year2', values='Emp')/10**6 
bdsp

Comment. Note that all the parameters here are columns. That's not a choice, it's the way the the pivot method is written.

We do a plot for fun:


In [ ]:
# plot 2013 as bar chart 
fig, ax = plt.subplots()
bdsp[2013].plot.barh(ax=ax)
ax.set_ylabel('')
ax.set_xlabel('Number of Employees (millions)')

In [ ]:

Review

We return to the OECD's healthcare data, specifically a subset of their table on the number of doctors per one thousand population. This loads and cleans the data:


In [ ]:
url1 = 'http://www.oecd.org/health/health-systems/'
url2 = 'OECD-Health-Statistics-2015-Frequently-Requested-Data.xls'
docs = pd.read_excel(url1+url2, 
                     skiprows=3, 
                     usecols=[0, 51, 52, 53, 54, 55, 57],
                     sheetname='Physicians', 
                     na_values=['..'],
                     skip_footer=21) 
# rename country variable 
names = list(docs)
docs = docs.rename(columns={names[0]: 'Country'})
# strip footnote numbers from country names 
docs['Country'] = docs['Country'].str.rsplit(n=1).str.get(0)
docs = docs.head()
docs

Use this data to:

  • Set the index as Country.
  • Construct a horizontal bar chart of the number of doctors in each country in "2013 (or nearest year)".
  • Apply the drop method to docs to create a dataframe new that's missing the last column.
  • Challenging. Use stack and unstack to "pivot" the data so that columns are labeled by country names and rows are labeled by year. This is challenging because we have left out the intermediate steps.
  • Plot the number of doctors over time in each country as a line in the same plot.

Comment. In the last plot, the x axis labels are non-intuitive. Ignore that.


In [ ]:


In [ ]:


In [ ]:

Resources

Far and away the best material on this subject is Brandon Rhodes' 2015 Pycon presentation. 2 hours and 25 minutes and worth every second.


In [ ]: