Pandas 3: Shaping data

The second in a series of notebooks that describe Pandas' powerful data management tools. This one covers shaping methods: switching rows and columns, pivoting, and stacking. We'll see that this is all about the indexes: the row and column labels.

Outline:

Example: WEO debt and deficits. Something to work with.
Indexing. Setting and resetting the index. Multi-indexes.
Switching rows and columns. Transpose. Referring to variables with multi-indexes.
Stack and unstack. Managing column structure and labels.
Pivot. Unstack shortcut if we start with wide data.
Review. Apply what we've learned.

Preliminaries

Import packages, etc.



In [18]:

    
import sys                             # system module 
import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module  
import datetime as dt                  # date and time module
import numpy as np                     # foundation for Pandas 

%matplotlib inline                     

# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())









    



Python version: 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)]
Pandas version:  0.18.0
Today:  2016-04-14

Example: WEO debt and deficits

We spend most of our time on one of the examples from the previous notebook. The problem in this example is that variables run across rows, rather than down columns. Our want is to flip some of the rows and columns so that we can plot the data against time. The question is how.

We use a small subset of the IMF's World Economic Outlook database that contains two variables and three countries.



In [19]:

    
url1 = 'http://www.imf.org/external/pubs/ft/weo/2015/02/weodata/'
url2 = 'WEOOct2015all.xls'
url = url1 + url2 
weo = pd.read_csv(url, sep='\t', 
                  usecols=[1,2,3,4,6,40,41,42], 
                  thousands=',', 
                  na_values=['n/a', '--']) 
print('Variable dtypes:\n', weo.dtypes, sep='')









    



Variable dtypes:
ISO                    object
WEO Subject Code       object
Country                object
Subject Descriptor     object
Units                  object
2011                  float64
2012                  float64
2013                  float64
dtype: object



In [20]:

    
# create debt and deficits dataframe:  two variables and three countries 
variables = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
countries = ['ARG', 'DEU', 'GRC']
dd = weo[weo['WEO Subject Code'].isin(variables) & weo['ISO'].isin(countries)]

# change column labels to something more intuitive 
dd = dd.rename(columns={'WEO Subject Code': 'Variable', 
                        'Subject Descriptor': 'Description'})

# rename variables 
dd['Variable'] = dd['Variable'].replace(to_replace=['GGXWDG_NGDP', 'GGXCNL_NGDP'], 
                                        value=['Debt', 'Surplus'])
dd









    Out[20]:






  
    
      
      ISO
      Variable
      Country
      Description
      Units
      2011
      2012
      2013
    
  
  
    
      252
      ARG
      Surplus
      Argentina
      General government net lending/borrowing
      Percent of GDP
      -1.949
      -2.359
      -1.997
    
    
      260
      ARG
      Debt
      Argentina
      General government gross debt
      Percent of GDP
      35.763
      37.328
      40.217
    
    
      2716
      DEU
      Surplus
      Germany
      General government net lending/borrowing
      Percent of GDP
      -0.861
      0.091
      0.133
    
    
      2724
      DEU
      Debt
      Germany
      General government gross debt
      Percent of GDP
      77.928
      79.311
      76.978
    
    
      2804
      GRC
      Surplus
      Greece
      General government net lending/borrowing
      Percent of GDP
      -10.215
      -6.406
      -2.944
    
    
      2812
      GRC
      Debt
      Greece
      General government gross debt
      Percent of GDP
      170.960
      156.494
      174.951



In [ ]:

Reminders

What kind of object does each of the following produce?



In [21]:

    
dd.index









    Out[21]:





Int64Index([252, 260, 2716, 2724, 2804, 2812], dtype='int64')



In [22]:

    
dd.columns









    Out[22]:





Index(['ISO', 'Variable', 'Country', 'Description', 'Units', '2011', '2012',
       '2013'],
      dtype='object')



In [23]:

    
dd['ISO']









    Out[23]:





252     ARG
260     ARG
2716    DEU
2724    DEU
2804    GRC
2812    GRC
Name: ISO, dtype: object



In [24]:

    
dd[['ISO', 'Variable']]









    Out[24]:






  
    
      
      ISO
      Variable
    
  
  
    
      252
      ARG
      Surplus
    
    
      260
      ARG
      Debt
    
    
      2716
      DEU
      Surplus
    
    
      2724
      DEU
      Debt
    
    
      2804
      GRC
      Surplus
    
    
      2812
      GRC
      Debt



In [25]:

    
dd[dd['ISO'] == 'ARG']









    Out[25]:






  
    
      
      ISO
      Variable
      Country
      Description
      Units
      2011
      2012
      2013
    
  
  
    
      252
      ARG
      Surplus
      Argentina
      General government net lending/borrowing
      Percent of GDP
      -1.949
      -2.359
      -1.997
    
    
      260
      ARG
      Debt
      Argentina
      General government gross debt
      Percent of GDP
      35.763
      37.328
      40.217



In [ ]:

Wants

We might imagine doing several different things with this data:

Plot a specific variable (debt or surplus) for a given date.
Time series plots for a specific country.
Time series plots for a specific variable.

Depending on which we want, we might organize the data differently. We'll focus on the last two.

Here's a brute force approach to the problem: simply transpose the data. This is where that leads:



In [26]:

    
dd.T









    Out[26]:






  
    
      
      252
      260
      2716
      2724
      2804
      2812
    
  
  
    
      ISO
      ARG
      ARG
      DEU
      DEU
      GRC
      GRC
    
    
      Variable
      Surplus
      Debt
      Surplus
      Debt
      Surplus
      Debt
    
    
      Country
      Argentina
      Argentina
      Germany
      Germany
      Greece
      Greece
    
    
      Description
      General government net lending/borrowing
      General government gross debt
      General government net lending/borrowing
      General government gross debt
      General government net lending/borrowing
      General government gross debt
    
    
      Units
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
    
    
      2011
      -1.949
      35.763
      -0.861
      77.928
      -10.215
      170.96
    
    
      2012
      -2.359
      37.328
      0.091
      79.311
      -6.406
      156.494
    
    
      2013
      -1.997
      40.217
      0.133
      76.978
      -2.944
      174.951

Comments. The problem here is that the columns include both the numbers (which we want to plot) and some descriptive information (which we don't).



In [ ]:

Setting and resetting the index

We start by setting and resetting the index. That may sound like a step backwards -- haven't we done this already? -- but it reminds us of some things that will be handy later.

Take the dataframe dd. What would we like in the index? Evenutally we'd like the dates [2011, 2012, 2013], but right now the row labels are more naturally the variable or country. Here are some varriants.

Setting the index



In [27]:

    
dd.set_index('Country')









    Out[27]:






  
    
      
      ISO
      Variable
      Description
      Units
      2011
      2012
      2013
    
    
      Country
      
      
      
      
      
      
      
    
  
  
    
      Argentina
      ARG
      Surplus
      General government net lending/borrowing
      Percent of GDP
      -1.949
      -2.359
      -1.997
    
    
      Argentina
      ARG
      Debt
      General government gross debt
      Percent of GDP
      35.763
      37.328
      40.217
    
    
      Germany
      DEU
      Surplus
      General government net lending/borrowing
      Percent of GDP
      -0.861
      0.091
      0.133
    
    
      Germany
      DEU
      Debt
      General government gross debt
      Percent of GDP
      77.928
      79.311
      76.978
    
    
      Greece
      GRC
      Surplus
      General government net lending/borrowing
      Percent of GDP
      -10.215
      -6.406
      -2.944
    
    
      Greece
      GRC
      Debt
      General government gross debt
      Percent of GDP
      170.960
      156.494
      174.951



In [28]:

    
# we can do the same thing with a list, which will be meaningful soon...
dd.set_index(['Country'])









    Out[28]:






  
    
      
      ISO
      Variable
      Description
      Units
      2011
      2012
      2013
    
    
      Country
      
      
      
      
      
      
      
    
  
  
    
      Argentina
      ARG
      Surplus
      General government net lending/borrowing
      Percent of GDP
      -1.949
      -2.359
      -1.997
    
    
      Argentina
      ARG
      Debt
      General government gross debt
      Percent of GDP
      35.763
      37.328
      40.217
    
    
      Germany
      DEU
      Surplus
      General government net lending/borrowing
      Percent of GDP
      -0.861
      0.091
      0.133
    
    
      Germany
      DEU
      Debt
      General government gross debt
      Percent of GDP
      77.928
      79.311
      76.978
    
    
      Greece
      GRC
      Surplus
      General government net lending/borrowing
      Percent of GDP
      -10.215
      -6.406
      -2.944
    
    
      Greece
      GRC
      Debt
      General government gross debt
      Percent of GDP
      170.960
      156.494
      174.951

Exercise. Set Variable as the index.



In [ ]:

Comment. Note that the new index brought its name along: Country in the two examples, Variable in the exercise. That's incredibly useful because we can refer to index levels by name. If we happen to have an index without a name, we can set it with

df.index.name = 'Whatever name we like'

Multi-indexes

We can put more than one variable in an index, which gives us a multi-index. This is sometimes called a hierarchical index because the levels of the index (as they're called) are ordered.

Multi-indexes are more common than you might think. One reason is that data itself is often multi-dimensional. A typical spreadsheet has two dimensions: the variable and the observation. The WEO data is naturally three dimensional: the variable, the year, and the country. (Think about that for a minute, it's deeper than it sounds.)

The problem we're having is fitting this nicely into two dimensions. A multi-index allows us to manage that. A two-dimensional index would work here -- the country and the variable code -- but right now we have some redundancy.

Example. We push all the descriptive, non-numerical columns into the index, leaving the dataframe itself with only numbers, which seems like a step in thee right direction.



In [29]:

    
ddi = dd.set_index(['Variable', 'Country', 'ISO', 'Description', 'Units'])
ddi









    Out[29]:






  
    
      
      
      
      
      
      2011
      2012
      2013
    
    
      Variable
      Country
      ISO
      Description
      Units
      
      
      
    
  
  
    
      Surplus
      Argentina
      ARG
      General government net lending/borrowing
      Percent of GDP
      -1.949
      -2.359
      -1.997
    
    
      Debt
      Argentina
      ARG
      General government gross debt
      Percent of GDP
      35.763
      37.328
      40.217
    
    
      Surplus
      Germany
      DEU
      General government net lending/borrowing
      Percent of GDP
      -0.861
      0.091
      0.133
    
    
      Debt
      Germany
      DEU
      General government gross debt
      Percent of GDP
      77.928
      79.311
      76.978
    
    
      Surplus
      Greece
      GRC
      General government net lending/borrowing
      Percent of GDP
      -10.215
      -6.406
      -2.944
    
    
      Debt
      Greece
      GRC
      General government gross debt
      Percent of GDP
      170.960
      156.494
      174.951

Let's take a closer look at the index



In [30]:

    
ddi.index









    Out[30]:





MultiIndex(levels=[['Debt', 'Surplus'], ['Argentina', 'Germany', 'Greece'], ['ARG', 'DEU', 'GRC'], ['General government gross debt', 'General government net lending/borrowing'], ['Percent of GDP']],
           labels=[[1, 0, 1, 0, 1, 0], [0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2], [1, 0, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0]],
           names=['Variable', 'Country', 'ISO', 'Description', 'Units'])

That's a lot to process, so we break it into pieces.

ddi.index.names contains a list of level names. (Remind yourself that lists are ordered, so this tracks levels.)
ddi.index.levels contains the values in each level.

Here's what they like like here:



In [31]:

    
# Chase and Spencer like double quotes 
print("The level names are:\n", ddi.index.names, "\n", sep="")
print("The levels (aka level values) are:\n", ddi.index.levels, sep="")









    



The level names are:
['Variable', 'Country', 'ISO', 'Description', 'Units']

The levels (aka level values) are:
[['Debt', 'Surplus'], ['Argentina', 'Germany', 'Greece'], ['ARG', 'DEU', 'GRC'], ['General government gross debt', 'General government net lending/borrowing'], ['Percent of GDP']]

Knowing the order of the index components and being able to inspect their values and names is fundamental to working with a multi-index.

Resetting the index

We've seen that set_index pushes columns into the index. Here we see that reset_index does the reverse: it pushes components of the index back to the columns.

Example.



In [32]:

    
ddi.head(2)









    Out[32]:






  
    
      
      
      
      
      
      2011
      2012
      2013
    
    
      Variable
      Country
      ISO
      Description
      Units
      
      
      
    
  
  
    
      Surplus
      Argentina
      ARG
      General government net lending/borrowing
      Percent of GDP
      -1.949
      -2.359
      -1.997
    
    
      Debt
      Argentina
      ARG
      General government gross debt
      Percent of GDP
      35.763
      37.328
      40.217



In [33]:

    
ddi.reset_index()









    Out[33]:






  
    
      
      Variable
      Country
      ISO
      Description
      Units
      2011
      2012
      2013
    
  
  
    
      0
      Surplus
      Argentina
      ARG
      General government net lending/borrowing
      Percent of GDP
      -1.949
      -2.359
      -1.997
    
    
      1
      Debt
      Argentina
      ARG
      General government gross debt
      Percent of GDP
      35.763
      37.328
      40.217
    
    
      2
      Surplus
      Germany
      DEU
      General government net lending/borrowing
      Percent of GDP
      -0.861
      0.091
      0.133
    
    
      3
      Debt
      Germany
      DEU
      General government gross debt
      Percent of GDP
      77.928
      79.311
      76.978
    
    
      4
      Surplus
      Greece
      GRC
      General government net lending/borrowing
      Percent of GDP
      -10.215
      -6.406
      -2.944
    
    
      5
      Debt
      Greece
      GRC
      General government gross debt
      Percent of GDP
      170.960
      156.494
      174.951



In [34]:

    
# or we can reset the index by level 
ddi.reset_index(level=1).head(2)









    Out[34]:






  
    
      
      
      
      
      Country
      2011
      2012
      2013
    
    
      Variable
      ISO
      Description
      Units
      
      
      
      
    
  
  
    
      Surplus
      ARG
      General government net lending/borrowing
      Percent of GDP
      Argentina
      -1.949
      -2.359
      -1.997
    
    
      Debt
      ARG
      General government gross debt
      Percent of GDP
      Argentina
      35.763
      37.328
      40.217



In [35]:

    
# or by name 
ddi.reset_index(level='Country').head(2)









    Out[35]:






  
    
      
      
      
      
      Country
      2011
      2012
      2013
    
    
      Variable
      ISO
      Description
      Units
      
      
      
      
    
  
  
    
      Surplus
      ARG
      General government net lending/borrowing
      Percent of GDP
      Argentina
      -1.949
      -2.359
      -1.997
    
    
      Debt
      ARG
      General government gross debt
      Percent of GDP
      Argentina
      35.763
      37.328
      40.217



In [36]:

    
# or do more than one at a time
ddi.reset_index(level=[1,3]).head(2)









    Out[36]:






  
    
      
      
      
      Country
      Description
      2011
      2012
      2013
    
    
      Variable
      ISO
      Units
      
      
      
      
      
    
  
  
    
      Surplus
      ARG
      Percent of GDP
      Argentina
      General government net lending/borrowing
      -1.949
      -2.359
      -1.997
    
    
      Debt
      ARG
      Percent of GDP
      Argentina
      General government gross debt
      35.763
      37.328
      40.217

Comment. By default, reset_index pushes one or more index levels into columns. If we want to discard that level of the index altogether, we use the parameter drop=True.



In [37]:

    
ddi.reset_index(level=[1,3], drop=True).head(2)









    Out[37]:






  
    
      
      
      
      2011
      2012
      2013
    
    
      Variable
      ISO
      Units
      
      
      
    
  
  
    
      Surplus
      ARG
      Percent of GDP
      -1.949
      -2.359
      -1.997
    
    
      Debt
      ARG
      Percent of GDP
      35.763
      37.328
      40.217

Exercise. For the dataframe ddi:

Use the reset_index method to move the Units level of the index to a column of the dataframe.
Use the drop parameter of reset_index to delete Units from the dataframe.



In [ ]:



In [ ]:

Switching rows and columns

If we take the dataframe ddi, we see that the everything's been put into the index but the data itself. Perhaps we can get what we want if we just flip the rows and columns. Roughly speaking, we refer to this as pivoting.

First look at switching rows and columns

The simplest way to flip rows and columns is to use the T or transpose property. When we do that, we end up with a lot of stuff in the column labels, as the multi-index for the rows gets rotated into the columns. Other than that, we're good. We can even do a plot. The only problem is all the stuff we've pushed into the column labels -- it's kind of a mess.



In [38]:

    
ddt = ddi.T
ddt









    Out[38]:






  
    
      Variable
      Surplus
      Debt
      Surplus
      Debt
      Surplus
      Debt
    
    
      Country
      Argentina
      Argentina
      Germany
      Germany
      Greece
      Greece
    
    
      ISO
      ARG
      ARG
      DEU
      DEU
      GRC
      GRC
    
    
      Description
      General government net lending/borrowing
      General government gross debt
      General government net lending/borrowing
      General government gross debt
      General government net lending/borrowing
      General government gross debt
    
    
      Units
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
    
  
  
    
      2011
      -1.949
      35.763
      -0.861
      77.928
      -10.215
      170.960
    
    
      2012
      -2.359
      37.328
      0.091
      79.311
      -6.406
      156.494
    
    
      2013
      -1.997
      40.217
      0.133
      76.978
      -2.944
      174.951

Comment. We see here that the multi-index for the rows has been turned into a multi-index for the columns. Works the same way.

The only problem here is that the column labels are more complicated than we might want. Here, for example, is what we get with the plot method. As usual, .plot() plots all the columns of the dataframe, but here that means we're mixing variables. And the legend contains all the levels of the column labels.



In [39]:

    
ddt.plot()









    Out[39]:





<matplotlib.axes._subplots.AxesSubplot at 0x91a4fd0>

Comment. Ooooh, that's ugly! We're on the right track, but evidently not there yet.



In [ ]:

Referring to variables with a multi-index

Can we refer to variables in the same way? Sort of, as long as we refer to the top level of the column index. It gives us a dataframe that's a subset of the original one.

Let's try each of these:

ddt['Debt']
ddt['Debt']['Argentina']
ddt['Debt', 'Argentina']
ddt['ARG']

What do you see? What's going on? The theme is that we can reference the top level, which in ddi is the Variable. If we try to access a lower level, it bombs.



In [40]:

    
# indexing by variable 
debt = ddt['Debt']

Exercise. With the dataframe ddt:

What type of object is Debt?
Construct a line plot of Debt over time with one line for each country.



In [ ]:



In [ ]:



In [ ]:

Example. Let's do this together. How would we fix up the legend? What approaches cross your mind? (No code, just the general approach.)



In [ ]:

Swapping levels

Since variables refer to the first level of the column index, it's not clear how we would group data by country. Suppose, for example, we wanted to plot Debt and Surplus for a specific country. What would we do?

One way to do that is to make the country the top level with the swaplevel method. Note the axis parameter. With axis=1 we swap column levels, with axis=0 (the default) we swap row levels.



In [41]:

    
ddts = ddt.swaplevel(0,1, axis=1) 
ddts









    Out[41]:






  
    
      Country
      Argentina
      Germany
      Greece
    
    
      Variable
      Surplus
      Debt
      Surplus
      Debt
      Surplus
      Debt
    
    
      ISO
      ARG
      ARG
      DEU
      DEU
      GRC
      GRC
    
    
      Description
      General government net lending/borrowing
      General government gross debt
      General government net lending/borrowing
      General government gross debt
      General government net lending/borrowing
      General government gross debt
    
    
      Units
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
    
  
  
    
      2011
      -1.949
      35.763
      -0.861
      77.928
      -10.215
      170.960
    
    
      2012
      -2.359
      37.328
      0.091
      79.311
      -6.406
      156.494
    
    
      2013
      -1.997
      40.217
      0.133
      76.978
      -2.944
      174.951

Exercise. Use the dataframe ddts to plot Debt and Surplus across time for Argentina. Hint: In the plot method, set subplots=True so that each variable is in a separate subplot.



In [ ]:

The `xs` method

Another approach to extracting data that cuts across levels of the row or column index: the xs method. This is recent addition tpo Pandas and an extremely good method once you get the hang of it.

The basic syntax is df.xs(item, axis=X, level=N), where N is the name or number of an index level and X describes if we are extracting from the index or column names. Setting X=0 (so axis=0) will slice up the data along the index, X=1 extracts data for column labels.

Here's how we could use xs to get the Argentina data without swapping the level of the column labels



In [42]:

    
#ddt.xs?



In [43]:

    
ddt.xs("Argentina", axis=1, level="Country")









    Out[43]:






  
    
      Variable
      Surplus
      Debt
    
    
      ISO
      ARG
      ARG
    
    
      Description
      General government net lending/borrowing
      General government gross debt
    
    
      Units
      Percent of GDP
      Percent of GDP
    
  
  
    
      2011
      -1.949
      35.763
    
    
      2012
      -2.359
      37.328
    
    
      2013
      -1.997
      40.217



In [44]:

    
ddt.xs("Argentina", axis=1, level="Country")["Debt"]









    Out[44]:






  
    
      ISO
      ARG
    
    
      Description
      General government gross debt
    
    
      Units
      Percent of GDP
    
  
  
    
      2011
      35.763
    
    
      2012
      37.328
    
    
      2013
      40.217

Exercise. Use a combination of xs and standard slicing with [...] to extract the variable Debt for Greece.



In [ ]:

Exercise. Use the dataframe ddt -- and the xs method -- to plot Debt and Surplus across time for Argentina.



In [ ]:

Stacking and unstacking

The set_index and reset_index methods work on the row labels -- the index. They move columns to the index and the reverse. The stack and unstack methods move index levels to and from column levels:

stack stacks the data up, moving the columns to the index and creating a long dataframe.
unstack does the reverse, moving columns or index levels into the column labels and creating a wide dataframe.

We use both to shape (or reshape) our data. We use set_index to push things into the index. And then use reset_index to push some of them back to the columns. That gives us pretty fine-grainded control over the shape of our data.

We start by simplifying our initial dataframe.



In [45]:

    
# drop some of the index levels (think s for small)
dds = ddi.reset_index(level=[1,3,4], drop=True)
# give a name to the column labels
dds.columns.name = 'Year'  
dds

Let's remind ourselves what we want. We want to

move the column index (Year) into the row index
move the Variable and ISO levels the other way, into the column labels.

The first one uses stack, the second one unstack.



In [ ]:

Stacking

We stack our data, one variable on top of another, with a multi-index to keep track of what's what. In simple terms, we change the data from a wide format to a long format. The stack method takes the lowest column level and makes it the lowest row level.



In [46]:

    
# convert to long format. Notice printing is different... what `type` is ds?
ds = dds.stack()    
ds









    Out[46]:





Variable  ISO  Year
Surplus   ARG  2011     -1.949
               2012     -2.359
               2013     -1.997
Debt      ARG  2011     35.763
               2012     37.328
               2013     40.217
Surplus   DEU  2011     -0.861
               2012      0.091
               2013      0.133
Debt      DEU  2011     77.928
               2012     79.311
               2013     76.978
Surplus   GRC  2011    -10.215
               2012     -6.406
               2013     -2.944
Debt      GRC  2011    170.960
               2012    156.494
               2013    174.951
dtype: float64



In [47]:

    
# same thing with explicit reference to column name 
dds.stack(level='Year').head(8)









    Out[47]:





Variable  ISO  Year
Surplus   ARG  2011    -1.949
               2012    -2.359
               2013    -1.997
Debt      ARG  2011    35.763
               2012    37.328
               2013    40.217
Surplus   DEU  2011    -0.861
               2012     0.091
dtype: float64



In [48]:

    
# or with level number 
dds.stack(level=0).head(8)









    Out[48]:





Variable  ISO  Year
Surplus   ARG  2011    -1.949
               2012    -2.359
               2013    -1.997
Debt      ARG  2011    35.763
               2012    37.328
               2013    40.217
Surplus   DEU  2011    -0.861
               2012     0.091
dtype: float64



In [ ]:

Unstacking

Stacking moves columns into the index, "stacking" the data up into longer columns. Unstacking does the reverse, taking levels of the row index and turning them into column labels. Roughly speaking we're rotating or pivoting the data.



In [49]:

    
# now go long to wide 
ds.unstack()  # default is lowest value level='ISO'



In [50]:

    
# different level 
ds.unstack(level='Variable')



In [51]:

    
# or two at once 
ds.unstack(level=['Variable', 'ISO'])



In [ ]:

Exercise. Run the code below and explain what each line of code does.



In [52]:

    
# stacked dataframe 
ds.head(8)









    Out[52]:





Variable  ISO  Year
Surplus   ARG  2011    -1.949
               2012    -2.359
               2013    -1.997
Debt      ARG  2011    35.763
               2012    37.328
               2013    40.217
Surplus   DEU  2011    -0.861
               2012     0.091
dtype: float64



In [53]:

    
du1 = ds.unstack() 
du2 = du1.unstack()



In [ ]:

Exercise (challenging). Take the unstacked dataframe dds. Use some combination of stack, unstack, and plot to plot the variable Surplus against Year for all three countries. Challenging mostly because you need to work out the steps by yourself.



In [ ]:



In [ ]:

Pivoting

The pivot method: a short cut to some kinds of unstacking. In rough terms, it takes a wide dataframe and constructs a long one. The inputs are columns, not index levels.

Example: BDS data

The Census's Business Dynamnics Statistics collects annual information about the hiring decisions of firms by size and age. This table list the number of firms and total employment by employment size categories: 1 to 4 employees, 5 to 9, and so on.

Apply want operator. Our want is to plot total employment (the variable Emp) against size (variable fsize). Both are columns in the original data.

Here we construct a subset of the data, where we look at two years rather than the whole 1976-2013 period.



In [54]:

    
url = 'http://www2.census.gov/ces/bds/firm/bds_f_sz_release.csv'
raw = pd.read_csv(url)
raw.head()

sizes = ['a) 1 to 4', 'b) 5 to 9', 'c) 10 to 19', 'd) 20 to 49']
bds = raw[(raw['year2']>=2012) & raw['fsize'].isin(sizes)][['year2', 'fsize', 'Firms', 'Emp']]
bds









    Out[54]:






  
    
      
      year2
      fsize
      Firms
      Emp
    
  
  
    
      420
      2012
      a) 1 to 4
      2832751
      6019263
    
    
      421
      2012
      b) 5 to 9
      1017942
      6715603
    
    
      422
      2012
      c) 10 to 19
      595848
      8123271
    
    
      423
      2012
      d) 20 to 49
      368391
      11300613
    
    
      432
      2013
      a) 1 to 4
      2846416
      5998912
    
    
      433
      2013
      b) 5 to 9
      1020772
      6714924
    
    
      434
      2013
      c) 10 to 19
      598153
      8151891
    
    
      435
      2013
      d) 20 to 49
      373345
      11425545

Pivoting the data

Let's think specifically about what we want. We want to graph Emp against fsize for (say) 2013. This calls for:

The index should be the size categories fsize.
The column labels should be the entries of year2, namely 2012 and 2013.
The data should come from the variable Emp.

These inputs translate directly into the following pivot method:



In [55]:

    
# pivot and divide by a million (dividing so bars aren't too long)
bdsp = bds.pivot(index='fsize', columns='year2', values='Emp')/10**6 
bdsp









    Out[55]:






  
    
      year2
      2012
      2013
    
    
      fsize
      
      
    
  
  
    
      a) 1 to 4
      6.019263
      5.998912
    
    
      b) 5 to 9
      6.715603
      6.714924
    
    
      c) 10 to 19
      8.123271
      8.151891
    
    
      d) 20 to 49
      11.300613
      11.425545

Comment. Note that all the parameters here are columns. That's not a choice, it's the way the the pivot method is written.

We do a plot for fun:



In [56]:

    
# plot 2013 as bar chart 
fig, ax = plt.subplots()
bdsp[2013].plot.barh(ax=ax)
ax.set_ylabel('')
ax.set_xlabel('Number of Employees (millions)')









    Out[56]:





<matplotlib.text.Text at 0x9587198>



In [ ]:

Review

We return to the OECD's healthcare data, specifically a subset of their table on the number of doctors per one thousand population. This loads and cleans the data:



In [57]:

    
url1 = 'http://www.oecd.org/health/health-systems/'
url2 = 'OECD-Health-Statistics-2015-Frequently-Requested-Data.xls'
docs = pd.read_excel(url1+url2, 
                     skiprows=3, 
                     usecols=[0, 51, 52, 53, 54, 55, 57],
                     sheetname='Physicians', 
                     na_values=['..'],
                     skip_footer=21) 
# rename country variable 
names = list(docs)
docs = docs.rename(columns={names[0]: 'Country'})
# strip footnote numbers from country names 
docs['Country'] = docs['Country'].str.rsplit(n=1).str.get(0)
docs = docs.head()
docs









    Out[57]:






  
    
      
      Country
      2010
      2011
      2012
      2013
      2014
      2013 (or nearest year)
    
  
  
    
      0
      Australia
      NaN
      3.31
      3.31
      3.39
      NaN
      3.39
    
    
      1
      Austria
      4.80
      4.84
      4.90
      4.99
      NaN
      4.99
    
    
      2
      Belgium
      2.92
      2.91
      2.93
      2.95
      NaN
      2.95
    
    
      3
      Canada
      2.37
      2.44
      2.50
      2.55
      NaN
      2.55
    
    
      4
      Chile
      1.43
      1.58
      1.74
      1.88
      2.02
      1.88

Use this data to:

Set the index as Country.
Construct a horizontal bar chart of the number of doctors in each country in "2013 (or nearest year)".
Apply the drop method to docs to create a dataframe new that's missing the last column.
Challenging. Use stack and unstack to "pivot" the data so that columns are labeled by country names and rows are labeled by year. This is challenging because we have left out the intermediate steps.
Plot the number of doctors over time in each country as a line in the same plot.

Comment. In the last plot, the x axis labels are non-intuitive. Ignore that.



In [ ]:



In [ ]:



In [ ]:

Resources

Far and away the best material on this subject is Brandon Rhodes' 2015 Pycon presentation. 2 hours and 25 minutes and worth every second.

Video: https://youtu.be/5JnMutdy6Fw
Materials: https://github.com/brandon-rhodes/pycon-pandas-tutorial
Outline: https://github.com/brandon-rhodes/pycon-pandas-tutorial/blob/master/script.txt



In [ ]:

	ISO	Variable	Country	Description	Units	2011	2012	2013
252	ARG	Surplus	Argentina	General government net lending/borrowing	Percent of GDP	-1.949	-2.359	-1.997
260	ARG	Debt	Argentina	General government gross debt	Percent of GDP	35.763	37.328	40.217
2716	DEU	Surplus	Germany	General government net lending/borrowing	Percent of GDP	-0.861	0.091	0.133
2724	DEU	Debt	Germany	General government gross debt	Percent of GDP	77.928	79.311	76.978
2804	GRC	Surplus	Greece	General government net lending/borrowing	Percent of GDP	-10.215	-6.406	-2.944
2812	GRC	Debt	Greece	General government gross debt	Percent of GDP	170.960	156.494	174.951

	252	260	2716	2724	2804	2812
ISO	ARG	ARG	DEU	DEU	GRC	GRC
Variable	Surplus	Debt	Surplus	Debt	Surplus	Debt
Country	Argentina	Argentina	Germany	Germany	Greece	Greece
Description	General government net lending/borrowing	General government gross debt	General government net lending/borrowing	General government gross debt	General government net lending/borrowing	General government gross debt
Units	Percent of GDP	Percent of GDP	Percent of GDP	Percent of GDP	Percent of GDP	Percent of GDP
2011	-1.949	35.763	-0.861	77.928	-10.215	170.96
2012	-2.359	37.328	0.091	79.311	-6.406	156.494
2013	-1.997	40.217	0.133	76.978	-2.944	174.951

	Variable	Debt	Surplus
ISO	Year
ARG	2011	35.763	-1.949
	2012	37.328	-2.359
	2013	40.217	-1.997
DEU	2011	77.928	-0.861
	2012	79.311	0.091
	2013	76.978	0.133
GRC	2011	170.960	-10.215
	2012	156.494	-6.406
	2013	174.951	-2.944

	year2	fsize	Firms	Emp
420	2012	a) 1 to 4	2832751	6019263
421	2012	b) 5 to 9	1017942	6715603
422	2012	c) 10 to 19	595848	8123271
423	2012	d) 20 to 49	368391	11300613
432	2013	a) 1 to 4	2846416	5998912
433	2013	b) 5 to 9	1020772	6714924
434	2013	c) 10 to 19	598153	8151891
435	2013	d) 20 to 49	373345	11425545

year2	2012	2013
fsize
a) 1 to 4	6.019263	5.998912
b) 5 to 9	6.715603	6.714924
c) 10 to 19	8.123271	8.151891
d) 20 to 49	11.300613	11.425545

	Country	2010	2011	2012	2013	2014	2013 (or nearest year)
0	Australia	NaN	3.31	3.31	3.39	NaN	3.39
1	Austria	4.80	4.84	4.90	4.99	NaN	4.99
2	Belgium	2.92	2.91	2.93	2.95	NaN	2.95
3	Canada	2.37	2.44	2.50	2.55	NaN	2.55
4	Chile	1.43	1.58	1.74	1.88	2.02	1.88