Advanced Pandas: Shaping data

The second in a series of notebooks that describe Pandas' powerful data management tools. This one covers shaping methods: switching rows and columns, pivoting, and stacking. We'll see that this is all about the indexes: the row and column labels.

Outline:

Example: WEO debt and deficits. Something to work with.
Indexing. Setting and resetting the index. Multi-indexes.
Switching rows and columns. Transpose. Referring to variables with multi-indexes.
Stack and unstack. Managing column structure and labels.
Pivot. Unstack shortcut if we start with wide data.
Review. Apply what we've learned.

tl;dr

Let df be a DataFrame

We use df.set_index to move columns into the index of df
We use df.reset_index to move one or more levels of the index back to columns. If we set drop=True, the requested index levels are simply thrown away instead of made into columns
We use df.stack to move column index levels into the row index
We use df.unstack to move row index levels into the colunm index (Helpful mnemonic: unstack moves index levels up)

Preliminaries

Import packages, etc



In [3]:

    
%matplotlib inline                     

import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module  
import datetime as dt                  # date and time module
import numpy as np                     # foundation for Pandas

Example: WEO debt and deficits

We spend most of our time on one of the examples from the previous notebook. The problem in this example is that variables run across rows, rather than down columns. Our want is to flip some of the rows and columns so that we can plot the data against time. The question is how.

We use a small subset of the IMF's World Economic Outlook database that contains two variables and three countries.



In [4]:

    
url = 'http://www.imf.org/external/pubs/ft/weo/2016/02/weodata/WEOOct2016all.xls'

# (1) define the column indices 
col_indices = [1, 2, 3, 4, 6] + list(range(9, 46))

# (2) download the dataset
weo = pd.read_csv(url, 
                  sep = '\t',
                  #index_col='ISO',
                  usecols=col_indices,
                  skipfooter=1, engine='python',
                  na_values=['n/a', '--'],
                  thousands =',',encoding='windows-1252')

# (3) turn the types of year variables into float 
years = [str(year) for year in range(1980, 2017)]
weo[years] = weo[years].astype(float)

print('Variable dtypes:\n', weo.dtypes, sep='')









    



Variable dtypes:
ISO                    object
WEO Subject Code       object
Country                object
Subject Descriptor     object
Units                  object
1980                  float64
1981                  float64
1982                  float64
1983                  float64
1984                  float64
1985                  float64
1986                  float64
1987                  float64
1988                  float64
1989                  float64
1990                  float64
1991                  float64
1992                  float64
1993                  float64
1994                  float64
1995                  float64
1996                  float64
1997                  float64
1998                  float64
1999                  float64
2000                  float64
2001                  float64
2002                  float64
2003                  float64
2004                  float64
2005                  float64
2006                  float64
2007                  float64
2008                  float64
2009                  float64
2010                  float64
2011                  float64
2012                  float64
2013                  float64
2014                  float64
2015                  float64
2016                  float64
dtype: object



In [5]:

    
# create debt and deficits dataframe:  two variables and three countries 
variables = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
countries = ['ARG', 'DEU', 'GRC']
dd = weo[weo['WEO Subject Code'].isin(variables) & weo['ISO'].isin(countries)]



In [6]:

    
# change column labels to something more intuitive 
dd = dd.rename(columns={'WEO Subject Code': 'Variable', 
                        'Subject Descriptor': 'Description'})



In [7]:

    
# rename variables (i.e. values of observables)
dd['Variable'] = dd['Variable'].replace(to_replace=['GGXWDG_NGDP', 'GGXCNL_NGDP'], value=['Debt', 'Surplus'])
dd









    Out[7]:







  
    
      
      ISO
      Variable
      Country
      Description
      Units
      1980
      1981
      1982
      1983
      1984
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
  
  
    
      252
      ARG
      Surplus
      Argentina
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      260
      ARG
      Debt
      Argentina
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
    
      2716
      DEU
      Surplus
      Germany
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      0.187
      -0.177
      -3.235
      -4.221
      -0.957
      -0.050
      -0.203
      0.278
      0.745
      0.129
    
    
      2724
      DEU
      Debt
      Germany
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      63.542
      64.926
      72.430
      81.004
      78.311
      79.516
      77.058
      74.480
      70.988
      68.169
    
    
      2804
      GRC
      Surplus
      Greece
      General government net lending/borrowing
      Percent of GDP
      -2.774
      -7.847
      -6.092
      -6.847
      -7.976
      ...
      -6.707
      -10.168
      -15.152
      -11.208
      -10.243
      -6.506
      -3.531
      -4.054
      -3.080
      -3.359
    
    
      2812
      GRC
      Debt
      Greece
      General government gross debt
      Percent of GDP
      22.526
      26.681
      29.310
      33.591
      40.061
      ...
      103.103
      109.416
      126.745
      146.250
      172.096
      159.565
      177.677
      180.063
      176.938
      183.436
    
  

6 rows × 42 columns



In [ ]:

Reminders

What kind of object does each of the following produce?



In [8]:

    
dd.index









    Out[8]:





Int64Index([252, 260, 2716, 2724, 2804, 2812], dtype='int64')



In [9]:

    
dd.columns









    Out[9]:





Index(['ISO', 'Variable', 'Country', 'Description', 'Units', '1980', '1981',
       '1982', '1983', '1984', '1985', '1986', '1987', '1988', '1989', '1990',
       '1991', '1992', '1993', '1994', '1995', '1996', '1997', '1998', '1999',
       '2000', '2001', '2002', '2003', '2004', '2005', '2006', '2007', '2008',
       '2009', '2010', '2011', '2012', '2013', '2014', '2015', '2016'],
      dtype='object')



In [10]:

    
dd['ISO']









    Out[10]:





252     ARG
260     ARG
2716    DEU
2724    DEU
2804    GRC
2812    GRC
Name: ISO, dtype: object



In [11]:

    
dd[['ISO', 'Variable']]









    Out[11]:







  
    
      
      ISO
      Variable
    
  
  
    
      252
      ARG
      Surplus
    
    
      260
      ARG
      Debt
    
    
      2716
      DEU
      Surplus
    
    
      2724
      DEU
      Debt
    
    
      2804
      GRC
      Surplus
    
    
      2812
      GRC
      Debt



In [12]:

    
dd[dd['ISO'] == 'ARG']









    Out[12]:







  
    
      
      ISO
      Variable
      Country
      Description
      Units
      1980
      1981
      1982
      1983
      1984
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
  
  
    
      252
      ARG
      Surplus
      Argentina
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      260
      ARG
      Debt
      Argentina
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
  

2 rows × 42 columns



In [ ]:

Wants

We might imagine doing several different things with this data:

Plot a specific variable (debt or surplus) for a given date.
Time series plots for a specific country.
Time series plots for a specific variable.

Depending on which we want, we might organize the data differently. We'll focus on the last two.

Here's a brute force approach to the problem: simply transpose the data. This is where that leads:



In [13]:

    
dd.T









    Out[13]:







  
    
      
      252
      260
      2716
      2724
      2804
      2812
    
  
  
    
      ISO
      ARG
      ARG
      DEU
      DEU
      GRC
      GRC
    
    
      Variable
      Surplus
      Debt
      Surplus
      Debt
      Surplus
      Debt
    
    
      Country
      Argentina
      Argentina
      Germany
      Germany
      Greece
      Greece
    
    
      Description
      General government net lending/borrowing
      General government gross debt
      General government net lending/borrowing
      General government gross debt
      General government net lending/borrowing
      General government gross debt
    
    
      Units
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
    
    
      1980
      NaN
      NaN
      NaN
      NaN
      -2.774
      22.526
    
    
      1981
      NaN
      NaN
      NaN
      NaN
      -7.847
      26.681
    
    
      1982
      NaN
      NaN
      NaN
      NaN
      -6.092
      29.31
    
    
      1983
      NaN
      NaN
      NaN
      NaN
      -6.847
      33.591
    
    
      1984
      NaN
      NaN
      NaN
      NaN
      -7.976
      40.061
    
    
      1985
      NaN
      NaN
      NaN
      NaN
      -10.603
      46.621
    
    
      1986
      NaN
      NaN
      NaN
      NaN
      -9.559
      47.142
    
    
      1987
      NaN
      NaN
      NaN
      NaN
      -8.972
      52.413
    
    
      1988
      NaN
      NaN
      NaN
      NaN
      -10.675
      57.069
    
    
      1989
      NaN
      NaN
      NaN
      NaN
      -13.195
      59.821
    
    
      1990
      NaN
      NaN
      NaN
      NaN
      -14.738
      73.155
    
    
      1991
      NaN
      NaN
      -3.167
      38.893
      -10.677
      74.683
    
    
      1992
      NaN
      25.746
      -2.594
      41.521
      -11.806
      79.968
    
    
      1993
      -0.025
      27.739
      -3.083
      45.119
      -12.701
      100.288
    
    
      1994
      -1.289
      29.31
      -2.506
      47.421
      -9.401
      98.296
    
    
      1995
      -2.137
      31.673
      -9.427
      54.82
      -10.944
      98.99
    
    
      1996
      -2.898
      33.561
      -3.542
      57.598
      -9.098
      101.335
    
    
      1997
      -1.89
      32.662
      -2.945
      58.773
      -6.671
      99.452
    
    
      1998
      -1.882
      35.2
      -2.532
      59.385
      -6.458
      97.425
    
    
      1999
      -3.813
      40.097
      -1.698
      59.962
      -6.06
      98.906
    
    
      2000
      -3.326
      42.06
      0.859
      58.758
      -4.113
      104.935
    
    
      2001
      -5.424
      49.436
      -3.112
      57.582
      -5.466
      107.081
    
    
      2002
      -2.123
      152.112
      -3.944
      59.22
      -6.024
      104.863
    
    
      2003
      1.173
      128.561
      -4.176
      62.921
      -7.83
      101.456
    
    
      2004
      3.549
      117.116
      -3.741
      64.689
      -8.828
      102.87
    
    
      2005
      2.187
      66.938
      -3.417
      66.909
      -6.188
      107.392
    
    
      2006
      1.709
      58.622
      -1.722
      66.333
      -5.946
      103.574
    
    
      2007
      -0.056
      50.842
      0.187
      63.542
      -6.707
      103.103
    
    
      2008
      0.242
      43.853
      -0.177
      64.926
      -10.168
      109.416
    
    
      2009
      -2.434
      53.828
      -3.235
      72.43
      -15.152
      126.745
    
    
      2010
      -1.272
      42.624
      -4.221
      81.004
      -11.208
      146.25
    
    
      2011
      -2.552
      38.063
      -0.957
      78.311
      -10.243
      172.096
    
    
      2012
      -2.802
      39.434
      -0.05
      79.516
      -6.506
      159.565
    
    
      2013
      -3.001
      42.195
      -0.203
      77.058
      -3.531
      177.677
    
    
      2014
      -3.99
      43.587
      0.278
      74.48
      -4.054
      180.063
    
    
      2015
      -6.63
      52.131
      0.745
      70.988
      -3.08
      176.938
    
    
      2016
      -7.123
      51.751
      0.129
      68.169
      -3.359
      183.436

Comments. The problem here is that the columns include both the numbers (which we want to plot) and some descriptive information (which we don't).



In [ ]:

Setting and resetting the index

We start by setting and resetting the index. That may sound like a step backwards -- haven't we done this already? -- but it reminds us of some things that will be handy later.

Take the dataframe dd. What would we like in the index? Evenutally we'd like the dates llke [2011, 2012, 2013], but right now the row labels are more naturally the variable or country. Here are some varriants.

Setting the index



In [14]:

    
dd.set_index('Country')









    Out[14]:







  
    
      
      ISO
      Variable
      Description
      Units
      1980
      1981
      1982
      1983
      1984
      1985
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      Country
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Argentina
      ARG
      Surplus
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      Argentina
      ARG
      Debt
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
    
      Germany
      DEU
      Surplus
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      0.187
      -0.177
      -3.235
      -4.221
      -0.957
      -0.050
      -0.203
      0.278
      0.745
      0.129
    
    
      Germany
      DEU
      Debt
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      63.542
      64.926
      72.430
      81.004
      78.311
      79.516
      77.058
      74.480
      70.988
      68.169
    
    
      Greece
      GRC
      Surplus
      General government net lending/borrowing
      Percent of GDP
      -2.774
      -7.847
      -6.092
      -6.847
      -7.976
      -10.603
      ...
      -6.707
      -10.168
      -15.152
      -11.208
      -10.243
      -6.506
      -3.531
      -4.054
      -3.080
      -3.359
    
    
      Greece
      GRC
      Debt
      General government gross debt
      Percent of GDP
      22.526
      26.681
      29.310
      33.591
      40.061
      46.621
      ...
      103.103
      109.416
      126.745
      146.250
      172.096
      159.565
      177.677
      180.063
      176.938
      183.436
    
  

6 rows × 41 columns



In [15]:

    
# we can do the same thing with a list, which will be meaningful soon...
dd.set_index(['Country'])









    Out[15]:







  
    
      
      ISO
      Variable
      Description
      Units
      1980
      1981
      1982
      1983
      1984
      1985
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      Country
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Argentina
      ARG
      Surplus
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      Argentina
      ARG
      Debt
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
    
      Germany
      DEU
      Surplus
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      0.187
      -0.177
      -3.235
      -4.221
      -0.957
      -0.050
      -0.203
      0.278
      0.745
      0.129
    
    
      Germany
      DEU
      Debt
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      63.542
      64.926
      72.430
      81.004
      78.311
      79.516
      77.058
      74.480
      70.988
      68.169
    
    
      Greece
      GRC
      Surplus
      General government net lending/borrowing
      Percent of GDP
      -2.774
      -7.847
      -6.092
      -6.847
      -7.976
      -10.603
      ...
      -6.707
      -10.168
      -15.152
      -11.208
      -10.243
      -6.506
      -3.531
      -4.054
      -3.080
      -3.359
    
    
      Greece
      GRC
      Debt
      General government gross debt
      Percent of GDP
      22.526
      26.681
      29.310
      33.591
      40.061
      46.621
      ...
      103.103
      109.416
      126.745
      146.250
      172.096
      159.565
      177.677
      180.063
      176.938
      183.436
    
  

6 rows × 41 columns

Exercise. Set Variable as the index.



In [ ]:

Comment. Note that the new index brought its name along: Country in the two examples, Variable in the exercise. That's incredibly useful because we can refer to index levels by name. If we happen to have an index without a name, we can set it with

df.index.name = 'Whatever name we like'

Multi-indexes

We can put more than one variable in an index, which gives us a multi-index. This is sometimes called a hierarchical index because the levels of the index (as they're called) are ordered.

Multi-indexes are more common than you might think. One reason is that data itself is often multi-dimensional. A typical spreadsheet has two dimensions: the variable and the observation. The WEO data is naturally three dimensional: the variable, the year, and the country. (Think about that for a minute, it's deeper than it sounds.)

The problem we're having is fitting this nicely into two dimensions. A multi-index allows us to manage that. A two-dimensional index would work here -- the country and the variable code -- but right now we have some redundancy.

Example. We push all the descriptive, non-numerical columns into the index, leaving the dataframe itself with only numbers, which seems like a step in thee right direction.



In [16]:

    
ddi = dd.set_index(['Variable', 'Country', 'ISO', 'Description', 'Units'])
ddi









    Out[16]:







  
    
      
      
      
      
      
      1980
      1981
      1982
      1983
      1984
      1985
      1986
      1987
      1988
      1989
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      Variable
      Country
      ISO
      Description
      Units
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Surplus
      Argentina
      ARG
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      Debt
      Argentina
      ARG
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
    
      Surplus
      Germany
      DEU
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      0.187
      -0.177
      -3.235
      -4.221
      -0.957
      -0.050
      -0.203
      0.278
      0.745
      0.129
    
    
      Debt
      Germany
      DEU
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      63.542
      64.926
      72.430
      81.004
      78.311
      79.516
      77.058
      74.480
      70.988
      68.169
    
    
      Surplus
      Greece
      GRC
      General government net lending/borrowing
      Percent of GDP
      -2.774
      -7.847
      -6.092
      -6.847
      -7.976
      -10.603
      -9.559
      -8.972
      -10.675
      -13.195
      ...
      -6.707
      -10.168
      -15.152
      -11.208
      -10.243
      -6.506
      -3.531
      -4.054
      -3.080
      -3.359
    
    
      Debt
      Greece
      GRC
      General government gross debt
      Percent of GDP
      22.526
      26.681
      29.310
      33.591
      40.061
      46.621
      47.142
      52.413
      57.069
      59.821
      ...
      103.103
      109.416
      126.745
      146.250
      172.096
      159.565
      177.677
      180.063
      176.938
      183.436
    
  

6 rows × 37 columns

Let's take a closer look at the index



In [17]:

    
ddi.index









    Out[17]:





MultiIndex(levels=[['Debt', 'Surplus'], ['Argentina', 'Germany', 'Greece'], ['ARG', 'DEU', 'GRC'], ['General government gross debt', 'General government net lending/borrowing'], ['Percent of GDP']],
           labels=[[1, 0, 1, 0, 1, 0], [0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2], [1, 0, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0]],
           names=['Variable', 'Country', 'ISO', 'Description', 'Units'])

That's a lot to process, so we break it into pieces.

ddi.index.names contains a list of level names. (Remind yourself that lists are ordered, so this tracks levels.)
ddi.index.levels contains the values in each level.

Here's what they like like here:



In [18]:

    
# Chase and Spencer like double quotes 
print("The level names are:\n", ddi.index.names, "\n", sep="")
print("The levels (aka level values) are:\n", ddi.index.levels, sep="")









    



The level names are:
['Variable', 'Country', 'ISO', 'Description', 'Units']

The levels (aka level values) are:
[['Debt', 'Surplus'], ['Argentina', 'Germany', 'Greece'], ['ARG', 'DEU', 'GRC'], ['General government gross debt', 'General government net lending/borrowing'], ['Percent of GDP']]

Knowing the order of the index components and being able to inspect their values and names is fundamental to working with a multi-index.

Exercise: What would happen if we had switched the order of the strings in the list when we called dd.set_index? Try it with this list to find out: ['ISO', 'Country', 'Variable', 'Description', 'Units']

Resetting the index

We've seen that set_index pushes columns into the index. Here we see that reset_index does the reverse: it pushes components of the index back to the columns.

Example.



In [19]:

    
ddi.head(2)









    Out[19]:







  
    
      
      
      
      
      
      1980
      1981
      1982
      1983
      1984
      1985
      1986
      1987
      1988
      1989
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      Variable
      Country
      ISO
      Description
      Units
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Surplus
      Argentina
      ARG
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      Debt
      Argentina
      ARG
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
  

2 rows × 37 columns



In [20]:

    
ddi.reset_index()









    Out[20]:







  
    
      
      Variable
      Country
      ISO
      Description
      Units
      1980
      1981
      1982
      1983
      1984
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
  
  
    
      0
      Surplus
      Argentina
      ARG
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      1
      Debt
      Argentina
      ARG
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
    
      2
      Surplus
      Germany
      DEU
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      0.187
      -0.177
      -3.235
      -4.221
      -0.957
      -0.050
      -0.203
      0.278
      0.745
      0.129
    
    
      3
      Debt
      Germany
      DEU
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      63.542
      64.926
      72.430
      81.004
      78.311
      79.516
      77.058
      74.480
      70.988
      68.169
    
    
      4
      Surplus
      Greece
      GRC
      General government net lending/borrowing
      Percent of GDP
      -2.774
      -7.847
      -6.092
      -6.847
      -7.976
      ...
      -6.707
      -10.168
      -15.152
      -11.208
      -10.243
      -6.506
      -3.531
      -4.054
      -3.080
      -3.359
    
    
      5
      Debt
      Greece
      GRC
      General government gross debt
      Percent of GDP
      22.526
      26.681
      29.310
      33.591
      40.061
      ...
      103.103
      109.416
      126.745
      146.250
      172.096
      159.565
      177.677
      180.063
      176.938
      183.436
    
  

6 rows × 42 columns



In [21]:

    
# or we can reset the index by level 
ddi.reset_index(level=1).head(2)









    Out[21]:







  
    
      
      
      
      
      Country
      1980
      1981
      1982
      1983
      1984
      1985
      1986
      1987
      1988
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      Variable
      ISO
      Description
      Units
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Surplus
      ARG
      General government net lending/borrowing
      Percent of GDP
      Argentina
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      Debt
      ARG
      General government gross debt
      Percent of GDP
      Argentina
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
  

2 rows × 38 columns



In [22]:

    
# or by name 
ddi.reset_index(level='Units').head(2)









    Out[22]:







  
    
      
      
      
      
      Units
      1980
      1981
      1982
      1983
      1984
      1985
      1986
      1987
      1988
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      Variable
      Country
      ISO
      Description
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Surplus
      Argentina
      ARG
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      Debt
      Argentina
      ARG
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
  

2 rows × 38 columns



In [23]:

    
# or do more than one at a time
ddi.reset_index(level=[1, 3]).head(2)









    Out[23]:







  
    
      
      
      
      Country
      Description
      1980
      1981
      1982
      1983
      1984
      1985
      1986
      1987
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      Variable
      ISO
      Units
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Surplus
      ARG
      Percent of GDP
      Argentina
      General government net lending/borrowing
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      Debt
      ARG
      Percent of GDP
      Argentina
      General government gross debt
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
  

2 rows × 39 columns

Comment. By default, reset_index pushes one or more index levels into columns. If we want to discard that level of the index altogether, we use the parameter drop=True.



In [24]:

    
ddi.reset_index(level=[1, 3], drop=True).head(2)









    Out[24]:







  
    
      
      
      
      1980
      1981
      1982
      1983
      1984
      1985
      1986
      1987
      1988
      1989
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      Variable
      ISO
      Units
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Surplus
      ARG
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      Debt
      ARG
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
  

2 rows × 37 columns

Exercise. For the dataframe ddi do the following in separate code cells:

Use the reset_index method to move the Units level of the index to a column of the dataframe.
Use the drop parameter of reset_index to delete Units from the dataframe.



In [ ]:

Switching rows and columns

If we take the dataframe ddi, we see that the everything's been put into the index but the data itself. Perhaps we can get what we want if we just flip the rows and columns. Roughly speaking, we refer to this as pivoting.

First look at switching rows and columns

The simplest way to flip rows and columns is to use the T or transpose property. When we do that, we end up with a lot of stuff in the column labels, as the multi-index for the rows gets rotated into the columns. Other than that, we're good. We can even do a plot. The only problem is all the stuff we've pushed into the column labels -- it's kind of a mess.



In [25]:

    
ddt = ddi.T
ddt









    Out[25]:







  
    
      Variable
      Surplus
      Debt
      Surplus
      Debt
      Surplus
      Debt
    
    
      Country
      Argentina
      Argentina
      Germany
      Germany
      Greece
      Greece
    
    
      ISO
      ARG
      ARG
      DEU
      DEU
      GRC
      GRC
    
    
      Description
      General government net lending/borrowing
      General government gross debt
      General government net lending/borrowing
      General government gross debt
      General government net lending/borrowing
      General government gross debt
    
    
      Units
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
    
  
  
    
      1980
      NaN
      NaN
      NaN
      NaN
      -2.774
      22.526
    
    
      1981
      NaN
      NaN
      NaN
      NaN
      -7.847
      26.681
    
    
      1982
      NaN
      NaN
      NaN
      NaN
      -6.092
      29.310
    
    
      1983
      NaN
      NaN
      NaN
      NaN
      -6.847
      33.591
    
    
      1984
      NaN
      NaN
      NaN
      NaN
      -7.976
      40.061
    
    
      1985
      NaN
      NaN
      NaN
      NaN
      -10.603
      46.621
    
    
      1986
      NaN
      NaN
      NaN
      NaN
      -9.559
      47.142
    
    
      1987
      NaN
      NaN
      NaN
      NaN
      -8.972
      52.413
    
    
      1988
      NaN
      NaN
      NaN
      NaN
      -10.675
      57.069
    
    
      1989
      NaN
      NaN
      NaN
      NaN
      -13.195
      59.821
    
    
      1990
      NaN
      NaN
      NaN
      NaN
      -14.738
      73.155
    
    
      1991
      NaN
      NaN
      -3.167
      38.893
      -10.677
      74.683
    
    
      1992
      NaN
      25.746
      -2.594
      41.521
      -11.806
      79.968
    
    
      1993
      -0.025
      27.739
      -3.083
      45.119
      -12.701
      100.288
    
    
      1994
      -1.289
      29.310
      -2.506
      47.421
      -9.401
      98.296
    
    
      1995
      -2.137
      31.673
      -9.427
      54.820
      -10.944
      98.990
    
    
      1996
      -2.898
      33.561
      -3.542
      57.598
      -9.098
      101.335
    
    
      1997
      -1.890
      32.662
      -2.945
      58.773
      -6.671
      99.452
    
    
      1998
      -1.882
      35.200
      -2.532
      59.385
      -6.458
      97.425
    
    
      1999
      -3.813
      40.097
      -1.698
      59.962
      -6.060
      98.906
    
    
      2000
      -3.326
      42.060
      0.859
      58.758
      -4.113
      104.935
    
    
      2001
      -5.424
      49.436
      -3.112
      57.582
      -5.466
      107.081
    
    
      2002
      -2.123
      152.112
      -3.944
      59.220
      -6.024
      104.863
    
    
      2003
      1.173
      128.561
      -4.176
      62.921
      -7.830
      101.456
    
    
      2004
      3.549
      117.116
      -3.741
      64.689
      -8.828
      102.870
    
    
      2005
      2.187
      66.938
      -3.417
      66.909
      -6.188
      107.392
    
    
      2006
      1.709
      58.622
      -1.722
      66.333
      -5.946
      103.574
    
    
      2007
      -0.056
      50.842
      0.187
      63.542
      -6.707
      103.103
    
    
      2008
      0.242
      43.853
      -0.177
      64.926
      -10.168
      109.416
    
    
      2009
      -2.434
      53.828
      -3.235
      72.430
      -15.152
      126.745
    
    
      2010
      -1.272
      42.624
      -4.221
      81.004
      -11.208
      146.250
    
    
      2011
      -2.552
      38.063
      -0.957
      78.311
      -10.243
      172.096
    
    
      2012
      -2.802
      39.434
      -0.050
      79.516
      -6.506
      159.565
    
    
      2013
      -3.001
      42.195
      -0.203
      77.058
      -3.531
      177.677
    
    
      2014
      -3.990
      43.587
      0.278
      74.480
      -4.054
      180.063
    
    
      2015
      -6.630
      52.131
      0.745
      70.988
      -3.080
      176.938
    
    
      2016
      -7.123
      51.751
      0.129
      68.169
      -3.359
      183.436

Comment. We see here that the multi-index for the rows has been turned into a multi-index for the columns. Works the same way.

The only problem here is that the column labels are more complicated than we might want. Here, for example, is what we get with the plot method. As usual, .plot() plots all the columns of the dataframe, but here that means we're mixing variables. And the legend contains all the levels of the column labels.



In [26]:

    
ddt.plot()









    Out[26]:





<matplotlib.axes._subplots.AxesSubplot at 0x1094d6518>



In [ ]:

Referring to variables with a multi-index

Can we refer to variables in the same way? Sort of, as long as we refer to the top level of the column index. It gives us a dataframe that's a subset of the original one.

Let's try each of these:

ddt['Debt']
ddt['Debt']['Argentina']
ddt['Debt', 'Argentina']
ddt['ARG']

What do you see?



In [27]:

    
# indexing by variable 
debt = ddt['Debt']
debt









    Out[27]:







  
    
      Country
      Argentina
      Germany
      Greece
    
    
      ISO
      ARG
      DEU
      GRC
    
    
      Description
      General government gross debt
      General government gross debt
      General government gross debt
    
    
      Units
      Percent of GDP
      Percent of GDP
      Percent of GDP
    
  
  
    
      1980
      NaN
      NaN
      22.526
    
    
      1981
      NaN
      NaN
      26.681
    
    
      1982
      NaN
      NaN
      29.310
    
    
      1983
      NaN
      NaN
      33.591
    
    
      1984
      NaN
      NaN
      40.061
    
    
      1985
      NaN
      NaN
      46.621
    
    
      1986
      NaN
      NaN
      47.142
    
    
      1987
      NaN
      NaN
      52.413
    
    
      1988
      NaN
      NaN
      57.069
    
    
      1989
      NaN
      NaN
      59.821
    
    
      1990
      NaN
      NaN
      73.155
    
    
      1991
      NaN
      38.893
      74.683
    
    
      1992
      25.746
      41.521
      79.968
    
    
      1993
      27.739
      45.119
      100.288
    
    
      1994
      29.310
      47.421
      98.296
    
    
      1995
      31.673
      54.820
      98.990
    
    
      1996
      33.561
      57.598
      101.335
    
    
      1997
      32.662
      58.773
      99.452
    
    
      1998
      35.200
      59.385
      97.425
    
    
      1999
      40.097
      59.962
      98.906
    
    
      2000
      42.060
      58.758
      104.935
    
    
      2001
      49.436
      57.582
      107.081
    
    
      2002
      152.112
      59.220
      104.863
    
    
      2003
      128.561
      62.921
      101.456
    
    
      2004
      117.116
      64.689
      102.870
    
    
      2005
      66.938
      66.909
      107.392
    
    
      2006
      58.622
      66.333
      103.574
    
    
      2007
      50.842
      63.542
      103.103
    
    
      2008
      43.853
      64.926
      109.416
    
    
      2009
      53.828
      72.430
      126.745
    
    
      2010
      42.624
      81.004
      146.250
    
    
      2011
      38.063
      78.311
      172.096
    
    
      2012
      39.434
      79.516
      159.565
    
    
      2013
      42.195
      77.058
      177.677
    
    
      2014
      43.587
      74.480
      180.063
    
    
      2015
      52.131
      70.988
      176.938
    
    
      2016
      51.751
      68.169
      183.436



In [28]:

    
ddt['Debt']['Argentina']









    Out[28]:







  
    
      ISO
      ARG
    
    
      Description
      General government gross debt
    
    
      Units
      Percent of GDP
    
  
  
    
      1980
      NaN
    
    
      1981
      NaN
    
    
      1982
      NaN
    
    
      1983
      NaN
    
    
      1984
      NaN
    
    
      1985
      NaN
    
    
      1986
      NaN
    
    
      1987
      NaN
    
    
      1988
      NaN
    
    
      1989
      NaN
    
    
      1990
      NaN
    
    
      1991
      NaN
    
    
      1992
      25.746
    
    
      1993
      27.739
    
    
      1994
      29.310
    
    
      1995
      31.673
    
    
      1996
      33.561
    
    
      1997
      32.662
    
    
      1998
      35.200
    
    
      1999
      40.097
    
    
      2000
      42.060
    
    
      2001
      49.436
    
    
      2002
      152.112
    
    
      2003
      128.561
    
    
      2004
      117.116
    
    
      2005
      66.938
    
    
      2006
      58.622
    
    
      2007
      50.842
    
    
      2008
      43.853
    
    
      2009
      53.828
    
    
      2010
      42.624
    
    
      2011
      38.063
    
    
      2012
      39.434
    
    
      2013
      42.195
    
    
      2014
      43.587
    
    
      2015
      52.131
    
    
      2016
      51.751



In [29]:

    
ddt['Debt', 'Argentina']









    



/Users/wasserman/anaconda3/lib/python3.6/site-packages/ipykernel/kernelbase.py:399: PerformanceWarning: indexing past lexsort depth may impact performance.
  user_expressions, allow_stdin)






    Out[29]:







  
    
      ISO
      ARG
    
    
      Description
      General government gross debt
    
    
      Units
      Percent of GDP
    
  
  
    
      1980
      NaN
    
    
      1981
      NaN
    
    
      1982
      NaN
    
    
      1983
      NaN
    
    
      1984
      NaN
    
    
      1985
      NaN
    
    
      1986
      NaN
    
    
      1987
      NaN
    
    
      1988
      NaN
    
    
      1989
      NaN
    
    
      1990
      NaN
    
    
      1991
      NaN
    
    
      1992
      25.746
    
    
      1993
      27.739
    
    
      1994
      29.310
    
    
      1995
      31.673
    
    
      1996
      33.561
    
    
      1997
      32.662
    
    
      1998
      35.200
    
    
      1999
      40.097
    
    
      2000
      42.060
    
    
      2001
      49.436
    
    
      2002
      152.112
    
    
      2003
      128.561
    
    
      2004
      117.116
    
    
      2005
      66.938
    
    
      2006
      58.622
    
    
      2007
      50.842
    
    
      2008
      43.853
    
    
      2009
      53.828
    
    
      2010
      42.624
    
    
      2011
      38.063
    
    
      2012
      39.434
    
    
      2013
      42.195
    
    
      2014
      43.587
    
    
      2015
      52.131
    
    
      2016
      51.751



In [30]:

    
#ddt['ARG']

What's going on? The theme is that we can reference the top level, which in ddi is the Variable. If we try to access a lower level, it bombs.

Exercise. With the dataframe ddt:

What type of object is ddt["Debt"]?
Construct a line plot of Debt over time with one line for each country.

SOL



In [ ]:

Example. Let's do this together. How would we fix up the legend? What approaches cross your mind? (No code, just the general approach.)



In [31]:

    
fig, ax = plt.subplots()
ddt['Debt'].plot(ax=ax)
ax.legend(['ARG', 'DEU', 'GRE'], loc='best')
#ax.axhline(100, color='k', linestyle='--', alpha=.5)









    Out[31]:





<matplotlib.legend.Legend at 0x10a73fe10>

Swapping levels

Since variables refer to the first level of the column index, it's not clear how we would group data by country. Suppose, for example, we wanted to plot Debt and Surplus for a specific country. What would we do?

One way to do that is to make the country the top level with the swaplevel method. Note the axis parameter. With axis=1 we swap column levels, with axis=0 (the default) we swap row levels.



In [32]:

    
ddts = ddt.swaplevel(0, 1, axis=1) 
ddts









    Out[32]:







  
    
      Country
      Argentina
      Germany
      Greece
    
    
      Variable
      Surplus
      Debt
      Surplus
      Debt
      Surplus
      Debt
    
    
      ISO
      ARG
      ARG
      DEU
      DEU
      GRC
      GRC
    
    
      Description
      General government net lending/borrowing
      General government gross debt
      General government net lending/borrowing
      General government gross debt
      General government net lending/borrowing
      General government gross debt
    
    
      Units
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
    
  
  
    
      1980
      NaN
      NaN
      NaN
      NaN
      -2.774
      22.526
    
    
      1981
      NaN
      NaN
      NaN
      NaN
      -7.847
      26.681
    
    
      1982
      NaN
      NaN
      NaN
      NaN
      -6.092
      29.310
    
    
      1983
      NaN
      NaN
      NaN
      NaN
      -6.847
      33.591
    
    
      1984
      NaN
      NaN
      NaN
      NaN
      -7.976
      40.061
    
    
      1985
      NaN
      NaN
      NaN
      NaN
      -10.603
      46.621
    
    
      1986
      NaN
      NaN
      NaN
      NaN
      -9.559
      47.142
    
    
      1987
      NaN
      NaN
      NaN
      NaN
      -8.972
      52.413
    
    
      1988
      NaN
      NaN
      NaN
      NaN
      -10.675
      57.069
    
    
      1989
      NaN
      NaN
      NaN
      NaN
      -13.195
      59.821
    
    
      1990
      NaN
      NaN
      NaN
      NaN
      -14.738
      73.155
    
    
      1991
      NaN
      NaN
      -3.167
      38.893
      -10.677
      74.683
    
    
      1992
      NaN
      25.746
      -2.594
      41.521
      -11.806
      79.968
    
    
      1993
      -0.025
      27.739
      -3.083
      45.119
      -12.701
      100.288
    
    
      1994
      -1.289
      29.310
      -2.506
      47.421
      -9.401
      98.296
    
    
      1995
      -2.137
      31.673
      -9.427
      54.820
      -10.944
      98.990
    
    
      1996
      -2.898
      33.561
      -3.542
      57.598
      -9.098
      101.335
    
    
      1997
      -1.890
      32.662
      -2.945
      58.773
      -6.671
      99.452
    
    
      1998
      -1.882
      35.200
      -2.532
      59.385
      -6.458
      97.425
    
    
      1999
      -3.813
      40.097
      -1.698
      59.962
      -6.060
      98.906
    
    
      2000
      -3.326
      42.060
      0.859
      58.758
      -4.113
      104.935
    
    
      2001
      -5.424
      49.436
      -3.112
      57.582
      -5.466
      107.081
    
    
      2002
      -2.123
      152.112
      -3.944
      59.220
      -6.024
      104.863
    
    
      2003
      1.173
      128.561
      -4.176
      62.921
      -7.830
      101.456
    
    
      2004
      3.549
      117.116
      -3.741
      64.689
      -8.828
      102.870
    
    
      2005
      2.187
      66.938
      -3.417
      66.909
      -6.188
      107.392
    
    
      2006
      1.709
      58.622
      -1.722
      66.333
      -5.946
      103.574
    
    
      2007
      -0.056
      50.842
      0.187
      63.542
      -6.707
      103.103
    
    
      2008
      0.242
      43.853
      -0.177
      64.926
      -10.168
      109.416
    
    
      2009
      -2.434
      53.828
      -3.235
      72.430
      -15.152
      126.745
    
    
      2010
      -1.272
      42.624
      -4.221
      81.004
      -11.208
      146.250
    
    
      2011
      -2.552
      38.063
      -0.957
      78.311
      -10.243
      172.096
    
    
      2012
      -2.802
      39.434
      -0.050
      79.516
      -6.506
      159.565
    
    
      2013
      -3.001
      42.195
      -0.203
      77.058
      -3.531
      177.677
    
    
      2014
      -3.990
      43.587
      0.278
      74.480
      -4.054
      180.063
    
    
      2015
      -6.630
      52.131
      0.745
      70.988
      -3.080
      176.938
    
    
      2016
      -7.123
      51.751
      0.129
      68.169
      -3.359
      183.436

Exercise. Use the dataframe ddts to plot Debt and Surplus across time for Argentina. Hint: In the plot method, set subplots=True so that each variable is in a separate subplot.

SOL

The `xs` method

Another approach to extracting data that cuts across levels of the row or column index: the xs method. This is recent addition to Pandas and an extremely good method once you get the hang of it.

The basic syntax is

df.xs(item, axis=X, level=N)

where N is the name or number of an index level and X describes if we are extracting from the index or column names. Setting X=0 (so axis=0) will slice up the data along the index, X=1 extracts data for column labels.

Here's how we could use xs to get the Argentina data without swapping the level of the column labels



In [33]:

    
# ddt.xs?



In [34]:

    
ddt.xs("Argentina", axis=1, level="Country")









    Out[34]:







  
    
      Variable
      Surplus
      Debt
    
    
      ISO
      ARG
      ARG
    
    
      Description
      General government net lending/borrowing
      General government gross debt
    
    
      Units
      Percent of GDP
      Percent of GDP
    
  
  
    
      1980
      NaN
      NaN
    
    
      1981
      NaN
      NaN
    
    
      1982
      NaN
      NaN
    
    
      1983
      NaN
      NaN
    
    
      1984
      NaN
      NaN
    
    
      1985
      NaN
      NaN
    
    
      1986
      NaN
      NaN
    
    
      1987
      NaN
      NaN
    
    
      1988
      NaN
      NaN
    
    
      1989
      NaN
      NaN
    
    
      1990
      NaN
      NaN
    
    
      1991
      NaN
      NaN
    
    
      1992
      NaN
      25.746
    
    
      1993
      -0.025
      27.739
    
    
      1994
      -1.289
      29.310
    
    
      1995
      -2.137
      31.673
    
    
      1996
      -2.898
      33.561
    
    
      1997
      -1.890
      32.662
    
    
      1998
      -1.882
      35.200
    
    
      1999
      -3.813
      40.097
    
    
      2000
      -3.326
      42.060
    
    
      2001
      -5.424
      49.436
    
    
      2002
      -2.123
      152.112
    
    
      2003
      1.173
      128.561
    
    
      2004
      3.549
      117.116
    
    
      2005
      2.187
      66.938
    
    
      2006
      1.709
      58.622
    
    
      2007
      -0.056
      50.842
    
    
      2008
      0.242
      43.853
    
    
      2009
      -2.434
      53.828
    
    
      2010
      -1.272
      42.624
    
    
      2011
      -2.552
      38.063
    
    
      2012
      -2.802
      39.434
    
    
      2013
      -3.001
      42.195
    
    
      2014
      -3.990
      43.587
    
    
      2015
      -6.630
      52.131
    
    
      2016
      -7.123
      51.751



In [35]:

    
ddt.xs("Argentina", axis=1, level="Country")["Debt"]









    Out[35]:







  
    
      ISO
      ARG
    
    
      Description
      General government gross debt
    
    
      Units
      Percent of GDP
    
  
  
    
      1980
      NaN
    
    
      1981
      NaN
    
    
      1982
      NaN
    
    
      1983
      NaN
    
    
      1984
      NaN
    
    
      1985
      NaN
    
    
      1986
      NaN
    
    
      1987
      NaN
    
    
      1988
      NaN
    
    
      1989
      NaN
    
    
      1990
      NaN
    
    
      1991
      NaN
    
    
      1992
      25.746
    
    
      1993
      27.739
    
    
      1994
      29.310
    
    
      1995
      31.673
    
    
      1996
      33.561
    
    
      1997
      32.662
    
    
      1998
      35.200
    
    
      1999
      40.097
    
    
      2000
      42.060
    
    
      2001
      49.436
    
    
      2002
      152.112
    
    
      2003
      128.561
    
    
      2004
      117.116
    
    
      2005
      66.938
    
    
      2006
      58.622
    
    
      2007
      50.842
    
    
      2008
      43.853
    
    
      2009
      53.828
    
    
      2010
      42.624
    
    
      2011
      38.063
    
    
      2012
      39.434
    
    
      2013
      42.195
    
    
      2014
      43.587
    
    
      2015
      52.131
    
    
      2016
      51.751

Exercise. Use a combination of xs and standard slicing with [...] to extract the variable Debt for Greece.

SOL

Exercise. Use the dataframe ddt -- and the xs method -- to plot Debt and Surplus across time for Argentina.

SOL

Stacking and unstacking

The set_index and reset_index methods work on the row labels -- the index. They move columns to the index and the reverse. The stack and unstack methods move index levels to and from column levels:

stack moves the "inner most" (closest to the data when printed) column label into a row label. This creates a long dataframe.
unstack does the reverse, it moves the inner most level of the index up to become the inner most column label. This creates a wide dataframe.

We use both to shape (or reshape) our data. We use set_index to push things into the index. And then use reset_index to push some of them back to the columns. That gives us pretty fine-grainded control over the shape of our data. Intuitively

stacking (vertically): wide table $\rightarrow$ long table
unstacking: long table $\rightarrow$ wide table



In [36]:

    
ddi.stack?

Single level index



In [37]:

    
# example from docstring
dic = {'a': [1, 3], 'b': [2, 4]}
s = pd.DataFrame(data=dic, index=['one', 'two'])
print(s)









    



     a  b
one  1  2
two  3  4



In [38]:

    
s.stack()









    Out[38]:





one  a    1
     b    2
two  a    3
     b    4
dtype: int64

Multi-index



In [39]:

    
ddi.index









    Out[39]:





MultiIndex(levels=[['Debt', 'Surplus'], ['Argentina', 'Germany', 'Greece'], ['ARG', 'DEU', 'GRC'], ['General government gross debt', 'General government net lending/borrowing'], ['Percent of GDP']],
           labels=[[1, 0, 1, 0, 1, 0], [0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2], [1, 0, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0]],
           names=['Variable', 'Country', 'ISO', 'Description', 'Units'])



In [40]:

    
ddi.unstack()   # Units variable has only one value, so this doesn't do much









    Out[40]:







  
    
      
      
      
      
      1980
      1981
      1982
      1983
      1984
      1985
      1986
      1987
      1988
      1989
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      
      
      
      Units
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      ...
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
      Percent of GDP
    
    
      Variable
      Country
      ISO
      Description
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Debt
      Argentina
      ARG
      General government gross debt
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
    
      Germany
      DEU
      General government gross debt
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      63.542
      64.926
      72.430
      81.004
      78.311
      79.516
      77.058
      74.480
      70.988
      68.169
    
    
      Greece
      GRC
      General government gross debt
      22.526
      26.681
      29.310
      33.591
      40.061
      46.621
      47.142
      52.413
      57.069
      59.821
      ...
      103.103
      109.416
      126.745
      146.250
      172.096
      159.565
      177.677
      180.063
      176.938
      183.436
    
    
      Surplus
      Argentina
      ARG
      General government net lending/borrowing
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      Germany
      DEU
      General government net lending/borrowing
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      0.187
      -0.177
      -3.235
      -4.221
      -0.957
      -0.050
      -0.203
      0.278
      0.745
      0.129
    
    
      Greece
      GRC
      General government net lending/borrowing
      -2.774
      -7.847
      -6.092
      -6.847
      -7.976
      -10.603
      -9.559
      -8.972
      -10.675
      -13.195
      ...
      -6.707
      -10.168
      -15.152
      -11.208
      -10.243
      -6.506
      -3.531
      -4.054
      -3.080
      -3.359
    
  

6 rows × 37 columns



In [41]:

    
ddi.unstack(level='ISO')









    Out[41]:







  
    
      
      
      
      
      1980
      1981
      1982
      1983
      ...
      2013
      2014
      2015
      2016
    
    
      
      
      
      ISO
      ARG
      DEU
      GRC
      ARG
      DEU
      GRC
      ARG
      DEU
      GRC
      ARG
      ...
      GRC
      ARG
      DEU
      GRC
      ARG
      DEU
      GRC
      ARG
      DEU
      GRC
    
    
      Variable
      Country
      Description
      Units
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Debt
      Argentina
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      43.587
      NaN
      NaN
      52.131
      NaN
      NaN
      51.751
      NaN
      NaN
    
    
      Germany
      General government gross debt
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      74.480
      NaN
      NaN
      70.988
      NaN
      NaN
      68.169
      NaN
    
    
      Greece
      General government gross debt
      Percent of GDP
      NaN
      NaN
      22.526
      NaN
      NaN
      26.681
      NaN
      NaN
      29.310
      NaN
      ...
      177.677
      NaN
      NaN
      180.063
      NaN
      NaN
      176.938
      NaN
      NaN
      183.436
    
    
      Surplus
      Argentina
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      -3.990
      NaN
      NaN
      -6.630
      NaN
      NaN
      -7.123
      NaN
      NaN
    
    
      Germany
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      NaN
      NaN
      0.278
      NaN
      NaN
      0.745
      NaN
      NaN
      0.129
      NaN
    
    
      Greece
      General government net lending/borrowing
      Percent of GDP
      NaN
      NaN
      -2.774
      NaN
      NaN
      -7.847
      NaN
      NaN
      -6.092
      NaN
      ...
      -3.531
      NaN
      NaN
      -4.054
      NaN
      NaN
      -3.080
      NaN
      NaN
      -3.359
    
  

6 rows × 111 columns

Let's get a smaller subset of this data to work with so we can see things a bit more clearly



In [42]:

    
# drop some of the index levels (think s for small)
dds = ddi.reset_index(level=[1, 3, 4], drop=True)
dds









    Out[42]:







  
    
      
      
      1980
      1981
      1982
      1983
      1984
      1985
      1986
      1987
      1988
      1989
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      Variable
      ISO
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Surplus
      ARG
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      Debt
      ARG
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
    
      Surplus
      DEU
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      0.187
      -0.177
      -3.235
      -4.221
      -0.957
      -0.050
      -0.203
      0.278
      0.745
      0.129
    
    
      Debt
      DEU
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      63.542
      64.926
      72.430
      81.004
      78.311
      79.516
      77.058
      74.480
      70.988
      68.169
    
    
      Surplus
      GRC
      -2.774
      -7.847
      -6.092
      -6.847
      -7.976
      -10.603
      -9.559
      -8.972
      -10.675
      -13.195
      ...
      -6.707
      -10.168
      -15.152
      -11.208
      -10.243
      -6.506
      -3.531
      -4.054
      -3.080
      -3.359
    
    
      Debt
      GRC
      22.526
      26.681
      29.310
      33.591
      40.061
      46.621
      47.142
      52.413
      57.069
      59.821
      ...
      103.103
      109.416
      126.745
      146.250
      172.096
      159.565
      177.677
      180.063
      176.938
      183.436
    
  

6 rows × 37 columns



In [43]:

    
# give a name to the column labels
dds.columns.name = 'Year'  
dds









    Out[43]:







  
    
      
      Year
      1980
      1981
      1982
      1983
      1984
      1985
      1986
      1987
      1988
      1989
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      Variable
      ISO
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Surplus
      ARG
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      Debt
      ARG
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
    
      Surplus
      DEU
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      0.187
      -0.177
      -3.235
      -4.221
      -0.957
      -0.050
      -0.203
      0.278
      0.745
      0.129
    
    
      Debt
      DEU
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      63.542
      64.926
      72.430
      81.004
      78.311
      79.516
      77.058
      74.480
      70.988
      68.169
    
    
      Surplus
      GRC
      -2.774
      -7.847
      -6.092
      -6.847
      -7.976
      -10.603
      -9.559
      -8.972
      -10.675
      -13.195
      ...
      -6.707
      -10.168
      -15.152
      -11.208
      -10.243
      -6.506
      -3.531
      -4.054
      -3.080
      -3.359
    
    
      Debt
      GRC
      22.526
      26.681
      29.310
      33.591
      40.061
      46.621
      47.142
      52.413
      57.069
      59.821
      ...
      103.103
      109.416
      126.745
      146.250
      172.096
      159.565
      177.677
      180.063
      176.938
      183.436
    
  

6 rows × 37 columns

Let's remind ourselves what we want. We want to

move the column index (Year) into the row index
move the Variable and ISO levels the other way, into the column labels.

The first one uses stack, the second one unstack.

Stacking

We stack our data, one variable on top of another, with a multi-index to keep track of what's what. In simple terms, we change the data from a wide format to a long format. The stack method takes the inner most column level and makes it the lowest row level.



In [44]:

    
# convert to long format. Notice printing is different... what `type` is ds?
ds = dds.stack()    
ds









    Out[44]:





Variable  ISO  Year
Surplus   ARG  1993     -0.025
               1994     -1.289
               1995     -2.137
               1996     -2.898
               1997     -1.890
               1998     -1.882
               1999     -3.813
               2000     -3.326
               2001     -5.424
               2002     -2.123
               2003      1.173
               2004      3.549
               2005      2.187
               2006      1.709
               2007     -0.056
               2008      0.242
               2009     -2.434
               2010     -1.272
               2011     -2.552
               2012     -2.802
               2013     -3.001
               2014     -3.990
               2015     -6.630
               2016     -7.123
Debt      ARG  1992     25.746
               1993     27.739
               1994     29.310
               1995     31.673
               1996     33.561
               1997     32.662
                        ...   
          GRC  1987     52.413
               1988     57.069
               1989     59.821
               1990     73.155
               1991     74.683
               1992     79.968
               1993    100.288
               1994     98.296
               1995     98.990
               1996    101.335
               1997     99.452
               1998     97.425
               1999     98.906
               2000    104.935
               2001    107.081
               2002    104.863
               2003    101.456
               2004    102.870
               2005    107.392
               2006    103.574
               2007    103.103
               2008    109.416
               2009    126.745
               2010    146.250
               2011    172.096
               2012    159.565
               2013    177.677
               2014    180.063
               2015    176.938
               2016    183.436
Length: 175, dtype: float64



In [45]:

    
# same thing with explicit reference to column name 
dds.stack(level='Year').head(8)









    Out[45]:





Variable  ISO  Year
Surplus   ARG  1993   -0.025
               1994   -1.289
               1995   -2.137
               1996   -2.898
               1997   -1.890
               1998   -1.882
               1999   -3.813
               2000   -3.326
dtype: float64



In [46]:

    
# or with level number 
dds.stack(level=0).head(8)









    Out[46]:





Variable  ISO  Year
Surplus   ARG  1993   -0.025
               1994   -1.289
               1995   -2.137
               1996   -2.898
               1997   -1.890
               1998   -1.882
               1999   -3.813
               2000   -3.326
dtype: float64

Unstacking

Stacking moves columns into the index, "stacking" the data up into longer columns. Unstacking does the reverse, taking levels of the row index and turning them into column labels. Roughly speaking we're rotating or pivoting the data.



In [47]:

    
# now go long to wide 
ds.unstack()               # default is lowest value wich is year now









    Out[47]:







  
    
      
      Year
      1980
      1981
      1982
      1983
      1984
      1985
      1986
      1987
      1988
      1989
      ...
      2007
      2008
      2009
      2010
      2011
      2012
      2013
      2014
      2015
      2016
    
    
      Variable
      ISO
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      Debt
      ARG
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      50.842
      43.853
      53.828
      42.624
      38.063
      39.434
      42.195
      43.587
      52.131
      51.751
    
    
      DEU
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      63.542
      64.926
      72.430
      81.004
      78.311
      79.516
      77.058
      74.480
      70.988
      68.169
    
    
      GRC
      22.526
      26.681
      29.310
      33.591
      40.061
      46.621
      47.142
      52.413
      57.069
      59.821
      ...
      103.103
      109.416
      126.745
      146.250
      172.096
      159.565
      177.677
      180.063
      176.938
      183.436
    
    
      Surplus
      ARG
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      -0.056
      0.242
      -2.434
      -1.272
      -2.552
      -2.802
      -3.001
      -3.990
      -6.630
      -7.123
    
    
      DEU
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      NaN
      ...
      0.187
      -0.177
      -3.235
      -4.221
      -0.957
      -0.050
      -0.203
      0.278
      0.745
      0.129
    
    
      GRC
      -2.774
      -7.847
      -6.092
      -6.847
      -7.976
      -10.603
      -9.559
      -8.972
      -10.675
      -13.195
      ...
      -6.707
      -10.168
      -15.152
      -11.208
      -10.243
      -6.506
      -3.531
      -4.054
      -3.080
      -3.359
    
  

6 rows × 37 columns



In [48]:

    
# different level 
ds.unstack(level='Variable')









    Out[48]:







  
    
      
      Variable
      Debt
      Surplus
    
    
      ISO
      Year
      
      
    
  
  
    
      ARG
      1992
      25.746
      NaN
    
    
      1993
      27.739
      -0.025
    
    
      1994
      29.310
      -1.289
    
    
      1995
      31.673
      -2.137
    
    
      1996
      33.561
      -2.898
    
    
      1997
      32.662
      -1.890
    
    
      1998
      35.200
      -1.882
    
    
      1999
      40.097
      -3.813
    
    
      2000
      42.060
      -3.326
    
    
      2001
      49.436
      -5.424
    
    
      2002
      152.112
      -2.123
    
    
      2003
      128.561
      1.173
    
    
      2004
      117.116
      3.549
    
    
      2005
      66.938
      2.187
    
    
      2006
      58.622
      1.709
    
    
      2007
      50.842
      -0.056
    
    
      2008
      43.853
      0.242
    
    
      2009
      53.828
      -2.434
    
    
      2010
      42.624
      -1.272
    
    
      2011
      38.063
      -2.552
    
    
      2012
      39.434
      -2.802
    
    
      2013
      42.195
      -3.001
    
    
      2014
      43.587
      -3.990
    
    
      2015
      52.131
      -6.630
    
    
      2016
      51.751
      -7.123
    
    
      DEU
      1991
      38.893
      -3.167
    
    
      1992
      41.521
      -2.594
    
    
      1993
      45.119
      -3.083
    
    
      1994
      47.421
      -2.506
    
    
      1995
      54.820
      -9.427
    
    
      ...
      ...
      ...
      ...
    
    
      GRC
      1987
      52.413
      -8.972
    
    
      1988
      57.069
      -10.675
    
    
      1989
      59.821
      -13.195
    
    
      1990
      73.155
      -14.738
    
    
      1991
      74.683
      -10.677
    
    
      1992
      79.968
      -11.806
    
    
      1993
      100.288
      -12.701
    
    
      1994
      98.296
      -9.401
    
    
      1995
      98.990
      -10.944
    
    
      1996
      101.335
      -9.098
    
    
      1997
      99.452
      -6.671
    
    
      1998
      97.425
      -6.458
    
    
      1999
      98.906
      -6.060
    
    
      2000
      104.935
      -4.113
    
    
      2001
      107.081
      -5.466
    
    
      2002
      104.863
      -6.024
    
    
      2003
      101.456
      -7.830
    
    
      2004
      102.870
      -8.828
    
    
      2005
      107.392
      -6.188
    
    
      2006
      103.574
      -5.946
    
    
      2007
      103.103
      -6.707
    
    
      2008
      109.416
      -10.168
    
    
      2009
      126.745
      -15.152
    
    
      2010
      146.250
      -11.208
    
    
      2011
      172.096
      -10.243
    
    
      2012
      159.565
      -6.506
    
    
      2013
      177.677
      -3.531
    
    
      2014
      180.063
      -4.054
    
    
      2015
      176.938
      -3.080
    
    
      2016
      183.436
      -3.359
    
  

88 rows × 2 columns



In [49]:

    
# or two at once 
ds.unstack(level=['Variable', 'ISO'])



In [ ]:

Exercise. Run the code below and explain what each line of code does.



In [50]:

    
# stacked dataframe 
ds.head(8)









    Out[50]:





Variable  ISO  Year
Surplus   ARG  1993   -0.025
               1994   -1.289
               1995   -2.137
               1996   -2.898
               1997   -1.890
               1998   -1.882
               1999   -3.813
               2000   -3.326
dtype: float64



In [51]:

    
du1 = ds.unstack()



In [52]:

    
du2 = du1.unstack()

Exercise (challenging). Take the unstacked dataframe dds. Use some combination of stack, unstack, and plot to plot the variable Surplus against Year for all three countries. Challenging mostly because you need to work out the steps by yourself.

SOL



In [ ]:

Pivoting

The pivot method: a short cut to some kinds of unstacking. In rough terms, it takes a wide dataframe and constructs a long one. The inputs are columns, not index levels.

Example: BDS data

The Census's Business Dynamnics Statistics collects annual information about the hiring decisions of firms by size and age. This table list the number of firms and total employment by employment size categories: 1 to 4 employees, 5 to 9, and so on.

Apply want operator. Our want is to plot total employment (the variable Emp) against size (variable fsize). Both are columns in the original data.

Here we construct a subset of the data, where we look at two years rather than the whole 1976-2013 period.



In [53]:

    
url = 'http://www2.census.gov/ces/bds/firm/bds_f_sz_release.csv'
raw = pd.read_csv(url)
raw.head()









    Out[53]:







  
    
      
      year2
      fsize
      Firms
      Estabs
      Emp
      Denom
      Estabs_Entry
      Estabs_Entry_Rate
      Estabs_Exit
      Estabs_Exit_Rate
      ...
      Job_Destruction_Rate_Deaths
      Job_Destruction_Rate
      Net_Job_Creation
      Net_Job_Creation_Rate
      Reallocation_Rate
      d_flag
      firmdeath_firms
      firmdeath_estabs
      firmdeath_emp
      sic1
    
  
  
    
      0
      1977
      a) 1 to 4
      2024233
      2032018
      4491493
      4346068
      480220
      24.1
      391696
      19.6
      ...
      22.0
      32.3
      290850
      6.7
      64.6
      0
      296754.0
      297198.0
      750459.0
      NaN
    
    
      1
      1977
      b) 5 to 9
      686103
      708931
      4663342
      4550280
      67327
      9.6
      45318
      6.5
      ...
      11.2
      22.5
      226125
      5.0
      45.0
      0
      31761.0
      32303.0
      390633.0
      NaN
    
    
      2
      1977
      c) 10 to 19
      381170
      418905
      5342410
      5220630
      40578
      9.9
      23785
      5.8
      ...
      9.4
      19.5
      243560
      4.7
      39.0
      0
      13765.0
      14205.0
      359036.0
      NaN
    
    
      3
      1977
      d) 20 to 49
      220469
      281988
      6900793
      6748250
      29488
      10.7
      15544
      5.7
      ...
      8.0
      17.9
      305086
      4.5
      35.8
      0
      6547.0
      6991.0
      377642.0
      NaN
    
    
      4
      1977
      e) 50 to 99
      67179
      117374
      4751217
      4635094
      13373
      11.8
      6247
      5.5
      ...
      6.3
      15.7
      232247
      5.0
      31.4
      0
      1382.0
      1582.0
      185565.0
      NaN
    
  

5 rows × 28 columns



In [54]:

    
# Four size categories
sizes = ['a) 1 to 4', 'b) 5 to 9', 'c) 10 to 19', 'd) 20 to 49']

# only defined size categories and only period since 2012
restricted_sample = (raw['year2']>=2012) & raw['fsize'].isin(sizes)
# don't need all variables
var_names = ['year2', 'fsize', 'Firms', 'Emp']

bds = raw[restricted_sample][var_names]
bds









    Out[54]:







  
    
      
      year2
      fsize
      Firms
      Emp
    
  
  
    
      420
      2012
      a) 1 to 4
      2799853
      5960532
    
    
      421
      2012
      b) 5 to 9
      1013266
      6685019
    
    
      422
      2012
      c) 10 to 19
      593944
      8096882
    
    
      423
      2012
      d) 20 to 49
      367271
      11265423
    
    
      432
      2013
      a) 1 to 4
      2817967
      5951884
    
    
      433
      2013
      b) 5 to 9
      1018276
      6703704
    
    
      434
      2013
      c) 10 to 19
      597802
      8156276
    
    
      435
      2013
      d) 20 to 49
      373209
      11441432
    
    
      444
      2014
      a) 1 to 4
      2832845
      5974204
    
    
      445
      2014
      b) 5 to 9
      1019909
      6722195
    
    
      446
      2014
      c) 10 to 19
      605354
      8251287
    
    
      447
      2014
      d) 20 to 49
      379912
      11679693

Pivoting the data

Let's think specifically about what we want. We want to graph Emp against fsize for (say) 2013. This calls for:

The index should be the size categories fsize.
The column labels should be the entries of year2, namely 2012, 2013 and `2014.
The data should come from the variable Emp.

These inputs translate directly into the following pivot method:



In [55]:

    
bdsp = bds.pivot(index='fsize', columns='year2', values='Emp')



In [56]:

    
# divide by a million so bars aren't too long
bdsp = bdsp/10**6 
bdsp









    Out[56]:







  
    
      year2
      2012
      2013
      2014
    
    
      fsize
      
      
      
    
  
  
    
      a) 1 to 4
      5.960532
      5.951884
      5.974204
    
    
      b) 5 to 9
      6.685019
      6.703704
      6.722195
    
    
      c) 10 to 19
      8.096882
      8.156276
      8.251287
    
    
      d) 20 to 49
      11.265423
      11.441432
      11.679693

Comment. Note that all the parameters here are columns. That's not a choice, it's the way the the pivot method is written.

We do a plot for fun:



In [57]:

    
# plot 2013 as bar chart 
fig, ax = plt.subplots()
bdsp[2013].plot(ax=ax, kind='barh')
ax.set_ylabel('')
ax.set_xlabel('Number of Employees (millions)')









    Out[57]:





Text(0.5,0,'Number of Employees (millions)')



In [ ]:



In [ ]:

Review

We return to the OECD's healthcare data, specifically a subset of their table on the number of doctors per one thousand population. This loads and cleans the data:



In [58]:

    
url1 = 'http://www.oecd.org/health/health-systems/'
url2 = 'OECD-Health-Statistics-2017-Frequently-Requested-Data.xls'
docs = pd.read_excel(url1+url2, 
                     skiprows=3, 
                     usecols=[0, 51, 52, 53, 54, 55, 57],
                     sheetname='Physicians', 
                     na_values=['..'],
                     skip_footer=21) 
# rename country variable 
names = list(docs)
docs = docs.rename(columns={names[0]: 'Country'})
# strip footnote numbers from country names 
docs['Country'] = docs['Country'].str.rsplit(n=1).str.get(0)
docs = docs.head()
docs









    



/Users/wasserman/anaconda3/lib/python3.6/site-packages/pandas/util/_decorators.py:118: FutureWarning: The `sheetname` keyword is deprecated, use `sheet_name` instead
  return func(*args, **kwargs)






    Out[58]:







  
    
      
      Country
      2009
      2010
      2011
      2012
      2013
      2015
    
  
  
    
      0
      Australia
      3.12
      NaN
      3.32
      3.31
      3.39
      3.52
    
    
      1
      Austria
      4.69
      4.80
      4.84
      4.90
      4.99
      5.10
    
    
      2
      Belgium
      2.92
      2.92
      2.91
      2.93
      2.95
      3.02
    
    
      3
      Canada
      2.34
      2.38
      2.46
      2.51
      2.57
      2.66
    
    
      4
      Chile
      NaN
      1.43
      1.58
      1.74
      1.87
      2.14

Use this data to:

Set the index as Country.
Construct a horizontal bar chart of the number of doctors in each country in "2013 (or nearest year)".
Apply the drop method to docs to create a dataframe new that's missing the last column.
Challenging. Use stack and unstack to "pivot" the data so that columns are labeled by country names and rows are labeled by year. This is challenging because we have left out the intermediate steps.
Plot the number of doctors over time in each country as a line in the same plot.

Comment. In the last plot, the x axis labels are non-intuitive. Ignore that.



In [ ]:



In [ ]:



In [ ]:

Resources

Far and away the best material on this subject is Brandon Rhodes' 2015 Pycon presentation. 2 hours and 25 minutes and worth every second.

Video: https://youtu.be/5JnMutdy6Fw
Materials: https://github.com/brandon-rhodes/pycon-pandas-tutorial
Outline: https://github.com/brandon-rhodes/pycon-pandas-tutorial/blob/master/script.txt



In [ ]:

    
#



In [ ]:

	ISO	Variable	Country	Description	Units	1980	1981	1982	1983	1984	...	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016
252	ARG	Surplus	Argentina	General government net lending/borrowing	Percent of GDP	NaN	NaN	NaN	NaN	NaN	...	-0.056	0.242	-2.434	-1.272	-2.552	-2.802	-3.001	-3.990	-6.630	-7.123
260	ARG	Debt	Argentina	General government gross debt	Percent of GDP	NaN	NaN	NaN	NaN	NaN	...	50.842	43.853	53.828	42.624	38.063	39.434	42.195	43.587	52.131	51.751
2716	DEU	Surplus	Germany	General government net lending/borrowing	Percent of GDP	NaN	NaN	NaN	NaN	NaN	...	0.187	-0.177	-3.235	-4.221	-0.957	-0.050	-0.203	0.278	0.745	0.129
2724	DEU	Debt	Germany	General government gross debt	Percent of GDP	NaN	NaN	NaN	NaN	NaN	...	63.542	64.926	72.430	81.004	78.311	79.516	77.058	74.480	70.988	68.169
2804	GRC	Surplus	Greece	General government net lending/borrowing	Percent of GDP	-2.774	-7.847	-6.092	-6.847	-7.976	...	-6.707	-10.168	-15.152	-11.208	-10.243	-6.506	-3.531	-4.054	-3.080	-3.359
2812	GRC	Debt	Greece	General government gross debt	Percent of GDP	22.526	26.681	29.310	33.591	40.061	...	103.103	109.416	126.745	146.250	172.096	159.565	177.677	180.063	176.938	183.436

	252	260	2716	2724	2804	2812
ISO	ARG	ARG	DEU	DEU	GRC	GRC
Variable	Surplus	Debt	Surplus	Debt	Surplus	Debt
Country	Argentina	Argentina	Germany	Germany	Greece	Greece
Description	General government net lending/borrowing	General government gross debt	General government net lending/borrowing	General government gross debt	General government net lending/borrowing	General government gross debt
Units	Percent of GDP	Percent of GDP	Percent of GDP	Percent of GDP	Percent of GDP	Percent of GDP
1980	NaN	NaN	NaN	NaN	-2.774	22.526
1981	NaN	NaN	NaN	NaN	-7.847	26.681
1982	NaN	NaN	NaN	NaN	-6.092	29.31
1983	NaN	NaN	NaN	NaN	-6.847	33.591
1984	NaN	NaN	NaN	NaN	-7.976	40.061
1985	NaN	NaN	NaN	NaN	-10.603	46.621
1986	NaN	NaN	NaN	NaN	-9.559	47.142
1987	NaN	NaN	NaN	NaN	-8.972	52.413
1988	NaN	NaN	NaN	NaN	-10.675	57.069
1989	NaN	NaN	NaN	NaN	-13.195	59.821
1990	NaN	NaN	NaN	NaN	-14.738	73.155
1991	NaN	NaN	-3.167	38.893	-10.677	74.683
1992	NaN	25.746	-2.594	41.521	-11.806	79.968
1993	-0.025	27.739	-3.083	45.119	-12.701	100.288
1994	-1.289	29.31	-2.506	47.421	-9.401	98.296
1995	-2.137	31.673	-9.427	54.82	-10.944	98.99
1996	-2.898	33.561	-3.542	57.598	-9.098	101.335
1997	-1.89	32.662	-2.945	58.773	-6.671	99.452
1998	-1.882	35.2	-2.532	59.385	-6.458	97.425
1999	-3.813	40.097	-1.698	59.962	-6.06	98.906
2000	-3.326	42.06	0.859	58.758	-4.113	104.935
2001	-5.424	49.436	-3.112	57.582	-5.466	107.081
2002	-2.123	152.112	-3.944	59.22	-6.024	104.863
2003	1.173	128.561	-4.176	62.921	-7.83	101.456
2004	3.549	117.116	-3.741	64.689	-8.828	102.87
2005	2.187	66.938	-3.417	66.909	-6.188	107.392
2006	1.709	58.622	-1.722	66.333	-5.946	103.574
2007	-0.056	50.842	0.187	63.542	-6.707	103.103
2008	0.242	43.853	-0.177	64.926	-10.168	109.416
2009	-2.434	53.828	-3.235	72.43	-15.152	126.745
2010	-1.272	42.624	-4.221	81.004	-11.208	146.25
2011	-2.552	38.063	-0.957	78.311	-10.243	172.096
2012	-2.802	39.434	-0.05	79.516	-6.506	159.565
2013	-3.001	42.195	-0.203	77.058	-3.531	177.677
2014	-3.99	43.587	0.278	74.48	-4.054	180.063
2015	-6.63	52.131	0.745	70.988	-3.08	176.938
2016	-7.123	51.751	0.129	68.169	-3.359	183.436

	Variable	Debt	Surplus
ISO	Year
ARG	1992	25.746	NaN
	1993	27.739	-0.025
	1994	29.310	-1.289
	1995	31.673	-2.137
	1996	33.561	-2.898
	1997	32.662	-1.890
	1998	35.200	-1.882
	1999	40.097	-3.813
	2000	42.060	-3.326
	2001	49.436	-5.424
	2002	152.112	-2.123
	2003	128.561	1.173
	2004	117.116	3.549
	2005	66.938	2.187
	2006	58.622	1.709
	2007	50.842	-0.056
	2008	43.853	0.242
	2009	53.828	-2.434
	2010	42.624	-1.272
	2011	38.063	-2.552
	2012	39.434	-2.802
	2013	42.195	-3.001
	2014	43.587	-3.990
	2015	52.131	-6.630
	2016	51.751	-7.123
DEU	1991	38.893	-3.167
	1992	41.521	-2.594
	1993	45.119	-3.083
	1994	47.421	-2.506
	1995	54.820	-9.427
...	...	...	...
GRC	1987	52.413	-8.972
	1988	57.069	-10.675
	1989	59.821	-13.195
	1990	73.155	-14.738
	1991	74.683	-10.677
	1992	79.968	-11.806
	1993	100.288	-12.701
	1994	98.296	-9.401
	1995	98.990	-10.944
	1996	101.335	-9.098
	1997	99.452	-6.671
	1998	97.425	-6.458
	1999	98.906	-6.060
	2000	104.935	-4.113
	2001	107.081	-5.466
	2002	104.863	-6.024
	2003	101.456	-7.830
	2004	102.870	-8.828
	2005	107.392	-6.188
	2006	103.574	-5.946
	2007	103.103	-6.707
	2008	109.416	-10.168
	2009	126.745	-15.152
	2010	146.250	-11.208
	2011	172.096	-10.243
	2012	159.565	-6.506
	2013	177.677	-3.531
	2014	180.063	-4.054
	2015	176.938	-3.080
	2016	183.436	-3.359

	year2	fsize	Firms	Estabs	Emp	Denom	Estabs_Entry	Estabs_Entry_Rate	Estabs_Exit	Estabs_Exit_Rate	...	Job_Destruction_Rate_Deaths	Job_Destruction_Rate	Net_Job_Creation	Net_Job_Creation_Rate	Reallocation_Rate	firmdeath_firms	firmdeath_estabs	firmdeath_emp	sic1
0	1977	a) 1 to 4	2024233	2032018	4491493	4346068	480220	24.1	391696	19.6	...	22.0	32.3	290850	6.7	64.6	296754.0	297198.0	750459.0	NaN
1	1977	b) 5 to 9	686103	708931	4663342	4550280	67327	9.6	45318	6.5	...	11.2	22.5	226125	5.0	45.0	31761.0	32303.0	390633.0	NaN
2	1977	c) 10 to 19	381170	418905	5342410	5220630	40578	9.9	23785	5.8	...	9.4	19.5	243560	4.7	39.0	13765.0	14205.0	359036.0	NaN
3	1977	d) 20 to 49	220469	281988	6900793	6748250	29488	10.7	15544	5.7	...	8.0	17.9	305086	4.5	35.8	6547.0	6991.0	377642.0	NaN
4	1977	e) 50 to 99	67179	117374	4751217	4635094	13373	11.8	6247	5.5	...	6.3	15.7	232247	5.0	31.4	1382.0	1582.0	185565.0	NaN

	year2	fsize	Firms	Emp
420	2012	a) 1 to 4	2799853	5960532
421	2012	b) 5 to 9	1013266	6685019
422	2012	c) 10 to 19	593944	8096882
423	2012	d) 20 to 49	367271	11265423
432	2013	a) 1 to 4	2817967	5951884
433	2013	b) 5 to 9	1018276	6703704
434	2013	c) 10 to 19	597802	8156276
435	2013	d) 20 to 49	373209	11441432
444	2014	a) 1 to 4	2832845	5974204
445	2014	b) 5 to 9	1019909	6722195
446	2014	c) 10 to 19	605354	8251287
447	2014	d) 20 to 49	379912	11679693

year2	2012	2013	2014
fsize
a) 1 to 4	5.960532	5.951884	5.974204
b) 5 to 9	6.685019	6.703704	6.722195
c) 10 to 19	8.096882	8.156276	8.251287
d) 20 to 49	11.265423	11.441432	11.679693

	Country	2009	2010	2011	2012	2013	2015
0	Australia	3.12	NaN	3.32	3.31	3.39	3.52
1	Austria	4.69	4.80	4.84	4.90	4.99	5.10
2	Belgium	2.92	2.92	2.91	2.93	2.95	3.02
3	Canada	2.34	2.38	2.46	2.51	2.57	2.66
4	Chile	NaN	1.43	1.58	1.74	1.87	2.14