Pandas 3: Shaping data

The second in a series of notebooks that describe Pandas' powerful data management tools. This one covers shaping methods: switching rows and columns, pivoting, and stacking. We'll see that this is all about the indexes: the row and column labels.

Outline:

More data management topics coming.

Note: requires internet access to run.

This IPython notebook was created by Dave Backus, Chase Coleman, and Spencer Lyon for the NYU Stern course Data Bootcamp.

Preliminaries

Import packages, etc.


In [18]:
import sys                             # system module 
import pandas as pd                    # data package
import matplotlib.pyplot as plt        # graphics module  
import datetime as dt                  # date and time module
import numpy as np                     # foundation for Pandas 

%matplotlib inline                     

# check versions (overkill, but why not?)
print('Python version:', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())


Python version: 3.5.1 |Anaconda 4.0.0 (64-bit)| (default, Feb 16 2016, 09:49:46) [MSC v.1900 64 bit (AMD64)]
Pandas version:  0.18.0
Today:  2016-04-14

Example: WEO debt and deficits

We spend most of our time on one of the examples from the previous notebook. The problem in this example is that variables run across rows, rather than down columns. Our want is to flip some of the rows and columns so that we can plot the data against time. The question is how.

We use a small subset of the IMF's World Economic Outlook database that contains two variables and three countries.


In [19]:
url1 = 'http://www.imf.org/external/pubs/ft/weo/2015/02/weodata/'
url2 = 'WEOOct2015all.xls'
url = url1 + url2 
weo = pd.read_csv(url, sep='\t', 
                  usecols=[1,2,3,4,6,40,41,42], 
                  thousands=',', 
                  na_values=['n/a', '--']) 
print('Variable dtypes:\n', weo.dtypes, sep='')


Variable dtypes:
ISO                    object
WEO Subject Code       object
Country                object
Subject Descriptor     object
Units                  object
2011                  float64
2012                  float64
2013                  float64
dtype: object

In [20]:
# create debt and deficits dataframe:  two variables and three countries 
variables = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
countries = ['ARG', 'DEU', 'GRC']
dd = weo[weo['WEO Subject Code'].isin(variables) & weo['ISO'].isin(countries)]

# change column labels to something more intuitive 
dd = dd.rename(columns={'WEO Subject Code': 'Variable', 
                        'Subject Descriptor': 'Description'})

# rename variables 
dd['Variable'] = dd['Variable'].replace(to_replace=['GGXWDG_NGDP', 'GGXCNL_NGDP'], 
                                        value=['Debt', 'Surplus'])
dd


Out[20]:
ISO Variable Country Description Units 2011 2012 2013
252 ARG Surplus Argentina General government net lending/borrowing Percent of GDP -1.949 -2.359 -1.997
260 ARG Debt Argentina General government gross debt Percent of GDP 35.763 37.328 40.217
2716 DEU Surplus Germany General government net lending/borrowing Percent of GDP -0.861 0.091 0.133
2724 DEU Debt Germany General government gross debt Percent of GDP 77.928 79.311 76.978
2804 GRC Surplus Greece General government net lending/borrowing Percent of GDP -10.215 -6.406 -2.944
2812 GRC Debt Greece General government gross debt Percent of GDP 170.960 156.494 174.951

In [ ]:

Reminders

What kind of object does each of the following produce?


In [21]:
dd.index


Out[21]:
Int64Index([252, 260, 2716, 2724, 2804, 2812], dtype='int64')

In [22]:
dd.columns


Out[22]:
Index(['ISO', 'Variable', 'Country', 'Description', 'Units', '2011', '2012',
       '2013'],
      dtype='object')

In [23]:
dd['ISO']


Out[23]:
252     ARG
260     ARG
2716    DEU
2724    DEU
2804    GRC
2812    GRC
Name: ISO, dtype: object

In [24]:
dd[['ISO', 'Variable']]


Out[24]:
ISO Variable
252 ARG Surplus
260 ARG Debt
2716 DEU Surplus
2724 DEU Debt
2804 GRC Surplus
2812 GRC Debt

In [25]:
dd[dd['ISO'] == 'ARG']


Out[25]:
ISO Variable Country Description Units 2011 2012 2013
252 ARG Surplus Argentina General government net lending/borrowing Percent of GDP -1.949 -2.359 -1.997
260 ARG Debt Argentina General government gross debt Percent of GDP 35.763 37.328 40.217

In [ ]:

Wants

We might imagine doing several different things with this data:

  • Plot a specific variable (debt or surplus) for a given date.
  • Time series plots for a specific country.
  • Time series plots for a specific variable.

Depending on which we want, we might organize the data differently. We'll focus on the last two.

Here's a brute force approach to the problem: simply transpose the data. This is where that leads:


In [26]:
dd.T


Out[26]:
252 260 2716 2724 2804 2812
ISO ARG ARG DEU DEU GRC GRC
Variable Surplus Debt Surplus Debt Surplus Debt
Country Argentina Argentina Germany Germany Greece Greece
Description General government net lending/borrowing General government gross debt General government net lending/borrowing General government gross debt General government net lending/borrowing General government gross debt
Units Percent of GDP Percent of GDP Percent of GDP Percent of GDP Percent of GDP Percent of GDP
2011 -1.949 35.763 -0.861 77.928 -10.215 170.96
2012 -2.359 37.328 0.091 79.311 -6.406 156.494
2013 -1.997 40.217 0.133 76.978 -2.944 174.951

Comments. The problem here is that the columns include both the numbers (which we want to plot) and some descriptive information (which we don't).


In [ ]:

Setting and resetting the index

We start by setting and resetting the index. That may sound like a step backwards -- haven't we done this already? -- but it reminds us of some things that will be handy later.

Take the dataframe dd. What would we like in the index? Evenutally we'd like the dates [2011, 2012, 2013], but right now the row labels are more naturally the variable or country. Here are some varriants.

Setting the index


In [27]:
dd.set_index('Country')


Out[27]:
ISO Variable Description Units 2011 2012 2013
Country
Argentina ARG Surplus General government net lending/borrowing Percent of GDP -1.949 -2.359 -1.997
Argentina ARG Debt General government gross debt Percent of GDP 35.763 37.328 40.217
Germany DEU Surplus General government net lending/borrowing Percent of GDP -0.861 0.091 0.133
Germany DEU Debt General government gross debt Percent of GDP 77.928 79.311 76.978
Greece GRC Surplus General government net lending/borrowing Percent of GDP -10.215 -6.406 -2.944
Greece GRC Debt General government gross debt Percent of GDP 170.960 156.494 174.951

In [28]:
# we can do the same thing with a list, which will be meaningful soon...
dd.set_index(['Country'])


Out[28]:
ISO Variable Description Units 2011 2012 2013
Country
Argentina ARG Surplus General government net lending/borrowing Percent of GDP -1.949 -2.359 -1.997
Argentina ARG Debt General government gross debt Percent of GDP 35.763 37.328 40.217
Germany DEU Surplus General government net lending/borrowing Percent of GDP -0.861 0.091 0.133
Germany DEU Debt General government gross debt Percent of GDP 77.928 79.311 76.978
Greece GRC Surplus General government net lending/borrowing Percent of GDP -10.215 -6.406 -2.944
Greece GRC Debt General government gross debt Percent of GDP 170.960 156.494 174.951

Exercise. Set Variable as the index.


In [ ]:

Comment. Note that the new index brought its name along: Country in the two examples, Variable in the exercise. That's incredibly useful because we can refer to index levels by name. If we happen to have an index without a name, we can set it with

df.index.name = 'Whatever name we like'

Multi-indexes

We can put more than one variable in an index, which gives us a multi-index. This is sometimes called a hierarchical index because the levels of the index (as they're called) are ordered.

Multi-indexes are more common than you might think. One reason is that data itself is often multi-dimensional. A typical spreadsheet has two dimensions: the variable and the observation. The WEO data is naturally three dimensional: the variable, the year, and the country. (Think about that for a minute, it's deeper than it sounds.)

The problem we're having is fitting this nicely into two dimensions. A multi-index allows us to manage that. A two-dimensional index would work here -- the country and the variable code -- but right now we have some redundancy.

Example. We push all the descriptive, non-numerical columns into the index, leaving the dataframe itself with only numbers, which seems like a step in thee right direction.


In [29]:
ddi = dd.set_index(['Variable', 'Country', 'ISO', 'Description', 'Units'])
ddi


Out[29]:
2011 2012 2013
Variable Country ISO Description Units
Surplus Argentina ARG General government net lending/borrowing Percent of GDP -1.949 -2.359 -1.997
Debt Argentina ARG General government gross debt Percent of GDP 35.763 37.328 40.217
Surplus Germany DEU General government net lending/borrowing Percent of GDP -0.861 0.091 0.133
Debt Germany DEU General government gross debt Percent of GDP 77.928 79.311 76.978
Surplus Greece GRC General government net lending/borrowing Percent of GDP -10.215 -6.406 -2.944
Debt Greece GRC General government gross debt Percent of GDP 170.960 156.494 174.951

Let's take a closer look at the index


In [30]:
ddi.index


Out[30]:
MultiIndex(levels=[['Debt', 'Surplus'], ['Argentina', 'Germany', 'Greece'], ['ARG', 'DEU', 'GRC'], ['General government gross debt', 'General government net lending/borrowing'], ['Percent of GDP']],
           labels=[[1, 0, 1, 0, 1, 0], [0, 0, 1, 1, 2, 2], [0, 0, 1, 1, 2, 2], [1, 0, 1, 0, 1, 0], [0, 0, 0, 0, 0, 0]],
           names=['Variable', 'Country', 'ISO', 'Description', 'Units'])

That's a lot to process, so we break it into pieces.

  • ddi.index.names contains a list of level names. (Remind yourself that lists are ordered, so this tracks levels.)
  • ddi.index.levels contains the values in each level.

Here's what they like like here:


In [31]:
# Chase and Spencer like double quotes 
print("The level names are:\n", ddi.index.names, "\n", sep="")
print("The levels (aka level values) are:\n", ddi.index.levels, sep="")


The level names are:
['Variable', 'Country', 'ISO', 'Description', 'Units']

The levels (aka level values) are:
[['Debt', 'Surplus'], ['Argentina', 'Germany', 'Greece'], ['ARG', 'DEU', 'GRC'], ['General government gross debt', 'General government net lending/borrowing'], ['Percent of GDP']]

Knowing the order of the index components and being able to inspect their values and names is fundamental to working with a multi-index.

Resetting the index

We've seen that set_index pushes columns into the index. Here we see that reset_index does the reverse: it pushes components of the index back to the columns.

Example.


In [32]:
ddi.head(2)


Out[32]:
2011 2012 2013
Variable Country ISO Description Units
Surplus Argentina ARG General government net lending/borrowing Percent of GDP -1.949 -2.359 -1.997
Debt Argentina ARG General government gross debt Percent of GDP 35.763 37.328 40.217

In [33]:
ddi.reset_index()


Out[33]:
Variable Country ISO Description Units 2011 2012 2013
0 Surplus Argentina ARG General government net lending/borrowing Percent of GDP -1.949 -2.359 -1.997
1 Debt Argentina ARG General government gross debt Percent of GDP 35.763 37.328 40.217
2 Surplus Germany DEU General government net lending/borrowing Percent of GDP -0.861 0.091 0.133
3 Debt Germany DEU General government gross debt Percent of GDP 77.928 79.311 76.978
4 Surplus Greece GRC General government net lending/borrowing Percent of GDP -10.215 -6.406 -2.944
5 Debt Greece GRC General government gross debt Percent of GDP 170.960 156.494 174.951

In [34]:
# or we can reset the index by level 
ddi.reset_index(level=1).head(2)


Out[34]:
Country 2011 2012 2013
Variable ISO Description Units
Surplus ARG General government net lending/borrowing Percent of GDP Argentina -1.949 -2.359 -1.997
Debt ARG General government gross debt Percent of GDP Argentina 35.763 37.328 40.217

In [35]:
# or by name 
ddi.reset_index(level='Country').head(2)


Out[35]:
Country 2011 2012 2013
Variable ISO Description Units
Surplus ARG General government net lending/borrowing Percent of GDP Argentina -1.949 -2.359 -1.997
Debt ARG General government gross debt Percent of GDP Argentina 35.763 37.328 40.217

In [36]:
# or do more than one at a time
ddi.reset_index(level=[1,3]).head(2)


Out[36]:
Country Description 2011 2012 2013
Variable ISO Units
Surplus ARG Percent of GDP Argentina General government net lending/borrowing -1.949 -2.359 -1.997
Debt ARG Percent of GDP Argentina General government gross debt 35.763 37.328 40.217

Comment. By default, reset_index pushes one or more index levels into columns. If we want to discard that level of the index altogether, we use the parameter drop=True.


In [37]:
ddi.reset_index(level=[1,3], drop=True).head(2)


Out[37]:
2011 2012 2013
Variable ISO Units
Surplus ARG Percent of GDP -1.949 -2.359 -1.997
Debt ARG Percent of GDP 35.763 37.328 40.217

Exercise. For the dataframe ddi:

  • Use the reset_index method to move the Units level of the index to a column of the dataframe.
  • Use the drop parameter of reset_index to delete Units from the dataframe.

In [ ]:


In [ ]:

Switching rows and columns

If we take the dataframe ddi, we see that the everything's been put into the index but the data itself. Perhaps we can get what we want if we just flip the rows and columns. Roughly speaking, we refer to this as pivoting.

First look at switching rows and columns

The simplest way to flip rows and columns is to use the T or transpose property. When we do that, we end up with a lot of stuff in the column labels, as the multi-index for the rows gets rotated into the columns. Other than that, we're good. We can even do a plot. The only problem is all the stuff we've pushed into the column labels -- it's kind of a mess.


In [38]:
ddt = ddi.T
ddt


Out[38]:
Variable Surplus Debt Surplus Debt Surplus Debt
Country Argentina Argentina Germany Germany Greece Greece
ISO ARG ARG DEU DEU GRC GRC
Description General government net lending/borrowing General government gross debt General government net lending/borrowing General government gross debt General government net lending/borrowing General government gross debt
Units Percent of GDP Percent of GDP Percent of GDP Percent of GDP Percent of GDP Percent of GDP
2011 -1.949 35.763 -0.861 77.928 -10.215 170.960
2012 -2.359 37.328 0.091 79.311 -6.406 156.494
2013 -1.997 40.217 0.133 76.978 -2.944 174.951

Comment. We see here that the multi-index for the rows has been turned into a multi-index for the columns. Works the same way.

The only problem here is that the column labels are more complicated than we might want. Here, for example, is what we get with the plot method. As usual, .plot() plots all the columns of the dataframe, but here that means we're mixing variables. And the legend contains all the levels of the column labels.


In [39]:
ddt.plot()


Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x91a4fd0>

Comment. Ooooh, that's ugly! We're on the right track, but evidently not there yet.


In [ ]:

Referring to variables with a multi-index

Can we refer to variables in the same way? Sort of, as long as we refer to the top level of the column index. It gives us a dataframe that's a subset of the original one.

Let's try each of these:

  • ddt['Debt']
  • ddt['Debt']['Argentina']
  • ddt['Debt', 'Argentina']
  • ddt['ARG']

What do you see? What's going on? The theme is that we can reference the top level, which in ddi is the Variable. If we try to access a lower level, it bombs.


In [40]:
# indexing by variable 
debt = ddt['Debt']

Exercise. With the dataframe ddt:

  • What type of object is Debt?
  • Construct a line plot of Debt over time with one line for each country.

In [ ]:


In [ ]:


In [ ]:

Example. Let's do this together. How would we fix up the legend? What approaches cross your mind? (No code, just the general approach.)


In [ ]:

Swapping levels

Since variables refer to the first level of the column index, it's not clear how we would group data by country. Suppose, for example, we wanted to plot Debt and Surplus for a specific country. What would we do?

One way to do that is to make the country the top level with the swaplevel method. Note the axis parameter. With axis=1 we swap column levels, with axis=0 (the default) we swap row levels.


In [41]:
ddts = ddt.swaplevel(0,1, axis=1) 
ddts


Out[41]:
Country Argentina Germany Greece
Variable Surplus Debt Surplus Debt Surplus Debt
ISO ARG ARG DEU DEU GRC GRC
Description General government net lending/borrowing General government gross debt General government net lending/borrowing General government gross debt General government net lending/borrowing General government gross debt
Units Percent of GDP Percent of GDP Percent of GDP Percent of GDP Percent of GDP Percent of GDP
2011 -1.949 35.763 -0.861 77.928 -10.215 170.960
2012 -2.359 37.328 0.091 79.311 -6.406 156.494
2013 -1.997 40.217 0.133 76.978 -2.944 174.951

Exercise. Use the dataframe ddts to plot Debt and Surplus across time for Argentina. Hint: In the plot method, set subplots=True so that each variable is in a separate subplot.


In [ ]:

The xs method

Another approach to extracting data that cuts across levels of the row or column index: the xs method. This is recent addition tpo Pandas and an extremely good method once you get the hang of it.

The basic syntax is df.xs(item, axis=X, level=N), where N is the name or number of an index level and X describes if we are extracting from the index or column names. Setting X=0 (so axis=0) will slice up the data along the index, X=1 extracts data for column labels.

Here's how we could use xs to get the Argentina data without swapping the level of the column labels


In [42]:
#ddt.xs?

In [43]:
ddt.xs("Argentina", axis=1, level="Country")


Out[43]:
Variable Surplus Debt
ISO ARG ARG
Description General government net lending/borrowing General government gross debt
Units Percent of GDP Percent of GDP
2011 -1.949 35.763
2012 -2.359 37.328
2013 -1.997 40.217

In [44]:
ddt.xs("Argentina", axis=1, level="Country")["Debt"]


Out[44]:
ISO ARG
Description General government gross debt
Units Percent of GDP
2011 35.763
2012 37.328
2013 40.217

Exercise. Use a combination of xs and standard slicing with [...] to extract the variable Debt for Greece.


In [ ]:

Exercise. Use the dataframe ddt -- and the xs method -- to plot Debt and Surplus across time for Argentina.


In [ ]:

Stacking and unstacking

The set_index and reset_index methods work on the row labels -- the index. They move columns to the index and the reverse. The stack and unstack methods move index levels to and from column levels:

  • stack stacks the data up, moving the columns to the index and creating a long dataframe.
  • unstack does the reverse, moving columns or index levels into the column labels and creating a wide dataframe.

We use both to shape (or reshape) our data. We use set_index to push things into the index. And then use reset_index to push some of them back to the columns. That gives us pretty fine-grainded control over the shape of our data.

We start by simplifying our initial dataframe.


In [45]:
# drop some of the index levels (think s for small)
dds = ddi.reset_index(level=[1,3,4], drop=True)
# give a name to the column labels
dds.columns.name = 'Year'  
dds


Out[45]:
Year 2011 2012 2013
Variable ISO
Surplus ARG -1.949 -2.359 -1.997
Debt ARG 35.763 37.328 40.217
Surplus DEU -0.861 0.091 0.133
Debt DEU 77.928 79.311 76.978
Surplus GRC -10.215 -6.406 -2.944
Debt GRC 170.960 156.494 174.951

Let's remind ourselves what we want. We want to

  • move the column index (Year) into the row index
  • move the Variable and ISO levels the other way, into the column labels.

The first one uses stack, the second one unstack.


In [ ]:

Stacking

We stack our data, one variable on top of another, with a multi-index to keep track of what's what. In simple terms, we change the data from a wide format to a long format. The stack method takes the lowest column level and makes it the lowest row level.


In [46]:
# convert to long format. Notice printing is different... what `type` is ds?
ds = dds.stack()    
ds


Out[46]:
Variable  ISO  Year
Surplus   ARG  2011     -1.949
               2012     -2.359
               2013     -1.997
Debt      ARG  2011     35.763
               2012     37.328
               2013     40.217
Surplus   DEU  2011     -0.861
               2012      0.091
               2013      0.133
Debt      DEU  2011     77.928
               2012     79.311
               2013     76.978
Surplus   GRC  2011    -10.215
               2012     -6.406
               2013     -2.944
Debt      GRC  2011    170.960
               2012    156.494
               2013    174.951
dtype: float64

In [47]:
# same thing with explicit reference to column name 
dds.stack(level='Year').head(8)


Out[47]:
Variable  ISO  Year
Surplus   ARG  2011    -1.949
               2012    -2.359
               2013    -1.997
Debt      ARG  2011    35.763
               2012    37.328
               2013    40.217
Surplus   DEU  2011    -0.861
               2012     0.091
dtype: float64

In [48]:
# or with level number 
dds.stack(level=0).head(8)


Out[48]:
Variable  ISO  Year
Surplus   ARG  2011    -1.949
               2012    -2.359
               2013    -1.997
Debt      ARG  2011    35.763
               2012    37.328
               2013    40.217
Surplus   DEU  2011    -0.861
               2012     0.091
dtype: float64

In [ ]:

Unstacking

Stacking moves columns into the index, "stacking" the data up into longer columns. Unstacking does the reverse, taking levels of the row index and turning them into column labels. Roughly speaking we're rotating or pivoting the data.


In [49]:
# now go long to wide 
ds.unstack()  # default is lowest value level='ISO'


Out[49]:
Year 2011 2012 2013
Variable ISO
Debt ARG 35.763 37.328 40.217
DEU 77.928 79.311 76.978
GRC 170.960 156.494 174.951
Surplus ARG -1.949 -2.359 -1.997
DEU -0.861 0.091 0.133
GRC -10.215 -6.406 -2.944

In [50]:
# different level 
ds.unstack(level='Variable')


Out[50]:
Variable Debt Surplus
ISO Year
ARG 2011 35.763 -1.949
2012 37.328 -2.359
2013 40.217 -1.997
DEU 2011 77.928 -0.861
2012 79.311 0.091
2013 76.978 0.133
GRC 2011 170.960 -10.215
2012 156.494 -6.406
2013 174.951 -2.944

In [51]:
# or two at once 
ds.unstack(level=['Variable', 'ISO'])


Out[51]:
Variable Surplus Debt Surplus Debt Surplus Debt
ISO ARG ARG DEU DEU GRC GRC
Year
2011 -1.949 35.763 -0.861 77.928 -10.215 170.960
2012 -2.359 37.328 0.091 79.311 -6.406 156.494
2013 -1.997 40.217 0.133 76.978 -2.944 174.951

In [ ]:

Exercise. Run the code below and explain what each line of code does.


In [52]:
# stacked dataframe 
ds.head(8)


Out[52]:
Variable  ISO  Year
Surplus   ARG  2011    -1.949
               2012    -2.359
               2013    -1.997
Debt      ARG  2011    35.763
               2012    37.328
               2013    40.217
Surplus   DEU  2011    -0.861
               2012     0.091
dtype: float64

In [53]:
du1 = ds.unstack() 
du2 = du1.unstack()

In [ ]:

Exercise (challenging). Take the unstacked dataframe dds. Use some combination of stack, unstack, and plot to plot the variable Surplus against Year for all three countries. Challenging mostly because you need to work out the steps by yourself.


In [ ]:


In [ ]:

Pivoting

The pivot method: a short cut to some kinds of unstacking. In rough terms, it takes a wide dataframe and constructs a long one. The inputs are columns, not index levels.

Example: BDS data

The Census's Business Dynamnics Statistics collects annual information about the hiring decisions of firms by size and age. This table list the number of firms and total employment by employment size categories: 1 to 4 employees, 5 to 9, and so on.

Apply want operator. Our want is to plot total employment (the variable Emp) against size (variable fsize). Both are columns in the original data.

Here we construct a subset of the data, where we look at two years rather than the whole 1976-2013 period.


In [54]:
url = 'http://www2.census.gov/ces/bds/firm/bds_f_sz_release.csv'
raw = pd.read_csv(url)
raw.head()

sizes = ['a) 1 to 4', 'b) 5 to 9', 'c) 10 to 19', 'd) 20 to 49']
bds = raw[(raw['year2']>=2012) & raw['fsize'].isin(sizes)][['year2', 'fsize', 'Firms', 'Emp']]
bds


Out[54]:
year2 fsize Firms Emp
420 2012 a) 1 to 4 2832751 6019263
421 2012 b) 5 to 9 1017942 6715603
422 2012 c) 10 to 19 595848 8123271
423 2012 d) 20 to 49 368391 11300613
432 2013 a) 1 to 4 2846416 5998912
433 2013 b) 5 to 9 1020772 6714924
434 2013 c) 10 to 19 598153 8151891
435 2013 d) 20 to 49 373345 11425545

Pivoting the data

Let's think specifically about what we want. We want to graph Emp against fsize for (say) 2013. This calls for:

  • The index should be the size categories fsize.
  • The column labels should be the entries of year2, namely 2012 and 2013.
  • The data should come from the variable Emp.

These inputs translate directly into the following pivot method:


In [55]:
# pivot and divide by a million (dividing so bars aren't too long)
bdsp = bds.pivot(index='fsize', columns='year2', values='Emp')/10**6 
bdsp


Out[55]:
year2 2012 2013
fsize
a) 1 to 4 6.019263 5.998912
b) 5 to 9 6.715603 6.714924
c) 10 to 19 8.123271 8.151891
d) 20 to 49 11.300613 11.425545

Comment. Note that all the parameters here are columns. That's not a choice, it's the way the the pivot method is written.

We do a plot for fun:


In [56]:
# plot 2013 as bar chart 
fig, ax = plt.subplots()
bdsp[2013].plot.barh(ax=ax)
ax.set_ylabel('')
ax.set_xlabel('Number of Employees (millions)')


Out[56]:
<matplotlib.text.Text at 0x9587198>

In [ ]:

Review

We return to the OECD's healthcare data, specifically a subset of their table on the number of doctors per one thousand population. This loads and cleans the data:


In [57]:
url1 = 'http://www.oecd.org/health/health-systems/'
url2 = 'OECD-Health-Statistics-2015-Frequently-Requested-Data.xls'
docs = pd.read_excel(url1+url2, 
                     skiprows=3, 
                     usecols=[0, 51, 52, 53, 54, 55, 57],
                     sheetname='Physicians', 
                     na_values=['..'],
                     skip_footer=21) 
# rename country variable 
names = list(docs)
docs = docs.rename(columns={names[0]: 'Country'})
# strip footnote numbers from country names 
docs['Country'] = docs['Country'].str.rsplit(n=1).str.get(0)
docs = docs.head()
docs


Out[57]:
Country 2010 2011 2012 2013 2014 2013 (or nearest year)
0 Australia NaN 3.31 3.31 3.39 NaN 3.39
1 Austria 4.80 4.84 4.90 4.99 NaN 4.99
2 Belgium 2.92 2.91 2.93 2.95 NaN 2.95
3 Canada 2.37 2.44 2.50 2.55 NaN 2.55
4 Chile 1.43 1.58 1.74 1.88 2.02 1.88

Use this data to:

  • Set the index as Country.
  • Construct a horizontal bar chart of the number of doctors in each country in "2013 (or nearest year)".
  • Apply the drop method to docs to create a dataframe new that's missing the last column.
  • Challenging. Use stack and unstack to "pivot" the data so that columns are labeled by country names and rows are labeled by year. This is challenging because we have left out the intermediate steps.
  • Plot the number of doctors over time in each country as a line in the same plot.

Comment. In the last plot, the x axis labels are non-intuitive. Ignore that.


In [ ]:


In [ ]:


In [ ]:

Resources

Far and away the best material on this subject is Brandon Rhodes' 2015 Pycon presentation. 2 hours and 25 minutes and worth every second.


In [ ]: