Data management with Pandas

An overview of some of the data management tools in Python's Pandas package. Includes:

  • Selecting variables
  • Selecting observations

  • Indexing

  • Groupby

  • Stacking

  • Doubly indexed dataframes

  • Combining dataframes (concat)

  • Merging dataframes

This notebook was written by Dave Backus for the NYU Stern course Data Bootcamp.


In [1]:
import pandas as pd
%matplotlib inline

Reminders

  • Dataframes
  • Index and columns

Selecting variables

Datasets

We take these examples from the data input chapter:

  • Penn World Table
  • World Economic Outlook
  • UN Population Data

All of them come in an unfriendly form; our goal is to fix them. Here we extract small subsets to work with so that we can follow all the steps.

Penn World Table

This one comes with countries stacked on top of each others.


In [37]:
data = {'countrycode': ['CHN', 'CHN', 'CHN', 'FRA', 'FRA', 'FRA', 'USA', 'USA', 'USA'],
 'pop': [1124.7939240000001, 1246.8400649999999, 1318.1701519999999, 58.183173999999994,
         60.764324999999999, 64.731126000000003, 253.33909699999998, 282.49630999999999,
         310.38394799999998],
 'rgdpe': [2611027.0, 4951485.0, 11106452.0, 1293837.0, 1752570.125, 2031723.25,
           7964788.5, 11494606.0, 13151344.0],
 'year': [1990, 2000, 2010, 1990, 2000, 2010, 1990, 2000, 2010]}
pwt = pd.DataFrame(data)
pwt


Out[37]:
countrycode pop rgdpe year
0 CHN 1124.793924 2611027.000 1990
1 CHN 1246.840065 4951485.000 2000
2 CHN 1318.170152 11106452.000 2010
3 FRA 58.183174 1293837.000 1990
4 FRA 60.764325 1752570.125 2000
5 FRA 64.731126 2031723.250 2010
6 USA 253.339097 7964788.500 1990
7 USA 282.496310 11494606.000 2000
8 USA 310.383948 13151344.000 2010

In [ ]:


In [ ]:
### UN Population Data

In [ ]:


In [ ]: