Data management datasets

Datasets to use in Pandas class. Convert large datasets to more manageable ones and save as dictionaries that we can turn into dataframes.

This notebook was written by Dave Backus for the NYU Stern course Data Bootcamp.


In [1]:
import pandas as pd
%matplotlib inline

Datasets

We take these examples from the data input chapter:

  • Penn World Table
  • World Economic Outlook
  • Fama-French
  • UN Population Projections

All of them come in an unfriendly form; our goal is to fix them. Here we extract small subsets to work with so that we can follow all the steps.

Penn World Table

This one comes with countries stacked on top of each others.


In [10]:
%%time
url = 'http://www.rug.nl/research/ggdc/data/pwt/v81/pwt81.xlsx'
pwt = pd.read_excel(url, sheetname='Data')
pwt.head()


Wall time: 14.5 s

In [11]:
pwt.shape


Out[11]:
(10357, 47)

In [12]:
variables = [0, 1, 3, 4, 6]
countries = ['CHN', 'FRA', 'USA']
years     = [1990, 2000, 2010]

pwt = pwt[variables]


Out[12]:
(10357, 5)

In [13]:
# one country 
pwt1 = pwt[(pwt['countrycode'] == 'CHN') & (pwt['year'].isin(years))]

In [15]:
# three countries 
pwt3 = pwt[(pwt['countrycode'].isin(countries)) & (pwt['year'].isin(years))]
pwt3 = pwt3[pwt3['country']==]
pwt3.shape


Out[15]:
(9, 5)

In [1]:
pwt3 = pwt3.replace(to_replace=["China, People's Republic of"], value=['China']) 
pwt3.reset_index()
pwt3


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-86de55a87ebb> in <module>()
----> 1 pwt3 = pwt3.replace(to_replace=["China, People's Republic of"], value=['China'])
      2 pwt3.reset_index()
      3 pwt3

NameError: name 'pwt3' is not defined

In [31]:
pw3d = pwt3.to_dict(orient='list')
pw3d


Out[31]:
{'country': ["China, People's Republic of",
  "China, People's Republic of",
  "China, People's Republic of",
  'France',
  'France',
  'France',
  'United States',
  'United States',
  'United States'],
 'countrycode': ['CHN',
  'CHN',
  'CHN',
  'FRA',
  'FRA',
  'FRA',
  'USA',
  'USA',
  'USA'],
 'pop': [1124.7939240000001,
  1246.8400649999999,
  1318.1701519999999,
  58.183173999999994,
  60.764324999999999,
  64.731126000000003,
  253.33909699999998,
  282.49630999999999,
  310.38394799999998],
 'rgdpe': [2611027.0,
  4951485.0,
  11106452.0,
  1293837.0,
  1752570.125,
  2031723.25,
  7964788.5,
  11494606.0,
  13151344.0],
 'year': [1990, 2000, 2010, 1990, 2000, 2010, 1990, 2000, 2010]}

In [ ]:


In [ ]:


In [ ]:


In [ ]: