In [1]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import requests, zipfile, io # So that we can download and unzip files
In [2]:
r = requests.get('http://www.federalreserve.gov/econresdata/scf/files/scfp2013excel.zip')
z = zipfile.ZipFile(io.BytesIO(r.content))
f = z.open('SCFP2013.xlsx')
table = pd.read_excel(f, sheetname='SCFP2013')
Lets have a quick look at the table
In [3]:
table.head()
Out[3]:
We can select particular rows of the table using standard Python array slicing notation
In [4]:
table[0:5]
Out[4]:
table is a DataFrame - a multi-dimensional equivalent of Series (another Pandas object), as it has multiple columns, so you can think of it as a matrix, where columns can be accessed by their 'names'. In fact, many operations can be performed on them (coming from numpy):
In [5]:
table.max().max()
Out[5]:
But they know more than that - they have several built-in statistics :
In [6]:
table.describe()
Out[6]:
As an example try to access normalized income and net-worth variables.
In [7]:
table.dtypes[0:5]
table.dtypes.shape
Out[7]:
There are way too many variables in there - try to search for the proper column names
In [8]:
[col for col in table.columns if 'NETWOR' in col]
Out[8]:
In [9]:
[col for col in table.columns if 'INC' in col]
Out[9]:
In [10]:
income = table['NORMINC']
net_worth = table['NETWORTH']
Get the "mean" (non -weighted) income and minimal net worth
In [11]:
income.mean()
Out[11]:
In [12]:
net_worth.min()
Out[12]:
That is, there is one person with a net worth of -227 million \$ ! Suppose we do not want our analysis to depend on these extremely low values and we would like to trim our dataframe. As a first step, create a new dataframe that only contains the variables of interest (and the id of each observation)
In [13]:
keep = ['YY1','Y1', 'NORMINC', 'NETWORTH']
In [14]:
data = table[keep]
In [15]:
data.head()
Out[15]:
Rename the columns:
In [16]:
data.columns ='Household', 'Observation' , 'Income', 'Net Worth'
In [17]:
data.head()
Out[17]:
Try to get a general picture of what would be a "good" trimming value of net worth by plotting an estimated kernel density
In [18]:
data['Net Worth'].plot(kind='density')
plt.show()
Let's see how many observations we eliminate if we exclude everyone below -1 million \$
In [19]:
data_trimmed = data[data['Net Worth'] > -1000000]
In [20]:
data.shape[0] - data_trimmed.shape[0]
Out[20]:
But how many households are in this category?
In [21]:
data[data['Net Worth'] < -1000000]
Out[21]:
In [22]:
data_trimmed['Net Worth'].plot(kind='density')
plt.show()
In [23]:
data.index
Out[23]:
We simply have the index of each observation. As a first step, replace these indeces by the household identifier and group income levels in each observation:
In [24]:
new_observations = data.loc[:,'Observation'] - 10 * data.loc[:,'Household']
data.loc[:,'Observation'] = new_observations
data[0:10]
# Normally, you should not do this - instead use assign
Out[24]:
In [25]:
# Reload the data
data = table[keep]
data.columns ='Household', 'Observation' , 'Income', 'Net Worth'
In [26]:
data = data.assign(Observations = (data['Observation'] - 10.0 * data['Household']).astype(int))
del data['Observation'] # delete the old column
data = data.rename(columns = {'Observations':'Observation'}) # rename the column
data = data[['Household', 'Observation' , 'Income', 'Net Worth']] # reinsert the column
data.head()
Out[26]:
In [27]:
p = data.pivot(index = 'Household' , columns = 'Observation' , values = 'Income' )
In [28]:
p.head()
Out[28]:
If instead we are interested in both income and net worth grouped by observations then:
In [29]:
p = data.pivot(index='Household', columns='Observation')
In [30]:
p.head()
Out[30]:
Use stacking to transform the data into a panel structure we are familiar with (and unstacking to go back to cross-section):
In [31]:
panel_data = p.stack()
In [32]:
panel_data.head()
Out[32]:
Using the panel structure it is even easier to see the number of households that had fewer than -1 million \$ net worth
In [33]:
panel_data[panel_data['Net Worth'] < -1000000]
Out[33]:
Pandas even have their own data-structure for panel data - for it, we need to create dataframes as inputs
In [34]:
p = data.pivot(index='Observation', columns='Household')
In [35]:
pdata = {'Observation 1' : pd.DataFrame(p.ix[1,:]),
'Observation 2' : pd.DataFrame(p.ix[2,:]),
'Observation 3' : pd.DataFrame(p.ix[3,:]),
'Observation 4' : pd.DataFrame(p.ix[4,:]),
'Observation 5' : pd.DataFrame(p.ix[5,:])}
pdata = pd.Panel(pdata)
In [36]:
pdata
Out[36]:
but this is not as useful - unfortunately the panel part of the package has been neglected. Very few functions are available. Now as a last exercise, save our dataFrame to a csv file
In [39]:
data.to_excel('SCF_2013_inc_netw.xlsx')
In [ ]: