Intorduction to Pandas - Pan(el)-da(ta)-s

Laszlo Tetenyi



In [1]:

    
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import requests, zipfile, io # So that we can download and unzip files

Survey of Consumer Finances (SCF) 2013

Load and explore data from the SCF website - note that this data cannot be loaded by 2007- Excel due to its size



In [2]:

    
r = requests.get('http://www.federalreserve.gov/econresdata/scf/files/scfp2013excel.zip')
z = zipfile.ZipFile(io.BytesIO(r.content))
f = z.open('SCFP2013.xlsx')
table = pd.read_excel(f, sheetname='SCFP2013')

Lets have a quick look at the table



In [3]:

    
table.head()









    Out[3]:






  
    
      
      YY1
      Y1
      WGT
      HHSEX
      AGE
      AGECL
      EDUC
      EDCL
      MARRIED
      KIDS
      ...
      LLOAN11
      LLOAN12
      NWCAT
      INCCAT
      ASSETCAT
      NINCCAT
      NINC2CAT
      NW10CAT
      INC10CAT
      NINC10CAT
    
  
  
    
      0
      1
      11
      3100.802441
      1
      54
      3
      11
      2
      2
      1
      ...
      0
      0
      1
      1
      1
      2
      1
      2
      2
      4
    
    
      1
      1
      12
      3090.352195
      1
      54
      3
      11
      2
      2
      1
      ...
      0
      0
      1
      1
      1
      2
      1
      2
      2
      4
    
    
      2
      1
      13
      3094.100275
      1
      54
      3
      11
      2
      2
      1
      ...
      0
      0
      1
      1
      1
      2
      1
      2
      2
      4
    
    
      3
      1
      14
      3098.507516
      1
      54
      3
      11
      2
      2
      1
      ...
      0
      0
      1
      1
      1
      2
      1
      2
      2
      4
    
    
      4
      1
      15
      3104.670102
      1
      54
      3
      11
      2
      2
      1
      ...
      0
      0
      1
      1
      1
      2
      1
      2
      2
      4
    
  

5 rows × 324 columns

We can select particular rows of the table using standard Python array slicing notation



In [4]:

    
table[0:5]









    Out[4]:






  
    
      
      YY1
      Y1
      WGT
      HHSEX
      AGE
      AGECL
      EDUC
      EDCL
      MARRIED
      KIDS
      ...
      LLOAN11
      LLOAN12
      NWCAT
      INCCAT
      ASSETCAT
      NINCCAT
      NINC2CAT
      NW10CAT
      INC10CAT
      NINC10CAT
    
  
  
    
      0
      1
      11
      3100.802441
      1
      54
      3
      11
      2
      2
      1
      ...
      0
      0
      1
      1
      1
      2
      1
      2
      2
      4
    
    
      1
      1
      12
      3090.352195
      1
      54
      3
      11
      2
      2
      1
      ...
      0
      0
      1
      1
      1
      2
      1
      2
      2
      4
    
    
      2
      1
      13
      3094.100275
      1
      54
      3
      11
      2
      2
      1
      ...
      0
      0
      1
      1
      1
      2
      1
      2
      2
      4
    
    
      3
      1
      14
      3098.507516
      1
      54
      3
      11
      2
      2
      1
      ...
      0
      0
      1
      1
      1
      2
      1
      2
      2
      4
    
    
      4
      1
      15
      3104.670102
      1
      54
      3
      11
      2
      2
      1
      ...
      0
      0
      1
      1
      1
      2
      1
      2
      2
      4
    
  

5 rows × 324 columns

table is a DataFrame - a multi-dimensional equivalent of Series (another Pandas object), as it has multiple columns, so you can think of it as a matrix, where columns can be accessed by their 'names'. In fact, many operations can be performed on them (coming from numpy):



In [5]:

    
table.max().max()









    Out[5]:





1324540600.0

But they know more than that - they have several built-in statistics :



In [6]:

    
table.describe()









    Out[6]:






  
    
      
      YY1
      Y1
      WGT
      HHSEX
      AGE
      AGECL
      EDUC
      EDCL
      MARRIED
      KIDS
      ...
      LLOAN11
      LLOAN12
      NWCAT
      INCCAT
      ASSETCAT
      NINCCAT
      NINC2CAT
      NW10CAT
      INC10CAT
      NINC10CAT
    
  
  
    
      count
      30075.000000
      30075.000000
      30075.000000
      30075.000000
      30075.000000
      30075.000000
      30075.000000
      30075.000000
      30075.000000
      30075.000000
      ...
      30075.000000
      30075.000000
      30075.000000
      30075.000000
      30075.000000
      30075.000000
      30075.000000
      30075.000000
      30075.000000
      30075.000000
    
    
      mean
      3013.273317
      30135.733167
      4074.150302
      1.236575
      51.753283
      3.232253
      13.963591
      3.025403
      1.373400
      0.840067
      ...
      539.636908
      4231.891604
      2.996143
      3.574763
      3.572835
      3.597905
      1.832086
      6.229061
      6.250208
      6.291837
    
    
      std
      1739.205187
      17392.051924
      2199.735338
      0.424986
      16.170621
      1.532090
      2.699456
      1.048733
      0.483715
      1.174573
      ...
      12572.614911
      124661.121441
      1.493547
      1.781943
      1.840313
      1.776878
      0.784993
      3.170065
      3.096357
      3.081563
    
    
      min
      1.000000
      11.000000
      1.620062
      1.000000
      18.000000
      1.000000
      -1.000000
      1.000000
      1.000000
      0.000000
      ...
      0.000000
      0.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
    
    
      25%
      1508.000000
      15084.500000
      3037.739309
      1.000000
      39.000000
      2.000000
      12.000000
      2.000000
      1.000000
      0.000000
      ...
      0.000000
      0.000000
      2.000000
      2.000000
      2.000000
      2.000000
      1.000000
      3.000000
      4.000000
      4.000000
    
    
      50%
      3013.000000
      30133.000000
      4303.426803
      1.000000
      52.000000
      3.000000
      14.000000
      3.000000
      1.000000
      0.000000
      ...
      0.000000
      0.000000
      3.000000
      4.000000
      4.000000
      4.000000
      2.000000
      7.000000
      7.000000
      7.000000
    
    
      75%
      4519.000000
      45191.500000
      5469.228758
      1.000000
      63.000000
      4.000000
      16.000000
      4.000000
      2.000000
      2.000000
      ...
      0.000000
      0.000000
      5.000000
      5.000000
      6.000000
      5.000000
      2.000000
      10.000000
      9.000000
      9.000000
    
    
      max
      6026.000000
      60265.000000
      10617.366341
      2.000000
      95.000000
      6.000000
      17.000000
      4.000000
      2.000000
      8.000000
      ...
      720000.000000
      8000000.000000
      5.000000
      6.000000
      6.000000
      6.000000
      3.000000
      10.000000
      10.000000
      10.000000
    
  

8 rows × 324 columns

As an example try to access normalized income and net-worth variables.



In [7]:

    
table.dtypes[0:5]
table.dtypes.shape









    Out[7]:





(324,)

There are way too many variables in there - try to search for the proper column names



In [8]:

    
[col for col in table.columns if 'NETWOR'  in col]









    Out[8]:





['NETWORTH']



In [9]:

    
[col for col in table.columns if 'INC'  in col]









    Out[9]:





['INCOME',
 'WAGEINC',
 'BUSSEFARMINC',
 'INTDIVINC',
 'KGINC',
 'SSRETINC',
 'TRANSFOTHINC',
 'NORMINC',
 'EQUITINC',
 'DEBT2INC',
 'INCCAT',
 'NINCCAT',
 'NINC2CAT',
 'INC10CAT',
 'NINC10CAT']



In [10]:

    
income = table['NORMINC']
net_worth = table['NETWORTH']

Get the "mean" (non -weighted) income and minimal net worth



In [11]:

    
income.mean()









    Out[11]:





713563.8921620903



In [12]:

    
net_worth.min()









    Out[12]:





-227019000.0

That is, there is one person with a net worth of -227 million \$ ! Suppose we do not want our analysis to depend on these extremely low values and we would like to trim our dataframe. As a first step, create a new dataframe that only contains the variables of interest (and the id of each observation)



In [13]:

    
keep = ['YY1','Y1', 'NORMINC', 'NETWORTH']



In [14]:

    
data = table[keep]



In [15]:

    
data.head()









    Out[15]:






  
    
      
      YY1
      Y1
      NORMINC
      NETWORTH
    
  
  
    
      0
      1
      11
      37537.663108
      -400
    
    
      1
      1
      12
      39566.725979
      -400
    
    
      2
      1
      13
      35508.600237
      -400
    
    
      3
      1
      14
      38552.194543
      -400
    
    
      4
      1
      15
      35508.600237
      -400

Rename the columns:



In [16]:

    
data.columns ='Household', 'Observation' , 'Income', 'Net Worth'



In [17]:

    
data.head()









    Out[17]:






  
    
      
      Household
      Observation
      Income
      Net Worth
    
  
  
    
      0
      1
      11
      37537.663108
      -400
    
    
      1
      1
      12
      39566.725979
      -400
    
    
      2
      1
      13
      35508.600237
      -400
    
    
      3
      1
      14
      38552.194543
      -400
    
    
      4
      1
      15
      35508.600237
      -400

Try to get a general picture of what would be a "good" trimming value of net worth by plotting an estimated kernel density



In [18]:

    
data['Net Worth'].plot(kind='density')
plt.show()

Let's see how many observations we eliminate if we exclude everyone below -1 million \$



In [19]:

    
data_trimmed = data[data['Net Worth'] > -1000000]



In [20]:

    
data.shape[0] - data_trimmed.shape[0]









    Out[20]:





15

But how many households are in this category?



In [21]:

    
data[data['Net Worth'] < -1000000]









    Out[21]:






  
    
      
      Household
      Observation
      Income
      Net Worth
    
  
  
    
      2770
      555
      5551
      56813.760380
      -1692510
    
    
      2771
      555
      5552
      54784.697509
      -1760510
    
    
      2772
      555
      5553
      53770.166074
      -1673500
    
    
      2773
      555
      5554
      49712.040332
      -1464610
    
    
      2774
      555
      5555
      49712.040332
      -1962990
    
    
      14230
      2852
      28521
      0.000000
      -215289000
    
    
      14231
      2852
      28522
      0.000000
      -220820200
    
    
      14232
      2852
      28523
      0.000000
      -227019000
    
    
      14233
      2852
      28524
      0.000000
      -226697000
    
    
      14234
      2852
      28525
      0.000000
      -216780000
    
    
      18785
      3764
      37641
      0.000000
      -2037500
    
    
      18786
      3764
      37642
      0.000000
      -1390100
    
    
      18787
      3764
      37643
      0.000000
      -2130900
    
    
      18788
      3764
      37644
      0.000000
      -1293000
    
    
      18789
      3764
      37645
      0.000000
      -1521200



In [22]:

    
data_trimmed['Net Worth'].plot(kind='density')
plt.show()

Pivoting

Note that the data has a nice panel structure which we so far did not exploit - each household has multiple observations. Lets use pivoting so that our dataframe reflects that. First get the index of our dataframe:



In [23]:

    
data.index









    Out[23]:





Int64Index([    0,     1,     2,     3,     4,     5,     6,     7,     8,
                9,
            ...
            30065, 30066, 30067, 30068, 30069, 30070, 30071, 30072, 30073,
            30074],
           dtype='int64', length=30075)

We simply have the index of each observation. As a first step, replace these indeces by the household identifier and group income levels in each observation:



In [24]:

    
new_observations = data.loc[:,'Observation'] - 10 * data.loc[:,'Household']
data.loc[:,'Observation'] = new_observations
data[0:10]
# Normally, you should not do this - instead use assign









    



D:\Programs\Anaconda\lib\site-packages\pandas\core\indexing.py:426: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s






    Out[24]:






  
    
      
      Household
      Observation
      Income
      Net Worth
    
  
  
    
      0
      1
      1
      37537.663108
      -400
    
    
      1
      1
      2
      39566.725979
      -400
    
    
      2
      1
      3
      35508.600237
      -400
    
    
      3
      1
      4
      38552.194543
      -400
    
    
      4
      1
      5
      35508.600237
      -400
    
    
      5
      2
      1
      22319.691578
      -8700
    
    
      6
      2
      2
      22319.691578
      -8600
    
    
      7
      2
      3
      22319.691578
      -8700
    
    
      8
      2
      4
      22319.691578
      -8600
    
    
      9
      2
      5
      22319.691578
      -8600



In [25]:

    
# Reload the data
data = table[keep]
data.columns ='Household', 'Observation' , 'Income', 'Net Worth'



In [26]:

    
data = data.assign(Observations = (data['Observation'] - 10.0 * data['Household']).astype(int))
del data['Observation'] # delete the old column
data = data.rename(columns = {'Observations':'Observation'}) # rename the column
data = data[['Household', 'Observation' , 'Income', 'Net Worth']] # reinsert the column
data.head()









    Out[26]:






  
    
      
      Household
      Observation
      Income
      Net Worth
    
  
  
    
      0
      1
      1
      37537.663108
      -400
    
    
      1
      1
      2
      39566.725979
      -400
    
    
      2
      1
      3
      35508.600237
      -400
    
    
      3
      1
      4
      38552.194543
      -400
    
    
      4
      1
      5
      35508.600237
      -400



In [27]:

    
p = data.pivot(index = 'Household' , columns = 'Observation' , values = 'Income' )



In [28]:

    
p.head()









    Out[28]:






  
    
      Observation
      1
      2
      3
      4
      5
    
    
      Household
      
      
      
      
      
    
  
  
    
      1
      37537.663108
      39566.725979
      35508.600237
      38552.194543
      35508.600237
    
    
      2
      22319.691578
      22319.691578
      22319.691578
      22319.691578
      22319.691578
    
    
      3
      52755.634638
      52755.634638
      52755.634638
      52755.634638
      52755.634638
    
    
      4
      125801.897980
      125801.897980
      125801.897980
      125801.897980
      125801.897980
    
    
      5
      99424.080664
      107540.332150
      108554.863580
      108554.863580
      92322.360617

If instead we are interested in both income and net worth grouped by observations then:



In [29]:

    
p = data.pivot(index='Household', columns='Observation')



In [30]:

    
p.head()









    Out[30]:






  
    
      
      Income
      Net Worth
    
    
      Observation
      1
      2
      3
      4
      5
      1
      2
      3
      4
      5
    
    
      Household
      
      
      
      
      
      
      
      
      
      
    
  
  
    
      1
      37537.663108
      39566.725979
      35508.600237
      38552.194543
      35508.600237
      -400
      -400
      -400
      -400
      -400
    
    
      2
      22319.691578
      22319.691578
      22319.691578
      22319.691578
      22319.691578
      -8700
      -8600
      -8700
      -8600
      -8600
    
    
      3
      52755.634638
      52755.634638
      52755.634638
      52755.634638
      52755.634638
      -500
      -600
      -500
      -500
      -500
    
    
      4
      125801.897980
      125801.897980
      125801.897980
      125801.897980
      125801.897980
      2111950
      2169750
      2149850
      2216950
      2289850
    
    
      5
      99424.080664
      107540.332150
      108554.863580
      108554.863580
      92322.360617
      676600
      675300
      677100
      677400
      677300

Use stacking to transform the data into a panel structure we are familiar with (and unstacking to go back to cross-section):



In [31]:

    
panel_data = p.stack()



In [32]:

    
panel_data.head()









    Out[32]:






  
    
      
      
      Income
      Net Worth
    
    
      Household
      Observation
      
      
    
  
  
    
      1
      1
      37537.663108
      -400
    
    
      2
      39566.725979
      -400
    
    
      3
      35508.600237
      -400
    
    
      4
      38552.194543
      -400
    
    
      5
      35508.600237
      -400

Using the panel structure it is even easier to see the number of households that had fewer than -1 million \$ net worth



In [33]:

    
panel_data[panel_data['Net Worth'] < -1000000]









    Out[33]:






  
    
      
      
      Income
      Net Worth
    
    
      Household
      Observation
      
      
    
  
  
    
      555
      1
      56813.760380
      -1692510
    
    
      2
      54784.697509
      -1760510
    
    
      3
      53770.166074
      -1673500
    
    
      4
      49712.040332
      -1464610
    
    
      5
      49712.040332
      -1962990
    
    
      2852
      1
      0.000000
      -215289000
    
    
      2
      0.000000
      -220820200
    
    
      3
      0.000000
      -227019000
    
    
      4
      0.000000
      -226697000
    
    
      5
      0.000000
      -216780000
    
    
      3764
      1
      0.000000
      -2037500
    
    
      2
      0.000000
      -1390100
    
    
      3
      0.000000
      -2130900
    
    
      4
      0.000000
      -1293000
    
    
      5
      0.000000
      -1521200

Pandas even have their own data-structure for panel data - for it, we need to create dataframes as inputs



In [34]:

    
p = data.pivot(index='Observation', columns='Household')



In [35]:

    
pdata = {'Observation 1' : pd.DataFrame(p.ix[1,:]),
        'Observation 2' : pd.DataFrame(p.ix[2,:]),
        'Observation 3' : pd.DataFrame(p.ix[3,:]),
        'Observation 4' : pd.DataFrame(p.ix[4,:]),
        'Observation 5' : pd.DataFrame(p.ix[5,:])}
pdata = pd.Panel(pdata)



In [36]:

    
pdata









    Out[36]:





<class 'pandas.core.panel.Panel'>
Dimensions: 5 (items) x 12030 (major_axis) x 5 (minor_axis)
Items axis: Observation 1 to Observation 5
Major_axis axis: (Income, 1) to (Net Worth, 6026)
Minor_axis axis: 1 to 5

but this is not as useful - unfortunately the panel part of the package has been neglected. Very few functions are available. Now as a last exercise, save our dataFrame to a csv file



In [39]:

    
data.to_excel('SCF_2013_inc_netw.xlsx')



In [ ]:

	YY1	Y1	WGT	HHSEX	AGE	AGECL	EDUC	EDCL	MARRIED	KIDS	...	NWCAT	INCCAT	ASSETCAT	NINCCAT	NINC2CAT	NW10CAT	INC10CAT	NINC10CAT
0	1	11	3100.802441	1	54	3	11	2	2	1	...	1	1	1	2	1	2	2	4
1	1	12	3090.352195	1	54	3	11	2	2	1	...	1	1	1	2	1	2	2	4
2	1	13	3094.100275	1	54	3	11	2	2	1	...	1	1	1	2	1	2	2	4
3	1	14	3098.507516	1	54	3	11	2	2	1	...	1	1	1	2	1	2	2	4
4	1	15	3104.670102	1	54	3	11	2	2	1	...	1	1	1	2	1	2	2	4

	YY1	Y1	WGT	HHSEX	AGE	AGECL	EDUC	EDCL	MARRIED	KIDS	...	LLOAN11	LLOAN12	NWCAT	INCCAT	ASSETCAT	NINCCAT	NINC2CAT	NW10CAT	INC10CAT	NINC10CAT
count	30075.000000	30075.000000	30075.000000	30075.000000	30075.000000	30075.000000	30075.000000	30075.000000	30075.000000	30075.000000	...	30075.000000	30075.000000	30075.000000	30075.000000	30075.000000	30075.000000	30075.000000	30075.000000	30075.000000	30075.000000
mean	3013.273317	30135.733167	4074.150302	1.236575	51.753283	3.232253	13.963591	3.025403	1.373400	0.840067	...	539.636908	4231.891604	2.996143	3.574763	3.572835	3.597905	1.832086	6.229061	6.250208	6.291837
std	1739.205187	17392.051924	2199.735338	0.424986	16.170621	1.532090	2.699456	1.048733	0.483715	1.174573	...	12572.614911	124661.121441	1.493547	1.781943	1.840313	1.776878	0.784993	3.170065	3.096357	3.081563
min	1.000000	11.000000	1.620062	1.000000	18.000000	1.000000	-1.000000	1.000000	1.000000	0.000000	...	0.000000	0.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
25%	1508.000000	15084.500000	3037.739309	1.000000	39.000000	2.000000	12.000000	2.000000	1.000000	0.000000	...	0.000000	0.000000	2.000000	2.000000	2.000000	2.000000	1.000000	3.000000	4.000000	4.000000
50%	3013.000000	30133.000000	4303.426803	1.000000	52.000000	3.000000	14.000000	3.000000	1.000000	0.000000	...	0.000000	0.000000	3.000000	4.000000	4.000000	4.000000	2.000000	7.000000	7.000000	7.000000
75%	4519.000000	45191.500000	5469.228758	1.000000	63.000000	4.000000	16.000000	4.000000	2.000000	2.000000	...	0.000000	0.000000	5.000000	5.000000	6.000000	5.000000	2.000000	10.000000	9.000000	9.000000
max	6026.000000	60265.000000	10617.366341	2.000000	95.000000	6.000000	17.000000	4.000000	2.000000	8.000000	...	720000.000000	8000000.000000	5.000000	6.000000	6.000000	6.000000	3.000000	10.000000	10.000000	10.000000

	YY1	Y1	NORMINC	NETWORTH
0	1	11	37537.663108	-400
1	1	12	39566.725979	-400
2	1	13	35508.600237	-400
3	1	14	38552.194543	-400
4	1	15	35508.600237	-400

	Household	Observation	Income	Net Worth
2770	555	5551	56813.760380	-1692510
2771	555	5552	54784.697509	-1760510
2772	555	5553	53770.166074	-1673500
2773	555	5554	49712.040332	-1464610
2774	555	5555	49712.040332	-1962990
14230	2852	28521	0.000000	-215289000
14231	2852	28522	0.000000	-220820200
14232	2852	28523	0.000000	-227019000
14233	2852	28524	0.000000	-226697000
14234	2852	28525	0.000000	-216780000
18785	3764	37641	0.000000	-2037500
18786	3764	37642	0.000000	-1390100
18787	3764	37643	0.000000	-2130900
18788	3764	37644	0.000000	-1293000
18789	3764	37645	0.000000	-1521200

Observation	1	2	3	4	5
Household
1	37537.663108	39566.725979	35508.600237	38552.194543	35508.600237
2	22319.691578	22319.691578	22319.691578	22319.691578	22319.691578
3	52755.634638	52755.634638	52755.634638	52755.634638	52755.634638
4	125801.897980	125801.897980	125801.897980	125801.897980	125801.897980
5	99424.080664	107540.332150	108554.863580	108554.863580	92322.360617

	Income					Net Worth
Observation	1	2	3	4	5	1	2	3	4	5
Household
1	37537.663108	39566.725979	35508.600237	38552.194543	35508.600237	-400	-400	-400	-400	-400
2	22319.691578	22319.691578	22319.691578	22319.691578	22319.691578	-8700	-8600	-8700	-8600	-8600
3	52755.634638	52755.634638	52755.634638	52755.634638	52755.634638	-500	-600	-500	-500	-500
4	125801.897980	125801.897980	125801.897980	125801.897980	125801.897980	2111950	2169750	2149850	2216950	2289850
5	99424.080664	107540.332150	108554.863580	108554.863580	92322.360617	676600	675300	677100	677400	677300