New York University

Applied Data Science 2016 Final Project


Measuring household income under Redatam in CensusData


1. Model by Individual


Project Description: Lorem ipsum

Members:

  • Felipe Gonzales
  • Ilan Reinstein
  • Fernando Melchor
  • Nicolas Metallo

LIBRARIES


In [24]:
# helper functions
import getEPH
import categorize
import schoolYears
import make_dummy
import functionsForModels
# libraries
import pandas as pd
import numpy as np
from scipy import stats
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.sandbox.regression.predstd import wls_prediction_std
from statsmodels.iolib.table import (SimpleTable, default_txt_fmt)
np.random.seed(1024)
%matplotlib inline

DATA HANDLING


In [25]:
# get data using 'getEPHdbf' function

getEPH.getEPHdbf('t310')


('Downloading', 't310')
file in place, creating CSV file
csv file cleanDataHousehold t310 .csv successfully created in folder data/
csv file cleanData t310 .csv successfully created in folder data/

In [26]:
data1 = pd.read_csv('data/cleanDatat310.csv')

In [27]:
data2 = categorize.categorize(data1)
data3 = schoolYears.schoolYears(data2)
data = make_dummy.make_dummy(data3)

In [28]:
dataModel = functionsForModels.prepareDataForModel(data)

In [29]:
dataModel.head()


Out[29]:
PONDERA P47T P21 primary secondary university male_14to24 male_25to34 female_14to24 female_25to34 female_35more female age education education2 age2 lnIncome lnIncomeT
2 1674 3000 3000 7.0 0.0 0.0 0 0 0 0 0 0 42 7.0 49.0 1764 8.006368 8.006368
3 1674 2800 2800 7.0 5.0 5.0 0 0 0 0 1 1 44 17.0 289.0 1936 7.937375 7.937375
7 1320 6000 5000 7.0 5.0 5.0 0 0 0 0 0 0 38 17.0 289.0 1444 8.517193 8.699515
8 1320 4000 4000 7.0 5.0 5.0 0 0 0 1 0 1 28 17.0 289.0 784 8.294050 8.294050
9 1281 3800 3800 7.0 5.0 5.0 0 0 0 0 0 0 63 17.0 289.0 3969 8.242756 8.242756

DATA EXPLORATION

Plot for: Education ~ Age


In [30]:
fig = plt.figure(figsize=(16,12))
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
ax4 = fig.add_subplot(2,2,4)

ax1.plot(dataModel.education,dataModel.P47T,'ro')
ax1.set_ylabel('Ingreso total')
ax1.set_xlabel('Educacion')
ax2.plot(dataModel.age,dataModel.P47T,'ro')
ax2.set_xlabel('Edad')
ax3.plot(dataModel.education,dataModel.P21,'bo')
ax3.set_ylabel('Ingreso Laboral')
ax3.set_xlabel('Educacion')
ax4.plot(dataModel.age,dataModel.P21,'bo')
ax4.set_xlabel('Edad')


Out[30]:
<matplotlib.text.Text at 0x1219f4810>

Reference:

  • P21: Refers to individual income by main activity (occupation)
  • P47T: Refers to total individual income (includes capital gains)
  • lnIncomeT: Refers to ln of P21
  • lnIncome: Refers to ln of P47T

Plot for: LnIncome


In [31]:
fig = plt.figure(figsize=(16,12))
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)
ax4 = fig.add_subplot(2,2,4)

sns.kdeplot(dataModel.P47T,ax=ax1,color = 'red')
sns.kdeplot(dataModel.lnIncomeT,ax=ax2,color = 'red')
sns.kdeplot(dataModel.P21,ax=ax3)
sns.kdeplot(dataModel.lnIncome,ax=ax4)


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x1293e4fd0>

In [32]:
print 'mean:', dataModel.lnIncome.mean(), 'std:', dataModel.lnIncome.std()


mean: 7.50403461536 std: 0.876203364122

In [33]:
print 'mean:', dataModel.P21.mean(), 'std:', dataModel.P21.std()


mean: 2488.57982262 std: 2063.59290863

In [34]:
plt.boxplot(list(dataModel.P21), 0, 'gD')


Out[34]:
{'boxes': [<matplotlib.lines.Line2D at 0x11fe8d1d0>],
 'caps': [<matplotlib.lines.Line2D at 0x11fe9b0d0>,
  <matplotlib.lines.Line2D at 0x11fe9b710>],
 'fliers': [<matplotlib.lines.Line2D at 0x11fea63d0>],
 'means': [],
 'medians': [<matplotlib.lines.Line2D at 0x11fe9bd50>],
 'whiskers': [<matplotlib.lines.Line2D at 0x11d0d0c90>,
  <matplotlib.lines.Line2D at 0x11fe8da50>]}

Plot for: LnIncome ~ Educ and Age


In [35]:
g = sns.JointGrid(x="education", y="lnIncome", data=dataModel)  
g.plot_joint(sns.regplot, order=2)  
g.plot_marginals(sns.distplot)

g2 = sns.JointGrid(x="age", y="lnIncome", data=dataModel)  
g2.plot_joint(sns.regplot, order=2)  
g2.plot_marginals(sns.distplot)


Out[35]:
<seaborn.axisgrid.JointGrid at 0x11f030790>

REGRESSION MODEL (ECLAC)

Background:

The ECLAC (Economic Comission for Latin America and the Caribbean) estimates income by using a regression model based on the following variables (education, gender and age):

  • x1: primary
  • x2: secondary
  • x3: university
  • x4: male_14to24
  • x5: male_25to34
  • x6: female_14to24
  • x7: female_25to34
  • x8: female_35more

MODEL # 1 - ECLAC


In [36]:
dataModel1 = functionsForModels.runModel(dataModel, income = 'P21')


                            WLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.256
Model:                            WLS   Adj. R-squared:                  0.254
Method:                 Least Squares   F-statistic:                     154.7
Date:                Sun, 11 Dec 2016   Prob (F-statistic):          1.74e-224
Time:                        22:13:00   Log-Likelihood:                -32126.
No. Observations:                3608   AIC:                         6.427e+04
Df Residuals:                    3599   BIC:                         6.433e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const       1707.4296    219.005      7.796      0.000      1278.042  2136.817
x1            83.8014     33.072      2.534      0.011        18.960   148.643
x2           162.4009     16.357      9.929      0.000       130.331   194.471
x3           339.7044     18.023     18.848      0.000       304.368   375.041
x4         -1153.2109    118.197     -9.757      0.000     -1384.951  -921.471
x5          -579.4509     90.681     -6.390      0.000      -757.243  -401.659
x6         -1876.6651    136.990    -13.699      0.000     -2145.251 -1608.079
x7         -1540.6495    101.248    -15.217      0.000     -1739.158 -1342.141
x8         -1127.6688     76.237    -14.792      0.000     -1277.140  -978.198
==============================================================================
Omnibus:                     2930.815   Durbin-Watson:                   1.763
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           167283.591
Skew:                           3.443   Prob(JB):                         0.00
Kurtosis:                      35.639   Cond. No.                         60.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
x1: primary
x2: secondary
x3: university
x4: male_14to24
x5: male_25to34
x6: female_14to24
x7: female_25to34
x8: female_35more
IS R-squared for 1000 times is 0.26332064393
OS R-squared for 1000 times is 0.260058876089

MODEL # 2 - ECLAC (Using Log of Individual Income)


In [37]:
dataModel2 = functionsForModels.runModel(dataModel, income = 'lnIncome', variables= [
        'primary','secondary','university',
        'male_14to24','male_25to34',
        'female_14to24', 'female_25to34', 'female_35more'])


                            WLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.274
Model:                            WLS   Adj. R-squared:                  0.272
Method:                 Least Squares   F-statistic:                     169.6
Date:                Sun, 11 Dec 2016   Prob (F-statistic):          1.71e-243
Time:                        22:13:04   Log-Likelihood:                -4124.3
No. Observations:                3608   AIC:                             8267.
Df Residuals:                    3599   BIC:                             8322.
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          6.8690      0.093     73.616      0.000         6.686     7.052
x1             0.0787      0.014      5.585      0.000         0.051     0.106
x2             0.0838      0.007     12.030      0.000         0.070     0.098
x3             0.1221      0.008     15.897      0.000         0.107     0.137
x4            -0.4212      0.050     -8.364      0.000        -0.520    -0.322
x5            -0.1547      0.039     -4.004      0.000        -0.230    -0.079
x6            -0.9296      0.058    -15.926      0.000        -1.044    -0.815
x7            -0.6264      0.043    -14.521      0.000        -0.711    -0.542
x8            -0.5843      0.032    -17.988      0.000        -0.648    -0.521
==============================================================================
Omnibus:                      667.857   Durbin-Watson:                   1.855
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1520.071
Skew:                          -1.050   Prob(JB):                         0.00
Kurtosis:                       5.387   Cond. No.                         60.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
x1: primary
x2: secondary
x3: university
x4: male_14to24
x5: male_25to34
x6: female_14to24
x7: female_25to34
x8: female_35more
IS R-squared for 1000 times is 0.28428319381
OS R-squared for 1000 times is 0.279569691356

MODEL # 3 - ECLAC (Using Total Individual Income)


In [38]:
dataModel3 = functionsForModels.runModel(dataModel, income = 'P47T')


                            WLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.253
Model:                            WLS   Adj. R-squared:                  0.251
Method:                 Least Squares   F-statistic:                     152.0
Date:                Sun, 11 Dec 2016   Prob (F-statistic):          4.78e-221
Time:                        22:13:08   Log-Likelihood:                -32860.
No. Observations:                3608   AIC:                         6.574e+04
Df Residuals:                    3599   BIC:                         6.579e+04
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const       2004.2388    268.448      7.466      0.000      1477.914  2530.564
x1            92.0956     40.538      2.272      0.023        12.616   171.575
x2           175.7131     20.050      8.764      0.000       136.403   215.023
x3           454.4130     22.092     20.569      0.000       411.099   497.727
x4         -1310.4315    144.881     -9.045      0.000     -1594.488 -1026.375
x5          -712.6864    111.153     -6.412      0.000      -930.616  -494.756
x6         -2196.2699    167.917    -13.080      0.000     -2525.492 -1867.048
x7         -1824.5013    124.105    -14.701      0.000     -2067.825 -1581.177
x8         -1051.1493     93.448    -11.249      0.000     -1234.365  -867.934
==============================================================================
Omnibus:                     2829.452   Durbin-Watson:                   1.801
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           117722.951
Skew:                           3.371   Prob(JB):                         0.00
Kurtosis:                      30.159   Cond. No.                         60.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
x1: primary
x2: secondary
x3: university
x4: male_14to24
x5: male_25to34
x6: female_14to24
x7: female_25to34
x8: female_35more
IS R-squared for 1000 times is 0.256591443245
OS R-squared for 1000 times is 0.25320737801

MODEL # 4 - ECLAC (Using Log of Total Individual Income)


In [39]:
dataModel4 = functionsForModels.runModel(dataModel, income = 'lnIncomeT')


                            WLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.269
Model:                            WLS   Adj. R-squared:                  0.267
Method:                 Least Squares   F-statistic:                     165.6
Date:                Sun, 11 Dec 2016   Prob (F-statistic):          2.03e-238
Time:                        22:13:12   Log-Likelihood:                -3972.8
No. Observations:                3608   AIC:                             7964.
Df Residuals:                    3599   BIC:                             8019.
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          7.2029      0.089     80.505      0.000         7.028     7.378
x1             0.0545      0.014      4.030      0.000         0.028     0.081
x2             0.0749      0.007     11.203      0.000         0.062     0.088
x3             0.1303      0.007     17.701      0.000         0.116     0.145
x4            -0.4616      0.048     -9.559      0.000        -0.556    -0.367
x5            -0.1635      0.037     -4.412      0.000        -0.236    -0.091
x6            -0.9301      0.056    -16.618      0.000        -1.040    -0.820
x7            -0.5881      0.041    -14.219      0.000        -0.669    -0.507
x8            -0.4353      0.031    -13.975      0.000        -0.496    -0.374
==============================================================================
Omnibus:                      712.707   Durbin-Watson:                   1.849
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2039.735
Skew:                          -1.031   Prob(JB):                         0.00
Kurtosis:                       6.052   Cond. No.                         60.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
x1: primary
x2: secondary
x3: university
x4: male_14to24
x5: male_25to34
x6: female_14to24
x7: female_25to34
x8: female_35more
IS R-squared for 1000 times is 0.27487241836
OS R-squared for 1000 times is 0.269403751808

REGRESSION MODEL (ALTERNATIVE)

Background:

We tested an alternative model similar to the ECLAC Regression Model using a second polynomial to account for the non-linear relationship between age, education and income.

MODEL # 1 - ALTERNATIVE (Using Log of Total Individual Income)


In [40]:
dataModel5 = functionsForModels.runModel(dataModel, income = 'lnIncomeT', variables=['education','education2',
                                'age','age2','female'])


                            WLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.275
Model:                            WLS   Adj. R-squared:                  0.274
Method:                 Least Squares   F-statistic:                     272.8
Date:                Sun, 11 Dec 2016   Prob (F-statistic):          5.89e-248
Time:                        22:13:16   Log-Likelihood:                -3959.0
No. Observations:                3608   AIC:                             7930.
Df Residuals:                    3602   BIC:                             7967.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          6.0980      0.133     45.843      0.000         5.837     6.359
x1            -0.0033      0.015     -0.221      0.825        -0.033     0.026
x2             0.0046      0.001      6.704      0.000         0.003     0.006
x3             0.0499      0.005      9.679      0.000         0.040     0.060
x4            -0.0005   5.94e-05     -7.858      0.000        -0.001    -0.000
x5            -0.4433      0.024    -18.310      0.000        -0.491    -0.396
==============================================================================
Omnibus:                      704.928   Durbin-Watson:                   1.853
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1966.525
Skew:                          -1.029   Prob(JB):                         0.00
Kurtosis:                       5.974   Cond. No.                     2.45e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.45e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
x1: education
x2: education2
x3: age
x4: age2
x5: female
IS R-squared for 1000 times is 0.280300100932
OS R-squared for 1000 times is 0.276369978805

MODEL # 1 - ALTERNATIVE (Using Log of Individual Income)


In [41]:
dataModel6 = functionsForModels.runModel(dataModel, income = 'lnIncome', variables=['education','education2',
                                'age','age2','female'])


                            WLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.287
Model:                            WLS   Adj. R-squared:                  0.286
Method:                 Least Squares   F-statistic:                     289.9
Date:                Sun, 11 Dec 2016   Prob (F-statistic):          2.70e-261
Time:                        22:13:19   Log-Likelihood:                -4091.5
No. Observations:                3608   AIC:                             8195.
Df Residuals:                    3602   BIC:                             8232.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          5.5938      0.138     40.537      0.000         5.323     5.864
x1             0.0277      0.016      1.779      0.075        -0.003     0.058
x2             0.0032      0.001      4.525      0.000         0.002     0.005
x3             0.0659      0.005     12.328      0.000         0.055     0.076
x4            -0.0007   6.16e-05    -11.488      0.000        -0.001    -0.001
x5            -0.5528      0.025    -22.012      0.000        -0.602    -0.504
==============================================================================
Omnibus:                      641.219   Durbin-Watson:                   1.858
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1458.019
Skew:                          -1.012   Prob(JB):                         0.00
Kurtosis:                       5.367   Cond. No.                     2.45e+04
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.45e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
x1: education
x2: education2
x3: age
x4: age2
x5: female
IS R-squared for 1000 times is 0.297515826042
OS R-squared for 1000 times is 0.29431490923

In [ ]: