Referencia: http://dump.jazzido.com/CNPHV2010-RADIO/

Variables en el CENSO 2010 (INDEC)

VIVIENDA.INCALCONS Calidad constructiva de la vivienda
VIVIENDA.INCALSERV Calidad de Conexiones a Servicios Básicos
VIVIENDA.INMAT Calidad de los materiales
VIVIENDA.TIPVV Tipo de vivienda agrupado
VIVIENDA.TOTHOG Cantidad de Hogares en la Vivienda
VIVIENDA.URP Area Urbano - Rural
VIVIENDA.V00 Tipo de vivienda colectiva
VIVIENDA.V01 Tipo de vivienda particular
VIVIENDA.V02 Condición de ocupación
HOGAR.ALGUNBI Al menos un indicador NBI
HOGAR.H05 Material predominante de los pisos
HOGAR.H06 Material predominante de la cubierta exterior del techo
HOGAR.H07 Revestimiento interior o cielorraso del techo
HOGAR.H08 Tenencia de agua
HOGAR.H09 Procedencia del agua para beber y cocinar
HOGAR.H10 Tiene baño / letrina
HOGAR.H11 Tiene botón, cadena, mochila para limpieza del inodoro
HOGAR.H12 Desagüe del inodoro
HOGAR.H13 Baño / letrina de uso exclusivo
HOGAR.H14 Combustible usado principalmente para cocinar
HOGAR.H15 Total de habitaciones o piezas para dormir
HOGAR.H19A Heladera
HOGAR.H19B Computadora
HOGAR.H19C Teléfono celular
HOGAR.H19D Teléfono de línea
HOGAR.INDHAC Hacinamiento
HOGAR.NHOG Número del hogar en la vivienda
HOGAR.PROP Régimen de tenencia
HOGAR.TOTPERS Total de Personas en el Hogar
PERSONA.CONDACT Condición de actividad
PERSONA.EDADAGRU Edad en grandes grupos
PERSONA.EDADQUI Edades quinquenales
PERSONA.P01 Relación o parentesco con el jefe(a) del hogar
PERSONA.P02 Sexo
PERSONA.P03 Edad
PERSONA.P05 En que país nació
PERSONA.P07 Sabe leer y escribir
PERSONA.P08 Condición de asistencia escolar
PERSONA.P09 Nivel educativo que cursa o cursó
PERSONA.P10 Completó el nivel
PERSONA.P12 Utiliza computadora

Required libraries



In [2]:

    
import pandas as pd
import numpy as np
import os
import sys
import simpledbf
%pylab inline
import matplotlib.pyplot as plt









    



Populating the interactive namespace from numpy and matplotlib

Functions



In [3]:

    
def getEPHdbf(censusstring):
    print ("Downloading", censusstring)
    ### First I will check that it is not already there
    if not os.path.isfile("data/Individual_" + censusstring + ".DBF"):
        if os.path.isfile('Individual_' + censusstring + ".DBF"):
            # if in the current dir just move it
            if os.system("mv " + 'Individual_' + censusstring + ".DBF " + "data/"):
                print ("Error moving file!, Please check!")
        # otherwise start looking for the zip file
        else:
            if not os.path.isfile("data/" + censusstring + "_dbf.zip"):
                if not os.path.isfile(censusstring + "_dbf.zip"):
                    os.system(
                        "curl -O http://www.indec.gob.ar/ftp/cuadros/menusuperior/eph/" + censusstring + "_dbf.zip")
                ###  To move it I use the os.system() functions to run bash commands with arguments
                os.system("mv " + censusstring + "_dbf.zip " + "data/")
            ### unzip the csv
            os.system("unzip " + "data/" + censusstring + "_dbf.zip -d data/")

    if not os.path.isfile("data/" + 'Individual_' + censusstring + ".DBF"):
        print ("WARNING!!! something is wrong: the file is not there!")

    else:
        print ("file in place, creating CSV file")

    trimestre = censusstring

    dbf = simpledbf.Dbf5('data/Individual_' + trimestre + '.dbf',codec='latin1')
    df_def = dbf.to_dataframe()

#     df_def_i = df_def.loc[df_def.REGION == 1, ['CODUSU','NRO_HOGAR','PONDERA','CH03','CH04',
#                                             'CH06','CH07','CH08','CH09','CH12','CH13',
#                                             'CH15','CH16','NIVEL_ED','ESTADO','CAT_OCUP','CAT_INAC',
# #                                             'PP02C2','PP02C3','PP02C4','PP02C5','PP02C6','PP02C7','PP02C8',
# #                                             'PP02E','PP02H','PP02I','PP03C','PP03D','PP3E_TOT','PP3F_TOT',
# #                                             'PP03G','PP03H','PP03I','PP03J','INTENSI','PP04A',
# #                                             'PP04B1','PP04B2',
# #                                             'PP04C','PP04C99','PP04G','PP05C_1','PP05C_2',
# #                                             'PP05C_3','PP05E','PP05F','PP05H','PP06A','PP06C','PP06D',
# #                                             'PP06E','PP06H','PP07A','PP07C','PP07D','PP07E','PP07F1',
# #                                             'PP07F2','PP07F3','PP07F4','PP07F5','PP07G1','PP07G2','PP07G3',
# #                                             'PP07G4','PP07G_59','PP07H','PP07I','PP07J','PP07K','PP08D1',
# #                                             'PP08D4','PP08F1','PP08F2','PP08J1','PP08J2','PP08J3','PP09A',
# #                                             'PP10A','PP10C','PP10D',
# #                                             'PP10E','PP11A','PP11B1','PP11C','PP11C99',
# #                                             'PP11L','PP11L1','PP11M','PP11N','PP11O','PP11P','PP11Q','PP11R',
# #                                             'PP11S','PP11T','P21',
# #                                             'V2_M','V3_M','V4_M','V5_M','V8_M','V9_M','V10_M',
# #                                             'V11_M','V12_M','V18_M','V19_AM','V21_M',
#                                                'ITF']]

    df_def_i = df_def.loc[df_def.REGION == 1,
                            ['CODUSU',
                            'NRO_HOGAR',
                            'PONDERA',
                            'CH03',
                            'CH04',
                            'CH06',
                            'CH07',
                            'CH08',
                            'CH09',
                            'CH15',
                            'NIVEL_ED',
                            'ESTADO',
                            'CAT_OCUP',
                            'CAT_INAC',
                            'ITF']]

    df_def_i.columns = ['CODUSU',
                            'NRO_HOGAR',
                            'PONDERA',
                            'Parentesco',
                            'Sexo',
                            'Edad',
                            'Estado_Civil',
                            'Cobertura_Medica',
                            'Sabe_leer',
                            'Lugar_Nac',
                            'NIVEL_ED',
                            'Trabajo',
                            'CAT_OCUP',
                            'CAT_INAC',
                            'ITF']
    
    
    df_def_i.index =range(0,df_def_i.shape[0])

    df_def_i.to_csv('clean_' + trimestre + '.csv', index = False, encoding='utf-8')
    
    print 'csv file clean_',trimestre,'.csv successfully created in folder /data'
    return



In [4]:

    
def dummy_variables(data, data_type_dict):
    #Loop over nominal variables.
    for variable in filter(lambda x: data_type_dict[x]=='nominal',
                           data_type_dict.keys()):
 
        #First we create the columns with dummy variables.
        #Note that the argument 'prefix' means the column names will be
        #prefix_value for each unique value in the original column, so
        #we set the prefix to be the name of the original variable.
        dummy_df=pd.get_dummies(data[variable], prefix=variable)
 
        #Remove old variable from dictionary.
        data_type_dict.pop(variable)
 
        #Add new dummy variables to dictionary.
        for dummy_variable in dummy_df.columns:
            data_type_dict[dummy_variable] = 'nominal'
 
        #Add dummy variables to main df.
        data=data.drop(variable, axis=1)
        data=data.join(dummy_df)
 
    return [data, data_type_dict]



In [5]:

    
def Regularization_fit_lambda(model,X_train,y_train,lambdas,p=0.4,Graph=False, logl=False):
    #model = 1-Ridge, 2-Lasso
    #lambdas: a list of lambda values to try
    #p: ratio of the validation sample size / total training size
    #Graph: plot the graph of R^2 values for different lambda

    R_2_OS=[]
    X_train0, X_valid, y_train0, y_valid = train_test_split(X_train,
                                    y_train, test_size = 0.4, random_state = 200)

    if model==1:
        RM = lambda a: linear_model.Ridge(fit_intercept=True, alpha=a)
        model_label='Ridge'
    else:
        RM = lambda a: linear_model.Lasso(fit_intercept=True, alpha=a)
        model_label='Lasso'
    
    best_R2 = -1
    best_lambda = lambdas[0]
    
    for i in lambdas:
        lm = RM(i)
        lm.fit(X_train0,y_train0)  #fit the regularization model
        y_predict=lm.predict(X_valid) #compute the prediction for the validation sample 
        err_OS=y_predict-y_valid
        R_2_OS_=1-np.var(err_OS)/np.var(y_valid)
        R_2_OS.append(R_2_OS_)
        if R_2_OS_ > best_R2:
            best_R2 = R_2_OS_
            best_lambda = i
    
    if Graph==True:
        plt.title('IS R-squared vs OS-R-squared for different Lambda')
        if logl:
            plt.xlabel('ln(Lambda)')
            l=log(lambdas)
            bl=log(best_lambda)
        else:
            plt.xlabel('Lambda')
            l=lambdas
            bl=best_lambda
        plt.plot(l,R_2_OS,'b',label=model_label)
        plt.legend(loc='upper right')
        plt.ylabel('R-squared')
        plt.axvline(bl,color='r',linestyle='--')

        plt.show()
    
    return best_lambda

Download data



In [127]:

    
getEPHdbf('t310')









    



('Downloading', 't310')
WARNING!!! something is wrong: the file is not there!
csv file clean_ t310 .csv successfully created in folder /data



In [6]:

    
data = pd.read_csv('clean_t310.csv')
data.head()









    Out[6]:






  
    
      
      CODUSU
      NRO_HOGAR
      PONDERA
      Parentesco
      Sexo
      Edad
      Estado_Civil
      Cobertura_Medica
      Sabe_leer
      Lugar_Nac
      NIVEL_ED
      Trabajo
      CAT_OCUP
      CAT_INAC
      ITF
    
  
  
    
      0
      302468
      1
      1287
      1
      2
      20
      5
      1
      1
      1
      5
      3
      0
      3
      4000
    
    
      1
      302468
      1
      1287
      10
      2
      20
      5
      1
      1
      1
      5
      3
      0
      3
      4000
    
    
      2
      307861
      1
      1674
      1
      1
      42
      2
      1
      1
      2
      2
      1
      3
      0
      5800
    
    
      3
      307861
      1
      1674
      2
      2
      44
      2
      1
      1
      2
      6
      1
      3
      0
      5800
    
    
      4
      307861
      1
      1674
      3
      1
      13
      5
      1
      1
      1
      3
      3
      0
      3
      5800

Data cleaning



In [7]:

    
cols = data.columns.tolist()
cols = cols[-1:] + cols[:-1]
data = data[cols]



In [8]:

    
data['Parentesco'] = data['Parentesco'].map({1:'Jefe', 2:'Conyuge', 3:'Hijo',4:'Yerno',5:'Nieto', 6:'Madre-padre',
                                             7:'Suegro', 8:'Hermano',9:'Otro', 10:'No-familia'})
data['Sexo'] = data['Sexo'].map({1:0,2:1})

data['Estado_Civil'] = data['Estado_Civil'].map({1:'Unido',2:'Casado',3:'Divorciado',4:'Viudo',5:'Soltero'})
data.Estado_Civil.replace(to_replace=[9], value=[np.nan], inplace=True, axis=None)

data['Sabe_leer'] = data['Sabe_leer'].map({1:'Si',2:'No',3:'Menor'})
data.Sabe_leer.replace(to_replace=[9], value=[np.nan], inplace=True, axis=None) 

data['Lugar_Nac'] = data['Lugar_Nac'].map({1:'Localidad',2:'Otra_loc',3:'Otra_prov',4:'Pais_limit',5:'Otro_pais'})
data.Lugar_Nac.replace(to_replace=[9], value=[np.nan], inplace=True, axis=None) 

data['NIVEL_ED'] = data['NIVEL_ED'].map({1:'Primaria_I',2:'Primaria_C',3:'Secundaria_I',4:'Secundaria_C',
                                          5:'Univ_I',6:'Univ_C',7:'Sin_Edu'})
data['Trabajo'] = data['Trabajo'].map({1:'Ocupado',2:'Desocupado',3:'Inactivo',4:'Menor'})
data.Trabajo.replace(to_replace=[0], value=[np.nan], inplace=True, axis=None) 

data['CAT_OCUP'] = data['CAT_OCUP'].map({0:'No_empleo',1:'Patron',2:'Cuenta_propia', 3:'Empleado',4:'Sin_sueldo'})



In [9]:

    
data.head()









    Out[9]:






  
    
      
      ITF
      CODUSU
      NRO_HOGAR
      PONDERA
      Parentesco
      Sexo
      Edad
      Estado_Civil
      Cobertura_Medica
      Sabe_leer
      Lugar_Nac
      NIVEL_ED
      Trabajo
      CAT_OCUP
      CAT_INAC
    
  
  
    
      0
      4000
      302468
      1
      1287
      Jefe
      1
      20
      Soltero
      1
      Si
      Localidad
      Univ_I
      Inactivo
      No_empleo
      3
    
    
      1
      4000
      302468
      1
      1287
      No-familia
      1
      20
      Soltero
      1
      Si
      Localidad
      Univ_I
      Inactivo
      No_empleo
      3
    
    
      2
      5800
      307861
      1
      1674
      Jefe
      0
      42
      Casado
      1
      Si
      Otra_loc
      Primaria_C
      Ocupado
      Empleado
      0
    
    
      3
      5800
      307861
      1
      1674
      Conyuge
      1
      44
      Casado
      1
      Si
      Otra_loc
      Univ_C
      Ocupado
      Empleado
      0
    
    
      4
      5800
      307861
      1
      1674
      Hijo
      0
      13
      Soltero
      1
      Si
      Localidad
      Secundaria_I
      Inactivo
      No_empleo
      3



In [10]:

    
data_type_dict = {'NRO_HOGAR':'nominal','Parentesco':'nominal','Estado_Civil':'nominal','Cobertura_Medica':'nominal',
                 'Sabe_leer':'nominal','Lugar_Nac':'nominal','NIVEL_ED':'nominal','Trabajo':'nominal',
                 'CAT_OCUP':'nominal','CAT_INAC':'nominal'}
dummy_var = dummy_variables(data, data_type_dict)
df = dummy_var[0]
df = df.dropna(axis=0)
weights = ( 1. / df.PONDERA )
df = df.drop('PONDERA',1)



In [11]:

    
data = data.dropna(axis=0)
g = data.columns.to_series().groupby(data.dtypes).groups
# g

Correlation Matrix



In [12]:

    
import seaborn as sns
sns.set(context="paper", font="monospace")

corrmat = df.corr()
f, ax = plt.subplots(figsize=(18, 16))
sns.heatmap(corrmat, vmax=.8, square=True)
f.tight_layout()

Linear Regression (WLS)



In [17]:

    
import statsmodels.api as sm
Y = np.asarray(df.ITF)
x = df.iloc[:,1:]
X = sm.add_constant(x)
wls_model = sm.WLS(Y,X, weights = weights)
results = wls_model.fit()
print(results.summary())









    



                            WLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.234
Model:                            WLS   Adj. R-squared:                  0.229
Method:                 Least Squares   F-statistic:                     47.02
Date:                Tue, 22 Nov 2016   Prob (F-statistic):               0.00
Time:                        20:29:15   Log-Likelihood:                -80726.
No. Observations:                8360   AIC:                         1.616e+05
Df Residuals:                    8305   BIC:                         1.619e+05
Df Model:                          54                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------------
const                    5826.1333   1224.044      4.760      0.000      3426.702  8225.565
CODUSU                     -0.0010      0.001     -1.457      0.145        -0.002     0.000
Sexo                     -100.1707     96.517     -1.038      0.299      -289.368    89.027
Edad                       15.9476      4.638      3.438      0.001         6.856    25.039
Parentesco_Conyuge      -1199.2324    217.546     -5.513      0.000     -1625.677  -772.788
Parentesco_Hermano        742.8127    382.742      1.941      0.052        -7.457  1493.083
Parentesco_Hijo          1024.8679    197.354      5.193      0.000       638.004  1411.732
Parentesco_Jefe         -1732.0786    186.608     -9.282      0.000     -2097.876 -1366.281
Parentesco_Madre-padre     40.0888    434.632      0.092      0.927      -811.898   892.075
Parentesco_Nieto         1447.9581    262.831      5.509      0.000       932.743  1963.173
Parentesco_No-familia    1303.7582    493.523      2.642      0.008       336.329  2271.187
Parentesco_Otro          2247.0768    317.440      7.079      0.000      1624.815  2869.339
Parentesco_Suegro        1440.4202    595.034      2.421      0.016       274.005  2606.835
Parentesco_Yerno          510.4617    381.532      1.338      0.181      -237.437  1258.360
NIVEL_ED_Primaria_C       240.5413    211.832      1.136      0.256      -174.703   655.785
NIVEL_ED_Primaria_I       -54.7701    221.447     -0.247      0.805      -488.861   379.321
NIVEL_ED_Secundaria_C     847.6256    210.391      4.029      0.000       435.206  1260.045
NIVEL_ED_Secundaria_I     204.2312    207.314      0.985      0.325      -202.157   610.619
NIVEL_ED_Sin_Edu         -432.3221    444.018     -0.974      0.330     -1302.708   438.064
NIVEL_ED_Univ_C          3434.0095    223.182     15.387      0.000      2996.517  3871.502
NIVEL_ED_Univ_I          1586.8180    224.801      7.059      0.000      1146.151  2027.485
CAT_OCUP_Cuenta_propia    665.2373    324.484      2.050      0.040        29.168  1301.307
CAT_OCUP_Empleado        1414.5072    307.294      4.603      0.000       812.135  2016.880
CAT_OCUP_No_empleo        971.6218    452.087      2.149      0.032        85.418  1857.825
CAT_OCUP_Patron          3251.5074    383.116      8.487      0.000      2500.505  4002.510
CAT_OCUP_Sin_sueldo      -476.7403    583.412     -0.817      0.414     -1620.373   666.892
Lugar_Nac_Localidad       348.7546   1302.506      0.268      0.789     -2204.482  2901.991
Lugar_Nac_Otra_loc        596.4011   1313.032      0.454      0.650     -1977.469  3170.272
Lugar_Nac_Otra_prov       323.8001   1307.388      0.248      0.804     -2239.007  2886.607
Lugar_Nac_Otro_pais       -74.5450   1323.761     -0.056      0.955     -2669.446  2520.356
Lugar_Nac_Pais_limit      224.7158   1318.779      0.170      0.865     -2360.420  2809.851
CAT_INAC_0              -1312.6524   1350.987     -0.972      0.331     -3960.924  1335.620
CAT_INAC_1               -204.5184    376.184     -0.544      0.587      -941.934   532.897
CAT_INAC_2               1827.3711    773.851      2.361      0.018       310.429  3344.313
CAT_INAC_3               1313.4447    361.684      3.631      0.000       604.453  2022.436
CAT_INAC_4                646.6932    368.208      1.756      0.079       -75.087  1368.473
CAT_INAC_5               1596.9203    534.795      2.986      0.003       548.588  2645.252
CAT_INAC_6               1170.2762    681.163      1.718      0.086      -164.973  2505.525
CAT_INAC_7                788.5986    466.132      1.692      0.091      -125.136  1702.333
NRO_HOGAR_1              2258.7454    935.270      2.415      0.016       425.383  4092.108
NRO_HOGAR_2              1180.7591    957.827      1.233      0.218      -696.822  3058.340
NRO_HOGAR_3               690.7813   1299.233      0.532      0.595     -1856.040  3237.602
NRO_HOGAR_4              1695.8475   2989.463      0.567      0.571     -4164.246  7555.941
Estado_Civil_Casado     -1.114e+04   3918.742     -2.842      0.005     -1.88e+04 -3453.569
Estado_Civil_Divorciado -1.273e+04   3920.522     -3.248      0.001     -2.04e+04 -5049.676
Estado_Civil_Soltero    -1.242e+04   3914.628     -3.172      0.002     -2.01e+04 -4742.431
Estado_Civil_Unido      -1.148e+04   3917.285     -2.932      0.003     -1.92e+04 -3805.142
Estado_Civil_Viudo        -1.2e+04   3922.024     -3.059      0.002     -1.97e+04 -4311.279
Sabe_leer_Menor           688.3558   2014.008      0.342      0.733     -3259.603  4636.315
Sabe_leer_No              775.7056   2008.016      0.386      0.699     -3160.507  4711.918
Sabe_leer_Si              454.6599   2028.178      0.224      0.823     -3521.076  4430.395
Cobertura_Medica_1       2060.7729    563.597      3.656      0.000       955.982  3165.564
Cobertura_Medica_2       2920.5225    578.242      5.051      0.000      1787.023  4054.022
Cobertura_Medica_3       -329.7940   1384.642     -0.238      0.812     -3044.039  2384.451
Cobertura_Medica_4         75.4321    565.195      0.133      0.894     -1032.491  1183.355
Cobertura_Medica_9      -1387.5128   1078.448     -1.287      0.198     -3501.541   726.515
Cobertura_Medica_12      2894.0951    636.543      4.547      0.000      1646.312  4141.878
Cobertura_Medica_13      -407.3824   2868.090     -0.142      0.887     -6029.555  5214.790
Trabajo_Desocupado       5027.3380   2604.752      1.930      0.054       -78.627  1.01e+04
Trabajo_Inactivo         3624.4207   1047.113      3.461      0.001      1571.818  5677.024
Trabajo_Menor            3514.3650   1053.467      3.336      0.001      1449.306  5579.424
Trabajo_Ocupado          6586.3323   2614.445      2.519      0.012      1461.368  1.17e+04
==============================================================================
Omnibus:                     6610.476   Durbin-Watson:                   0.783
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           290520.001
Skew:                           3.435   Prob(JB):                         0.00
Kurtosis:                      31.050   Cond. No.                     1.05e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.71e-21. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Linear Regression (sklearn)



In [69]:

    
# sk-learn (Y ~ x) with intercept
from sklearn.linear_model import LinearRegression

R_IS=[]
R_OS=[]

n = 10
from sklearn.cross_validation import train_test_split
for i in range(n):
    X_train, X_test, y_train, y_test = train_test_split(x, Y, test_size=0.40)
    
    res=LinearRegression(fit_intercept=False)
    res.fit(X_train,y_train, sample_weight = weights[:len(X_train)])
    R_IS.append(1-((np.asarray(res.predict(X_train))-y_train)**2).sum()/((y_train-np.mean(y_train))**2).sum())
    R_OS.append(1-((np.asarray(res.predict(X_test))-y_test)**2).sum()/((y_test-np.mean(y_test))**2).sum())
    
print("IS R-squared for {} times is {}".format(n,np.mean(R_IS)))
print("OS R-squared for {} times is {}".format(n,np.mean(R_OS)))









    



IS R-squared for 10 times is 0.237468349016
OS R-squared for 10 times is 0.216176069094

LOGISTIC REGRESSION



In [72]:

    
import pandas as pd
import numpy as np
import statsmodels as sm
import sklearn as skl
import sklearn.preprocessing as preprocessing
import sklearn.linear_model as linear_model
import sklearn.cross_validation as cross_validation
import sklearn.metrics as metrics
import sklearn.tree as tree
import seaborn as sns

X_train, X_test, y_train, y_test = train_test_split(x, Y, train_size=0.70)
scaler = preprocessing.StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = scaler.transform(X_test)



In [74]:

    
# Encode the categorical features as numbers
def number_encode_features(df):
    result = df.copy()
    encoders = {}
    for column in result.columns:
        if result.dtypes[column] == np.object:
            encoders[column] = preprocessing.LabelEncoder()
            result[column] = encoders[column].fit_transform(result[column])
    return result, encoders



In [77]:

    
encoded_data, _ = number_encode_features(df)



In [81]:

    
cls = linear_model.LogisticRegression()

cls.fit(X_train, y_train)
y_pred = cls.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)

# plt.figure(figsize=(20,20))
# plt.subplot(2,1,1)
# sns.heatmap(cm, annot=True, fmt="d", xticklabels=encoders["Target"].classes_, yticklabels=encoders["Target"].classes_)
# plt.ylabel("Real value")
# plt.xlabel("Predicted value")
# print "F1 score: %f" % skl.metrics.f1_score(y_test, y_pred)
coefs = pd.Series(cls.coef_[0], index=X_train.columns)
coefs.sort()
# ax = plt.subplot(2,1,2)
coefs.plot(kind="bar")
# plt.show()









    



/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:14: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting






    Out[81]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f7771038c90>



In [ ]:



In [ ]:



In [ ]:



In [ ]:

RIDGE REGRESSION



In [71]:

    
#import Quandl
import statsmodels.formula.api as smf
from scipy import stats
from pandas.stats.api import ols
from sklearn import linear_model



In [72]:

    
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.ix[:,1:],
                                    df.ITF, test_size = 0.4, random_state = 200)



In [73]:

    
Ridge=linear_model.Ridge(fit_intercept=True, alpha=7.80899583959)

Ridge.fit(X_train,y_train)
# In the sample:
p_IS=Ridge.predict(X_train)
err_IS=p_IS-y_train
R_2_IS_Ridge=1-np.var(err_IS)/np.var(y_train)
print("The R-squared we found for IS Ridge is: {0}".format(R_2_IS_Ridge))

Ridge_coef=Ridge.coef_

#Out of sample
p_OS=Ridge.predict(X_test)
err_OS=p_OS-y_test
R_2_OS_Ridge=1-np.var(err_OS)/np.var(y_test)
print("The R-squared we found for OS Ridge is: {0}".format(R_2_OS_Ridge))









    



The R-squared we found for IS Ridge is: 0.255020716457
The R-squared we found for OS Ridge is: 0.220395819262



In [48]:

    
#select best lambda for Ridge
lambdas = np.exp(np.linspace(-5,13,200))
lambda_r_optimal=Regularization_fit_lambda(1,X_train,y_train,lambdas,p=0.4,Graph=True)
print('Optimal lambda for Ridge={0}'.format(lambda_r_optimal))









    












    



Optimal lambda for Ridge=7.80899583959

LASSO REGRESSION



In [74]:

    
Lasso=linear_model.Lasso(fit_intercept=True,alpha=1)
#try Ridge with a selected regularization parameter lambda

Lasso.fit(X_train,y_train)
# In the sample:
p_IS=Lasso.predict(X_train)
err_IS=p_IS-y_train
R_2_IS_Lasso=1-np.var(err_IS)/np.var(y_train)
print("The R-squared we found for IS Lasso is: {0}".format(R_2_IS_Ridge))

Lasso_coef=Lasso.coef_
#Out of sample
p_OS=Lasso.predict(X_test)
err_OS=p_OS-y_test
R_2_OS_Lasso=1-np.var(err_OS)/np.var(y_test)
print("The R-squared we found for OS Lasso is: {0}".format(R_2_OS_Lasso))









    



The R-squared we found for IS Lasso is: 0.255020716457
The R-squared we found for OS Lasso is: 0.219265327439



In [ ]:

    
#select lambdas for Lasso 
lambdas=np.exp(np.linspace(-5,6.5,200))
lambda_l_optimal=Regularization_fit_lambda(2,X_train,y_train,lambdas,p=0.4,Graph=True)
print('Optimal lambda for Lasso={0}'.format(lambda_l_optimal))

PCA

No funciona bien para multivariate regression. Tenemos muchos datos categoricos



In [15]:

    
n=2
from sklearn.decomposition import PCA
pca = PCA(n)
Xproj = pca.fit_transform(df)
eigenvalues = pca.explained_variance_

plt.figure(2, figsize=(8, 6))
plt.scatter(Xproj[:, 0], Xproj[:, 1], c = X.sum(axis=1), cmap=plt.cm.cool)
plt.xlabel('First Component')
plt.ylabel('Second Component')
plt.show()



In [181]:

    
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans

X_ = np.asarray(df)
range_n_clusters = [2, 3, 4, 5, 6, 7]
for n_clusters in range_n_clusters:
    km = KMeans(n_clusters=n_clusters, random_state=324)
    cluster_labels = km.fit_predict(X_)
    silhouette_avg = silhouette_score(X_, cluster_labels)
    print("For n_clusters = {},".format(n_clusters)+" the average silhouette_score is :{}".format(silhouette_avg))









    



For n_clusters = 2, the average silhouette_score is :0.734622497892
For n_clusters = 3, the average silhouette_score is :0.711378747975
For n_clusters = 4, the average silhouette_score is :0.695906072063
For n_clusters = 5, the average silhouette_score is :0.682964255797
For n_clusters = 6, the average silhouette_score is :0.666034528316
For n_clusters = 7, the average silhouette_score is :0.651261959551

Kmeans Clustering



In [182]:

    
from sklearn.cluster import KMeans

n=2
dd=Xproj

km=KMeans(n_clusters=n)
res=km.fit(dd)

with plt.style.context('ggplot'):
    plt.figure(figsize=(6, 5))
    plt.scatter(dd[:, 0], dd[:, 1], c=res.labels_)
    plt.ylabel('X2')
    plt.xlabel('X1')
    plt.xticks(())
    plt.yticks(())
    plt.title("KMeans, K = {} Clusters".format(n))
    plt.show()

Gaussian Mixture



In [183]:

    
from sklearn.mixture import GaussianMixture

gm=GaussianMixture(n_components=n)
res1=gm.fit(dd)

with plt.style.context('ggplot'):
    plt.figure(figsize=(6,6))
    plt.scatter(dd[:, 0], dd[:, 1], c=res1.predict(dd))
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.xticks(())
    plt.yticks(())
    plt.title("Gaussian Mixture")
    plt.show()

Feature selection

Recursive Feature Elimination



In [184]:

    
# Feature Extraction with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load data
X = x
Y = Y
# feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d") % fit.n_features_
print("Selected Features: %s") % fit.support_
print("Feature Ranking: %s") % fit.ranking_









    




KeyboardInterruptTraceback (most recent call last)
<ipython-input-184-8d312837b618> in <module>()
      9 model = LogisticRegression()
     10 rfe = RFE(model, 3)
---> 11 fit = rfe.fit(X, Y)
     12 print("Num Features: %d") % fit.n_features_
     13 print("Selected Features: %s") % fit.support_

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/feature_selection/rfe.pyc in fit(self, X, y)
    133             The target values.
    134         """
--> 135         return self._fit(X, y)
    136 
    137     def _fit(self, X, y, step_score=None):

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/feature_selection/rfe.pyc in _fit(self, X, y, step_score)
    167                 print("Fitting estimator with %d features." % np.sum(support_))
    168 
--> 169             estimator.fit(X[:, features], y)
    170 
    171             # Get coefs

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/linear_model/logistic.pyc in fit(self, X, y, sample_weight)
   1185                 self.class_weight, self.penalty, self.dual, self.verbose,
   1186                 self.max_iter, self.tol, self.random_state,
-> 1187                 sample_weight=sample_weight)
   1188             self.n_iter_ = np.array([n_iter_])
   1189             return self

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/svm/base.pyc in _fit_liblinear(X, y, C, fit_intercept, intercept_scaling, class_weight, penalty, dual, verbose, max_iter, tol, random_state, multi_class, loss, epsilon, sample_weight)
    910         X, y_ind, sp.isspmatrix(X), solver_type, tol, bias, C,
    911         class_weight_, max_iter, rnd.randint(np.iinfo('i').max),
--> 912         epsilon, sample_weight)
    913     # Regarding rnd.randint(..) in the above signature:
    914     # seed for srand in range [0..INT_MAX); due to limitations in Numpy

KeyboardInterrupt:



In [23]:

    
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
 
X = x
Y = Y
names = df.columns
 
#use linear regression as the model
lr = LinearRegression()
#rank all features, i.e continue the elimination until the last one
rfe = RFE(lr, n_features_to_select=1)
rfe.fit(X,Y)
 
print "Features sorted by their rank:"
print sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names))









    



Features sorted by their rank:
[(1.0, 'CAT_INAC_7'), (2.0, 'NRO_HOGAR_1'), (3.0, 'NRO_HOGAR_2'), (4.0, 'NRO_HOGAR_3'), (5.0, 'Cobertura_Medica_1'), (6.0, 'Cobertura_Medica_9'), (7.0, 'Sabe_leer_Si'), (8.0, 'Cobertura_Medica_4'), (9.0, 'Cobertura_Medica_3'), (10.0, 'Cobertura_Medica_2'), (11.0, 'Cobertura_Medica_12'), (12.0, 'CAT_OCUP_No_empleo'), (13.0, 'CAT_OCUP_Cuenta_propia'), (14.0, 'NIVEL_ED_Univ_I'), (15.0, 'CAT_OCUP_Patron'), (16.0, 'CAT_OCUP_Empleado'), (17.0, 'CAT_INAC_2'), (18.0, 'CAT_INAC_1'), (19.0, 'CAT_INAC_6'), (20.0, 'CAT_INAC_5'), (21.0, 'CAT_INAC_4'), (22.0, 'CAT_INAC_3'), (23.0, 'Lugar_Nac_Pais_limit'), (24.0, 'CAT_INAC_0'), (25.0, 'NIVEL_ED_Sin_Edu'), (26.0, 'NIVEL_ED_Univ_C'), (27.0, 'NIVEL_ED_Primaria_I'), (28.0, 'NIVEL_ED_Secundaria_C'), (29.0, 'Parentesco_Yerno'), (30.0, 'NIVEL_ED_Primaria_C'), (31.0, 'NIVEL_ED_Secundaria_I'), (32.0, 'Trabajo_Inactivo'), (33.0, 'Trabajo_Desocupado'), (34.0, 'Parentesco_Hijo'), (35.0, 'Edad'), (36.0, 'Parentesco_Jefe'), (37.0, 'Parentesco_Conyuge'), (38.0, 'Parentesco_Hermano'), (39.0, 'Parentesco_Suegro'), (40.0, 'Parentesco_Madre-padre'), (41.0, 'Parentesco_Nieto'), (42.0, 'Parentesco_Otro'), (43.0, 'Parentesco_No-familia'), (44.0, 'Estado_Civil_Casado'), (45.0, 'Estado_Civil_Divorciado'), (46.0, 'Estado_Civil_Unido'), (47.0, 'Estado_Civil_Soltero'), (48.0, 'NRO_HOGAR_4'), (49.0, 'Trabajo_Menor'), (50.0, 'Cobertura_Medica_13'), (51.0, 'Sabe_leer_Menor'), (52.0, 'Estado_Civil_Viudo'), (53.0, 'Sabe_leer_No'), (54.0, 'Lugar_Nac_Localidad'), (55.0, 'CAT_OCUP_Sin_sueldo'), (56.0, 'Lugar_Nac_Otra_loc'), (57.0, 'Lugar_Nac_Otro_pais'), (58.0, 'CODUSU'), (59.0, 'Sexo'), (60.0, 'Lugar_Nac_Otra_prov'), (61.0, 'ITF')]

Univariate Selection



In [24]:

    
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
import pandas
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# load data
names = df.columns
X = x
Y = Y
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
# summarize scores
numpy.set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])









    




ValueErrorTraceback (most recent call last)
<ipython-input-24-4775e1d976d1> in <module>()
     10 # feature extraction
     11 test = SelectKBest(score_func=chi2, k=4)
---> 12 fit = test.fit(X, Y)
     13 # summarize scores
     14 numpy.set_printoptions(precision=3)

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/feature_selection/univariate_selection.pyc in fit(self, X, y)
    328 
    329         self._check_params(X, y)
--> 330         score_func_ret = self.score_func(X, y)
    331         if isinstance(score_func_ret, (list, tuple)):
    332             self.scores_, self.pvalues_ = score_func_ret

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/feature_selection/univariate_selection.pyc in chi2(X, y)
    213     X = check_array(X, accept_sparse='csr')
    214     if np.any((X.data if issparse(X) else X) < 0):
--> 215         raise ValueError("Input X must be non-negative.")
    216 
    217     Y = LabelBinarizer().fit_transform(y)

ValueError: Input X must be non-negative.

Stability selection



In [25]:

    
from sklearn.linear_model import RandomizedLasso
 
#using the Boston housing data. 
#Data gets scaled automatically by sklearn's implementation
X = x
Y = Y
names = df.columns
 
rlasso = RandomizedLasso(alpha=0.025)
rlasso.fit(X, Y)
 
print "Features sorted by their score:"
print sorted(zip(map(lambda x: round(x, 4), rlasso.scores_), 
                 names), reverse=True)









    



Features sorted by their score:
[(1.0, 'Sexo'), (1.0, 'Parentesco_No-familia'), (1.0, 'Parentesco_Madre-padre'), (1.0, 'Parentesco_Hijo'), (1.0, 'NIVEL_ED_Univ_C'), (1.0, 'NIVEL_ED_Sin_Edu'), (1.0, 'NIVEL_ED_Primaria_C'), (1.0, 'Lugar_Nac_Otra_prov'), (1.0, 'Estado_Civil_Divorciado'), (1.0, 'Estado_Civil_Casado'), (1.0, 'Cobertura_Medica_9'), (1.0, 'Cobertura_Medica_13'), (1.0, 'Cobertura_Medica_1'), (1.0, 'CAT_OCUP_Patron'), (1.0, 'CAT_OCUP_No_empleo'), (1.0, 'CAT_INAC_0'), (0.995, 'CODUSU'), (0.985, 'NIVEL_ED_Secundaria_I'), (0.985, 'CAT_INAC_5'), (0.98, 'Parentesco_Nieto'), (0.98, 'Lugar_Nac_Localidad'), (0.98, 'Cobertura_Medica_2'), (0.975, 'Sabe_leer_No'), (0.965, 'Lugar_Nac_Otro_pais'), (0.965, 'ITF'), (0.96, 'CAT_INAC_1'), (0.95, 'NRO_HOGAR_2'), (0.945, 'Cobertura_Medica_4'), (0.935, 'NRO_HOGAR_4'), (0.93, 'Parentesco_Otro'), (0.93, 'Edad'), (0.925, 'NIVEL_ED_Primaria_I'), (0.91, 'Cobertura_Medica_3'), (0.89, 'CAT_INAC_3'), (0.875, 'Estado_Civil_Unido'), (0.87, 'Trabajo_Menor'), (0.84, 'NIVEL_ED_Univ_I'), (0.815, 'CAT_INAC_6'), (0.81, 'CAT_OCUP_Cuenta_propia'), (0.795, 'Parentesco_Conyuge'), (0.78, 'Parentesco_Hermano'), (0.775, 'Parentesco_Jefe'), (0.76, 'CAT_INAC_7'), (0.735, 'Cobertura_Medica_12'), (0.715, 'Lugar_Nac_Otra_loc'), (0.7, 'CAT_INAC_4'), (0.695, 'Parentesco_Suegro'), (0.665, 'Trabajo_Inactivo'), (0.61, 'NRO_HOGAR_3'), (0.605, 'NIVEL_ED_Secundaria_C'), (0.525, 'Estado_Civil_Viudo'), (0.475, 'CAT_INAC_2'), (0.46, 'Sabe_leer_Menor'), (0.39, 'Parentesco_Yerno'), (0.305, 'CAT_OCUP_Sin_sueldo'), (0.275, 'Trabajo_Desocupado'), (0.265, 'NRO_HOGAR_1'), (0.24, 'CAT_OCUP_Empleado'), (0.235, 'Estado_Civil_Soltero'), (0.155, 'Lugar_Nac_Pais_limit'), (0.14, 'Sabe_leer_Si')]

Example: running the methods side by side



In [67]:

    
# !pip install minepy

from sklearn.linear_model import (LinearRegression, Ridge, 
                                  Lasso, RandomizedLasso)
from sklearn.feature_selection import RFE, f_regression
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from minepy import MINE
 
X = x.values
Y = Y
    
names = df.columns
 
ranks = {}
 
def rank_to_dict(ranks, names, order=1):
    minmax = MinMaxScaler()
    ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
    ranks = map(lambda x: round(x, 2), ranks)
    return dict(zip(names, ranks ))
 
lr = LinearRegression(normalize=True)
lr.fit(X, Y)
ranks["Linear reg"] = rank_to_dict(np.abs(lr.coef_), names)
 
ridge = Ridge(alpha=7)
ridge.fit(X, Y)
ranks["Ridge"] = rank_to_dict(np.abs(ridge.coef_), names)
 
 
lasso = Lasso(alpha=.05)
lasso.fit(X, Y)
ranks["Lasso"] = rank_to_dict(np.abs(lasso.coef_), names)
 
 
rlasso = RandomizedLasso(alpha=0.04)
rlasso.fit(X, Y)
ranks["Stability"] = rank_to_dict(np.abs(rlasso.scores_), names)
 
#stop the search when 5 features are left (they will get equal scores)
rfe = RFE(lr, n_features_to_select=5)
rfe.fit(X,Y)
ranks["RFE"] = rank_to_dict(map(float, rfe.ranking_), names, order=-1)
 
rf = RandomForestRegressor()
rf.fit(X,Y)
ranks["RF"] = rank_to_dict(rf.feature_importances_, names)
 
 
f, pval  = f_regression(X, Y, center=True)
ranks["Corr."] = rank_to_dict(f, names)
 
mine = MINE()
mic_scores = []
for i in range(X.shape[1]):
    mine.compute_score(X[:,i], Y)
    m = mine.mic()
    mic_scores.append(m)
 
ranks["MIC"] = rank_to_dict(mic_scores, names) 
 
 
r = {}
for name in names[:-1]:
    r[name] = round(np.mean([ranks[method][name] 
                             for method in ranks.keys()]), 2)
 
methods = sorted(ranks.keys())
ranks["Mean"] = r
methods.append("Mean")
 
feat_ranking = pd.DataFrame(ranks)
cols = feat_ranking.columns.tolist()
cols.insert(0, cols.pop(cols.index('Mean')))
feat_ranking = feat_ranking.ix[:, cols]
feat_ranking.sort_values(['Mean'], ascending=False)









    Out[67]:






  
    
      
      Mean
      Corr.
      Lasso
      Linear reg
      MIC
      RF
      RFE
      Ridge
      Stability
    
  
  
    
      NIVEL_ED_Sin_Edu
      0.69
      1.00
      0.79
      0.16
      0.49
      0.25
      0.79
      1.00
      1.00
    
    
      Cobertura_Medica_3
      0.57
      0.86
      0.13
      0.04
      1.00
      0.13
      0.96
      0.56
      0.90
    
    
      Cobertura_Medica_1
      0.52
      0.30
      0.81
      0.04
      0.46
      0.03
      0.89
      0.64
      1.00
    
    
      CAT_OCUP_No_empleo
      0.48
      0.17
      0.71
      0.12
      0.16
      0.06
      0.80
      0.81
      1.00
    
    
      CAT_INAC_7
      0.45
      0.04
      0.30
      1.00
      0.25
      0.00
      1.00
      0.28
      0.70
    
    
      Parentesco_Hijo
      0.43
      0.06
      0.63
      0.03
      0.15
      0.04
      0.66
      0.88
      1.00
    
    
      Cobertura_Medica_9
      0.43
      0.04
      0.73
      0.04
      0.21
      0.01
      0.91
      0.51
      1.00
    
    
      NRO_HOGAR_2
      0.41
      0.01
      0.14
      1.00
      0.03
      0.00
      1.00
      0.18
      0.89
    
    
      CAT_OCUP_Patron
      0.39
      0.00
      0.50
      0.12
      0.06
      0.00
      0.86
      0.60
      1.00
    
    
      Trabajo_Menor
      0.38
      0.48
      0.75
      0.00
      0.33
      0.05
      0.21
      0.38
      0.84
    
    
      Parentesco_No-familia
      0.38
      0.01
      0.60
      0.03
      0.25
      0.01
      0.50
      0.65
      1.00
    
    
      Estado_Civil_Casado
      0.37
      0.03
      1.00
      0.00
      0.17
      0.02
      0.30
      0.45
      1.00
    
    
      NIVEL_ED_Univ_C
      0.36
      0.14
      0.25
      0.16
      0.25
      0.04
      0.77
      0.31
      1.00
    
    
      NIVEL_ED_Secundaria_I
      0.36
      0.03
      0.37
      0.16
      0.17
      0.01
      0.75
      0.43
      0.93
    
    
      NIVEL_ED_Primaria_C
      0.36
      0.13
      0.27
      0.16
      0.23
      0.02
      0.70
      0.35
      1.00
    
    
      Edad
      0.36
      0.01
      0.46
      0.03
      0.11
      0.02
      0.64
      0.65
      0.92
    
    
      CAT_INAC_0
      0.36
      0.15
      0.37
      0.15
      0.40
      0.01
      0.34
      0.43
      1.00
    
    
      ITF
      0.35
      0.00
      0.00
      0.00
      0.92
      1.00
      0.00
      0.00
      0.91
    
    
      Cobertura_Medica_4
      0.34
      0.00
      0.34
      0.04
      0.04
      0.00
      0.95
      0.38
      0.95
    
    
      Sabe_leer_Si
      0.34
      0.29
      0.48
      0.04
      0.72
      0.03
      0.93
      0.21
      0.00
    
    
      Estado_Civil_Divorciado
      0.34
      0.00
      0.90
      0.00
      0.16
      0.02
      0.29
      0.32
      1.00
    
    
      Cobertura_Medica_2
      0.33
      0.00
      0.24
      0.04
      0.03
      0.00
      0.98
      0.34
      0.98
    
    
      CAT_OCUP_Cuenta_propia
      0.33
      0.32
      0.13
      0.12
      0.33
      0.03
      0.82
      0.09
      0.79
    
    
      Parentesco_Madre-padre
      0.32
      0.00
      0.36
      0.03
      0.31
      0.01
      0.52
      0.36
      1.00
    
    
      NRO_HOGAR_1
      0.32
      0.03
      0.00
      1.00
      0.24
      0.00
      1.00
      0.11
      0.21
    
    
      NRO_HOGAR_3
      0.31
      0.00
      0.04
      1.00
      0.00
      0.00
      1.00
      0.01
      0.42
    
    
      NRO_HOGAR_4
      0.29
      0.08
      0.51
      0.00
      0.34
      0.04
      0.23
      0.17
      0.96
    
    
      Parentesco_Nieto
      0.29
      0.00
      0.34
      0.03
      0.11
      0.00
      0.55
      0.31
      0.96
    
    
      Estado_Civil_Unido
      0.27
      0.06
      0.75
      0.00
      0.25
      0.01
      0.27
      0.14
      0.70
    
    
      Parentesco_Otro
      0.27
      0.00
      0.31
      0.03
      0.07
      0.00
      0.54
      0.27
      0.97
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      NIVEL_ED_Primaria_I
      0.26
      0.00
      0.01
      0.16
      0.23
      0.02
      0.73
      0.01
      0.92
    
    
      Parentesco_Yerno
      0.26
      0.17
      0.19
      0.16
      0.28
      0.02
      0.68
      0.25
      0.29
    
    
      Sabe_leer_No
      0.26
      0.02
      0.60
      0.00
      0.18
      0.01
      0.14
      0.26
      0.87
    
    
      NIVEL_ED_Univ_I
      0.26
      0.01
      0.08
      0.12
      0.16
      0.02
      0.84
      0.19
      0.69
    
    
      CAT_INAC_1
      0.26
      0.00
      0.22
      0.15
      0.04
      0.00
      0.48
      0.25
      0.91
    
    
      Parentesco_Hermano
      0.25
      0.01
      0.22
      0.03
      0.16
      0.03
      0.59
      0.19
      0.74
    
    
      CAT_OCUP_Empleado
      0.24
      0.38
      0.01
      0.12
      0.30
      0.01
      0.88
      0.10
      0.15
    
    
      Cobertura_Medica_12
      0.24
      0.00
      0.12
      0.04
      0.00
      0.00
      1.00
      0.09
      0.70
    
    
      Cobertura_Medica_13
      0.24
      0.05
      0.28
      0.00
      0.18
      0.01
      0.20
      0.22
      1.00
    
    
      CAT_INAC_3
      0.24
      0.07
      0.11
      0.15
      0.16
      0.00
      0.39
      0.11
      0.89
    
    
      Sexo
      0.23
      0.00
      0.01
      0.00
      0.23
      0.57
      0.02
      0.01
      1.00
    
    
      Parentesco_Jefe
      0.23
      0.00
      0.10
      0.03
      0.11
      0.01
      0.61
      0.19
      0.80
    
    
      Lugar_Nac_Otra_prov
      0.22
      0.02
      0.19
      0.00
      0.25
      0.01
      0.11
      0.17
      0.98
    
    
      CAT_INAC_5
      0.22
      0.00
      0.06
      0.15
      0.06
      0.01
      0.46
      0.09
      0.92
    
    
      Lugar_Nac_Pais_limit
      0.21
      0.35
      0.50
      0.08
      0.30
      0.00
      0.32
      0.06
      0.06
    
    
      Parentesco_Conyuge
      0.19
      0.00
      0.01
      0.03
      0.14
      0.01
      0.63
      0.07
      0.66
    
    
      CAT_INAC_6
      0.19
      0.01
      0.05
      0.15
      0.13
      0.01
      0.41
      0.04
      0.74
    
    
      Parentesco_Suegro
      0.19
      0.01
      0.07
      0.03
      0.13
      0.01
      0.57
      0.02
      0.67
    
    
      Lugar_Nac_Otro_pais
      0.19
      0.04
      0.13
      0.00
      0.30
      0.01
      0.04
      0.10
      0.90
    
    
      CAT_INAC_4
      0.19
      0.02
      0.16
      0.15
      0.17
      0.00
      0.45
      0.18
      0.38
    
    
      Estado_Civil_Viudo
      0.18
      0.01
      0.52
      0.00
      0.12
      0.00
      0.16
      0.15
      0.48
    
    
      CAT_INAC_2
      0.18
      0.01
      0.08
      0.15
      0.19
      0.01
      0.43
      0.13
      0.43
    
    
      Estado_Civil_Soltero
      0.18
      0.01
      0.62
      0.00
      0.28
      0.02
      0.25
      0.02
      0.26
    
    
      Lugar_Nac_Localidad
      0.18
      0.00
      0.01
      0.00
      0.26
      0.03
      0.13
      0.05
      0.97
    
    
      CODUSU
      0.17
      0.01
      0.04
      0.00
      0.11
      0.08
      0.09
      0.05
      0.99
    
    
      Trabajo_Inactivo
      0.17
      0.04
      0.03
      0.07
      0.19
      0.00
      0.38
      0.01
      0.67
    
    
      Lugar_Nac_Otra_loc
      0.16
      0.05
      0.11
      0.00
      0.27
      0.05
      0.05
      0.07
      0.66
    
    
      Sabe_leer_Menor
      0.15
      0.02
      0.51
      0.00
      0.16
      0.01
      0.18
      0.15
      0.20
    
    
      Trabajo_Desocupado
      0.14
      0.22
      0.07
      0.07
      0.28
      0.01
      0.36
      0.05
      0.03
    
    
      CAT_OCUP_Sin_sueldo
      0.10
      0.09
      0.09
      0.00
      0.31
      0.03
      0.07
      0.04
      0.19
    
  

61 rows × 9 columns

	CODUSU	NRO_HOGAR	PONDERA	Parentesco	Sexo	Edad	Estado_Civil	Cobertura_Medica	Sabe_leer	Lugar_Nac	NIVEL_ED	Trabajo	CAT_OCUP	CAT_INAC	ITF
0	302468	1	1287	1	2	20	5	1	1	1	5	3	0	3	4000
1	302468	1	1287	10	2	20	5	1	1	1	5	3	0	3	4000
2	307861	1	1674	1	1	42	2	1	1	2	2	1	3	0	5800
3	307861	1	1674	2	2	44	2	1	1	2	6	1	3	0	5800
4	307861	1	1674	3	1	13	5	1	1	1	3	3	0	3	5800

	Mean	Corr.	Lasso	Linear reg	MIC	RF	RFE	Ridge	Stability
NIVEL_ED_Sin_Edu	0.69	1.00	0.79	0.16	0.49	0.25	0.79	1.00	1.00
Cobertura_Medica_3	0.57	0.86	0.13	0.04	1.00	0.13	0.96	0.56	0.90
Cobertura_Medica_1	0.52	0.30	0.81	0.04	0.46	0.03	0.89	0.64	1.00
CAT_OCUP_No_empleo	0.48	0.17	0.71	0.12	0.16	0.06	0.80	0.81	1.00
CAT_INAC_7	0.45	0.04	0.30	1.00	0.25	0.00	1.00	0.28	0.70
Parentesco_Hijo	0.43	0.06	0.63	0.03	0.15	0.04	0.66	0.88	1.00
Cobertura_Medica_9	0.43	0.04	0.73	0.04	0.21	0.01	0.91	0.51	1.00
NRO_HOGAR_2	0.41	0.01	0.14	1.00	0.03	0.00	1.00	0.18	0.89
CAT_OCUP_Patron	0.39	0.00	0.50	0.12	0.06	0.00	0.86	0.60	1.00
Trabajo_Menor	0.38	0.48	0.75	0.00	0.33	0.05	0.21	0.38	0.84
Parentesco_No-familia	0.38	0.01	0.60	0.03	0.25	0.01	0.50	0.65	1.00
Estado_Civil_Casado	0.37	0.03	1.00	0.00	0.17	0.02	0.30	0.45	1.00
NIVEL_ED_Univ_C	0.36	0.14	0.25	0.16	0.25	0.04	0.77	0.31	1.00
NIVEL_ED_Secundaria_I	0.36	0.03	0.37	0.16	0.17	0.01	0.75	0.43	0.93
NIVEL_ED_Primaria_C	0.36	0.13	0.27	0.16	0.23	0.02	0.70	0.35	1.00
Edad	0.36	0.01	0.46	0.03	0.11	0.02	0.64	0.65	0.92
CAT_INAC_0	0.36	0.15	0.37	0.15	0.40	0.01	0.34	0.43	1.00
ITF	0.35	0.00	0.00	0.00	0.92	1.00	0.00	0.00	0.91
Cobertura_Medica_4	0.34	0.00	0.34	0.04	0.04	0.00	0.95	0.38	0.95
Sabe_leer_Si	0.34	0.29	0.48	0.04	0.72	0.03	0.93	0.21	0.00
Estado_Civil_Divorciado	0.34	0.00	0.90	0.00	0.16	0.02	0.29	0.32	1.00
Cobertura_Medica_2	0.33	0.00	0.24	0.04	0.03	0.00	0.98	0.34	0.98
CAT_OCUP_Cuenta_propia	0.33	0.32	0.13	0.12	0.33	0.03	0.82	0.09	0.79
Parentesco_Madre-padre	0.32	0.00	0.36	0.03	0.31	0.01	0.52	0.36	1.00
NRO_HOGAR_1	0.32	0.03	0.00	1.00	0.24	0.00	1.00	0.11	0.21
NRO_HOGAR_3	0.31	0.00	0.04	1.00	0.00	0.00	1.00	0.01	0.42
NRO_HOGAR_4	0.29	0.08	0.51	0.00	0.34	0.04	0.23	0.17	0.96
Parentesco_Nieto	0.29	0.00	0.34	0.03	0.11	0.00	0.55	0.31	0.96
Estado_Civil_Unido	0.27	0.06	0.75	0.00	0.25	0.01	0.27	0.14	0.70
Parentesco_Otro	0.27	0.00	0.31	0.03	0.07	0.00	0.54	0.27	0.97
...	...	...	...	...	...	...	...	...	...
NIVEL_ED_Primaria_I	0.26	0.00	0.01	0.16	0.23	0.02	0.73	0.01	0.92
Parentesco_Yerno	0.26	0.17	0.19	0.16	0.28	0.02	0.68	0.25	0.29
Sabe_leer_No	0.26	0.02	0.60	0.00	0.18	0.01	0.14	0.26	0.87
NIVEL_ED_Univ_I	0.26	0.01	0.08	0.12	0.16	0.02	0.84	0.19	0.69
CAT_INAC_1	0.26	0.00	0.22	0.15	0.04	0.00	0.48	0.25	0.91
Parentesco_Hermano	0.25	0.01	0.22	0.03	0.16	0.03	0.59	0.19	0.74
CAT_OCUP_Empleado	0.24	0.38	0.01	0.12	0.30	0.01	0.88	0.10	0.15
Cobertura_Medica_12	0.24	0.00	0.12	0.04	0.00	0.00	1.00	0.09	0.70
Cobertura_Medica_13	0.24	0.05	0.28	0.00	0.18	0.01	0.20	0.22	1.00
CAT_INAC_3	0.24	0.07	0.11	0.15	0.16	0.00	0.39	0.11	0.89
Sexo	0.23	0.00	0.01	0.00	0.23	0.57	0.02	0.01	1.00
Parentesco_Jefe	0.23	0.00	0.10	0.03	0.11	0.01	0.61	0.19	0.80
Lugar_Nac_Otra_prov	0.22	0.02	0.19	0.00	0.25	0.01	0.11	0.17	0.98
CAT_INAC_5	0.22	0.00	0.06	0.15	0.06	0.01	0.46	0.09	0.92
Lugar_Nac_Pais_limit	0.21	0.35	0.50	0.08	0.30	0.00	0.32	0.06	0.06
Parentesco_Conyuge	0.19	0.00	0.01	0.03	0.14	0.01	0.63	0.07	0.66
CAT_INAC_6	0.19	0.01	0.05	0.15	0.13	0.01	0.41	0.04	0.74
Parentesco_Suegro	0.19	0.01	0.07	0.03	0.13	0.01	0.57	0.02	0.67
Lugar_Nac_Otro_pais	0.19	0.04	0.13	0.00	0.30	0.01	0.04	0.10	0.90
CAT_INAC_4	0.19	0.02	0.16	0.15	0.17	0.00	0.45	0.18	0.38
Estado_Civil_Viudo	0.18	0.01	0.52	0.00	0.12	0.00	0.16	0.15	0.48
CAT_INAC_2	0.18	0.01	0.08	0.15	0.19	0.01	0.43	0.13	0.43
Estado_Civil_Soltero	0.18	0.01	0.62	0.00	0.28	0.02	0.25	0.02	0.26
Lugar_Nac_Localidad	0.18	0.00	0.01	0.00	0.26	0.03	0.13	0.05	0.97
CODUSU	0.17	0.01	0.04	0.00	0.11	0.08	0.09	0.05	0.99
Trabajo_Inactivo	0.17	0.04	0.03	0.07	0.19	0.00	0.38	0.01	0.67
Lugar_Nac_Otra_loc	0.16	0.05	0.11	0.00	0.27	0.05	0.05	0.07	0.66
Sabe_leer_Menor	0.15	0.02	0.51	0.00	0.16	0.01	0.18	0.15	0.20
Trabajo_Desocupado	0.14	0.22	0.07	0.07	0.28	0.01	0.36	0.05	0.03
CAT_OCUP_Sin_sueldo	0.10	0.09	0.09	0.00	0.31	0.03	0.07	0.04	0.19