Referencia: http://dump.jazzido.com/CNPHV2010-RADIO/

Variables en el CENSO 2010 (INDEC)

  • VIVIENDA.INCALCONS Calidad constructiva de la vivienda
  • VIVIENDA.INCALSERV Calidad de Conexiones a Servicios Básicos
  • VIVIENDA.INMAT Calidad de los materiales
  • VIVIENDA.TIPVV Tipo de vivienda agrupado
  • VIVIENDA.TOTHOG Cantidad de Hogares en la Vivienda
  • VIVIENDA.URP Area Urbano - Rural
  • VIVIENDA.V00 Tipo de vivienda colectiva
  • VIVIENDA.V01 Tipo de vivienda particular
  • VIVIENDA.V02 Condición de ocupación
  • HOGAR.ALGUNBI Al menos un indicador NBI
  • HOGAR.H05 Material predominante de los pisos
  • HOGAR.H06 Material predominante de la cubierta exterior del techo
  • HOGAR.H07 Revestimiento interior o cielorraso del techo
  • HOGAR.H08 Tenencia de agua
  • HOGAR.H09 Procedencia del agua para beber y cocinar
  • HOGAR.H10 Tiene baño / letrina
  • HOGAR.H11 Tiene botón, cadena, mochila para limpieza del inodoro
  • HOGAR.H12 Desagüe del inodoro
  • HOGAR.H13 Baño / letrina de uso exclusivo
  • HOGAR.H14 Combustible usado principalmente para cocinar
  • HOGAR.H15 Total de habitaciones o piezas para dormir
  • HOGAR.H19A Heladera
  • HOGAR.H19B Computadora
  • HOGAR.H19C Teléfono celular
  • HOGAR.H19D Teléfono de línea
  • HOGAR.INDHAC Hacinamiento
  • HOGAR.NHOG Número del hogar en la vivienda
  • HOGAR.PROP Régimen de tenencia
  • HOGAR.TOTPERS Total de Personas en el Hogar
  • PERSONA.CONDACT Condición de actividad
  • PERSONA.EDADAGRU Edad en grandes grupos
  • PERSONA.EDADQUI Edades quinquenales
  • PERSONA.P01 Relación o parentesco con el jefe(a) del hogar
  • PERSONA.P02 Sexo
  • PERSONA.P03 Edad
  • PERSONA.P05 En que país nació
  • PERSONA.P07 Sabe leer y escribir
  • PERSONA.P08 Condición de asistencia escolar
  • PERSONA.P09 Nivel educativo que cursa o cursó
  • PERSONA.P10 Completó el nivel
  • PERSONA.P12 Utiliza computadora

Required libraries


In [2]:
import pandas as pd
import numpy as np
import os
import sys
import simpledbf
%pylab inline
import matplotlib.pyplot as plt


Populating the interactive namespace from numpy and matplotlib

Functions


In [3]:
def getEPHdbf(censusstring):
    print ("Downloading", censusstring)
    ### First I will check that it is not already there
    if not os.path.isfile("data/Individual_" + censusstring + ".DBF"):
        if os.path.isfile('Individual_' + censusstring + ".DBF"):
            # if in the current dir just move it
            if os.system("mv " + 'Individual_' + censusstring + ".DBF " + "data/"):
                print ("Error moving file!, Please check!")
        # otherwise start looking for the zip file
        else:
            if not os.path.isfile("data/" + censusstring + "_dbf.zip"):
                if not os.path.isfile(censusstring + "_dbf.zip"):
                    os.system(
                        "curl -O http://www.indec.gob.ar/ftp/cuadros/menusuperior/eph/" + censusstring + "_dbf.zip")
                ###  To move it I use the os.system() functions to run bash commands with arguments
                os.system("mv " + censusstring + "_dbf.zip " + "data/")
            ### unzip the csv
            os.system("unzip " + "data/" + censusstring + "_dbf.zip -d data/")

    if not os.path.isfile("data/" + 'Individual_' + censusstring + ".DBF"):
        print ("WARNING!!! something is wrong: the file is not there!")

    else:
        print ("file in place, creating CSV file")

    trimestre = censusstring

    dbf = simpledbf.Dbf5('data/Individual_' + trimestre + '.dbf',codec='latin1')
    df_def = dbf.to_dataframe()

#     df_def_i = df_def.loc[df_def.REGION == 1, ['CODUSU','NRO_HOGAR','PONDERA','CH03','CH04',
#                                             'CH06','CH07','CH08','CH09','CH12','CH13',
#                                             'CH15','CH16','NIVEL_ED','ESTADO','CAT_OCUP','CAT_INAC',
# #                                             'PP02C2','PP02C3','PP02C4','PP02C5','PP02C6','PP02C7','PP02C8',
# #                                             'PP02E','PP02H','PP02I','PP03C','PP03D','PP3E_TOT','PP3F_TOT',
# #                                             'PP03G','PP03H','PP03I','PP03J','INTENSI','PP04A',
# #                                             'PP04B1','PP04B2',
# #                                             'PP04C','PP04C99','PP04G','PP05C_1','PP05C_2',
# #                                             'PP05C_3','PP05E','PP05F','PP05H','PP06A','PP06C','PP06D',
# #                                             'PP06E','PP06H','PP07A','PP07C','PP07D','PP07E','PP07F1',
# #                                             'PP07F2','PP07F3','PP07F4','PP07F5','PP07G1','PP07G2','PP07G3',
# #                                             'PP07G4','PP07G_59','PP07H','PP07I','PP07J','PP07K','PP08D1',
# #                                             'PP08D4','PP08F1','PP08F2','PP08J1','PP08J2','PP08J3','PP09A',
# #                                             'PP10A','PP10C','PP10D',
# #                                             'PP10E','PP11A','PP11B1','PP11C','PP11C99',
# #                                             'PP11L','PP11L1','PP11M','PP11N','PP11O','PP11P','PP11Q','PP11R',
# #                                             'PP11S','PP11T','P21',
# #                                             'V2_M','V3_M','V4_M','V5_M','V8_M','V9_M','V10_M',
# #                                             'V11_M','V12_M','V18_M','V19_AM','V21_M',
#                                                'ITF']]

    df_def_i = df_def.loc[df_def.REGION == 1,
                            ['CODUSU',
                            'NRO_HOGAR',
                            'PONDERA',
                            'CH03',
                            'CH04',
                            'CH06',
                            'CH07',
                            'CH08',
                            'CH09',
                            'CH15',
                            'NIVEL_ED',
                            'ESTADO',
                            'CAT_OCUP',
                            'CAT_INAC',
                            'ITF']]

    df_def_i.columns = ['CODUSU',
                            'NRO_HOGAR',
                            'PONDERA',
                            'Parentesco',
                            'Sexo',
                            'Edad',
                            'Estado_Civil',
                            'Cobertura_Medica',
                            'Sabe_leer',
                            'Lugar_Nac',
                            'NIVEL_ED',
                            'Trabajo',
                            'CAT_OCUP',
                            'CAT_INAC',
                            'ITF']
    
    
    df_def_i.index =range(0,df_def_i.shape[0])

    df_def_i.to_csv('clean_' + trimestre + '.csv', index = False, encoding='utf-8')
    
    print 'csv file clean_',trimestre,'.csv successfully created in folder /data'
    return

In [4]:
def dummy_variables(data, data_type_dict):
    #Loop over nominal variables.
    for variable in filter(lambda x: data_type_dict[x]=='nominal',
                           data_type_dict.keys()):
 
        #First we create the columns with dummy variables.
        #Note that the argument 'prefix' means the column names will be
        #prefix_value for each unique value in the original column, so
        #we set the prefix to be the name of the original variable.
        dummy_df=pd.get_dummies(data[variable], prefix=variable)
 
        #Remove old variable from dictionary.
        data_type_dict.pop(variable)
 
        #Add new dummy variables to dictionary.
        for dummy_variable in dummy_df.columns:
            data_type_dict[dummy_variable] = 'nominal'
 
        #Add dummy variables to main df.
        data=data.drop(variable, axis=1)
        data=data.join(dummy_df)
 
    return [data, data_type_dict]

In [5]:
def Regularization_fit_lambda(model,X_train,y_train,lambdas,p=0.4,Graph=False, logl=False):
    #model = 1-Ridge, 2-Lasso
    #lambdas: a list of lambda values to try
    #p: ratio of the validation sample size / total training size
    #Graph: plot the graph of R^2 values for different lambda

    R_2_OS=[]
    X_train0, X_valid, y_train0, y_valid = train_test_split(X_train,
                                    y_train, test_size = 0.4, random_state = 200)

    if model==1:
        RM = lambda a: linear_model.Ridge(fit_intercept=True, alpha=a)
        model_label='Ridge'
    else:
        RM = lambda a: linear_model.Lasso(fit_intercept=True, alpha=a)
        model_label='Lasso'
    
    best_R2 = -1
    best_lambda = lambdas[0]
    
    for i in lambdas:
        lm = RM(i)
        lm.fit(X_train0,y_train0)  #fit the regularization model
        y_predict=lm.predict(X_valid) #compute the prediction for the validation sample 
        err_OS=y_predict-y_valid
        R_2_OS_=1-np.var(err_OS)/np.var(y_valid)
        R_2_OS.append(R_2_OS_)
        if R_2_OS_ > best_R2:
            best_R2 = R_2_OS_
            best_lambda = i
    
    if Graph==True:
        plt.title('IS R-squared vs OS-R-squared for different Lambda')
        if logl:
            plt.xlabel('ln(Lambda)')
            l=log(lambdas)
            bl=log(best_lambda)
        else:
            plt.xlabel('Lambda')
            l=lambdas
            bl=best_lambda
        plt.plot(l,R_2_OS,'b',label=model_label)
        plt.legend(loc='upper right')
        plt.ylabel('R-squared')
        plt.axvline(bl,color='r',linestyle='--')

        plt.show()
    
    return best_lambda

Download data


In [127]:
getEPHdbf('t310')


('Downloading', 't310')
WARNING!!! something is wrong: the file is not there!
csv file clean_ t310 .csv successfully created in folder /data

In [6]:
data = pd.read_csv('clean_t310.csv')
data.head()


Out[6]:
CODUSU NRO_HOGAR PONDERA Parentesco Sexo Edad Estado_Civil Cobertura_Medica Sabe_leer Lugar_Nac NIVEL_ED Trabajo CAT_OCUP CAT_INAC ITF
0 302468 1 1287 1 2 20 5 1 1 1 5 3 0 3 4000
1 302468 1 1287 10 2 20 5 1 1 1 5 3 0 3 4000
2 307861 1 1674 1 1 42 2 1 1 2 2 1 3 0 5800
3 307861 1 1674 2 2 44 2 1 1 2 6 1 3 0 5800
4 307861 1 1674 3 1 13 5 1 1 1 3 3 0 3 5800

Data cleaning


In [7]:
cols = data.columns.tolist()
cols = cols[-1:] + cols[:-1]
data = data[cols]

In [8]:
data['Parentesco'] = data['Parentesco'].map({1:'Jefe', 2:'Conyuge', 3:'Hijo',4:'Yerno',5:'Nieto', 6:'Madre-padre',
                                             7:'Suegro', 8:'Hermano',9:'Otro', 10:'No-familia'})
data['Sexo'] = data['Sexo'].map({1:0,2:1})

data['Estado_Civil'] = data['Estado_Civil'].map({1:'Unido',2:'Casado',3:'Divorciado',4:'Viudo',5:'Soltero'})
data.Estado_Civil.replace(to_replace=[9], value=[np.nan], inplace=True, axis=None)

data['Sabe_leer'] = data['Sabe_leer'].map({1:'Si',2:'No',3:'Menor'})
data.Sabe_leer.replace(to_replace=[9], value=[np.nan], inplace=True, axis=None) 

data['Lugar_Nac'] = data['Lugar_Nac'].map({1:'Localidad',2:'Otra_loc',3:'Otra_prov',4:'Pais_limit',5:'Otro_pais'})
data.Lugar_Nac.replace(to_replace=[9], value=[np.nan], inplace=True, axis=None) 

data['NIVEL_ED'] = data['NIVEL_ED'].map({1:'Primaria_I',2:'Primaria_C',3:'Secundaria_I',4:'Secundaria_C',
                                          5:'Univ_I',6:'Univ_C',7:'Sin_Edu'})
data['Trabajo'] = data['Trabajo'].map({1:'Ocupado',2:'Desocupado',3:'Inactivo',4:'Menor'})
data.Trabajo.replace(to_replace=[0], value=[np.nan], inplace=True, axis=None) 

data['CAT_OCUP'] = data['CAT_OCUP'].map({0:'No_empleo',1:'Patron',2:'Cuenta_propia', 3:'Empleado',4:'Sin_sueldo'})

In [9]:
data.head()


Out[9]:
ITF CODUSU NRO_HOGAR PONDERA Parentesco Sexo Edad Estado_Civil Cobertura_Medica Sabe_leer Lugar_Nac NIVEL_ED Trabajo CAT_OCUP CAT_INAC
0 4000 302468 1 1287 Jefe 1 20 Soltero 1 Si Localidad Univ_I Inactivo No_empleo 3
1 4000 302468 1 1287 No-familia 1 20 Soltero 1 Si Localidad Univ_I Inactivo No_empleo 3
2 5800 307861 1 1674 Jefe 0 42 Casado 1 Si Otra_loc Primaria_C Ocupado Empleado 0
3 5800 307861 1 1674 Conyuge 1 44 Casado 1 Si Otra_loc Univ_C Ocupado Empleado 0
4 5800 307861 1 1674 Hijo 0 13 Soltero 1 Si Localidad Secundaria_I Inactivo No_empleo 3

In [10]:
data_type_dict = {'NRO_HOGAR':'nominal','Parentesco':'nominal','Estado_Civil':'nominal','Cobertura_Medica':'nominal',
                 'Sabe_leer':'nominal','Lugar_Nac':'nominal','NIVEL_ED':'nominal','Trabajo':'nominal',
                 'CAT_OCUP':'nominal','CAT_INAC':'nominal'}
dummy_var = dummy_variables(data, data_type_dict)
df = dummy_var[0]
df = df.dropna(axis=0)
weights = ( 1. / df.PONDERA )
df = df.drop('PONDERA',1)

In [11]:
data = data.dropna(axis=0)
g = data.columns.to_series().groupby(data.dtypes).groups
# g

Correlation Matrix


In [12]:
import seaborn as sns
sns.set(context="paper", font="monospace")

corrmat = df.corr()
f, ax = plt.subplots(figsize=(18, 16))
sns.heatmap(corrmat, vmax=.8, square=True)
f.tight_layout()


Linear Regression (WLS)


In [17]:
import statsmodels.api as sm
Y = np.asarray(df.ITF)
x = df.iloc[:,1:]
X = sm.add_constant(x)
wls_model = sm.WLS(Y,X, weights = weights)
results = wls_model.fit()
print(results.summary())


                            WLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.234
Model:                            WLS   Adj. R-squared:                  0.229
Method:                 Least Squares   F-statistic:                     47.02
Date:                Tue, 22 Nov 2016   Prob (F-statistic):               0.00
Time:                        20:29:15   Log-Likelihood:                -80726.
No. Observations:                8360   AIC:                         1.616e+05
Df Residuals:                    8305   BIC:                         1.619e+05
Df Model:                          54                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------------
const                    5826.1333   1224.044      4.760      0.000      3426.702  8225.565
CODUSU                     -0.0010      0.001     -1.457      0.145        -0.002     0.000
Sexo                     -100.1707     96.517     -1.038      0.299      -289.368    89.027
Edad                       15.9476      4.638      3.438      0.001         6.856    25.039
Parentesco_Conyuge      -1199.2324    217.546     -5.513      0.000     -1625.677  -772.788
Parentesco_Hermano        742.8127    382.742      1.941      0.052        -7.457  1493.083
Parentesco_Hijo          1024.8679    197.354      5.193      0.000       638.004  1411.732
Parentesco_Jefe         -1732.0786    186.608     -9.282      0.000     -2097.876 -1366.281
Parentesco_Madre-padre     40.0888    434.632      0.092      0.927      -811.898   892.075
Parentesco_Nieto         1447.9581    262.831      5.509      0.000       932.743  1963.173
Parentesco_No-familia    1303.7582    493.523      2.642      0.008       336.329  2271.187
Parentesco_Otro          2247.0768    317.440      7.079      0.000      1624.815  2869.339
Parentesco_Suegro        1440.4202    595.034      2.421      0.016       274.005  2606.835
Parentesco_Yerno          510.4617    381.532      1.338      0.181      -237.437  1258.360
NIVEL_ED_Primaria_C       240.5413    211.832      1.136      0.256      -174.703   655.785
NIVEL_ED_Primaria_I       -54.7701    221.447     -0.247      0.805      -488.861   379.321
NIVEL_ED_Secundaria_C     847.6256    210.391      4.029      0.000       435.206  1260.045
NIVEL_ED_Secundaria_I     204.2312    207.314      0.985      0.325      -202.157   610.619
NIVEL_ED_Sin_Edu         -432.3221    444.018     -0.974      0.330     -1302.708   438.064
NIVEL_ED_Univ_C          3434.0095    223.182     15.387      0.000      2996.517  3871.502
NIVEL_ED_Univ_I          1586.8180    224.801      7.059      0.000      1146.151  2027.485
CAT_OCUP_Cuenta_propia    665.2373    324.484      2.050      0.040        29.168  1301.307
CAT_OCUP_Empleado        1414.5072    307.294      4.603      0.000       812.135  2016.880
CAT_OCUP_No_empleo        971.6218    452.087      2.149      0.032        85.418  1857.825
CAT_OCUP_Patron          3251.5074    383.116      8.487      0.000      2500.505  4002.510
CAT_OCUP_Sin_sueldo      -476.7403    583.412     -0.817      0.414     -1620.373   666.892
Lugar_Nac_Localidad       348.7546   1302.506      0.268      0.789     -2204.482  2901.991
Lugar_Nac_Otra_loc        596.4011   1313.032      0.454      0.650     -1977.469  3170.272
Lugar_Nac_Otra_prov       323.8001   1307.388      0.248      0.804     -2239.007  2886.607
Lugar_Nac_Otro_pais       -74.5450   1323.761     -0.056      0.955     -2669.446  2520.356
Lugar_Nac_Pais_limit      224.7158   1318.779      0.170      0.865     -2360.420  2809.851
CAT_INAC_0              -1312.6524   1350.987     -0.972      0.331     -3960.924  1335.620
CAT_INAC_1               -204.5184    376.184     -0.544      0.587      -941.934   532.897
CAT_INAC_2               1827.3711    773.851      2.361      0.018       310.429  3344.313
CAT_INAC_3               1313.4447    361.684      3.631      0.000       604.453  2022.436
CAT_INAC_4                646.6932    368.208      1.756      0.079       -75.087  1368.473
CAT_INAC_5               1596.9203    534.795      2.986      0.003       548.588  2645.252
CAT_INAC_6               1170.2762    681.163      1.718      0.086      -164.973  2505.525
CAT_INAC_7                788.5986    466.132      1.692      0.091      -125.136  1702.333
NRO_HOGAR_1              2258.7454    935.270      2.415      0.016       425.383  4092.108
NRO_HOGAR_2              1180.7591    957.827      1.233      0.218      -696.822  3058.340
NRO_HOGAR_3               690.7813   1299.233      0.532      0.595     -1856.040  3237.602
NRO_HOGAR_4              1695.8475   2989.463      0.567      0.571     -4164.246  7555.941
Estado_Civil_Casado     -1.114e+04   3918.742     -2.842      0.005     -1.88e+04 -3453.569
Estado_Civil_Divorciado -1.273e+04   3920.522     -3.248      0.001     -2.04e+04 -5049.676
Estado_Civil_Soltero    -1.242e+04   3914.628     -3.172      0.002     -2.01e+04 -4742.431
Estado_Civil_Unido      -1.148e+04   3917.285     -2.932      0.003     -1.92e+04 -3805.142
Estado_Civil_Viudo        -1.2e+04   3922.024     -3.059      0.002     -1.97e+04 -4311.279
Sabe_leer_Menor           688.3558   2014.008      0.342      0.733     -3259.603  4636.315
Sabe_leer_No              775.7056   2008.016      0.386      0.699     -3160.507  4711.918
Sabe_leer_Si              454.6599   2028.178      0.224      0.823     -3521.076  4430.395
Cobertura_Medica_1       2060.7729    563.597      3.656      0.000       955.982  3165.564
Cobertura_Medica_2       2920.5225    578.242      5.051      0.000      1787.023  4054.022
Cobertura_Medica_3       -329.7940   1384.642     -0.238      0.812     -3044.039  2384.451
Cobertura_Medica_4         75.4321    565.195      0.133      0.894     -1032.491  1183.355
Cobertura_Medica_9      -1387.5128   1078.448     -1.287      0.198     -3501.541   726.515
Cobertura_Medica_12      2894.0951    636.543      4.547      0.000      1646.312  4141.878
Cobertura_Medica_13      -407.3824   2868.090     -0.142      0.887     -6029.555  5214.790
Trabajo_Desocupado       5027.3380   2604.752      1.930      0.054       -78.627  1.01e+04
Trabajo_Inactivo         3624.4207   1047.113      3.461      0.001      1571.818  5677.024
Trabajo_Menor            3514.3650   1053.467      3.336      0.001      1449.306  5579.424
Trabajo_Ocupado          6586.3323   2614.445      2.519      0.012      1461.368  1.17e+04
==============================================================================
Omnibus:                     6610.476   Durbin-Watson:                   0.783
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           290520.001
Skew:                           3.435   Prob(JB):                         0.00
Kurtosis:                      31.050   Cond. No.                     1.05e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 3.71e-21. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Linear Regression (sklearn)


In [69]:
# sk-learn (Y ~ x) with intercept
from sklearn.linear_model import LinearRegression

R_IS=[]
R_OS=[]

n = 10
from sklearn.cross_validation import train_test_split
for i in range(n):
    X_train, X_test, y_train, y_test = train_test_split(x, Y, test_size=0.40)
    
    res=LinearRegression(fit_intercept=False)
    res.fit(X_train,y_train, sample_weight = weights[:len(X_train)])
    R_IS.append(1-((np.asarray(res.predict(X_train))-y_train)**2).sum()/((y_train-np.mean(y_train))**2).sum())
    R_OS.append(1-((np.asarray(res.predict(X_test))-y_test)**2).sum()/((y_test-np.mean(y_test))**2).sum())
    
print("IS R-squared for {} times is {}".format(n,np.mean(R_IS)))
print("OS R-squared for {} times is {}".format(n,np.mean(R_OS)))


IS R-squared for 10 times is 0.237468349016
OS R-squared for 10 times is 0.216176069094

LOGISTIC REGRESSION


In [72]:
import pandas as pd
import numpy as np
import statsmodels as sm
import sklearn as skl
import sklearn.preprocessing as preprocessing
import sklearn.linear_model as linear_model
import sklearn.cross_validation as cross_validation
import sklearn.metrics as metrics
import sklearn.tree as tree
import seaborn as sns

X_train, X_test, y_train, y_test = train_test_split(x, Y, train_size=0.70)
scaler = preprocessing.StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = scaler.transform(X_test)

In [74]:
# Encode the categorical features as numbers
def number_encode_features(df):
    result = df.copy()
    encoders = {}
    for column in result.columns:
        if result.dtypes[column] == np.object:
            encoders[column] = preprocessing.LabelEncoder()
            result[column] = encoders[column].fit_transform(result[column])
    return result, encoders

In [77]:
encoded_data, _ = number_encode_features(df)

In [81]:
cls = linear_model.LogisticRegression()

cls.fit(X_train, y_train)
y_pred = cls.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)

# plt.figure(figsize=(20,20))
# plt.subplot(2,1,1)
# sns.heatmap(cm, annot=True, fmt="d", xticklabels=encoders["Target"].classes_, yticklabels=encoders["Target"].classes_)
# plt.ylabel("Real value")
# plt.xlabel("Predicted value")
# print "F1 score: %f" % skl.metrics.f1_score(y_test, y_pred)
coefs = pd.Series(cls.coef_[0], index=X_train.columns)
coefs.sort()
# ax = plt.subplot(2,1,2)
coefs.plot(kind="bar")
# plt.show()


/usr/local/lib/python2.7/dist-packages/ipykernel/__main__.py:14: FutureWarning: sort is deprecated, use sort_values(inplace=True) for INPLACE sorting
Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f7771038c90>

In [ ]:


In [ ]:


In [ ]:


In [ ]:

RIDGE REGRESSION


In [71]:
#import Quandl
import statsmodels.formula.api as smf
from scipy import stats
from pandas.stats.api import ols
from sklearn import linear_model

In [72]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.ix[:,1:],
                                    df.ITF, test_size = 0.4, random_state = 200)

In [73]:
Ridge=linear_model.Ridge(fit_intercept=True, alpha=7.80899583959)

Ridge.fit(X_train,y_train)
# In the sample:
p_IS=Ridge.predict(X_train)
err_IS=p_IS-y_train
R_2_IS_Ridge=1-np.var(err_IS)/np.var(y_train)
print("The R-squared we found for IS Ridge is: {0}".format(R_2_IS_Ridge))

Ridge_coef=Ridge.coef_

#Out of sample
p_OS=Ridge.predict(X_test)
err_OS=p_OS-y_test
R_2_OS_Ridge=1-np.var(err_OS)/np.var(y_test)
print("The R-squared we found for OS Ridge is: {0}".format(R_2_OS_Ridge))


The R-squared we found for IS Ridge is: 0.255020716457
The R-squared we found for OS Ridge is: 0.220395819262

In [48]:
#select best lambda for Ridge
lambdas = np.exp(np.linspace(-5,13,200))
lambda_r_optimal=Regularization_fit_lambda(1,X_train,y_train,lambdas,p=0.4,Graph=True)
print('Optimal lambda for Ridge={0}'.format(lambda_r_optimal))


Optimal lambda for Ridge=7.80899583959

LASSO REGRESSION


In [74]:
Lasso=linear_model.Lasso(fit_intercept=True,alpha=1)
#try Ridge with a selected regularization parameter lambda

Lasso.fit(X_train,y_train)
# In the sample:
p_IS=Lasso.predict(X_train)
err_IS=p_IS-y_train
R_2_IS_Lasso=1-np.var(err_IS)/np.var(y_train)
print("The R-squared we found for IS Lasso is: {0}".format(R_2_IS_Ridge))

Lasso_coef=Lasso.coef_
#Out of sample
p_OS=Lasso.predict(X_test)
err_OS=p_OS-y_test
R_2_OS_Lasso=1-np.var(err_OS)/np.var(y_test)
print("The R-squared we found for OS Lasso is: {0}".format(R_2_OS_Lasso))


The R-squared we found for IS Lasso is: 0.255020716457
The R-squared we found for OS Lasso is: 0.219265327439

In [ ]:
#select lambdas for Lasso 
lambdas=np.exp(np.linspace(-5,6.5,200))
lambda_l_optimal=Regularization_fit_lambda(2,X_train,y_train,lambdas,p=0.4,Graph=True)
print('Optimal lambda for Lasso={0}'.format(lambda_l_optimal))

PCA

No funciona bien para multivariate regression. Tenemos muchos datos categoricos


In [15]:
n=2
from sklearn.decomposition import PCA
pca = PCA(n)
Xproj = pca.fit_transform(df)
eigenvalues = pca.explained_variance_

plt.figure(2, figsize=(8, 6))
plt.scatter(Xproj[:, 0], Xproj[:, 1], c = X.sum(axis=1), cmap=plt.cm.cool)
plt.xlabel('First Component')
plt.ylabel('Second Component')
plt.show()



In [181]:
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans

X_ = np.asarray(df)
range_n_clusters = [2, 3, 4, 5, 6, 7]
for n_clusters in range_n_clusters:
    km = KMeans(n_clusters=n_clusters, random_state=324)
    cluster_labels = km.fit_predict(X_)
    silhouette_avg = silhouette_score(X_, cluster_labels)
    print("For n_clusters = {},".format(n_clusters)+" the average silhouette_score is :{}".format(silhouette_avg))


For n_clusters = 2, the average silhouette_score is :0.734622497892
For n_clusters = 3, the average silhouette_score is :0.711378747975
For n_clusters = 4, the average silhouette_score is :0.695906072063
For n_clusters = 5, the average silhouette_score is :0.682964255797
For n_clusters = 6, the average silhouette_score is :0.666034528316
For n_clusters = 7, the average silhouette_score is :0.651261959551

Kmeans Clustering


In [182]:
from sklearn.cluster import KMeans

n=2
dd=Xproj

km=KMeans(n_clusters=n)
res=km.fit(dd)

with plt.style.context('ggplot'):
    plt.figure(figsize=(6, 5))
    plt.scatter(dd[:, 0], dd[:, 1], c=res.labels_)
    plt.ylabel('X2')
    plt.xlabel('X1')
    plt.xticks(())
    plt.yticks(())
    plt.title("KMeans, K = {} Clusters".format(n))
    plt.show()


Gaussian Mixture


In [183]:
from sklearn.mixture import GaussianMixture

gm=GaussianMixture(n_components=n)
res1=gm.fit(dd)

with plt.style.context('ggplot'):
    plt.figure(figsize=(6,6))
    plt.scatter(dd[:, 0], dd[:, 1], c=res1.predict(dd))
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.xticks(())
    plt.yticks(())
    plt.title("Gaussian Mixture")
    plt.show()


Feature selection

Recursive Feature Elimination


In [184]:
# Feature Extraction with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load data
X = x
Y = Y
# feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d") % fit.n_features_
print("Selected Features: %s") % fit.support_
print("Feature Ranking: %s") % fit.ranking_



KeyboardInterruptTraceback (most recent call last)
<ipython-input-184-8d312837b618> in <module>()
      9 model = LogisticRegression()
     10 rfe = RFE(model, 3)
---> 11 fit = rfe.fit(X, Y)
     12 print("Num Features: %d") % fit.n_features_
     13 print("Selected Features: %s") % fit.support_

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/feature_selection/rfe.pyc in fit(self, X, y)
    133             The target values.
    134         """
--> 135         return self._fit(X, y)
    136 
    137     def _fit(self, X, y, step_score=None):

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/feature_selection/rfe.pyc in _fit(self, X, y, step_score)
    167                 print("Fitting estimator with %d features." % np.sum(support_))
    168 
--> 169             estimator.fit(X[:, features], y)
    170 
    171             # Get coefs

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/linear_model/logistic.pyc in fit(self, X, y, sample_weight)
   1185                 self.class_weight, self.penalty, self.dual, self.verbose,
   1186                 self.max_iter, self.tol, self.random_state,
-> 1187                 sample_weight=sample_weight)
   1188             self.n_iter_ = np.array([n_iter_])
   1189             return self

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/svm/base.pyc in _fit_liblinear(X, y, C, fit_intercept, intercept_scaling, class_weight, penalty, dual, verbose, max_iter, tol, random_state, multi_class, loss, epsilon, sample_weight)
    910         X, y_ind, sp.isspmatrix(X), solver_type, tol, bias, C,
    911         class_weight_, max_iter, rnd.randint(np.iinfo('i').max),
--> 912         epsilon, sample_weight)
    913     # Regarding rnd.randint(..) in the above signature:
    914     # seed for srand in range [0..INT_MAX); due to limitations in Numpy

KeyboardInterrupt: 

In [23]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
 
X = x
Y = Y
names = df.columns
 
#use linear regression as the model
lr = LinearRegression()
#rank all features, i.e continue the elimination until the last one
rfe = RFE(lr, n_features_to_select=1)
rfe.fit(X,Y)
 
print "Features sorted by their rank:"
print sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names))


Features sorted by their rank:
[(1.0, 'CAT_INAC_7'), (2.0, 'NRO_HOGAR_1'), (3.0, 'NRO_HOGAR_2'), (4.0, 'NRO_HOGAR_3'), (5.0, 'Cobertura_Medica_1'), (6.0, 'Cobertura_Medica_9'), (7.0, 'Sabe_leer_Si'), (8.0, 'Cobertura_Medica_4'), (9.0, 'Cobertura_Medica_3'), (10.0, 'Cobertura_Medica_2'), (11.0, 'Cobertura_Medica_12'), (12.0, 'CAT_OCUP_No_empleo'), (13.0, 'CAT_OCUP_Cuenta_propia'), (14.0, 'NIVEL_ED_Univ_I'), (15.0, 'CAT_OCUP_Patron'), (16.0, 'CAT_OCUP_Empleado'), (17.0, 'CAT_INAC_2'), (18.0, 'CAT_INAC_1'), (19.0, 'CAT_INAC_6'), (20.0, 'CAT_INAC_5'), (21.0, 'CAT_INAC_4'), (22.0, 'CAT_INAC_3'), (23.0, 'Lugar_Nac_Pais_limit'), (24.0, 'CAT_INAC_0'), (25.0, 'NIVEL_ED_Sin_Edu'), (26.0, 'NIVEL_ED_Univ_C'), (27.0, 'NIVEL_ED_Primaria_I'), (28.0, 'NIVEL_ED_Secundaria_C'), (29.0, 'Parentesco_Yerno'), (30.0, 'NIVEL_ED_Primaria_C'), (31.0, 'NIVEL_ED_Secundaria_I'), (32.0, 'Trabajo_Inactivo'), (33.0, 'Trabajo_Desocupado'), (34.0, 'Parentesco_Hijo'), (35.0, 'Edad'), (36.0, 'Parentesco_Jefe'), (37.0, 'Parentesco_Conyuge'), (38.0, 'Parentesco_Hermano'), (39.0, 'Parentesco_Suegro'), (40.0, 'Parentesco_Madre-padre'), (41.0, 'Parentesco_Nieto'), (42.0, 'Parentesco_Otro'), (43.0, 'Parentesco_No-familia'), (44.0, 'Estado_Civil_Casado'), (45.0, 'Estado_Civil_Divorciado'), (46.0, 'Estado_Civil_Unido'), (47.0, 'Estado_Civil_Soltero'), (48.0, 'NRO_HOGAR_4'), (49.0, 'Trabajo_Menor'), (50.0, 'Cobertura_Medica_13'), (51.0, 'Sabe_leer_Menor'), (52.0, 'Estado_Civil_Viudo'), (53.0, 'Sabe_leer_No'), (54.0, 'Lugar_Nac_Localidad'), (55.0, 'CAT_OCUP_Sin_sueldo'), (56.0, 'Lugar_Nac_Otra_loc'), (57.0, 'Lugar_Nac_Otro_pais'), (58.0, 'CODUSU'), (59.0, 'Sexo'), (60.0, 'Lugar_Nac_Otra_prov'), (61.0, 'ITF')]

Univariate Selection


In [24]:
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
import pandas
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# load data
names = df.columns
X = x
Y = Y
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, Y)
# summarize scores
numpy.set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])



ValueErrorTraceback (most recent call last)
<ipython-input-24-4775e1d976d1> in <module>()
     10 # feature extraction
     11 test = SelectKBest(score_func=chi2, k=4)
---> 12 fit = test.fit(X, Y)
     13 # summarize scores
     14 numpy.set_printoptions(precision=3)

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/feature_selection/univariate_selection.pyc in fit(self, X, y)
    328 
    329         self._check_params(X, y)
--> 330         score_func_ret = self.score_func(X, y)
    331         if isinstance(score_func_ret, (list, tuple)):
    332             self.scores_, self.pvalues_ = score_func_ret

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/feature_selection/univariate_selection.pyc in chi2(X, y)
    213     X = check_array(X, accept_sparse='csr')
    214     if np.any((X.data if issparse(X) else X) < 0):
--> 215         raise ValueError("Input X must be non-negative.")
    216 
    217     Y = LabelBinarizer().fit_transform(y)

ValueError: Input X must be non-negative.

Stability selection


In [25]:
from sklearn.linear_model import RandomizedLasso
 
#using the Boston housing data. 
#Data gets scaled automatically by sklearn's implementation
X = x
Y = Y
names = df.columns
 
rlasso = RandomizedLasso(alpha=0.025)
rlasso.fit(X, Y)
 
print "Features sorted by their score:"
print sorted(zip(map(lambda x: round(x, 4), rlasso.scores_), 
                 names), reverse=True)


Features sorted by their score:
[(1.0, 'Sexo'), (1.0, 'Parentesco_No-familia'), (1.0, 'Parentesco_Madre-padre'), (1.0, 'Parentesco_Hijo'), (1.0, 'NIVEL_ED_Univ_C'), (1.0, 'NIVEL_ED_Sin_Edu'), (1.0, 'NIVEL_ED_Primaria_C'), (1.0, 'Lugar_Nac_Otra_prov'), (1.0, 'Estado_Civil_Divorciado'), (1.0, 'Estado_Civil_Casado'), (1.0, 'Cobertura_Medica_9'), (1.0, 'Cobertura_Medica_13'), (1.0, 'Cobertura_Medica_1'), (1.0, 'CAT_OCUP_Patron'), (1.0, 'CAT_OCUP_No_empleo'), (1.0, 'CAT_INAC_0'), (0.995, 'CODUSU'), (0.985, 'NIVEL_ED_Secundaria_I'), (0.985, 'CAT_INAC_5'), (0.98, 'Parentesco_Nieto'), (0.98, 'Lugar_Nac_Localidad'), (0.98, 'Cobertura_Medica_2'), (0.975, 'Sabe_leer_No'), (0.965, 'Lugar_Nac_Otro_pais'), (0.965, 'ITF'), (0.96, 'CAT_INAC_1'), (0.95, 'NRO_HOGAR_2'), (0.945, 'Cobertura_Medica_4'), (0.935, 'NRO_HOGAR_4'), (0.93, 'Parentesco_Otro'), (0.93, 'Edad'), (0.925, 'NIVEL_ED_Primaria_I'), (0.91, 'Cobertura_Medica_3'), (0.89, 'CAT_INAC_3'), (0.875, 'Estado_Civil_Unido'), (0.87, 'Trabajo_Menor'), (0.84, 'NIVEL_ED_Univ_I'), (0.815, 'CAT_INAC_6'), (0.81, 'CAT_OCUP_Cuenta_propia'), (0.795, 'Parentesco_Conyuge'), (0.78, 'Parentesco_Hermano'), (0.775, 'Parentesco_Jefe'), (0.76, 'CAT_INAC_7'), (0.735, 'Cobertura_Medica_12'), (0.715, 'Lugar_Nac_Otra_loc'), (0.7, 'CAT_INAC_4'), (0.695, 'Parentesco_Suegro'), (0.665, 'Trabajo_Inactivo'), (0.61, 'NRO_HOGAR_3'), (0.605, 'NIVEL_ED_Secundaria_C'), (0.525, 'Estado_Civil_Viudo'), (0.475, 'CAT_INAC_2'), (0.46, 'Sabe_leer_Menor'), (0.39, 'Parentesco_Yerno'), (0.305, 'CAT_OCUP_Sin_sueldo'), (0.275, 'Trabajo_Desocupado'), (0.265, 'NRO_HOGAR_1'), (0.24, 'CAT_OCUP_Empleado'), (0.235, 'Estado_Civil_Soltero'), (0.155, 'Lugar_Nac_Pais_limit'), (0.14, 'Sabe_leer_Si')]

Example: running the methods side by side


In [67]:
# !pip install minepy

from sklearn.linear_model import (LinearRegression, Ridge, 
                                  Lasso, RandomizedLasso)
from sklearn.feature_selection import RFE, f_regression
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from minepy import MINE
 
X = x.values
Y = Y
    
names = df.columns
 
ranks = {}
 
def rank_to_dict(ranks, names, order=1):
    minmax = MinMaxScaler()
    ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
    ranks = map(lambda x: round(x, 2), ranks)
    return dict(zip(names, ranks ))
 
lr = LinearRegression(normalize=True)
lr.fit(X, Y)
ranks["Linear reg"] = rank_to_dict(np.abs(lr.coef_), names)
 
ridge = Ridge(alpha=7)
ridge.fit(X, Y)
ranks["Ridge"] = rank_to_dict(np.abs(ridge.coef_), names)
 
 
lasso = Lasso(alpha=.05)
lasso.fit(X, Y)
ranks["Lasso"] = rank_to_dict(np.abs(lasso.coef_), names)
 
 
rlasso = RandomizedLasso(alpha=0.04)
rlasso.fit(X, Y)
ranks["Stability"] = rank_to_dict(np.abs(rlasso.scores_), names)
 
#stop the search when 5 features are left (they will get equal scores)
rfe = RFE(lr, n_features_to_select=5)
rfe.fit(X,Y)
ranks["RFE"] = rank_to_dict(map(float, rfe.ranking_), names, order=-1)
 
rf = RandomForestRegressor()
rf.fit(X,Y)
ranks["RF"] = rank_to_dict(rf.feature_importances_, names)
 
 
f, pval  = f_regression(X, Y, center=True)
ranks["Corr."] = rank_to_dict(f, names)
 
mine = MINE()
mic_scores = []
for i in range(X.shape[1]):
    mine.compute_score(X[:,i], Y)
    m = mine.mic()
    mic_scores.append(m)
 
ranks["MIC"] = rank_to_dict(mic_scores, names) 
 
 
r = {}
for name in names[:-1]:
    r[name] = round(np.mean([ranks[method][name] 
                             for method in ranks.keys()]), 2)
 
methods = sorted(ranks.keys())
ranks["Mean"] = r
methods.append("Mean")
 
feat_ranking = pd.DataFrame(ranks)
cols = feat_ranking.columns.tolist()
cols.insert(0, cols.pop(cols.index('Mean')))
feat_ranking = feat_ranking.ix[:, cols]
feat_ranking.sort_values(['Mean'], ascending=False)


Out[67]:
Mean Corr. Lasso Linear reg MIC RF RFE Ridge Stability
NIVEL_ED_Sin_Edu 0.69 1.00 0.79 0.16 0.49 0.25 0.79 1.00 1.00
Cobertura_Medica_3 0.57 0.86 0.13 0.04 1.00 0.13 0.96 0.56 0.90
Cobertura_Medica_1 0.52 0.30 0.81 0.04 0.46 0.03 0.89 0.64 1.00
CAT_OCUP_No_empleo 0.48 0.17 0.71 0.12 0.16 0.06 0.80 0.81 1.00
CAT_INAC_7 0.45 0.04 0.30 1.00 0.25 0.00 1.00 0.28 0.70
Parentesco_Hijo 0.43 0.06 0.63 0.03 0.15 0.04 0.66 0.88 1.00
Cobertura_Medica_9 0.43 0.04 0.73 0.04 0.21 0.01 0.91 0.51 1.00
NRO_HOGAR_2 0.41 0.01 0.14 1.00 0.03 0.00 1.00 0.18 0.89
CAT_OCUP_Patron 0.39 0.00 0.50 0.12 0.06 0.00 0.86 0.60 1.00
Trabajo_Menor 0.38 0.48 0.75 0.00 0.33 0.05 0.21 0.38 0.84
Parentesco_No-familia 0.38 0.01 0.60 0.03 0.25 0.01 0.50 0.65 1.00
Estado_Civil_Casado 0.37 0.03 1.00 0.00 0.17 0.02 0.30 0.45 1.00
NIVEL_ED_Univ_C 0.36 0.14 0.25 0.16 0.25 0.04 0.77 0.31 1.00
NIVEL_ED_Secundaria_I 0.36 0.03 0.37 0.16 0.17 0.01 0.75 0.43 0.93
NIVEL_ED_Primaria_C 0.36 0.13 0.27 0.16 0.23 0.02 0.70 0.35 1.00
Edad 0.36 0.01 0.46 0.03 0.11 0.02 0.64 0.65 0.92
CAT_INAC_0 0.36 0.15 0.37 0.15 0.40 0.01 0.34 0.43 1.00
ITF 0.35 0.00 0.00 0.00 0.92 1.00 0.00 0.00 0.91
Cobertura_Medica_4 0.34 0.00 0.34 0.04 0.04 0.00 0.95 0.38 0.95
Sabe_leer_Si 0.34 0.29 0.48 0.04 0.72 0.03 0.93 0.21 0.00
Estado_Civil_Divorciado 0.34 0.00 0.90 0.00 0.16 0.02 0.29 0.32 1.00
Cobertura_Medica_2 0.33 0.00 0.24 0.04 0.03 0.00 0.98 0.34 0.98
CAT_OCUP_Cuenta_propia 0.33 0.32 0.13 0.12 0.33 0.03 0.82 0.09 0.79
Parentesco_Madre-padre 0.32 0.00 0.36 0.03 0.31 0.01 0.52 0.36 1.00
NRO_HOGAR_1 0.32 0.03 0.00 1.00 0.24 0.00 1.00 0.11 0.21
NRO_HOGAR_3 0.31 0.00 0.04 1.00 0.00 0.00 1.00 0.01 0.42
NRO_HOGAR_4 0.29 0.08 0.51 0.00 0.34 0.04 0.23 0.17 0.96
Parentesco_Nieto 0.29 0.00 0.34 0.03 0.11 0.00 0.55 0.31 0.96
Estado_Civil_Unido 0.27 0.06 0.75 0.00 0.25 0.01 0.27 0.14 0.70
Parentesco_Otro 0.27 0.00 0.31 0.03 0.07 0.00 0.54 0.27 0.97
... ... ... ... ... ... ... ... ... ...
NIVEL_ED_Primaria_I 0.26 0.00 0.01 0.16 0.23 0.02 0.73 0.01 0.92
Parentesco_Yerno 0.26 0.17 0.19 0.16 0.28 0.02 0.68 0.25 0.29
Sabe_leer_No 0.26 0.02 0.60 0.00 0.18 0.01 0.14 0.26 0.87
NIVEL_ED_Univ_I 0.26 0.01 0.08 0.12 0.16 0.02 0.84 0.19 0.69
CAT_INAC_1 0.26 0.00 0.22 0.15 0.04 0.00 0.48 0.25 0.91
Parentesco_Hermano 0.25 0.01 0.22 0.03 0.16 0.03 0.59 0.19 0.74
CAT_OCUP_Empleado 0.24 0.38 0.01 0.12 0.30 0.01 0.88 0.10 0.15
Cobertura_Medica_12 0.24 0.00 0.12 0.04 0.00 0.00 1.00 0.09 0.70
Cobertura_Medica_13 0.24 0.05 0.28 0.00 0.18 0.01 0.20 0.22 1.00
CAT_INAC_3 0.24 0.07 0.11 0.15 0.16 0.00 0.39 0.11 0.89
Sexo 0.23 0.00 0.01 0.00 0.23 0.57 0.02 0.01 1.00
Parentesco_Jefe 0.23 0.00 0.10 0.03 0.11 0.01 0.61 0.19 0.80
Lugar_Nac_Otra_prov 0.22 0.02 0.19 0.00 0.25 0.01 0.11 0.17 0.98
CAT_INAC_5 0.22 0.00 0.06 0.15 0.06 0.01 0.46 0.09 0.92
Lugar_Nac_Pais_limit 0.21 0.35 0.50 0.08 0.30 0.00 0.32 0.06 0.06
Parentesco_Conyuge 0.19 0.00 0.01 0.03 0.14 0.01 0.63 0.07 0.66
CAT_INAC_6 0.19 0.01 0.05 0.15 0.13 0.01 0.41 0.04 0.74
Parentesco_Suegro 0.19 0.01 0.07 0.03 0.13 0.01 0.57 0.02 0.67
Lugar_Nac_Otro_pais 0.19 0.04 0.13 0.00 0.30 0.01 0.04 0.10 0.90
CAT_INAC_4 0.19 0.02 0.16 0.15 0.17 0.00 0.45 0.18 0.38
Estado_Civil_Viudo 0.18 0.01 0.52 0.00 0.12 0.00 0.16 0.15 0.48
CAT_INAC_2 0.18 0.01 0.08 0.15 0.19 0.01 0.43 0.13 0.43
Estado_Civil_Soltero 0.18 0.01 0.62 0.00 0.28 0.02 0.25 0.02 0.26
Lugar_Nac_Localidad 0.18 0.00 0.01 0.00 0.26 0.03 0.13 0.05 0.97
CODUSU 0.17 0.01 0.04 0.00 0.11 0.08 0.09 0.05 0.99
Trabajo_Inactivo 0.17 0.04 0.03 0.07 0.19 0.00 0.38 0.01 0.67
Lugar_Nac_Otra_loc 0.16 0.05 0.11 0.00 0.27 0.05 0.05 0.07 0.66
Sabe_leer_Menor 0.15 0.02 0.51 0.00 0.16 0.01 0.18 0.15 0.20
Trabajo_Desocupado 0.14 0.22 0.07 0.07 0.28 0.01 0.36 0.05 0.03
CAT_OCUP_Sin_sueldo 0.10 0.09 0.09 0.00 0.31 0.03 0.07 0.04 0.19

61 rows × 9 columns