Referencia:

Variables en el CENSO 2010 (INDEC)

  • VIVIENDA.INCALCONS Calidad constructiva de la vivienda
  • VIVIENDA.INCALSERV Calidad de Conexiones a Servicios Básicos
  • VIVIENDA.INMAT Calidad de los materiales
  • VIVIENDA.TIPVV Tipo de vivienda agrupado
  • VIVIENDA.TOTHOG Cantidad de Hogares en la Vivienda
  • VIVIENDA.URP Area Urbano - Rural
  • VIVIENDA.V00 Tipo de vivienda colectiva
  • VIVIENDA.V01 Tipo de vivienda particular
  • VIVIENDA.V02 Condición de ocupación
  • HOGAR.ALGUNBI Al menos un indicador NBI
  • HOGAR.H05 Material predominante de los pisos
  • HOGAR.H06 Material predominante de la cubierta exterior del techo
  • HOGAR.H07 Revestimiento interior o cielorraso del techo
  • HOGAR.H08 Tenencia de agua
  • HOGAR.H09 Procedencia del agua para beber y cocinar
  • HOGAR.H10 Tiene baño / letrina
  • HOGAR.H11 Tiene botón, cadena, mochila para limpieza del inodoro
  • HOGAR.H12 Desagüe del inodoro
  • HOGAR.H13 Baño / letrina de uso exclusivo
  • HOGAR.H14 Combustible usado principalmente para cocinar
  • HOGAR.H15 Total de habitaciones o piezas para dormir
  • HOGAR.H19A Heladera
  • HOGAR.H19B Computadora
  • HOGAR.H19C Teléfono celular
  • HOGAR.H19D Teléfono de línea
  • HOGAR.INDHAC Hacinamiento
  • HOGAR.NHOG Número del hogar en la vivienda
  • HOGAR.PROP Régimen de tenencia
  • HOGAR.TOTPERS Total de Personas en el Hogar
  • PERSONA.CONDACT Condición de actividad
  • PERSONA.EDADAGRU Edad en grandes grupos
  • PERSONA.EDADQUI Edades quinquenales
  • PERSONA.P01 Relación o parentesco con el jefe(a) del hogar
  • PERSONA.P02 Sexo
  • PERSONA.P03 Edad
  • PERSONA.P05 En que país nació
  • PERSONA.P07 Sabe leer y escribir
  • PERSONA.P08 Condición de asistencia escolar
  • PERSONA.P09 Nivel educativo que cursa o cursó
  • PERSONA.P10 Completó el nivel
  • PERSONA.P12 Utiliza computadora

Required libraries


In [1]:
import pandas as pd
import numpy as np
import os
import sys
import simpledbf
%pylab inline
import matplotlib.pyplot as plt


Populating the interactive namespace from numpy and matplotlib

Functions


In [2]:
def getEPHdbf(censusstring):
    print ("Downloading", censusstring)
    ### First I will check that it is not already there
    if not os.path.isfile("data/Individual_" + censusstring + ".DBF"):
        if os.path.isfile('Individual_' + censusstring + ".DBF"):
            # if in the current dir just move it
            if os.system("mv " + 'Individual_' + censusstring + ".DBF " + "data/"):
                print ("Error moving file!, Please check!")
        # otherwise start looking for the zip file
        else:
            if not os.path.isfile("data/" + censusstring + "_dbf.zip"):
                if not os.path.isfile(censusstring + "_dbf.zip"):
                    os.system(
                        "curl -O http://www.indec.gob.ar/ftp/cuadros/menusuperior/eph/" + censusstring + "_dbf.zip")
                ###  To move it I use the os.system() functions to run bash commands with arguments
                os.system("mv " + censusstring + "_dbf.zip " + "data/")
            ### unzip the csv
            os.system("unzip " + "data/" + censusstring + "_dbf.zip -d data/")

    if not os.path.isfile("data/" + 'Individual_' + censusstring + ".dbf"):
        print ("WARNING!!! something is wrong: the file is not there!")

    else:
        print ("file in place, creating CSV file")

    trimestre = censusstring

    dbf = simpledbf.Dbf5('data/Individual_' + trimestre + '.dbf',codec='latin1')
    df_def = dbf.to_dataframe()

#     df_def_i = df_def.loc[df_def.REGION == 1, ['CODUSU','NRO_HOGAR','PONDERA','CH03','CH04',
#                                             'CH06','CH07','CH08','CH09','CH12','CH13',
#                                             'CH15','CH16','NIVEL_ED','ESTADO','CAT_OCUP','CAT_INAC',
# #                                             'PP02C2','PP02C3','PP02C4','PP02C5','PP02C6','PP02C7','PP02C8',
# #                                             'PP02E','PP02H','PP02I','PP03C','PP03D','PP3E_TOT','PP3F_TOT',
# #                                             'PP03G','PP03H','PP03I','PP03J','INTENSI','PP04A',
# #                                             'PP04B1','PP04B2',
# #                                             'PP04C','PP04C99','PP04G','PP05C_1','PP05C_2',
# #                                             'PP05C_3','PP05E','PP05F','PP05H','PP06A','PP06C','PP06D',
# #                                             'PP06E','PP06H','PP07A','PP07C','PP07D','PP07E','PP07F1',
# #                                             'PP07F2','PP07F3','PP07F4','PP07F5','PP07G1','PP07G2','PP07G3',
# #                                             'PP07G4','PP07G_59','PP07H','PP07I','PP07J','PP07K','PP08D1',
# #                                             'PP08D4','PP08F1','PP08F2','PP08J1','PP08J2','PP08J3','PP09A',
# #                                             'PP10A','PP10C','PP10D',
# #                                             'PP10E','PP11A','PP11B1','PP11C','PP11C99',
# #                                             'PP11L','PP11L1','PP11M','PP11N','PP11O','PP11P','PP11Q','PP11R',
# #                                             'PP11S','PP11T','P21',
# #                                             'V2_M','V3_M','V4_M','V5_M','V8_M','V9_M','V10_M',
# #                                             'V11_M','V12_M','V18_M','V19_AM','V21_M',
#                                                'ITF']]

    df_def_i = df_def.loc[df_def.REGION == 1,
                            ['CODUSU',
                            'NRO_HOGAR',
                            'PONDERA',
                            'CH03',
                            'CH04',
                            'CH06',
                            'CH07',
                            'CH08',
                            'CH09',
                            'CH15',
                            'NIVEL_ED',
                            'ESTADO',
                            'CAT_OCUP',
                            'CAT_INAC',
                            'ITF']]

    df_def_i.columns = ['CODUSU',
                            'NRO_HOGAR',
                            'PONDERA',
                            'Parentesco',
                            'Sexo',
                            'Edad',
                            'Estado_Civil',
                            'Cobertura_Medica',
                            'Sabe_leer',
                            'Lugar_Nac',
                            'NIVEL_ED',
                            'Trabajo',
                            'CAT_OCUP',
                            'CAT_INAC',
                            'Y']
    
    
    df_def_i.index =range(0,df_def_i.shape[0])

    df_def_i.to_csv('clean_' + trimestre + '.csv', index = False, encoding='utf-8')
    
    print 'csv file clean_',trimestre,'.csv successfully created in folder /data'
    return

In [3]:
def dummy_variables(data, data_type_dict):
    #Loop over nominal variables.
    for variable in filter(lambda x: data_type_dict[x]=='nominal',
                           data_type_dict.keys()):
 
        #First we create the columns with dummy variables.
        #Note that the argument 'prefix' means the column names will be
        #prefix_value for each unique value in the original column, so
        #we set the prefix to be the name of the original variable.
        dummy_df=pd.get_dummies(data[variable], prefix=variable)
 
        #Remove old variable from dictionary.
        data_type_dict.pop(variable)
 
        #Add new dummy variables to dictionary.
        for dummy_variable in dummy_df.columns:
            data_type_dict[dummy_variable] = 'nominal'
 
        #Add dummy variables to main df.
        data=data.drop(variable, axis=1)
        data=data.join(dummy_df)
 
    return [data, data_type_dict]

In [4]:
def Regularization_fit_lambda(model,X_train,y_train,lambdas,p=0.4,Graph=False, logl=False):
    #model = 1-Ridge, 2-Lasso
    #lambdas: a list of lambda values to try
    #p: ratio of the validation sample size / total training size
    #Graph: plot the graph of R^2 values for different lambda

    R_2_OS=[]
    X_train0, X_valid, y_train0, y_valid = train_test_split(X_train,
                                    y_train, test_size = 0.4, random_state = 200)

    if model==1:
        RM = lambda a: linear_model.Ridge(fit_intercept=True, alpha=a)
        model_label='Ridge'
    else:
        RM = lambda a: linear_model.Lasso(fit_intercept=True, alpha=a)
        model_label='Lasso'
    
    best_R2 = -1
    best_lambda = lambdas[0]
    
    for i in lambdas:
        lm = RM(i)
        lm.fit(X_train0,y_train0)  #fit the regularization model
        y_predict=lm.predict(X_valid) #compute the prediction for the validation sample 
        err_OS=y_predict-y_valid
        R_2_OS_=1-np.var(err_OS)/np.var(y_valid)
        R_2_OS.append(R_2_OS_)
        if R_2_OS_ > best_R2:
            best_R2 = R_2_OS_
            best_lambda = i
    
    if Graph==True:
        plt.title('IS R-squared vs OS-R-squared for different Lambda')
        if logl:
            plt.xlabel('ln(Lambda)')
            l=log(lambdas)
            bl=log(best_lambda)
        else:
            plt.xlabel('Lambda')
            l=lambdas
            bl=best_lambda
        plt.plot(l,R_2_OS,'b',label=model_label)
        plt.legend(loc='upper right')
        plt.ylabel('R-squared')
        plt.axvline(bl,color='r',linestyle='--')

        plt.show()
    
    return best_lambda

Download data


In [13]:
getEPHdbf('t310')


('Downloading', 't310')
file in place, creating CSV file
csv file clean_ t310 .csv successfully created in folder /data

In [106]:
# @hidden_cell
from io import StringIO
import requests
import json
import pandas as pd


Out[106]:
CODUSU NRO_HOGAR PONDERA Parentesco Sexo Edad Estado_Civil Cobertura_Medica Sabe_leer Lugar_Nac NIVEL_ED Trabajo CAT_OCUP CAT_INAC Y
0 302468 1 1287 1 2 20 5 1 1 1 5 3 0 3 4000
1 302468 1 1287 10 2 20 5 1 1 1 5 3 0 3 4000
2 307861 1 1674 1 1 42 2 1 1 2 2 1 3 0 5800
3 307861 1 1674 2 2 44 2 1 1 2 6 1 3 0 5800
4 307861 1 1674 3 1 13 5 1 1 1 3 3 0 3 5800

Data cleaning


In [12]:



Out[12]:
P21 CODUSU NRO_HOGAR COMPONENTE AGLOMERADO PONDERA familyRelation female age schoolLevel finishedYear lastYear activity empCond unempCond ITF IPCF P47T
0 0 302468 1 1 32 1287 1 2 20 7 2 1.0 3 0 3 4000 2000.0 2000
1 0 302468 1 2 32 1287 10 2 20 6 2 1.0 3 0 3 4000 2000.0 2000
2 3000 307861 1 1 32 1674 1 1 42 2 1 NaN 1 3 0 5800 1450.0 3000
3 2800 307861 1 2 32 1674 2 2 44 7 1 NaN 1 3 0 5800 1450.0 2800
4 0 307861 1 3 32 1674 3 1 13 4 2 0.0 3 0 3 5800 1450.0 0

In [69]:
data = pd.read_csv('clean_t310.csv')
cols = data.columns.tolist()
cols = cols[-1:] + cols[:-1]
data = data[cols]
data.head()


Out[69]:
Y CODUSU NRO_HOGAR PONDERA Parentesco Sexo Edad Estado_Civil Cobertura_Medica Sabe_leer Lugar_Nac NIVEL_ED Trabajo CAT_OCUP CAT_INAC
0 4000 302468 1 1287 1 2 20 5 1 1 1 5 3 0 3
1 4000 302468 1 1287 10 2 20 5 1 1 1 5 3 0 3
2 5800 307861 1 1674 1 1 42 2 1 1 2 2 1 3 0
3 5800 307861 1 1674 2 2 44 2 1 1 2 6 1 3 0
4 5800 307861 1 1674 3 1 13 5 1 1 1 3 3 0 3

In [70]:
data['Parentesco'] = data['Parentesco'].map({1:'Jefe', 2:'Conyuge', 3:'Hijo',4:'Yerno',5:'Nieto', 6:'Madre_Padre',
                                             7:'Suegro', 8:'Hermano',9:'Otro', 10:'No_Familia'})
data['Sexo'] = data['Sexo'].map({1:0,2:1})

data['Estado_Civil'] = data['Estado_Civil'].map({1:'Unido',2:'Casado',3:'Divorciado',4:'Viudo',5:'Soltero'})
data.Estado_Civil.replace(to_replace=[9], value=[np.nan], inplace=True, axis=None)

data['Sabe_leer'] = data['Sabe_leer'].map({1:'Si',2:'No',3:'Menor'})
data.Sabe_leer.replace(to_replace=[9], value=[np.nan], inplace=True, axis=None) 

data['Lugar_Nac'] = data['Lugar_Nac'].map({1:'Localidad',2:'Otra_loc',3:'Otra_prov',4:'Pais_limit',5:'Otro_pais'})
data.Lugar_Nac.replace(to_replace=[9], value=[np.nan], inplace=True, axis=None) 

data['NIVEL_ED'] = data['NIVEL_ED'].map({1:'Primaria_I',2:'Primaria_C',3:'Secundaria_I',4:'Secundaria_C',
                                          5:'Univ_I',6:'Univ_C',7:'Sin_Edu'})
data['Trabajo'] = data['Trabajo'].map({1:'Ocupado',2:'Desocupado',3:'Inactivo',4:'Menor'})
data.Trabajo.replace(to_replace=[0], value=[np.nan], inplace=True, axis=None) 

data['CAT_OCUP'] = data['CAT_OCUP'].map({0:'No_empleo',1:'Patron',2:'Cuenta_propia', 3:'Empleado',4:'Sin_sueldo'})

In [71]:
data.head()


Out[71]:
Y CODUSU NRO_HOGAR PONDERA Parentesco Sexo Edad Estado_Civil Cobertura_Medica Sabe_leer Lugar_Nac NIVEL_ED Trabajo CAT_OCUP CAT_INAC
0 4000 302468 1 1287 Jefe 1 20 Soltero 1 Si Localidad Univ_I Inactivo No_empleo 3
1 4000 302468 1 1287 No_Familia 1 20 Soltero 1 Si Localidad Univ_I Inactivo No_empleo 3
2 5800 307861 1 1674 Jefe 0 42 Casado 1 Si Otra_loc Primaria_C Ocupado Empleado 0
3 5800 307861 1 1674 Conyuge 1 44 Casado 1 Si Otra_loc Univ_C Ocupado Empleado 0
4 5800 307861 1 1674 Hijo 0 13 Soltero 1 Si Localidad Secundaria_I Inactivo No_empleo 3

In [72]:
data_type_dict = {'NRO_HOGAR':'nominal','Parentesco':'nominal','Estado_Civil':'nominal','Cobertura_Medica':'nominal',
                 'Sabe_leer':'nominal','Lugar_Nac':'nominal','NIVEL_ED':'nominal','Trabajo':'nominal',
                 'CAT_OCUP':'nominal','CAT_INAC':'nominal'}
dummy_var = dummy_variables(data, data_type_dict)
df = dummy_var[0]
df = df.dropna(axis=0)
df = df[df.Y != 0]
weights = ( 1. / df.PONDERA )
df = df.drop(['PONDERA', 'Edad', 'CODUSU'],1)

In [36]:
# Get quartiles into bin

#q = pd.cut(df.Y, 5, labels=False) # returns categorical
# q = pd.qcut(df.Y.values, 5).codes # returns an array of bins

In [73]:
df.Y = np.log(df.Y)
df.Y.plot(kind = 'density')
df.head()


Out[73]:
Y Sexo Parentesco_Conyuge Parentesco_Hermano Parentesco_Hijo Parentesco_Jefe Parentesco_Madre_Padre Parentesco_Nieto Parentesco_No_Familia Parentesco_Otro ... Cobertura_Medica_2 Cobertura_Medica_3 Cobertura_Medica_4 Cobertura_Medica_9 Cobertura_Medica_12 Cobertura_Medica_13 Trabajo_Desocupado Trabajo_Inactivo Trabajo_Menor Trabajo_Ocupado
0 8.294050 1 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 8.294050 1 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
2 8.665613 0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 8.665613 1 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
4 8.665613 0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0

5 rows × 60 columns


In [124]:

Correlation Matrix


In [53]:
import seaborn as sns
sns.set(context="paper", font="monospace")

corrmat = df.corr()
f, ax = plt.subplots(figsize=(18, 16))
sns.heatmap(corrmat, vmax=.8, square=True)
f.tight_layout()


Linear Regression (WLS)


In [74]:
# get percentile as int

# from scipy import stats
# rankdata = stats.rankdata(Y, "average")/len(Y)

Y = df.Y
X = df.ix[:,1:]

In [75]:
import statsmodels.api as sm
X1 = sm.add_constant(X)
wls_model = sm.WLS(Y,X1, weights = weights)
results = wls_model.fit()
print(results.summary())


                            WLS Regression Results                            
==============================================================================
Dep. Variable:                      Y   R-squared:                       0.303
Model:                            WLS   Adj. R-squared:                  0.299
Method:                 Least Squares   F-statistic:                     69.14
Date:                Wed, 30 Nov 2016   Prob (F-statistic):               0.00
Time:                        11:47:07   Log-Likelihood:                -8221.9
No. Observations:                8323   AIC:                         1.655e+04
Df Residuals:                    8270   BIC:                         1.692e+04
Df Model:                          52                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------------------
const                       5.1763      0.230     22.514      0.000         4.726     5.627
Sexo                       -0.0270      0.017     -1.626      0.104        -0.060     0.006
Parentesco_Conyuge          0.0927      0.039      2.406      0.016         0.017     0.168
Parentesco_Hermano          0.6015      0.066      9.050      0.000         0.471     0.732
Parentesco_Hijo             0.5492      0.035     15.762      0.000         0.481     0.617
Parentesco_Jefe            -0.0119      0.033     -0.361      0.718        -0.077     0.053
Parentesco_Madre_Padre      0.4801      0.075      6.428      0.000         0.334     0.626
Parentesco_Nieto            0.7297      0.046     15.929      0.000         0.640     0.819
Parentesco_No_Familia       0.5643      0.085      6.615      0.000         0.397     0.732
Parentesco_Otro             0.8665      0.055     15.673      0.000         0.758     0.975
Parentesco_Suegro           0.8560      0.102      8.359      0.000         0.655     1.057
Parentesco_Yerno            0.4483      0.066      6.775      0.000         0.319     0.578
NIVEL_ED_Primaria_C         0.6608      0.039     17.093      0.000         0.585     0.737
NIVEL_ED_Primaria_I         0.5445      0.040     13.504      0.000         0.465     0.624
NIVEL_ED_Secundaria_C       0.7929      0.039     20.555      0.000         0.717     0.869
NIVEL_ED_Secundaria_I       0.6567      0.038     17.268      0.000         0.582     0.731
NIVEL_ED_Sin_Edu            0.4866      0.077      6.284      0.000         0.335     0.638
NIVEL_ED_Univ_C             1.1426      0.041     28.076      0.000         1.063     1.222
NIVEL_ED_Univ_I             0.8923      0.041     21.864      0.000         0.812     0.972
CAT_OCUP_Cuenta_propia      0.9907      0.059     16.852      0.000         0.875     1.106
CAT_OCUP_Empleado           1.0808      0.056     19.281      0.000         0.971     1.191
CAT_OCUP_No_empleo          0.9855      0.080     12.275      0.000         0.828     1.143
CAT_OCUP_Patron             1.2811      0.068     18.708      0.000         1.147     1.415
CAT_OCUP_Sin_sueldo         0.8383      0.102      8.215      0.000         0.638     1.038
Lugar_Nac_Localidad         0.1784      0.224      0.797      0.426        -0.261     0.617
Lugar_Nac_Otra_loc          0.1835      0.226      0.813      0.417        -0.259     0.626
Lugar_Nac_Otra_prov         0.1627      0.225      0.724      0.469        -0.278     0.603
Lugar_Nac_Otro_pais         0.1635      0.228      0.719      0.472        -0.283     0.610
Lugar_Nac_Pais_limit        0.1454      0.227      0.641      0.521        -0.299     0.590
CAT_INAC_0                  0.0894      0.288      0.311      0.756        -0.475     0.653
CAT_INAC_1                  0.5650      0.073      7.772      0.000         0.423     0.708
CAT_INAC_2                  0.9706      0.138      7.029      0.000         0.700     1.241
CAT_INAC_3                  0.8001      0.072     11.080      0.000         0.659     0.942
CAT_INAC_4                  0.7270      0.074      9.880      0.000         0.583     0.871
CAT_INAC_5                  0.7824      0.098      7.954      0.000         0.590     0.975
CAT_INAC_6                  0.5872      0.123      4.774      0.000         0.346     0.828
CAT_INAC_7                  0.6547      0.089      7.385      0.000         0.481     0.828
NRO_HOGAR_1                 1.4517      0.164      8.856      0.000         1.130     1.773
NRO_HOGAR_2                 1.2782      0.168      7.613      0.000         0.949     1.607
NRO_HOGAR_3                 1.2606      0.239      5.277      0.000         0.792     1.729
NRO_HOGAR_4                 1.1858      0.515      2.302      0.021         0.176     2.195
Estado_Civil_Casado        -4.6330      0.754     -6.143      0.000        -6.111    -3.155
Estado_Civil_Divorciado    -5.0277      0.754     -6.666      0.000        -6.506    -3.549
Estado_Civil_Soltero       -5.0059      0.753     -6.644      0.000        -6.483    -3.529
Estado_Civil_Unido         -4.7382      0.754     -6.287      0.000        -6.216    -3.261
Estado_Civil_Viudo         -4.9289      0.754     -6.533      0.000        -6.408    -3.450
Sabe_leer_Menor             0.0395      0.346      0.114      0.909        -0.639     0.718
Sabe_leer_No                0.0887      0.345      0.257      0.797        -0.588     0.766
Sabe_leer_Si               -0.0032      0.349     -0.009      0.993        -0.687     0.680
Cobertura_Medica_1          1.1140      0.098     11.386      0.000         0.922     1.306
Cobertura_Medica_2          1.2335      0.100     12.293      0.000         1.037     1.430
Cobertura_Medica_3          0.4485      0.238      1.881      0.060        -0.019     0.916
Cobertura_Medica_4          0.6088      0.098      6.206      0.000         0.417     0.801
Cobertura_Medica_9          0.0409      0.186      0.220      0.826        -0.324     0.405
Cobertura_Medica_12         1.2728      0.110     11.548      0.000         1.057     1.489
Cobertura_Medica_13         0.4579      0.493      0.928      0.353        -0.509     1.425
Trabajo_Desocupado          3.0229      0.561      5.384      0.000         1.922     4.124
Trabajo_Inactivo            2.5280      0.223     11.349      0.000         2.091     2.965
Trabajo_Menor               2.5590      0.224     11.440      0.000         2.120     2.997
Trabajo_Ocupado             3.3463      0.563      5.946      0.000         2.243     4.450
==============================================================================
Omnibus:                      298.009   Durbin-Watson:                   0.950
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              602.706
Skew:                          -0.253   Prob(JB):                    1.33e-131
Kurtosis:                       4.218   Cond. No.                     2.02e+16
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 8.42e-32. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.

Linear Regression (sklearn)


In [56]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.4, random_state = 200)

In [57]:
# sk-learn (Y ~ x) with intercept
from sklearn.linear_model import LinearRegression

R_IS=[]
R_OS=[]

n = 10
for i in range(n):  
    res=LinearRegression(fit_intercept=False)
    res.fit(X_train,y_train, sample_weight = weights[:len(X_train)])
    R_IS.append(1-((np.asarray(res.predict(X_train))-y_train)**2).sum()/((y_train-np.mean(y_train))**2).sum())
    R_OS.append(1-((np.asarray(res.predict(X_test))-y_test)**2).sum()/((y_test-np.mean(y_test))**2).sum())
    
print("IS R-squared for {} times is {}".format(n,np.mean(R_IS)))
print("OS R-squared for {} times is {}".format(n,np.mean(R_OS)))


IS R-squared for 10 times is 0.278070108368
OS R-squared for 10 times is -1.22686989748e+19

LOGISTIC REGRESSION


In [58]:
import pandas as pd
import numpy as np
import statsmodels as sm
import sklearn as skl
import sklearn.preprocessing as preprocessing
import sklearn.linear_model as linear_model
import sklearn.cross_validation as cross_validation
import sklearn.metrics as metrics
import sklearn.tree as tree
import seaborn as sns

scaler = preprocessing.StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test = scaler.transform(X_test)


/Users/IlanReinstein/anaconda/lib/python2.7/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

In [59]:
# Encode the categorical features as numbers
def number_encode_features(df):
    result = df.copy()
    encoders = {}
    for column in result.columns:
        if result.dtypes[column] == np.object:
            encoders[column] = preprocessing.LabelEncoder()
            result[column] = encoders[column].fit_transform(result[column])
    return result, encoders

In [60]:
encoded_data, _ = number_encode_features(df)

In [61]:
# print all features and their relevance (takes some time)
cls = linear_model.LogisticRegression()

cls.fit(X_train, y_train)
y_pred = cls.predict(X_test)
cm = metrics.confusion_matrix(y_test, y_pred)

plt.figure(figsize=(20,20))
coefs = pd.Series(cls.coef_[0], index=X_train.columns)
coefs.sort_values(inplace = True)
coefs.plot(kind="bar")


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-61-388bc3b8277f> in <module>()
      2 cls = linear_model.LogisticRegression()
      3 
----> 4 cls.fit(X_train, y_train)
      5 y_pred = cls.predict(X_test)
      6 cm = metrics.confusion_matrix(y_test, y_pred)

/Users/IlanReinstein/anaconda/lib/python2.7/site-packages/sklearn/linear_model/logistic.pyc in fit(self, X, y, sample_weight)
   1173         X, y = check_X_y(X, y, accept_sparse='csr', dtype=np.float64,
   1174                          order="C")
-> 1175         check_classification_targets(y)
   1176         self.classes_ = np.unique(y)
   1177         n_samples, n_features = X.shape

/Users/IlanReinstein/anaconda/lib/python2.7/site-packages/sklearn/utils/multiclass.pyc in check_classification_targets(y)
    170     if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
    171             'multilabel-indicator', 'multilabel-sequences']:
--> 172         raise ValueError("Unknown label type: %r" % y_type)
    173 
    174 

ValueError: Unknown label type: 'continuous'

RIDGE REGRESSION


In [62]:
import statsmodels.formula.api as smf
from scipy import stats
from pandas.stats.api import ols
from sklearn import linear_model

In [63]:
Ridge=linear_model.Ridge(fit_intercept=True, alpha=1)

Ridge.fit(X_train,y_train)
# In the sample:
p_IS=Ridge.predict(X_train)
err_IS=p_IS-y_train
R_2_IS_Ridge=1-np.var(err_IS)/np.var(y_train)
print("The R-squared we found for IS Ridge is: {0}".format(R_2_IS_Ridge))

Ridge_coef=Ridge.coef_

#Out of sample
p_OS=Ridge.predict(X_test)
err_OS=p_OS-y_test
R_2_OS_Ridge=1-np.var(err_OS)/np.var(y_test)
print("The R-squared we found for OS Ridge is: {0}".format(R_2_OS_Ridge))


The R-squared we found for IS Ridge is: 0.312149973091
The R-squared we found for OS Ridge is: 0.276840643729

In [64]:
#select best lambda for Ridge
lambdas = np.exp(np.linspace(-5,13,200))
lambda_r_optimal=Regularization_fit_lambda(1,X_train,y_train,lambdas,p=0.4,Graph=True)
print('Optimal lambda for Ridge={0}'.format(lambda_r_optimal))


Optimal lambda for Ridge=11.2131301574

LASSO REGRESSION


In [65]:
Lasso=linear_model.Lasso(fit_intercept=True,alpha=1000000)
#try Ridge with a selected regularization parameter lambda

Lasso.fit(X_train,y_train)
# In the sample:
p_IS=Lasso.predict(X_train)
err_IS=p_IS-y_train
R_2_IS_Lasso=1-np.var(err_IS)/np.var(y_train)
print("The R-squared we found for IS Lasso is: {0}".format(R_2_IS_Ridge))

Lasso_coef=Lasso.coef_
#Out of sample
p_OS=Lasso.predict(X_test)
err_OS=p_OS-y_test
R_2_OS_Lasso=1-np.var(err_OS)/np.var(y_test)
print("The R-squared we found for OS Lasso is: {0}".format(R_2_OS_Lasso))


The R-squared we found for IS Lasso is: 0.312149973091
The R-squared we found for OS Lasso is: -1.55431223448e-15

In [66]:
#select lambdas for Lasso 
lambdas=np.exp(np.linspace(-5,6.5,200))
lambda_l_optimal=Regularization_fit_lambda(2,X_train,y_train,lambdas,p=0.4,Graph=True)
print('Optimal lambda for Lasso={0}'.format(lambda_l_optimal))


Optimal lambda for Lasso=0.00673794699909

OLS PERFORMANCE COMPARISON


In [67]:
# Source: ADS Practice Lab 2016

Number_variables=range(len(X_train.columns[:]))

# #select best lambda for Ridge
# lambdas = np.exp(np.linspace(-5,13,200))
# lambda_r_optimal=Regularization_fit_lambda(1,X_train,y_train,lambdas,p=0.4,Graph=True)

# #select lambdas for Lasso 
# lambdas=np.exp(np.linspace(-5,6.5,200))
# lambda_l_optimal=Regularization_fit_lambda(2,X_train,y_train,lambdas,p=0.4,Graph=True)

OLS_R_2_OS_F=[]
OLS_R_2_IS_F=[]
OLS_R_2_Ridge_OS_F=[]
OLS_R_2_Ridge_IS_F=[]
OLS_R_2_Lasso_OS_F=[]
OLS_R_2_Lasso_IS_F=[]

Ridge=linear_model.Ridge(fit_intercept=True,alpha=1)
Lasso=linear_model.Lasso(fit_intercept=True, alpha=1)

for j in Number_variables:
    # OLS
    lm = smf.ols(formula = 'Y ~ '+ '+'.join(X_train.columns[:j+1]), 
                 data = pd.concat([X_train.ix[:,:j+1],y_train], axis=1)).fit()
    error = lm.predict(X_test.ix[:,:j+1]) - y_test
    R_2_OS_OLS=1-error.var()/y_test.var()
    R_2_IS_OLS = lm.rsquared
    OLS_R_2_IS_F.append(R_2_IS_OLS)
    OLS_R_2_OS_F.append(max(R_2_OS_OLS,0))
    
    # Ridge
    Ridge.fit(X_train.ix[:,:j+1],y_train)
    
    # In sample:
    err_IS=Ridge.predict(X_train.ix[:,:j+1]) - y_train
    R_2_IS_Ridge=1-np.var(err_IS)/np.var(y_train)
    OLS_R_2_Ridge_IS_F.append(R_2_IS_Ridge)
    
    #Out of sample
    err_OS=Ridge.predict(X_test.ix[:,:j+1]) - y_test
    R_2_OS_Ridge=1-np.var(err_OS)/np.var(y_test)
    OLS_R_2_Ridge_OS_F.append(max(R_2_OS_Ridge,0))

    # Lasso
    
    Lasso.fit(X_train.ix[:,0:j+1],y_train)
    
    #In sample:
    p_IS=Lasso.predict(X_train.ix[:,0:j+1])
    err_IS=p_IS-y_train
    R_2_IS_Lasso=1-np.var(err_IS)/np.var(y_train)
    OLS_R_2_Lasso_IS_F.append(R_2_IS_Lasso)

    #Out of sample
    p_OS=Lasso.predict(X_test.ix[:,0:j+1])
    err_OS=p_OS-y_test
    R_2_OS_Lasso=1-np.var(err_OS)/np.var(y_test)
    OLS_R_2_Lasso_OS_F.append(max(R_2_OS_Lasso,0))


pylab.rcParams['figure.figsize'] = [14,10]

plt.title('OS performance of OLS when we subsequently add variables')

plt.plot(Number_variables,OLS_R_2_IS_F,'g',label='OLS_IS')
plt.plot(Number_variables,OLS_R_2_Lasso_IS_F,'y',label='Lasso_IS')
plt.plot(Number_variables,OLS_R_2_Ridge_IS_F,'k',label='Ridge_IS')

plt.plot(Number_variables,OLS_R_2_OS_F,'b',label='OLS_OS')
plt.plot(Number_variables,OLS_R_2_Lasso_OS_F,'c',label='Lasso_OS')
plt.plot(Number_variables,OLS_R_2_Ridge_OS_F,'r',label='Ridge_OS')

plt.legend(loc='lower right')
plt.xlabel('Number of independent variables')
plt.ylabel('R-squared')
plt.show()


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-67-ae00022046fe> in <module>()
     25     lm = smf.ols(formula = 'Y ~ '+ '+'.join(X_train.columns[:j+1]), 
     26                  data = pd.concat([X_train.ix[:,:j+1],y_train], axis=1)).fit()
---> 27     error = lm.predict(X_test.ix[:,:j+1]) - y_test
     28     R_2_OS_OLS=1-error.var()/y_test.var()
     29     R_2_IS_OLS = lm.rsquared

AttributeError: 'numpy.ndarray' object has no attribute 'ix'

Pipeline Anova SVM


In [68]:
from sklearn import svm
from sklearn.datasets import samples_generator
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.pipeline import make_pipeline

# ANOVA SVM-C
# 1) anova filter, take 3 best ranked features
anova_filter = SelectKBest(f_regression, k=3)
# 2) svm
clf = svm.SVC(kernel='linear')

anova_svm = make_pipeline(anova_filter, clf)
anova_svm.fit(X_train, y_train)
print("ANOVA Score: {}".format(anova_svm.score(X_train,y_train)))


---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-68-2c6ffd8f3685> in <module>()
     11 
     12 anova_svm = make_pipeline(anova_filter, clf)
---> 13 anova_svm.fit(X_train, y_train)
     14 print("ANOVA Score: {}".format(anova_svm.score(X_train,y_train)))

/Users/IlanReinstein/anaconda/lib/python2.7/site-packages/sklearn/pipeline.pyc in fit(self, X, y, **fit_params)
    268         Xt, fit_params = self._fit(X, y, **fit_params)
    269         if self._final_estimator is not None:
--> 270             self._final_estimator.fit(Xt, y, **fit_params)
    271         return self
    272 

/Users/IlanReinstein/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in fit(self, X, y, sample_weight)
    150 
    151         X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
--> 152         y = self._validate_targets(y)
    153 
    154         sample_weight = np.asarray([]

/Users/IlanReinstein/anaconda/lib/python2.7/site-packages/sklearn/svm/base.pyc in _validate_targets(self, y)
    518     def _validate_targets(self, y):
    519         y_ = column_or_1d(y, warn=True)
--> 520         check_classification_targets(y)
    521         cls, y = np.unique(y_, return_inverse=True)
    522         self.class_weight_ = compute_class_weight(self.class_weight, cls, y_)

/Users/IlanReinstein/anaconda/lib/python2.7/site-packages/sklearn/utils/multiclass.pyc in check_classification_targets(y)
    170     if y_type not in ['binary', 'multiclass', 'multiclass-multioutput',
    171             'multilabel-indicator', 'multilabel-sequences']:
--> 172         raise ValueError("Unknown label type: %r" % y_type)
    173 
    174 

ValueError: Unknown label type: 'continuous'

Random Forest


In [17]:
# Random Forest Classification
import pandas
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier

num_folds = 10
num_instances = len(X)
seed = 7
num_trees = 100
max_features = 3
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)
results = cross_validation.cross_val_score(model, X, Y, cv=kfold)
print(results.mean())


0.00921052631579

In [ ]:


In [ ]:

Plot the decision boundaries of a VotingClassifier (SVC vs DecisionTree vs K-neighbours)


In [339]:
X = np.asarray(X_train)
Y = np.asarray(y_train)

In [30]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from itertools import product
from sklearn.ensemble import VotingClassifier

# Loading some example data
iris = datasets.load_iris()
X = iris.data[:, [0, 2]]
y = iris.target

# Training classifiers
clf1 = DecisionTreeClassifier(max_depth=4)
clf2 = KNeighborsClassifier(n_neighbors=7)
clf3 = SVC(kernel='rbf', probability=True)
eclf = VotingClassifier(estimators=[('dt', clf1), ('knn', clf2), ('svc', clf3)], voting='soft', weights=[2,1,2])

clf1 = clf1.fit(X,Y)
clf2 = clf2.fit(X,Y)
clf3 = clf3.fit(X,Y)
eclf = eclf.fit(X,Y)

In [ ]:
# Plotting decision regions
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                     np.arange(y_min, y_max, 0.1))

f, axarr = plt.subplots(2, 2, sharex='col', sharey='row', figsize=(10, 8))

for idx, clf, tt in zip(product([0, 1], [0, 1]),
                        [clf1, clf2, clf3, eclf],
                        ['Decision Tree (depth=4)', 'KNN (k=7)',
                         'Kernel SVM', 'Soft Voting']):

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    axarr[idx[0], idx[1]].contourf(xx, yy, Z, alpha=0.4)
    axarr[idx[0], idx[1]].scatter(X[:, 0], X[:, 1], c=y, alpha=0.8)
    axarr[idx[0], idx[1]].set_title(tt)

plt.show()

In [ ]:


In [391]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_gaussian_quantiles

# Create and fit an AdaBoosted decision tree
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
                         algorithm="SAMME",
                         n_estimators=200)

bdt.fit(X, y)

plot_colors = "br"
plot_step = 0.02
class_names = "AB"

plt.figure(figsize=(10, 5))

# Plot the decision boundaries
plt.subplot(121)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                     np.arange(y_min, y_max, plot_step))

Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis("tight")

# Plot the training points
for i, n, c in zip(len(X), class_names, plot_colors):
    idx = np.where(y == i)
    plt.scatter(X[idx, 0], X[idx, 1],
                c=c, cmap=plt.cm.Paired,
                label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Decision Boundary')

# Plot the two-class decision scores
twoclass_output = bdt.decision_function(X)
plot_range = (twoclass_output.min(), twoclass_output.max())
plt.subplot(122)
for i, n, c in zip(range(2), class_names, plot_colors):
    plt.hist(twoclass_output[y == i],
             bins=10,
             range=plot_range,
             facecolor=c,
             label='Class %s' % n,
             alpha=.5)
x1, x2, y1, y2 = plt.axis()
plt.axis((x1, x2, y1, y2 * 1.2))
plt.legend(loc='upper right')
plt.ylabel('Samples')
plt.xlabel('Score')
plt.title('Decision Scores')

plt.tight_layout()
plt.subplots_adjust(wspace=0.35)
plt.show()



ValueErrorTraceback (most recent call last)
<ipython-input-391-f08c6b17d38c> in <module>()
     26                      np.arange(y_min, y_max, plot_step))
     27 
---> 28 Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
     29 Z = Z.reshape(xx.shape)
     30 cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.pyc in predict(self, X)
    598             The predicted classes.
    599         """
--> 600         pred = self.decision_function(X)
    601 
    602         if self.n_classes_ == 2:

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.pyc in decision_function(self, X)
    670             pred = sum((estimator.predict(X) == classes).T * w
    671                        for estimator, w in zip(self.estimators_,
--> 672                                                self.estimator_weights_))
    673 
    674         pred /= self.estimator_weights_.sum()

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/ensemble/weight_boosting.pyc in <genexpr>((estimator, w))
    669         else:   # self.algorithm == "SAMME"
    670             pred = sum((estimator.predict(X) == classes).T * w
--> 671                        for estimator, w in zip(self.estimators_,
    672                                                self.estimator_weights_))
    673 

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/tree/tree.pyc in predict(self, X, check_input)
    429         """
    430 
--> 431         X = self._validate_X_predict(X, check_input)
    432         proba = self.tree_.predict(X)
    433         n_samples = X.shape[0]

/resources/common/.virtualenv/python2/lib/python2.7/site-packages/sklearn/tree/tree.pyc in _validate_X_predict(self, X, check_input)
    401                              "match the input. Model n_features is %s and "
    402                              "input n_features is %s "
--> 403                              % (self.n_features_, n_features))
    404 
    405         return X

ValueError: Number of features of the model must match the input. Model n_features is 60 and input n_features is 2 

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# import some data to play with
X = X_train
y = y_train
h = .02  # step size in the mesh

# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0  # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, y)
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X, y)
poly_svc = svm.SVC(kernel='poly', degree=3, C=C).fit(X, y)
lin_svc = svm.LinearSVC(C=C).fit(X, y)

# create a mesh to plot in
x_min, x_max = X.ix[:, 0].min() - 1, X.ix[:, 0].max() + 1
y_min, y_max = X.ix[:, 1].min() - 1, X.ix[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# title for the plots
titles = ['SVC with linear kernel',
          'LinearSVC (linear kernel)',
          'SVC with RBF kernel',
          'SVC with polynomial (degree 3) kernel']


for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    plt.subplots_adjust(wspace=0.4, hspace=0.4)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm)
    plt.xlabel('Sepal length')
    plt.ylabel('Sepal width')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(titles[i])

plt.show()

PCA

No funciona bien para multivariate regression. Tenemos muchos datos categoricos


In [16]:
nad = pd.concat([X,Y], axis=1)

In [20]:
X,y = X,Y

SVM-Anova: SVM with univariate feature selection


In [22]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets, feature_selection
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

transform = feature_selection.SelectPercentile(feature_selection.f_classif)

clf = Pipeline([('anova', transform), ('svc', svm.SVC(C=1.0))])

score_means = list()
score_stds = list()
percentiles = (1, 3, 6, 10, 15, 20, 30, 40, 60, 80, 100)

for percentile in percentiles:
    clf.set_params(anova__percentile=percentile)
    # Compute cross-validation score using 1 CPU
    this_scores = cross_val_score(clf, X, y, n_jobs=1)
    score_means.append(this_scores.mean())
    score_stds.append(this_scores.std())

plt.errorbar(percentiles, score_means, np.array(score_stds))

plt.title(
    'Performance of the SVM-Anova varying the percentile of features selected')
plt.xlabel('Percentile')
plt.ylabel('Prediction rate')

plt.axis('tight')
plt.show()


/gpfs/fs01/user/s354-79e1eae297cf06-da1f292c4b52/.local/lib/python2.7/site-packages/sklearn/feature_selection/univariate_selection.py:113: UserWarning: Features [40 55] are constant.
  UserWarning)

Pipelining: chaining a PCA and a logistic regression


In [27]:
from sklearn import linear_model, decomposition, datasets
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

logistic = linear_model.LogisticRegression()

pca = decomposition.PCA()
pipe = Pipeline(steps=[('pca', pca), ('logistic', logistic)])

pca.fit(X_train)

plt.figure(1, figsize=(4, 3))
plt.clf()
plt.axes([.2, .2, .7, .7])
plt.plot(pca.explained_variance_, linewidth=2)
plt.axis('tight')
plt.xlabel('n_components')
plt.ylabel('explained_variance_')


Out[27]:
<matplotlib.text.Text at 0x7f6a5daeba90>

In [29]:
n_components = [20, 40, 60]
Cs = np.logspace(-4, 4, 3)

#Parameters of pipelines can be set using ‘__’ separated parameter names:

estimator = GridSearchCV(pipe,
                         dict(pca__n_components=n_components,
                              logistic__C=Cs))
estimator.fit(X_train, y_train)

plt.axvline(estimator.best_estimator_.named_steps['pca'].n_components,
            linestyle=':', label='n_components chosen')
plt.legend(prop=dict(size=12))
plt.show()



In [ ]:


In [17]:
n=2
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

nad_ = StandardScaler().fit_transform(nad)

pca = PCA(n)
Xproj = pca.fit_transform(nad)
eigenvalues = pca.explained_variance_

plt.figure(2, figsize=(8, 6))
plt.scatter(Xproj[:, 0], Xproj[:, 1], c = nad.sum(axis=1), cmap=plt.cm.cool)
plt.xlabel('First Component')
plt.ylabel('Second Component')
plt.show()



In [279]:
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.cluster import KMeans

X_ = np.asarray(nad)
range_n_clusters = [2, 3, 4, 5, 6, 7]
for n_clusters in range_n_clusters:
    km = KMeans(n_clusters=n_clusters, random_state=324)
    cluster_labels = km.fit_predict(X_)
    silhouette_avg = silhouette_score(X_, cluster_labels)
    print("For n_clusters = {},".format(n_clusters)+" the average silhouette_score is :{}".format(silhouette_avg))


For n_clusters = 2, the average silhouette_score is :0.595538068807
For n_clusters = 3, the average silhouette_score is :0.537221913843
For n_clusters = 4, the average silhouette_score is :0.501805531701
For n_clusters = 5, the average silhouette_score is :0.483144350648
For n_clusters = 6, the average silhouette_score is :0.456988255653
For n_clusters = 7, the average silhouette_score is :0.427559903572

Kmeans Clustering


In [18]:
from sklearn.cluster import KMeans

n=5
dd=Xproj

km=KMeans(n_clusters=n)
res=km.fit(dd)

with plt.style.context('ggplot'):
    plt.figure(figsize=(6, 5))
    plt.scatter(dd[:, 0], dd[:, 1], c=res.labels_)
    plt.ylabel('X2')
    plt.xlabel('X1')
    plt.xticks(())
    plt.yticks(())
    plt.title("KMeans, K = {} Clusters".format(n))
    plt.show()


Gaussian Mixture


In [19]:
from sklearn.mixture import GaussianMixture

gm=GaussianMixture(n_components=n)
res1=gm.fit(dd)

with plt.style.context('ggplot'):
    plt.figure(figsize=(6,6))
    plt.scatter(dd[:, 0], dd[:, 1], c=res1.predict(dd))
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.xticks(())
    plt.yticks(())
    plt.title("Gaussian Mixture")
    plt.show()



In [ ]:


In [ ]:
from sklearn.mixture import GaussianMixture

gm=GaussianMixture(n_components=n)
res1=gm.fit(dd)

with plt.style.context('ggplot'):
    plt.figure(figsize=(6,6))
    plt.scatter(dd[:, 0], dd[:, 1], c=res1.predict(dd))
    plt.xlabel('X1')
    plt.ylabel('X2')
    plt.xticks(())
    plt.yticks(())
    plt.title("Gaussian Mixture")
    plt.show()

In [ ]:


In [27]:
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.preprocessing import StandardScaler

X = StandardScaler().fit_transform(nad)

db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)

print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))


Estimated number of clusters: 89

NameErrorTraceback (most recent call last)
<ipython-input-27-67f59df0b1ae> in <module>()
     15 
     16 print('Estimated number of clusters: %d' % n_clusters_)
---> 17 print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
     18 print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
     19 print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))

NameError: name 'labels_true' is not defined

In [ ]:


In [ ]:

Feature selection

Recursive Feature Elimination


In [57]:
# Feature Extraction with RFE
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load data
X = X
Y = Y
# feature extraction
model = LogisticRegression()
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print("Num Features: %d") % fit.n_features_
print("Selected Features: %s") % fit.support_
print("Feature Ranking: %s") % fit.ranking_


Num Features: 3
Selected Features: [False False False False False False False False False False False False
 False False False False False  True False False False False False False
 False False False False False False False False False False False False
 False False False False False False False False False False False False
 False False False  True  True False False False False False False False]
Feature Ranking: [56 58 13 57 51  3 34 23  8  5 24 52 30 29 54 31 28  1 38 53 40 39 18  7 46
 42 47 44 45 35 26  9 50 32 27 25 49 19  6 17 48 43 14 16 55 15 22 21 20 12
 11  1  1  2 10 41 33 36 37  4]

In [58]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
 
X = X
Y = Y
names = df.columns
 
#use linear regression as the model
lr = LinearRegression()
#rank all features, i.e continue the elimination until the last one
rfe = RFE(lr, n_features_to_select=1)
rfe.fit(X,Y)
 
print "Features sorted by their rank:"
print sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names))


Features sorted by their rank:
[(1.0, 'Cobertura_Medica_9'), (2.0, 'Cobertura_Medica_1'), (3.0, 'Cobertura_Medica_2'), (4.0, 'Cobertura_Medica_3'), (5.0, 'Sabe_leer_No'), (6.0, 'Cobertura_Medica_4'), (7.0, 'Sabe_leer_Si'), (8.0, 'CAT_OCUP_Empleado'), (9.0, 'NIVEL_ED_Univ_I'), (10.0, 'NIVEL_ED_Univ_C'), (11.0, 'CAT_OCUP_No_empleo'), (12.0, 'CAT_OCUP_Cuenta_propia'), (13.0, 'Trabajo_Desocupado'), (14.0, 'Lugar_Nac_Otro_pais'), (15.0, 'Cobertura_Medica_13'), (16.0, 'NIVEL_ED_Secundaria_I'), (17.0, 'NIVEL_ED_Sin_Edu'), (18.0, 'NIVEL_ED_Primaria_C'), (19.0, 'NIVEL_ED_Primaria_I'), (20.0, 'NIVEL_ED_Secundaria_C'), (21.0, 'Parentesco_Yerno'), (22.0, 'Parentesco_Suegro'), (23.0, 'Parentesco_Nieto'), (24.0, 'Parentesco_Otro'), (25.0, 'Parentesco_Madre_Padre'), (26.0, 'Parentesco_Jefe'), (27.0, 'Parentesco_Conyuge'), (28.0, 'Edad'), (29.0, 'Parentesco_Hijo'), (30.0, 'Sexo'), (31.0, 'Parentesco_No_Familia'), (32.0, 'Parentesco_Hermano'), (33.0, 'CAT_INAC_6'), (34.0, 'CAT_INAC_7'), (35.0, 'NRO_HOGAR_1'), (36.0, 'NRO_HOGAR_2'), (37.0, 'CAT_INAC_4'), (38.0, 'Lugar_Nac_Pais_limit'), (39.0, 'CAT_INAC_2'), (40.0, 'CAT_INAC_3'), (41.0, 'CAT_INAC_1'), (42.0, 'CAT_INAC_5'), (43.0, 'CAT_INAC_0'), (44.0, 'NRO_HOGAR_4'), (45.0, 'Estado_Civil_Casado'), (46.0, 'Estado_Civil_Soltero'), (47.0, 'Estado_Civil_Divorciado'), (48.0, 'NRO_HOGAR_3'), (49.0, 'Trabajo_Inactivo'), (50.0, 'Cobertura_Medica_12'), (51.0, 'Estado_Civil_Viudo'), (52.0, 'Estado_Civil_Unido'), (53.0, 'Sabe_leer_Menor'), (54.0, 'CAT_OCUP_Sin_sueldo'), (55.0, 'CAT_OCUP_Patron'), (56.0, 'Lugar_Nac_Otra_prov'), (57.0, 'Lugar_Nac_Otra_loc'), (58.0, 'Lugar_Nac_Localidad'), (59.0, 'Y'), (60.0, 'CODUSU')]

Univariate Selection


In [61]:
# Feature Extraction with Univariate Statistical Tests (Chi-squared for classification)
import pandas
import numpy
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
# load data
names = df.columns
X = X
Y = Y
# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(np.asarray(X), Y)
# summarize scores
numpy.set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(np.asarray(X))
# summarize selected features
print(features[0:5,:])



ValueErrorTraceback (most recent call last)
<ipython-input-61-abc091ee4dde> in <module>()
     10 # feature extraction
     11 test = SelectKBest(score_func=chi2, k=4)
---> 12 fit = test.fit(np.asarray(X), Y)
     13 # summarize scores
     14 numpy.set_printoptions(precision=3)

/gpfs/fs01/user/s354-79e1eae297cf06-da1f292c4b52/.local/lib/python2.7/site-packages/sklearn/feature_selection/univariate_selection.pyc in fit(self, X, y)
    328 
    329         self._check_params(X, y)
--> 330         score_func_ret = self.score_func(X, y)
    331         if isinstance(score_func_ret, (list, tuple)):
    332             self.scores_, self.pvalues_ = score_func_ret

/gpfs/fs01/user/s354-79e1eae297cf06-da1f292c4b52/.local/lib/python2.7/site-packages/sklearn/feature_selection/univariate_selection.pyc in chi2(X, y)
    213     X = check_array(X, accept_sparse='csr')
    214     if np.any((X.data if issparse(X) else X) < 0):
--> 215         raise ValueError("Input X must be non-negative.")
    216 
    217     Y = LabelBinarizer().fit_transform(y)

ValueError: Input X must be non-negative.

Stability selection


In [62]:
from sklearn.linear_model import RandomizedLasso
 
#using the Boston housing data. 
#Data gets scaled automatically by sklearn's implementation
X = X
Y = Y
names = df.columns
 
rlasso = RandomizedLasso(alpha=0.025)
rlasso.fit(X, Y)
 
print "Features sorted by their score:"
print sorted(zip(map(lambda x: round(x, 4), rlasso.scores_), 
                 names), reverse=True)


Features sorted by their score:
[(0.0, 'Y'), (0.0, 'Trabajo_Inactivo'), (0.0, 'Trabajo_Desocupado'), (0.0, 'Sexo'), (0.0, 'Sabe_leer_Si'), (0.0, 'Sabe_leer_No'), (0.0, 'Sabe_leer_Menor'), (0.0, 'Parentesco_Yerno'), (0.0, 'Parentesco_Suegro'), (0.0, 'Parentesco_Otro'), (0.0, 'Parentesco_No_Familia'), (0.0, 'Parentesco_Nieto'), (0.0, 'Parentesco_Madre_Padre'), (0.0, 'Parentesco_Jefe'), (0.0, 'Parentesco_Hijo'), (0.0, 'Parentesco_Hermano'), (0.0, 'Parentesco_Conyuge'), (0.0, 'NRO_HOGAR_4'), (0.0, 'NRO_HOGAR_3'), (0.0, 'NRO_HOGAR_2'), (0.0, 'NRO_HOGAR_1'), (0.0, 'NIVEL_ED_Univ_I'), (0.0, 'NIVEL_ED_Univ_C'), (0.0, 'NIVEL_ED_Sin_Edu'), (0.0, 'NIVEL_ED_Secundaria_I'), (0.0, 'NIVEL_ED_Secundaria_C'), (0.0, 'NIVEL_ED_Primaria_I'), (0.0, 'NIVEL_ED_Primaria_C'), (0.0, 'Lugar_Nac_Pais_limit'), (0.0, 'Lugar_Nac_Otro_pais'), (0.0, 'Lugar_Nac_Otra_prov'), (0.0, 'Lugar_Nac_Otra_loc'), (0.0, 'Lugar_Nac_Localidad'), (0.0, 'Estado_Civil_Viudo'), (0.0, 'Estado_Civil_Unido'), (0.0, 'Estado_Civil_Soltero'), (0.0, 'Estado_Civil_Divorciado'), (0.0, 'Estado_Civil_Casado'), (0.0, 'Edad'), (0.0, 'Cobertura_Medica_9'), (0.0, 'Cobertura_Medica_4'), (0.0, 'Cobertura_Medica_3'), (0.0, 'Cobertura_Medica_2'), (0.0, 'Cobertura_Medica_13'), (0.0, 'Cobertura_Medica_12'), (0.0, 'Cobertura_Medica_1'), (0.0, 'CODUSU'), (0.0, 'CAT_OCUP_Sin_sueldo'), (0.0, 'CAT_OCUP_Patron'), (0.0, 'CAT_OCUP_No_empleo'), (0.0, 'CAT_OCUP_Empleado'), (0.0, 'CAT_OCUP_Cuenta_propia'), (0.0, 'CAT_INAC_7'), (0.0, 'CAT_INAC_6'), (0.0, 'CAT_INAC_5'), (0.0, 'CAT_INAC_4'), (0.0, 'CAT_INAC_3'), (0.0, 'CAT_INAC_2'), (0.0, 'CAT_INAC_1'), (0.0, 'CAT_INAC_0')]

Example: running the methods side by side


In [84]:
Y = q
X = df.iloc[:,2:]

In [83]:
# !pip install minepy

from sklearn.linear_model import (LinearRegression, Ridge, 
                                  Lasso, RandomizedLasso)
from sklearn.feature_selection import RFE, f_regression
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from minepy import MINE
 
X = np.asarray(X)
# Y = Y
    
names = X.columns
 
ranks = {}
 
def rank_to_dict(ranks, names, order=1):
    minmax = MinMaxScaler()
    ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
    ranks = map(lambda x: round(x, 2), ranks)
    return dict(zip(names, ranks ))
 
lr = LinearRegression(normalize=True)
lr.fit(X, Y)
ranks["Linear reg"] = rank_to_dict(np.abs(lr.coef_), names)
 
ridge = Ridge(alpha=7)
ridge.fit(X, Y)
ranks["Ridge"] = rank_to_dict(np.abs(ridge.coef_), names)
 
 
lasso = Lasso(alpha=.05)
lasso.fit(X, Y)
ranks["Lasso"] = rank_to_dict(np.abs(lasso.coef_), names)
 
 
rlasso = RandomizedLasso(alpha=0.04)
rlasso.fit(X, Y)
ranks["Stability"] = rank_to_dict(np.abs(rlasso.scores_), names)
 
#stop the search when 5 features are left (they will get equal scores)
rfe = RFE(lr, n_features_to_select=5)
rfe.fit(X,Y)
ranks["RFE"] = rank_to_dict(map(float, rfe.ranking_), names, order=-1)
 
rf = RandomForestRegressor()
rf.fit(X,Y)
ranks["RF"] = rank_to_dict(rf.feature_importances_, names)
 
 
f, pval  = f_regression(X, Y, center=True)
ranks["Corr."] = rank_to_dict(f, names)
 
mine = MINE()
mic_scores = []
for i in range(X.shape[1]):
    mine.compute_score(X[:,i], Y)
    m = mine.mic()
    mic_scores.append(m)
 
ranks["MIC"] = rank_to_dict(mic_scores, names) 
 
 
r = {}
for name in names:
    r[name] = round(np.mean([ranks[method][name] 
                             for method in ranks.keys()]), 2)
 
methods = sorted(ranks.keys())
ranks["Mean"] = r
methods.append("Mean")
 
feat_ranking = pd.DataFrame(ranks)
cols = feat_ranking.columns.tolist()
cols.insert(0, cols.pop(cols.index('Mean')))
feat_ranking = feat_ranking.ix[:, cols]
feat_ranking.sort_values(['Mean'], ascending=False)



AttributeErrorTraceback (most recent call last)
<ipython-input-83-c52157088f16> in <module>()
     12 # Y = Y
     13 
---> 14 names = X.columns
     15 
     16 ranks = {}

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

In [82]:
r['Trabajo_Ocupado']



KeyErrorTraceback (most recent call last)
<ipython-input-82-6c604b4ba5d3> in <module>()
----> 1 r['Trabajo_Ocupado']

KeyError: 'Trabajo_Ocupado'

In [74]:
for name in names:
    print name
    r[name] = round(np.mean([ranks[method][name] 
                             for method in ranks.keys()]), 2)


Sexo
Edad
Parentesco_Conyuge
Parentesco_Hermano
Parentesco_Hijo
Parentesco_Jefe
Parentesco_Madre_Padre
Parentesco_Nieto
Parentesco_No_Familia
Parentesco_Otro
Parentesco_Suegro
Parentesco_Yerno
NIVEL_ED_Primaria_C
NIVEL_ED_Primaria_I
NIVEL_ED_Secundaria_C
NIVEL_ED_Secundaria_I
NIVEL_ED_Sin_Edu
NIVEL_ED_Univ_C
NIVEL_ED_Univ_I
CAT_OCUP_Cuenta_propia
CAT_OCUP_Empleado
CAT_OCUP_No_empleo
CAT_OCUP_Patron
CAT_OCUP_Sin_sueldo
Lugar_Nac_Localidad
Lugar_Nac_Otra_loc
Lugar_Nac_Otra_prov
Lugar_Nac_Otro_pais
Lugar_Nac_Pais_limit
CAT_INAC_0
CAT_INAC_1
CAT_INAC_2
CAT_INAC_3
CAT_INAC_4
CAT_INAC_5
CAT_INAC_6
CAT_INAC_7
NRO_HOGAR_1
NRO_HOGAR_2
NRO_HOGAR_3
NRO_HOGAR_4
Estado_Civil_Casado
Estado_Civil_Divorciado
Estado_Civil_Soltero
Estado_Civil_Unido
Estado_Civil_Viudo
Sabe_leer_Menor
Sabe_leer_No
Sabe_leer_Si
Cobertura_Medica_1
Cobertura_Medica_2
Cobertura_Medica_3
Cobertura_Medica_4
Cobertura_Medica_9
Cobertura_Medica_12
Cobertura_Medica_13
Trabajo_Desocupado
Trabajo_Inactivo
Trabajo_Menor

KeyErrorTraceback (most recent call last)
<ipython-input-74-e3577e4605a3> in <module>()
      2     print name
      3     r[name] = round(np.mean([ranks[method][name] 
----> 4                              for method in ranks.keys()]), 2)

KeyError: 'Trabajo_Menor'

In [ ]:


In [ ]: