HW 3: KNN & Random Forest

Get your data here. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Assignment

Preprocess your data (you may find LabelEncoder useful)
Train both KNN and Random Forest models
Find the best parameters by computing their learning curve (feel free to verify this with grid search)
Create a clasification report
Inspect your models, what features are most important? How might you use this information to improve model precision?



In [2]:

    
# Download data:

! wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip"
! wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip"









    



--2015-01-12 13:42:24--  https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip
Resolving archive.ics.uci.edu... 128.195.1.95
Connecting to archive.ics.uci.edu|128.195.1.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 579043 (565K) [application/zip]
Saving to: 'bank.zip.1'

100%[======================================>] 579,043     3.38MB/s   in 0.2s   

2015-01-12 13:42:25 (3.38 MB/s) - 'bank.zip.1' saved [579043/579043]

--2015-01-12 13:42:25--  https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional.zip
Resolving archive.ics.uci.edu... 128.195.1.95
Connecting to archive.ics.uci.edu|128.195.1.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 444572 (434K) [application/zip]
Saving to: 'bank-additional.zip'

100%[======================================>] 444,572     --.-K/s   in 0.1s    

2015-01-12 13:42:25 (3.61 MB/s) - 'bank-additional.zip' saved [444572/444572]



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split

%matplotlib inline



In [2]:

    
ldata = pd.read_csv("./bank-additional/bank-additional-full.csv", sep=';') # ldata means large data
sdata = pd.read_csv("./bank-additional/bank-additional.csv", sep=';') # sdata means... yup, small.



In [3]:

    
sdata.head()









    Out[3]:






  
    
      
      age
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      day_of_week
      ...
      campaign
      pdays
      previous
      poutcome
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
      y
    
  
  
    
      0
       30
       blue-collar
       married
                basic.9y
       no
           yes
            no
        cellular
       may
       fri
      ...
       2
       999
       0
       nonexistent
      -1.8
       92.893
      -46.2
       1.313
       5099.1
       no
    
    
      1
       39
          services
        single
             high.school
       no
            no
            no
       telephone
       may
       fri
      ...
       4
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.855
       5191.0
       no
    
    
      2
       25
          services
       married
             high.school
       no
           yes
            no
       telephone
       jun
       wed
      ...
       1
       999
       0
       nonexistent
       1.4
       94.465
      -41.8
       4.962
       5228.1
       no
    
    
      3
       38
          services
       married
                basic.9y
       no
       unknown
       unknown
       telephone
       jun
       fri
      ...
       3
       999
       0
       nonexistent
       1.4
       94.465
      -41.8
       4.959
       5228.1
       no
    
    
      4
       47
            admin.
       married
       university.degree
       no
           yes
            no
        cellular
       nov
       mon
      ...
       1
       999
       0
       nonexistent
      -0.1
       93.200
      -42.0
       4.191
       5195.8
       no
    
  

5 rows × 21 columns



In [4]:

    
sdata.describe()









    Out[4]:






  
    
      
      age
      duration
      campaign
      pdays
      previous
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
    
  
  
    
      count
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
    
    
      mean
         40.113620
        256.788055
          2.537266
        960.422190
          0.190337
          0.084972
         93.579704
        -40.499102
          3.621356
       5166.481695
    
    
      std
         10.313362
        254.703736
          2.568159
        191.922786
          0.541788
          1.563114
          0.579349
          4.594578
          1.733591
         73.667904
    
    
      min
         18.000000
          0.000000
          1.000000
          0.000000
          0.000000
         -3.400000
         92.201000
        -50.800000
          0.635000
       4963.600000
    
    
      25%
         32.000000
        103.000000
          1.000000
        999.000000
          0.000000
         -1.800000
         93.075000
        -42.700000
          1.334000
       5099.100000
    
    
      50%
         38.000000
        181.000000
          2.000000
        999.000000
          0.000000
          1.100000
         93.749000
        -41.800000
          4.857000
       5191.000000
    
    
      75%
         47.000000
        317.000000
          3.000000
        999.000000
          0.000000
          1.400000
         93.994000
        -36.400000
          4.961000
       5228.100000
    
    
      max
         88.000000
       3643.000000
         35.000000
        999.000000
          6.000000
          1.400000
         94.767000
        -26.900000
          5.045000
       5228.100000



In [5]:

    
test = sdata.copy()
test.drop('y', axis=1)









    Out[5]:






  
    
      
      age
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      day_of_week
      duration
      campaign
      pdays
      previous
      poutcome
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
    
  
  
    
      0   
       30
         blue-collar
        married
                  basic.9y
            no
           yes
            no
        cellular
       may
       fri
        487
        2
       999
       0
       nonexistent
      -1.8
       92.893
      -46.2
       1.313
       5099.1
    
    
      1   
       39
            services
         single
               high.school
            no
            no
            no
       telephone
       may
       fri
        346
        4
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.855
       5191.0
    
    
      2   
       25
            services
        married
               high.school
            no
           yes
            no
       telephone
       jun
       wed
        227
        1
       999
       0
       nonexistent
       1.4
       94.465
      -41.8
       4.962
       5228.1
    
    
      3   
       38
            services
        married
                  basic.9y
            no
       unknown
       unknown
       telephone
       jun
       fri
         17
        3
       999
       0
       nonexistent
       1.4
       94.465
      -41.8
       4.959
       5228.1
    
    
      4   
       47
              admin.
        married
         university.degree
            no
           yes
            no
        cellular
       nov
       mon
         58
        1
       999
       0
       nonexistent
      -0.1
       93.200
      -42.0
       4.191
       5195.8
    
    
      5   
       32
            services
         single
         university.degree
            no
            no
            no
        cellular
       sep
       thu
        128
        3
       999
       2
           failure
      -1.1
       94.199
      -37.5
       0.884
       4963.6
    
    
      6   
       32
              admin.
         single
         university.degree
            no
           yes
            no
        cellular
       sep
       mon
        290
        4
       999
       0
       nonexistent
      -1.1
       94.199
      -37.5
       0.879
       4963.6
    
    
      7   
       41
        entrepreneur
        married
         university.degree
       unknown
           yes
            no
        cellular
       nov
       mon
         44
        2
       999
       0
       nonexistent
      -0.1
       93.200
      -42.0
       4.191
       5195.8
    
    
      8   
       31
            services
       divorced
       professional.course
            no
            no
            no
        cellular
       nov
       tue
         68
        1
       999
       1
           failure
      -0.1
       93.200
      -42.0
       4.153
       5195.8
    
    
      9   
       35
         blue-collar
        married
                  basic.9y
       unknown
            no
            no
       telephone
       may
       thu
        170
        1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.855
       5191.0
    
    
      10  
       25
            services
         single
                  basic.6y
       unknown
           yes
            no
        cellular
       jul
       thu
        301
        1
       999
       0
       nonexistent
       1.4
       93.918
      -42.7
       4.958
       5228.1
    
    
      11  
       36
       self-employed
         single
                  basic.4y
            no
            no
            no
        cellular
       jul
       thu
        148
        1
       999
       0
       nonexistent
       1.4
       93.918
      -42.7
       4.968
       5228.1
    
    
      12  
       36
              admin.
        married
               high.school
            no
            no
            no
       telephone
       may
       wed
         97
        2
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.859
       5191.0
    
    
      13  
       47
         blue-collar
        married
                  basic.4y
            no
           yes
            no
       telephone
       jun
       thu
        211
        2
       999
       0
       nonexistent
       1.4
       94.465
      -41.8
       4.958
       5228.1
    
    
      14  
       29
              admin.
         single
               high.school
            no
            no
            no
        cellular
       may
       fri
        553
        2
       999
       0
       nonexistent
      -1.8
       92.893
      -46.2
       1.313
       5099.1
    
    
      15  
       27
            services
         single
         university.degree
            no
            no
            no
        cellular
       jul
       wed
        698
        2
       999
       0
       nonexistent
       1.4
       93.918
      -42.7
       4.963
       5228.1
    
    
      16  
       44
              admin.
       divorced
         university.degree
            no
            no
            no
        cellular
       jul
       wed
        191
        6
       999
       0
       nonexistent
       1.4
       93.918
      -42.7
       4.957
       5228.1
    
    
      17  
       46
              admin.
       divorced
         university.degree
            no
           yes
            no
       telephone
       jul
       mon
         59
        4
       999
       0
       nonexistent
       1.4
       93.918
      -42.7
       4.962
       5228.1
    
    
      18  
       45
        entrepreneur
        married
         university.degree
       unknown
           yes
           yes
        cellular
       aug
       mon
         38
        2
       999
       0
       nonexistent
       1.4
       93.444
      -36.1
       4.965
       5228.1
    
    
      19  
       50
         blue-collar
        married
                  basic.4y
            no
            no
           yes
        cellular
       jul
       tue
        849
        1
       999
       0
       nonexistent
       1.4
       93.918
      -42.7
       4.961
       5228.1
    
    
      20  
       55
            services
        married
                  basic.6y
       unknown
           yes
            no
        cellular
       jul
       tue
        326
        6
       999
       0
       nonexistent
       1.4
       93.918
      -42.7
       4.962
       5228.1
    
    
      21  
       39
          technician
       divorced
               high.school
            no
            no
            no
        cellular
       mar
       mon
        222
        1
        12
       2
           success
      -1.8
       93.369
      -34.8
       0.639
       5008.7
    
    
      22  
       29
          technician
         single
         university.degree
            no
           yes
           yes
        cellular
       aug
       wed
        626
        3
       999
       0
       nonexistent
       1.4
       93.444
      -36.1
       4.967
       5228.1
    
    
      23  
       40
          management
        married
               high.school
            no
            no
           yes
        cellular
       aug
       wed
        119
        1
       999
       0
       nonexistent
       1.4
       93.444
      -36.1
       4.965
       5228.1
    
    
      24  
       44
          technician
        married
       professional.course
       unknown
           yes
            no
       telephone
       may
       fri
        388
        7
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.864
       5191.0
    
    
      25  
       38
          technician
        married
       professional.course
            no
           yes
            no
        cellular
       aug
       mon
        479
        1
       999
       0
       nonexistent
       1.4
       93.444
      -36.1
       4.965
       5228.1
    
    
      26  
       36
          technician
       divorced
       professional.course
            no
            no
            no
       telephone
       may
       wed
        446
        1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.856
       5191.0
    
    
      27  
       28
         blue-collar
        married
                  basic.6y
       unknown
            no
            no
        cellular
       may
       mon
         68
        2
       999
       1
           failure
      -1.8
       92.893
      -46.2
       1.299
       5099.1
    
    
      28  
       47
              admin.
         single
                   unknown
       unknown
            no
            no
       telephone
       may
       thu
        127
        1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.860
       5191.0
    
    
      29  
       34
              admin.
        married
         university.degree
            no
            no
            no
        cellular
       aug
       tue
        109
        1
       999
       0
       nonexistent
       1.4
       93.444
      -36.1
       4.963
       5228.1
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      4089
       25
              admin.
         single
         university.degree
            no
           yes
           yes
        cellular
       oct
       fri
        115
        1
       999
       1
           failure
      -3.4
       92.431
      -26.9
       0.739
       5017.5
    
    
      4090
       43
         blue-collar
        married
                  basic.4y
       unknown
           yes
           yes
       telephone
       may
       tue
        593
        2
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191.0
    
    
      4091
       38
          management
        married
               high.school
       unknown
            no
            no
       telephone
       may
       thu
        879
        2
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.860
       5191.0
    
    
      4092
       30
         blue-collar
         single
               high.school
            no
            no
            no
       telephone
       jul
       wed
         71
        1
       999
       0
       nonexistent
       1.4
       93.918
      -42.7
       4.956
       5228.1
    
    
      4093
       56
             retired
        married
                  basic.4y
       unknown
            no
            no
        cellular
       jul
       tue
        580
        3
       999
       0
       nonexistent
       1.4
       93.918
      -42.7
       4.961
       5228.1
    
    
      4094
       62
         blue-collar
        married
                  basic.4y
            no
           yes
            no
        cellular
       nov
       mon
        152
        1
         6
       1
           success
      -3.4
       92.649
      -30.1
       0.719
       5017.5
    
    
      4095
       36
              admin.
         single
         university.degree
            no
            no
           yes
        cellular
       aug
       fri
         69
        2
       999
       0
       nonexistent
       1.4
       93.444
      -36.1
       4.963
       5228.1
    
    
      4096
       33
            services
        married
               high.school
            no
            no
            no
       telephone
       may
       mon
        146
        2
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191.0
    
    
      4097
       41
         blue-collar
       divorced
                  basic.9y
            no
            no
            no
        cellular
       aug
       tue
        102
        1
       999
       0
       nonexistent
       1.4
       93.444
      -36.1
       4.963
       5228.1
    
    
      4098
       34
           housemaid
         single
         university.degree
            no
           yes
            no
        cellular
       aug
       thu
        159
        3
       999
       0
       nonexistent
       1.4
       93.444
      -36.1
       4.963
       5228.1
    
    
      4099
       58
              admin.
       divorced
               high.school
            no
            no
            no
        cellular
       aug
       tue
        290
        1
       999
       0
       nonexistent
       1.4
       93.444
      -36.1
       4.963
       5228.1
    
    
      4100
       41
              admin.
       divorced
               high.school
            no
            no
            no
        cellular
       apr
       fri
        620
        1
       999
       0
       nonexistent
      -1.8
       93.075
      -47.1
       1.405
       5099.1
    
    
      4101
       35
        entrepreneur
         single
         university.degree
            no
           yes
            no
        cellular
       jul
       mon
         88
        5
       999
       0
       nonexistent
       1.4
       93.918
      -42.7
       4.960
       5228.1
    
    
      4102
       31
         blue-collar
         single
                  basic.9y
       unknown
            no
           yes
       telephone
       jun
       fri
         70
        2
       999
       0
       nonexistent
       1.4
       94.465
      -41.8
       4.959
       5228.1
    
    
      4103
       43
            services
        married
               high.school
            no
            no
            no
       telephone
       may
       mon
         77
        1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191.0
    
    
      4104
       42
          technician
       divorced
       professional.course
            no
           yes
            no
        cellular
       aug
       mon
        408
        1
       999
       0
       nonexistent
       1.4
       93.444
      -36.1
       4.970
       5228.1
    
    
      4105
       47
           housemaid
        married
                  basic.4y
       unknown
           yes
            no
       telephone
       jul
       tue
        159
        2
       999
       0
       nonexistent
       1.4
       93.918
      -42.7
       4.961
       5228.1
    
    
      4106
       45
        entrepreneur
       divorced
                  basic.9y
            no
           yes
            no
        cellular
       may
       tue
         29
        3
       999
       0
       nonexistent
      -1.8
       92.893
      -46.2
       1.344
       5099.1
    
    
      4107
       36
              admin.
        married
         university.degree
       unknown
           yes
            no
        cellular
       aug
       wed
        155
       11
       999
       0
       nonexistent
       1.4
       93.444
      -36.1
       4.964
       5228.1
    
    
      4108
       32
              admin.
        married
         university.degree
            no
           yes
            no
       telephone
       may
       thu
        151
        5
       999
       0
       nonexistent
      -1.8
       92.893
      -46.2
       1.266
       5099.1
    
    
      4109
       63
             retired
        married
               high.school
            no
            no
            no
        cellular
       oct
       wed
       1386
        1
       999
       0
       nonexistent
      -3.4
       92.431
      -26.9
       0.740
       5017.5
    
    
      4110
       53
           housemaid
       divorced
                  basic.6y
       unknown
       unknown
       unknown
       telephone
       may
       fri
         85
        2
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.855
       5191.0
    
    
      4111
       30
          technician
        married
         university.degree
            no
            no
           yes
        cellular
       jun
       fri
        131
        1
       999
       1
           failure
      -1.7
       94.055
      -39.8
       0.748
       4991.6
    
    
      4112
       31
          technician
         single
       professional.course
            no
           yes
            no
        cellular
       nov
       thu
        155
        1
       999
       0
       nonexistent
      -0.1
       93.200
      -42.0
       4.076
       5195.8
    
    
      4113
       31
              admin.
         single
         university.degree
            no
           yes
            no
        cellular
       nov
       thu
        463
        1
       999
       0
       nonexistent
      -0.1
       93.200
      -42.0
       4.076
       5195.8
    
    
      4114
       30
              admin.
        married
                  basic.6y
            no
           yes
           yes
        cellular
       jul
       thu
         53
        1
       999
       0
       nonexistent
       1.4
       93.918
      -42.7
       4.958
       5228.1
    
    
      4115
       39
              admin.
        married
               high.school
            no
           yes
            no
       telephone
       jul
       fri
        219
        1
       999
       0
       nonexistent
       1.4
       93.918
      -42.7
       4.959
       5228.1
    
    
      4116
       27
             student
         single
               high.school
            no
            no
            no
        cellular
       may
       mon
         64
        2
       999
       1
           failure
      -1.8
       92.893
      -46.2
       1.354
       5099.1
    
    
      4117
       58
              admin.
        married
               high.school
            no
            no
            no
        cellular
       aug
       fri
        528
        1
       999
       0
       nonexistent
       1.4
       93.444
      -36.1
       4.966
       5228.1
    
    
      4118
       34
          management
         single
               high.school
            no
           yes
            no
        cellular
       nov
       wed
        175
        1
       999
       0
       nonexistent
      -0.1
       93.200
      -42.0
       4.120
       5195.8
    
  

4119 rows × 20 columns



In [6]:

    
# 1. Preprocessing !
# Instead of label encoding, I'm just going to use get_dummies.
# Then we'll do some scaling of the numerical (non-categorical) values
def dummies_and_scale(df):
    # separate target
    target = df['y']
    target.replace('no', '0', inplace=True)
    target.replace('yes', '1', inplace=True)
    df = df.drop('y', axis=1)
    # scale numerical values
    numeric_cols = df[df.describe().columns]
    scaled_cols = pd.DataFrame(preprocessing.scale(numeric_cols), index = df.index, columns = df.describe().columns)
    df[df.describe().columns] = scaled_cols
    dummies = pd.get_dummies(df)
    
    return dummies, target



In [7]:

    
data, target = dummies_and_scale(sdata)



In [8]:

    
target.value_counts()









    Out[8]:





0    3668
1     451
dtype: int64



In [9]:

    
sdata['y'].value_counts()









    Out[9]:





0    3668
1     451
dtype: int64



In [10]:

    
sdata.describe()









    Out[10]:






  
    
      
      age
      duration
      campaign
      pdays
      previous
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
    
  
  
    
      count
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
    
    
      mean
         40.113620
        256.788055
          2.537266
        960.422190
          0.190337
          0.084972
         93.579704
        -40.499102
          3.621356
       5166.481695
    
    
      std
         10.313362
        254.703736
          2.568159
        191.922786
          0.541788
          1.563114
          0.579349
          4.594578
          1.733591
         73.667904
    
    
      min
         18.000000
          0.000000
          1.000000
          0.000000
          0.000000
         -3.400000
         92.201000
        -50.800000
          0.635000
       4963.600000
    
    
      25%
         32.000000
        103.000000
          1.000000
        999.000000
          0.000000
         -1.800000
         93.075000
        -42.700000
          1.334000
       5099.100000
    
    
      50%
         38.000000
        181.000000
          2.000000
        999.000000
          0.000000
          1.100000
         93.749000
        -41.800000
          4.857000
       5191.000000
    
    
      75%
         47.000000
        317.000000
          3.000000
        999.000000
          0.000000
          1.400000
         93.994000
        -36.400000
          4.961000
       5228.100000
    
    
      max
         88.000000
       3643.000000
         35.000000
        999.000000
          6.000000
          1.400000
         94.767000
        -26.900000
          5.045000
       5228.100000



In [11]:

    
data.describe()









    Out[11]:






  
    
      
      age
      duration
      campaign
      pdays
      previous
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
      ...
      month_oct
      month_sep
      day_of_week_fri
      day_of_week_mon
      day_of_week_thu
      day_of_week_tue
      day_of_week_wed
      poutcome_failure
      poutcome_nonexistent
      poutcome_success
    
  
  
    
      count
       4.119000e+03
       4.119000e+03
       4.119000e+03
       4.119000e+03
       4.119000e+03
       4.119000e+03
       4.119000e+03
       4.119000e+03
       4.119000e+03
       4.119000e+03
      ...
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
       4119.000000
    
    
      mean
      -1.794038e-16
      -5.175111e-18
       5.045733e-17
      -2.022606e-16
      -3.708830e-17
      -2.415052e-17
       1.332074e-14
      -1.932041e-16
       1.397280e-16
       1.656036e-16
      ...
          0.016752
          0.015538
          0.186453
          0.207575
          0.208789
          0.204176
          0.193008
          0.110221
          0.855305
          0.034474
    
    
      std
       1.000121e+00
       1.000121e+00
       1.000121e+00
       1.000121e+00
       1.000121e+00
       1.000121e+00
       1.000121e+00
       1.000121e+00
       1.000121e+00
       1.000121e+00
      ...
          0.128355
          0.123693
          0.389519
          0.405620
          0.406492
          0.403147
          0.394707
          0.313203
          0.351836
          0.182466
    
    
      min
      -2.144432e+00
      -1.008306e+00
      -5.986595e-01
      -5.004819e+00
      -3.513560e-01
      -2.229776e+00
      -2.380037e+00
      -2.242241e+00
      -1.722850e+00
      -2.754338e+00
      ...
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
    
    
      25%
      -7.868050e-01
      -6.038652e-01
      -5.986595e-01
       2.010313e-01
      -3.513560e-01
      -1.206054e+00
      -8.712637e-01
      -4.790790e-01
      -1.319592e+00
      -9.147793e-01
      ...
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          1.000000
          0.000000
    
    
      50%
      -2.049648e-01
      -2.975899e-01
      -2.092283e-01
       2.010313e-01
      -3.513560e-01
       6.494413e-01
       2.922527e-01
      -2.831721e-01
       7.128522e-01
       3.328625e-01
      ...
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          1.000000
          0.000000
    
    
      75%
       6.677955e-01
       2.364286e-01
       1.802029e-01
       2.010313e-01
      -3.513560e-01
       8.413892e-01
       7.151926e-01
       8.922691e-01
       7.728506e-01
       8.365351e-01
      ...
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          0.000000
          1.000000
          0.000000
    
    
      max
       4.643704e+00
       1.329632e+01
       1.264200e+01
       2.010313e-01
       1.072442e+01
       8.413892e-01
       2.049611e+00
       2.960175e+00
       8.213108e-01
       8.365351e-01
      ...
          1.000000
          1.000000
          1.000000
          1.000000
          1.000000
          1.000000
          1.000000
          1.000000
          1.000000
          1.000000
    
  

8 rows × 63 columns



In [12]:

    
# Just double checking: std is close to 0 and mean is basically 0 (except for dummies obviously).

Training the Models & Plot the Learning Curve



In [13]:

    
from sklearn.learning_curve import learning_curve
from sklearn import cross_validation


def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 10)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt



In [14]:

    
# Above from: http://scikit-learn.org/stable/auto_examples/plot_learning_curve.html



In [15]:

    
cv = cross_validation.ShuffleSplit(data.shape[0], n_iter=2,
                                   test_size=0.2, random_state=0)
estimator=KNeighborsClassifier()
plot_learning_curve(estimator, "KNN", data, target, cv=cv, n_jobs=4)

plt.show()



In [16]:

    
cv = cross_validation.ShuffleSplit(data.shape[0], n_iter=2,
                                   test_size=0.2, random_state=0)
estimator=RandomForestClassifier()
plot_learning_curve(estimator, "Random Forest", data, target, cv=cv, n_jobs=4)

plt.show()



In [17]:

    
# Let's try again with a different cross validation value:

cv = 2
estimator=KNeighborsClassifier()
plot_learning_curve(estimator, "KNN", data, target, cv=cv, n_jobs=4)

plt.show()



In [20]:

    
# That is looking better ! I'll try to random forest too...

cv = 2
estimator=RandomForestClassifier()
plot_learning_curve(estimator, "Random Forest", data, target, cv=cv)

plt.show()



In [21]:

    
# Ok, so after running the larger set as well I've determined there may be features that are too important.



In [22]:

    
from sklearn.cross_validation import train_test_split



In [25]:

    
x_train, x_test, y_train, y_test = train_test_split?



In [26]:

    
x_train, x_test, y_train, y_test = train_test_split(data, target, test_size=.2)



In [27]:

    
estimator = RandomForestClassifier()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)



In [28]:

    
from sklearn.metrics import confusion_matrix



In [29]:

    
confusion_matrix(y_pred, y_test)









    



/Library/Python/2.7/site-packages/numpy-1.9.1-py2.7-macosx-10.9-intel.egg/numpy/core/fromnumeric.py:2499: VisibleDeprecationWarning: `rank` is deprecated; use the `ndim` attribute or function instead. To find the rank of a matrix see `numpy.linalg.matrix_rank`.
  VisibleDeprecationWarning)






    Out[29]:





array([[716,  65],
       [ 16,  27]])



In [31]:

    
estimator.feature_importances_.max()









    Out[31]:





0.23830676743965801



In [32]:

    
estimator.feature_importances_.sum()









    Out[32]:





1.0



In [ ]:

    
# So for random forests, the most important feature is only 23%



In [33]:

    
# Let's try the KNN too
estimator = KNeighborsClassifier()
estimator.fit(x_train, y_train)
y_pred = estimator.predict(x_test)
confusion_matrix(y_pred, y_test)









    Out[33]:





array([[710,  56],
       [ 22,  36]])



In [36]:

    
# So Random forests is better, but only slightly



In [38]:

    
from sklearn.grid_search import GridSearchCV as gs
rf = RandomForestClassifier()
rf_paramters = {'n_estimators' : range(10,50,10)}



In [40]:

    
rf.fit(x_train, y_train)
rfg = gs(rf, rf_paramters)
rfg.fit(x_train, y_train)









    Out[40]:





GridSearchCV(cv=None,
       estimator=RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_estimators': [10, 20, 30, 40]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)



In [41]:

    
rfg.best_params_









    Out[41]:





{'n_estimators': 30}



In [42]:

    
rf = RandomForestClassifier()
rf_paramters = {'n_estimators' : range(25,50,1)}
rf.fit(x_train, y_train)
rfg = gs(rf, rf_paramters)
rfg.fit(x_train, y_train)
rfg.best_params_









    Out[42]:





{'n_estimators': 32}



In [43]:

    
rfg.best_score_









    Out[43]:





0.90986342943854326



In [44]:

    
# That's seems too good ?



In [45]:

    
from sklearn.metrics import classification_report



In [46]:

    
classification_report?



In [47]:

    
classification_report(y_test, y_pred)









    Out[47]:





'             precision    recall  f1-score   support\n\n          0       0.93      0.97      0.95       732\n          1       0.62      0.39      0.48        92\n\navg / total       0.89      0.91      0.90       824\n'



In [48]:

    
import pprint



In [49]:

    
print classification_report(y_test, y_pred)









    



             precision    recall  f1-score   support

          0       0.93      0.97      0.95       732
          1       0.62      0.39      0.48        92

avg / total       0.89      0.91      0.90       824



In [50]:

    
rf = RandomForestClassifier(n_estimators=32).fit(x_train, y_train)
y_pred = rf.predict(x_test)
print confusion_matrix(y_test, y_pred)



In [51]:

    
print classification_report(y_test, y_pred)









    



             precision    recall  f1-score   support

          0       0.92      0.98      0.95       732
          1       0.63      0.32      0.42        92

avg / total       0.89      0.90      0.89       824



In [56]:

    
features = zip(rf.feature_importances_, sdata.columns)



In [61]:

    
features.sort(reverse=True)



In [62]:

    
features









    Out[62]:





[(0.23787866999090784, 'job'),
 (0.092757495516714999, 'month'),
 (0.074543934592155192, 'age'),
 (0.058628621282640972, 'day_of_week'),
 (0.032383812326831657, 'marital'),
 (0.03207739641329347, 'contact'),
 (0.030651974427814175, 'education'),
 (0.028920355419687741, 'housing'),
 (0.027574574885258604, 'loan'),
 (0.015489395304804142, 'default'),
 (0.014177318492882254, 'duration'),
 (0.0098731345417975187, 'nr.employed'),
 (0.0082567916457971686, 'campaign'),
 (0.0072101629135299236, 'emp.var.rate'),
 (0.0070700707499528944, 'poutcome'),
 (0.0054013289400914599, 'cons.conf.idx'),
 (0.0043193754372234001, 'y'),
 (0.0037179746568354031, 'euribor3m'),
 (0.0033291093715349398, 'cons.price.idx'),
 (0.0030211469603017029, 'pdays'),
 (0.00207814438331417, 'previous')]



In [69]:

    
seven = [
(0.23787866999090784, 'job'),
 (0.092757495516714999, 'month'),
 (0.074543934592155192, 'age'),
 (0.058628621282640972, 'day_of_week'),
 (0.032383812326831657, 'marital'),
 (0.03207739641329347, 'contact'),
 (0.030651974427814175, 'education')
]



In [72]:

    
f7 = [i for i,j in seven]



In [74]:

    
sum(f7)









    Out[74]:





0.5589219045503583



In [ ]:

    
# So the top 7 cover about 56% of the importances...



In [1]:

    
ten = [
(0.23787866999090784, 'job'),
 (0.092757495516714999, 'month'),
 (0.074543934592155192, 'age'),
 (0.058628621282640972, 'day_of_week'),
 (0.032383812326831657, 'marital'),
 (0.03207739641329347, 'contact'),
 (0.030651974427814175, 'education'),
 (0.028920355419687741, 'housing'),
 (0.027574574885258604, 'loan'),
 (0.015489395304804142, 'default'),
 (0.014177318492882254, 'duration')
]

f10 = [i for i,j in ten]
sum(f10)









    Out[1]:





0.645083548652991



In [3]:

    
# So ten features is another ~10% or so... it seems hard to know how many I would want to actually drop ?

all_features = [i for i,j in [(0.23787866999090784, 'job'),
 (0.092757495516714999, 'month'),
 (0.074543934592155192, 'age'),
 (0.058628621282640972, 'day_of_week'),
 (0.032383812326831657, 'marital'),
 (0.03207739641329347, 'contact'),
 (0.030651974427814175, 'education'),
 (0.028920355419687741, 'housing'),
 (0.027574574885258604, 'loan'),
 (0.015489395304804142, 'default'),
 (0.014177318492882254, 'duration'),
 (0.0098731345417975187, 'nr.employed'),
 (0.0082567916457971686, 'campaign'),
 (0.0072101629135299236, 'emp.var.rate'),
 (0.0070700707499528944, 'poutcome'),
 (0.0054013289400914599, 'cons.conf.idx'),
 (0.0043193754372234001, 'y'),
 (0.0037179746568354031, 'euribor3m'),
 (0.0033291093715349398, 'cons.price.idx'),
 (0.0030211469603017029, 'pdays'),
 (0.00207814438331417, 'previous')]]

sum(all_features)









    Out[3]:





0.6993607882533698



In [4]:

    
# So all of them max out at .699



In [75]:

    
# How might we use certain features? It seems that if we can reduce the complexity of the model, it would be good.
# Yet, all those little features add up...

# One idea is to do more grid search on the variables that matter most, 
# then see how close you can get to the whole feature list

# Basically, if we can simplify the model by getting rid of irrelevant features, that is ideal.



In [ ]:

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	30	blue-collar	married	basic.9y	no	yes	no	cellular	may	fri	...	2	999	nonexistent	-1.8	92.893	-46.2	1.313	5099.1	no
1	39	services	single	high.school	no	no	no	telephone	may	fri	...	4	999	nonexistent	1.1	93.994	-36.4	4.855	5191.0	no
2	25	services	married	high.school	no	yes	no	telephone	jun	wed	...	1	999	nonexistent	1.4	94.465	-41.8	4.962	5228.1	no
3	38	services	married	basic.9y	no	unknown	unknown	telephone	jun	fri	...	3	999	nonexistent	1.4	94.465	-41.8	4.959	5228.1	no
4	47	admin.	married	university.degree	no	yes	no	cellular	nov	mon	...	1	999	nonexistent	-0.1	93.200	-42.0	4.191	5195.8	no

	age	duration	campaign	pdays	previous	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed
count	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000
mean	40.113620	256.788055	2.537266	960.422190	0.190337	0.084972	93.579704	-40.499102	3.621356	5166.481695
std	10.313362	254.703736	2.568159	191.922786	0.541788	1.563114	0.579349	4.594578	1.733591	73.667904
min	18.000000	0.000000	1.000000	0.000000	0.000000	-3.400000	92.201000	-50.800000	0.635000	4963.600000
25%	32.000000	103.000000	1.000000	999.000000	0.000000	-1.800000	93.075000	-42.700000	1.334000	5099.100000
50%	38.000000	181.000000	2.000000	999.000000	0.000000	1.100000	93.749000	-41.800000	4.857000	5191.000000
75%	47.000000	317.000000	3.000000	999.000000	0.000000	1.400000	93.994000	-36.400000	4.961000	5228.100000
max	88.000000	3643.000000	35.000000	999.000000	6.000000	1.400000	94.767000	-26.900000	5.045000	5228.100000

	age	duration	campaign	pdays	previous	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	...	month_oct	month_sep	day_of_week_fri	day_of_week_mon	day_of_week_thu	day_of_week_tue	day_of_week_wed	poutcome_failure	poutcome_nonexistent	poutcome_success
count	4.119000e+03	4.119000e+03	4.119000e+03	4.119000e+03	4.119000e+03	4.119000e+03	4.119000e+03	4.119000e+03	4.119000e+03	4.119000e+03	...	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000	4119.000000
mean	-1.794038e-16	-5.175111e-18	5.045733e-17	-2.022606e-16	-3.708830e-17	-2.415052e-17	1.332074e-14	-1.932041e-16	1.397280e-16	1.656036e-16	...	0.016752	0.015538	0.186453	0.207575	0.208789	0.204176	0.193008	0.110221	0.855305	0.034474
std	1.000121e+00	1.000121e+00	1.000121e+00	1.000121e+00	1.000121e+00	1.000121e+00	1.000121e+00	1.000121e+00	1.000121e+00	1.000121e+00	...	0.128355	0.123693	0.389519	0.405620	0.406492	0.403147	0.394707	0.313203	0.351836	0.182466
min	-2.144432e+00	-1.008306e+00	-5.986595e-01	-5.004819e+00	-3.513560e-01	-2.229776e+00	-2.380037e+00	-2.242241e+00	-1.722850e+00	-2.754338e+00	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
25%	-7.868050e-01	-6.038652e-01	-5.986595e-01	2.010313e-01	-3.513560e-01	-1.206054e+00	-8.712637e-01	-4.790790e-01	-1.319592e+00	-9.147793e-01	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000
50%	-2.049648e-01	-2.975899e-01	-2.092283e-01	2.010313e-01	-3.513560e-01	6.494413e-01	2.922527e-01	-2.831721e-01	7.128522e-01	3.328625e-01	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000
75%	6.677955e-01	2.364286e-01	1.802029e-01	2.010313e-01	-3.513560e-01	8.413892e-01	7.151926e-01	8.922691e-01	7.728506e-01	8.365351e-01	...	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000
max	4.643704e+00	1.329632e+01	1.264200e+01	2.010313e-01	1.072442e+01	8.413892e-01	2.049611e+00	2.960175e+00	8.213108e-01	8.365351e-01	...	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000