HW 3: KNN & Random Forest

Get your data here. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Assignment

Preprocess your data (you may find LabelEncoder useful)
Train both KNN and Random Forest models
Find the best parameters by computing their learning curve (feel free to verify this with grid search)
Create a clasification report
Inspect your models, what features are most important? How might you use this information to improve model precision?



In [29]:

    
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image

# This enables inline Plots
%matplotlib inline

# the pd.read... etc pulls in data using pandas to create a data frame
# set the delimiter=';'
bank_additional_full = pd.read_csv('../data/bank-additional-full.csv',header=0,index_col=False, delimiter=';')
bank_additional = pd.read_csv('../data/bank-additional.csv',header=0,index_col=False, delimiter=';')
bank_full = pd.read_csv('../data/bank-full.csv',header=0,index_col=False, delimiter=';')
bank = pd.read_csv('../data/bank.csv',header=0,index_col=False, delimiter=';')



In [33]:

    
bank_additional_full.head(5)









    Out[33]:






  
    
      
      age
      job
      marital
      education
      default
      balance
      housing
      loan
      contact
      day
      month
      duration
      campaign
      pdays
      previous
      poutcome
      y
    
  
  
    
      0
       30
        unemployed
       married
         primary
       no
       1787
        no
        no
       cellular
       19
       oct
        79
       1
        -1
       0
       unknown
       no
    
    
      1
       33
          services
       married
       secondary
       no
       4789
       yes
       yes
       cellular
       11
       may
       220
       1
       339
       4
       failure
       no
    
    
      2
       35
        management
        single
        tertiary
       no
       1350
       yes
        no
       cellular
       16
       apr
       185
       1
       330
       1
       failure
       no
    
    
      3
       30
        management
       married
        tertiary
       no
       1476
       yes
       yes
        unknown
        3
       jun
       199
       4
        -1
       0
       unknown
       no
    
    
      4
       59
       blue-collar
       married
       secondary
       no
          0
       yes
        no
        unknown
        5
       may
       226
       1
        -1
       0
       unknown
       no



In [34]:

    
bank_additional_full.head(5)









    Out[34]:






  
    
      
      age
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      day_of_week
      ...
      campaign
      pdays
      previous
      poutcome
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
      y
    
  
  
    
      0
       56
       housemaid
       married
          basic.4y
            no
        no
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      1
       57
        services
       married
       high.school
       unknown
        no
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      2
       37
        services
       married
       high.school
            no
       yes
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      3
       40
          admin.
       married
          basic.6y
            no
        no
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      4
       56
        services
       married
       high.school
            no
        no
       yes
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
  

5 rows × 21 columns



In [35]:

    
bank_additional.head(5)









    Out[35]:






  
    
      
      age
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      day_of_week
      ...
      campaign
      pdays
      previous
      poutcome
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
      y
    
  
  
    
      0
       30
       blue-collar
       married
                basic.9y
       no
           yes
            no
        cellular
       may
       fri
      ...
       2
       999
       0
       nonexistent
      -1.8
       92.893
      -46.2
       1.313
       5099.1
       no
    
    
      1
       39
          services
        single
             high.school
       no
            no
            no
       telephone
       may
       fri
      ...
       4
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.855
       5191.0
       no
    
    
      2
       25
          services
       married
             high.school
       no
           yes
            no
       telephone
       jun
       wed
      ...
       1
       999
       0
       nonexistent
       1.4
       94.465
      -41.8
       4.962
       5228.1
       no
    
    
      3
       38
          services
       married
                basic.9y
       no
       unknown
       unknown
       telephone
       jun
       fri
      ...
       3
       999
       0
       nonexistent
       1.4
       94.465
      -41.8
       4.959
       5228.1
       no
    
    
      4
       47
            admin.
       married
       university.degree
       no
           yes
            no
        cellular
       nov
       mon
      ...
       1
       999
       0
       nonexistent
      -0.1
       93.200
      -42.0
       4.191
       5195.8
       no
    
  

5 rows × 21 columns



In [36]:

    
bank_full.head(5)









    Out[36]:






  
    
      
      age
      job
      marital
      education
      default
      balance
      housing
      loan
      contact
      day
      month
      duration
      campaign
      pdays
      previous
      poutcome
      y
    
  
  
    
      0
       58
         management
       married
        tertiary
       no
       2143
       yes
        no
       unknown
       5
       may
       261
       1
      -1
       0
       unknown
       no
    
    
      1
       44
         technician
        single
       secondary
       no
         29
       yes
        no
       unknown
       5
       may
       151
       1
      -1
       0
       unknown
       no
    
    
      2
       33
       entrepreneur
       married
       secondary
       no
          2
       yes
       yes
       unknown
       5
       may
        76
       1
      -1
       0
       unknown
       no
    
    
      3
       47
        blue-collar
       married
         unknown
       no
       1506
       yes
        no
       unknown
       5
       may
        92
       1
      -1
       0
       unknown
       no
    
    
      4
       33
            unknown
        single
         unknown
       no
          1
        no
        no
       unknown
       5
       may
       198
       1
      -1
       0
       unknown
       no



In [37]:

    
bank.head(5)









    Out[37]:






  
    
      
      age
      job
      marital
      education
      default
      balance
      housing
      loan
      contact
      day
      month
      duration
      campaign
      pdays
      previous
      poutcome
      y
    
  
  
    
      0
       30
        unemployed
       married
         primary
       no
       1787
        no
        no
       cellular
       19
       oct
        79
       1
        -1
       0
       unknown
       no
    
    
      1
       33
          services
       married
       secondary
       no
       4789
       yes
       yes
       cellular
       11
       may
       220
       1
       339
       4
       failure
       no
    
    
      2
       35
        management
        single
        tertiary
       no
       1350
       yes
        no
       cellular
       16
       apr
       185
       1
       330
       1
       failure
       no
    
    
      3
       30
        management
       married
        tertiary
       no
       1476
       yes
       yes
        unknown
        3
       jun
       199
       4
        -1
       0
       unknown
       no
    
    
      4
       59
       blue-collar
       married
       secondary
       no
          0
       yes
        no
        unknown
        5
       may
       226
       1
        -1
       0
       unknown
       no



In [40]:

    
# bank_additional_full.describe()
# bank_additional.describe()
# bank_full.describe()
# bank.describe()

bank_additional_full.info()
bank_additional.info()
bank_full.info()
bank.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)<class 'pandas.core.frame.DataFrame'>
Int64Index: 4119 entries, 0 to 4118
Data columns (total 21 columns):
age               4119 non-null int64
job               4119 non-null object
marital           4119 non-null object
education         4119 non-null object
default           4119 non-null object
housing           4119 non-null object
loan              4119 non-null object
contact           4119 non-null object
month             4119 non-null object
day_of_week       4119 non-null object
duration          4119 non-null int64
campaign          4119 non-null int64
pdays             4119 non-null int64
previous          4119 non-null int64
poutcome          4119 non-null object
emp.var.rate      4119 non-null float64
cons.price.idx    4119 non-null float64
cons.conf.idx     4119 non-null float64
euribor3m         4119 non-null float64
nr.employed       4119 non-null float64
y                 4119 non-null object
dtypes: float64(5), int64(5), object(11)<class 'pandas.core.frame.DataFrame'>
Int64Index: 45211 entries, 0 to 45210
Data columns (total 17 columns):
age          45211 non-null int64
job          45211 non-null object
marital      45211 non-null object
education    45211 non-null object
default      45211 non-null object
balance      45211 non-null int64
housing      45211 non-null object
loan         45211 non-null object
contact      45211 non-null object
day          45211 non-null int64
month        45211 non-null object
duration     45211 non-null int64
campaign     45211 non-null int64
pdays        45211 non-null int64
previous     45211 non-null int64
poutcome     45211 non-null object
y            45211 non-null object
dtypes: int64(7), object(10)<class 'pandas.core.frame.DataFrame'>
Int64Index: 4521 entries, 0 to 4520
Data columns (total 17 columns):
age          4521 non-null int64
job          4521 non-null object
marital      4521 non-null object
education    4521 non-null object
default      4521 non-null object
balance      4521 non-null int64
housing      4521 non-null object
loan         4521 non-null object
contact      4521 non-null object
day          4521 non-null int64
month        4521 non-null object
duration     4521 non-null int64
campaign     4521 non-null int64
pdays        4521 non-null int64
previous     4521 non-null int64
poutcome     4521 non-null object
y            4521 non-null object
dtypes: int64(7), object(10)



In [53]:

    
from sklearn.datasets import load_svmlight_file 
from sklearn import datasets
from sklearn import preprocessing 
from sklearn.preprocessing import LabelEncoder



In [54]:

    
# X_train, y_train = load_svmlight_file('../data/bank-additional-full.csv',index_col=False, delimiter=';')



In [57]:

    
datasets.get_data_home()









    Out[57]:





'/Users/ChristopherRuiz/scikit_learn_data'



In [60]:

    
# for col in df:
#     data[col] =



In [68]:

    
le = preprocessing.LabelEncoder()
le.fit(list(bank_additional_full.columns.values))
list(le.classes_)
label_cols = le.transform(le.classes_)
label_cols

# for col in df:
#     data[col] =









    Out[68]:





array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20])



In [69]:

    
bank_full.columns









    Out[69]:





Index([u'age', u'job', u'marital', u'education', u'default', u'balance', u'housing', u'loan', u'contact', u'day', u'month', u'duration', u'campaign', u'pdays', u'previous', u'poutcome', u'y'], dtype='object')



In [70]:

    
bank_full.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 45211 entries, 0 to 45210
Data columns (total 17 columns):
age          45211 non-null int64
job          45211 non-null object
marital      45211 non-null object
education    45211 non-null object
default      45211 non-null object
balance      45211 non-null int64
housing      45211 non-null object
loan         45211 non-null object
contact      45211 non-null object
day          45211 non-null int64
month        45211 non-null object
duration     45211 non-null int64
campaign     45211 non-null int64
pdays        45211 non-null int64
previous     45211 non-null int64
poutcome     45211 non-null object
y            45211 non-null object
dtypes: int64(7), object(10)



In [129]:

    
numeric_cols = []
label_cols = []
for col in bank_full.columns:
    if bank_full[col].dtype == 'object':
        label_cols.append(col)
    else:
        numeric_cols.append(col)



In [131]:

    
bank_full.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 45211 entries, 0 to 45210
Data columns (total 17 columns):
age          45211 non-null int64
job          45211 non-null object
marital      45211 non-null object
education    45211 non-null object
default      45211 non-null object
balance      45211 non-null int64
housing      45211 non-null object
loan         45211 non-null object
contact      45211 non-null object
day          45211 non-null int64
month        45211 non-null object
duration     45211 non-null int64
campaign     45211 non-null int64
pdays        45211 non-null int64
previous     45211 non-null int64
poutcome     45211 non-null object
y            45211 non-null object
dtypes: int64(7), object(10)



In [134]:

    
new_DF = bank_full
label = 'string'
for i in range(len(label_cols)):
    label = label_cols[i]
    le.fit(list(new_DF[label].values))
    list(le.classes_)

    numeric_col = le.transform(le.classes_)  


#     for q in range(len(new_DF)):
#         temp_val = le.inverse_transform(numeric_col[q])
#         print new_DF[label]
#         new_DF[label][q] = le.transform(q)


# le = preprocessing.LabelEncoder()
# le.fit(list(bank_full.columns.values))
# list(le.classes_)
# label_col = le.transform(le.classes_)
# label_col



In [133]:

    
bank_full['age'].values









    Out[133]:





array([58, 44, 33, ..., 72, 57, 37])



In [ ]:



In [97]:

    
for i in range(len(label_cols)):
    print label_cols[i]
# label_cols[3]









    



job
marital
education
default
housing
loan
contact
month
poutcome
y



In [ ]:

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	y
0	30	unemployed	married	primary	no	1787	no	no	cellular	19	oct	79	1	-1	0	unknown	no
1	33	services	married	secondary	no	4789	yes	yes	cellular	11	may	220	1	339	4	failure	no
2	35	management	single	tertiary	no	1350	yes	no	cellular	16	apr	185	1	330	1	failure	no
3	30	management	married	tertiary	no	1476	yes	yes	unknown	3	jun	199	4	-1	0	unknown	no
4	59	blue-collar	married	secondary	no	0	yes	no	unknown	5	may	226	1	-1	0	unknown	no

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	56	housemaid	married	basic.4y	no	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
1	57	services	married	high.school	unknown	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
2	37	services	married	high.school	no	yes	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
3	40	admin.	married	basic.6y	no	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
4	56	services	married	high.school	no	no	yes	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	30	blue-collar	married	basic.9y	no	yes	no	cellular	may	fri	...	2	999	nonexistent	-1.8	92.893	-46.2	1.313	5099.1	no
1	39	services	single	high.school	no	no	no	telephone	may	fri	...	4	999	nonexistent	1.1	93.994	-36.4	4.855	5191.0	no
2	25	services	married	high.school	no	yes	no	telephone	jun	wed	...	1	999	nonexistent	1.4	94.465	-41.8	4.962	5228.1	no
3	38	services	married	basic.9y	no	unknown	unknown	telephone	jun	fri	...	3	999	nonexistent	1.4	94.465	-41.8	4.959	5228.1	no
4	47	admin.	married	university.degree	no	yes	no	cellular	nov	mon	...	1	999	nonexistent	-0.1	93.200	-42.0	4.191	5195.8	no