HW 3: KNN & Random Forest

Get your data here. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Assignment

  • Preprocess your data (you may find LabelEncoder useful)
  • Train both KNN and Random Forest models
  • Find the best parameters by computing their learning curve (feel free to verify this with grid search)
  • Create a clasification report
  • Inspect your models, what features are most important? How might you use this information to improve model precision?

In [29]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import Image

# This enables inline Plots
%matplotlib inline

# the pd.read... etc pulls in data using pandas to create a data frame
# set the delimiter=';'
bank_additional_full = pd.read_csv('../data/bank-additional-full.csv',header=0,index_col=False, delimiter=';')
bank_additional = pd.read_csv('../data/bank-additional.csv',header=0,index_col=False, delimiter=';')
bank_full = pd.read_csv('../data/bank-full.csv',header=0,index_col=False, delimiter=';')
bank = pd.read_csv('../data/bank.csv',header=0,index_col=False, delimiter=';')

In [33]:
bank_additional_full.head(5)


Out[33]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
2 35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
3 30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
4 59 blue-collar married secondary no 0 yes no unknown 5 may 226 1 -1 0 unknown no

In [34]:
bank_additional_full.head(5)


Out[34]:
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 housemaid married basic.4y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no
1 57 services married high.school unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no
2 37 services married high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no
3 40 admin. married basic.6y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no
4 56 services married high.school no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no

5 rows × 21 columns


In [35]:
bank_additional.head(5)


Out[35]:
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 30 blue-collar married basic.9y no yes no cellular may fri ... 2 999 0 nonexistent -1.8 92.893 -46.2 1.313 5099.1 no
1 39 services single high.school no no no telephone may fri ... 4 999 0 nonexistent 1.1 93.994 -36.4 4.855 5191.0 no
2 25 services married high.school no yes no telephone jun wed ... 1 999 0 nonexistent 1.4 94.465 -41.8 4.962 5228.1 no
3 38 services married basic.9y no unknown unknown telephone jun fri ... 3 999 0 nonexistent 1.4 94.465 -41.8 4.959 5228.1 no
4 47 admin. married university.degree no yes no cellular nov mon ... 1 999 0 nonexistent -0.1 93.200 -42.0 4.191 5195.8 no

5 rows × 21 columns


In [36]:
bank_full.head(5)


Out[36]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 58 management married tertiary no 2143 yes no unknown 5 may 261 1 -1 0 unknown no
1 44 technician single secondary no 29 yes no unknown 5 may 151 1 -1 0 unknown no
2 33 entrepreneur married secondary no 2 yes yes unknown 5 may 76 1 -1 0 unknown no
3 47 blue-collar married unknown no 1506 yes no unknown 5 may 92 1 -1 0 unknown no
4 33 unknown single unknown no 1 no no unknown 5 may 198 1 -1 0 unknown no

In [37]:
bank.head(5)


Out[37]:
age job marital education default balance housing loan contact day month duration campaign pdays previous poutcome y
0 30 unemployed married primary no 1787 no no cellular 19 oct 79 1 -1 0 unknown no
1 33 services married secondary no 4789 yes yes cellular 11 may 220 1 339 4 failure no
2 35 management single tertiary no 1350 yes no cellular 16 apr 185 1 330 1 failure no
3 30 management married tertiary no 1476 yes yes unknown 3 jun 199 4 -1 0 unknown no
4 59 blue-collar married secondary no 0 yes no unknown 5 may 226 1 -1 0 unknown no

In [40]:
# bank_additional_full.describe()
# bank_additional.describe()
# bank_full.describe()
# bank.describe()

bank_additional_full.info()
bank_additional.info()
bank_full.info()
bank.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)<class 'pandas.core.frame.DataFrame'>
Int64Index: 4119 entries, 0 to 4118
Data columns (total 21 columns):
age               4119 non-null int64
job               4119 non-null object
marital           4119 non-null object
education         4119 non-null object
default           4119 non-null object
housing           4119 non-null object
loan              4119 non-null object
contact           4119 non-null object
month             4119 non-null object
day_of_week       4119 non-null object
duration          4119 non-null int64
campaign          4119 non-null int64
pdays             4119 non-null int64
previous          4119 non-null int64
poutcome          4119 non-null object
emp.var.rate      4119 non-null float64
cons.price.idx    4119 non-null float64
cons.conf.idx     4119 non-null float64
euribor3m         4119 non-null float64
nr.employed       4119 non-null float64
y                 4119 non-null object
dtypes: float64(5), int64(5), object(11)<class 'pandas.core.frame.DataFrame'>
Int64Index: 45211 entries, 0 to 45210
Data columns (total 17 columns):
age          45211 non-null int64
job          45211 non-null object
marital      45211 non-null object
education    45211 non-null object
default      45211 non-null object
balance      45211 non-null int64
housing      45211 non-null object
loan         45211 non-null object
contact      45211 non-null object
day          45211 non-null int64
month        45211 non-null object
duration     45211 non-null int64
campaign     45211 non-null int64
pdays        45211 non-null int64
previous     45211 non-null int64
poutcome     45211 non-null object
y            45211 non-null object
dtypes: int64(7), object(10)<class 'pandas.core.frame.DataFrame'>
Int64Index: 4521 entries, 0 to 4520
Data columns (total 17 columns):
age          4521 non-null int64
job          4521 non-null object
marital      4521 non-null object
education    4521 non-null object
default      4521 non-null object
balance      4521 non-null int64
housing      4521 non-null object
loan         4521 non-null object
contact      4521 non-null object
day          4521 non-null int64
month        4521 non-null object
duration     4521 non-null int64
campaign     4521 non-null int64
pdays        4521 non-null int64
previous     4521 non-null int64
poutcome     4521 non-null object
y            4521 non-null object
dtypes: int64(7), object(10)

In [53]:
from sklearn.datasets import load_svmlight_file 
from sklearn import datasets
from sklearn import preprocessing 
from sklearn.preprocessing import LabelEncoder

In [54]:
# X_train, y_train = load_svmlight_file('../data/bank-additional-full.csv',index_col=False, delimiter=';')

In [57]:
datasets.get_data_home()


Out[57]:
'/Users/ChristopherRuiz/scikit_learn_data'

In [60]:
# for col in df:
#     data[col] =

In [68]:
le = preprocessing.LabelEncoder()
le.fit(list(bank_additional_full.columns.values))
list(le.classes_)
label_cols = le.transform(le.classes_)
label_cols

# for col in df:
#     data[col] =


Out[68]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20])

In [69]:
bank_full.columns


Out[69]:
Index([u'age', u'job', u'marital', u'education', u'default', u'balance', u'housing', u'loan', u'contact', u'day', u'month', u'duration', u'campaign', u'pdays', u'previous', u'poutcome', u'y'], dtype='object')

In [70]:
bank_full.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 45211 entries, 0 to 45210
Data columns (total 17 columns):
age          45211 non-null int64
job          45211 non-null object
marital      45211 non-null object
education    45211 non-null object
default      45211 non-null object
balance      45211 non-null int64
housing      45211 non-null object
loan         45211 non-null object
contact      45211 non-null object
day          45211 non-null int64
month        45211 non-null object
duration     45211 non-null int64
campaign     45211 non-null int64
pdays        45211 non-null int64
previous     45211 non-null int64
poutcome     45211 non-null object
y            45211 non-null object
dtypes: int64(7), object(10)

In [129]:
numeric_cols = []
label_cols = []
for col in bank_full.columns:
    if bank_full[col].dtype == 'object':
        label_cols.append(col)
    else:
        numeric_cols.append(col)

In [131]:
bank_full.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 45211 entries, 0 to 45210
Data columns (total 17 columns):
age          45211 non-null int64
job          45211 non-null object
marital      45211 non-null object
education    45211 non-null object
default      45211 non-null object
balance      45211 non-null int64
housing      45211 non-null object
loan         45211 non-null object
contact      45211 non-null object
day          45211 non-null int64
month        45211 non-null object
duration     45211 non-null int64
campaign     45211 non-null int64
pdays        45211 non-null int64
previous     45211 non-null int64
poutcome     45211 non-null object
y            45211 non-null object
dtypes: int64(7), object(10)

In [134]:
new_DF = bank_full
label = 'string'
for i in range(len(label_cols)):
    label = label_cols[i]
    le.fit(list(new_DF[label].values))
    list(le.classes_)

    numeric_col = le.transform(le.classes_)  


#     for q in range(len(new_DF)):
#         temp_val = le.inverse_transform(numeric_col[q])
#         print new_DF[label]
#         new_DF[label][q] = le.transform(q)


# le = preprocessing.LabelEncoder()
# le.fit(list(bank_full.columns.values))
# list(le.classes_)
# label_col = le.transform(le.classes_)
# label_col

In [133]:
bank_full['age'].values


Out[133]:
array([58, 44, 33, ..., 72, 57, 37])

In [ ]:


In [97]:
for i in range(len(label_cols)):
    print label_cols[i]
# label_cols[3]


job
marital
education
default
housing
loan
contact
month
poutcome
y

In [ ]: