HW 3: KNN & Random Forest

Get your data here. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Assignment

Preprocess your data (you may find LabelEncoder useful)
Train both KNN and Random Forest models
Find the best parameters by computing their learning curve (feel free to verify this with grid search)
Create a clasification report
Inspect your models, what features are most important? How might you use this information to improve model precision?



In [53]:

    
import pandas as pd
import numpy as np

from sklearn.cross_validation import cross_val_score
from sklearn.cross_validation import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix, classification_report
from sklearn import learning_curve

import matplotlib as mpl
import matplotlib.pyplot as plt

%matplotlib inline



In [21]:

    
df = pd.read_csv('bank-full.csv', sep =";")



In [22]:

    
df.head()









    Out[22]:






  
    
      
      age
      job
      marital
      education
      default
      balance
      housing
      loan
      contact
      day
      month
      duration
      campaign
      pdays
      previous
      poutcome
      y
    
  
  
    
      0
       58
         management
       married
        tertiary
       no
       2143
       yes
        no
       unknown
       5
       may
       261
       1
      -1
       0
       unknown
       no
    
    
      1
       44
         technician
        single
       secondary
       no
         29
       yes
        no
       unknown
       5
       may
       151
       1
      -1
       0
       unknown
       no
    
    
      2
       33
       entrepreneur
       married
       secondary
       no
          2
       yes
       yes
       unknown
       5
       may
        76
       1
      -1
       0
       unknown
       no
    
    
      3
       47
        blue-collar
       married
         unknown
       no
       1506
       yes
        no
       unknown
       5
       may
        92
       1
      -1
       0
       unknown
       no
    
    
      4
       33
            unknown
        single
         unknown
       no
          1
        no
        no
       unknown
       5
       may
       198
       1
      -1
       0
       unknown
       no



In [23]:

    
df.shape









    Out[23]:





(45211, 17)



In [24]:

    
df.describe()









    Out[24]:






  
    
      
      age
      balance
      day
      duration
      campaign
      pdays
      previous
    
  
  
    
      count
       45211.000000
        45211.000000
       45211.000000
       45211.000000
       45211.000000
       45211.000000
       45211.000000
    
    
      mean
          40.936210
         1362.272058
          15.806419
         258.163080
           2.763841
          40.197828
           0.580323
    
    
      std
          10.618762
         3044.765829
           8.322476
         257.527812
           3.098021
         100.128746
           2.303441
    
    
      min
          18.000000
        -8019.000000
           1.000000
           0.000000
           1.000000
          -1.000000
           0.000000
    
    
      25%
          33.000000
           72.000000
           8.000000
         103.000000
           1.000000
          -1.000000
           0.000000
    
    
      50%
          39.000000
          448.000000
          16.000000
         180.000000
           2.000000
          -1.000000
           0.000000
    
    
      75%
          48.000000
         1428.000000
          21.000000
         319.000000
           3.000000
          -1.000000
           0.000000
    
    
      max
          95.000000
       102127.000000
          31.000000
        4918.000000
          63.000000
         871.000000
         275.000000



In [25]:

    
for i in df:
    for x in df[i]:
        if x == '?':
            print i
        else: 
            pass



In [26]:

    
df.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 45211 entries, 0 to 45210
Data columns (total 17 columns):
age          45211 non-null int64
job          45211 non-null object
marital      45211 non-null object
education    45211 non-null object
default      45211 non-null object
balance      45211 non-null int64
housing      45211 non-null object
loan         45211 non-null object
contact      45211 non-null object
day          45211 non-null int64
month        45211 non-null object
duration     45211 non-null int64
campaign     45211 non-null int64
pdays        45211 non-null int64
previous     45211 non-null int64
poutcome     45211 non-null object
y            45211 non-null object
dtypes: int64(7), object(10)
memory usage: 6.2+ MB



In [27]:

    
df.columns









    Out[27]:





Index([u'age', u'job', u'marital', u'education', u'default', u'balance', u'housing', u'loan', u'contact', u'day', u'month', u'duration', u'campaign', u'pdays', u'previous', u'poutcome', u'y'], dtype='object')



In [28]:

    
bank_data = pd.DataFrame()
label_encoders = {}

for column in df.columns:
    if df[column].dtype == 'object':
        label_encoders[column] = preprocessing.LabelEncoder()
        bank_data[column] = label_encoders[column].fit_transform(df[column])
    else:
        bank_data[column] = df[column]



In [29]:

    
bank_data.shape









    Out[29]:





(45211, 17)



In [37]:

    
#drop duration column
bank_data.drop([bank_data.columns[11]], axis=1, inplace=True)
bank_data.head()



In [38]:

    
bank_data.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 45211 entries, 0 to 45210
Data columns (total 16 columns):
age          45211 non-null int64
job          45211 non-null int64
marital      45211 non-null int64
education    45211 non-null int64
default      45211 non-null int64
balance      45211 non-null int64
housing      45211 non-null int64
loan         45211 non-null int64
contact      45211 non-null int64
day          45211 non-null int64
month        45211 non-null int64
campaign     45211 non-null int64
pdays        45211 non-null int64
previous     45211 non-null int64
poutcome     45211 non-null int64
y            45211 non-null int64
dtypes: int64(16)
memory usage: 5.9 MB



In [42]:

    
X=bank_data.ix[:,0:15]
X.head()



In [44]:

    
y=bank_data['y']
y.head()









    Out[44]:





0    0
1    0
2    0
3    0
4    0
Name: y, dtype: int64



In [70]:

    
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)



In [71]:

    
knn_model = KNeighborsClassifier()



In [72]:

    
knn_model.fit(X_train,y_train)









    Out[72]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=5, p=2, weights='uniform')



In [73]:

    
# Use the model to predict Test Data
y_pred = knn_model.predict(X_test)



In [74]:

    
print classification_report(y_test, y_pred)









    



             precision    recall  f1-score   support

          0       0.89      0.98      0.93     11994
          1       0.40      0.10      0.16      1570

avg / total       0.84      0.88      0.85     13564



In [75]:

    
rf_model = RandomForestClassifier(n_estimators = 100, max_features="sqrt")



In [76]:

    
rf_model.fit(X_train, y_train, sample_weight=None)









    Out[76]:





RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='sqrt',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)



In [77]:

    
# Use the model to predict Test Data
y_pred = rf_model.predict(X_test)



In [78]:

    
print classification_report(y_test, y_pred)









    



             precision    recall  f1-score   support

          0       0.91      0.98      0.94     11994
          1       0.62      0.21      0.32      1570

avg / total       0.87      0.89      0.87     13564



In [79]:

    
# This prints the most important features
rf_model.fit(X_train,y_train)
sorted(zip(rf_model.feature_importances_, X.columns), reverse=True)[:20]

# 5 most important features are 'age', 'euribor3m', 'campaign', 'job', 'education'









    Out[79]:





[(0.19323735491245256, 'balance'),
 (0.15967526627767437, 'age'),
 (0.13417391043918392, 'day'),
 (0.10242793065310464, 'month'),
 (0.074784685082148347, 'job'),
 (0.061966017660057229, 'campaign'),
 (0.060922390737049199, 'pdays'),
 (0.051942030064884036, 'poutcome'),
 (0.038920126396893413, 'education'),
 (0.029115826374942488, 'previous'),
 (0.02773562292594384, 'marital'),
 (0.026633267252170092, 'housing'),
 (0.023555789966148641, 'contact'),
 (0.01236637542825819, 'loan'),
 (0.0025434058290890894, 'default')]



In [64]:

    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.learning_curve import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")

    plt.legend(loc="best")
    return plt



In [65]:

    
%%time
_ = plot_learning_curve(RandomForestClassifier(n_estimators=100),'test',X_train,y_train)









    



CPU times: user 23.6 s, sys: 370 ms, total: 24 s
Wall time: 24 s



In [84]:

    
%%time
_ = plot_learning_curve(KNeighborsClassifier(),'test',X_train,y_train)









    



CPU times: user 5.39 s, sys: 54.6 ms, total: 5.44 s
Wall time: 5.43 s



In [ ]:

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	y
0	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
2	33	entrepreneur	married	secondary	no	2	yes	yes	unknown	5	may	76	1	-1	unknown	no
3	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
4	33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no

	age	balance	day	duration	campaign	pdays	previous
count	45211.000000	45211.000000	45211.000000	45211.000000	45211.000000	45211.000000	45211.000000
mean	40.936210	1362.272058	15.806419	258.163080	2.763841	40.197828	0.580323
std	10.618762	3044.765829	8.322476	257.527812	3.098021	100.128746	2.303441
min	18.000000	-8019.000000	1.000000	0.000000	1.000000	-1.000000	0.000000
25%	33.000000	72.000000	8.000000	103.000000	1.000000	-1.000000	0.000000
50%	39.000000	448.000000	16.000000	180.000000	2.000000	-1.000000	0.000000
75%	48.000000	1428.000000	21.000000	319.000000	3.000000	-1.000000	0.000000
max	95.000000	102127.000000	31.000000	4918.000000	63.000000	871.000000	275.000000