HW 3: KNN & Random Forest

Get your data here. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Assignment

Preprocess your data (you may find LabelEncoder useful)
Train both KNN and Random Forest models
Find the best parameters by computing their learning curve (feel free to verify this with grid search)
Create a clasification report
Inspect your models, what features are most important? How might you use this information to improve model precision?



In [1]:

    
#import needed packages
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn import preprocessing
%matplotlib inline



In [2]:

    
bank_full=pd.read_csv('../data/bank/bank-full.csv', sep=';')



In [68]:

    
bank=pd.read_csv('../data/bank/bank.csv', sep=';')



In [3]:

    
bank_additional_full=pd.read_csv('../data/bank-additional/bank-additional-full.csv', sep=';')



In [49]:

    
bank_additional=pd.read_csv('../data/bank-additional/bank-additional.csv', sep=';')



In [4]:

    
bank_additional_full.head(5)









    Out[4]:






  
    
      
      age
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      day_of_week
      ...
      campaign
      pdays
      previous
      poutcome
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
      y
    
  
  
    
      0
       56
       housemaid
       married
          basic.4y
            no
        no
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      1
       57
        services
       married
       high.school
       unknown
        no
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      2
       37
        services
       married
       high.school
            no
       yes
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      3
       40
          admin.
       married
          basic.6y
            no
        no
        no
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
    
      4
       56
        services
       married
       high.school
            no
        no
       yes
       telephone
       may
       mon
      ...
       1
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.857
       5191
       no
    
  

5 rows × 21 columns



In [5]:

    
bank_additional_full.info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usage: 6.9+ MB



In [6]:

    
le = preprocessing.LabelEncoder()
le.fit(bank_additional_full.y)









    Out[6]:





LabelEncoder()



In [7]:

    
le.classes_









    Out[7]:





array(['no', 'yes'], dtype=object)



In [8]:

    
le.transform(bank_additional_full.y)









    Out[8]:





array([0, 0, 0, ..., 0, 1, 0])



In [9]:

    
bank_additional_full.y=le.transform(bank_additional_full.y)



In [10]:

    
bank_additional_full.y









    Out[10]:





0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
...
41173    1
41174    1
41175    0
41176    0
41177    0
41178    1
41179    0
41180    0
41181    1
41182    0
41183    1
41184    0
41185    0
41186    1
41187    0
Name: y, Length: 41188, dtype: int64



In [11]:

    
#check for unique values of features
bank_additional_full.marital.unique()









    Out[11]:





array(['married', 'single', 'divorced', 'unknown'], dtype=object)



In [12]:

    
pd.scatter_matrix(bank_additional_full[['age','campaign','pdays','duration','previous','y','emp.var.rate']],figsize=(20,20))









    Out[12]:





array([[<matplotlib.axes._subplots.AxesSubplot object at 0x109f87250>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a8bcb10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10aa0e750>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a90b750>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x103956410>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1029a8d90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a1d2b50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10a337810>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a4b88d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a53b690>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a772290>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a7f6290>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a84ef50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10b2c7250>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10a889090>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c2b4910>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c335910>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c39bf50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c51ed10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c578810>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c608750>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10c68e410>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c6fd2d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c773f50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c7dae50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c85cc10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c801210>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10ca3bb10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10cac07d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10cb27390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10cba9150>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10cb49d50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10cd8fe10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10ce15ad0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10ce861d0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10cefcf50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10d857c10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10d8e5c10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10d96a8d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10d9d86d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10da5e390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10dac4290>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10dc49050>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10dadc490>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10dd2ed10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10ddb39d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10df240d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10df9ce50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10dff4b10>]], dtype=object)



In [13]:

    
X = bank_additional_full.ix[:,'age':'poutcome']
y = bank_additional_full.y



In [14]:

    
X_data=pd.get_dummies(X)



In [15]:

    
#Split the Data
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_data, y, test_size=0.3)
print X_train.shape, X_test.shape









    



(28831, 58) (12357, 58)

KNN Model



In [16]:

    
from sklearn.neighbors import KNeighborsClassifier

# Instantiate the estimator
clf=KNeighborsClassifier(algorithm='brute')

# Fit the estimator to the Training Data
clf.fit(X_train, y_train)

# Use the model to predict Test Data
y_pred=clf.predict(X_test)



In [17]:

    
from sklearn import metrics

def plot_confusion_matrix(y_pred, y):
    plt.imshow(metrics.confusion_matrix(y, y_pred),
               cmap=plt.cm.binary, interpolation='nearest')
    plt.colorbar()
    plt.xlabel('true value')
    plt.ylabel('predicted value')
    
print "classification accuracy:", metrics.accuracy_score(y_test, y_pred)
plot_confusion_matrix(y_test, y_pred)









    



classification accuracy: 0.9005422028



In [18]:

    
print "accuracy:", metrics.accuracy_score(y_test, y_pred)
print "precision:", metrics.precision_score(y_test, y_pred)
print "recall:", metrics.recall_score(y_test, y_pred)
print "f1 score:", metrics.f1_score(y_test, y_pred)









    



accuracy: 0.9005422028
precision: 0.557303370787
recall: 0.372652141247
f1 score: 0.44664565511

Random Forest Model



In [19]:

    
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier



In [20]:

    
#plot learning curve
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.learning_curve import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):

    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt



In [21]:

    
rf_model = RandomForestClassifier(n_estimators=100,max_depth=15,criterion='entropy')
rf_model.fit(X_train,y_train)









    Out[21]:





RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='entropy', max_depth=15, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)

Grid Search



In [23]:

    
from sklearn.grid_search import GridSearchCV



In [24]:

    
param= {'n_estimators':np.arange(50,200,50), 'max_depth':np.arange(5,25,5)}



In [25]:

    
gs=GridSearchCV(RandomForestClassifier(),param)



In [26]:

    
gs.fit(X_train, y_train)









    Out[26]:





GridSearchCV(cv=None,
       estimator=RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_estimators': array([ 50, 100, 150]), 'max_depth': array([ 5, 10, 15, 20])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)



In [27]:

    
gs.best_params_, gs.best_score_









    Out[27]:





({'max_depth': 20, 'n_estimators': 50}, 0.90589989941382543)

Learning Curves



In [28]:

    
_ = plot_learning_curve(RandomForestClassifier(n_estimators=50),'test',X_train,y_train)



In [25]:

    
_ = plot_learning_curve(KNeighborsClassifier(algorithm='brute'),'test',X_train,y_train)



In [26]:

    
_ = plot_learning_curve(KNeighborsClassifier(algorithm='kd_tree'),'test',X_train,y_train)



In [30]:

    
rf_model = RandomForestClassifier(n_estimators=50,max_depth=25,criterion='entropy')
rf_model.fit(X_train,y_train)









    Out[30]:





RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='entropy', max_depth=25, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)



In [31]:

    
y_pred = rf_model.predict(X_test)



In [32]:

    
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(y_pred, y):
    plt.imshow(confusion_matrix(y, y_pred),
               cmap=plt.cm.binary, interpolation='nearest')
    plt.colorbar()
    plt.xlabel('true value')
    plt.ylabel('predicted value')



In [33]:

    
plot_confusion_matrix( y_pred,y_test)



In [34]:

    
from sklearn.metrics import classification_report
print classification_report(y_test,y_pred)









    



             precision    recall  f1-score   support

          0       0.93      0.97      0.95     11026
          1       0.63      0.36      0.46      1331

avg / total       0.89      0.91      0.90     12357

Feature Importance



In [35]:

    
rf_model.fit(X_train,y_train)
sorted(zip(rf_model.feature_importances_, X_data.columns.values), reverse=True)









    Out[35]:





[(0.31542860259098604, 'duration'),
 (0.096996126650562772, 'age'),
 (0.044095866858841301, 'campaign'),
 (0.035960872775156125, 'pdays'),
 (0.028855110408211843, 'poutcome_success'),
 (0.019054975588938441, 'previous'),
 (0.016520431074296738, 'poutcome_nonexistent'),
 (0.015647003311632553, 'contact_telephone'),
 (0.015645260641902619, 'month_may'),
 (0.015632967583978939, 'month_jun'),
 (0.015142246500243482, 'month_mar'),
 (0.01433790864056679, 'housing_yes'),
 (0.014309079315670879, 'housing_no'),
 (0.0129535055629644, 'day_of_week_mon'),
 (0.012886666537345716, 'contact_cellular'),
 (0.012763579885893146, 'day_of_week_thu'),
 (0.012717040249534182, 'education_university.degree'),
 (0.012566901972371267, 'day_of_week_fri'),
 (0.012406176598208325, 'month_oct'),
 (0.012337874663603114, 'day_of_week_tue'),
 (0.012261652180976669, 'job_admin.'),
 (0.012120240331798583, 'day_of_week_wed'),
 (0.012100393552890337, 'marital_married'),
 (0.011525244884626463, 'education_high.school'),
 (0.010962328444502137, 'marital_single'),
 (0.010713901447142161, 'month_jul'),
 (0.010635219289316815, 'month_apr'),
 (0.010260118559309426, 'poutcome_failure'),
 (0.009824308594848034, 'month_aug'),
 (0.0095848304654248158, 'loan_no'),
 (0.0095775243025992621, 'job_technician'),
 (0.0092642929966690463, 'loan_yes'),
 (0.0091188661615600273, 'month_nov'),
 (0.0089789011091064158, 'job_blue-collar'),
 (0.0088222995271310407, 'education_professional.course'),
 (0.0086708135477546414, 'education_basic.9y'),
 (0.0079546941584079309, 'month_sep'),
 (0.007903223624132821, 'default_no'),
 (0.0076005788008625995, 'marital_divorced'),
 (0.0070914245452314397, 'default_unknown'),
 (0.006798463219944627, 'job_retired'),
 (0.0066258704689579903, 'job_management'),
 (0.0064885828105282727, 'job_services'),
 (0.0063745314291836219, 'education_basic.4y'),
 (0.0063451559315849341, 'job_student'),
 (0.00498166930145878, 'education_unknown'),
 (0.0048325906623528881, 'education_basic.6y'),
 (0.0043378749560467758, 'job_self-employed'),
 (0.0041876341616813133, 'job_unemployed'),
 (0.0041289498092654218, 'job_entrepreneur'),
 (0.0035308521298796748, 'job_housemaid'),
 (0.003107840826279244, 'month_dec'),
 (0.0022321342237142578, 'housing_unknown'),
 (0.0021451594169264372, 'loan_unknown'),
 (0.001720854066830896, 'job_unknown'),
 (0.00058644970529911833, 'marital_unknown'),
 (0.00034420587409966465, 'education_illiterate'),
 (2.1270707667031843e-06, 'default_yes')]



In [ ]:

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	56	housemaid	married	basic.4y	no	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
1	57	services	married	high.school	unknown	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
2	37	services	married	high.school	no	yes	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
3	40	admin.	married	basic.6y	no	no	no	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no
4	56	services	married	high.school	no	no	yes	telephone	may	mon	...	1	999	nonexistent	1.1	93.994	-36.4	4.857	5191	no