HW 3: KNN & Random Forest

Get your data here. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Assignment

  • Preprocess your data (you may find LabelEncoder useful)
  • Train both KNN and Random Forest models
  • Find the best parameters by computing their learning curve (feel free to verify this with grid search)
  • Create a clasification report
  • Inspect your models, what features are most important? How might you use this information to improve model precision?

In [1]:
#import needed packages
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns

from sklearn import preprocessing
%matplotlib inline

In [2]:
bank_full=pd.read_csv('../data/bank/bank-full.csv', sep=';')

In [68]:
bank=pd.read_csv('../data/bank/bank.csv', sep=';')

In [3]:
bank_additional_full=pd.read_csv('../data/bank-additional/bank-additional-full.csv', sep=';')

In [49]:
bank_additional=pd.read_csv('../data/bank-additional/bank-additional.csv', sep=';')

In [4]:
bank_additional_full.head(5)


Out[4]:
age job marital education default housing loan contact month day_of_week ... campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 housemaid married basic.4y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no
1 57 services married high.school unknown no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no
2 37 services married high.school no yes no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no
3 40 admin. married basic.6y no no no telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no
4 56 services married high.school no no yes telephone may mon ... 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191 no

5 rows × 21 columns


In [5]:
bank_additional_full.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usage: 6.9+ MB

In [6]:
le = preprocessing.LabelEncoder()
le.fit(bank_additional_full.y)


Out[6]:
LabelEncoder()

In [7]:
le.classes_


Out[7]:
array(['no', 'yes'], dtype=object)

In [8]:
le.transform(bank_additional_full.y)


Out[8]:
array([0, 0, 0, ..., 0, 1, 0])

In [9]:
bank_additional_full.y=le.transform(bank_additional_full.y)

In [10]:
bank_additional_full.y


Out[10]:
0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
...
41173    1
41174    1
41175    0
41176    0
41177    0
41178    1
41179    0
41180    0
41181    1
41182    0
41183    1
41184    0
41185    0
41186    1
41187    0
Name: y, Length: 41188, dtype: int64

In [11]:
#check for unique values of features
bank_additional_full.marital.unique()


Out[11]:
array(['married', 'single', 'divorced', 'unknown'], dtype=object)

In [12]:
pd.scatter_matrix(bank_additional_full[['age','campaign','pdays','duration','previous','y','emp.var.rate']],figsize=(20,20))


Out[12]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x109f87250>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a8bcb10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10aa0e750>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a90b750>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x103956410>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1029a8d90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a1d2b50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10a337810>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a4b88d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a53b690>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a772290>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a7f6290>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10a84ef50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10b2c7250>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10a889090>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c2b4910>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c335910>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c39bf50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c51ed10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c578810>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c608750>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10c68e410>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c6fd2d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c773f50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c7dae50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c85cc10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10c801210>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10ca3bb10>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10cac07d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10cb27390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10cba9150>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10cb49d50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10cd8fe10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10ce15ad0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10ce861d0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10cefcf50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10d857c10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10d8e5c10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10d96a8d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10d9d86d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10da5e390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10dac4290>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x10dc49050>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10dadc490>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10dd2ed10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10ddb39d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10df240d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10df9ce50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x10dff4b10>]], dtype=object)

In [13]:
X = bank_additional_full.ix[:,'age':'poutcome']
y = bank_additional_full.y

In [14]:
X_data=pd.get_dummies(X)

In [15]:
#Split the Data
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_data, y, test_size=0.3)
print X_train.shape, X_test.shape


(28831, 58) (12357, 58)

KNN Model


In [16]:
from sklearn.neighbors import KNeighborsClassifier

# Instantiate the estimator
clf=KNeighborsClassifier(algorithm='brute')

# Fit the estimator to the Training Data
clf.fit(X_train, y_train)

# Use the model to predict Test Data
y_pred=clf.predict(X_test)

In [17]:
from sklearn import metrics

def plot_confusion_matrix(y_pred, y):
    plt.imshow(metrics.confusion_matrix(y, y_pred),
               cmap=plt.cm.binary, interpolation='nearest')
    plt.colorbar()
    plt.xlabel('true value')
    plt.ylabel('predicted value')
    
print "classification accuracy:", metrics.accuracy_score(y_test, y_pred)
plot_confusion_matrix(y_test, y_pred)


classification accuracy: 0.9005422028

In [18]:
print "accuracy:", metrics.accuracy_score(y_test, y_pred)
print "precision:", metrics.precision_score(y_test, y_pred)
print "recall:", metrics.recall_score(y_test, y_pred)
print "f1 score:", metrics.f1_score(y_test, y_pred)


accuracy: 0.9005422028
precision: 0.557303370787
recall: 0.372652141247
f1 score: 0.44664565511

Random Forest Model


In [19]:
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

In [20]:
#plot learning curve
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.learning_curve import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):

    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [21]:
rf_model = RandomForestClassifier(n_estimators=100,max_depth=15,criterion='entropy')
rf_model.fit(X_train,y_train)


Out[21]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='entropy', max_depth=15, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)

In [23]:
from sklearn.grid_search import GridSearchCV

In [24]:
param= {'n_estimators':np.arange(50,200,50), 'max_depth':np.arange(5,25,5)}

In [25]:
gs=GridSearchCV(RandomForestClassifier(),param)

In [26]:
gs.fit(X_train, y_train)


Out[26]:
GridSearchCV(cv=None,
       estimator=RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_estimators': array([ 50, 100, 150]), 'max_depth': array([ 5, 10, 15, 20])},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [27]:
gs.best_params_, gs.best_score_


Out[27]:
({'max_depth': 20, 'n_estimators': 50}, 0.90589989941382543)

Learning Curves


In [28]:
_ = plot_learning_curve(RandomForestClassifier(n_estimators=50),'test',X_train,y_train)



In [25]:
_ = plot_learning_curve(KNeighborsClassifier(algorithm='brute'),'test',X_train,y_train)



In [26]:
_ = plot_learning_curve(KNeighborsClassifier(algorithm='kd_tree'),'test',X_train,y_train)



In [30]:
rf_model = RandomForestClassifier(n_estimators=50,max_depth=25,criterion='entropy')
rf_model.fit(X_train,y_train)


Out[30]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='entropy', max_depth=25, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=50, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)

In [31]:
y_pred = rf_model.predict(X_test)

In [32]:
from sklearn.metrics import confusion_matrix

def plot_confusion_matrix(y_pred, y):
    plt.imshow(confusion_matrix(y, y_pred),
               cmap=plt.cm.binary, interpolation='nearest')
    plt.colorbar()
    plt.xlabel('true value')
    plt.ylabel('predicted value')

In [33]:
plot_confusion_matrix( y_pred,y_test)



In [34]:
from sklearn.metrics import classification_report
print classification_report(y_test,y_pred)


             precision    recall  f1-score   support

          0       0.93      0.97      0.95     11026
          1       0.63      0.36      0.46      1331

avg / total       0.89      0.91      0.90     12357

Feature Importance


In [35]:
rf_model.fit(X_train,y_train)
sorted(zip(rf_model.feature_importances_, X_data.columns.values), reverse=True)


Out[35]:
[(0.31542860259098604, 'duration'),
 (0.096996126650562772, 'age'),
 (0.044095866858841301, 'campaign'),
 (0.035960872775156125, 'pdays'),
 (0.028855110408211843, 'poutcome_success'),
 (0.019054975588938441, 'previous'),
 (0.016520431074296738, 'poutcome_nonexistent'),
 (0.015647003311632553, 'contact_telephone'),
 (0.015645260641902619, 'month_may'),
 (0.015632967583978939, 'month_jun'),
 (0.015142246500243482, 'month_mar'),
 (0.01433790864056679, 'housing_yes'),
 (0.014309079315670879, 'housing_no'),
 (0.0129535055629644, 'day_of_week_mon'),
 (0.012886666537345716, 'contact_cellular'),
 (0.012763579885893146, 'day_of_week_thu'),
 (0.012717040249534182, 'education_university.degree'),
 (0.012566901972371267, 'day_of_week_fri'),
 (0.012406176598208325, 'month_oct'),
 (0.012337874663603114, 'day_of_week_tue'),
 (0.012261652180976669, 'job_admin.'),
 (0.012120240331798583, 'day_of_week_wed'),
 (0.012100393552890337, 'marital_married'),
 (0.011525244884626463, 'education_high.school'),
 (0.010962328444502137, 'marital_single'),
 (0.010713901447142161, 'month_jul'),
 (0.010635219289316815, 'month_apr'),
 (0.010260118559309426, 'poutcome_failure'),
 (0.009824308594848034, 'month_aug'),
 (0.0095848304654248158, 'loan_no'),
 (0.0095775243025992621, 'job_technician'),
 (0.0092642929966690463, 'loan_yes'),
 (0.0091188661615600273, 'month_nov'),
 (0.0089789011091064158, 'job_blue-collar'),
 (0.0088222995271310407, 'education_professional.course'),
 (0.0086708135477546414, 'education_basic.9y'),
 (0.0079546941584079309, 'month_sep'),
 (0.007903223624132821, 'default_no'),
 (0.0076005788008625995, 'marital_divorced'),
 (0.0070914245452314397, 'default_unknown'),
 (0.006798463219944627, 'job_retired'),
 (0.0066258704689579903, 'job_management'),
 (0.0064885828105282727, 'job_services'),
 (0.0063745314291836219, 'education_basic.4y'),
 (0.0063451559315849341, 'job_student'),
 (0.00498166930145878, 'education_unknown'),
 (0.0048325906623528881, 'education_basic.6y'),
 (0.0043378749560467758, 'job_self-employed'),
 (0.0041876341616813133, 'job_unemployed'),
 (0.0041289498092654218, 'job_entrepreneur'),
 (0.0035308521298796748, 'job_housemaid'),
 (0.003107840826279244, 'month_dec'),
 (0.0022321342237142578, 'housing_unknown'),
 (0.0021451594169264372, 'loan_unknown'),
 (0.001720854066830896, 'job_unknown'),
 (0.00058644970529911833, 'marital_unknown'),
 (0.00034420587409966465, 'education_illiterate'),
 (2.1270707667031843e-06, 'default_yes')]

In [ ]: