HW 3: KNN & Random Forest

Get your data here. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Assignment

  • Preprocess your data (you may find LabelEncoder useful)
  • Train both KNN and Random Forest models
  • Find the best parameters by computing their learning curve (feel free to verify this with grid search)
  • Create a clasification report
  • Inspect your models, what features are most important? How might you use this information to improve model precision?

In [57]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
%matplotlib inline
from sklearn.learning_curve import learning_curve

from sklearn.metrics import confusion_matrix, classification_report
from sklearn.cross_validation import cross_val_score,train_test_split

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

Preprocess the data


In [7]:
df = pd.read_csv("../data/bank-additional-full.csv", sep = ';')

In [25]:
#convert decision column to bool
df['y'].replace('no', 0, inplace = True)
df['y'].replace('yes', 1, inplace = True)
df['y'].value_counts()


Out[25]:
0    36548
1     4640
dtype: int64

In [29]:
dfdummy = pd.get_dummies(df)

In [46]:
x_train, x_test, y_train, y_test = train_test_split(dfdummy.drop('y', axis = 1), dfdummy['y'])

Train KNN and Random Forest


In [41]:
knn = KNeighborsClassifier()
rdf = RandomForestClassifier()

In [48]:
knn.fit(x_train, y_train)
rdf.fit(x_train, y_train)


Out[48]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)

In [50]:
print cross_val_score(knn, x_train, y_train).mean()


0.905182739309

In [51]:
print cross_val_score(rdf, x_train, y_train).mean()


0.904664789097

Find the best parameters by computing their learning curve

(feel free to verify this with grid search)


In [58]:
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [75]:
%%time
#knn learning curve
num_neighbors = range(1, 50, 5)
knn_results = []
for x in num_neighbors:
    knn = KNeighborsClassifier(n_neighbors = x)
    plot_learning_curve(knn, x, x_train, y_train)
    knn_results.append([cross_val_score(knn, x_train, y_train).mean(), x, knn])


CPU times: user 14min 5s, sys: 11.7 s, total: 14min 16s
Wall time: 14min 42s

In [77]:
%%time
num_estimators = range(1, 100, 5)
rdf_results = []
for x in num_estimators:
    rdf = RandomForestClassifier(n_estimators = x)
    plot_learning_curve(rdf, x, x_train, y_train)
    rdf_results.append([cross_val_score(rdf, x_train, y_train).mean(), x, rdf])


CPU times: user 7min 17s, sys: 13.3 s, total: 7min 31s
Wall time: 7min 44s

Create a classification report


In [87]:
#find the best classifier for knn and rdf by getting the one with the highest score
knn_results.sort(reverse = True)
best_knn = knn_results[0][2]


[0.91175423262438893, 26, KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=26, p=2, weights='uniform')]

In [91]:
best_knn.fit(x_train, y_train)
y_pred = best_knn.predict(x_test)
confusion_matrix(y_test, y_pred)


Out[91]:
array([[8841,  299],
       [ 629,  528]])

In [92]:
print classification_report(y_test, y_pred)


             precision    recall  f1-score   support

          0       0.93      0.97      0.95      9140
          1       0.64      0.46      0.53      1157

avg / total       0.90      0.91      0.90     10297


In [93]:
#do it again for rdf
rdf_results.sort(reverse = True)
best_rdf = rdf_results[0][2]

In [94]:
best_rdf.fit(x_train, y_train)
y_pred = best_rdf.predict(x_test)
confusion_matrix(y_test, y_pred)


Out[94]:
array([[8854,  286],
       [ 626,  531]])

In [95]:
print classification_report(y_test, y_pred)


             precision    recall  f1-score   support

          0       0.93      0.97      0.95      9140
          1       0.65      0.46      0.54      1157

avg / total       0.90      0.91      0.90     10297

Inspect your models, what features are most important? How might you use this information to improve model precision?


In [113]:
#I would use it to weight the features somehow
#I think this is the wrong way to get values zipped but I am not sure what the right way is
#The issue here is the shape of feature importances has 63 fields, but columns values has 64 because it has the target lable.
#So the below is wrong ish, but some of them might be right.  The 'y' column is not actually labled that, and the rest 
#may have an off by one error
sorted(zip(best_rdf.feature_importances_, dfdummy.drop('y', axis = 1).columns.values), reverse = True)


Out[113]:
[(0.28046256420863563, 'duration'),
 (0.090723306047462662, 'euribor3m'),
 (0.076571347099394504, 'age'),
 (0.056068033740963183, 'nr.employed'),
 (0.038638876298731091, 'campaign'),
 (0.02751971903501382, 'cons.conf.idx'),
 (0.024749679421018994, 'poutcome_success'),
 (0.023525453480378093, 'emp.var.rate'),
 (0.021523656576233743, 'cons.price.idx'),
 (0.02145449865147166, 'pdays'),
 (0.013289214396963456, 'housing_yes'),
 (0.013199024279980721, 'housing_no'),
 (0.012023665672458576, 'job_admin.'),
 (0.011817931097993496, 'education_university.degree'),
 (0.011631195176855914, 'marital_married'),
 (0.011096875242214177, 'education_high.school'),
 (0.011077168098602838, 'marital_single'),
 (0.011010938645308572, 'day_of_week_mon'),
 (0.011003108831120441, 'previous'),
 (0.010941952259044471, 'day_of_week_thu'),
 (0.010796280796580683, 'day_of_week_wed'),
 (0.010409942347255929, 'day_of_week_tue'),
 (0.010095922778613577, 'day_of_week_fri'),
 (0.0098856700680860221, 'job_technician'),
 (0.0091182858733987165, 'loan_no'),
 (0.0087064760606784773, 'education_professional.course'),
 (0.0085779310668876816, 'job_blue-collar'),
 (0.0085654517298333274, 'loan_yes'),
 (0.0081744413846524377, 'education_basic.9y'),
 (0.007291827489494653, 'poutcome_nonexistent'),
 (0.007148754847066929, 'contact_telephone'),
 (0.0069930353134876264, 'marital_divorced'),
 (0.0067816831808204005, 'contact_cellular'),
 (0.0067755773576015053, 'job_management'),
 (0.0062912808247787243, 'education_basic.4y'),
 (0.0061167228139090727, 'poutcome_failure'),
 (0.0061045246379153343, 'job_services'),
 (0.0057400194689728066, 'month_oct'),
 (0.0056823754516221988, 'default_unknown'),
 (0.0056139966673029312, 'job_retired'),
 (0.0054028214562713266, 'month_may'),
 (0.0051518970087346006, 'default_no'),
 (0.0049636889360231175, 'education_unknown'),
 (0.0043041174195159069, 'job_student'),
 (0.0042527575118035668, 'month_mar'),
 (0.0041888001095868599, 'education_basic.6y'),
 (0.0039670851872183827, 'job_self-employed'),
 (0.00387146711873777, 'job_entrepreneur'),
 (0.0037843512767358907, 'job_unemployed'),
 (0.0036464208321616816, 'month_apr'),
 (0.0029798053160975313, 'job_housemaid'),
 (0.0029129538046558262, 'month_jun'),
 (0.0028389378321447604, 'month_aug'),
 (0.0027932603508357373, 'month_sep'),
 (0.0026358775366423799, 'month_nov'),
 (0.0023032921647709709, 'month_jul'),
 (0.001949863328099733, 'loan_unknown'),
 (0.0019026643678586555, 'housing_unknown'),
 (0.0015462809313536679, 'job_unknown'),
 (0.00062078782763083326, 'month_dec'),
 (0.00060176016024983674, 'marital_unknown'),
 (0.00018246360281022806, 'education_illiterate'),
 (2.3750125561101476e-07, 'default_yes')]

In [124]:
#this doesnt impress me.  Can I do better?  
#Everything above satisfies the homework, the below is my testing

In [119]:
#is multicore faster, if yes, how much?
%%time
randtree = RandomForestClassifier(n_estimators = 91, n_jobs=-1)
rtree = RandomForestClassifier(n_estimators = 91)


CPU times: user 65 µs, sys: 13 µs, total: 78 µs
Wall time: 74.1 µs

In [122]:
%%time
randtree.fit(x_train, y_train)


CPU times: user 7.74 s, sys: 126 ms, total: 7.87 s
Wall time: 2.38 s
Out[122]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=91, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0)

In [123]:
%%time
rtree.fit(x_train, y_train)


CPU times: user 4.54 s, sys: 89.8 ms, total: 4.63 s
Wall time: 4.65 s
Out[123]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=91, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)

In [125]:
#success.  now for grid search
from sklearn.grid_search import GridSearchCV

In [140]:
param_grid = {
    "n_estimators": [291, 391, 491],
    "max_depth": [30,50,70,None],
} 
gs = GridSearchCV(RandomForestClassifier(n_jobs = -1), param_grid)

In [142]:
%%time
gs.fit(x_train, y_train)
print gs.best_estimator_
print cross_val_score(gs.best_estimator_, x_train, y_train).mean()


RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=291, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0)
0.911495257518
CPU times: user 13min 14s, sys: 18.3 s, total: 13min 32s
Wall time: 4min

In [143]:
#ok this isnt doing great.  let's nuke some of those crummy features
features = sorted(zip(best_rdf.feature_importances_, dfdummy.drop('y', axis = 1).columns.values), reverse = True)

In [144]:
to_remove = [x[1] for x in features if x[0] < .01]

In [145]:
print to_remove


['job_technician', 'loan_no', 'education_professional.course', 'job_blue-collar', 'loan_yes', 'education_basic.9y', 'poutcome_nonexistent', 'contact_telephone', 'marital_divorced', 'contact_cellular', 'job_management', 'education_basic.4y', 'poutcome_failure', 'job_services', 'month_oct', 'default_unknown', 'job_retired', 'month_may', 'default_no', 'education_unknown', 'job_student', 'month_mar', 'education_basic.6y', 'job_self-employed', 'job_entrepreneur', 'job_unemployed', 'month_apr', 'job_housemaid', 'month_jun', 'month_aug', 'month_sep', 'month_nov', 'month_jul', 'loan_unknown', 'housing_unknown', 'job_unknown', 'month_dec', 'marital_unknown', 'education_illiterate', 'default_yes']

In [151]:
df_simple = dfdummy.drop(to_remove, axis =1)

In [152]:
x_train2, x_test2, y_train2, y_test2 = train_test_split(df_simple.drop('y', axis = 1), df_simple['y'])

In [153]:
simple_rfc = RandomForestClassifier(n_estimators = 291, n_jobs = -1)

In [154]:
simple_rfc.fit(x_train2, y_train2)


Out[154]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=291, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0)

In [155]:
cross_val_score(simple_rfc, x_train, y_train).mean()


Out[155]:
0.91214269528341585

In [161]:
#Ok lets try that again but with even fewer features
to_remove = [x[1] for x in features if x[0] < .02]
df_simple = dfdummy.drop(to_remove, axis = 1)
x_train2, x_test2, y_train2, y_test2 = train_test_split(df_simple.drop('y', axis = 1), df_simple['y'])
simple_rfc = RandomForestClassifier(n_estimators = 291, n_jobs = -1)
simple_rfc.fit(x_train2, y_train2)
cross_val_score(simple_rfc, x_train, y_train).mean()


Out[161]:
0.91262827360719945

In [162]:
param_grid = {
    "n_estimators": [291, 391, 491],
    "max_features": ['sqrt','log2',None],
} 
gs = GridSearchCV(RandomForestClassifier(n_jobs = -1), param_grid)

In [163]:
gs.fit(x_train,y_train)


Out[163]:
GridSearchCV(cv=None,
       estimator=RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=10, n_jobs=-1,
            oob_score=False, random_state=None, verbose=0),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_estimators': [291, 391, 491], 'max_features': ['sqrt', 'log2', None]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [166]:
gs.best_estimator_.score(x_test, y_test)


Out[166]:
0.91570360299116249

In [167]:
#cant being that .91 wall.  let's try gradient boosting
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()

In [170]:
%%time
gbc.fit(x_train, y_train)


CPU times: user 19.3 s, sys: 180 ms, total: 19.4 s
Wall time: 20.1 s
Out[170]:
GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
              max_depth=3, max_features=None, max_leaf_nodes=None,
              min_samples_leaf=1, min_samples_split=2, n_estimators=100,
              random_state=None, subsample=1.0, verbose=0,
              warm_start=False)

In [172]:
%%time
cross_val_score(gbc, x_train, y_train)


CPU times: user 36.3 s, sys: 453 ms, total: 36.8 s
Wall time: 38.4 s
Out[172]:
array([ 0.91851996,  0.91395552,  0.91774303])

In [174]:
from sklearn.svm import SVC

In [175]:
linsvm = SVC()

In [176]:
linsvm.fit(x_train, y_train)


Out[176]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [177]:
cross_val_score(linsvm, x_train, y_train)


Out[177]:
array([ 0.89278431,  0.89259007,  0.8915218 ])

In [ ]: