HW 3: KNN & Random Forest

Get your data here. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Assignment

  • Preprocess your data (you may find LabelEncoder useful)
  • Train both KNN and Random Forest models
  • Find the best parameters by computing their learning curve (feel free to verify this with grid search)
  • Create a clasification report
  • Inspect your models, what features are most important? How might you use this information to improve model precision?

In [170]:
# Standard imports for data analysis packages in Python
import pandas as pd
import numpy as np
import seaborn as sns  # for pretty layout of plots
import matplotlib.pyplot as plt
from pprint import pprint  # for pretty printing
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn import preprocessing
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
from sklearn.learning_curve import learning_curve
from sklearn.grid_search import GridSearchCV

# This enables inline Plots
%matplotlib inline

In [171]:
# read data
data = pd.read_csv('bank-additional/bank-additional-full.csv', sep=';')

In [172]:
# use labelencode to transform categorical rows with small unique values
le = preprocessing.LabelEncoder()
le.fit(data.default)
data.default = le.transform(data.default)

le.fit(data.marital)
data.marital = le.transform(data.marital)

le.fit(data.job)
data.job = le.transform(data.job)

le.fit(data.education)
data.education = le.transform(data.education)

le.fit(data.housing)
data.housing = le.transform(data.housing)

le.fit(data.loan)
data.loan = le.transform(data.loan)

le.fit(data.contact)
data.contact = le.transform(data.contact)

le.fit(data.month)
data.month = le.transform(data.month)

le.fit(data.day_of_week)
data.day_of_week = le.transform(data.day_of_week)

le.fit(data.poutcome)
data.poutcome = le.transform(data.poutcome)

le.fit(data.y)
data.y = le.transform(data.y)

In [173]:
# assign X and y. Split data
y = data.y
X = data.drop('y', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [168]:
# Instantiate the estimator using KNN
clf = KNeighborsClassifier(n_neighbors=9, leaf_size=5)

# Fit the estimator to the Training Data
clf.fit(X_train, y_train)


Out[168]:
KNeighborsClassifier(algorithm='auto', leaf_size=5, metric='minkowski',
           metric_params=None, n_neighbors=9, p=2, weights='uniform')

In [125]:
#plot learning curve for KNN with kd_tree algorithm

%time
_ = plot_learning_curve(KNeighborsClassifier(),'test', X_train, y_train)


CPU times: user 4 µs, sys: 1 µs, total: 5 µs
Wall time: 10 µs

In [80]:
#plot learning curve for KNN with brute algorithm
_ = plot_learning_curve(KNeighborsClassifier(algorithm='brute'),'test', X_train, y_train)



In [126]:
#plot learning curve for KNN with ball_tree algorithm
_ = plot_learning_curve(KNeighborsClassifier(algorithm='ball_tree'),'test', X_train, y_train)



In [119]:
# Find and verify best parameters for KNN using grid search
# Since the different algorithms have little to no effect on the learning curve,
# We vary the grid search by using n_neighbors and leaf_size

d = {'n_neighbors': range(1, 10)}
gds = GridSearchCV(KNeighborsClassifier(), d)

In [120]:
gds.fit(X_train, y_train)
gds.best_params_


Out[120]:
{'n_neighbors': 9}

In [121]:
e = {'leaf_size' : [1, 3, 5, 10, 15, 30, 60, 120]}
gds = GridSearchCV(KNeighborsClassifier(), e)

In [122]:
gds.fit(X_train, y_train)
gds.best_params_


Out[122]:
{'leaf_size': 5}

In [116]:
# Classification report for KNN
y_pred = clf.predict(X_test)
print classification_report(y_test, y_pred)


             precision    recall  f1-score   support

          0       0.93      0.96      0.95     10958
          1       0.60      0.47      0.52      1399

avg / total       0.90      0.90      0.90     12357


In [131]:


In [ ]:
# Repeat steps above for random forest

In [174]:
rf = RandomForestClassifier(n_estimators=90, max_depth=10)
rf.fit(X_train, y_train)


Out[174]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=10, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=90, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)

In [137]:
# Plot learning curve for random forest classifier

%time
_ = plot_learning_curve(RandomForestClassifier(n_estimators=100),'test',X_train, y_train)


CPU times: user 5 µs, sys: 1 µs, total: 6 µs
Wall time: 11.2 µs

In [95]:
# Find the best parameters for random forest
# We will vary the n_estimators and max_depth parameters

In [132]:
d = {'n_estimators' : [2, 4, 8, 16, 25, 30, 60, 90, 100]}
gds = GridSearchCV(RandomForestClassifier(), d)

In [133]:
gds.fit(X_train, y_train)
gds.best_params_


Out[133]:
{'n_estimators': 90}

In [134]:
e = {'max_depth' : [2, 4, 5, 10, 20, 40, 60, 90, 100]}
gds = GridSearchCV(RandomForestClassifier(), e)

In [135]:
gds.fit(X_train, y_train)
gds.best_params_


Out[135]:
{'max_depth': 10}

In [138]:
# plot learning curve again using the best params
%time
_ = plot_learning_curve(RandomForestClassifier(n_estimators=90, max_depth=10),'test',X_train, y_train)


CPU times: user 5 µs, sys: 1e+03 ns, total: 6 µs
Wall time: 10 µs

In [175]:
# Classification report for random forest
y_pred = rf.predict(X_test)
print classification_report(y_test, y_pred)


             precision    recall  f1-score   support

          0       0.93      0.98      0.95     10956
          1       0.70      0.44      0.54      1401

avg / total       0.90      0.91      0.91     12357


In [179]:
#Find the most important features 
sorted(zip(rf.feature_importances_, data.drop('y', axis=1).columns), reverse=True)[:20]


Out[179]:
[(0.36848234232206195, 'duration'),
 (0.11712792421655446, 'euribor3m'),
 (0.097504832322366147, 'nr.employed'),
 (0.055766271037977662, 'pdays'),
 (0.048542090252638961, 'cons.conf.idx'),
 (0.044548200454230052, 'emp.var.rate'),
 (0.04127742774203845, 'poutcome'),
 (0.041197369782866672, 'cons.price.idx'),
 (0.040880285911750025, 'age'),
 (0.022878593565465361, 'month'),
 (0.018875633330684018, 'day_of_week'),
 (0.018308415594707635, 'education'),
 (0.017893747656759846, 'job'),
 (0.016836819599349621, 'campaign'),
 (0.012898382995126344, 'previous'),
 (0.0097214004130339159, 'contact'),
 (0.0094099312864026462, 'marital'),
 (0.0068534054725107284, 'housing'),
 (0.0066635258672835878, 'loan'),
 (0.0043334001761919294, 'default')]

In [144]:
# It looks like the top 3 features account for the majority (58%) of importance.
# Rather than using all 20 features, perhaps we can us the top 5 important features
# To build the model with better precision

In [180]:
new_data = data[['duration', 'euribor3m', 'nr.employed', 'pdays', 'cons.conf.idx']]

In [181]:
X_train, X_test, y_train, y_test = train_test_split(new_data, y, test_size=0.3)

In [182]:
rf = RandomForestClassifier(n_estimators=90, max_depth=10)
rf.fit(X_train, y_train)


Out[182]:
RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=10, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=90, n_jobs=1,
            oob_score=False, random_state=None, verbose=0)

In [183]:
# Classification report for random forest
y_pred = rf.predict(X_test)
print classification_report(y_test, y_pred)


             precision    recall  f1-score   support

          0       0.94      0.96      0.95     10968
          1       0.64      0.53      0.58      1389

avg / total       0.91      0.91      0.91     12357


In [ ]:
# It looks like the precision for classifying 0 increase by classifying 1 decreased.
# Perhaps using only 5 features doesn't have help increasing the precision after all

In [97]:
#plot learning curve helper function

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [158]:


In [ ]: