HW 3: KNN & Random Forest

Get your data here. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Assignment

  • Preprocess your data (you may find LabelEncoder useful)

In [2]:
# Standard imports for data analysis packages in Python

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# This enables inline Plots

%matplotlib inline

# Limit the rows displayed in dataframe by inserting this line along with your imports.

pd.set_option('display.max_rows', 10)

In [3]:
# Create a data frame from the listings dataset

bank_additional = pd.read_csv('../hw3/bank-additional.csv', delimiter=";", header=0)
bank_additional_full = pd.read_csv('../hw3/bank-additional-full.csv', delimiter=";", header=0)

In [4]:
# Check the header of the file

print bank_additional_full.head()


   age        job  marital    education  default housing loan    contact  \
0   56  housemaid  married     basic.4y       no      no   no  telephone   
1   57   services  married  high.school  unknown      no   no  telephone   
2   37   services  married  high.school       no     yes   no  telephone   
3   40     admin.  married     basic.6y       no      no   no  telephone   
4   56   services  married  high.school       no      no  yes  telephone   

  month day_of_week ...  campaign  pdays  previous     poutcome emp.var.rate  \
0   may         mon ...         1    999         0  nonexistent          1.1   
1   may         mon ...         1    999         0  nonexistent          1.1   
2   may         mon ...         1    999         0  nonexistent          1.1   
3   may         mon ...         1    999         0  nonexistent          1.1   
4   may         mon ...         1    999         0  nonexistent          1.1   

   cons.price.idx  cons.conf.idx  euribor3m  nr.employed   y  
0          93.994          -36.4      4.857         5191  no  
1          93.994          -36.4      4.857         5191  no  
2          93.994          -36.4      4.857         5191  no  
3          93.994          -36.4      4.857         5191  no  
4          93.994          -36.4      4.857         5191  no  

[5 rows x 21 columns]

In [5]:
# Check for missing values

print bank_additional_full.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usage: 6.9+ MB
None

In [6]:
# There are apparently no null values in the datasets so it is not necessary to impute any values

In [7]:
# Check the values in the target variable (y)

print bank_additional_full.y.value_counts()


no     36548
yes     4640
dtype: int64

In [8]:
# Import the label encoder package

import pandas
from sklearn import preprocessing

In [9]:
# Create the feature dataset using LabelEncoder for string variables

cols = list(bank_additional_full.columns)

cols.remove('y')

x = pd.DataFrame()

for col in cols:
    if bank_additional_full[col].dtype <> 'object':
        x[col] = bank_additional_full[col]
    if bank_additional_full[col].dtype == 'object':
        le = preprocessing.LabelEncoder()
        x[col] = le.fit_transform(bank_additional_full[col])

In [10]:
# Create the target dataset

y = le.fit_transform(bank_additional_full.y)
  • Train both KNN and Random Forest models

In [11]:
# Split the data into test and training

from sklearn.cross_validation import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

In [12]:
# Import the KNeigghborsClassifier

from sklearn.neighbors import KNeighborsClassifier

# Instantiate the estimator

kn_model = KNeighborsClassifier()

# Fit the estimator to the training data

kn_model.fit(x_train, y_train)

# Print out the score

print kn_model.score(x_train, y_train)


0.932261001517

In [13]:
# Import the DecisionTreeClassifier and RandomForestClassifier from scikit-learn

from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# Use the RandomForestClassifier to fit the data

rf_model = RandomForestClassifier()
rf_model.fit(x_train, y_train)

# Print out the score

print cross_val_score(rf_model, x_train, y_train).mean()


0.908710167672
  • Find the best parameters by computing their learning curve (feel free to verify this with grid search)

In [14]:
# Use the plotting functionality that was created

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.learning_curve import learning_curve

def plot_learning_curve(estimator, title, x, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, x, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [14]:
# Plot the learning curve using the training data

_ = plot_learning_curve(KNeighborsClassifier(n_neighbors=100), 'test', x_train, y_train)



In [15]:
# Plot the learning curve using the training data

_ = plot_learning_curve(RandomForestClassifier(n_estimators=100), 'test', x_train, y_train)



In [25]:
# Import GridSearchCV from scikit-learn

from sklearn.grid_search import GridSearchCV

# Establish the search space for parameters of n_estimators

param = {'n_neighbors':range(1,100)}

# Set up the grid search

gs = GridSearchCV(KNeighborsClassifier(),param)

# Run the grid search on our model

gs.fit(x_train, y_train)


Out[25]:
GridSearchCV(cv=None,
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_neighbors=5, p=2, weights='uniform'),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [26]:
# Display the parameters and score for the best fitting model

gs.best_params_,gs.best_score_


Out[26]:
({'n_neighbors': 99}, 0.91050075872534142)

In [ ]:
# GridSearch suggests that the cross validation score is best when we use 99 neighbors
# In essence, the cross validation score continues to go up with the number of neighbors

In [16]:
# Import GridSearchCV from scikit-learn

from sklearn.grid_search import GridSearchCV

# Establish the search space for parameters of n_estimators

param = {'n_estimators':range(1,100)}

# Set up the grid search

gs = GridSearchCV(RandomForestClassifier(),param)

# Run the grid search on our model

gs.fit(x_train, y_train)


Out[16]:
GridSearchCV(cv=None,
       estimator=RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_estimators': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
       verbose=0)

In [17]:
# Display the parameters and score for the best fitting model

gs.best_params_,gs.best_score_


Out[17]:
({'n_estimators': 66}, 0.91344461305007585)

In [18]:
# GridSearch suggests that the cross validation score is best when we use 66 estimators
# As we saw in the graph, there is a certain point when the score starts to decline
  • Create a clasification report

In [27]:
# Import the metrics package from scikit-learn

from sklearn import metrics

In [29]:
# Use the model to predict test data from KNeighborsClassifier

kn_model = KNeighborsClassifier(n_neighbors=99)
kn_model.fit(x_train, y_train)
y_pred = kn_model.predict(x_test)

# Run the classification report

print metrics.classification_report(y_test, y_pred)


             precision    recall  f1-score   support

          0       0.94      0.97      0.95      7291
          1       0.69      0.50      0.58       947

avg / total       0.91      0.92      0.91      8238


In [30]:
# Print out the accuracy, precision, recall, etc.

print "accuracy:", metrics.accuracy_score(y_test, y_pred)
print "precision:", metrics.precision_score(y_test, y_pred)
print "recall:", metrics.recall_score(y_test, y_pred)
print "f1 score:", metrics.f1_score(y_test, y_pred)


accuracy: 0.916605972323
precision: 0.687319884726
recall: 0.503695881732
f1 score: 0.581352833638

In [31]:
# The accuracy of the model is 0.917 while the precision is 0.687 and the recall is 0.50
# This suggests that the model is better at making any decision (positives + negatives)
# than it is at correctly identifying the true positives

In [35]:
# Use the model to predict test data from RandomForestClassifier

rf_model = RandomForestClassifier(n_estimators=66)
rf_model.fit(x_train, y_train)
y_pred = rf_model.predict(x_test)

# Run the classification report

print metrics.classification_report(y_test, y_pred)


             precision    recall  f1-score   support

          0       0.94      0.97      0.95      7291
          1       0.67      0.51      0.58       947

avg / total       0.91      0.92      0.91      8238


In [36]:
# Print out the accuracy, precision, recall, etc.

print "accuracy:", metrics.accuracy_score(y_test, y_pred)
print "precision:", metrics.precision_score(y_test, y_pred)
print "recall:", metrics.recall_score(y_test, y_pred)
print "f1 score:", metrics.f1_score(y_test, y_pred)


accuracy: 0.915392085458
precision: 0.673130193906
recall: 0.513199577614
f1 score: 0.582384661474

In [37]:
# The accuracy of the Random Forest model is 0.915 while the precision is 0.67 and the recall is 0.51
# As with the previou
  • Inspect your models, what features are most important? How might you use this information to improve model precision?

In [39]:
# This prints the top 10 most important features with the RandomForestClassifier

rf_model.fit(x_train, y_train)
sorted(zip(rf_model.feature_importances_, cols), reverse=True)[:20]


Out[39]:
[(0.31480692275604805, 'duration'),
 (0.11671819170506352, 'euribor3m'),
 (0.094509101198103518, 'age'),
 (0.056529222881454975, 'nr.employed'),
 (0.049538953973526895, 'job'),
 (0.042635676362556552, 'campaign'),
 (0.042268000385460552, 'education'),
 (0.041157363648545775, 'day_of_week'),
 (0.039173572852304744, 'pdays'),
 (0.030646804283503807, 'cons.conf.idx'),
 (0.024313780711928069, 'marital'),
 (0.021959887146177408, 'emp.var.rate'),
 (0.020990417430912321, 'poutcome'),
 (0.020474405841071915, 'housing'),
 (0.019314368089308701, 'cons.price.idx'),
 (0.018605753488851854, 'month'),
 (0.015082844277081646, 'loan'),
 (0.014058031697684964, 'previous'),
 (0.0090490495436050672, 'contact'),
 (0.008167651726809692, 'default')]

In [ ]:
# The most important features are listed above but the top-5 are significantly better predictors than the rest:
# duration
# euribor3m
# age
# nr.employed
# job
# Knowing this, it would be possible to improve model precision by weighting these features more heavily
# while ignoring the less predictive features.