HW 3: KNN & Random Forest

Get your data here. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Assignment

  • Preprocess your data (you may find LabelEncoder useful)
  • Train both KNN and Random Forest models
  • Find the best parameters by computing their learning curve (feel free to verify this with grid search)
  • Create a clasification report
  • Inspect your models, what features are most important? How might you use this information to improve model precision?

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier 
from sklearn.neighbors import KNeighborsClassifier 
from sklearn import preprocessing
from sklearn import metrics
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, mean_squared_error
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import DistanceMetric



%matplotlib inline

In [2]:
!pwd


/Users/sampathweb/ga/DAT_SF_11_homework/jgaw

In [4]:
bank_df = pd.read_csv('../DATA/bank-additional/bank-additional-full.csv', sep = ';')

In [5]:
bank_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 41188 entries, 0 to 41187
Data columns (total 21 columns):
age               41188 non-null int64
job               41188 non-null object
marital           41188 non-null object
education         41188 non-null object
default           41188 non-null object
housing           41188 non-null object
loan              41188 non-null object
contact           41188 non-null object
month             41188 non-null object
day_of_week       41188 non-null object
duration          41188 non-null int64
campaign          41188 non-null int64
pdays             41188 non-null int64
previous          41188 non-null int64
poutcome          41188 non-null object
emp.var.rate      41188 non-null float64
cons.price.idx    41188 non-null float64
cons.conf.idx     41188 non-null float64
euribor3m         41188 non-null float64
nr.employed       41188 non-null float64
y                 41188 non-null object
dtypes: float64(5), int64(5), object(11)
memory usage: 6.9+ MB

In [6]:
bank_data = pd.DataFrame()
label_encoders = {}

for col in bank_df.columns:
    if bank_df[col].dtype == 'object':
        label_encoders[col] = preprocessing.LabelEncoder()
        bank_data[col] = label_encoders[col].fit_transform(bank_df[col])
    else:
        bank_data[col] = bank_df[col]

In [7]:
label_encoders


Out[7]:
{'contact': LabelEncoder(),
 'day_of_week': LabelEncoder(),
 'default': LabelEncoder(),
 'education': LabelEncoder(),
 'housing': LabelEncoder(),
 'job': LabelEncoder(),
 'loan': LabelEncoder(),
 'marital': LabelEncoder(),
 'month': LabelEncoder(),
 'poutcome': LabelEncoder(),
 'y': LabelEncoder()}

In [8]:
xcols = [col for col in bank_data.columns if col != 'y']

X = bank_data[xcols].values
y = bank_data['y'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=99)

In [8]:


In [9]:
%%time

param_grid = {'n_neighbors':[6], 
              'algorithm' : ['auto'], 
              'p': [2], 
              'leaf_size':[1],
              'metric': ['euclidean']
              }
clf = GridSearchCV(KNeighborsClassifier(), param_grid = param_grid )
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)


CPU times: user 8.55 s, sys: 28.5 ms, total: 8.58 s
Wall time: 8.58 s

In [10]:
print 'Best Params: ', clf.best_params_
print 'Score: ', clf.score(X_test, y_test)
print
print 'Classification Report: '
print classification_report(y_test, y_pred)


Best Params:  {'n_neighbors': 6, 'metric': 'euclidean', 'leaf_size': 1, 'algorithm': 'auto', 'p': 2}
Score:  0.905074047099

Classification Report: 
             precision    recall  f1-score   support

          0       0.92      0.97      0.95      7303
          1       0.64      0.38      0.48       935

avg / total       0.89      0.91      0.89      8238


In [11]:
cm = confusion_matrix(y_test, y_pred)

print(cm)

# Show confusion matrix in a separate window
plt.matshow(cm)
plt.title('Confusion matrix')
plt.colorbar()
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()


#[[7103  216]
# [ 553  366]]


[[7097  206]
 [ 576  359]]

In [12]:
def plot_learning_curve(X_train, y_train, X_test, y_test, n_est=5):
    errors_train = []
    errors_test = []
    for i in range(n_est):
        est = RandomForestClassifier(n_jobs = -1,n_estimators = i+1)
        est.fit(X_train, y_train)
        y_pred = est.predict(X_test)
        errors_train.append(mean_squared_error(est.predict(X_train), y_train)) 
        errors_test.append(mean_squared_error(y_pred,y_test))
    fig, ax = plt.subplots(figsize=(15,10))
    ax.plot(range(n_est), errors_train, 'o-', color="g", label='mse_train')
    ax.plot(range(n_est), errors_test, 'o-', color="r", label='mse_test')
    ax.set_xlabel('n_estimators')
    ax.set_ylabel('error')
    ax.legend(loc=0)
    ax.set_title('Learning Curve of N Estimators')
    
    print 'Score: ' ,est.score(X_test, y_test)
    print 'Confusion Matrix: '
    print confusion_matrix(y_pred, y_test)
    print
    print classification_report(y_pred, y_test)
    print 
    print sorted(zip(est.feature_importances_, bank_df.columns),reverse=True)[:10]

In [13]:
plot_learning_curve(X_train, y_train, X_test, y_test, n_est = 50)


Score:  0.913692643846
Confusion Matrix: 
[[7044  452]
 [ 259  483]]

             precision    recall  f1-score   support

          0       0.96      0.94      0.95      7496
          1       0.52      0.65      0.58       742

avg / total       0.92      0.91      0.92      8238


[(0.31385481631095186, 'duration'), (0.1090622725566975, 'euribor3m'), (0.093519825829489769, 'age'), (0.06438357075098522, 'nr.employed'), (0.048099527941428022, 'job'), (0.043019627313560634, 'education'), (0.04266307931767975, 'campaign'), (0.040940208068242633, 'day_of_week'), (0.039562418338822541, 'pdays'), (0.028714623318760232, 'cons.conf.idx')]

In [13]:


In [ ]: