HW 3: KNN & Random Forest

Get your data here. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Assignment

Preprocess your data (you may find LabelEncoder useful)
Train both KNN and Random Forest models
Find the best parameters by computing their learning curve (feel free to verify this with grid search)
Create a clasification report
Inspect your models, what features are most important? How might you use this information to improve model precision?



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn import preprocessing
from sklearn import metrics

%matplotlib inline



In [2]:

    
bank_full_df = pd.read_csv('bank/bank-full.csv', sep = ';')



In [3]:

    
bank_df = pd.read_csv('bank/bank.csv', sep = ';')



In [4]:

    
bank_additional_full_df = pd.read_csv('bank-additional/bank-additional-full.csv', sep = ';')



In [5]:

    
bank_additional_df = pd.read_csv('bank-additional/bank-additional.csv', sep = ';')



In [6]:

    
bank_additional_df.head()









    Out[6]:






  
    
      
      age
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      day_of_week
      ...
      campaign
      pdays
      previous
      poutcome
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
      y
    
  
  
    
      0
       30
       blue-collar
       married
                basic.9y
       no
           yes
            no
        cellular
       may
       fri
      ...
       2
       999
       0
       nonexistent
      -1.8
       92.893
      -46.2
       1.313
       5099.1
       no
    
    
      1
       39
          services
        single
             high.school
       no
            no
            no
       telephone
       may
       fri
      ...
       4
       999
       0
       nonexistent
       1.1
       93.994
      -36.4
       4.855
       5191.0
       no
    
    
      2
       25
          services
       married
             high.school
       no
           yes
            no
       telephone
       jun
       wed
      ...
       1
       999
       0
       nonexistent
       1.4
       94.465
      -41.8
       4.962
       5228.1
       no
    
    
      3
       38
          services
       married
                basic.9y
       no
       unknown
       unknown
       telephone
       jun
       fri
      ...
       3
       999
       0
       nonexistent
       1.4
       94.465
      -41.8
       4.959
       5228.1
       no
    
    
      4
       47
            admin.
       married
       university.degree
       no
           yes
            no
        cellular
       nov
       mon
      ...
       1
       999
       0
       nonexistent
      -0.1
       93.200
      -42.0
       4.191
       5195.8
       no
    
  

5 rows × 21 columns



In [7]:

    
from sklearn.cross_validation import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.neighbors import KNeighborsClassifier as KNN



In [8]:

    
cat_columns = ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome', 'y']



In [9]:

    
le = preprocessing.LabelEncoder()

for col in cat_columns:
    bank_additional_full_df[col] = le.fit_transform(bank_additional_full_df[col])



In [10]:

    
bank_additional_full_df.head()









    Out[10]:






  
    
      
      age
      job
      marital
      education
      default
      housing
      loan
      contact
      month
      day_of_week
      ...
      campaign
      pdays
      previous
      poutcome
      emp.var.rate
      cons.price.idx
      cons.conf.idx
      euribor3m
      nr.employed
      y
    
  
  
    
      0
       56
       3
       1
       0
       0
       0
       0
       1
       6
       1
      ...
       1
       999
       0
       1
       1.1
       93.994
      -36.4
       4.857
       5191
       0
    
    
      1
       57
       7
       1
       3
       1
       0
       0
       1
       6
       1
      ...
       1
       999
       0
       1
       1.1
       93.994
      -36.4
       4.857
       5191
       0
    
    
      2
       37
       7
       1
       3
       0
       2
       0
       1
       6
       1
      ...
       1
       999
       0
       1
       1.1
       93.994
      -36.4
       4.857
       5191
       0
    
    
      3
       40
       0
       1
       1
       0
       0
       0
       1
       6
       1
      ...
       1
       999
       0
       1
       1.1
       93.994
      -36.4
       4.857
       5191
       0
    
    
      4
       56
       7
       1
       3
       0
       0
       2
       1
       6
       1
      ...
       1
       999
       0
       1
       1.1
       93.994
      -36.4
       4.857
       5191
       0
    
  

5 rows × 21 columns



In [10]:



In [11]:

    
%%time


from sklearn.ensemble import RandomForestClassifier as RF
# Random Forest
def preprocess(df):
    # Feature - Target
    X = df.drop('y', axis = 1)
    y = df.y.values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)
    # Run Model
    clf = RF(n_estimators = 50)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    # Get results
    print y_pred.shape
    print clf.score(X_test, y_test)
    print confusion_matrix(y_test, y_pred)
    print classification_report(y_test, y_pred)
preprocess(bank_additional_full_df)









    



(8238,)
0.907380432144
[[7018  271]
 [ 492  457]]
             precision    recall  f1-score   support

          0       0.93      0.96      0.95      7289
          1       0.63      0.48      0.55       949

avg / total       0.90      0.91      0.90      8238

CPU times: user 1.73 s, sys: 42.6 ms, total: 1.77 s
Wall time: 1.82 s



In [12]:

    
%%time
def preprocess(df):
    # Feature - Target
    X = df.drop('y', axis = 1)
    y = df.y.values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)
    # Run Model
    clf = KNN(n_neighbors = 5, algorithm = "kd_tree")
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    # Get results
    print y_pred.shape
    print clf.score(X_test, y_test)
    print confusion_matrix(y_test, y_pred)
    print classification_report(y_test, y_pred)
preprocess(bank_additional_full_df)









    



(8238,)
0.905074047099
[[6969  322]
 [ 460  487]]
             precision    recall  f1-score   support

          0       0.94      0.96      0.95      7291
          1       0.60      0.51      0.55       947

avg / total       0.90      0.91      0.90      8238

CPU times: user 1.97 s, sys: 16.7 ms, total: 1.99 s
Wall time: 2.15 s



In [13]:

    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.learning_curve import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt



In [14]:

    
%%time
def training_examples(df):
    X = df.drop('y', axis = 1)
    y = df.y.values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)

    #this is an example of 1 learning curve for 1 model, try a few more.
    plot_learning_curve(RF(n_estimators=10),'test',X_train, y_train)
    
training_examples(bank_additional_full_df)









    



CPU times: user 2.07 s, sys: 50.7 ms, total: 2.12 s
Wall time: 2.2 s

RF mean squared error



In [15]:

    
%%time
def plot_RF_recall(df):
    re_scores = []
    X = df.drop('y', axis = 1)
    y = df.y.values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)
    def learning_curve(n):
        for i in range(1, n):
            clf = RF(n_estimators = i)
            clf.fit(X_train, y_train)
            y_pred = clf.predict(X_test)
            re_score = metrics.mean_squared_error(y_test, y_pred)
            re_scores.append(re_score)
    learning_curve(30)
    pd.DataFrame(re_scores).plot()
plot_RF_recall(bank_additional_full_df)









    



CPU times: user 13.8 s, sys: 264 ms, total: 14 s
Wall time: 14 s

RF Recall



In [16]:

    
%%time
def plot_RF_recall(df):
    re_scores = []
    X = df.drop('y', axis = 1)
    y = df.y.values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)
    def learning_curve(n):
        for i in range(1, n):
            clf = RF(n_estimators = i)
            clf.fit(X_train, y_train)
            y_pred = clf.predict(X_test)
            re_score = metrics.recall_score(y_test, y_pred)
            re_scores.append(re_score)
    learning_curve(20)
    pd.DataFrame(re_scores).plot()
plot_RF_recall(bank_additional_full_df)









    



CPU times: user 6.15 s, sys: 126 ms, total: 6.27 s
Wall time: 6.27 s

RF precision score



In [17]:

    
%%time
def plot_RF_precision(df):
    pre_scores = []
    X = df.drop('y', axis = 1)
    y = df.y.values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)
    def learning_curve(n):
        for i in range(1, n):
            clf = RF(n_estimators = i)
            clf.fit(X_train, y_train)
            y_pred = clf.predict(X_test)
            pre_score = metrics.precision_score(y_test, y_pred)
            pre_scores.append(pre_score)
    learning_curve(20)
    pd.DataFrame(pre_scores).plot()
plot_RF_precision(bank_additional_full_df)









    



CPU times: user 6.2 s, sys: 125 ms, total: 6.32 s
Wall time: 6.32 s

KNN mean squared error



In [19]:

    
%%time
def plot_KNN_recall(df):
    re_scores = []
    X = df.drop('y', axis = 1)
    X = pd.get_dummies(X)
    y = df.y.values
    le = preprocessing.LabelEncoder()
    le.fit(["yes", "no"])
    y = le.transform(y)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)
    def learning_curve(n):
        for i in range(1, n):
            clf = KNN(n_neighbors = i, algorithm = "kd_tree")
            clf.fit(X_train, y_train)
            y_pred = clf.predict(X_test)
            re_score = metrics.mean_squared_error(y_test, y_pred)
            re_scores.append(re_score)
    learning_curve(50)
    pd.DataFrame(re_scores).plot()
plot_KNN_recall(bank_additional_df)









    



CPU times: user 3.32 s, sys: 14.1 ms, total: 3.34 s
Wall time: 3.34 s

KNN recall score



In [20]:

    
%%time
def plot_KNN_recall(df):
    re_scores = []
    X = df.drop('y', axis = 1)
    X = pd.get_dummies(X)
    y = df.y.values
    le = preprocessing.LabelEncoder()
    le.fit(["yes", "no"])
    y = le.transform(y)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)
    def learning_curve(n):
        for i in range(1, n):
            clf = KNN(n_neighbors = i, algorithm = "kd_tree")
            clf.fit(X_train, y_train)
            y_pred = clf.predict(X_test)
            re_score = metrics.recall_score(y_test, y_pred)
            re_scores.append(re_score)
    learning_curve(50)
    pd.DataFrame(re_scores).plot()
plot_KNN_recall(bank_additional_df)









    



CPU times: user 3.19 s, sys: 9.46 ms, total: 3.2 s
Wall time: 3.19 s

KNN precision score



In [21]:

    
%%time
def plot_KNN_precision(df):
    pre_scores = []
    X = df.drop('y', axis = 1)
    X = pd.get_dummies(X)
    y = df.y.values
    le = preprocessing.LabelEncoder()
    le.fit(["yes", "no"])
    y = le.transform(y)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)
    def learning_curve(n):
        for i in range(1, n):
            clf = KNN(n_neighbors = i, algorithm = "kd_tree")
            clf.fit(X_train, y_train)
            y_pred = clf.predict(X_test)
            pre_score = metrics.precision_score(y_test, y_pred)
            pre_scores.append(pre_score)
    learning_curve(50)
    pd.DataFrame(pre_scores).plot()
plot_KNN_precision(bank_additional_df)









    



CPU times: user 3.18 s, sys: 8.63 ms, total: 3.19 s
Wall time: 3.19 s

KNN classification_report



In [22]:

    
%%time
def plot_KNN_precision(df):
    re_scores = []
    X = df.drop('y', axis = 1)
    X = pd.get_dummies(X)
    y = df.y.values
    le = preprocessing.LabelEncoder()
    le.fit(["yes", "no"])
    y = le.transform(y)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)
    clf = KNN(n_neighbors = 42, algorithm = "kd_tree")
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print classification_report(y_test, y_pred)
plot_KNN_precision(bank_additional_df)









    



             precision    recall  f1-score   support

          0       0.95      0.98      0.96       759
          1       0.60      0.37      0.46        65

avg / total       0.92      0.93      0.92       824

CPU times: user 108 ms, sys: 4.77 ms, total: 112 ms
Wall time: 111 ms

RF feature importances



In [23]:

    
%%time
from sklearn.ensemble import RandomForestClassifier as RF
# Random ForestZaQ
def preprocess(df):
    # Feature - Target
    X = df.drop('y', axis = 1)
    y = df.y.values
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .2)
    # Run Model
    clf = RF(n_estimators = 10)
    clf.fit(X_train, y_train)
    print sorted(zip(clf.feature_importances_, bank_additional_full_df.columns), reverse=True)[:5]
preprocess(bank_additional_full_df)









    



[(0.30022711109339417, 'duration'), (0.13223064311954683, 'euribor3m'), (0.094320775671858204, 'age'), (0.051103975755848727, 'pdays'), (0.051078511808145634, 'job')]
CPU times: user 341 ms, sys: 17.3 ms, total: 359 ms
Wall time: 357 ms



In [ ]:

Most important feature is duration followed by euribor3m then age



In [23]:



In [23]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed	y
0	30	blue-collar	married	basic.9y	no	yes	no	cellular	may	fri	...	2	999	nonexistent	-1.8	92.893	-46.2	1.313	5099.1	no
1	39	services	single	high.school	no	no	no	telephone	may	fri	...	4	999	nonexistent	1.1	93.994	-36.4	4.855	5191.0	no
2	25	services	married	high.school	no	yes	no	telephone	jun	wed	...	1	999	nonexistent	1.4	94.465	-41.8	4.962	5228.1	no
3	38	services	married	basic.9y	no	unknown	unknown	telephone	jun	fri	...	3	999	nonexistent	1.4	94.465	-41.8	4.959	5228.1	no
4	47	admin.	married	university.degree	no	yes	no	cellular	nov	mon	...	1	999	nonexistent	-0.1	93.200	-42.0	4.191	5195.8	no

	age	job	marital	education	default	housing	loan	contact	month	day_of_week	...	campaign	pdays	poutcome	emp.var.rate	cons.price.idx	cons.conf.idx	euribor3m	nr.employed
0	56	3	1	0	0	0	0	1	6	1	...	1	999	1	1.1	93.994	-36.4	4.857	5191
1	57	7	1	3	1	0	0	1	6	1	...	1	999	1	1.1	93.994	-36.4	4.857	5191
2	37	7	1	3	0	2	0	1	6	1	...	1	999	1	1.1	93.994	-36.4	4.857	5191
3	40	0	1	1	0	0	0	1	6	1	...	1	999	1	1.1	93.994	-36.4	4.857	5191
4	56	7	1	3	0	0	2	1	6	1	...	1	999	1	1.1	93.994	-36.4	4.857	5191