HW 3: KNN & Random Forest

Get your data here. The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. There are four datasets:

1) bank-additional-full.csv with all examples (41188) and 20 inputs, ordered by date (from May 2008 to November 2010)

2) bank-additional.csv with 10% of the examples (4119), randomly selected from 1), and 20 inputs.

3) bank-full.csv with all examples and 17 inputs, ordered by date (older version of this dataset with less inputs).

4) bank.csv with 10% of the examples and 17 inputs, randomly selected from 3 (older version of this dataset with less inputs).

The smallest datasets are provided to test more computationally demanding machine learning algorithms (e.g., SVM).

The classification goal is to predict if the client will subscribe (yes/no) a term deposit (variable y).

Assignment

  • Preprocess your data (you may find LabelEncoder useful)
  • Train both KNN and Random Forest models
  • Find the best parameters by computing their learning curve (feel free to verify this with grid search)
  • Create a clasification report
  • Inspect your models, what features are most important? How might you use this information to improve model precision?

In [8]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn import preprocessing

pd.set_option('display.max_rows', 10)

In [9]:
print 'Pandas:', pd.__version__
print 'Numpy:', np.__version__
print 'Matplotlib:', mpl.__version__


Pandas: 0.14.1
Numpy: 1.9.1
Matplotlib: 1.4.2

In [10]:
data_addl = pd.read_csv('bank-additional-full.csv', delimiter=";")
data_addl.describe()
#data_addl.info()
data_addl.education.unique()
data_addl.month.unique()
#need to encode job, marital, education, default, housing, loan, contact, month, day_of_week, poutcome, y?


Out[10]:
array(['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'mar', 'apr', 'sep'], dtype=object)

In [11]:
obj_fields = ['y','job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'poutcome']
num_fields = ['age','duration','campaign','pdays','previous','emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m','nr.employed']

dataX = data_addl[num_fields]


for field in obj_fields:
    print field
    le = preprocessing.LabelEncoder()
    le.fit(data_addl[field])
#    print list(le.classes_)
    data_addl[field] = le.transform(data_addl[field])
#    le_field = pd.DataFrame(le_field)
#    print len(le_field)
#    print le_field_df.info()
#    dataX.append(le_field, ignore_index=True)
#    print dataX.columns
    #print data_addl.head()
print data_addl.describe()


y
job
marital
education
default
housing
loan
contact
month
day_of_week
poutcome
               age          job       marital     education       default  \
count  41188.00000  41188.00000  41188.000000  41188.000000  41188.000000   
mean      40.02406      3.72458      1.172769      3.747184      0.208872   
std       10.42125      3.59456      0.608902      2.136482      0.406686   
min       17.00000      0.00000      0.000000      0.000000      0.000000   
25%       32.00000      0.00000      1.000000      2.000000      0.000000   
50%       38.00000      2.00000      1.000000      3.000000      0.000000   
75%       47.00000      7.00000      2.000000      6.000000      0.000000   
max       98.00000     11.00000      3.000000      7.000000      2.000000   

            housing          loan       contact         month   day_of_week  \
count  41188.000000  41188.000000  41188.000000  41188.000000  41188.000000   
mean       1.071720      0.327425      0.365252      4.230868      2.004613   
std        0.985314      0.723616      0.481507      2.320025      1.397575   
min        0.000000      0.000000      0.000000      0.000000      0.000000   
25%        0.000000      0.000000      0.000000      3.000000      1.000000   
50%        2.000000      0.000000      0.000000      4.000000      2.000000   
75%        2.000000      0.000000      1.000000      6.000000      3.000000   
max        2.000000      2.000000      1.000000      9.000000      4.000000   

           ...           campaign         pdays      previous      poutcome  \
count      ...       41188.000000  41188.000000  41188.000000  41188.000000   
mean       ...           2.567593    962.475454      0.172963      0.930101   
std        ...           2.770014    186.910907      0.494901      0.362886   
min        ...           1.000000      0.000000      0.000000      0.000000   
25%        ...           1.000000    999.000000      0.000000      1.000000   
50%        ...           2.000000    999.000000      0.000000      1.000000   
75%        ...           3.000000    999.000000      0.000000      1.000000   
max        ...          56.000000    999.000000      7.000000      2.000000   

       emp.var.rate  cons.price.idx  cons.conf.idx     euribor3m  \
count  41188.000000    41188.000000   41188.000000  41188.000000   
mean       0.081886       93.575664     -40.502600      3.621291   
std        1.570960        0.578840       4.628198      1.734447   
min       -3.400000       92.201000     -50.800000      0.634000   
25%       -1.800000       93.075000     -42.700000      1.344000   
50%        1.100000       93.749000     -41.800000      4.857000   
75%        1.400000       93.994000     -36.400000      4.961000   
max        1.400000       94.767000     -26.900000      5.045000   

        nr.employed             y  
count  41188.000000  41188.000000  
mean    5167.035911      0.112654  
std       72.251528      0.316173  
min     4963.600000      0.000000  
25%     5099.100000      0.000000  
50%     5191.000000      0.000000  
75%     5228.100000      0.000000  
max     5228.100000      1.000000  

[8 rows x 21 columns]

In [12]:
le_df = pd.DataFrame(lelist)
le_df = le_df.transpose()
le_df.columns = fields


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-b3df6792a525> in <module>()
      1 
----> 2 le_df = pd.DataFrame(lelist)
      3 le_df = le_df.transpose()
      4 le_df.columns = fields

NameError: name 'lelist' is not defined

In [13]:
#le_df.head()
#data_addl.info()
#num_fields = ['age','duration','campaign','pdays','previous','emp.var.rate','cons.price.idx','cons.conf.idx','euribor3m','nr.employed']


#data = le_df.merge(data_addl[num_fields], ignore_index=True)
data_fields = [col for col in data_addl.columns if col not in ['y']]
print data_fields
features = data_addl[data_fields]
features.head()


['age', 'job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'day_of_week', 'duration', 'campaign', 'pdays', 'previous', 'poutcome', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']
Out[13]:
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
0 56 3 1 0 0 0 0 1 6 1 261 1 999 0 1 1.1 93.994 -36.4 4.857 5191
1 57 7 1 3 1 0 0 1 6 1 149 1 999 0 1 1.1 93.994 -36.4 4.857 5191
2 37 7 1 3 0 2 0 1 6 1 226 1 999 0 1 1.1 93.994 -36.4 4.857 5191
3 40 0 1 1 0 0 0 1 6 1 151 1 999 0 1 1.1 93.994 -36.4 4.857 5191
4 56 7 1 3 0 0 2 1 6 1 307 1 999 0 1 1.1 93.994 -36.4 4.857 5191

In [14]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(features, data_addl.y, test_size=0.3)

In [15]:
from sklearn.cross_validation import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
X_train = pd.DataFrame(X_train, columns=features.columns)

In [16]:
%%time
rf_model = RandomForestClassifier(n_estimators=100,max_depth=15,criterion='entropy')
rf_model.fit(X_train, y_train)
print cross_val_score(rf_model, X_train, y_train).mean()


0.913634758498
Wall time: 9.13 s

In [17]:
#importance
sorted(zip(rf_model.feature_importances_,features.columns),reverse=True)
#sorted(zip(rf_model.feature_importances_, vectorizer.get_feature_names()), reverse=True)[:20]


Out[17]:
[(0.3610549594527131, 'duration'),
 (0.11609582624771214, 'euribor3m'),
 (0.090188910652724719, 'nr.employed'),
 (0.064971854874751378, 'age'),
 (0.041317250277320597, 'emp.var.rate'),
 (0.033920827816514169, 'job'),
 (0.032878569339131608, 'cons.conf.idx'),
 (0.032082502003733221, 'education'),
 (0.031539582407124664, 'pdays'),
 (0.031100430616478481, 'campaign'),
 (0.030369118288527598, 'day_of_week'),
 (0.02464954291719999, 'month'),
 (0.022665648130125358, 'cons.price.idx'),
 (0.018474401106435095, 'poutcome'),
 (0.01667205779792609, 'marital'),
 (0.013492804429195289, 'housing'),
 (0.010875887461125576, 'loan'),
 (0.010769701699461705, 'previous'),
 (0.0093314597867219812, 'contact'),
 (0.0075486646950773161, 'default')]

In [18]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix

neigh = KNeighborsClassifier(n_neighbors=2)
neigh.fit(X_train, y_train) 
y_pred = neigh.predict(X_test)


def plot_confusion_matrix(y_pred, y):
    plt.imshow(confusion_matrix(y, y_pred),
               cmap=plt.cm.binary, interpolation='nearest')
    plt.colorbar()
    plt.xlabel('true value')
    plt.ylabel('predicted value')
    
plot_confusion_matrix(y_pred,y_test)



In [19]:
from sklearn.metrics import classification_report
print classification_report(y_test,y_pred)


             precision    recall  f1-score   support

          0       0.92      0.98      0.95     10992
          1       0.60      0.29      0.39      1365

avg / total       0.88      0.90      0.88     12357


In [20]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.learning_curve import learning_curve

def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
    """
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

In [21]:
%%time
#this is an example of 1 learning curve for 1 model, try a few more.
_ = plot_learning_curve(RandomForestClassifier(n_estimators=100),'test',X_train,y_train)


Wall time: 20.7 s

In [ ]: