0. Introduction

The following notebook is going to demonstrate the usage of prediction-wrapper, a set of utility classes that makes it much easier to run sklearn machine learning experiments.

This demo shows the procedure of setting up a classification wrapper to run 5-fold cross-validations on 3 different classification models with 3 performance metrics by only using a few lines of code.



In [1]:

    
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from binary_classifier_wrappers import KfoldBinaryClassifierWrapper
from metric_wrappers import RSquare, AUC, RMSE

1. Load and perpare data

Load Titanic data from local. Downloaded from https://www.kaggle.com/c/titanic/data. Since this is only a demo, I only used training data.



In [2]:

    
titanic = pd.read_csv("./data/train.csv")

Display some meta info from the data file.



In [3]:

    
titanic.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

We see that 'Age' feature has some missing data, so fill them with the current median.



In [4]:

    
titanic["Age"].fillna(titanic["Age"].median(), inplace=True)

Drop useless featurs from data frame.



In [5]:

    
titanic = titanic.drop(['PassengerId','Name','Ticket', 'Cabin', 'Embarked'], axis=1)

2. Set up model inputs

Build a list of feature names from data frame. Note that we need to drop the 'Survived' column from input features.



In [6]:

    
all_feature_names = titanic.columns.tolist()
all_feature_names.remove('Survived')
all_feature_names









    Out[6]:





['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']

Build a list of categorical feature names.



In [7]:

    
categorical_feature_names = ['Pclass', 'Sex']

And set the name of label in the data frame.



In [8]:

    
label_name = 'Survived'

Initialize 3 classification models.



In [9]:

    
lr_model = LogisticRegression()
svn_model = SVC(probability = True)
rf_model = RandomForestClassifier()



In [10]:

    
model_dict = {'Logistic Regression': lr_model,
             'SVM': svn_model,
             'Random Forest': rf_model}

3. Run a classification wrapper with multiple models and multiple metrics

Initialize the classification wrapper with 5-fold cross-validation, this is where magic happens.



In [11]:

    
k_fold_binary = KfoldBinaryClassifierWrapper(titanic, label_name, \
                                             all_feature_names, categorical_feature_names, k=5)

Build a table to store results.



In [12]:

    
model_performance_table = pd.DataFrame(index=range(len(model_dict)), \
                                       columns=['Model', 'AUC', 'r^2', 'RMSE'])

Run the classification wrapper with 3 models, and compute their results with 3 performance metrics.



In [13]:

    
for n, name in enumerate(model_dict.keys()):
    k_fold_binary.set_model(model_dict[name])
    pred_result = k_fold_binary.run()
    
    model_performance_table.ix[n,'Model'] = name
    model_performance_table.ix[n,'AUC'] = AUC.measure(pred_result.label, pred_result.pred_prob)
    model_performance_table.ix[n,'r^2'] = RSquare.measure(pred_result.label, pred_result.pred_prob)
    model_performance_table.ix[n,'RMSE'] = RMSE.measure(pred_result.label, pred_result.pred_prob)

Display results.



In [14]:

    
model_performance_table = model_performance_table.sort_values(by='AUC', ascending=False).reset_index(drop=True)
model_performance_table









    Out[14]:






  
    
      
      Model
      AUC
      r^2
      RMSE
    
  
  
    
      0
      Logistic Regression
      0.841261
      0.38717
      0.38072
    
    
      1
      Random Forest
      0.834897
      0.356306
      0.397467
    
    
      2
      SVM
      0.80583
      0.287623
      0.410927



In [ ]:

	Model	AUC	r^2	RMSE
0	Logistic Regression	0.841261	0.38717	0.38072
1	Random Forest	0.834897	0.356306	0.397467
2	SVM	0.80583	0.287623	0.410927