0. Introduction

The following notebook is going to demonstrate the usage of prediction-wrapper, a set of utility classes that makes it much easier to run sklearn machine learning experiments.

This demo shows the procedure of setting up a classification wrapper to run 5-fold cross-validations on 3 different classification models with 3 performance metrics by only using a few lines of code.


In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

from binary_classifier_wrappers import KfoldBinaryClassifierWrapper
from metric_wrappers import RSquare, AUC, RMSE

1. Load and perpare data

Load Titanic data from local. Downloaded from https://www.kaggle.com/c/titanic/data. Since this is only a demo, I only used training data.


In [2]:
titanic = pd.read_csv("./data/train.csv")

Display some meta info from the data file.


In [3]:
titanic.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

We see that 'Age' feature has some missing data, so fill them with the current median.


In [4]:
titanic["Age"].fillna(titanic["Age"].median(), inplace=True)

Drop useless featurs from data frame.


In [5]:
titanic = titanic.drop(['PassengerId','Name','Ticket', 'Cabin', 'Embarked'], axis=1)

2. Set up model inputs

Build a list of feature names from data frame. Note that we need to drop the 'Survived' column from input features.


In [6]:
all_feature_names = titanic.columns.tolist()
all_feature_names.remove('Survived')
all_feature_names


Out[6]:
['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']

Build a list of categorical feature names.


In [7]:
categorical_feature_names = ['Pclass', 'Sex']

And set the name of label in the data frame.


In [8]:
label_name = 'Survived'

Initialize 3 classification models.


In [9]:
lr_model = LogisticRegression()
svn_model = SVC(probability = True)
rf_model = RandomForestClassifier()

In [10]:
model_dict = {'Logistic Regression': lr_model,
             'SVM': svn_model,
             'Random Forest': rf_model}

3. Run a classification wrapper with multiple models and multiple metrics

Initialize the classification wrapper with 5-fold cross-validation, this is where magic happens.


In [11]:
k_fold_binary = KfoldBinaryClassifierWrapper(titanic, label_name, \
                                             all_feature_names, categorical_feature_names, k=5)

Build a table to store results.


In [12]:
model_performance_table = pd.DataFrame(index=range(len(model_dict)), \
                                       columns=['Model', 'AUC', 'r^2', 'RMSE'])

Run the classification wrapper with 3 models, and compute their results with 3 performance metrics.


In [13]:
for n, name in enumerate(model_dict.keys()):
    k_fold_binary.set_model(model_dict[name])
    pred_result = k_fold_binary.run()
    
    model_performance_table.ix[n,'Model'] = name
    model_performance_table.ix[n,'AUC'] = AUC.measure(pred_result.label, pred_result.pred_prob)
    model_performance_table.ix[n,'r^2'] = RSquare.measure(pred_result.label, pred_result.pred_prob)
    model_performance_table.ix[n,'RMSE'] = RMSE.measure(pred_result.label, pred_result.pred_prob)

Display results.


In [14]:
model_performance_table = model_performance_table.sort_values(by='AUC', ascending=False).reset_index(drop=True)
model_performance_table


Out[14]:
Model AUC r^2 RMSE
0 Logistic Regression 0.841261 0.38717 0.38072
1 Random Forest 0.834897 0.356306 0.397467
2 SVM 0.80583 0.287623 0.410927

In [ ]: