The following notebook is going to demonstrate the usage of prediction-wrapper, a set of utility classes that makes it much easier to run sklearn machine learning experiments.
This demo shows the procedure of setting up a classification wrapper to run 5-fold cross-validations on 3 different classification models with 3 performance metrics by only using a few lines of code.
In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from binary_classifier_wrappers import KfoldBinaryClassifierWrapper
from metric_wrappers import RSquare, AUC, RMSE
Load Titanic data from local. Downloaded from https://www.kaggle.com/c/titanic/data. Since this is only a demo, I only used training data.
In [2]:
titanic = pd.read_csv("./data/train.csv")
Display some meta info from the data file.
In [3]:
titanic.info()
We see that 'Age' feature has some missing data, so fill them with the current median.
In [4]:
titanic["Age"].fillna(titanic["Age"].median(), inplace=True)
Drop useless featurs from data frame.
In [5]:
titanic = titanic.drop(['PassengerId','Name','Ticket', 'Cabin', 'Embarked'], axis=1)
In [6]:
all_feature_names = titanic.columns.tolist()
all_feature_names.remove('Survived')
all_feature_names
Out[6]:
Build a list of categorical feature names.
In [7]:
categorical_feature_names = ['Pclass', 'Sex']
And set the name of label in the data frame.
In [8]:
label_name = 'Survived'
Initialize 3 classification models.
In [9]:
lr_model = LogisticRegression()
svn_model = SVC(probability = True)
rf_model = RandomForestClassifier()
In [10]:
model_dict = {'Logistic Regression': lr_model,
'SVM': svn_model,
'Random Forest': rf_model}
In [11]:
k_fold_binary = KfoldBinaryClassifierWrapper(titanic, label_name, \
all_feature_names, categorical_feature_names, k=5)
Build a table to store results.
In [12]:
model_performance_table = pd.DataFrame(index=range(len(model_dict)), \
columns=['Model', 'AUC', 'r^2', 'RMSE'])
Run the classification wrapper with 3 models, and compute their results with 3 performance metrics.
In [13]:
for n, name in enumerate(model_dict.keys()):
k_fold_binary.set_model(model_dict[name])
pred_result = k_fold_binary.run()
model_performance_table.ix[n,'Model'] = name
model_performance_table.ix[n,'AUC'] = AUC.measure(pred_result.label, pred_result.pred_prob)
model_performance_table.ix[n,'r^2'] = RSquare.measure(pred_result.label, pred_result.pred_prob)
model_performance_table.ix[n,'RMSE'] = RMSE.measure(pred_result.label, pred_result.pred_prob)
Display results.
In [14]:
model_performance_table = model_performance_table.sort_values(by='AUC', ascending=False).reset_index(drop=True)
model_performance_table
Out[14]:
In [ ]: