Appendix A - Cross-Validation

Suppose we don't have access to the Kaggle leaderboard, we still have means to check how well we're doing with cross-validation. The process is as follows: we divide the data set into two, using the first to train our model and the second to make predictions. Since we have the actual outcomes in the second data set, we can use this as a basis of comparison and calculate the accuracy of our predictions. Dividing the data set into two is called 2-fold cross-validation, with each fold being the individual partitions of the data set.

Cross-validation is best illustrated by an example. We'll load and process the Titanic training set as before.


In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('../data/train.csv')

df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)

age_mean = df['Age'].mean()
df['Age'] = df['Age'].fillna(age_mean)

from scipy.stats import mode

mode_embarked = mode(df['Embarked'])[0][0]
df['Embarked'] = df['Embarked'].fillna(mode_embarked)

df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)

pd.get_dummies(df['Embarked'], prefix='Embarked').head(10)
df = pd.concat([df, pd.get_dummies(df['Embarked'], prefix='Embarked')], axis=1)

df = df.drop(['Sex', 'Embarked'], axis=1)

cols = df.columns.tolist()
cols = [cols[1]] + cols[0:1] + cols[2:]

df = df[cols]

For convenience, we select an even number of rows. We'll rename the feature columns of the training set as X, and the outcomes as y.


In [2]:
train_data = df.values[:891]

X = train_data[:, 2:]
y = train_data[:, 0]

n = len(df)/2

We divide X and y into two, using the first fold as our new training set (X_train and y_train) and the second as our new test set (X_test and y_test). We train our model with X_train and y_train, and make predictions on X_test. Finally we compare the our predictions on the second data set, y_prediction, against the actual outcomes y_test, and evaluate the accuracy of our predictions.


In [3]:
X_train = X[:n, :]
y_train = y[:n]

X_test = X[n:, :]
y_test = y[n:]

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)
model = model.fit(X_train, y_train)
y_prediction = model.predict(X_test)

print "prediction accuracy:", np.sum(y_test == y_prediction)*1./len(y_test)


prediction accuracy: 0.791479820628

Now we swap the order, making the second fold our new training set and the first fold our new test set.


In [4]:
X_train, X_test = X_test, X_train
y_train, y_test = y_test, y_train

model = RandomForestClassifier(n_estimators=100)
model = model.fit(X_train, y_train)
y_prediction = model.predict(X_test)

print "prediction accuracy:", np.sum(y_test == y_prediction)*1./len(y_test)


prediction accuracy: 0.779775280899

Hence we see that our model has close to 80% accuracy. GridSearchCV, which we used previously, applies the same concept of cross-validation in comparing the performance of tuning parameters.

We can generate cross-validation folds automatically with Scikit-learn. KFold divides our data set into the required number of folds.


In [5]:
from sklearn.cross_validation import KFold

cv = KFold(n=len(train_data), n_folds=2)

for training_set, test_set in cv:
    X_train = X[training_set]
    y_train = y[training_set]
    X_test = X[test_set]
    y_test = y[test_set]
    model = RandomForestClassifier(n_estimators=100)
    model.fit(X_train, y_train)
    y_prediction = model.predict(X_test)
    print "prediction accuracy:", np.sum(y_test == y_prediction)*1./len(y_test)


prediction accuracy: 0.780269058296
prediction accuracy: 0.786516853933

It is important to note that, for 2-fold cross-validation, the model is trained on a substantially smaller data set. This is why cross-validation is generally recommended across a larger number of folds, usually between 5 and 10. For 10-fold cross-validation, 9 folds serves as the training set and 1 fold as the test set.