Suppose we don't have access to the Kaggle leaderboard, we still have means to check how well we're doing with cross-validation. The process is as follows: we divide the data set into two, using the first to train our model and the second to make predictions. Since we have the actual outcomes in the second data set, we can use this as a basis of comparison and calculate the accuracy of our predictions. Dividing the data set into two is called 2-fold cross-validation, with each fold being the individual partitions of the data set.
Cross-validation is best illustrated by an example. We'll load and process the Titanic training set as before.
In [1]:
import pandas as pd
import numpy as np
df = pd.read_csv('../data/train.csv')
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)
age_mean = df['Age'].mean()
df['Age'] = df['Age'].fillna(age_mean)
from scipy.stats import mode
mode_embarked = mode(df['Embarked'])[0][0]
df['Embarked'] = df['Embarked'].fillna(mode_embarked)
df['Gender'] = df['Sex'].map({'female': 0, 'male': 1}).astype(int)
pd.get_dummies(df['Embarked'], prefix='Embarked').head(10)
df = pd.concat([df, pd.get_dummies(df['Embarked'], prefix='Embarked')], axis=1)
df = df.drop(['Sex', 'Embarked'], axis=1)
cols = df.columns.tolist()
cols = [cols[1]] + cols[0:1] + cols[2:]
df = df[cols]
For convenience, we select an even number of rows. We'll rename the feature columns of the training set as X, and the outcomes as y.
In [2]:
train_data = df.values[:891]
X = train_data[:, 2:]
y = train_data[:, 0]
n = len(df)/2
We divide X and y into two, using the first fold as our new training set (X_train and y_train) and the second as our new test set (X_test and y_test). We train our model with X_train and y_train, and make predictions on X_test. Finally we compare the our predictions on the second data set, y_prediction, against the actual outcomes y_test, and evaluate the accuracy of our predictions.
In [3]:
X_train = X[:n, :]
y_train = y[:n]
X_test = X[n:, :]
y_test = y[n:]
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)
model = model.fit(X_train, y_train)
y_prediction = model.predict(X_test)
print "prediction accuracy:", np.sum(y_test == y_prediction)*1./len(y_test)
Now we swap the order, making the second fold our new training set and the first fold our new test set.
In [4]:
X_train, X_test = X_test, X_train
y_train, y_test = y_test, y_train
model = RandomForestClassifier(n_estimators=100)
model = model.fit(X_train, y_train)
y_prediction = model.predict(X_test)
print "prediction accuracy:", np.sum(y_test == y_prediction)*1./len(y_test)
Hence we see that our model has close to 80% accuracy. GridSearchCV, which we used previously, applies the same concept of cross-validation in comparing the performance of tuning parameters.
We can generate cross-validation folds automatically with Scikit-learn. KFold divides our data set into the required number of folds.
In [5]:
from sklearn.cross_validation import KFold
cv = KFold(n=len(train_data), n_folds=2)
for training_set, test_set in cv:
X_train = X[training_set]
y_train = y[training_set]
X_test = X[test_set]
y_test = y[test_set]
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)
y_prediction = model.predict(X_test)
print "prediction accuracy:", np.sum(y_test == y_prediction)*1./len(y_test)
It is important to note that, for 2-fold cross-validation, the model is trained on a substantially smaller data set. This is why cross-validation is generally recommended across a larger number of folds, usually between 5 and 10. For 10-fold cross-validation, 9 folds serves as the training set and 1 fold as the test set.