Paris Saclay Center for Data Science

Titanic RAMP: survival prediction of Titanic passengers

Benoit Playe (Institut Curie/Mines ParisTech), Chloé-Agathe Azencott (Institut Curie/Mines ParisTech), Alex Gramfort (LTCI/Télécom ParisTech), Balázs Kégl (LAL/CNRS)

Introduction

This is an initiation project to introduce RAMP and get you to know how it works.

The goal is to develop prediction models able to identify people who survived from the sinking of the Titanic, based on gender, age, and ticketing information.

The data we will manipulate is from the Titanic kaggle challenge.

Requirements

  • numpy>=1.10.0
  • matplotlib>=1.5.0
  • pandas>=0.19.0
  • scikit-learn>=0.17 (different syntaxes for v0.17 and v0.18)
  • seaborn>=0.7.1

In [ ]:
%matplotlib inline
import os
import glob
import numpy as np
from scipy import io
import matplotlib.pyplot as plt
import pandas as pd

Exploratory data analysis

Loading the data


In [ ]:
train_filename = 'data/train.csv'
data = pd.read_csv(train_filename)
y_train = data['Survived'].values
X_train = data.drop(['Survived', 'PassengerId'], axis=1)
X_train.head(5)

In [ ]:
data.describe()

In [ ]:
data.count()

The original training data frame has 891 rows. In the starting kit, we give you a subset of 445 rows. Some passengers have missing information: in particular Age and Cabin info can be missing. The meaning of the columns is explained on the challenge website:

Predicting survival

The goal is to predict whether a passenger has survived from other known attributes. Let us group the data according to the Survived columns:


In [ ]:
data.groupby('Survived').count()

About two thirds of the passengers perished in the event. A dummy classifier that systematically returns "0" would have an accuracy of 62%, higher than that of a random model.

Some plots

Features densities and co-evolution

A scatterplot matrix allows us to visualize:

  • on the diagonal, the density estimation for each feature
  • on each of the off-diagonal plots, a scatterplot between two features. Each dot represents an instance.

In [ ]:
from pandas.plotting import scatter_matrix
scatter_matrix(data.get(['Fare', 'Pclass', 'Age']), alpha=0.2,
               figsize=(8, 8), diagonal='kde');

Non-linearly transformed data

The Fare variable has a very heavy tail. We can log-transform it.


In [ ]:
data_plot = data.get(['Age', 'Survived'])
data_plot = data.assign(LogFare=lambda x : np.log(x.Fare + 10.))
scatter_matrix(data_plot.get(['Age', 'LogFare']), alpha=0.2, figsize=(8, 8), diagonal='kde');

data_plot.plot(kind='scatter', x='Age', y='LogFare', c='Survived', s=50, cmap=plt.cm.Paired);

Plot the bivariate distributions and marginals of two variables

Another way of visualizing relationships between variables is to plot their bivariate distributions.


In [ ]:
import seaborn as sns

sns.set()
sns.set_style("whitegrid")
sns.jointplot(data_plot.Age[data_plot.Survived == 1],
              data_plot.LogFare[data_plot.Survived == 1],
              kind="kde", size=7, space=0, color="b");

sns.jointplot(data_plot.Age[data_plot.Survived == 0],
              data_plot.LogFare[data_plot.Survived == 0],
              kind="kde", size=7, space=0, color="y");

Making predictions

A basic prediction workflow, using scikit-learn, will be presented below.

First, we will perform some simple preprocessing of our data:

  • one-hot encode the categorical features: Sex, Pclass, Embarked
  • for the numerical columns Age, SibSp, Parch, Fare, fill in missing values with a default value (-1)
  • all remaining columns will be dropped

This can be done succintly with make_column_transformer which performs specific transformations on specific features.


In [ ]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

categorical_cols = ['Sex', 'Pclass', 'Embarked']
numerical_cols = ['Age', 'SibSp', 'Parch', 'Fare']

preprocessor = make_column_transformer(
    (OneHotEncoder(handle_unknown='ignore'), categorical_cols),
    (SimpleImputer(strategy='constant', fill_value=-1), numerical_cols),
)

The preprocessor object created with make_column_transformer can be used in a scikit-learn pipeline. A pipeline assembles several steps together and can be used to cross validate an entire workflow. Generally, transformation steps are combined with a final estimator.

We will create a pipeline consisting of the preprocessor created above and a final estimator, LogisticRegression.


In [ ]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('transformer', preprocessor),
    ('classifier', LogisticRegression()),
])

We can cross-validate our pipeline using cross_val_score. Below we will have specified cv=8 meaning KFold cross-valdiation splitting will be used, with 8 folds. The Area Under the Receiver Operating Characteristic Curve (ROC AUC) score is calculated for each split. The output score will be an array of 8 scores from each KFold. The score mean and standard of the 8 scores is printed at the end.


In [ ]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipeline, X_train, y_train, cv=8, scoring='roc_auc')

print("mean: %e (+/- %e)" % (scores.mean(), scores.std()))

Testing

Once you have created a model with cross-valdiation scores you are happy with, you can test how well your model performs on the independent test data.

First we will read in our test data:


In [ ]:
test_filename = 'data/test.csv'
data = pd.read_csv(test_filename)
y_test = data['Survived'].values
X_test = data.drop(['Survived', 'PassengerId'], axis=1)
X_test.head(5)

Next we need to fit our pipeline on our training data:


In [ ]:
clf = pipeline.fit(X_train, y_train)

Now we can predict on our test data:


In [ ]:
y_pred = pipeline.predict(X_test)

Finally, we can calculate how well our model performed on the test data:


In [ ]:
from sklearn.metrics import roc_auc_score

score = roc_auc_score(y_test, y_pred)
score

RAMP submissions

For submitting to the RAMP site, you will need to write a submission.py file that defines a get_estimator function that returns a scikit-learn pipeline.

For example, to submit our basic example above, we would define our pipeline within the function and return the pipeline at the end. Remember to include all the necessary imports at the beginning of the file.


In [ ]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

def get_estimator():

    categorical_cols = ['Sex', 'Pclass', 'Embarked']
    numerical_cols = ['Age', 'SibSp', 'Parch', 'Fare']

    preprocessor = make_column_transformer(
        (OneHotEncoder(handle_unknown='ignore'), categorical_cols),
        (SimpleImputer(strategy='constant', fill_value=-1), numerical_cols),
    )

    pipeline = Pipeline([
        ('transformer', preprocessor),
        ('classifier', LogisticRegression()),
    ])

    return pipeline

If you take a look at the sample submission in the directory submissions/starting_kit, you will find a file named submission.py, which has the above code in it.

You can test that the sample submission works by running ramp_test_submission in your terminal (ensure that ramp-workflow has been installed and you are in the titanic ramp kit directory). Alternatively, within this notebook you can run:


In [ ]:
# !ramp_test_submission

To test that your own submission works, create a new folder within submissions and name it how you wish. Within your new folder save your submission.py file that defines a get_estimator function. Test your submission locally by running:

ramp_test_submission --submission <folder>

where <folder> is the name of the new folder you created above.

Submitting to ramp.studio

Once you found a good solution, you can submit it to ramp.studio. First, if it is your first time using RAMP, sign up, otherwise log in. Then, find the appropriate open event for the titanic challenge. Sign up for the event. Note that both RAMP and event signups are controled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.

Once your signup request(s) have been accepted, you can go to your sandbox and copy-paste (or upload) your submissions.py file. Save your submission, name it, then click 'submit'. The submission is trained and tested on our backend in the same way as ramp_test_submission does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get a mail, and your submission shows up on the public leaderboard.

If there is an error (despite having tested your submission locally with ramp_test_submission), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.

After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.

The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.

The usual workflow with RAMP is to explore solutions by refining feature transformations, selecting different models and perhaps do some AutoML/hyperopt, etc., in a notebook setting, then test them with ramp_test_submission. The script prints mean cross-validation scores:

----------------------------
train auc = 0.85 ± 0.005
train acc = 0.81 ± 0.006
train nll = 0.45 ± 0.007
valid auc = 0.87 ± 0.023
valid acc = 0.81 ± 0.02
valid nll = 0.44 ± 0.024
test auc = 0.83 ± 0.006
test acc = 0.76 ± 0.003
test nll = 0.5 ± 0.005

The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is area under the roc curve ("auc"), so the line that is relevant in the output of ramp_test_submission is valid auc = 0.87 ± 0.023.

More information

You can find more information in the README of the ramp-workflow library.

Contact

Don't hesitate to contact us.