Benoit Playe (Institut Curie/Mines ParisTech), Chloé-Agathe Azencott (Institut Curie/Mines ParisTech), Alex Gramfort (LTCI/Télécom ParisTech), Balázs Kégl (LAL/CNRS)
This is an initiation project to introduce RAMP and get you to know how it works.
The goal is to develop prediction models able to identify people who survived from the sinking of the Titanic, based on gender, age, and ticketing information.
The data we will manipulate is from the Titanic kaggle challenge.
In [ ]:
%matplotlib inline
import os
import glob
import numpy as np
from scipy import io
import matplotlib.pyplot as plt
import pandas as pd
In [ ]:
train_filename = 'data/train.csv'
data = pd.read_csv(train_filename)
y_train = data['Survived'].values
X_train = data.drop(['Survived', 'PassengerId'], axis=1)
X_train.head(5)
In [ ]:
data.describe()
In [ ]:
data.count()
The original training data frame has 891 rows. In the starting kit, we give you a subset of 445 rows. Some passengers have missing information: in particular Age
and Cabin
info can be missing. The meaning of the columns is explained on the challenge website:
In [ ]:
data.groupby('Survived').count()
About two thirds of the passengers perished in the event. A dummy classifier that systematically returns "0" would have an accuracy of 62%, higher than that of a random model.
In [ ]:
from pandas.plotting import scatter_matrix
scatter_matrix(data.get(['Fare', 'Pclass', 'Age']), alpha=0.2,
figsize=(8, 8), diagonal='kde');
In [ ]:
data_plot = data.get(['Age', 'Survived'])
data_plot = data.assign(LogFare=lambda x : np.log(x.Fare + 10.))
scatter_matrix(data_plot.get(['Age', 'LogFare']), alpha=0.2, figsize=(8, 8), diagonal='kde');
data_plot.plot(kind='scatter', x='Age', y='LogFare', c='Survived', s=50, cmap=plt.cm.Paired);
In [ ]:
import seaborn as sns
sns.set()
sns.set_style("whitegrid")
sns.jointplot(data_plot.Age[data_plot.Survived == 1],
data_plot.LogFare[data_plot.Survived == 1],
kind="kde", size=7, space=0, color="b");
sns.jointplot(data_plot.Age[data_plot.Survived == 0],
data_plot.LogFare[data_plot.Survived == 0],
kind="kde", size=7, space=0, color="y");
First, we will perform some simple preprocessing of our data:
Sex
, Pclass
, Embarked
Age
, SibSp
, Parch
, Fare
, fill in missing values with a default value (-1
)This can be done succintly with make_column_transformer
which performs specific transformations on specific features.
In [ ]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
categorical_cols = ['Sex', 'Pclass', 'Embarked']
numerical_cols = ['Age', 'SibSp', 'Parch', 'Fare']
preprocessor = make_column_transformer(
(OneHotEncoder(handle_unknown='ignore'), categorical_cols),
(SimpleImputer(strategy='constant', fill_value=-1), numerical_cols),
)
The preprocessor
object created with make_column_transformer
can be used in a scikit-learn pipeline
. A pipeline
assembles several steps together and can be used to cross validate an entire workflow. Generally, transformation steps are combined with a final estimator.
We will create a pipeline consisting of the preprocessor
created above and a final estimator, LogisticRegression
.
In [ ]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
pipeline = Pipeline([
('transformer', preprocessor),
('classifier', LogisticRegression()),
])
We can cross-validate our pipeline
using cross_val_score
. Below we will have specified cv=8
meaning KFold cross-valdiation splitting will be used, with 8 folds. The Area Under the Receiver Operating Characteristic Curve (ROC AUC) score is calculated for each split. The output score
will be an array of 8 scores from each KFold. The score mean and standard of the 8 scores is printed at the end.
In [ ]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(pipeline, X_train, y_train, cv=8, scoring='roc_auc')
print("mean: %e (+/- %e)" % (scores.mean(), scores.std()))
In [ ]:
test_filename = 'data/test.csv'
data = pd.read_csv(test_filename)
y_test = data['Survived'].values
X_test = data.drop(['Survived', 'PassengerId'], axis=1)
X_test.head(5)
Next we need to fit our pipeline on our training data:
In [ ]:
clf = pipeline.fit(X_train, y_train)
Now we can predict on our test data:
In [ ]:
y_pred = pipeline.predict(X_test)
Finally, we can calculate how well our model performed on the test data:
In [ ]:
from sklearn.metrics import roc_auc_score
score = roc_auc_score(y_test, y_pred)
score
For submitting to the RAMP site, you will need to write a submission.py
file that defines a get_estimator
function that returns a scikit-learn pipeline.
For example, to submit our basic example above, we would define our pipeline
within the function and return the pipeline at the end. Remember to include all the necessary imports at the beginning of the file.
In [ ]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
def get_estimator():
categorical_cols = ['Sex', 'Pclass', 'Embarked']
numerical_cols = ['Age', 'SibSp', 'Parch', 'Fare']
preprocessor = make_column_transformer(
(OneHotEncoder(handle_unknown='ignore'), categorical_cols),
(SimpleImputer(strategy='constant', fill_value=-1), numerical_cols),
)
pipeline = Pipeline([
('transformer', preprocessor),
('classifier', LogisticRegression()),
])
return pipeline
If you take a look at the sample submission in the directory submissions/starting_kit
, you will find a file named submission.py
, which has the above code in it.
You can test that the sample submission works by running ramp_test_submission
in your terminal (ensure that ramp-workflow
has been installed and you are in the titanic
ramp kit directory). Alternatively, within this notebook you can run:
In [ ]:
# !ramp_test_submission
To test that your own submission works, create a new folder within submissions
and name it how you wish. Within your new folder save your submission.py
file that defines a get_estimator
function. Test your submission locally by running:
ramp_test_submission --submission <folder>
where <folder>
is the name of the new folder you created above.
Once you found a good solution, you can submit it to ramp.studio. First, if it is your first time using RAMP, sign up, otherwise log in. Then, find the appropriate open event for the titanic challenge. Sign up for the event. Note that both RAMP and event signups are controled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.
Once your signup request(s) have been accepted, you can go to your sandbox and copy-paste (or upload) your submissions.py
file. Save your submission, name it, then click 'submit'. The submission is trained and tested on our backend in the same way as ramp_test_submission
does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get a mail, and your submission shows up on the public leaderboard.
If there is an error (despite having tested your submission locally with ramp_test_submission
), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.
After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.
The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.
The usual workflow with RAMP is to explore solutions by refining feature transformations, selecting different models and perhaps do some AutoML/hyperopt, etc., in a notebook setting, then test them with ramp_test_submission
. The script prints mean cross-validation scores:
----------------------------
train auc = 0.85 ± 0.005
train acc = 0.81 ± 0.006
train nll = 0.45 ± 0.007
valid auc = 0.87 ± 0.023
valid acc = 0.81 ± 0.02
valid nll = 0.44 ± 0.024
test auc = 0.83 ± 0.006
test acc = 0.76 ± 0.003
test nll = 0.5 ± 0.005
The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is area under the roc curve ("auc"), so the line that is relevant in the output of ramp_test_submission
is valid auc = 0.87 ± 0.023
.
You can find more information in the README of the ramp-workflow library.
Don't hesitate to contact us.