Iris is a small standard multi-class classification data set from the UCI Machine Learning Repository.
In [ ]:
from __future__ import print_function
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
In [ ]:
local_filename = 'data/train.csv'
# Open file and print the first 3 lines
with open(local_filename) as fid:
for line in fid.readlines()[:3]:
print(line)
In [ ]:
data = pd.read_csv(local_filename)
In [ ]:
data.head()
In [ ]:
data.shape
In [ ]:
data.describe()
In [ ]:
data.hist(figsize=(10, 10), bins=50, layout=(3, 2));
In [ ]:
sns.pairplot(data);
First we will split our data into features and the target:
In [ ]:
data.head()
In [ ]:
X_train = data.drop(columns='species')
y_train = data['species'].values
A basic predictive model using the scikit-learn random forest classifier will be presented below:
In [ ]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=1, max_leaf_nodes=2, random_state=61)
We can cross-validate our classifier (clf
) using cross_val_score
. Below we will have specified cv=8
meaning KFold cross-valdiation splitting will be used, with 8 folds. The accuracy classification score is calculated for each split. The output score
will be an array of 8 scores from each KFold. The score mean and standard of the 8 scores is printed at the end.
In [ ]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X_train, y_train, cv=8, scoring='accuracy')
print("mean: %e (+/- %e)" % (scores.mean(), scores.std()))
For submitting to the RAMP site, you will need to write a submission.py
file that defines a get_estimator
function that returns a scikit-learn estimator.
For example, to submit our basic example above, we would define our classifier clf
within the function and return clf
at the end. Remember to include all the necessary imports at the beginning of the file.
In [ ]:
from sklearn.ensemble import RandomForestClassifier
def get_estimator():
clf = RandomForestClassifier(n_estimators=1, max_leaf_nodes=2,
random_state=61)
return clf
If you take a look at the sample submission in the directory submissions/starting_kit
, you will find a file named classifier.py
, which has the above code in it.
You can test that the sample submission works by running ramp_test_submission
in your terminal (ensure that ramp-workflow
has been installed and you are in the iris
ramp kit directory). Alternatively, within this notebook you can run:
In [ ]:
!ramp_test_submission
To test that your own submission works, create a new folder within submissions
and name it how you wish. Within your new folder save your classifier.py
file that defines a get_estimator
function. Test your submission locally by running:
ramp_test_submission --submission <folder>
where <folder>
is the name of the new folder you created above.
Once you found a good solution, you can submit it to ramp.studio. First, if it is your first time using RAMP, sign up, otherwise log in. Then, find the appropriate open event for the titanic challenge. Sign up for the event. Note that both RAMP and event signups are controled by RAMP administrators, so there can be a delay between asking for signup and being able to submit.
Once your signup request(s) have been accepted, you can go to your sandbox and copy-paste (or upload) your classifier.py
file. Save your submission, name it, then click 'submit'. The submission is trained and tested on our backend in the same way as ramp_test_submission
does it locally. While your submission is waiting in the queue and being trained, you can find it in the "New submissions (pending training)" table in my submissions. Once it is trained, you get a mail, and your submission shows up on the public leaderboard.
If there is an error (despite having tested your submission locally with ramp_test_submission
), it will show up in the "Failed submissions" table in my submissions. You can click on the error to see part of the trace.
After submission, do not forget to give credits to the previous submissions you reused or integrated into your submission.
The data set we use at the backend is usually different from what you find in the starting kit, so the score may be different.
The usual workflow with RAMP is to explore solutions by refining feature transformations, selecting different models and perhaps do some AutoML/hyperopt, etc., in a notebook setting, then test them with ramp_test_submission
. The script prints mean cross-validation scores:
----------------------------
train acc = 0.62 ± 0.033
train err = 0.38 ± 0.033
train nll = 1.01 ± 0.378
train f1_70 = 0.5 ± 0.167
valid acc = 0.63 ± 0.06
valid err = 0.38 ± 0.06
valid nll = 1.41 ± 1.115
valid f1_70 = 0.5 ± 0.167
test acc = 0.55 ± 0.084
test err = 0.45 ± 0.084
test nll = 1.31 ± 0.858
test f1_70 = 0.4 ± 0.133
The official score in this RAMP (the first score column after "historical contributivity" on the leaderboard) is accuracy ("acc"), so the line that is relevant in the output of ramp_test_submission
is valid acc = 0.63 ± 0.06
. When the score is good enough, you can submit it at the RAMP.
You can find more information in the README of the ramp-workflow library.
Don't hesitate to contact us.