import pandas as pd
import sklearn

data_dir      = '../data/raw/'
data_filename = 'blood_train.csv'
df_blood      = pd.read_csv(data_dir+data_filename)


Unnamed: 0 Months since Last Donation Number of Donations Total Volume Donated (c.c.) Months since First Donation Made Donation in March 2007
0 619 2 50 12500 98 1
1 664 0 13 3250 28 1
2 441 1 16 4000 35 1
3 160 2 20 5000 45 1
4 358 1 24 6000 77 0
5 335 4 4 1000 4 0
6 47 2 7 1750 14 1
7 164 1 12 3000 35 0
8 736 5 46 11500 98 1
9 436 0 3 750 4 0

Transform data for machine learning

  • Used iloc to drop the subject id column and the prediction column (i.e. 'Made Donation in March 2007')
    • Previously, forgot to drop the 'Made Donation in March 2007' column and totally overfitted the data #LessonLearned
  • Converted to matrix array format for input into predict_proba, train_test_split functions
  • Saved the predictions in a separate array

X = df_blood.iloc[:,1:5].as_matrix()
y = list(df_blood["Made Donation in March 2007"])

Split data into training/test sets for cross-validation


Split the total dataset into training and testing sets via random selection

  • test_size - proportion of the dataset to put into the test set
  • randome_state - Seed for pseudo-random number generator

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test= sklearn.model_selection.train_test_split(
    X, y, 

print("No. Rows in training set:\t", len(X_train))
print("No. Rows in testing set:\t" , len(X_test))

No. Rows in training set:	 288
No. Rows in testing set:	 288

Cross-validation: Split the data into training/test sets by hand

Playing around with different classifiers

With the data loaded, transformed and split, can now pass it into different classifiers and see how they perform

Basic workflow for each classifier:

  1. import classifier
  2. initialize classifier into clf variable
  3. fit data (X_train, y_train) into classifier
  4. predict output (i.e. probabilities) using X_test data
  5. evaluate prediction quality (via sklearn.metrics.log_loss function & y_test data)

Logistic Regression Classifier:

from sklearn.linear_model import LogisticRegression

clf       = sklearn.linear_model.LogisticRegression()
clf.fit(X_train, y_train)

clf_probs = clf.predict_proba(X_test)

score     = sklearn.metrics.log_loss(y_test, clf_probs)
print("Log-loss score:\t", score)

Log-loss score:	 0.466860254919

Submission code for Logistic Regression

  • Results saved in: /data/processed/

from sklearn.linear_model import LogisticRegression

# Load Test Data
data_filename = 'blood_test.csv'
df_test       = pd.read_csv(data_dir+data_filename)

# Transform data
#  - dropped the ID column
#  - converted to matrix array for input to `predict_proba`
Z             = df_test.iloc[:,1:5].as_matrix()

# Predict data
clf_probs     = clf.predict_proba(Z)

# Add predictions back into test data frame
df_test['Made Donation in March 2007'] = clf_probs[:,1]

# Setup save filename and directory
submit_dir      = '../data/processed/'
submit_filename = 'submit-logistic_regression.csv'

# Save to CSV-file using only the subject-id, and predition columns
               columns=('Unnamed: 0', 'Made Donation in March 2007'),

Random Forest Classifier


from sklearn.ensemble import RandomForestClassifier

# Train uncalibrated random forest classifier 
# on whole train and validation data 
# and evaluate on test data
clf       = sklearn.ensemble.RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)

# Get probabilities
clf_probs = clf.predict_proba(X_test)

# Test/Evaluate the the model
score     = sklearn.metrics.log_loss(y_test, clf_probs)
print("Log-loss score:\t", score)

Log-loss score:	 1.55880940138

Calibrated Random Forest Classifier

  • sklearn.calibration.CalibratedClassifierCV
  • sklearn.ensemble.RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier

# Train random forest classifier
#  - calibrate on validation data
#  - evaluate test data
clf       = sklearn.ensemble.RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)

from sklearn.calibration import CalibratedClassifierCV

# Pass the RandomForestClassifier into the CalibrationClassifier
sig_clf   = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
sig_clf.fit(X_train, y_train)

# Get prediction probabilities from model
sig_clf_probs = sig_clf.predict_proba(X_test)

# Test quality of predictions using `log_loss` function
sig_score     = sklearn.metrics.log_loss(y_test, sig_clf_probs)
print("Log-loss score:\t", sig_score)

Log-loss score:	 0.773731302797

Playing around w/: sklearn.model_selection.cross_val_score

From Katie Malone's Workflows in Python:

The cheapest and easiest way to train on one portion of my dataset and test on another, and to get a measure of model quality at the same time, is to use sklearn.cross_validation.cross_val_score().


  • splits data into 3 equal portions
  • trains on 2 portions
  • tests on the third

This process repeats 3 times. That’s why 3 numbers get printed in the code block below.

Note: log_loss results are negative and is labelled neg_log_loss for the cross_val_score function


Generate data

  • X: Training vector
  • y: Target vector

X = df_blood.iloc[:,1:5].as_matrix()
y = list(df_blood["Made Donation in March 2007"])


from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

clf   = sklearn.linear_model.LogisticRegression()
score = sklearn.model_selection.cross_val_score( 
    X, y,


[-0.56947507 -0.54819823 -0.48901389]


from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

clf   = sklearn.tree.DecisionTreeClassifier()
score = sklearn.model_selection.cross_val_score( 
    X, y,

[-14.08570262  -7.84956205  -8.54322446]


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

clf   = sklearn.ensemble.RandomForestClassifier()
score = sklearn.model_selection.cross_val_score( 
    X, y,

[-1.82437783 -5.49929871 -1.66941622]

