Load Data


In [1]:
import pandas as pd
import sklearn

In [2]:
data_dir      = '../data/raw/'
data_filename = 'blood_train.csv'
df_blood      = pd.read_csv(data_dir+data_filename)

df_blood.head(10)


Out[2]:
Unnamed: 0 Months since Last Donation Number of Donations Total Volume Donated (c.c.) Months since First Donation Made Donation in March 2007
0 619 2 50 12500 98 1
1 664 0 13 3250 28 1
2 441 1 16 4000 35 1
3 160 2 20 5000 45 1
4 358 1 24 6000 77 0
5 335 4 4 1000 4 0
6 47 2 7 1750 14 1
7 164 1 12 3000 35 0
8 736 5 46 11500 98 1
9 436 0 3 750 4 0






Transform data for machine learning

  • Used iloc to drop the subject id column and the prediction column (i.e. 'Made Donation in March 2007')
    • Previously, forgot to drop the 'Made Donation in March 2007' column and totally overfitted the data #LessonLearned
  • Converted to matrix array format for input into predict_proba, train_test_split functions
  • Saved the predictions in a separate array

In [3]:
X = df_blood.iloc[:,1:5].as_matrix()
y = list(df_blood["Made Donation in March 2007"])






Split data into training/test sets for cross-validation

sklearn.model_selection.train_test_split

Split the total dataset into training and testing sets via random selection

  • test_size - proportion of the dataset to put into the test set
  • randome_state - Seed for pseudo-random number generator

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test= sklearn.model_selection.train_test_split(
    X, y, 
    test_size=0.5, 
    random_state=0) 

print("No. Rows in training set:\t", len(X_train))
print("No. Rows in testing set:\t" , len(X_test))


No. Rows in training set:	 288
No. Rows in testing set:	 288






Cross-validation: Split the data into training/test sets by hand

Note: I don't actually use this data, since I used train_test_split()

Splits the 1st third of the data into the training set, the 2nd third into validation set, last 3rd into test set


In [5]:
# Split data into 4 partitions
#  - training set
#  - validation set
#  - combined training & validation set
#  - testing set

# nrows_total = df_blood.count()[1]
# nrows_train = int(nrows_total/3)
# nrows_valid = int(nrows_total*2/3)

# X_train, y_train             = X[:nrows_train]           , y[:nrows_train]
# X_valid, y_valid             = X[nrows_train:nrows_valid], y[nrows_train:nrows_valid]
# X_test , y_test              = X[nrows_valid:]           , y[nrows_valid:]
# X_train_valid, y_train_valid = X[:nrows_valid]           , y[:nrows_valid]

# print("Total number of rows:\t", nrows_total)
# print("Training rows:\t\t"     , 0          ,"-", nrows_train)
# print("Validation rows:\t"     , nrows_train,"-", nrows_valid)
# print("Testing rows:\t\t"      ,nrows_valid ,"-" , nrows_total)







Playing around with different classifiers

With the data loaded, transformed and split, can now pass it into different classifiers and see how they perform

Basic workflow for each classifier:

  1. import classifier
  2. initialize classifier into clf variable
  3. fit data (X_train, y_train) into classifier
  4. predict output (i.e. probabilities) using X_test data
  5. evaluate prediction quality (via sklearn.metrics.log_loss function & y_test data)

Logistic Regression Classifier:


In [6]:
from sklearn.linear_model import LogisticRegression

clf       = sklearn.linear_model.LogisticRegression()
clf.fit(X_train, y_train)

clf_probs = clf.predict_proba(X_test)

score     = sklearn.metrics.log_loss(y_test, clf_probs)
print("Log-loss score:\t", score)


Log-loss score:	 0.466860254919

Submission code for Logistic Regression

  • Results saved in: /data/processed/

In [7]:
from sklearn.linear_model import LogisticRegression

# Load Test Data
data_filename = 'blood_test.csv'
df_test       = pd.read_csv(data_dir+data_filename)

# Transform data
#  - dropped the ID column
#  - converted to matrix array for input to `predict_proba`
Z             = df_test.iloc[:,1:5].as_matrix()

# Predict data
clf_probs     = clf.predict_proba(Z)

# Add predictions back into test data frame
df_test['Made Donation in March 2007'] = clf_probs[:,1]
df_test.head()

# Setup save filename and directory
submit_dir      = '../data/processed/'
submit_filename = 'submit-logistic_regression.csv'

# Save to CSV-file using only the subject-id, and predition columns
df_test.to_csv(submit_dir+submit_filename, 
               columns=('Unnamed: 0', 'Made Donation in March 2007'),
               index=False)

Random Forest Classifier

sklearn.ensemble.RandomForestClassifier


In [8]:
from sklearn.ensemble import RandomForestClassifier

# Train uncalibrated random forest classifier 
# on whole train and validation data 
# and evaluate on test data
clf       = sklearn.ensemble.RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)

# Get probabilities
clf_probs = clf.predict_proba(X_test)

# Test/Evaluate the the model
score     = sklearn.metrics.log_loss(y_test, clf_probs)
print("Log-loss score:\t", score)


Log-loss score:	 1.55880940138

Calibrated Random Forest Classifier

  • sklearn.calibration.CalibratedClassifierCV
  • sklearn.ensemble.RandomForestClassifier

In [9]:
from sklearn.ensemble import RandomForestClassifier

# Train random forest classifier
#  - calibrate on validation data
#  - evaluate test data
clf       = sklearn.ensemble.RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)


from sklearn.calibration import CalibratedClassifierCV

# Pass the RandomForestClassifier into the CalibrationClassifier
sig_clf   = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
sig_clf.fit(X_train, y_train)

# Get prediction probabilities from model
sig_clf_probs = sig_clf.predict_proba(X_test)

# Test quality of predictions using `log_loss` function
sig_score     = sklearn.metrics.log_loss(y_test, sig_clf_probs)
print("Log-loss score:\t", sig_score)


Log-loss score:	 0.773731302797






Playing around w/: sklearn.model_selection.cross_val_score

From Katie Malone's Workflows in Python:

The cheapest and easiest way to train on one portion of my dataset and test on another, and to get a measure of model quality at the same time, is to use sklearn.cross_validation.cross_val_score().

cross_val_score()

  • splits data into 3 equal portions
  • trains on 2 portions
  • tests on the third

This process repeats 3 times. That’s why 3 numbers get printed in the code block below.

Note: log_loss results are negative and is labelled neg_log_loss for the cross_val_score function

See:

Generate data

  • X: Training vector
  • y: Target vector

In [10]:
X = df_blood.iloc[:,1:5].as_matrix()
y = list(df_blood["Made Donation in March 2007"])

LogisticRegression


In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

clf   = sklearn.linear_model.LogisticRegression()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")

print(score)


[-0.56947507 -0.54819823 -0.48901389]

DecisionTreeClassifier


In [12]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

clf   = sklearn.tree.DecisionTreeClassifier()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")
print(score)


[-14.08570262  -7.84956205  -8.54322446]

RandomForestClassifier


In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

clf   = sklearn.ensemble.RandomForestClassifier()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")
print(score)


[-1.82437783 -5.49929871 -1.66941622]

In [ ]: