Data pipeline

Workflow adapted from Katie Malone's Workflows in Python series

Load data
Transform data to conform to sklearn functions
1. Note: need to incorporate Abbie's outlier data cleaning
Split training data into train/test sets for cross validation
Pick a classifer & evaluate it
Creating CSV-file for submission

Load Data



In [1]:

    
import pandas as pd
import sklearn



In [2]:

    
data_dir      = '../data/raw/'
data_filename = 'blood_train.csv'
df_blood      = pd.read_csv(data_dir+data_filename)

df_blood.head(10)









    Out[2]:






  
    
      
      Unnamed: 0
      Months since Last Donation
      Number of Donations
      Total Volume Donated (c.c.)
      Months since First Donation
      Made Donation in March 2007
    
  
  
    
      0
      619
      2
      50
      12500
      98
      1
    
    
      1
      664
      0
      13
      3250
      28
      1
    
    
      2
      441
      1
      16
      4000
      35
      1
    
    
      3
      160
      2
      20
      5000
      45
      1
    
    
      4
      358
      1
      24
      6000
      77
      0
    
    
      5
      335
      4
      4
      1000
      4
      0
    
    
      6
      47
      2
      7
      1750
      14
      1
    
    
      7
      164
      1
      12
      3000
      35
      0
    
    
      8
      736
      5
      46
      11500
      98
      1
    
    
      9
      436
      0
      3
      750
      4
      0

Transform data for machine learning

Used iloc to drop the subject id column and the prediction column (i.e. 'Made Donation in March 2007')
- Previously, forgot to drop the 'Made Donation in March 2007' column and totally overfitted the data #LessonLearned
Converted to matrix array format for input into predict_proba, train_test_split functions
Saved the predictions in a separate array



In [3]:

    
X = df_blood.iloc[:,1:5].as_matrix()
y = list(df_blood["Made Donation in March 2007"])

Split data into training/test sets for cross-validation

`sklearn.model_selection.train_test_split`

Split the total dataset into training and testing sets via random selection

test_size - proportion of the dataset to put into the test set
randome_state - Seed for pseudo-random number generator



In [4]:

    
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test= sklearn.model_selection.train_test_split(
    X, y, 
    test_size=0.5, 
    random_state=0) 

print("No. Rows in training set:\t", len(X_train))
print("No. Rows in testing set:\t" , len(X_test))









    



No. Rows in training set:	 288
No. Rows in testing set:	 288

Cross-validation: Split the data into training/test sets by hand

Note: I don't actually use this data, since I used train_test_split()

Splits the 1st third of the data into the training set, the 2nd third into validation set, last 3rd into test set



In [5]:

    
# Split data into 4 partitions
#  - training set
#  - validation set
#  - combined training & validation set
#  - testing set

# nrows_total = df_blood.count()[1]
# nrows_train = int(nrows_total/3)
# nrows_valid = int(nrows_total*2/3)

# X_train, y_train             = X[:nrows_train]           , y[:nrows_train]
# X_valid, y_valid             = X[nrows_train:nrows_valid], y[nrows_train:nrows_valid]
# X_test , y_test              = X[nrows_valid:]           , y[nrows_valid:]
# X_train_valid, y_train_valid = X[:nrows_valid]           , y[:nrows_valid]

# print("Total number of rows:\t", nrows_total)
# print("Training rows:\t\t"     , 0          ,"-", nrows_train)
# print("Validation rows:\t"     , nrows_train,"-", nrows_valid)
# print("Testing rows:\t\t"      ,nrows_valid ,"-" , nrows_total)

Playing around with different classifiers

With the data loaded, transformed and split, can now pass it into different classifiers and see how they perform

Basic workflow for each classifier:

import classifier
initialize classifier into clf variable
fit data (X_train, y_train) into classifier
predict output (i.e. probabilities) using X_test data
evaluate prediction quality (via sklearn.metrics.log_loss function & y_test data)

Logistic Regression Classifier:

sklearn.linear_model.LinearRegression
- Example: Simple linear regression



In [6]:

    
from sklearn.linear_model import LogisticRegression

clf       = sklearn.linear_model.LogisticRegression()
clf.fit(X_train, y_train)

clf_probs = clf.predict_proba(X_test)

score     = sklearn.metrics.log_loss(y_test, clf_probs)
print("Log-loss score:\t", score)









    



Log-loss score:	 0.466860254919

Submission code for Logistic Regression

Results saved in: /data/processed/



In [7]:

    
from sklearn.linear_model import LogisticRegression

# Load Test Data
data_filename = 'blood_test.csv'
df_test       = pd.read_csv(data_dir+data_filename)

# Transform data
#  - dropped the ID column
#  - converted to matrix array for input to `predict_proba`
Z             = df_test.iloc[:,1:5].as_matrix()

# Predict data
clf_probs     = clf.predict_proba(Z)

# Add predictions back into test data frame
df_test['Made Donation in March 2007'] = clf_probs[:,1]
df_test.head()

# Setup save filename and directory
submit_dir      = '../data/processed/'
submit_filename = 'submit-logistic_regression.csv'

# Save to CSV-file using only the subject-id, and predition columns
df_test.to_csv(submit_dir+submit_filename, 
               columns=('Unnamed: 0', 'Made Donation in March 2007'),
               index=False)

Random Forest Classifier

sklearn.ensemble.RandomForestClassifier



In [8]:

    
from sklearn.ensemble import RandomForestClassifier

# Train uncalibrated random forest classifier 
# on whole train and validation data 
# and evaluate on test data
clf       = sklearn.ensemble.RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)

# Get probabilities
clf_probs = clf.predict_proba(X_test)

# Test/Evaluate the the model
score     = sklearn.metrics.log_loss(y_test, clf_probs)
print("Log-loss score:\t", score)









    



Log-loss score:	 1.55880940138

Calibrated Random Forest Classifier

sklearn.calibration.CalibratedClassifierCV
sklearn.ensemble.RandomForestClassifier



In [9]:

    
from sklearn.ensemble import RandomForestClassifier

# Train random forest classifier
#  - calibrate on validation data
#  - evaluate test data
clf       = sklearn.ensemble.RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)


from sklearn.calibration import CalibratedClassifierCV

# Pass the RandomForestClassifier into the CalibrationClassifier
sig_clf   = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
sig_clf.fit(X_train, y_train)

# Get prediction probabilities from model
sig_clf_probs = sig_clf.predict_proba(X_test)

# Test quality of predictions using `log_loss` function
sig_score     = sklearn.metrics.log_loss(y_test, sig_clf_probs)
print("Log-loss score:\t", sig_score)









    



Log-loss score:	 0.773731302797

Playing around w/: `sklearn.model_selection.cross_val_score`

From Katie Malone's Workflows in Python:

The cheapest and easiest way to train on one portion of my dataset and test on another, and to get a measure of model quality at the same time, is to use sklearn.cross_validation.cross_val_score().

cross_val_score()

splits data into 3 equal portions

trains on 2 portions

tests on the third

This process repeats 3 times. That’s why 3 numbers get printed in the code block below.

Note: `log_loss` results are negative and is labelled `neg_log_loss` for the `cross_val_score` function

See:

Sklearn | Quantifying the Quality of Predictions
StackOverflow | Why is log_loss negative?
- Basically, higher score means better performance (less loss)

Generate data

X: Training vector
y: Target vector



In [10]:

    
X = df_blood.iloc[:,1:5].as_matrix()
y = list(df_blood["Made Donation in March 2007"])

LogisticRegression



In [11]:

    
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

clf   = sklearn.linear_model.LogisticRegression()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")

print(score)









    



[-0.56947507 -0.54819823 -0.48901389]

DecisionTreeClassifier



In [12]:

    
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

clf   = sklearn.tree.DecisionTreeClassifier()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")
print(score)









    



[-14.08570262  -7.84956205  -8.54322446]

RandomForestClassifier



In [13]:

    
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

clf   = sklearn.ensemble.RandomForestClassifier()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")
print(score)









    



[-1.82437783 -5.49929871 -1.66941622]

Sources

Examples:

Example: Probability Calibration
- helpful for seeing whole workflow in action from loading to plotting

Documentation

Discussion

CrossValidated | sklearn predict_proba output interpretation



In [ ]:

	Unnamed: 0	Months since Last Donation	Number of Donations	Total Volume Donated (c.c.)	Months since First Donation	Made Donation in March 2007
0	619	2	50	12500	98	1
1	664	0	13	3250	28	1
2	441	1	16	4000	35	1
3	160	2	20	5000	45	1
4	358	1	24	6000	77	0
5	335	4	4	1000	4	0
6	47	2	7	1750	14	1
7	164	1	12	3000	35	0
8	736	5	46	11500	98	1
9	436	0	3	750	4	0