Data pipeline

Workflow adapted from Katie Malone's Workflows in Python series

Load data
Transform data to conform to sklearn functions
1. Note: need to incorporate Abbie's outlier data cleaning
Split training data into train/test sets for cross validation
Pick a classifer & evaluate it
Creating CSV-file for submission

Load Data



In [40]:

    
import pandas as pd
import sklearn



In [41]:

    
data_dir      = '../data/raw/'
data_filename = 'blood_train.csv'
df_blood      = pd.read_csv(data_dir+data_filename)

df_blood.head(10)









    Out[41]:






  
    
      
      Unnamed: 0
      Months since Last Donation
      Number of Donations
      Total Volume Donated (c.c.)
      Months since First Donation
      Made Donation in March 2007
    
  
  
    
      0
      619
      2
      50
      12500
      98
      1
    
    
      1
      664
      0
      13
      3250
      28
      1
    
    
      2
      441
      1
      16
      4000
      35
      1
    
    
      3
      160
      2
      20
      5000
      45
      1
    
    
      4
      358
      1
      24
      6000
      77
      0
    
    
      5
      335
      4
      4
      1000
      4
      0
    
    
      6
      47
      2
      7
      1750
      14
      1
    
    
      7
      164
      1
      12
      3000
      35
      0
    
    
      8
      736
      5
      46
      11500
      98
      1
    
    
      9
      436
      0
      3
      750
      4
      0

Transform data for machine learning

Used iloc to drop the subject id column and the prediction column (i.e. 'Made Donation in March 2007')
- Previously, forgot to drop the 'Made Donation in March 2007' column and totally overfitted the data #LessonLearned
Converted to matrix array format for input into predict_proba, train_test_split functions
Saved the predictions in a separate array



In [74]:

    
X = df_blood.iloc[:,1:5].as_matrix()
y = list(df_blood["Made Donation in March 2007"])



In [43]:

    
# Log transform the features
import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p)
X= transformer.transform(X)

# Normalize the features using Standard Scalar
#X_test = sklearn.preprocessing.StandardScaler().fit_transform(X_test)
#X_train = sklearn.preprocessing.StandardScaler().fit_transform(X_train)

Split data into training/test sets for cross-validation

`sklearn.model_selection.train_test_split`

Split the total dataset into training and testing sets via random selection

test_size - proportion of the dataset to put into the test set
randome_state - Seed for pseudo-random number generator



In [117]:

    
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test= sklearn.model_selection.train_test_split(
    X, y, 
    test_size=0.5, 
    random_state=0) 

print("No. Rows in training set:\t", len(X_train))
print("No. Rows in testing set:\t" , len(X_test))









    



('No. Rows in training set:\t', 288)
('No. Rows in testing set:\t', 288)

Cross-validation: Split the data into training/test sets by hand

Note: I don't actually use this data, since I used train_test_split()

Splits the 1st third of the data into the training set, the 2nd third into validation set, last 3rd into test set



In [118]:

    
# Split data into 4 partitions
#  - training set
#  - validation set
#  - combined training & validation set
#  - testing set

# nrows_total = df_blood.count()[1]
# nrows_train = int(nrows_total/3)
# nrows_valid = int(nrows_total*2/3)

# X_train, y_train             = X[:nrows_train]           , y[:nrows_train]
# X_valid, y_valid             = X[nrows_train:nrows_valid], y[nrows_train:nrows_valid]
# X_test , y_test              = X[nrows_valid:]           , y[nrows_valid:]
# X_train_valid, y_train_valid = X[:nrows_valid]           , y[:nrows_valid]

# print("Total number of rows:\t", nrows_total)
# print("Training rows:\t\t"     , 0          ,"-", nrows_train)
# print("Validation rows:\t"     , nrows_train,"-", nrows_valid)
# print("Testing rows:\t\t"      ,nrows_valid ,"-" , nrows_total)

Playing around with different classifiers

With the data loaded, transformed and split, can now pass it into different classifiers and see how they perform

Basic workflow for each classifier:

import classifier
initialize classifier into clf variable
fit data (X_train, y_train) into classifier
predict output (i.e. probabilities) using X_test data
evaluate prediction quality (via sklearn.metrics.log_loss function & y_test data)

Logistic Regression Classifier:

sklearn.linear_model.LinearRegression
- Example: Simple linear regression



In [119]:

    
from sklearn.linear_model import LogisticRegression

clf       = sklearn.linear_model.LogisticRegression()
clf.fit(X_train, y_train)

clf_probs = clf.predict_proba(X_test)

score     = sklearn.metrics.log_loss(y_test, clf_probs)
print("Log-loss score:\t", score)









    



('Log-loss score:\t', 0.46686030610135276)



In [81]:

    
clfE       = sklearn.linear_model.ElasticNet(l1_ratio=0.24)
clfE.fit(X_train, y_train)

clfE_probs = clfE.predict(X_test)

score     = sklearn.metrics.log_loss(y_test, clfE_probs)
print("Log-loss score:\t", score)









    



('Log-loss score:\t', 0.49254144740610439)



In [112]:

    
from sklearn.linear_model import LogisticRegression

clfL       = sklearn.linear_model.LogisticRegression(C=1)
clfL.fit(X_train, y_train)

clfL_probs = clfL.predict_proba(X_test)

score     = sklearn.metrics.log_loss(y_test, clfL_probs)
print("Log-loss score:\t", score)









    



('Log-loss score:\t', 0.46686030610135276)



In [102]:

    
clfECV       = sklearn.linear_model.ElasticNetCV(l1_ratio=0.24)
clfECV.fit(X_train, y_train)

clfECV_probs = clfECV.predict(X_test)
score     = sklearn.metrics.log_loss(y_test, clfECV_probs )
print("Log-loss score:\t", score)









    



('Log-loss score:\t', 0.53195276047172257)



In [100]:

    
dir(clfECV)









    Out[100]:





['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__getstate__',
 '__hash__',
 '__init__',
 '__module__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_cache',
 '_abc_negative_cache',
 '_abc_negative_cache_version',
 '_abc_registry',
 '_decision_function',
 '_estimator_type',
 '_get_param_names',
 '_preprocess_data',
 '_set_intercept',
 'alpha_',
 'alphas',
 'alphas_',
 'coef_',
 'copy_X',
 'cv',
 'decision_function',
 'dual_gap_',
 'eps',
 'fit',
 'fit_intercept',
 'get_params',
 'intercept_',
 'l1_ratio',
 'l1_ratio_',
 'max_iter',
 'mse_path_',
 'n_alphas',
 'n_iter_',
 'n_jobs',
 'normalize',
 'path',
 'positive',
 'precompute',
 'predict',
 'random_state',
 'score',
 'selection',
 'set_params',
 'tol',
 'verbose']

Submission code for Logistic Regression

Results saved in: /data/processed/



In [46]:

    
from sklearn.linear_model import LogisticRegression

# Load Test Data
data_filename = 'blood_test.csv'
df_test       = pd.read_csv(data_dir+data_filename)

# Transform data
#  - dropped the ID column
#  - converted to matrix array for input to `predict_proba`
Z             = df_test.iloc[:,1:5].as_matrix()

# Predict data
clf_probs     = clf.predict_proba(Z)

# Add predictions back into test data frame
df_test['Made Donation in March 2007'] = clf_probs[:,1]
df_test.head()

# Setup save filename and directory
submit_dir      = '../data/processed/'
submit_filename = 'submit-logistic_regression.csv'

# Save to CSV-file using only the subject-id, and predition columns
df_test.to_csv(submit_dir+submit_filename, 
               columns=('Unnamed: 0', 'Made Donation in March 2007'),
               index=False)

Random Forest Classifier

sklearn.ensemble.RandomForestClassifier



In [30]:

    
from sklearn.ensemble import RandomForestClassifier

# Train uncalibrated random forest classifier 
# on whole train and validation data 
# and evaluate on test data
clf       = sklearn.ensemble.RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)

# Get probabilities
clf_probs = clf.predict_proba(X_test)

# Test/Evaluate the the model
score     = sklearn.metrics.log_loss(y_test, clf_probs)
print("Log-loss score:\t", score)









    



('Log-loss score:\t', 1.1349310694085468)

Calibrated Random Forest Classifier

sklearn.calibration.CalibratedClassifierCV
sklearn.ensemble.RandomForestClassifier



In [31]:

    
from sklearn.ensemble import RandomForestClassifier

# Train random forest classifier
#  - calibrate on validation data
#  - evaluate test data
clf       = sklearn.ensemble.RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)


from sklearn.calibration import CalibratedClassifierCV

# Pass the RandomForestClassifier into the CalibrationClassifier
sig_clf   = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
sig_clf.fit(X_train, y_train)

# Get prediction probabilities from model
sig_clf_probs = sig_clf.predict_proba(X_test)

# Test quality of predictions using `log_loss` function
sig_score     = sklearn.metrics.log_loss(y_test, sig_clf_probs)
print("Log-loss score:\t", sig_score)









    



('Log-loss score:\t', 0.80482654902433792)

Support Vector Machine Classifier

sklearn.svm.SVC `



In [132]:

    
from sklearn import svm
clf = []
clf_probs = []
# clf = svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
#     decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
#     max_iter=-1, probability=True, random_state=None, shrinking=True,
#     tol=0.001, verbose=False)
# clf = svm.SVC(kernel='rbf',degree=2, probability=True)
#StandardScaler
clf = svm.SVC(kernel='linear', probability=True)
clf.fit(X_trainNorm, y_train) 

# Get prediction probabilities from model
clf_probs = clf.predict_proba(X_testNorm)

# Test quality of predictions using `log_loss` function
score     = sklearn.metrics.log_loss(y_test, clf_probs)
print("Log-loss score:\t", score)









    



('Log-loss score:\t', 0.55300082892047209)



In [22]:









    



[[   2    8 2000   38]
 [   2   11 2750   79]
 [   2    5 1250   63]
 ..., 
 [  11    6 1500   41]
 [   2    7 1750   76]
 [   4    5 1250   28]]

Playing around w/: `sklearn.model_selection.cross_val_score`

From Katie Malone's Workflows in Python:

The cheapest and easiest way to train on one portion of my dataset and test on another, and to get a measure of model quality at the same time, is to use sklearn.cross_validation.cross_val_score().

cross_val_score()

splits data into 3 equal portions

trains on 2 portions

tests on the third

This process repeats 3 times. That’s why 3 numbers get printed in the code block below.

Note: `log_loss` results are negative and is labelled `neg_log_loss` for the `cross_val_score` function

See:

Sklearn | Quantifying the Quality of Predictions
StackOverflow | Why is log_loss negative?
- Basically, higher score means better performance (less loss)

Generate data

X: Training vector
y: Target vector



In [10]:

    
X = df_blood.iloc[:,1:5].as_matrix()
y = list(df_blood["Made Donation in March 2007"])

LogisticRegression



In [11]:

    
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

clf   = sklearn.linear_model.LogisticRegression()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")

print(score)









    



[-0.56947507 -0.54819823 -0.48901389]

DecisionTreeClassifier



In [12]:

    
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

clf   = sklearn.tree.DecisionTreeClassifier()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")
print(score)









    



[-14.08570262  -7.84956205  -8.54322446]

RandomForestClassifier



In [13]:

    
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

clf   = sklearn.ensemble.RandomForestClassifier()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")
print(score)









    



[-1.82437783 -5.49929871 -1.66941622]

Sources

Examples:

Example: Probability Calibration
- helpful for seeing whole workflow in action from loading to plotting

Documentation

Discussion

CrossValidated | sklearn predict_proba output interpretation



In [ ]:

	Unnamed: 0	Months since Last Donation	Number of Donations	Total Volume Donated (c.c.)	Months since First Donation	Made Donation in March 2007
0	619	2	50	12500	98	1
1	664	0	13	3250	28	1
2	441	1	16	4000	35	1
3	160	2	20	5000	45	1
4	358	1	24	6000	77	0
5	335	4	4	1000	4	0
6	47	2	7	1750	14	1
7	164	1	12	3000	35	0
8	736	5	46	11500	98	1
9	436	0	3	750	4	0