Load Data


In [40]:
import pandas as pd
import sklearn

In [41]:
data_dir      = '../data/raw/'
data_filename = 'blood_train.csv'
df_blood      = pd.read_csv(data_dir+data_filename)

df_blood.head(10)


Out[41]:
Unnamed: 0 Months since Last Donation Number of Donations Total Volume Donated (c.c.) Months since First Donation Made Donation in March 2007
0 619 2 50 12500 98 1
1 664 0 13 3250 28 1
2 441 1 16 4000 35 1
3 160 2 20 5000 45 1
4 358 1 24 6000 77 0
5 335 4 4 1000 4 0
6 47 2 7 1750 14 1
7 164 1 12 3000 35 0
8 736 5 46 11500 98 1
9 436 0 3 750 4 0






Transform data for machine learning

  • Used iloc to drop the subject id column and the prediction column (i.e. 'Made Donation in March 2007')
    • Previously, forgot to drop the 'Made Donation in March 2007' column and totally overfitted the data #LessonLearned
  • Converted to matrix array format for input into predict_proba, train_test_split functions
  • Saved the predictions in a separate array

In [74]:
X = df_blood.iloc[:,1:5].as_matrix()
y = list(df_blood["Made Donation in March 2007"])

In [43]:
# Log transform the features
import numpy as np
from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log1p)
X= transformer.transform(X)

# Normalize the features using Standard Scalar
#X_test = sklearn.preprocessing.StandardScaler().fit_transform(X_test)
#X_train = sklearn.preprocessing.StandardScaler().fit_transform(X_train)






Split data into training/test sets for cross-validation

sklearn.model_selection.train_test_split

Split the total dataset into training and testing sets via random selection

  • test_size - proportion of the dataset to put into the test set
  • randome_state - Seed for pseudo-random number generator

In [117]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test= sklearn.model_selection.train_test_split(
    X, y, 
    test_size=0.5, 
    random_state=0) 

print("No. Rows in training set:\t", len(X_train))
print("No. Rows in testing set:\t" , len(X_test))


('No. Rows in training set:\t', 288)
('No. Rows in testing set:\t', 288)






Cross-validation: Split the data into training/test sets by hand

Note: I don't actually use this data, since I used train_test_split()

Splits the 1st third of the data into the training set, the 2nd third into validation set, last 3rd into test set


In [118]:
# Split data into 4 partitions
#  - training set
#  - validation set
#  - combined training & validation set
#  - testing set

# nrows_total = df_blood.count()[1]
# nrows_train = int(nrows_total/3)
# nrows_valid = int(nrows_total*2/3)

# X_train, y_train             = X[:nrows_train]           , y[:nrows_train]
# X_valid, y_valid             = X[nrows_train:nrows_valid], y[nrows_train:nrows_valid]
# X_test , y_test              = X[nrows_valid:]           , y[nrows_valid:]
# X_train_valid, y_train_valid = X[:nrows_valid]           , y[:nrows_valid]

# print("Total number of rows:\t", nrows_total)
# print("Training rows:\t\t"     , 0          ,"-", nrows_train)
# print("Validation rows:\t"     , nrows_train,"-", nrows_valid)
# print("Testing rows:\t\t"      ,nrows_valid ,"-" , nrows_total)







Playing around with different classifiers

With the data loaded, transformed and split, can now pass it into different classifiers and see how they perform

Basic workflow for each classifier:

  1. import classifier
  2. initialize classifier into clf variable
  3. fit data (X_train, y_train) into classifier
  4. predict output (i.e. probabilities) using X_test data
  5. evaluate prediction quality (via sklearn.metrics.log_loss function & y_test data)

Logistic Regression Classifier:


In [119]:
from sklearn.linear_model import LogisticRegression

clf       = sklearn.linear_model.LogisticRegression()
clf.fit(X_train, y_train)

clf_probs = clf.predict_proba(X_test)

score     = sklearn.metrics.log_loss(y_test, clf_probs)
print("Log-loss score:\t", score)


('Log-loss score:\t', 0.46686030610135276)

In [81]:
clfE       = sklearn.linear_model.ElasticNet(l1_ratio=0.24)
clfE.fit(X_train, y_train)

clfE_probs = clfE.predict(X_test)

score     = sklearn.metrics.log_loss(y_test, clfE_probs)
print("Log-loss score:\t", score)


('Log-loss score:\t', 0.49254144740610439)

In [112]:
from sklearn.linear_model import LogisticRegression

clfL       = sklearn.linear_model.LogisticRegression(C=1)
clfL.fit(X_train, y_train)

clfL_probs = clfL.predict_proba(X_test)

score     = sklearn.metrics.log_loss(y_test, clfL_probs)
print("Log-loss score:\t", score)


('Log-loss score:\t', 0.46686030610135276)

In [102]:
clfECV       = sklearn.linear_model.ElasticNetCV(l1_ratio=0.24)
clfECV.fit(X_train, y_train)

clfECV_probs = clfECV.predict(X_test)
score     = sklearn.metrics.log_loss(y_test, clfECV_probs )
print("Log-loss score:\t", score)


('Log-loss score:\t', 0.53195276047172257)

In [100]:
dir(clfECV)


Out[100]:
['__abstractmethods__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__doc__',
 '__format__',
 '__getattribute__',
 '__getstate__',
 '__hash__',
 '__init__',
 '__module__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_abc_cache',
 '_abc_negative_cache',
 '_abc_negative_cache_version',
 '_abc_registry',
 '_decision_function',
 '_estimator_type',
 '_get_param_names',
 '_preprocess_data',
 '_set_intercept',
 'alpha_',
 'alphas',
 'alphas_',
 'coef_',
 'copy_X',
 'cv',
 'decision_function',
 'dual_gap_',
 'eps',
 'fit',
 'fit_intercept',
 'get_params',
 'intercept_',
 'l1_ratio',
 'l1_ratio_',
 'max_iter',
 'mse_path_',
 'n_alphas',
 'n_iter_',
 'n_jobs',
 'normalize',
 'path',
 'positive',
 'precompute',
 'predict',
 'random_state',
 'score',
 'selection',
 'set_params',
 'tol',
 'verbose']

Submission code for Logistic Regression

  • Results saved in: /data/processed/

In [46]:
from sklearn.linear_model import LogisticRegression

# Load Test Data
data_filename = 'blood_test.csv'
df_test       = pd.read_csv(data_dir+data_filename)

# Transform data
#  - dropped the ID column
#  - converted to matrix array for input to `predict_proba`
Z             = df_test.iloc[:,1:5].as_matrix()

# Predict data
clf_probs     = clf.predict_proba(Z)

# Add predictions back into test data frame
df_test['Made Donation in March 2007'] = clf_probs[:,1]
df_test.head()

# Setup save filename and directory
submit_dir      = '../data/processed/'
submit_filename = 'submit-logistic_regression.csv'

# Save to CSV-file using only the subject-id, and predition columns
df_test.to_csv(submit_dir+submit_filename, 
               columns=('Unnamed: 0', 'Made Donation in March 2007'),
               index=False)

Random Forest Classifier

sklearn.ensemble.RandomForestClassifier


In [30]:
from sklearn.ensemble import RandomForestClassifier

# Train uncalibrated random forest classifier 
# on whole train and validation data 
# and evaluate on test data
clf       = sklearn.ensemble.RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)

# Get probabilities
clf_probs = clf.predict_proba(X_test)

# Test/Evaluate the the model
score     = sklearn.metrics.log_loss(y_test, clf_probs)
print("Log-loss score:\t", score)


('Log-loss score:\t', 1.1349310694085468)

Calibrated Random Forest Classifier

  • sklearn.calibration.CalibratedClassifierCV
  • sklearn.ensemble.RandomForestClassifier

In [31]:
from sklearn.ensemble import RandomForestClassifier

# Train random forest classifier
#  - calibrate on validation data
#  - evaluate test data
clf       = sklearn.ensemble.RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)


from sklearn.calibration import CalibratedClassifierCV

# Pass the RandomForestClassifier into the CalibrationClassifier
sig_clf   = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
sig_clf.fit(X_train, y_train)

# Get prediction probabilities from model
sig_clf_probs = sig_clf.predict_proba(X_test)

# Test quality of predictions using `log_loss` function
sig_score     = sklearn.metrics.log_loss(y_test, sig_clf_probs)
print("Log-loss score:\t", sig_score)


('Log-loss score:\t', 0.80482654902433792)

Support Vector Machine Classifier

  • sklearn.svm.SVC `

In [132]:
from sklearn import svm
clf = []
clf_probs = []
# clf = svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
#     decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
#     max_iter=-1, probability=True, random_state=None, shrinking=True,
#     tol=0.001, verbose=False)
# clf = svm.SVC(kernel='rbf',degree=2, probability=True)
#StandardScaler
clf = svm.SVC(kernel='linear', probability=True)
clf.fit(X_trainNorm, y_train) 

# Get prediction probabilities from model
clf_probs = clf.predict_proba(X_testNorm)

# Test quality of predictions using `log_loss` function
score     = sklearn.metrics.log_loss(y_test, clf_probs)
print("Log-loss score:\t", score)


('Log-loss score:\t', 0.55300082892047209)

In [22]:



[[   2    8 2000   38]
 [   2   11 2750   79]
 [   2    5 1250   63]
 ..., 
 [  11    6 1500   41]
 [   2    7 1750   76]
 [   4    5 1250   28]]






Playing around w/: sklearn.model_selection.cross_val_score

From Katie Malone's Workflows in Python:

The cheapest and easiest way to train on one portion of my dataset and test on another, and to get a measure of model quality at the same time, is to use sklearn.cross_validation.cross_val_score().

cross_val_score()

  • splits data into 3 equal portions
  • trains on 2 portions
  • tests on the third

This process repeats 3 times. That’s why 3 numbers get printed in the code block below.

Note: log_loss results are negative and is labelled neg_log_loss for the cross_val_score function

See:

Generate data

  • X: Training vector
  • y: Target vector

In [10]:
X = df_blood.iloc[:,1:5].as_matrix()
y = list(df_blood["Made Donation in March 2007"])

LogisticRegression


In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

clf   = sklearn.linear_model.LogisticRegression()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")

print(score)


[-0.56947507 -0.54819823 -0.48901389]

DecisionTreeClassifier


In [12]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

clf   = sklearn.tree.DecisionTreeClassifier()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")
print(score)


[-14.08570262  -7.84956205  -8.54322446]

RandomForestClassifier


In [13]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

clf   = sklearn.ensemble.RandomForestClassifier()
score = sklearn.model_selection.cross_val_score( 
    clf, 
    X, y,
    scoring="neg_log_loss")
print(score)


[-1.82437783 -5.49929871 -1.66941622]

In [ ]: