Predictive statistics on the I-SPY1 Clinical Trial

0. Load modules and clean data



In [73]:

    
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
# import custom modules wrote by julio
from ispy1 import predictive_statistics

# reload modules without restartign the kernel (makes development easier)
import importlib
importlib.reload(predictive_statistics);



In [2]:

    
df = pd.read_csv('./data/I-SPY_1_clean_data.csv')
df.head(2)









    Out[2]:







  
    
      
      SUBJECTID
      age
      White
      ER+
      PR+
      HR+
      Bilateral
      Right_Breast
      MRI_LD_Baseline
      MRI_LD_1_3dAC
      MRI_LD_Int_Reg
      MRI_LD_PreSurg
      Alive
      Survival_length
      RFS
      RFS_code
      PCR
      RCB
    
  
  
    
      0
      1001
      38.73
      Yes
      Yes
      No
      Yes
      No
      No
      88.0
      78.0
      30.0
      14.0
      No
      1264
      751
      1
      No
      2.0
    
    
      1
      1002
      37.79
      Yes
      Yes
      Yes
      Yes
      No
      Yes
      29.0
      26.0
      66.0
      16.0
      No
      1155
      1043
      1
      No
      3.0

Prediction of categorical outcomes

1.0 Pathological Complete Response (PCR)



In [3]:

    
# allocate  outcome 
outcome = 'PCR'
y = predictive_statistics.labels_to_numbers(df, outcome);

# check how unbalanced the data are
df[outcome].value_counts(normalize = True).plot.barh();
plt.title('Normalized Sample size for PCR')
plt.xlabel('group size (%)');
plt.savefig('Sample_Size_PCR.png')



In [4]:

    
# allocate continous predictors
cont_predictors = ['age','MRI_LD_Baseline', 'MRI_LD_1_3dAC', 'MRI_LD_Int_Reg', 'MRI_LD_PreSurg']
X_cont = df[cont_predictors].values

# allocate clinical predictors
cat_predictors = ['White', 'ER+', 'PR+', 'HR+'];
X_cat = pd.pandas.get_dummies(df[cat_predictors], drop_first=True).values

# allocate a single predictors matrix X
X = np.concatenate( (X_cont, X_cat), axis=1)

# allocate  outcome 
outcome = 'PCR'
y = predictive_statistics.labels_to_numbers(df, outcome);

Why Kappa is a good metric



In [5]:

    
from sklearn import metrics

def mymetrics(ypredicted, yexpected):
    print(metrics.classification_report(ypredicted, yexpected))
    k = metrics.cohen_kappa_score(ypredicted, yexpected); k = np.round(k,3);
    auc = metrics.roc_auc_score(ypredicted, yexpected);   auc  = np.round(auc,3);
    accuracy = metrics.accuracy_score(ypredicted, yexpected); accuracy = np.round(accuracy,3);

    print("Kappa = " + str(k))
    print("AUC = " + str(auc))
    print("Accuracy = " + str(accuracy))

# make at least one observation positive
y_crazy = np.zeros_like(y)
y_crazy[np.argwhere(y>0)[0]] = 1
mymetrics(y_crazy, y)









    



             precision    recall  f1-score   support

          0       1.00      0.74      0.85       167
          1       0.02      1.00      0.04         1

avg / total       0.99      0.74      0.84       168

Kappa = 0.032
AUC = 0.868
Accuracy = 0.738

### logistic Regression



In [6]:

    
# standard
auc1, kappa1, fpr1, tpr1 = predictive_statistics.Logistic_Regression(X, y, oversample = False)

# oversampled
auc2, kappa2, fpr2, tpr2 = predictive_statistics.Logistic_Regression(X, y, oversample = True, K_neighbors = 4)

title ='Effect of oversampling on Logistic Regression for PCR'
predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)
plt.savefig('PCR_Logistic.png')









    



             precision    recall  f1-score   support

          0       0.77      0.87      0.82        39
          1       0.29      0.17      0.21        12

avg / total       0.66      0.71      0.68        51

The estimated Cohen kappa is 0.0449438202247
The estimated AUC is 0.519
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.82      0.72      0.77        39
          1       0.35      0.50      0.41        12

avg / total       0.71      0.67      0.68        51

The estimated Cohen kappa is 0.190476190476
The estimated AUC is 0.609
============================================================

### random forests



In [7]:

    
# standard
auc1, kappa1, fpr1, tpr1, forest = predictive_statistics.RandomForest_Classifier(X, y,oversample=False)
t ='Feature Importance for RFC unbalanced'
predictive_statistics.plot_forest_feature_importances_(forest, cont_predictors + cat_predictors, title = t)
plt.savefig('RFC_std_PCR.png')

# unbalanced learning
auc2, kappa2, fpr2, tpr2, Forest = predictive_statistics.RandomForest_Classifier(X, y, oversample=True, K_neighbors = 5)
t ='Feature Importance for RFC ADASYN'
predictive_statistics.plot_forest_feature_importances_(Forest, cont_predictors + cat_predictors, title = t)
plt.savefig('RFC_Adasyn_PCR.png')









    



             precision    recall  f1-score   support

          0       0.80      0.92      0.86        39
          1       0.50      0.25      0.33        12

avg / total       0.73      0.76      0.73        51

The estimated Cohen kappa is 0.209302325581
The estimated AUC is 0.587
============================================================









    












    



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.79      0.87      0.83        39
          1       0.38      0.25      0.30        12

avg / total       0.69      0.73      0.70        51

The estimated Cohen kappa is 0.13768115942
The estimated AUC is 0.561
============================================================









    





<matplotlib.figure.Figure at 0x113db4d68>






    












    





<matplotlib.figure.Figure at 0x113db4588>



In [8]:

    
# compare
title ='Effect of oversampling on RFC for PCR'
predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)
plt.savefig('RFC_std_PCR.png')



In [9]:

    
K = [4,5,6]
for k in K:
    predictive_statistics.Logistic_Regression(X, y, oversample=True, K_neighbors = k)









    



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.82      0.72      0.77        39
          1       0.35      0.50      0.41        12

avg / total       0.71      0.67      0.68        51

The estimated Cohen kappa is 0.190476190476
The estimated AUC is 0.609
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.84      0.69      0.76        39
          1       0.37      0.58      0.45        12

avg / total       0.73      0.67      0.69        51

The estimated Cohen kappa is 0.229333333333
The estimated AUC is 0.638
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.82      0.72      0.77        39
          1       0.35      0.50      0.41        12

avg / total       0.71      0.67      0.68        51

The estimated Cohen kappa is 0.190476190476
The estimated AUC is 0.609
============================================================

2.0 Survival (`Alive`) using Logistic Regression



In [10]:

    
# allocate  outcome 
outcome = 'Alive'
y = predictive_statistics.labels_to_numbers(df, outcome);

# check how unbalanced the data are
df[outcome].value_counts(normalize = True).plot.barh();
plt.title('Normalized Sample size for Alive')
plt.xlabel('group size (%)');
plt.savefig('Sample_Size_Alive.png')

### Logistic Regression



In [11]:

    
# standard
auc1, kappa1, fpr1, tpr1 = predictive_statistics.Logistic_Regression(X, y)

# unbalanced learning
auc2, kappa2, fpr2, tpr2 = predictive_statistics.Logistic_Regression(X, y, oversample=True, K_neighbors = 4)









    



             precision    recall  f1-score   support

          0       0.50      0.27      0.35        11
          1       0.82      0.93      0.87        40

avg / total       0.75      0.78      0.76        51

The estimated Cohen kappa is 0.236734693878
The estimated AUC is 0.599
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.36      0.45      0.40        11
          1       0.84      0.78      0.81        40

avg / total       0.73      0.71      0.72        51

The estimated Cohen kappa is 0.208893485005
The estimated AUC is 0.615
============================================================



In [12]:

    
title ='Effect of oversampling on Logistic Regression for Alive'
predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)
plt.savefig('Alive_Logistic.png')

### random forests



In [13]:

    
# standard
auc1, kappa1, fpr1, tpr1, _= predictive_statistics.RandomForest_Classifier(X, y)

# unbalanced learning
auc2, kappa2, fpr2, tpr2, _= predictive_statistics.RandomForest_Classifier(X, y, oversample=True, K_neighbors = 6)

title ='Effect of oversampling on RFC for PCR'
predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)
plt.savefig('Alive_RFC.png')









    



             precision    recall  f1-score   support

          0       0.38      0.27      0.32        11
          1       0.81      0.88      0.84        40

avg / total       0.72      0.75      0.73        51

The estimated Cohen kappa is 0.16393442623
The estimated AUC is 0.574
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.67      0.36      0.47        11
          1       0.84      0.95      0.89        40

avg / total       0.81      0.82      0.80        51

The estimated Cohen kappa is 0.375510204082
The estimated AUC is 0.657
============================================================

3.0 Survival (`Alive`) including PCR as predictor

- Logistic Regression



In [14]:

    
# allocate new predictor variable
pcr = predictive_statistics.labels_to_numbers(df, 'PCR').reshape(168,1)
newX = np.concatenate((X,pcr), axis  = 1)



In [15]:

    
# standard
auc1, kappa1, fpr1, tpr1 = predictive_statistics.Logistic_Regression(newX, y)

# unbalanced learning
auc2, kappa2, fpr2, tpr2 = predictive_statistics.Logistic_Regression(newX, y, oversample=True, K_neighbors = 10)









    



             precision    recall  f1-score   support

          0       0.43      0.27      0.33        11
          1       0.82      0.90      0.86        40

avg / total       0.73      0.76      0.74        51

The estimated Cohen kappa is 0.198952879581
The estimated AUC is 0.586
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.38      0.45      0.42        11
          1       0.84      0.80      0.82        40

avg / total       0.74      0.73      0.73        51

The estimated Cohen kappa is 0.238805970149
The estimated AUC is 0.627
============================================================



In [16]:

    
title ='Effect of oversampling on Logistic Regression for Alive (PCR included)'
predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)

- Random Forest Classifier



In [17]:

    
# standard
auc1, kappa1, fpr1, tpr1 , _ = predictive_statistics.RandomForest_Classifier(newX, y)









    



             precision    recall  f1-score   support

          0       0.33      0.09      0.14        11
          1       0.79      0.95      0.86        40

avg / total       0.69      0.76      0.71        51

The estimated Cohen kappa is 0.0555555555556
The estimated AUC is 0.52
============================================================



In [18]:

    
# unbalanced learning
auc2, kappa2, fpr2, tpr2, _ = predictive_statistics.RandomForest_Classifier(newX, y, oversample=True, K_neighbors = 10)









    



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.38      0.45      0.42        11
          1       0.84      0.80      0.82        40

avg / total       0.74      0.73      0.73        51

The estimated Cohen kappa is 0.238805970149
The estimated AUC is 0.627
============================================================

4.0 Survival (`Alive`) including `RCB` as predictor



In [19]:

    
rcb = pd.get_dummies(df['RCB']).values
newX = np.concatenate((X,rcb), axis  = 1)

# standard
auc1, kappa1, fpr1, tpr1 = predictive_statistics.Logistic_Regression(newX, y)
auc2, kappa2, fpr2, tpr2, _= predictive_statistics.RandomForest_Classifier(newX, y)









    



             precision    recall  f1-score   support

          0       0.50      0.36      0.42        11
          1       0.84      0.90      0.87        40

avg / total       0.76      0.78      0.77        51

The estimated Cohen kappa is 0.292559899117
The estimated AUC is 0.632
============================================================



             precision    recall  f1-score   support

          0       0.50      0.45      0.48        11
          1       0.85      0.88      0.86        40

avg / total       0.78      0.78      0.78        51

The estimated Cohen kappa is 0.340775558167
The estimated AUC is 0.665
============================================================



In [20]:

    
# compare
title ='Predicting Survival including RCB as a predictor'
predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)
plt.legend(['Logistic Regression','Random Forest Classifier']);
plt.savefig('LG_vs_RFC_Alive_RCB.png')



In [21]:

    
# unbalanced learning
auc3, kappa3, fpr3, tpr3 = predictive_statistics.Logistic_Regression(newX, y, oversample=True, K_neighbors = 4)
auc4, kappa4, fpr4, tpr4, _= predictive_statistics.RandomForest_Classifier(newX, y, oversample=True, K_neighbors = 4)









    



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.33      0.45      0.38        11
          1       0.83      0.75      0.79        40

avg / total       0.73      0.69      0.70        51

The estimated Cohen kappa is 0.180722891566
The estimated AUC is 0.602
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.30      0.27      0.29        11
          1       0.80      0.82      0.81        40

avg / total       0.70      0.71      0.70        51

The estimated Cohen kappa is 0.101057579318
The estimated AUC is 0.549
============================================================



In [38]:

    
title ='Predicting Survival including RCB as a predictor (ADASYN)'
predictive_statistics.plot_compare_roc(fpr3, tpr3,fpr4, tpr4, auc3, auc4, title = title)
plt.legend(['Logistic Regression','Random Forest Classifier']);

#plt.savefig('LG_vs_RFC_Alive_RCB.png')

Prediction of continous outcomes

Prepare functions



In [ ]:

    
# metrics
mae = metrics.median_absolute_error

def mae_report(Ytest, Yhat, outcome_):
    error = mae(Ytest, Yhat)
    error = np.round( error, decimals=3)
    # report
    print('\n' )
    print('==' *40)
    print('The median absolute error for testing data set of ' + outcome_ + ' is: ' + str(error))
    print('==' *40)

def train_test_report(predictor, Xtrain, Ytrain, Xtest, Ytest, outcome):
    # train
    predictor.fit(Xtrain, Ytrain)
    # test
    Yhat = predictor.predict(Xtest)
    # report
    mae_report(Ytest, Yhat, outcome)
    
# lsq 
import statsmodels.api as sm
def lsq(Xtrain,Ytrain, Xtest, Ytest, outcome =''):
    # train
    OLS = sm.OLS(Ytrain,Xtrain).fit();
    print(OLS.summary())
    #test
    Yhat = OLS.predict(Xtest)
    # report
    mae_report(Ytest, Yhat, outcome)

# SVR    
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV   

# GridSearchCV utility
def gridsearch(regressor, grid):
    optimized_regressor=  GridSearchCV(  regressor, 
                               param_grid = grid, 
                               cv= 3, verbose = 0, n_jobs = -1,
                               scoring = metrics.make_scorer(metrics.median_absolute_error))
    
    return optimized_regressor
    


def svr(Xtrain,Ytrain, Xtest, Ytest, outcome = ''):
    # define regressor
    regressor =  SVR()
    # define parameter grid search
    grid = dict(       kernel = ['rbf','linear','sigmoid'], 
                       C = np.arange(1,11,1),
                       epsilon = np.arange(1,11,1),
                       gamma = np.linspace(1/10,10,3))
    # perform grid search
    grid_search=  gridsearch(regressor, grid)
    
    # train, test, and report
    train_test_report(grid_search, Xtrain, Ytrain, Xtest, Ytest, outcome)

    
# ElasticNet
from sklearn.linear_model import ElasticNet as ENet

def ElasticNet(Xtrain,Ytrain, Xtest, Ytest, outcome = ''):
    # define regressor
    regressor =  ENet(max_iter=5000)
    # define parameter grid search
    grid = dict(   alpha = np.arange(1,20,.5), l1_ratio = np.arange(.1,1,.05))
    # perform grid search
    grid_search=  gridsearch(regressor, grid)
    # train, test, and report
    train_test_report(grid_search, Xtrain, Ytrain, Xtest, Ytest, outcome)
    

# Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor as RFR

def RandomForestRegressor(Xtrain,Ytrain, Xtest, Ytest, outcome = ''):
    # define regressor
    regressor =  RFR( criterion='mse', random_state = RANDOM_STATE)
    
    #
    num_features = Xtrain.shape[1]
    
    # define parameter grid search
    grid = dict(    n_estimators = np.arange(5,100,5), 
                    max_features = np.arange(1,num_features, 1),
                    max_depth = [None, 1, 2, 3, 4, 5])
    
    # perform grid search
    grid_search=  gridsearch(regressor, grid)
    
    # train, test, and report
    train_test_report(grid_search, Xtrain, Ytrain, Xtest, Ytest, outcome)

Organize predictors in the right format and split data



In [57]:

    
from sklearn.preprocessing import StandardScaler


# allocate continous predictors
cont_predictors = ['age','MRI_LD_Baseline', 'MRI_LD_1_3dAC', 'MRI_LD_Int_Reg', 'MRI_LD_PreSurg']
contX = df[cont_predictors].values
contX = StandardScaler().fit_transform(contX)


# allocate categorical predictors
cat_pred = ['PCR','White', 'ER+', 'PR+', 'HR+'];
catX = pd.pandas.get_dummies(df[cat_pred], drop_first=True).values

# concatenate predictors
X = np.concatenate( (catX, contX), axis=1)

- Recurrence-Free Survival (`RFS`, Continous in months)



In [58]:

    
#outcome
y = df.RFS.values / 30; # conver to months

#split
X_train, X_test, y_train, y_test = predictive_statistics.split_data(X, y, False)



In [59]:

    
# LSQ
predictive_statistics.lsq(X_train, y_train, X_test, y_test, outcome =' RFS (months)')









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.773
Model:                            OLS   Adj. R-squared:                  0.751
Method:                 Least Squares   F-statistic:                     36.38
Date:                Tue, 20 Jun 2017   Prob (F-statistic):           5.60e-30
Time:                        09:37:26   Log-Likelihood:                -535.00
No. Observations:                 117   AIC:                             1090.
Df Residuals:                     107   BIC:                             1118.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            27.0400      5.212      5.188      0.000      16.708      37.372
x2            17.8831      4.243      4.215      0.000       9.473      26.294
x3            15.3054     15.097      1.014      0.313     -14.623      45.234
x4             7.5932      7.294      1.041      0.300      -6.865      22.052
x5             9.6444     16.267      0.593      0.555     -22.603      41.891
x6             2.4913      2.422      1.029      0.306      -2.309       7.292
x7             9.0136      5.044      1.787      0.077      -0.985      19.012
x8            -4.0383      5.040     -0.801      0.425     -14.029       5.953
x9            -6.3210      3.774     -1.675      0.097     -13.802       1.160
x10            2.0339      3.237      0.628      0.531      -4.384       8.452
==============================================================================
Omnibus:                       11.597   Durbin-Watson:                   1.873
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               12.486
Skew:                           0.655   Prob(JB):                      0.00194
Kurtosis:                       3.918   Cond. No.                         15.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
================================================================================
The median absolute error for testing data set of  RFS (months) is: 14.364
================================================================================

ElasticNet



In [60]:

    
importlib.reload(predictive_statistics);



In [61]:

    
EN = predictive_statistics.ElasticNet(X_train, y_train, X_test, y_test, outcome =' RFS (months)');









    



================================================================================
The median absolute error for testing data set of  RFS (months) is: 12.536
================================================================================

SVM Regressor



In [62]:

    
SVR = predictive_statistics.svr(X_train, y_train, X_test, y_test, outcome =' RFS(months)');









    



================================================================================
The median absolute error for testing data set of  RFS(months) is: 82.589
================================================================================



In [63]:

    
RFR = predictive_statistics.RandomForestRegressor(X_train, y_train, X_test, y_test, outcome =' RFS(months)')









    



================================================================================
The median absolute error for testing data set of  RFS(months) is: 11.76
================================================================================

2.0 Survival Length (`Survival_length`, Continous, months)



In [74]:

    
y = df.Survival_length.values / 30

#split
X_train, X_test, y_train, y_test = predictive_statistics.split_data(X, y, False)



In [75]:

    
# # LSQ
predictive_statistics.lsq(X_train, y_train, X_test, y_test, outcome ='Survival_length (months)')









    



                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.795
Model:                            OLS   Adj. R-squared:                  0.775
Method:                 Least Squares   F-statistic:                     41.39
Date:                Tue, 20 Jun 2017   Prob (F-statistic):           2.78e-32
Time:                        10:03:09   Log-Likelihood:                -532.02
No. Observations:                 117   AIC:                             1084.
Df Residuals:                     107   BIC:                             1112.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            25.8238      5.081      5.083      0.000      15.752      35.896
x2            21.5604      4.136      5.213      0.000      13.361      29.759
x3            11.1381     14.718      0.757      0.451     -18.039      40.315
x4             6.6792      7.110      0.939      0.350      -7.416      20.775
x5            13.6270     15.858      0.859      0.392     -17.810      45.064
x6             1.0752      2.361      0.455      0.650      -3.605       5.755
x7             8.8649      4.917      1.803      0.074      -0.882      18.612
x8            -3.8696      4.913     -0.788      0.433     -13.610       5.871
x9            -4.7785      3.679     -1.299      0.197     -12.072       2.515
x10            1.9332      3.156      0.613      0.541      -4.324       8.190
==============================================================================
Omnibus:                       13.355   Durbin-Watson:                   1.709
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               14.827
Skew:                           0.724   Prob(JB):                     0.000603
Kurtosis:                       3.973   Cond. No.                         15.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
================================================================================
The median absolute error for testing data set of Survival_length (months) is: 15.819
================================================================================



In [76]:

    
predictive_statistics.svr(X_train, y_train, X_test, y_test, outcome ='Survival length (months)');









    



================================================================================
The median absolute error for testing data set of Survival length (months) is: 102.682
================================================================================



In [71]:

    
predictive_statistics.ElasticNet(X_train, y_train, X_test, y_test, outcome = 'Survival length (months)');









    



================================================================================
The median absolute error for testing data set of Survival length (months) is: 9.136
================================================================================



In [72]:

    
predictive_statistics.RandomForestRegressor(X_train, y_train, X_test, y_test, outcome = 'Survival length (months)');









    



================================================================================
The median absolute error for testing data set of Survival length (months) is: 9.254
================================================================================



In [ ]:

	SUBJECTID	age	White	ER+	PR+	HR+	Bilateral	Right_Breast	MRI_LD_Baseline	MRI_LD_1_3dAC	MRI_LD_Int_Reg	MRI_LD_PreSurg	Alive	Survival_length	RFS	RFS_code	PCR	RCB
0	1001	38.73	Yes	Yes	No	Yes	No	No	88.0	78.0	30.0	14.0	No	1264	751	1	No	2.0
1	1002	37.79	Yes	Yes	Yes	Yes	No	Yes	29.0	26.0	66.0	16.0	No	1155	1043	1	No	3.0

Predictive statistics on the I-SPY1 Clinical Trial

0. Load modules and clean data

Prediction of categorical outcomes

1.0 Pathological Complete Response (PCR)

Why Kappa is a good metric

2.0 Survival (Alive) using Logistic Regression

3.0 Survival (Alive) including PCR as predictor

- Logistic Regression

- Random Forest Classifier

4.0 Survival (Alive) including RCB as predictor

Prediction of continous outcomes

Prepare functions

Organize predictors in the right format and split data

- Recurrence-Free Survival (RFS, Continous in months)

ElasticNet

SVM Regressor

2.0 Survival Length (Survival_length, Continous, months)

2.0 Survival (`Alive`) using Logistic Regression

3.0 Survival (`Alive`) including PCR as predictor

4.0 Survival (`Alive`) including `RCB` as predictor

- Recurrence-Free Survival (`RFS`, Continous in months)

2.0 Survival Length (`Survival_length`, Continous, months)