Predictive statistics on the I-SPY1 Clinical Trial

0. Load modules and clean data


In [73]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
# import custom modules wrote by julio
from ispy1 import predictive_statistics

# reload modules without restartign the kernel (makes development easier)
import importlib
importlib.reload(predictive_statistics);

In [2]:
df = pd.read_csv('./data/I-SPY_1_clean_data.csv')
df.head(2)


Out[2]:
SUBJECTID age White ER+ PR+ HR+ Bilateral Right_Breast MRI_LD_Baseline MRI_LD_1_3dAC MRI_LD_Int_Reg MRI_LD_PreSurg Alive Survival_length RFS RFS_code PCR RCB
0 1001 38.73 Yes Yes No Yes No No 88.0 78.0 30.0 14.0 No 1264 751 1 No 2.0
1 1002 37.79 Yes Yes Yes Yes No Yes 29.0 26.0 66.0 16.0 No 1155 1043 1 No 3.0

Prediction of categorical outcomes

1.0 Pathological Complete Response (PCR)


In [3]:
# allocate  outcome 
outcome = 'PCR'
y = predictive_statistics.labels_to_numbers(df, outcome);

# check how unbalanced the data are
df[outcome].value_counts(normalize = True).plot.barh();
plt.title('Normalized Sample size for PCR')
plt.xlabel('group size (%)');
plt.savefig('Sample_Size_PCR.png')



In [4]:
# allocate continous predictors
cont_predictors = ['age','MRI_LD_Baseline', 'MRI_LD_1_3dAC', 'MRI_LD_Int_Reg', 'MRI_LD_PreSurg']
X_cont = df[cont_predictors].values

# allocate clinical predictors
cat_predictors = ['White', 'ER+', 'PR+', 'HR+'];
X_cat = pd.pandas.get_dummies(df[cat_predictors], drop_first=True).values

# allocate a single predictors matrix X
X = np.concatenate( (X_cont, X_cat), axis=1)

# allocate  outcome 
outcome = 'PCR'
y = predictive_statistics.labels_to_numbers(df, outcome);

Why Kappa is a good metric


In [5]:
from sklearn import metrics

def mymetrics(ypredicted, yexpected):
    print(metrics.classification_report(ypredicted, yexpected))
    k = metrics.cohen_kappa_score(ypredicted, yexpected); k = np.round(k,3);
    auc = metrics.roc_auc_score(ypredicted, yexpected);   auc  = np.round(auc,3);
    accuracy = metrics.accuracy_score(ypredicted, yexpected); accuracy = np.round(accuracy,3);

    print("Kappa = " + str(k))
    print("AUC = " + str(auc))
    print("Accuracy = " + str(accuracy))

# make at least one observation positive
y_crazy = np.zeros_like(y)
y_crazy[np.argwhere(y>0)[0]] = 1
mymetrics(y_crazy, y)


             precision    recall  f1-score   support

          0       1.00      0.74      0.85       167
          1       0.02      1.00      0.04         1

avg / total       0.99      0.74      0.84       168

Kappa = 0.032
AUC = 0.868
Accuracy = 0.738
  • ### logistic Regression

In [6]:
# standard
auc1, kappa1, fpr1, tpr1 = predictive_statistics.Logistic_Regression(X, y, oversample = False)

# oversampled
auc2, kappa2, fpr2, tpr2 = predictive_statistics.Logistic_Regression(X, y, oversample = True, K_neighbors = 4)

title ='Effect of oversampling on Logistic Regression for PCR'
predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)
plt.savefig('PCR_Logistic.png')


             precision    recall  f1-score   support

          0       0.77      0.87      0.82        39
          1       0.29      0.17      0.21        12

avg / total       0.66      0.71      0.68        51

The estimated Cohen kappa is 0.0449438202247
The estimated AUC is 0.519
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.82      0.72      0.77        39
          1       0.35      0.50      0.41        12

avg / total       0.71      0.67      0.68        51

The estimated Cohen kappa is 0.190476190476
The estimated AUC is 0.609
============================================================



  • ### random forests

In [7]:
# standard
auc1, kappa1, fpr1, tpr1, forest = predictive_statistics.RandomForest_Classifier(X, y,oversample=False)
t ='Feature Importance for RFC unbalanced'
predictive_statistics.plot_forest_feature_importances_(forest, cont_predictors + cat_predictors, title = t)
plt.savefig('RFC_std_PCR.png')

# unbalanced learning
auc2, kappa2, fpr2, tpr2, Forest = predictive_statistics.RandomForest_Classifier(X, y, oversample=True, K_neighbors = 5)
t ='Feature Importance for RFC ADASYN'
predictive_statistics.plot_forest_feature_importances_(Forest, cont_predictors + cat_predictors, title = t)
plt.savefig('RFC_Adasyn_PCR.png')


             precision    recall  f1-score   support

          0       0.80      0.92      0.86        39
          1       0.50      0.25      0.33        12

avg / total       0.73      0.76      0.73        51

The estimated Cohen kappa is 0.209302325581
The estimated AUC is 0.587
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.79      0.87      0.83        39
          1       0.38      0.25      0.30        12

avg / total       0.69      0.73      0.70        51

The estimated Cohen kappa is 0.13768115942
The estimated AUC is 0.561
============================================================



<matplotlib.figure.Figure at 0x113db4d68>
<matplotlib.figure.Figure at 0x113db4588>

In [8]:
# compare
title ='Effect of oversampling on RFC for PCR'
predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)
plt.savefig('RFC_std_PCR.png')



In [9]:
K = [4,5,6]
for k in K:
    predictive_statistics.Logistic_Regression(X, y, oversample=True, K_neighbors = k)


Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.82      0.72      0.77        39
          1       0.35      0.50      0.41        12

avg / total       0.71      0.67      0.68        51

The estimated Cohen kappa is 0.190476190476
The estimated AUC is 0.609
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.84      0.69      0.76        39
          1       0.37      0.58      0.45        12

avg / total       0.73      0.67      0.69        51

The estimated Cohen kappa is 0.229333333333
The estimated AUC is 0.638
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.82      0.72      0.77        39
          1       0.35      0.50      0.41        12

avg / total       0.71      0.67      0.68        51

The estimated Cohen kappa is 0.190476190476
The estimated AUC is 0.609
============================================================



2.0 Survival (Alive) using Logistic Regression


In [10]:
# allocate  outcome 
outcome = 'Alive'
y = predictive_statistics.labels_to_numbers(df, outcome);

# check how unbalanced the data are
df[outcome].value_counts(normalize = True).plot.barh();
plt.title('Normalized Sample size for Alive')
plt.xlabel('group size (%)');
plt.savefig('Sample_Size_Alive.png')


  • ### Logistic Regression

In [11]:
# standard
auc1, kappa1, fpr1, tpr1 = predictive_statistics.Logistic_Regression(X, y)

# unbalanced learning
auc2, kappa2, fpr2, tpr2 = predictive_statistics.Logistic_Regression(X, y, oversample=True, K_neighbors = 4)


             precision    recall  f1-score   support

          0       0.50      0.27      0.35        11
          1       0.82      0.93      0.87        40

avg / total       0.75      0.78      0.76        51

The estimated Cohen kappa is 0.236734693878
The estimated AUC is 0.599
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.36      0.45      0.40        11
          1       0.84      0.78      0.81        40

avg / total       0.73      0.71      0.72        51

The estimated Cohen kappa is 0.208893485005
The estimated AUC is 0.615
============================================================




In [12]:
title ='Effect of oversampling on Logistic Regression for Alive'
predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)
plt.savefig('Alive_Logistic.png')


  • ### random forests

In [13]:
# standard
auc1, kappa1, fpr1, tpr1, _= predictive_statistics.RandomForest_Classifier(X, y)

# unbalanced learning
auc2, kappa2, fpr2, tpr2, _= predictive_statistics.RandomForest_Classifier(X, y, oversample=True, K_neighbors = 6)

title ='Effect of oversampling on RFC for PCR'
predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)
plt.savefig('Alive_RFC.png')


             precision    recall  f1-score   support

          0       0.38      0.27      0.32        11
          1       0.81      0.88      0.84        40

avg / total       0.72      0.75      0.73        51

The estimated Cohen kappa is 0.16393442623
The estimated AUC is 0.574
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.67      0.36      0.47        11
          1       0.84      0.95      0.89        40

avg / total       0.81      0.82      0.80        51

The estimated Cohen kappa is 0.375510204082
The estimated AUC is 0.657
============================================================



3.0 Survival (Alive) including PCR as predictor

- Logistic Regression


In [14]:
# allocate new predictor variable
pcr = predictive_statistics.labels_to_numbers(df, 'PCR').reshape(168,1)
newX = np.concatenate((X,pcr), axis  = 1)

In [15]:
# standard
auc1, kappa1, fpr1, tpr1 = predictive_statistics.Logistic_Regression(newX, y)

# unbalanced learning
auc2, kappa2, fpr2, tpr2 = predictive_statistics.Logistic_Regression(newX, y, oversample=True, K_neighbors = 10)


             precision    recall  f1-score   support

          0       0.43      0.27      0.33        11
          1       0.82      0.90      0.86        40

avg / total       0.73      0.76      0.74        51

The estimated Cohen kappa is 0.198952879581
The estimated AUC is 0.586
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.38      0.45      0.42        11
          1       0.84      0.80      0.82        40

avg / total       0.74      0.73      0.73        51

The estimated Cohen kappa is 0.238805970149
The estimated AUC is 0.627
============================================================




In [16]:
title ='Effect of oversampling on Logistic Regression for Alive (PCR included)'
predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)


- Random Forest Classifier


In [17]:
# standard
auc1, kappa1, fpr1, tpr1 , _ = predictive_statistics.RandomForest_Classifier(newX, y)


             precision    recall  f1-score   support

          0       0.33      0.09      0.14        11
          1       0.79      0.95      0.86        40

avg / total       0.69      0.76      0.71        51

The estimated Cohen kappa is 0.0555555555556
The estimated AUC is 0.52
============================================================




In [18]:
# unbalanced learning
auc2, kappa2, fpr2, tpr2, _ = predictive_statistics.RandomForest_Classifier(newX, y, oversample=True, K_neighbors = 10)


Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.38      0.45      0.42        11
          1       0.84      0.80      0.82        40

avg / total       0.74      0.73      0.73        51

The estimated Cohen kappa is 0.238805970149
The estimated AUC is 0.627
============================================================



4.0 Survival (Alive) including RCB as predictor


In [19]:
rcb = pd.get_dummies(df['RCB']).values
newX = np.concatenate((X,rcb), axis  = 1)

# standard
auc1, kappa1, fpr1, tpr1 = predictive_statistics.Logistic_Regression(newX, y)
auc2, kappa2, fpr2, tpr2, _= predictive_statistics.RandomForest_Classifier(newX, y)


             precision    recall  f1-score   support

          0       0.50      0.36      0.42        11
          1       0.84      0.90      0.87        40

avg / total       0.76      0.78      0.77        51

The estimated Cohen kappa is 0.292559899117
The estimated AUC is 0.632
============================================================



             precision    recall  f1-score   support

          0       0.50      0.45      0.48        11
          1       0.85      0.88      0.86        40

avg / total       0.78      0.78      0.78        51

The estimated Cohen kappa is 0.340775558167
The estimated AUC is 0.665
============================================================




In [20]:
# compare
title ='Predicting Survival including RCB as a predictor'
predictive_statistics.plot_compare_roc(fpr1, tpr1,fpr2, tpr2, auc1, auc2, title = title)
plt.legend(['Logistic Regression','Random Forest Classifier']);
plt.savefig('LG_vs_RFC_Alive_RCB.png')



In [21]:
# unbalanced learning
auc3, kappa3, fpr3, tpr3 = predictive_statistics.Logistic_Regression(newX, y, oversample=True, K_neighbors = 4)
auc4, kappa4, fpr4, tpr4, _= predictive_statistics.RandomForest_Classifier(newX, y, oversample=True, K_neighbors = 4)


Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.33      0.45      0.38        11
          1       0.83      0.75      0.79        40

avg / total       0.73      0.69      0.70        51

The estimated Cohen kappa is 0.180722891566
The estimated AUC is 0.602
============================================================



Data was oversampled using the ADASYN method
             precision    recall  f1-score   support

          0       0.30      0.27      0.29        11
          1       0.80      0.82      0.81        40

avg / total       0.70      0.71      0.70        51

The estimated Cohen kappa is 0.101057579318
The estimated AUC is 0.549
============================================================




In [38]:
title ='Predicting Survival including RCB as a predictor (ADASYN)'
predictive_statistics.plot_compare_roc(fpr3, tpr3,fpr4, tpr4, auc3, auc4, title = title)
plt.legend(['Logistic Regression','Random Forest Classifier']);

#plt.savefig('LG_vs_RFC_Alive_RCB.png')


Prediction of continous outcomes

Prepare functions


In [ ]:
# metrics
mae = metrics.median_absolute_error

def mae_report(Ytest, Yhat, outcome_):
    error = mae(Ytest, Yhat)
    error = np.round( error, decimals=3)
    # report
    print('\n' )
    print('==' *40)
    print('The median absolute error for testing data set of ' + outcome_ + ' is: ' + str(error))
    print('==' *40)

def train_test_report(predictor, Xtrain, Ytrain, Xtest, Ytest, outcome):
    # train
    predictor.fit(Xtrain, Ytrain)
    # test
    Yhat = predictor.predict(Xtest)
    # report
    mae_report(Ytest, Yhat, outcome)
    
# lsq 
import statsmodels.api as sm
def lsq(Xtrain,Ytrain, Xtest, Ytest, outcome =''):
    # train
    OLS = sm.OLS(Ytrain,Xtrain).fit();
    print(OLS.summary())
    #test
    Yhat = OLS.predict(Xtest)
    # report
    mae_report(Ytest, Yhat, outcome)

# SVR    
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV   

# GridSearchCV utility
def gridsearch(regressor, grid):
    optimized_regressor=  GridSearchCV(  regressor, 
                               param_grid = grid, 
                               cv= 3, verbose = 0, n_jobs = -1,
                               scoring = metrics.make_scorer(metrics.median_absolute_error))
    
    return optimized_regressor
    


def svr(Xtrain,Ytrain, Xtest, Ytest, outcome = ''):
    # define regressor
    regressor =  SVR()
    # define parameter grid search
    grid = dict(       kernel = ['rbf','linear','sigmoid'], 
                       C = np.arange(1,11,1),
                       epsilon = np.arange(1,11,1),
                       gamma = np.linspace(1/10,10,3))
    # perform grid search
    grid_search=  gridsearch(regressor, grid)
    
    # train, test, and report
    train_test_report(grid_search, Xtrain, Ytrain, Xtest, Ytest, outcome)

    
# ElasticNet
from sklearn.linear_model import ElasticNet as ENet

def ElasticNet(Xtrain,Ytrain, Xtest, Ytest, outcome = ''):
    # define regressor
    regressor =  ENet(max_iter=5000)
    # define parameter grid search
    grid = dict(   alpha = np.arange(1,20,.5), l1_ratio = np.arange(.1,1,.05))
    # perform grid search
    grid_search=  gridsearch(regressor, grid)
    # train, test, and report
    train_test_report(grid_search, Xtrain, Ytrain, Xtest, Ytest, outcome)
    

# Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor as RFR

def RandomForestRegressor(Xtrain,Ytrain, Xtest, Ytest, outcome = ''):
    # define regressor
    regressor =  RFR( criterion='mse', random_state = RANDOM_STATE)
    
    #
    num_features = Xtrain.shape[1]
    
    # define parameter grid search
    grid = dict(    n_estimators = np.arange(5,100,5), 
                    max_features = np.arange(1,num_features, 1),
                    max_depth = [None, 1, 2, 3, 4, 5])
    
    # perform grid search
    grid_search=  gridsearch(regressor, grid)
    
    # train, test, and report
    train_test_report(grid_search, Xtrain, Ytrain, Xtest, Ytest, outcome)

Organize predictors in the right format and split data


In [57]:
from sklearn.preprocessing import StandardScaler


# allocate continous predictors
cont_predictors = ['age','MRI_LD_Baseline', 'MRI_LD_1_3dAC', 'MRI_LD_Int_Reg', 'MRI_LD_PreSurg']
contX = df[cont_predictors].values
contX = StandardScaler().fit_transform(contX)


# allocate categorical predictors
cat_pred = ['PCR','White', 'ER+', 'PR+', 'HR+'];
catX = pd.pandas.get_dummies(df[cat_pred], drop_first=True).values

# concatenate predictors
X = np.concatenate( (catX, contX), axis=1)

- Recurrence-Free Survival (RFS, Continous in months)


In [58]:
#outcome
y = df.RFS.values / 30; # conver to months

#split
X_train, X_test, y_train, y_test = predictive_statistics.split_data(X, y, False)

In [59]:
# LSQ
predictive_statistics.lsq(X_train, y_train, X_test, y_test, outcome =' RFS (months)')


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.773
Model:                            OLS   Adj. R-squared:                  0.751
Method:                 Least Squares   F-statistic:                     36.38
Date:                Tue, 20 Jun 2017   Prob (F-statistic):           5.60e-30
Time:                        09:37:26   Log-Likelihood:                -535.00
No. Observations:                 117   AIC:                             1090.
Df Residuals:                     107   BIC:                             1118.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            27.0400      5.212      5.188      0.000      16.708      37.372
x2            17.8831      4.243      4.215      0.000       9.473      26.294
x3            15.3054     15.097      1.014      0.313     -14.623      45.234
x4             7.5932      7.294      1.041      0.300      -6.865      22.052
x5             9.6444     16.267      0.593      0.555     -22.603      41.891
x6             2.4913      2.422      1.029      0.306      -2.309       7.292
x7             9.0136      5.044      1.787      0.077      -0.985      19.012
x8            -4.0383      5.040     -0.801      0.425     -14.029       5.953
x9            -6.3210      3.774     -1.675      0.097     -13.802       1.160
x10            2.0339      3.237      0.628      0.531      -4.384       8.452
==============================================================================
Omnibus:                       11.597   Durbin-Watson:                   1.873
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               12.486
Skew:                           0.655   Prob(JB):                      0.00194
Kurtosis:                       3.918   Cond. No.                         15.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
================================================================================
The median absolute error for testing data set of  RFS (months) is: 14.364
================================================================================

ElasticNet


In [60]:
importlib.reload(predictive_statistics);

In [61]:
EN = predictive_statistics.ElasticNet(X_train, y_train, X_test, y_test, outcome =' RFS (months)');


================================================================================
The median absolute error for testing data set of  RFS (months) is: 12.536
================================================================================

SVM Regressor


In [62]:
SVR = predictive_statistics.svr(X_train, y_train, X_test, y_test, outcome =' RFS(months)');


================================================================================
The median absolute error for testing data set of  RFS(months) is: 82.589
================================================================================

In [63]:
RFR = predictive_statistics.RandomForestRegressor(X_train, y_train, X_test, y_test, outcome =' RFS(months)')


================================================================================
The median absolute error for testing data set of  RFS(months) is: 11.76
================================================================================

2.0 Survival Length (Survival_length, Continous, months)


In [74]:
y = df.Survival_length.values / 30

#split
X_train, X_test, y_train, y_test = predictive_statistics.split_data(X, y, False)

In [75]:
# # LSQ
predictive_statistics.lsq(X_train, y_train, X_test, y_test, outcome ='Survival_length (months)')


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.795
Model:                            OLS   Adj. R-squared:                  0.775
Method:                 Least Squares   F-statistic:                     41.39
Date:                Tue, 20 Jun 2017   Prob (F-statistic):           2.78e-32
Time:                        10:03:09   Log-Likelihood:                -532.02
No. Observations:                 117   AIC:                             1084.
Df Residuals:                     107   BIC:                             1112.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1            25.8238      5.081      5.083      0.000      15.752      35.896
x2            21.5604      4.136      5.213      0.000      13.361      29.759
x3            11.1381     14.718      0.757      0.451     -18.039      40.315
x4             6.6792      7.110      0.939      0.350      -7.416      20.775
x5            13.6270     15.858      0.859      0.392     -17.810      45.064
x6             1.0752      2.361      0.455      0.650      -3.605       5.755
x7             8.8649      4.917      1.803      0.074      -0.882      18.612
x8            -3.8696      4.913     -0.788      0.433     -13.610       5.871
x9            -4.7785      3.679     -1.299      0.197     -12.072       2.515
x10            1.9332      3.156      0.613      0.541      -4.324       8.190
==============================================================================
Omnibus:                       13.355   Durbin-Watson:                   1.709
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               14.827
Skew:                           0.724   Prob(JB):                     0.000603
Kurtosis:                       3.973   Cond. No.                         15.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
================================================================================
The median absolute error for testing data set of Survival_length (months) is: 15.819
================================================================================

In [76]:
predictive_statistics.svr(X_train, y_train, X_test, y_test, outcome ='Survival length (months)');


================================================================================
The median absolute error for testing data set of Survival length (months) is: 102.682
================================================================================

In [71]:
predictive_statistics.ElasticNet(X_train, y_train, X_test, y_test, outcome = 'Survival length (months)');


================================================================================
The median absolute error for testing data set of Survival length (months) is: 9.136
================================================================================

In [72]:
predictive_statistics.RandomForestRegressor(X_train, y_train, X_test, y_test, outcome = 'Survival length (months)');


================================================================================
The median absolute error for testing data set of Survival length (months) is: 9.254
================================================================================

In [ ]: