Data Input

The datasets used in this project are available from 'The Cancer Genome Atlas' (TCGA) consortium. Clinical and RNA-seq gene count data sets can be downloaded via a provided data portal, or using an R package called TCGA2STAT. The R script used to download the data sets for this project and write files locally in feather format are available in the project repository here. Datasets are also stored in this project's repository here. The R script is not necessary for dataset download, but can be run locally to obtain the most up to date clinical information from TCGA.



In [1]:

    
%run -i Dataset_cleanup/arrange_DF.py









    



Clinical data set imported!
Features: 21 
Patients: 499

 The following features do not provide any information: 
 ['Composite.Element.REF' 'ethnicity' 'gender' 'pathologicstage'
 'pathologyMstage' 'tumortissuesite'] 

Variables that are not known at initial diagnosis: 
 ['daystodeath' 'daystolastfollowup' 'daystopsa' 'histologicaltype'
 'numberoflymphnodes' 'pathologyTstage' 'radiationtherapy' 'residualtumor'
 'vitalstatus'] 

Variables that are known at the time of diagnosis:
 ['clinical_index' 'dateofinitialpathologicdiagnosis' 'gleasonscore'
 'pathologyNstage' 'psavalue' 'race' 'yearstobirth']


Gene Counts data set imported!
Features: 20501 
Patients: 497


Transforming gene counts to transcript per million (TPM)

Transformation Successful!

497 Gene count estimate profiles have been transformed from gene counts to transcripts per million reads (TPM)

Data Exploration

Clinical Information and metastasis label

There is missing clinical data in many of the features, including what will become the outcome label ('pathologyNstage' - metastasis state). The series 'y_all' is the full list of pathologyNstage, where 'n1' represents metastasis, and 'n0' represents no metastasis observed to date. Some observations have no metastasis state recorded and are represented by NaN in y_all. These are removed for the trimmed 'y' series.



In [2]:

    
import matplotlib.pyplot as plt
%matplotlib inline



In [3]:

    
D = (('n0', y_all[y_all =='n0'].shape[0]), 
     ('n1', y_all[y_all =='n1'].shape[0]), 
     ('missing' , y_all[y_all.isnull()].shape[0]))
exp_fig = plt.figure(figsize=(5 , 5))
A = exp_fig.add_subplot(1,1,1)
ind = A.bar(range(0,3), height= [D[0][1], D[1][1], D[2][1]], align='center', color = 'grey')
A.set_xticks(range(0,3))
A.set_ylabel('Frequency', fontsize=16)
A.set_xlabel('Metastasis state', fontsize=16)
A.set_ylim(0,400)
A.set_xticklabels([D[0][0], D[1][0], D[2][0]])  #Must be a better way to do this as tuples (D, in this case) are not ordered?
ep = plt.show
exp_fig.savefig('Figures/LabelCount.png')

Gleason Score versus Metastasis analysis

The Gleason score is the gold-standard diagnostic test for cancer severity, but is not highly correlative with metastatic disease.



In [4]:

    
clinical['gleasonscore'] = pd.to_numeric(clinical['gleasonscore'], errors= 'coerce')
clinical['gleasonscore'].isnull().any()  #checks if there are any values in gleason score that are null; False means all are numeric









    Out[4]:





False



In [5]:

    
GS_hist = plt.figure(figsize=(5,5))
A = GS_hist.add_subplot(1,1,1)
sub = (('n1', ))

subs = [clinical.loc[y_all[y_all == 'n0'].index]['gleasonscore'],
        clinical.loc[y_all[y_all == 'n1'].index]['gleasonscore'],
        clinical.loc[y_all[y_all.isnull()].index]['gleasonscore']]

bins = [5.99, 6.99, 7.99, 8.99, 9.99, 10.99]
A.hist(subs, bins, label =['n0', 'n1', 'unknown'], color = ['blue','red','grey'], stacked = True)
A.set_ylim(0,275)
A.set_ylabel('Count', fontsize=16)
A.set_xlabel('Gleason Score', fontsize =16)
A.legend()
plt.show
GS_hist.savefig('Figures/GleasonHist.png')

It appears that no metastases are recorded in those specimens graded at a Gleason score of 6. This represents an opportunity to replace the missing data label with the most likely pathology state, n0, to more efficiently use the small data set.



In [6]:

    
"""Define the indices where gleasonscore == 6 and pathologyNstage is null"""
set(y_all[y_all.isnull()].index).intersection(list(clinical[clinical['gleasonscore'] == 6].index))









    Out[6]:





{'TCGA-2A-A8VO',
 'TCGA-2A-AAYO',
 'TCGA-CH-5738',
 'TCGA-G9-6342',
 'TCGA-G9-6347',
 'TCGA-G9-6371',
 'TCGA-G9-7509',
 'TCGA-H9-A6BX',
 'TCGA-HC-7075',
 'TCGA-HC-7077',
 'TCGA-HC-8213',
 'TCGA-HC-8259',
 'TCGA-J4-A67Q',
 'TCGA-J4-A83K',
 'TCGA-J9-A8CN',
 'TCGA-M7-A720',
 'TCGA-V1-A8WN',
 'TCGA-V1-A8WS',
 'TCGA-V1-A9OF',
 'TCGA-V1-A9OQ',
 'TCGA-V1-A9OT',
 'TCGA-XJ-A9DQ'}



In [7]:

    
"""Set all samples with Gleason grade of 6 and label is missing to 'no'. """
y_all.loc[set(y_all[y_all.isnull()].index).intersection(list(clinical[clinical['gleasonscore'] == 6].index))] = 'n0'









    



/Users/ccthomps/anaconda/lib/python3.5/site-packages/pandas/core/indexing.py:132: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)



In [8]:

    
GS_hist = plt.figure(figsize=(5,5))
A = GS_hist.add_subplot(1,1,1)
sub = (('n1', ))

subs = [clinical.loc[y_all[y_all == 'n0'].index]['gleasonscore'],
        clinical.loc[y_all[y_all == 'n1'].index]['gleasonscore'],
        clinical.loc[y_all[y_all.isnull()].index]['gleasonscore']]

bins = [5.99, 6.99, 7.99, 8.99, 9.99, 10.99]
A.hist(subs, bins, label =['n0', 'n1', 'unknown'], color = ['blue','red','grey'], stacked = True)
A.set_ylim(0,275)
A.set_ylabel('Count', fontsize=16)
A.set_xlabel('Gleason Score', fontsize =16)
A.legend()
plt.show
GS_hist.savefig('Figures/GleasonHist2.png')

Gene Activation (Gene counts) Dataset

The starting dataset, 'X_all', includes the transformed transcript per million (TPM) estimates for all RNA-seq profiles. However some of the observations in this set do not have corresponding y_labels, as the clinical data set contains missing information.

Therefore X_all was trimmed to include only those observations where a finite y label exists, to yield 'X'.



In [9]:

    
print("Total observations in original dataset:",clinical.shape[0])

not_labeled = y_all[y_all.isnull()] 
y = y_all[y_all.notnull()]

print("\nObservations with metastasis label:",y.shape[0])
print("Unlabeled observations (removed:)",not_labeled.shape[0])









    



Total observations in original dataset: 499

Observations with metastasis label: 448
Unlabeled observations (removed:) 51



In [10]:

    
"""Limit X to only observations where a target label is present."""
X = X_all.loc[set(y.index).intersection(X_all.index)]  #Only observations that also have a known metastasis state are kept.
y = y.loc[set(X.index).intersection(y.index)]
print("X dimensions:",X.shape,"\ny dimensions:",y.shape)









    



X dimensions: (446, 20501) 
y dimensions: (446,)



In [11]:

    
X_no_y = X_all.loc[list(not_labeled.index)]
print("Dimensions of unlabeled dataset:",X_no_y.shape)









    



Dimensions of unlabeled dataset: (51, 20501)



In [12]:

    
"""Check to determine if any clinical information is missing"""
X.isnull().values.any() #Are any values in the entire gene count data set null?









    Out[12]:





False

Data Visualization



In [13]:

    
gleason = clinical['gleasonscore']
#print(gleason.shape)
age = pd.to_numeric(clinical['yearstobirth'], errors = 'coerce')
age.fillna(value = np.mean(age), inplace=True)
psa = pd.to_numeric(clinical['psavalue'], errors= 'coerce')
psa.fillna(value = np.mean(psa), inplace =True)
#gleason = gleason.loc[y.index]

clinicalDF_all = pd.DataFrame({'gleason': gleason,
                            'age':age ,
                            'psa' : psa,
                            'y' : y_all}, index=X_all.index)
clinicalDF = clinicalDF_all.loc[y.index, :]



In [14]:

    
clinicalDF.drop(['y'], axis=1, inplace = True)



In [15]:

    
clinicalDF = clinicalDF.reindex(y.index)



In [16]:

    
sm = pd.scatter_matrix(clinicalDF, 
                       alpha=0.25, 
                       figsize= (10,10), 
                       diagonal = 'kde', 
                       c = y.replace({'n1': 'red','n0':'blue'}), 
                       s = 125)

plt.savefig("Figures/ClinScatterMatrix.png")



In [17]:

    
"""Pilot experiment to see if any genes are activated/expressed differently in the 'n0' and 'n1 classes"""

from sklearn.feature_selection import SelectKBest, f_classif
K_selector = SelectKBest(f_classif, k = 'all') # using k=all here and will filter based on F-stat later.  
K = K_selector.fit(X, y)
K_df = pd.DataFrame({'Gene':X.columns, 'F_score':K.scores_})
K_df.dropna(inplace=True)









    



/Users/ccthomps/anaconda/lib/python3.5/site-packages/sklearn/feature_selection/univariate_selection.py:113: UserWarning: Features [ 1595  1815  2267  4324  4341  4342  4343  4344  4345  4346  4347  4779
  4780  4781  4782  4787  4790  4791  4794  4795  4801  4803  5259  6032
  6776  6780  6783  7445  7551  7632  7633  7634  7635  7636  7732  9275
  9277  9287  9291  9419  9421  9423  9495 10092 11101 11929 12028 12488
 12527 12593 12606 13749 13963 14015 14129 14130 14131 14132 14726 14727
 14729 15109 15111 15112 16536 16538 16539 16541 16542 16543 16545 16546
 16547 16548 16549 16550 16576 16605 16608 16648 16668 16669 16670 16671
 16672 16673 16674 16675 16676 16677 16678 16679 16680 16681 16682 16683
 16684 16685 16686 16687 16688 16689 16690 16691 16692 16693 16694 16695
 16696 16697 16698 16699 16700 16701 16702 16703 16704 16705 16706 16707
 16708 16709 16710 16711 16712 16713 16714 16715 16716 16717 16718 16720
 16721 16722 16723 16724 16725 16726 16728 16729 16730 16731 16732 16733
 16734 16735 16736 16737 16738 16739 16740 16741 16742 16743 16745 16746
 16747 16748 16749 16750 16751 16752 16753 16754 16755 16756 16757 16759
 16760 16761 16762 16763 16764 16766 16767 16769 16770 16771 16772 16773
 16774 16775 16776 16777 16778 16779 16780 16781 16782 16783 16786 16788
 16789 16790 16791 16792 16793 16794 16796 16797 16798 16801 16802 16803
 16804 16805 16806 16807 16808 16809 16810 16811 16812 16813 16814 16815
 16816 16817 16818 16819 16820 16821 16822 16823 16824 16825 16826 16827
 16828 16829 16830 16831 16832 16833 16834 16835 16836 16837 16838 16839
 16840 16841 16842 16843 16844 16845 16846 16847 16848 16849 16850 16851
 16852 16853 16854 16855 16856 16857 16858 16859 16860 16861 16862 16863
 16864 16865 16866 16867 16868 16869 16870 16871 16873 16874 16875 16876
 16877 16878 16879 16880 16881 16882 16884 16885 16886 16887 16888 16889
 16891 16892 16893 16895 16896 17041 17045 17127 17466 17871 18589 18788
 18800 18873 18879 18880 18882 18886 18887 18964 19420 19421 19422 19641] are constant.
  UserWarning)



In [18]:

    
"""Figure shows the same data twice, just with a linear and a log y-axis for clarity"""
F_fig = plt.figure(figsize=(15,5))
A = F_fig.add_subplot(1,2,1)
A.hist(K_df['F_score'], bins=range(0,45,1), facecolor='green')
#A.set_yscale('log')
A.set_ylabel('Count', fontsize=16)
A.set_xlabel("F-statistic", fontsize=16)
F_fig.suptitle("Distribution of F-statistics for metatastasis state group comparison of gene expression", fontsize = 18)
B = F_fig.add_subplot(1,2,2)
B.hist(K_df['F_score'], bins=range(0,45,1), facecolor='green')
B.set_yscale('log')
B.set_ylabel('Count', fontsize=16)
B.set_xlabel("F-statistic", fontsize=16)
#B.set_title("Distribution of F-statistics for metatastasis state group comparison of gene expression", fontsize = 18)
plt.show()
F_fig.savefig("Figures/FDist.png")

Analysis

Prepare for Analysis

Generate Scoring Functions



In [19]:

    
from sklearn.metrics import matthews_corrcoef, fbeta_score, classification_report, log_loss
from sklearn.model_selection import cross_val_score



In [20]:

    
"""Scorers needed throughout code:"""
from sklearn.metrics import make_scorer
LL_scorer = make_scorer(log_loss, greater_is_better=False, needs_proba=True, needs_threshold=False)
MCC_scorer = make_scorer(matthews_corrcoef, greater_is_better=True, needs_proba=False, needs_threshold=False)
fbeta_scorer = make_scorer(fbeta_score, greater_is_better=True, needs_proba=False, needs_threshold=False, pos_label='n1', beta = 2)

Import Logistic Regression Algorithms



In [21]:

    
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV

Set Seed and Folds for Cross Validation



In [22]:

    
seed = 123
folds = 4

Handles for Graph legends



In [23]:

    
import matplotlib.patches as mpatches
blue = mpatches.Patch(color ='blue', label ='n0')
red = mpatches.Patch(color='red', label = 'n1')

Train Test Split



In [24]:

    
from sklearn.model_selection import train_test_split
train_k = round(len(y)*0.70)
test_k = len(y) - train_k



In [25]:

    
"""Split samples into train and test sets, stratified by Gleason score"""
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size= test_k, 
                                                    train_size = train_k,
                                                    random_state = seed,
                                                    stratify = clinicalDF['gleason'])



In [26]:

    
"""Push the train and test indices to split the clinical dataframe in the same way"""
clinicalDF_train = clinicalDF.loc[X_train.index,:]
clinicalDF_test = clinicalDF.loc[X_test.index,:]

Benchmark

Clinical Information Benchmark



In [27]:

    
clf_LR_clin = LogisticRegression(penalty='l2',
                              dual=False,
                              tol=0.0001,
                              C=1,
                              fit_intercept=True,
                              intercept_scaling=1,
                              class_weight='balanced',
                              random_state=123,
                              solver='liblinear',
                              max_iter=100,
                              multi_class='ovr',
                              verbose=0,
                              warm_start=False,
                              n_jobs=1)

clf_LR_clin.fit(clinicalDF_train, y_train)  #use training set for model learning









    Out[27]:





LogisticRegression(C=1, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=123,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)



In [28]:

    
clf_LR_clin = LogisticRegression(penalty='l2',
                              dual=False,
                              tol=0.0001,
                              C=1,
                              fit_intercept=True,
                              intercept_scaling=1,
                              class_weight='balanced',
                              random_state=123,
                              solver='liblinear',
                              max_iter=100,
                              multi_class='ovr',
                              verbose=0,
                              warm_start=False,
                              n_jobs=1)

clf_LR_clin.fit(clinicalDF_train, y_train)  #use training set for model learning

bench_fig = plt.figure(figsize=(15,5))
A= bench_fig.add_subplot(1,2,1)
B = bench_fig.add_subplot(1,2,2)
A.scatter(clinicalDF_train.loc[:,'gleason'], 
          clf_LR_clin.predict_proba(clinicalDF_train)[:,1], 
          color = y_train.replace({'n1':'red', 'n0': 'blue', 'NaN':'grey'}),
          alpha = 0.3,
          s = 25)
A.set_ylabel('Benchmark model probability of metastasis')
A.set_ylim(0,1)
B.set_ylim(0,1)
A.set_xlabel('Gleason')
plt.suptitle('Prognosis of metastasis risk based on features available at presentation', fontsize=20)
A.set_title('Training Set')
B.set_title('Validation Set')
B.scatter(clinicalDF_test.loc[:,'gleason'],
          clf_LR_clin.predict_proba(clinicalDF_test)[:,1],
          color = y_test.replace({'n1':'red', 'n0': 'blue', 'NaN':'grey'}),
          alpha = 0.5,
          s = 25)
B.set_xlabel('Gleason')
B.set_ylabel('Benchmark model probability of metastasis')
bench_fig.legend(labels = ['n0', 'n1'], handles = [blue, red], loc = 5)
bench_fig.suptitle("Benchmark Model", fontsize=20)
plt.show
bench_fig.savefig('Figures/benchmark.png')



In [29]:

    
"""The coefficents for each feature in the benchmark model"""
print(pd.DataFrame({'features':clinicalDF.columns, 'coefficient': clf_LR_clin.coef_[0]}))









    



   coefficient features
0    -0.070227      age
1     0.851081  gleason
2     0.085552      psa



In [30]:

    
"Benchmark Performance on Train dataset"
print(clf_LR_clin)
print('\nNull F beta: ', fbeta_score(y_train, clf_LR_clin.predict(clinicalDF_train), pos_label='n1',beta=2))
print('\nMCC: ',matthews_corrcoef(y_train, clf_LR_clin.predict(clinicalDF_train)),"\n")
print(classification_report(y_train, clf_LR_clin.predict(clinicalDF_train), labels = ['n0','n1']))
print('\nLogLoss: ', log_loss(y_train.replace({'n0':0, 'n1':1}), 
         clf_LR_clin.predict_proba(clinicalDF_train)[:,1]))

training_stat = pd.DataFrame({'F2': fbeta_score(y_train, clf_LR_clin.predict(clinicalDF_train), pos_label='n1',beta=2),
                            'MCC': matthews_corrcoef(y_train, clf_LR_clin.predict(clinicalDF_train)),
                            'LogLoss':log_loss(y_train.replace({'n0':0, 'n1':1}), 
                                      clf_LR_clin.predict_proba(clinicalDF_train)[:,1])}, index = ['clinical_only'])









    



LogisticRegression(C=1, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=123,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

Null F beta:  0.574534161491

MCC:  0.316565563781 

             precision    recall  f1-score   support

         n0       0.93      0.69      0.79       261
         n1       0.31      0.73      0.44        51

avg / total       0.83      0.70      0.73       312


LogLoss:  0.585950270317



In [31]:

    
"Benchmark Performance on Test dataset"
print(clf_LR_clin)
print('\nNull F beta: ', fbeta_score(y_test, clf_LR_clin.predict(clinicalDF_test), pos_label='n1',beta=2))
print('\nMCC: ',matthews_corrcoef(y_test, clf_LR_clin.predict(clinicalDF_test)),"\n")
print(classification_report(y_test, clf_LR_clin.predict(clinicalDF_test), labels = ['n0','n1']))
print('\nLogLoss: ', log_loss(y_test.replace({'n0':0, 'n1':1}), 
         clf_LR_clin.predict_proba(clinicalDF_test)[:,1]))

metric_stat = pd.DataFrame({'F2': fbeta_score(y_test, clf_LR_clin.predict(clinicalDF_test), pos_label='n1',beta=2),
                            'MCC': matthews_corrcoef(y_test, clf_LR_clin.predict(clinicalDF_test)),
                            'LogLoss':log_loss(y_test.replace({'n0':0, 'n1':1}), 
                                      clf_LR_clin.predict_proba(clinicalDF_test)[:,1])}, index = ['clinical_only'])









    



LogisticRegression(C=1, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=123,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

Null F beta:  0.65625

MCC:  0.419967363804 

             precision    recall  f1-score   support

         n0       0.92      0.75      0.82       106
         n1       0.44      0.75      0.55        28

avg / total       0.82      0.75      0.77       134


LogLoss:  0.617745640628

Feature Reduction

Import Functions



In [32]:

    
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

Gini Importance Filter



In [33]:

    
clf_RF = RandomForestClassifier(n_estimators=1000, 
                                criterion='gini', 
                                max_depth=None, 
                                min_samples_split=2, 
                                min_samples_leaf=1, 
                                min_weight_fraction_leaf=0.0, 
                                max_features= 'auto', 
                                max_leaf_nodes=None, 
                                bootstrap=True, 
                                oob_score=False, 
                                n_jobs=1, 
                                random_state=seed, 
                                verbose=0, 
                                warm_start=False, 
                                class_weight='balanced')



In [34]:

    
clf_RF.fit(X, y)
Gini_DF = pd.DataFrame({'Gini' :clf_RF.feature_importances_}, index=X.columns).sort_values(by = ['Gini'], axis = 0, ascending = False)
print(Gini_DF.head())









    



            Gini
EDN3    0.003057
GNE     0.002224
NRP1    0.002137
GDI1    0.001829
MAP6D1  0.001797



In [35]:

    
"""Set k number of genes to retain"""
k = 75

Stability Experiment

The code used for the Random Forest to Gini Importance pipeline selection cycle is left as a markdown cell below. To validate its outcome, change code type to 'Code'.

genes = Gini_DF.index[0:k] # gene list from first instance
for x in [5, 54, 543, 5432, 54321] : clf_RF = RandomForestClassifier(n_estimators=1000, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features= 'auto', max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=x, verbose=0, warm_start=False, class_weight='balanced') clf_RF.fit(X, y) Gini_DF = pd.DataFrame({'Gini' :clf_RF.featureimportances}, index=X.columns).sort_values(by = ['Gini'], axis = 0, ascending = False) genes = set(genes).intersection(Gini_DF.index[0:k]) print('Number of Consistent (Stable) Genes:',len(genes))



In [36]:

    
"""The genes most commonly determined in the slow iterative process above are listed below.  By commenting out the cycle above
and just covering up the `genes` variable here, the code runs faster.  For confirmation, one may turn the code type of the 
cell above to 'code' instead of 'markdown' and run the process to check the subset below."""

genes = ['HIST1H3J',
         'EDN3',
         'FKBP15',
         'FOXS1',
         'GNE',
         'HMGB2',
         'MAP6D1',
         'SNAPIN',
         'SPAG1',
         'UBE2D2',
         'ZNRF3', 
         'DPM1',
         'VPS36']



In [37]:

    
"""Subset X to include only the k best features from the Gini Importance table"""
X_k = X.loc[:,genes]



In [38]:

    
fig = plt.figure(figsize=(12,8))
for i in range(0,len(X_k.columns),1) :
    ax = fig.add_subplot(np.ceil(len(X_k.columns)/5),5,i+1)
    df0 = list(X_k.ix[y.loc[y=='n0'].index,i])
    df1 = list(X_k.ix[y.loc[y=='n1'].index,i])
    data = [df0,df1]
    A = ax.boxplot(data, positions = [1,2], vert = False, widths = 0.5, patch_artist=True)
    plt.setp(A['boxes'][1], color='red')
    plt.setp(A['boxes'][0], color='blue')
    plt.setp(A['medians'], color = 'black')
    plt.setp(A['whiskers'], color = 'black')
    plt.setp(A['fliers'], color = 'black', marker = '.', markersize = 3)
    ax.yaxis.set_major_formatter(plt.NullFormatter())
    ax.set_xlabel(X_k.columns[i], fontsize=12)
    ax.set_xscale('log')
fig.tight_layout()
fig.legend(labels = ['n0', 'n1'], handles= [blue,red], ncol = 2, loc = 'lower center', bbox_to_anchor = (.945, 1.05))
fig.text(-.005, 0.5, 'Metastasis state', ha='center', va='center', fontsize=24,rotation='vertical')
fig.text(.5, -.025, 'Transcripts Per Million (TPM)', ha='center', va='center', fontsize=24)
fig.savefig('Figures/boxplots.png', dpi = 300, bbox_inches='tight')
plt.show()

Transform reduced feature dataframe to standard scale



In [39]:

    
scaler = StandardScaler() #mean = 0, variance = 1.
Xs = pd.DataFrame(scaler.fit_transform(X_k), columns = X_k.columns, index = X_k.index)

PCA transformation

Train on the scaled feature set



In [40]:

    
components = 3



In [41]:

    
pca = PCA(n_components = components, whiten=False)
pca.fit(Xs)









    Out[41]:





PCA(copy=True, iterated_power=4, n_components=3, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

Transform with the trained PCA function



In [42]:

    
Xpc = pd.DataFrame(pca.transform(Xs), columns = range(0,components,1), index = X.index)

Visualize the PCA transformation



In [43]:

    
%run -i 'support files/renders_PCA.py' #modified from MLE Nanodegree Project 3
pca_results(Xs, pca)
plt.savefig('Figures/pcaEV.png', dpi =500)



In [44]:

    
sm = pd.scatter_matrix(Xpc, 
                       alpha=0.5, 
                       figsize= (10,10), 
                       diagonal = 'kde', 
                       c = y.replace({'n1': 'red','n0':'blue'}), 
                       s = 50)
plt.savefig("Figures/pcaScatter.png", dpi = 300)

Notice how the first principle component (labeled, 0) seems to distinguish the non-metastatic (blue) and metastatic samples (red).

Appropriate the PCA transformed dataset via the current train, test split index



In [45]:

    
"""Cover up X_train and X_test with the new feature set of 3 PCs"""
X_train, X_test = Xpc.loc[X_train.index, :], Xpc.loc[X_test.index, :]

Train the Logistic Regression Algorithm

Training with the first three principle components with cross validation to determine the optimal C parameter.



In [46]:

    
clf_LR = LogisticRegression(penalty='l2',
                              dual=False,
                              tol=0.0001,
                              C=1,
                              fit_intercept=True,
                              intercept_scaling=1,
                              class_weight='balanced',
                              random_state=123,
                              solver='liblinear',
                              max_iter=100,
                              multi_class='ovr',
                              verbose=0,
                              warm_start=False,
                              n_jobs=1)
clf_LR.fit(X_train, y_train)









    Out[46]:





LogisticRegression(C=1, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=123,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

Visualize Model Performance on Training and Test sets



In [47]:

    
%run -i 'Support Files/capstone_graphs.py'
fig.savefig('Figures/PC3.png')

Measure Performance on the held-out test set



In [48]:

    
run_name = '3_PCs'
%run -i 'Support Files/capstone_performance.py'









    



Model:
 LogisticRegression(C=1, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=123,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

              precision    recall  f1-score   support

         n0       0.94      0.82      0.87       106
         n1       0.54      0.79      0.64        28

avg / total       0.85      0.81      0.82       134


Log Loss:  0.506919319618


                      F2   LogLoss       MCC
clinical_only  0.656250  0.617746  0.419967
3_PCs          0.718954  0.506919  0.535066

Optimize the Logistic Regression Model



In [49]:

    
clf_LR = LogisticRegressionCV(Cs=10, 
                              fit_intercept=True, 
                              cv=folds, 
                              dual=False, 
                              penalty='l2', 
                              scoring='log_loss', 
                              solver='liblinear', 
                              tol=0.0001, 
                              max_iter=100, 
                              class_weight='balanced', 
                              n_jobs=1, 
                              verbose=0, 
                              refit=True, 
                              intercept_scaling=1.0, 
                              multi_class='ovr', 
                              random_state=seed)
clf_LR.fit(X_train, y_train)









    Out[49]:





LogisticRegressionCV(Cs=10, class_weight='balanced', cv=4, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=123,
           refit=True, scoring='log_loss', solver='liblinear', tol=0.0001,
           verbose=0)



In [50]:

    
print("The optimal C term: ",clf_LR.C_)









    



The optimal C term:  [ 10000.]

Visualize Model Performance on Training and Test sets



In [51]:

    
%run -i 'Support Files/capstone_graphs.py'
fig.savefig('Figures/optPC3.png')

Measure Performance on the held-out test set



In [52]:

    
run_name = '3_PCs_optimized'
%run -i 'Support Files/capstone_performance.py'









    



Model:
 LogisticRegressionCV(Cs=10, class_weight='balanced', cv=4, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=123,
           refit=True, scoring='log_loss', solver='liblinear', tol=0.0001,
           verbose=0)

              precision    recall  f1-score   support

         n0       0.94      0.82      0.87       106
         n1       0.54      0.79      0.64        28

avg / total       0.85      0.81      0.82       134


Log Loss:  0.505976619516


                        F2   LogLoss       MCC
clinical_only    0.656250  0.617746  0.419967
3_PCs            0.718954  0.506919  0.535066
3_PCs_optimized  0.718954  0.505977  0.535066

Incorporate Gleason score



In [53]:

    
X_train.loc[:,'gleason'] = clinicalDF.loc[X_train.index,'gleason']
X_test.loc[:,'gleason'] = clinicalDF.loc[X_test.index, 'gleason']

Retrain Logistic Regression model with CV for C term



In [54]:

    
clf_LR = LogisticRegressionCV(Cs=10, 
                              fit_intercept=True, 
                              cv=folds, 
                              dual=False, 
                              penalty='l2', 
                              scoring='log_loss', 
                              solver='liblinear', 
                              tol=0.0001, 
                              max_iter=100, 
                              class_weight='balanced', 
                              n_jobs=1, 
                              verbose=0, 
                              refit=True, 
                              intercept_scaling=1.0, 
                              multi_class='ovr', 
                              random_state=seed)
clf_LR.fit(X_train, y_train)









    Out[54]:





LogisticRegressionCV(Cs=10, class_weight='balanced', cv=4, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=123,
           refit=True, scoring='log_loss', solver='liblinear', tol=0.0001,
           verbose=0)

Visualize Model Performance

Trained on First 3 PCs + gleason score



In [55]:

    
%run -i 'Support Files/capstone_graphs.py'
plt.savefig('Figures/PC3Gleason.png')



In [56]:

    
sm = pd.scatter_matrix(X_train, 
                       alpha=0.5, 
                       figsize= (12,12), 
                       diagonal = 'kde', 
                       c = y_train.replace({'n1': 'red','n0':'blue'}), 
                       s = 50)
plt.savefig('Figures/FeatureSM.png')

Performance Measure



In [57]:

    
run_name = 'Final_Model'
%run -i 'Support Files/capstone_performance.py'









    



Model:
 LogisticRegressionCV(Cs=10, class_weight='balanced', cv=4, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=123,
           refit=True, scoring='log_loss', solver='liblinear', tol=0.0001,
           verbose=0)

              precision    recall  f1-score   support

         n0       0.95      0.82      0.88       106
         n1       0.55      0.82      0.66        28

avg / total       0.86      0.82      0.83       134


Log Loss:  0.484356884231


                        F2   LogLoss       MCC
clinical_only    0.656250  0.617746  0.419967
3_PCs            0.718954  0.506919  0.535066
3_PCs_optimized  0.718954  0.505977  0.535066
Final_Model      0.746753  0.484357  0.562824

Is each feature important for model performance?



In [58]:

    
"""The coefficient of each feature incorporated into the final model"""
pd.DataFrame({'Component': X_train.columns, 'coef': clf_LR.coef_[0]})



In [59]:

    
%run -i 'Support Files/final_graph.py'
fig.savefig('Figures/FF.png')

Conclusions and Discussion

Function for Prostate Cancer Metastasis Risk



In [60]:

    
def risk_function(barcode, DF, clinDF, genes, scaler, fit_pca, fit_clf) :
    self = DF.loc[barcode,genes].reshape(1,-1)
    self = pd.DataFrame(scaler.transform(self))
    self = pd.DataFrame(fit_pca.transform(self), index = [barcode])
    self['gleason'] = clinDF.loc[barcode,'gleasonscore']
    return(fit_clf.predict_proba(self)[0,1])



In [61]:

    
"""Put the non-labeled samples through the risk function"""
risk = []
for x in not_labeled.index :
    risk.append(risk_function(barcode = x,
              DF = X_all,
              clinDF = clinical,
              genes = genes,
              fit_pca = pca,
              scaler = scaler,
              fit_clf = clf_LR))
model_risk = pd.Series(risk, index=not_labeled.index)



In [62]:

    
clinicalDF_all.drop(['y'], axis=1, inplace = True)



In [63]:

    
bench_risk = pd.Series(clf_LR_clin.predict_proba(clinicalDF_all.loc[not_labeled.index,:])[:,1])



In [64]:

    
n0_question = X_all.loc[y[y!='n1'].index, :]



In [65]:

    
"""Put the non-metastatic samples through the risk function as a functional test"""
risk = []
for x in n0_question.index :
    risk.append(risk_function(barcode = x,
              DF = n0_question,
              clinDF = clinical,
              genes = genes,
              fit_pca = pca,
              scaler = scaler,
              fit_clf = clf_LR))
n0_risk = pd.Series(risk, index=n0_question.index)

Visualize unlabeled and 'non-metastatic' cases from the TCGA cohort



In [66]:

    
%run -i 'Support Files/free-form.py'
fig.savefig('Figures/missing.png')



In [67]:

    
%run -i 'Support Files/free-form2.py'
fig.savefig('Figures/N0Analysis.png')

It is important to remember that the dataset utilized in this project is longitudinal and active, and indeed some cases listed as 'n0' will eventually be diagnosed as 'n1'. Even though the model was trained on such false-negative labeled cases, some of the 'n0' class are still predicted as highly likely to metastasize. If the dataset were linked somehow to patient records (for ethical purposes, no such link exists), then these patients should be monitored with extreme care for signs of metastasis based on the predictions made by the project model.



In [68]:

    
n0_risk[n0_risk > 0.75].sort_values(ascending = False)









    Out[68]:





clinical_index
TCGA-EJ-8472    0.998112
TCGA-CH-5761    0.989662
TCGA-HC-A9TE    0.983214
TCGA-KK-A59V    0.959048
TCGA-V1-A9OL    0.958359
TCGA-EJ-5514    0.951237
TCGA-HC-7213    0.945857
TCGA-J4-AATZ    0.930018
TCGA-FC-7961    0.912228
TCGA-V1-A9Z8    0.875559
TCGA-HC-A9TH    0.872729
TCGA-YL-A8SC    0.865229
TCGA-HC-A4ZV    0.846214
TCGA-YL-A8HM    0.846111
TCGA-HC-A631    0.821491
TCGA-EJ-8469    0.818739
TCGA-XJ-A9DI    0.801364
TCGA-HC-A632    0.800009
TCGA-KK-A8IC    0.794476
TCGA-XK-AAIW    0.788813
TCGA-VP-AA1N    0.783299
TCGA-EJ-A65F    0.772900
TCGA-KK-A6E6    0.772893
TCGA-X4-A8KQ    0.767916
TCGA-HI-7168    0.766598
TCGA-ZG-A9LU    0.764319
TCGA-KK-A5A1    0.763620
TCGA-M7-A725    0.756041
dtype: float64

Sensitivity Analysis



In [69]:

    
benign = feather.read_dataframe('feather_files/benign.feather')



In [70]:

    
benign.set_index(['benign_index'], inplace = True)



In [71]:

    
benign.shape









    Out[71]:





(52, 20501)



In [72]:

    
benign_tpm = transformation(benign)









    




Transforming gene counts to transcript per million (TPM)

Transformation Successful!

52 Gene count estimate profiles have been transformed from gene counts to transcripts per million reads (TPM)



In [73]:

    
"""Put the benign samples through the risk function"""
risk = []
for x in benign.index :
    risk.append(risk_function(barcode = x,
              DF = benign,
              clinDF = clinical,
              genes = genes,
              fit_pca = pca,
              scaler = scaler,
              fit_clf = clf_LR))
benign_risk = pd.Series(risk, index=benign.index)



In [74]:

    
fig = plt.figure(figsize=(8,4))
A = fig.add_subplot(1,1,1)
A.hist(benign_risk.sort_values(ascending=False),
       bins=12,
       facecolor='blue')
#A.set_yscale('log')
A.set_ylabel('Count', fontsize=16)
A.set_xlim(0,1)
A.set_xlabel("Model Probability of Metastasis", fontsize=16)
fig.suptitle("Distribution of metastasis probability for benign samples", fontsize = 14)
plt.show()
fig.savefig("Figures/Sensitivity.png")

Sensitivity Analysis experiment shows that specimens from benign regions of the prostate (even when borrowing Gleason scores from an orthogonal malignancy) do not register as high risk for metastasis.



In [ ]:

	Component	coef
0	0	0.692539
1	1	-0.361935
2	2	-0.252021
3	gleason	0.483543