PyCon UK: Alzheimer's Disease Challenge Hackathon

Important! make sure you have added your email and name here before proceeding further: https://tinyurl.com/y76vk384



In [2]:

    
# To support both python 2 and python 3
# from __future__ import division, print_function, unicode_literals

import os
from zipfile import ZipFile
from six.moves import urllib


import sys
print(sys.version)









    



3.4.5 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:47:47) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]



In [3]:

    
!mkdir -p ../data
TADPOLE_PATH = os.path.join("..", "data")

Check your email, you will have received a login for lda.loni.usc.edu.

Then download the zipfile from: https://ida.loni.usc.edu/pages/access/studyData.jsp?categoryId=43&subCategoryId=94

After log-in, select Projects -> ADNI -> Download -> Study Data
In the "Find" entry box type "tadpole"
Select "Tadpole Challenge Data" and click on the DOWNLOAD button
Once "tadpole_challenge.zip" is downloaded, save it under ../data/
The function below will extract the files for you into the ../data folder.



In [4]:

    
def fetch_tadpole_data(tadpole_path=TADPOLE_PATH):
    if not os.path.isdir(tadpole_path):
        os.makedirs(tadpole_path)
    zip_path = os.path.join(tadpole_path, "tadpole_challenge.zip")
    if not os.path.isfile(zip_path):
        raise ValueError("please move the downloaded zipfile to %s folder" % TADPOLE_PATH)
    print("extracting from %s" % zip_path)
#     urllib.request.urlretrieve(tadpole_url, zip_path)
    with ZipFile(zip_path) as tadpole_zip:

        tadpole_zip.extractall(path=tadpole_path)
        tadpole_zip.close()
        
fetch_tadpole_data()









    



extracting from ../data/tadpole_challenge.zip

Make Leaderboard datasets

First generate the leaderboard datasets;

LB1 (Full training set - all subjects)
LB2 (Selection of subjects for prediction & testing against)



In [5]:

    
from makeLeaderboardDataset import *
import pandas as pd

generateLBdatasets(inputFolder='../data/', outputFolder='../data/')









    



TADPOLE_LB1_LB2.csv created in ../data/
columns ['RID', 'PTID', 'VISCODE', 'SITE', 'D1', 'D2', 'LB1', 'LB2', 'COLPROT', 'ORIGPROT']
TADPOLE_LB4_dummy.csv created in ../data/

Load in datasets

LB1: TADPOLE Standard training set.

This training dataset contains medical data including:

MRI scans
PET scans
DTI scans
Cognitive assessment data
Demographic data
Genetic data
CSF data

LB2: TADPOLE Standard prediction set.

This is a subset of LB1; the list of subjects to be predicted in the final submission

See the github readme file ["https://github.com/swhustla/pycon2017-alzheimers-hack/blob/master/README.md"] for more information and explanations on the data sources.



In [6]:

    
def load_tadpole_data(tadpole_path=TADPOLE_PATH):
    csv_path_lb1_lb2 = os.path.join(tadpole_path, "TADPOLE_LB1_LB2.csv")
    return pd.read_csv(csv_path_lb1_lb2)

tadpole_lb1_lb2 = load_tadpole_data()









    



/home/razvan/anaconda2/envs/pycon/lib/python3.4/site-packages/IPython/core/interactiveshell.py:2850: DtypeWarning: Columns (473,475,476,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,571,572,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,601,603,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,626,627,628,629,630,631,632,633,634,635,636,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,747,748,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,772,773,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,796,797,799,800,801,802,803,804,805,806,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833) have mixed types. Specify dtype option on import or set low_memory=False.
  if self.run_code(code, result):

Data Exploration



In [7]:

    
tadpole_lb1_lb2.head()









    Out[7]:







  
    
      
      RID
      PTID
      VISCODE
      SITE
      D1
      D2
      LB1
      LB2
      COLPROT
      ORIGPROT
      ...
      PHASE_UPENNBIOMK9_04_19_17
      BATCH_UPENNBIOMK9_04_19_17
      KIT_UPENNBIOMK9_04_19_17
      STDS_UPENNBIOMK9_04_19_17
      RUNDATE_UPENNBIOMK9_04_19_17
      ABETA_UPENNBIOMK9_04_19_17
      TAU_UPENNBIOMK9_04_19_17
      PTAU_UPENNBIOMK9_04_19_17
      COMMENT_UPENNBIOMK9_04_19_17
      update_stamp_UPENNBIOMK9_04_19_17
    
  
  
    
      0
      2
      011_S_0002
      bl
      11
      1
      1
      1
      0
      ADNI1
      ADNI1
      ...
      
      
      
      
      
      
      
      
      
      
    
    
      1
      3
      011_S_0003
      bl
      11
      1
      0
      1
      0
      ADNI1
      ADNI1
      ...
      ADNI1
      UPENNBIOMK9
      P06-MP02-MP01
      P06-MP02-MP01/2
      2016-12-14
      741.5
      239.7
      22.83
      NaN
      2017-04-20 14:39:54.0
    
    
      2
      3
      011_S_0003
      m06
      11
      1
      0
      1
      0
      ADNI1
      ADNI1
      ...
      
      
      
      
      
      
      
      
      
      
    
    
      3
      3
      011_S_0003
      m12
      11
      1
      0
      1
      0
      ADNI1
      ADNI1
      ...
      ADNI1
      UPENNBIOMK9
      P06-MP02-MP01
      P06-MP02-MP01/2
      2016-12-14
      601.4
      251.7
      24.18
      NaN
      2017-04-20 14:39:54.0
    
    
      4
      3
      011_S_0003
      m24
      11
      1
      0
      1
      0
      ADNI1
      ADNI1
      ...
      
      
      
      
      
      
      
      
      
      
    
  

5 rows × 1909 columns



In [8]:

    
print(list(tadpole_lb1_lb2.columns)[:30])









    



['RID', 'PTID', 'VISCODE', 'SITE', 'D1', 'D2', 'LB1', 'LB2', 'COLPROT', 'ORIGPROT', 'EXAMDATE', 'DX_bl', 'DXCHANGE', 'AGE', 'PTGENDER', 'PTEDUCAT', 'PTETHCAT', 'PTRACCAT', 'PTMARRY', 'APOE4', 'FDG', 'PIB', 'AV45', 'CDRSB', 'ADAS11', 'ADAS13', 'MMSE', 'RAVLT_immediate', 'RAVLT_learning', 'RAVLT_forgetting']



In [9]:

    
print(tadpole_lb1_lb2.info())
tadpole_lb1_lb2.describe()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12012 entries, 0 to 12011
Columns: 1909 entries, RID to update_stamp_UPENNBIOMK9_04_19_17
dtypes: float64(72), int64(10), object(1827)
memory usage: 174.9+ MB
None






    Out[9]:







  
    
      
      RID
      SITE
      D1
      D2
      LB1
      LB2
      DXCHANGE
      AGE
      PTEDUCAT
      APOE4
      ...
      EcogSPOrgan_bl
      EcogSPDivatt_bl
      EcogSPTotal_bl
      FDG_bl
      PIB_bl
      AV45_bl
      Years_bl
      Month_bl
      Month
      M
    
  
  
    
      count
      12012.000000
      12012.000000
      12012.000000
      12012.000000
      12012.000000
      12012.000000
      8475.000000
      12012.000000
      12012.000000
      12000.000000
      ...
      5504.000000
      5637.000000
      5767.000000
      8781.000000
      120.000000
      5750.000000
      12012.000000
      12012.000000
      12012.000000
      12012.000000
    
    
      mean
      2336.799118
      72.953713
      0.994172
      0.588745
      0.928571
      0.071429
      2.103363
      73.760931
      15.986097
      0.550167
      ...
      1.651214
      1.845027
      1.693177
      1.246474
      1.610021
      1.202087
      1.925861
      23.062975
      22.977772
      22.818931
    
    
      std
      1864.695129
      107.984561
      0.076119
      0.492082
      0.257550
      0.257550
      1.077779
      7.043616
      2.829938
      0.660570
      ...
      0.838895
      0.894895
      0.700158
      0.146611
      0.330875
      0.222027
      1.943552
      23.274829
      23.210868
      23.020114
    
    
      min
      2.000000
      2.000000
      0.000000
      0.000000
      0.000000
      0.000000
      1.000000
      54.400000
      4.000000
      0.000000
      ...
      1.000000
      1.000000
      1.000000
      0.697264
      1.155000
      0.838537
      0.000000
      0.000000
      0.000000
      0.000000
    
    
      25%
      658.000000
      21.000000
      1.000000
      0.000000
      1.000000
      0.000000
      1.000000
      69.400000
      14.000000
      0.000000
      ...
      1.000000
      1.000000
      1.153850
      1.157480
      1.335000
      1.019440
      0.490075
      5.868850
      6.000000
      6.000000
    
    
      50%
      1378.000000
      41.000000
      1.000000
      1.000000
      1.000000
      0.000000
      2.000000
      73.700000
      16.000000
      0.000000
      ...
      1.250000
      1.500000
      1.435900
      1.254910
      1.490000
      1.125370
      1.496235
      17.918000
      18.000000
      18.000000
    
    
      75%
      4371.000000
      116.000000
      1.000000
      1.000000
      1.000000
      0.000000
      3.000000
      78.800000
      18.000000
      1.000000
      ...
      2.000000
      2.333330
      2.051280
      1.338750
      1.835000
      1.374980
      2.904860
      34.786900
      36.000000
      36.000000
    
    
      max
      5296.000000
      941.000000
      1.000000
      1.000000
      1.000000
      1.000000
      8.000000
      91.400000
      20.000000
      2.000000
      ...
      4.000000
      4.000000
      3.948720
      1.707170
      2.282500
      2.025560
      10.321700
      123.607000
      126.000000
      120.000000
    
  

8 rows × 82 columns



In [10]:

    
tadpole_lb1_lb2["DX"].value_counts()









    Out[10]:





MCI                3801
NL                 2477
Dementia           1686
MCI to Dementia     346
NL to MCI            89
MCI to NL            73
Dementia to MCI      12
NL to Dementia        3
Name: DX, dtype: int64



In [11]:

    
tadpole_lb1_lb2["DX_bl"].value_counts()









    Out[11]:





LMCI    4353
CN      3383
EMCI    2319
AD      1568
SMC      389
Name: DX_bl, dtype: int64



In [12]:

    
tadpole_lb1_lb2["VISCODE"].value_counts()









    Out[12]:





bl      1737
m06     1618
m12     1485
m24     1326
m18     1293
m36      848
m03      793
m30      750
m48      603
m42      303
m60      264
m72      163
m54      146
m66      138
m78      129
m84      123
m96      100
m108      78
m90       77
m120      33
m102       4
m114       1
Name: VISCODE, dtype: int64



In [13]:

    
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
tadpole_lb1_lb2.hist(bins=50, figsize=(20,15))
plt.show()

Check for correlations

(both positive and negative)



In [14]:

    
corr_matrix = tadpole_lb1_lb2.corr()



In [15]:

    
correlations_with_ADAS13 = corr_matrix['ADAS13'].sort_values(ascending=False)
print(correlations_with_ADAS13[:10], correlations_with_ADAS13[-10:])









    



ADAS13           1.000000
ADAS11           0.981339
ADAS13_bl        0.823384
CDRSB            0.804806
ADAS11_bl        0.782538
FAQ              0.764608
EcogSPTotal      0.711863
EcogSPVisspat    0.689860
EcogSPMem        0.674288
EcogSPPlan       0.654259
Name: ADAS13, dtype: float64 Hippocampus          -0.545448
FDG_bl               -0.559090
RAVLT_learning       -0.604140
MMSE_bl              -0.627232
FDG                  -0.646439
RAVLT_immediate_bl   -0.680146
MOCA_bl              -0.693297
RAVLT_immediate      -0.790030
MOCA                 -0.838713
MMSE                 -0.838904
Name: ADAS13, dtype: float64

https://tadpole.grand-challenge.org/data/#List

Plot a selection of variables



In [16]:

    
prediction_variables = ["ADAS13", "DX", "Ventricles"]
cog_tests_attributes = ["CDRSB", "ADAS11", "MMSE", "RAVLT_immediate"]
mri_measures = ['Hippocampus', 'WholeBrain', 'Entorhinal', 'MidTemp' , "FDG", "AV45"]
pet_measures = ["FDG", "AV45"]
csf_measures = ["ABETA_UPENNBIOMK9_04_19_17", "TAU_UPENNBIOMK9_04_19_17", "PTAU_UPENNBIOMK9_04_19_17"]
risk_factors = ["APOE4", "AGE"]



In [17]:

    
from pandas.plotting import scatter_matrix

scatter_matrix(tadpole_lb1_lb2[prediction_variables+cog_tests_attributes], figsize=(12,8), alpha=0.1)
plt.title("cog_tests_attributes")
plt.show()



In [18]:

    
scatter_matrix(tadpole_lb1_lb2[prediction_variables+mri_measures], figsize=(12,8), alpha=0.1)
plt.title("mri_measures")
plt.show()



In [19]:

    
scatter_matrix(tadpole_lb1_lb2[prediction_variables+pet_measures], figsize=(12,8), alpha=0.1)
plt.title("pet_measures")
plt.show()



In [20]:

    
scatter_matrix(tadpole_lb1_lb2[prediction_variables+risk_factors], alpha=0.1, figsize=(12,8))
plt.title("risk_factors")
plt.show()

Focus on individual patients



In [21]:

    
tadpole_lb1_lb2.RID.value_counts()[:5]









    Out[21]:





906    19
31     19
259    19
61     19
126    19
Name: RID, dtype: int64



In [22]:

    
tadpole_lb1_lb2.EXAMDATE = pd.to_datetime(tadpole_lb1_lb2.EXAMDATE)



In [23]:

    
tadpole_grouped = tadpole_lb1_lb2.groupby("RID").apply(lambda x:(x["EXAMDATE"]-x["EXAMDATE"].min()).dt.days/365.25 + x["AGE"].min())



In [24]:

    
tadpole_grouped.sort_index(inplace=True)



In [25]:

    
tadpole_grouped.values









    Out[25]:





array([ 74.3       ,  74.79007529,  77.26783025, ...,  77.49041752,
        69.3       ,  71.29589322])



In [26]:

    
tadpole_lb1_lb2.sort_values(by=["RID", "EXAMDATE"], inplace=True)



In [27]:

    
tadpole_lb1_lb2["AGE_AT_EXAM"] = tadpole_grouped.values



In [28]:

    
tadpole_lb1_lb2[tadpole_lb1_lb2.RID==259].plot(kind="scatter", x="AGE_AT_EXAM", y="ADAS13")
plt.show()



In [29]:

    
tadpole_lb1_lb2[tadpole_lb1_lb2['RID'] > 5000].plot(kind="scatter", x="RID", y="AGE_AT_EXAM")









    Out[29]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f0d9153ff98>



In [30]:

    
tadpole_lb1_lb2['AGE_INT'] = tadpole_lb1_lb2['AGE_AT_EXAM'].apply(int)



In [31]:

    
tadpole_lb1_lb2[tadpole_lb1_lb2['ADAS13'].notnull()]\
    .groupby('AGE_INT')['ADAS13']\
    .count().plot()









    Out[31]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f0d8f192400>



In [32]:

    
tadpole_lb1_lb2[tadpole_lb1_lb2['ADAS13'].notnull()]\
    .groupby('AGE_INT')['ADAS13']\
    .mean().plot()









    Out[32]:





<matplotlib.axes._subplots.AxesSubplot at 0x7f0d949e6d30>

Column datatypes

Determine which are numerical, which are categorical



In [33]:

    
# categorical example
tadpole_lb1_lb2["DX"].value_counts()









    Out[33]:





MCI                3801
NL                 2477
Dementia           1686
MCI to Dementia     346
NL to MCI            89
MCI to NL            73
Dementia to MCI      12
NL to Dementia        3
Name: DX, dtype: int64



In [34]:

    
y_num_cols = ["ADAS13", "Ventricles"]
y_cat_cols = ["DX"]



In [35]:

    
cog_tests_attributes = ["CDRSB", "ADAS11", "MMSE", "RAVLT_immediate"]
mri_measures = ['Hippocampus', 'WholeBrain', 'Entorhinal', 'MidTemp' , "FDG", "AV45"]
pet_measures = ["FDG", "AV45"]
csf_measures = ["ABETA_UPENNBIOMK9_04_19_17", "TAU_UPENNBIOMK9_04_19_17", "PTAU_UPENNBIOMK9_04_19_17"]
risk_factors = ["APOE4", "AGE"]



In [36]:

    
def convert_float(val):
    try:
        return float(val)
    except ValueError:
        return np.nan



In [37]:

    
for col in csf_measures:
    tadpole_lb1_lb2[col] = tadpole_lb1_lb2[col].map(lambda x:convert_float(x))



In [38]:

    
useful_numerical_attribs = cog_tests_attributes + mri_measures + pet_measures + csf_measures + ["AGE_AT_EXAM", 'AGE']
useful_numerical_attribs









    Out[38]:





['CDRSB',
 'ADAS11',
 'MMSE',
 'RAVLT_immediate',
 'Hippocampus',
 'WholeBrain',
 'Entorhinal',
 'MidTemp',
 'FDG',
 'AV45',
 'FDG',
 'AV45',
 'ABETA_UPENNBIOMK9_04_19_17',
 'TAU_UPENNBIOMK9_04_19_17',
 'PTAU_UPENNBIOMK9_04_19_17',
 'AGE_AT_EXAM',
 'AGE']



In [39]:

    
tadpole_lb1_lb2.columns[:20]









    Out[39]:





Index(['RID', 'PTID', 'VISCODE', 'SITE', 'D1', 'D2', 'LB1', 'LB2', 'COLPROT',
       'ORIGPROT', 'EXAMDATE', 'DX_bl', 'DXCHANGE', 'AGE', 'PTGENDER',
       'PTEDUCAT', 'PTETHCAT', 'PTRACCAT', 'PTMARRY', 'APOE4'],
      dtype='object')



In [40]:

    
useful_categorical_attribs = ['RID', 'SITE', 'DXCHANGE', 'PTGENDER',
       'PTEDUCAT', 'PTETHCAT', 'PTRACCAT', 'PTMARRY', 'APOE4']
useful_categorical_attribs









    Out[40]:





['RID',
 'SITE',
 'DXCHANGE',
 'PTGENDER',
 'PTEDUCAT',
 'PTETHCAT',
 'PTRACCAT',
 'PTMARRY',
 'APOE4']



In [41]:

    
tadpole_lb1_lb2[useful_categorical_attribs] = tadpole_lb1_lb2[useful_categorical_attribs].astype(str)

Data Cleaning

Split the dataset into train and test

Perform stratified shuffle split of data (to avoid over sampling on marital status, ethinicity or gender)



In [42]:

    
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index \
in split.split(tadpole_lb1_lb2, tadpole_lb1_lb2[["PTMARRY", "PTGENDER","PTETHCAT"]]):
    strat_train_set = tadpole_lb1_lb2.loc[train_index]
    strat_test_set = tadpole_lb1_lb2.loc[test_index]

Use the training set only now



In [43]:

    
tadpole = strat_train_set.copy()
print(strat_train_set.AGE)









    



9405     82.4
6945     76.1
3164     84.7
8984     81.8
4377     69.5
6879     75.6
5002     82.4
2726     71.8
5339     67.0
9504     61.9
11023    84.5
11340    75.2
2169     90.9
3915     67.7
61       64.1
6962     71.5
7906     74.3
1174     80.7
1468     77.7
9356     65.1
1542     89.6
5459     69.9
11638    70.8
2283     84.6
7016     78.2
8277     77.6
2648     72.5
9341     67.8
3575     62.0
8693     68.2
         ... 
7203     82.0
316      78.4
7244     72.4
4076     66.9
10350    70.9
8172     64.6
9648     76.6
3751     62.3
311      75.2
4092     66.5
2976     76.4
2030     75.0
9575     71.4
7041     76.8
3130     65.5
9920     75.1
5486     59.8
2248     73.7
11922    63.2
11802    67.8
8927     55.9
10830    88.3
6571     85.5
11817    58.8
2735     70.7
9703     77.1
6529     75.6
2754     85.9
4104     61.9
4645     68.5
Name: AGE, Length: 9609, dtype: float64



In [44]:

    
print(tadpole.head())
print(tadpole_lb1_lb2.head())

print(strat_train_set.keys()[:10])
tadpole = strat_train_set.drop(y_num_cols + y_num_cols, axis=1)

#tadpole_labels_categorical = strat_train_set[useful_categorical_attribs].copy()
#'AGE_AT_EXAM' in strat_train_set.keys()









    



       RID        PTID VISCODE SITE  D1  D2  LB1  LB2 COLPROT ORIGPROT  \
9405  4392  024_S_4392     m24   24   1   0    1    0   ADNI2    ADNI2   
6945   637  012_S_0637     m54   12   1   1    0    1   ADNI1    ADNI1   
3164  2205  099_S_2205      bl   99   1   1    1    0  ADNIGO   ADNIGO   
8984  4086  099_S_4086     m18   99   1   1    1    0   ADNI2    ADNI2   
4377  4631  137_S_4631     m03  137   1   1    1    0   ADNI2    ADNI2   

       ...   KIT_UPENNBIOMK9_04_19_17 STDS_UPENNBIOMK9_04_19_17  \
9405   ...              P06-MP02-MP01           P06-MP02-MP01/2   
6945   ...                                                        
3164   ...              P06-MP02-MP01           P06-MP02-MP01/2   
8984   ...                                                        
4377   ...                                                        

     RUNDATE_UPENNBIOMK9_04_19_17  ABETA_UPENNBIOMK9_04_19_17  \
9405                   2017-01-12                      1099.0   
6945                                                      NaN   
3164                   2016-11-17                       336.6   
8984                                                      NaN   
4377                                                      NaN   

     TAU_UPENNBIOMK9_04_19_17 PTAU_UPENNBIOMK9_04_19_17  \
9405                    342.0                     35.32   
6945                      NaN                       NaN   
3164                    247.8                     23.28   
8984                      NaN                       NaN   
4377                      NaN                       NaN   

     COMMENT_UPENNBIOMK9_04_19_17 update_stamp_UPENNBIOMK9_04_19_17  \
9405                          NaN             2017-04-20 14:39:55.0   
6945                                                                  
3164                          NaN             2017-04-20 14:39:55.0   
8984                                                                  
4377                                                                  

     AGE_AT_EXAM AGE_INT  
9405   84.409582      84  
6945   80.628405      80  
3164   84.700000      84  
8984   83.314031      83  
4377   69.636893      69  

[5 rows x 1911 columns]
     RID        PTID VISCODE SITE  D1  D2  LB1  LB2 COLPROT ORIGPROT   ...    \
0      2  011_S_0002      bl   11   1   1    1    0   ADNI1    ADNI1   ...     
5618   2  011_S_0002     m06   11   1   1    1    0   ADNI1    ADNI1   ...     
5619   2  011_S_0002     m36   11   1   1    1    0   ADNI1    ADNI1   ...     
5620   2  011_S_0002     m60   11   1   1    1    0  ADNIGO    ADNI1   ...     
5621   2  011_S_0002     m66   11   1   1    1    0  ADNIGO    ADNI1   ...     

     KIT_UPENNBIOMK9_04_19_17 STDS_UPENNBIOMK9_04_19_17  \
0                                                         
5618                                                      
5619                                                      
5620                                                      
5621                                                      

     RUNDATE_UPENNBIOMK9_04_19_17  ABETA_UPENNBIOMK9_04_19_17  \
0                                                         NaN   
5618                                                      NaN   
5619                                                      NaN   
5620                                                      NaN   
5621                                                      NaN   

     TAU_UPENNBIOMK9_04_19_17 PTAU_UPENNBIOMK9_04_19_17  \
0                         NaN                       NaN   
5618                      NaN                       NaN   
5619                      NaN                       NaN   
5620                      NaN                       NaN   
5621                      NaN                       NaN   

     COMMENT_UPENNBIOMK9_04_19_17 update_stamp_UPENNBIOMK9_04_19_17  \
0                                                                     
5618                                                                  
5619                                                                  
5620                                                                  
5621                                                                  

     AGE_AT_EXAM AGE_INT  
0      74.300000      74  
5618   74.790075      74  
5619   77.267830      77  
5620   79.337645      79  
5621   79.783915      79  

[5 rows x 1911 columns]
Index(['RID', 'PTID', 'VISCODE', 'SITE', 'D1', 'D2', 'LB1', 'LB2', 'COLPROT',
       'ORIGPROT'],
      dtype='object')



In [45]:

    
tadpole = strat_train_set.drop(y_num_cols + y_num_cols, axis=1)
tadpole_labels_categorical = strat_train_set[useful_categorical_attribs].copy()
tadpole_labels_continuous = strat_train_set[useful_numerical_attribs].copy()



In [ ]:



In [46]:

    
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, Imputer, LabelBinarizer



In [47]:

    
class LabelBinarizerPipelineFriendly(LabelBinarizer):

    def fit(self, X, y=None):
        """this would allow us to fit the model based on the X input."""
        super(LabelBinarizerPipelineFriendly, self).fit(X)
    def transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).transform(X)

    def fit_transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).fit(X).transform(X)



In [48]:

    
from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values



In [49]:

    
num_pipeline = Pipeline([
    ('selector', DataFrameSelector(useful_numerical_attribs)),
    ('imputer', Imputer(strategy="median")),
#     ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(['APOE4'])),
    ('label_binarizer', LabelBinarizerPipelineFriendly()),
])



In [50]:

    
from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline),
])



In [51]:

    
tadpole_prepared = full_pipeline.fit_transform(tadpole)



In [52]:

    
tadpole_prepared









    Out[52]:





array([[ 0.2717961 ,  0.16833368, -0.58854353, ...,  0.        ,
         0.        ,  0.        ],
       [-0.34811578, -0.23822369,  0.30000493, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.06515881, -0.50926193,  0.30000493, ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [-0.14147848,  0.16833368,  0.59618775, ...,  1.        ,
         0.        ,  0.        ],
       [-0.34811578, -0.50926193, -0.29236071, ...,  0.        ,
         1.        ,  0.        ],
       [-0.76139036, -0.78030018, -0.29236071, ...,  1.        ,
         0.        ,  0.        ]])

pipeline for generating, evaluating and submitting a leaderboard submission

Then generate a simple forecast from the training data, and save it as TADPOLE_Submission_Pycon_TeamName1.csv



In [53]:

    
!python3 TADPOLE_SimpleForecast1.py









    



Generating forecast ...
Constructing the output spreadsheet ../data/TADPOLE_Submission_Pycon_TeamName1.csv ...

You should replace TeamName1 with your team name and submission index (no underscores allowed) e.g., TADPOLE_Submission_Pycon_TeamAwesome3.csv



In [54]:

    
team_name = "TeamFrank1" ## add your own team name here



In [55]:

    
import os
oldFile = '../data/TADPOLE_Submission_Pycon_TeamName1.csv'
newFile = '../data/TADPOLE_Submission_Pycon_%s.csv' % team_name
os.system('mv %s %s' % (oldFile, newFile))









    Out[55]:





0

Evaluate the user forecasts from TADPOLE_Submission_Leaderboard_TeamName1.csv against TADPOLE_LB4_dummy.csv (held out dataset) using the evaluation function



In [59]:

    
cmd = 'python3 evalOneSubmission.py --leaderboard --d4File %s --forecastFile %s' % ("../data/TADPOLE_LB4_dummy.csv", newFile)
print(cmd) 
os.system(cmd) 
# check the console where you launched jupyter from, it should show the outputs. 
# Otherwise, run the command from  the command line









    



python3 evalOneSubmission.py --leaderboard --d4File ../data/TADPOLE_LB4_dummy.csv --forecastFile ../data/TADPOLE_Submission_Pycon_TeamFrank1.csv






    Out[59]:





0



In [57]:

    
# Submit (renamed version of) TADPOLE_Submission_Leaderboard_TeamName1.csv to TADPOLE website via the Submit page



In [ ]:

	RID	PTID	VISCODE	SITE	D1	D2	LB1	COLPROT	ORIGPROT	...	PHASE_UPENNBIOMK9_04_19_17	BATCH_UPENNBIOMK9_04_19_17	KIT_UPENNBIOMK9_04_19_17	STDS_UPENNBIOMK9_04_19_17	RUNDATE_UPENNBIOMK9_04_19_17	ABETA_UPENNBIOMK9_04_19_17	TAU_UPENNBIOMK9_04_19_17	PTAU_UPENNBIOMK9_04_19_17	COMMENT_UPENNBIOMK9_04_19_17	update_stamp_UPENNBIOMK9_04_19_17
0	2	011_S_0002	bl	11	1	1	1	ADNI1	ADNI1	...
1	3	011_S_0003	bl	11	1	0	1	ADNI1	ADNI1	...	ADNI1	UPENNBIOMK9	P06-MP02-MP01	P06-MP02-MP01/2	2016-12-14	741.5	239.7	22.83	NaN	2017-04-20 14:39:54.0
2	3	011_S_0003	m06	11	1	0	1	ADNI1	ADNI1	...
3	3	011_S_0003	m12	11	1	0	1	ADNI1	ADNI1	...	ADNI1	UPENNBIOMK9	P06-MP02-MP01	P06-MP02-MP01/2	2016-12-14	601.4	251.7	24.18	NaN	2017-04-20 14:39:54.0
4	3	011_S_0003	m24	11	1	0	1	ADNI1	ADNI1	...

	RID	SITE	D1	D2	LB1	LB2	DXCHANGE	AGE	PTEDUCAT	APOE4	...	EcogSPOrgan_bl	EcogSPDivatt_bl	EcogSPTotal_bl	FDG_bl	PIB_bl	AV45_bl	Years_bl	Month_bl	Month	M
count	12012.000000	12012.000000	12012.000000	12012.000000	12012.000000	12012.000000	8475.000000	12012.000000	12012.000000	12000.000000	...	5504.000000	5637.000000	5767.000000	8781.000000	120.000000	5750.000000	12012.000000	12012.000000	12012.000000	12012.000000
mean	2336.799118	72.953713	0.994172	0.588745	0.928571	0.071429	2.103363	73.760931	15.986097	0.550167	...	1.651214	1.845027	1.693177	1.246474	1.610021	1.202087	1.925861	23.062975	22.977772	22.818931
std	1864.695129	107.984561	0.076119	0.492082	0.257550	0.257550	1.077779	7.043616	2.829938	0.660570	...	0.838895	0.894895	0.700158	0.146611	0.330875	0.222027	1.943552	23.274829	23.210868	23.020114
min	2.000000	2.000000	0.000000	0.000000	0.000000	0.000000	1.000000	54.400000	4.000000	0.000000	...	1.000000	1.000000	1.000000	0.697264	1.155000	0.838537	0.000000	0.000000	0.000000	0.000000
25%	658.000000	21.000000	1.000000	0.000000	1.000000	0.000000	1.000000	69.400000	14.000000	0.000000	...	1.000000	1.000000	1.153850	1.157480	1.335000	1.019440	0.490075	5.868850	6.000000	6.000000
50%	1378.000000	41.000000	1.000000	1.000000	1.000000	0.000000	2.000000	73.700000	16.000000	0.000000	...	1.250000	1.500000	1.435900	1.254910	1.490000	1.125370	1.496235	17.918000	18.000000	18.000000
75%	4371.000000	116.000000	1.000000	1.000000	1.000000	0.000000	3.000000	78.800000	18.000000	1.000000	...	2.000000	2.333330	2.051280	1.338750	1.835000	1.374980	2.904860	34.786900	36.000000	36.000000
max	5296.000000	941.000000	1.000000	1.000000	1.000000	1.000000	8.000000	91.400000	20.000000	2.000000	...	4.000000	4.000000	3.948720	1.707170	2.282500	2.025560	10.321700	123.607000	126.000000	120.000000