PyCon UK: Alzheimer's Disease Challenge Hackathon

Important! make sure you have added your email and name here before proceeding further: https://tinyurl.com/y76vk384


In [2]:
# To support both python 2 and python 3
# from __future__ import division, print_function, unicode_literals

import os
from zipfile import ZipFile
from six.moves import urllib


import sys
print(sys.version)


3.4.5 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:47:47) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]

In [3]:
!mkdir -p ../data
TADPOLE_PATH = os.path.join("..", "data")
  • Check your email, you will have received a login for lda.loni.usc.edu.

Then download the zipfile from: https://ida.loni.usc.edu/pages/access/studyData.jsp?categoryId=43&subCategoryId=94

  • After log-in, select Projects -> ADNI -> Download -> Study Data
  • In the "Find" entry box type "tadpole"
  • Select "Tadpole Challenge Data" and click on the DOWNLOAD button
  • Once "tadpole_challenge.zip" is downloaded, save it under ../data/
  • The function below will extract the files for you into the ../data folder.

In [4]:
def fetch_tadpole_data(tadpole_path=TADPOLE_PATH):
    if not os.path.isdir(tadpole_path):
        os.makedirs(tadpole_path)
    zip_path = os.path.join(tadpole_path, "tadpole_challenge.zip")
    if not os.path.isfile(zip_path):
        raise ValueError("please move the downloaded zipfile to %s folder" % TADPOLE_PATH)
    print("extracting from %s" % zip_path)
#     urllib.request.urlretrieve(tadpole_url, zip_path)
    with ZipFile(zip_path) as tadpole_zip:

        tadpole_zip.extractall(path=tadpole_path)
        tadpole_zip.close()
        
fetch_tadpole_data()


extracting from ../data/tadpole_challenge.zip

Make Leaderboard datasets

First generate the leaderboard datasets;

  • LB1 (Full training set - all subjects)
  • LB2 (Selection of subjects for prediction & testing against)

In [5]:
from makeLeaderboardDataset import *
import pandas as pd

generateLBdatasets(inputFolder='../data/', outputFolder='../data/')


TADPOLE_LB1_LB2.csv created in ../data/
columns ['RID', 'PTID', 'VISCODE', 'SITE', 'D1', 'D2', 'LB1', 'LB2', 'COLPROT', 'ORIGPROT']
TADPOLE_LB4_dummy.csv created in ../data/

Load in datasets

LB1: TADPOLE Standard training set.

This training dataset contains medical data including:

  • MRI scans
  • PET scans
  • DTI scans
  • Cognitive assessment data
  • Demographic data
  • Genetic data
  • CSF data

LB2: TADPOLE Standard prediction set.

This is a subset of LB1; the list of subjects to be predicted in the final submission

See the github readme file ["https://github.com/swhustla/pycon2017-alzheimers-hack/blob/master/README.md"] for more information and explanations on the data sources.


In [6]:
def load_tadpole_data(tadpole_path=TADPOLE_PATH):
    csv_path_lb1_lb2 = os.path.join(tadpole_path, "TADPOLE_LB1_LB2.csv")
    return pd.read_csv(csv_path_lb1_lb2)

tadpole_lb1_lb2 = load_tadpole_data()


/home/razvan/anaconda2/envs/pycon/lib/python3.4/site-packages/IPython/core/interactiveshell.py:2850: DtypeWarning: Columns (473,475,476,489,490,491,492,493,494,495,496,497,498,499,500,501,502,503,504,505,506,507,508,509,510,511,512,513,514,515,516,517,518,519,520,521,522,523,524,525,526,527,528,529,530,531,532,533,534,535,536,537,538,539,540,541,542,543,544,545,546,547,548,549,550,551,552,553,554,555,556,557,558,559,560,561,562,563,564,565,571,572,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,601,603,608,609,610,611,612,613,614,615,616,617,618,619,620,621,622,623,626,627,628,629,630,631,632,633,634,635,636,638,639,640,641,642,643,644,645,646,647,648,649,650,651,652,653,654,655,656,657,658,659,660,661,662,663,665,666,667,668,669,670,671,672,673,674,675,676,677,678,679,680,681,682,683,684,685,686,687,688,689,690,691,692,693,694,695,696,697,698,699,700,701,702,703,704,705,706,707,708,709,710,711,712,713,714,715,716,717,718,719,720,721,722,723,724,725,726,727,728,729,730,731,732,733,734,735,736,737,738,739,740,741,747,748,750,751,752,753,754,755,756,757,758,759,760,761,762,763,764,765,766,767,768,769,772,773,778,779,780,781,782,783,784,785,786,787,788,789,790,791,792,793,796,797,799,800,801,802,803,804,805,806,808,809,810,811,812,813,814,815,816,817,818,819,820,821,822,823,824,825,826,827,828,829,830,831,832,833) have mixed types. Specify dtype option on import or set low_memory=False.
  if self.run_code(code, result):

Data Exploration


In [7]:
tadpole_lb1_lb2.head()


Out[7]:
RID PTID VISCODE SITE D1 D2 LB1 LB2 COLPROT ORIGPROT ... PHASE_UPENNBIOMK9_04_19_17 BATCH_UPENNBIOMK9_04_19_17 KIT_UPENNBIOMK9_04_19_17 STDS_UPENNBIOMK9_04_19_17 RUNDATE_UPENNBIOMK9_04_19_17 ABETA_UPENNBIOMK9_04_19_17 TAU_UPENNBIOMK9_04_19_17 PTAU_UPENNBIOMK9_04_19_17 COMMENT_UPENNBIOMK9_04_19_17 update_stamp_UPENNBIOMK9_04_19_17
0 2 011_S_0002 bl 11 1 1 1 0 ADNI1 ADNI1 ...
1 3 011_S_0003 bl 11 1 0 1 0 ADNI1 ADNI1 ... ADNI1 UPENNBIOMK9 P06-MP02-MP01 P06-MP02-MP01/2 2016-12-14 741.5 239.7 22.83 NaN 2017-04-20 14:39:54.0
2 3 011_S_0003 m06 11 1 0 1 0 ADNI1 ADNI1 ...
3 3 011_S_0003 m12 11 1 0 1 0 ADNI1 ADNI1 ... ADNI1 UPENNBIOMK9 P06-MP02-MP01 P06-MP02-MP01/2 2016-12-14 601.4 251.7 24.18 NaN 2017-04-20 14:39:54.0
4 3 011_S_0003 m24 11 1 0 1 0 ADNI1 ADNI1 ...

5 rows × 1909 columns


In [8]:
print(list(tadpole_lb1_lb2.columns)[:30])


['RID', 'PTID', 'VISCODE', 'SITE', 'D1', 'D2', 'LB1', 'LB2', 'COLPROT', 'ORIGPROT', 'EXAMDATE', 'DX_bl', 'DXCHANGE', 'AGE', 'PTGENDER', 'PTEDUCAT', 'PTETHCAT', 'PTRACCAT', 'PTMARRY', 'APOE4', 'FDG', 'PIB', 'AV45', 'CDRSB', 'ADAS11', 'ADAS13', 'MMSE', 'RAVLT_immediate', 'RAVLT_learning', 'RAVLT_forgetting']

In [9]:
print(tadpole_lb1_lb2.info())
tadpole_lb1_lb2.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12012 entries, 0 to 12011
Columns: 1909 entries, RID to update_stamp_UPENNBIOMK9_04_19_17
dtypes: float64(72), int64(10), object(1827)
memory usage: 174.9+ MB
None
Out[9]:
RID SITE D1 D2 LB1 LB2 DXCHANGE AGE PTEDUCAT APOE4 ... EcogSPOrgan_bl EcogSPDivatt_bl EcogSPTotal_bl FDG_bl PIB_bl AV45_bl Years_bl Month_bl Month M
count 12012.000000 12012.000000 12012.000000 12012.000000 12012.000000 12012.000000 8475.000000 12012.000000 12012.000000 12000.000000 ... 5504.000000 5637.000000 5767.000000 8781.000000 120.000000 5750.000000 12012.000000 12012.000000 12012.000000 12012.000000
mean 2336.799118 72.953713 0.994172 0.588745 0.928571 0.071429 2.103363 73.760931 15.986097 0.550167 ... 1.651214 1.845027 1.693177 1.246474 1.610021 1.202087 1.925861 23.062975 22.977772 22.818931
std 1864.695129 107.984561 0.076119 0.492082 0.257550 0.257550 1.077779 7.043616 2.829938 0.660570 ... 0.838895 0.894895 0.700158 0.146611 0.330875 0.222027 1.943552 23.274829 23.210868 23.020114
min 2.000000 2.000000 0.000000 0.000000 0.000000 0.000000 1.000000 54.400000 4.000000 0.000000 ... 1.000000 1.000000 1.000000 0.697264 1.155000 0.838537 0.000000 0.000000 0.000000 0.000000
25% 658.000000 21.000000 1.000000 0.000000 1.000000 0.000000 1.000000 69.400000 14.000000 0.000000 ... 1.000000 1.000000 1.153850 1.157480 1.335000 1.019440 0.490075 5.868850 6.000000 6.000000
50% 1378.000000 41.000000 1.000000 1.000000 1.000000 0.000000 2.000000 73.700000 16.000000 0.000000 ... 1.250000 1.500000 1.435900 1.254910 1.490000 1.125370 1.496235 17.918000 18.000000 18.000000
75% 4371.000000 116.000000 1.000000 1.000000 1.000000 0.000000 3.000000 78.800000 18.000000 1.000000 ... 2.000000 2.333330 2.051280 1.338750 1.835000 1.374980 2.904860 34.786900 36.000000 36.000000
max 5296.000000 941.000000 1.000000 1.000000 1.000000 1.000000 8.000000 91.400000 20.000000 2.000000 ... 4.000000 4.000000 3.948720 1.707170 2.282500 2.025560 10.321700 123.607000 126.000000 120.000000

8 rows × 82 columns


In [10]:
tadpole_lb1_lb2["DX"].value_counts()


Out[10]:
MCI                3801
NL                 2477
Dementia           1686
MCI to Dementia     346
NL to MCI            89
MCI to NL            73
Dementia to MCI      12
NL to Dementia        3
Name: DX, dtype: int64

In [11]:
tadpole_lb1_lb2["DX_bl"].value_counts()


Out[11]:
LMCI    4353
CN      3383
EMCI    2319
AD      1568
SMC      389
Name: DX_bl, dtype: int64

In [12]:
tadpole_lb1_lb2["VISCODE"].value_counts()


Out[12]:
bl      1737
m06     1618
m12     1485
m24     1326
m18     1293
m36      848
m03      793
m30      750
m48      603
m42      303
m60      264
m72      163
m54      146
m66      138
m78      129
m84      123
m96      100
m108      78
m90       77
m120      33
m102       4
m114       1
Name: VISCODE, dtype: int64

In [13]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
tadpole_lb1_lb2.hist(bins=50, figsize=(20,15))
plt.show()


Check for correlations

(both positive and negative)


In [14]:
corr_matrix = tadpole_lb1_lb2.corr()

In [15]:
correlations_with_ADAS13 = corr_matrix['ADAS13'].sort_values(ascending=False)
print(correlations_with_ADAS13[:10], correlations_with_ADAS13[-10:])


ADAS13           1.000000
ADAS11           0.981339
ADAS13_bl        0.823384
CDRSB            0.804806
ADAS11_bl        0.782538
FAQ              0.764608
EcogSPTotal      0.711863
EcogSPVisspat    0.689860
EcogSPMem        0.674288
EcogSPPlan       0.654259
Name: ADAS13, dtype: float64 Hippocampus          -0.545448
FDG_bl               -0.559090
RAVLT_learning       -0.604140
MMSE_bl              -0.627232
FDG                  -0.646439
RAVLT_immediate_bl   -0.680146
MOCA_bl              -0.693297
RAVLT_immediate      -0.790030
MOCA                 -0.838713
MMSE                 -0.838904
Name: ADAS13, dtype: float64

Plot a selection of variables


In [16]:
prediction_variables = ["ADAS13", "DX", "Ventricles"]
cog_tests_attributes = ["CDRSB", "ADAS11", "MMSE", "RAVLT_immediate"]
mri_measures = ['Hippocampus', 'WholeBrain', 'Entorhinal', 'MidTemp' , "FDG", "AV45"]
pet_measures = ["FDG", "AV45"]
csf_measures = ["ABETA_UPENNBIOMK9_04_19_17", "TAU_UPENNBIOMK9_04_19_17", "PTAU_UPENNBIOMK9_04_19_17"]
risk_factors = ["APOE4", "AGE"]

In [17]:
from pandas.plotting import scatter_matrix

scatter_matrix(tadpole_lb1_lb2[prediction_variables+cog_tests_attributes], figsize=(12,8), alpha=0.1)
plt.title("cog_tests_attributes")
plt.show()



In [18]:
scatter_matrix(tadpole_lb1_lb2[prediction_variables+mri_measures], figsize=(12,8), alpha=0.1)
plt.title("mri_measures")
plt.show()



In [19]:
scatter_matrix(tadpole_lb1_lb2[prediction_variables+pet_measures], figsize=(12,8), alpha=0.1)
plt.title("pet_measures")
plt.show()



In [20]:
scatter_matrix(tadpole_lb1_lb2[prediction_variables+risk_factors], alpha=0.1, figsize=(12,8))
plt.title("risk_factors")
plt.show()


Focus on individual patients


In [21]:
tadpole_lb1_lb2.RID.value_counts()[:5]


Out[21]:
906    19
31     19
259    19
61     19
126    19
Name: RID, dtype: int64

In [22]:
tadpole_lb1_lb2.EXAMDATE = pd.to_datetime(tadpole_lb1_lb2.EXAMDATE)

In [23]:
tadpole_grouped = tadpole_lb1_lb2.groupby("RID").apply(lambda x:(x["EXAMDATE"]-x["EXAMDATE"].min()).dt.days/365.25 + x["AGE"].min())

In [24]:
tadpole_grouped.sort_index(inplace=True)

In [25]:
tadpole_grouped.values


Out[25]:
array([ 74.3       ,  74.79007529,  77.26783025, ...,  77.49041752,
        69.3       ,  71.29589322])

In [26]:
tadpole_lb1_lb2.sort_values(by=["RID", "EXAMDATE"], inplace=True)

In [27]:
tadpole_lb1_lb2["AGE_AT_EXAM"] = tadpole_grouped.values

In [28]:
tadpole_lb1_lb2[tadpole_lb1_lb2.RID==259].plot(kind="scatter", x="AGE_AT_EXAM", y="ADAS13")
plt.show()



In [29]:
tadpole_lb1_lb2[tadpole_lb1_lb2['RID'] > 5000].plot(kind="scatter", x="RID", y="AGE_AT_EXAM")


Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0d9153ff98>

In [30]:
tadpole_lb1_lb2['AGE_INT'] = tadpole_lb1_lb2['AGE_AT_EXAM'].apply(int)

In [31]:
tadpole_lb1_lb2[tadpole_lb1_lb2['ADAS13'].notnull()]\
    .groupby('AGE_INT')['ADAS13']\
    .count().plot()


Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0d8f192400>

In [32]:
tadpole_lb1_lb2[tadpole_lb1_lb2['ADAS13'].notnull()]\
    .groupby('AGE_INT')['ADAS13']\
    .mean().plot()


Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0d949e6d30>

Column datatypes

Determine which are numerical, which are categorical


In [33]:
# categorical example
tadpole_lb1_lb2["DX"].value_counts()


Out[33]:
MCI                3801
NL                 2477
Dementia           1686
MCI to Dementia     346
NL to MCI            89
MCI to NL            73
Dementia to MCI      12
NL to Dementia        3
Name: DX, dtype: int64

In [34]:
y_num_cols = ["ADAS13", "Ventricles"]
y_cat_cols = ["DX"]

In [35]:
cog_tests_attributes = ["CDRSB", "ADAS11", "MMSE", "RAVLT_immediate"]
mri_measures = ['Hippocampus', 'WholeBrain', 'Entorhinal', 'MidTemp' , "FDG", "AV45"]
pet_measures = ["FDG", "AV45"]
csf_measures = ["ABETA_UPENNBIOMK9_04_19_17", "TAU_UPENNBIOMK9_04_19_17", "PTAU_UPENNBIOMK9_04_19_17"]
risk_factors = ["APOE4", "AGE"]

In [36]:
def convert_float(val):
    try:
        return float(val)
    except ValueError:
        return np.nan

In [37]:
for col in csf_measures:
    tadpole_lb1_lb2[col] = tadpole_lb1_lb2[col].map(lambda x:convert_float(x))

In [38]:
useful_numerical_attribs = cog_tests_attributes + mri_measures + pet_measures + csf_measures + ["AGE_AT_EXAM", 'AGE']
useful_numerical_attribs


Out[38]:
['CDRSB',
 'ADAS11',
 'MMSE',
 'RAVLT_immediate',
 'Hippocampus',
 'WholeBrain',
 'Entorhinal',
 'MidTemp',
 'FDG',
 'AV45',
 'FDG',
 'AV45',
 'ABETA_UPENNBIOMK9_04_19_17',
 'TAU_UPENNBIOMK9_04_19_17',
 'PTAU_UPENNBIOMK9_04_19_17',
 'AGE_AT_EXAM',
 'AGE']

In [39]:
tadpole_lb1_lb2.columns[:20]


Out[39]:
Index(['RID', 'PTID', 'VISCODE', 'SITE', 'D1', 'D2', 'LB1', 'LB2', 'COLPROT',
       'ORIGPROT', 'EXAMDATE', 'DX_bl', 'DXCHANGE', 'AGE', 'PTGENDER',
       'PTEDUCAT', 'PTETHCAT', 'PTRACCAT', 'PTMARRY', 'APOE4'],
      dtype='object')

In [40]:
useful_categorical_attribs = ['RID', 'SITE', 'DXCHANGE', 'PTGENDER',
       'PTEDUCAT', 'PTETHCAT', 'PTRACCAT', 'PTMARRY', 'APOE4']
useful_categorical_attribs


Out[40]:
['RID',
 'SITE',
 'DXCHANGE',
 'PTGENDER',
 'PTEDUCAT',
 'PTETHCAT',
 'PTRACCAT',
 'PTMARRY',
 'APOE4']

In [41]:
tadpole_lb1_lb2[useful_categorical_attribs] = tadpole_lb1_lb2[useful_categorical_attribs].astype(str)

Data Cleaning

Split the dataset into train and test

Perform stratified shuffle split of data (to avoid over sampling on marital status, ethinicity or gender)


In [42]:
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index \
in split.split(tadpole_lb1_lb2, tadpole_lb1_lb2[["PTMARRY", "PTGENDER","PTETHCAT"]]):
    strat_train_set = tadpole_lb1_lb2.loc[train_index]
    strat_test_set = tadpole_lb1_lb2.loc[test_index]

Use the training set only now


In [43]:
tadpole = strat_train_set.copy()
print(strat_train_set.AGE)


9405     82.4
6945     76.1
3164     84.7
8984     81.8
4377     69.5
6879     75.6
5002     82.4
2726     71.8
5339     67.0
9504     61.9
11023    84.5
11340    75.2
2169     90.9
3915     67.7
61       64.1
6962     71.5
7906     74.3
1174     80.7
1468     77.7
9356     65.1
1542     89.6
5459     69.9
11638    70.8
2283     84.6
7016     78.2
8277     77.6
2648     72.5
9341     67.8
3575     62.0
8693     68.2
         ... 
7203     82.0
316      78.4
7244     72.4
4076     66.9
10350    70.9
8172     64.6
9648     76.6
3751     62.3
311      75.2
4092     66.5
2976     76.4
2030     75.0
9575     71.4
7041     76.8
3130     65.5
9920     75.1
5486     59.8
2248     73.7
11922    63.2
11802    67.8
8927     55.9
10830    88.3
6571     85.5
11817    58.8
2735     70.7
9703     77.1
6529     75.6
2754     85.9
4104     61.9
4645     68.5
Name: AGE, Length: 9609, dtype: float64

In [44]:
print(tadpole.head())
print(tadpole_lb1_lb2.head())

print(strat_train_set.keys()[:10])
tadpole = strat_train_set.drop(y_num_cols + y_num_cols, axis=1)

#tadpole_labels_categorical = strat_train_set[useful_categorical_attribs].copy()
#'AGE_AT_EXAM' in strat_train_set.keys()


       RID        PTID VISCODE SITE  D1  D2  LB1  LB2 COLPROT ORIGPROT  \
9405  4392  024_S_4392     m24   24   1   0    1    0   ADNI2    ADNI2   
6945   637  012_S_0637     m54   12   1   1    0    1   ADNI1    ADNI1   
3164  2205  099_S_2205      bl   99   1   1    1    0  ADNIGO   ADNIGO   
8984  4086  099_S_4086     m18   99   1   1    1    0   ADNI2    ADNI2   
4377  4631  137_S_4631     m03  137   1   1    1    0   ADNI2    ADNI2   

       ...   KIT_UPENNBIOMK9_04_19_17 STDS_UPENNBIOMK9_04_19_17  \
9405   ...              P06-MP02-MP01           P06-MP02-MP01/2   
6945   ...                                                        
3164   ...              P06-MP02-MP01           P06-MP02-MP01/2   
8984   ...                                                        
4377   ...                                                        

     RUNDATE_UPENNBIOMK9_04_19_17  ABETA_UPENNBIOMK9_04_19_17  \
9405                   2017-01-12                      1099.0   
6945                                                      NaN   
3164                   2016-11-17                       336.6   
8984                                                      NaN   
4377                                                      NaN   

     TAU_UPENNBIOMK9_04_19_17 PTAU_UPENNBIOMK9_04_19_17  \
9405                    342.0                     35.32   
6945                      NaN                       NaN   
3164                    247.8                     23.28   
8984                      NaN                       NaN   
4377                      NaN                       NaN   

     COMMENT_UPENNBIOMK9_04_19_17 update_stamp_UPENNBIOMK9_04_19_17  \
9405                          NaN             2017-04-20 14:39:55.0   
6945                                                                  
3164                          NaN             2017-04-20 14:39:55.0   
8984                                                                  
4377                                                                  

     AGE_AT_EXAM AGE_INT  
9405   84.409582      84  
6945   80.628405      80  
3164   84.700000      84  
8984   83.314031      83  
4377   69.636893      69  

[5 rows x 1911 columns]
     RID        PTID VISCODE SITE  D1  D2  LB1  LB2 COLPROT ORIGPROT   ...    \
0      2  011_S_0002      bl   11   1   1    1    0   ADNI1    ADNI1   ...     
5618   2  011_S_0002     m06   11   1   1    1    0   ADNI1    ADNI1   ...     
5619   2  011_S_0002     m36   11   1   1    1    0   ADNI1    ADNI1   ...     
5620   2  011_S_0002     m60   11   1   1    1    0  ADNIGO    ADNI1   ...     
5621   2  011_S_0002     m66   11   1   1    1    0  ADNIGO    ADNI1   ...     

     KIT_UPENNBIOMK9_04_19_17 STDS_UPENNBIOMK9_04_19_17  \
0                                                         
5618                                                      
5619                                                      
5620                                                      
5621                                                      

     RUNDATE_UPENNBIOMK9_04_19_17  ABETA_UPENNBIOMK9_04_19_17  \
0                                                         NaN   
5618                                                      NaN   
5619                                                      NaN   
5620                                                      NaN   
5621                                                      NaN   

     TAU_UPENNBIOMK9_04_19_17 PTAU_UPENNBIOMK9_04_19_17  \
0                         NaN                       NaN   
5618                      NaN                       NaN   
5619                      NaN                       NaN   
5620                      NaN                       NaN   
5621                      NaN                       NaN   

     COMMENT_UPENNBIOMK9_04_19_17 update_stamp_UPENNBIOMK9_04_19_17  \
0                                                                     
5618                                                                  
5619                                                                  
5620                                                                  
5621                                                                  

     AGE_AT_EXAM AGE_INT  
0      74.300000      74  
5618   74.790075      74  
5619   77.267830      77  
5620   79.337645      79  
5621   79.783915      79  

[5 rows x 1911 columns]
Index(['RID', 'PTID', 'VISCODE', 'SITE', 'D1', 'D2', 'LB1', 'LB2', 'COLPROT',
       'ORIGPROT'],
      dtype='object')

In [45]:
tadpole = strat_train_set.drop(y_num_cols + y_num_cols, axis=1)
tadpole_labels_categorical = strat_train_set[useful_categorical_attribs].copy()
tadpole_labels_continuous = strat_train_set[useful_numerical_attribs].copy()

In [ ]:


In [46]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, Imputer, LabelBinarizer

In [47]:
class LabelBinarizerPipelineFriendly(LabelBinarizer):

    def fit(self, X, y=None):
        """this would allow us to fit the model based on the X input."""
        super(LabelBinarizerPipelineFriendly, self).fit(X)
    def transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).transform(X)

    def fit_transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).fit(X).transform(X)

In [48]:
from sklearn.base import BaseEstimator, TransformerMixin

class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

In [49]:
num_pipeline = Pipeline([
    ('selector', DataFrameSelector(useful_numerical_attribs)),
    ('imputer', Imputer(strategy="median")),
#     ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

cat_pipeline = Pipeline([
    ('selector', DataFrameSelector(['APOE4'])),
    ('label_binarizer', LabelBinarizerPipelineFriendly()),
])

In [50]:
from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline", num_pipeline),
    ("cat_pipeline", cat_pipeline),
])

In [51]:
tadpole_prepared = full_pipeline.fit_transform(tadpole)

In [52]:
tadpole_prepared


Out[52]:
array([[ 0.2717961 ,  0.16833368, -0.58854353, ...,  0.        ,
         0.        ,  0.        ],
       [-0.34811578, -0.23822369,  0.30000493, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.06515881, -0.50926193,  0.30000493, ...,  0.        ,
         0.        ,  0.        ],
       ..., 
       [-0.14147848,  0.16833368,  0.59618775, ...,  1.        ,
         0.        ,  0.        ],
       [-0.34811578, -0.50926193, -0.29236071, ...,  0.        ,
         1.        ,  0.        ],
       [-0.76139036, -0.78030018, -0.29236071, ...,  1.        ,
         0.        ,  0.        ]])

pipeline for generating, evaluating and submitting a leaderboard submission

Then generate a simple forecast from the training data, and save it as TADPOLE_Submission_Pycon_TeamName1.csv


In [53]:
!python3 TADPOLE_SimpleForecast1.py


Generating forecast ...
Constructing the output spreadsheet ../data/TADPOLE_Submission_Pycon_TeamName1.csv ...

You should replace TeamName1 with your team name and submission index (no underscores allowed) e.g., TADPOLE_Submission_Pycon_TeamAwesome3.csv


In [54]:
team_name = "TeamFrank1" ## add your own team name here

In [55]:
import os
oldFile = '../data/TADPOLE_Submission_Pycon_TeamName1.csv'
newFile = '../data/TADPOLE_Submission_Pycon_%s.csv' % team_name
os.system('mv %s %s' % (oldFile, newFile))


Out[55]:
0

Evaluate the user forecasts from TADPOLE_Submission_Leaderboard_TeamName1.csv against TADPOLE_LB4_dummy.csv (held out dataset) using the evaluation function


In [59]:
cmd = 'python3 evalOneSubmission.py --leaderboard --d4File %s --forecastFile %s' % ("../data/TADPOLE_LB4_dummy.csv", newFile)
print(cmd) 
os.system(cmd) 
# check the console where you launched jupyter from, it should show the outputs. 
# Otherwise, run the command from  the command line


python3 evalOneSubmission.py --leaderboard --d4File ../data/TADPOLE_LB4_dummy.csv --forecastFile ../data/TADPOLE_Submission_Pycon_TeamFrank1.csv
Out[59]:
0

In [57]:
# Submit (renamed version of) TADPOLE_Submission_Leaderboard_TeamName1.csv to TADPOLE website via the Submit page

In [ ]: