In [172]:
import pandas as pd
import numpy as np

# The columns of disorders are extracted using the split.py in /utils folder. There are 2796 rows left.
# All rows that has NaNs are dropped. This file contains only disoders along with corresponding patient ID.
# The rows of the dataframe extracted from the disorders.csv are indexed by the Patient_IDs.
df_disorders = pd.DataFrame.from_csv('disorders.csv')

# Extract patient IDs
patient_ids = df_disorders.index.tolist()

Descriptive

  • What do the features in the vector indicate?

Current guess:

This data is likely from a study of people with ADHD, as it is looking at brain activity during a concentration activity. We could maybe try to parse out the areas of the brain associated with different types of ADHD from parts that are activated due to a different disorder based on comparing those with one disorder to some without. Also the trends of different kinds of ADHD could have different locations in the brain for the concentration task

  • Why are certain types of baseline values not applicable to certain individuals (NaN values)?

It is impossible to know for sure why certain baseline values are not applicable to certain individuals. However, we speculate a few possibilities. Perhaps baseline data was not recorded, either by design or by error (e.g. measurement utensil was not calibrated appropriately). Alternatively, the baseline data may not have been collected originally, and then later found necessary to compare to the concentration values. The cost of measuring brain activity could also have factored in to the decision to only record the baseline values of some of the participants. Nonetheless, as stated above, we have eliminated the missing values from our training data because without a baseline value, the concentration value is not as meaningful. If we find any consistent trends within the data, we may be able to estimate the baseline values for those with missing information.

  • What kind of labels are we going to extract? What will be yi?

For starters, a simple learning goal would be to separate healthy participants from the unhealthy ones. Here we identify those who are diagnosed with no disorders as the healty partipants, and assign such patient records with label of value 1, the remaining records are assigned label 0.

Therefore to answer the question above, our labels at the initial stage are binary, indicating whether a participant is healthy (has no disorders) or not.


In [117]:
# Ignore different types of ADHD for now
df_disorder_results = df_disorders.drop('ADHD_Type', inplace=False, axis=1)

# Find records that has zero values across all the columns (disorders)
# Extract a list of Patient_IDs corresponding to healthy participants
healthy_ids = df_disorder_results[(df_disorder_results.T==0).all()].index.tolist()

print 'There are %d healthy participants.\n' % len(healthy_ids)
print 'Their Patient_IDs are', healthy_ids
print '\nFor the records above, label y_i=1 (healthy), for all the other records, y_i=0 (mentally disordered).'


There are 50 healthy participants.

Their Patient_IDs are [792951749, 1352837869, 351767390, 385736291, 1073135719, 2147375616, 1063681694, 1740434093, 643203234, 1412312752, 787197016, 1397640254, 2006204394, 882339396, 1733021018, 2141528074, 2147375576, 898779271, 457650554, 1089994190, 874166793, 921301480, 708633347, 132445998, 1392520338, 523669933, 95466200, 2147375992, 1899420078, 1154083442, 1136199638, 2147375509, 704315387, 261535778, 1733338065, 2146819728, 416826989, 1274042226, 1931808785, 605252192, 1192925290, 1886482913, 230205983, 54081174, 2024686590, 2147375794, 90309090, 1627824129, 1201846788, 129776326]

For the records above, label y_i=1 (healthy), for all the other records, y_i=0 (mentally disordered).

In [118]:
# Now constuct the label vector y 
y = pd.Series([0] * len(df_disorders), index=patient_ids) 
y[healthy_ids] = 1 

print 'Finish constructing label vector for healthy/unhealthy.'


Finish constructing label vector for healthy/unhealthy.

Additional prediction goals: ( in the future )

-whether or not certain disorders are correlated with certain baseline values (or delta values from concentration to baseline)

-if different parts of the brain are affected with different kinds of ADHD

Exploratory

  • What is the sole metric that can be used to separate healthy people from unhealthy people?

As shown in the label construction procedure above, our sole metric for identifying a person as healthy is: the record (row) for such person has 0 values for all disorders (across all disorder columns).

  • What is the range of values nominal features can take?

In [176]:
# Read full dataset
df_all = pd.DataFrame.from_csv('Data_Adults_1.csv')

# Extract non-numerical features
non_num_keys = [key for key in dict(df_all.dtypes) if dict(df_all.dtypes)[key] not in ['float', 'int']]

print 'Nominal features are', non_num_keys


Nominal features are ['RaceName', 'Age_Group', 'STUDY_NAME', 'BSC_Respondent', 'ADHD_Type', 'locationname', 'LDS_Respondent', 'GSC_Respondent', 'group_name', 'Gendername']

In [184]:
print 'Unique values for the nominal features:\n'
print np.unique(df_all['RaceName'])
print np.unique(df_all['Age_Group'])
print np.unique(df_all['STUDY_NAME'])
print np.unique(df_all['BSC_Respondent'])
print np.unique(df_all['ADHD_Type'])
print np.unique(df_all['locationname'])
print np.unique(df_all['LDS_Respondent'])
print np.unique(df_all['GSC_Respondent'])
print np.unique(df_all['group_name'])
print np.unique(df_all['Gendername'])


Unique values for the nominal features:

['African American          ' 'Arab/Middle Eastern       '
 'Asian                     ' 'Asian/Caucasian           '
 'Caucasian                 ' 'Caucasian/African American'
 'Caucasian/Hispanic        ' 'Caucasian/Native American '
 'Declined                  ' 'Hispanic                  '
 'Hispanic/African American ' 'Hispanic/Native American  '
 'Indian                    ' 'Native American/Eskimo    '
 'Other                     ' 'Unknown                   ']
['Adult    ' 'Geriatric' 'Pediatric']
['BigLove']
['      ' 'Other ' 'Parent' 'Self  ' 'Spouse']
['                  ' 'Asymptomatic      ' 'Combined Type     '
 'Hyperactive       ' 'Inattentive       ' 'Mostly Impulsive  '
 'Mostly Inattentive' 'Undetermined      ']
['Atlanta      ' 'Bellevue     ' 'Brisbane     ' 'Fairfield    '
 'Mind Matters ' 'New York     ' 'Newport Beach' 'Not Specified'
 'Reston       ' 'Sierra Tucson' 'Tacoma       ']
['      ' 'Mother' 'Other ' 'Parent' 'Self  ' 'Spouse']
['      ' 'Mother' 'Other ' 'Parent' 'Self  ' 'Spouse']
['Adults        ' 'Healthy Brains']
['Female ' 'Male   ' 'Unknown']

In order not to comfuse our future model, here for certain types of features, we avoid using one value to denote different categories a variable can take, instead, we want to encode them into small one-hot vectors.

Note that we do not take null values into consideration for now.

Range of values for the nominal features:

1) RaceName: This can be encoded into a one-hot vector of length 16. For example, 'African American' will be encoded into [1]+[0]*15 (python grammar), with the value at index 0 set to 1, the rest of 15 elements of the vector are 0. 'Asian' takes value of 1 at index 2, 'Unknown' takes value of 1 at index 15, etc.

2) Age_Group: This can be represented by scales, where Adult=2, Geriatric=3, Pediatric=1.

3) STUDY_NAME: Only one name, can be removed from the dataset.

4) BSC_Respondent, ADHD_Type, locationname, LDS_Respondent, GSC_Respondent can be encoded in the same way as RaceName, except that their vector lengths will be 4, 7, 11, 4, 4 respectively.

5) group_name: We are not sure if this is just random group names or control group versus experimental group, this has to be decided after we get to look at data documentation. For the former case, it can be ignored. For the latter case, we will represent it using a binary indicator (0-1).

6) Gender: Female=0, Male=1, Unknown=0.5 (maybe).

  • Are the features correlated?

For feature correlation, since we do not have complete information about all column headers at this point, only correlation within baseline values and concentration values are analyzed.


In [131]:
# Get baseline and concenctration data
df_base = pd.DataFrame.from_csv('baseline.csv')
df_concen = pd.DataFrame.from_csv('concentration.csv')

# Use numpy matrix format (numerical)
df_base_vals = df_base.values
df_concen_vals = df_concen.values

In [169]:
def check_perfect_corr(coeff):
    # Fill diagonal with 0 (not comparing to oneself)
    np.fill_diagonal(coeff, 0)
    # Perfect correlation: 1 or -1
    return coeff.max()==1 or coeff.min()==-1

# Compute Pearson product-moment correlation coefficients
# Check for perfect correlation row-wise and column-wise
def pearson_corr_test(x):
    # rowvar = 1: row-wise
    # rowvar = 0: column-wise
    row_coeff = np.corrcoef(x, y=None, rowvar=1)
    col_coeff = np.corrcoef(x, y=None, rowvar=0)
    
    # Check for perfect correlation row-wise
    row_corr = check_perfect_corr(row_coeff)
    # Check for perfect correlation column-wise
    col_corr = check_perfect_corr(col_coeff)
    
    return row_corr and col_corr 
    
print 'Perfect correlation exists in baselines?', pearson_corr_test(df_base_vals)
print 'Perfect correlation exists in concentrations?', pearson_corr_test(df_concen_vals)


Perfect correlation exists in baselines? False
Perfect correlation exists in concentrations? False

Therefore, within baseline data and concentration data, perfect correlation does not exist.

  • Are we able to identify outliers at this point?

Currently we are unable to identify outliers for the following reasons:

  1. We do not have the documentation to understand all the column headers (feature names).
  2. We do not have enough information about the null values, especially when more than half of our samples and most of the columns contain null values.