Monday, 27th October 2014

These are the things I want to do today:

  • Deal with duplicates
  • Display phenotype profile for a group of subjects
  • Make an automated panel of phenotype profiles for a number of subject clusters

In [1]:
# Imports
import re
import warnings
import pandas as pd
import brainbox as bb
from collections import Counter
from operator import itemgetter as ig

In [2]:
# Grab image data
in_path = '/data1/abide/Out/Remote/some_failed/out'
metric = 'stability_maps'
file_dict = bb.fileOps.grab_files(in_path, '.nii.gz', sub=metric)
# Grab pheno data
pheno_path = '/home/surchs/Project/abide/pheno/pheno_full.csv'
pheno = pd.read_csv(pheno_path)


I will be pulling files from /data1/abide/Out/Remote/some_failed/out/stability_maps

In [3]:
# Get the real sub ids from the file names
names = file_dict['sub_name']
# Get the subject's session and run and id
s_ses = np.array([int64(re.search(r'(?<=session_)\d{1}', sub_id).group()) for sub_id in names])
s_idn = np.array([int64(re.search(r'(?<=\d{2})\d{5}', sub_id).group()) for sub_id in names])
s_run = np.array([int64(re.search(r'(?<=_run)\d{1}', sub_id).group()) for sub_id in names])

For duplicates, find the ones with either the lowest session and run or the one with the desired run

The problem here is to deal with advanced indexing when going through different layers of indexes. Advanced indexing returns a copy of the object, not the object itself. Therefore something like data[idx1][idx2][idx3] = 10 does not change anything despite not throwing an error.

The trick here seems to be to replace boolean indexing with integer indexing by using numpy.flatnonzero. This returns a short list of index integers that can be sub-indexed to return a copy of itself. Since we wouldn't change anything about that cluster but just use it to slice a second object, we don't care that it's just a copy. So here this would be:

val_list = [list, of, values, to, be, sliced]
bool_idx = some_bool_array
flat_idx = np.flatnonzero(bool_idx)
val_list[flat_idx[idx2][idx3]]

Now there may be one more problem here. If flat_idx[idx2][idx3] returns more than one value, we cannot use it to slice a list. Lists only take single integer arrays. So we need to check for number of indices if we are dealing with lists. A solution in this case would be to iterate through the indices:

vals = [names[x] for x in flat_idx[idx2][idx3]]

Here, I implemented a routine that either lets the user select the run and session he wants to use or selects the lowest session and run number.


In [4]:
# Count the number of occurences of one subject's id
dupl = [item for item, count in Counter(s_idn).iteritems() if count > 1]
run = 1
for val in dupl:
    case_idx = s_idn == val
    flat_idx = np.flatnonzero(case_idx)
    # Iterator needed because names is not an array but a list
    cases = [names[x] for x in np.where(case_idx)[0]]
    # No iterators needed for arrays
    sessions = s_ses[case_idx]
    runs = s_run[case_idx]
    # Find the one with the lowest session and then run number
    min_ses = np.min(sessions)
    ses_idx = sessions == min_ses
    min_run = np.min(runs[ses_idx])
    # run_idx = runs[ses_idx] == min_run
    run_idx = runs[ses_idx] == run
    if not run_idx.any():
        print('\n    WARNING:Subject {} does not have the sought run\n'.format(val))
        continue
    
    case = names[flat_idx[ses_idx][run_idx]]
    print('We take {}'.format(case))


We take OHSU_0050142_session_1_run1_stability_maps
We take OHSU_0050146_session_1_run1_stability_maps
We take OHSU_0050147_session_1_run1_stability_maps
We take OHSU_0050148_session_1_run1_stability_maps
We take OHSU_0050152_session_1_run1_stability_maps
We take OHSU_0050153_session_1_run1_stability_maps
We take OHSU_0050156_session_1_run1_stability_maps

    WARNING:Subject 50159 does not have the sought run

We take OHSU_0050160_session_1_run1_stability_maps
We take OHSU_0050161_session_1_run1_stability_maps
We take OHSU_0050162_session_1_run1_stability_maps
We take OHSU_0050167_session_1_run1_stability_maps
We take OHSU_0050168_session_1_run1_stability_maps
We take OHSU_0050169_session_1_run1_stability_maps
We take OHSU_0050171_session_1_run1_stability_maps

Show phenotype profile for a number of subjects


In [5]:
# Grab pheno data
pheno_path = '/home/surchs/Project/abide/pheno/pheno_full.csv'
pheno = pd.read_csv(pheno_path)

In [6]:
# Take a look at the pheno data
pheno.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 1112 entries, 0 to 1111
Data columns (total 74 columns):
SITE_ID                            1112 non-null object
SUB_ID                             1112 non-null int64
DX_GROUP                           1112 non-null int64
DSM_IV_TR                          1112 non-null int64
AGE_AT_SCAN                        1112 non-null float64
SEX                                1112 non-null int64
HANDEDNESS_CATEGORY                797 non-null object
HANDEDNESS_SCORES                  370 non-null float64
FIQ                                1077 non-null float64
VIQ                                935 non-null float64
PIQ                                953 non-null float64
FIQ_TEST_TYPE                      947 non-null object
VIQ_TEST_TYPE                      834 non-null object
PIQ_TEST_TYPE                      853 non-null object
ADI_R_SOCIAL_TOTAL_A               412 non-null float64
ADI_R_VERBAL_TOTAL_BV              412 non-null float64
ADI_RRB_TOTAL_C                    412 non-null float64
ADI_R_ONSET_TOTAL_D                331 non-null float64
ADI_R_RSRCH_RELIABLE               411 non-null float64
ADOS_MODULE                        535 non-null float64
ADOS_TOTAL                         442 non-null float64
ADOS_COMM                          417 non-null float64
ADOS_SOCIAL                        417 non-null float64
ADOS_STEREO_BEHAV                  378 non-null float64
ADOS_RSRCH_RELIABLE                380 non-null float64
ADOS_GOTHAM_SOCAFFECT              317 non-null float64
ADOS_GOTHAM_RRB                    317 non-null float64
ADOS_GOTHAM_TOTAL                  317 non-null float64
ADOS_GOTHAM_SEVERITY               317 non-null float64
SRS_VERSION                        269 non-null float64
SRS_RAW_TOTAL                      405 non-null float64
SRS_AWARENESS                      64 non-null float64
SRS_COGNITION                      64 non-null float64
SRS_COMMUNICATION                  64 non-null float64
SRS_MOTIVATION                     64 non-null float64
SRS_MANNERISMS                     64 non-null float64
SCQ_TOTAL                          143 non-null float64
AQ_TOTAL                           58 non-null float64
COMORBIDITY                        72 non-null object
CURRENT_MED_STATUS                 817 non-null object
MEDICATION_NAME                    157 non-null object
OFF_STIMULANTS_AT_SCAN             141 non-null float64
VINELAND_RECEPTIVE_V_SCALED        184 non-null float64
VINELAND_EXPRESSIVE_V_SCALED       184 non-null float64
VINELAND_WRITTEN_V_SCALED          184 non-null float64
VINELAND_COMMUNICATION_STANDARD    184 non-null float64
VINELAND_PERSONAL_V_SCALED         184 non-null float64
VINELAND_DOMESTIC_V_SCALED         184 non-null float64
VINELAND_COMMUNITY_V_SCALED        184 non-null float64
VINELAND_DAILYLVNG_STANDARD        184 non-null float64
VINELAND_INTERPERSONAL_V_SCALED    184 non-null float64
VINELAND_PLAY_V_SCALED             184 non-null float64
VINELAND_COPING_V_SCALED           184 non-null float64
VINELAND_SOCIAL_STANDARD           184 non-null float64
VINELAND_SUM_SCORES                184 non-null float64
VINELAND_ABC_STANDARD              184 non-null float64
VINELAND_INFORMANT                 184 non-null float64
WISC_IV_VCI                        55 non-null float64
WISC_IV_PRI                        55 non-null float64
WISC_IV_WMI                        55 non-null float64
WISC_IV_PSI                        55 non-null float64
WISC_IV_SIM_SCALED                 55 non-null float64
WISC_IV_VOCAB_SCALED               55 non-null float64
WISC_IV_INFO_SCALED                55 non-null float64
WISC_IV_BLK_DSN_SCALED             55 non-null float64
WISC_IV_PIC_CON_SCALED             55 non-null float64
WISC_IV_MATRIX_SCALED              55 non-null float64
WISC_IV_DIGIT_SPAN_SCALED          55 non-null float64
WISC_IV_LET_NUM_SCALED             55 non-null float64
WISC_IV_CODING_SCALED              55 non-null float64
WISC_IV_SYM_SCALED                 55 non-null float64
EYE_STATUS_AT_SCAN                 1112 non-null int64
AGE_AT_MPRAGE                      101 non-null float64
BMI                                214 non-null float64
dtypes: float64(61), int64(5), object(8)

In [7]:
# Make some random subject clusters
n_subs = len(pheno)
n_clus = 10

In [8]:
part = np.random.randint(1, high=n_clus+1, size=n_subs)

In [9]:
# Get one cluster and fetch specified columns
clus = 2
cols = ['DX_GROUP', 'AGE_AT_SCAN', 'SEX', 'EYE_STATUS_AT_SCAN']

In [10]:
cl_id = np.where(part == clus)
cl_pheno = pheno.ix[cl_id]
cl_cols = cl_pheno[cols]

In [11]:
# Display the profile in a litte window
dump = cl_cols.hist()
# Kaboooom


Now do this for a couple of clusters and plot them side by side

I am simpy amazed by how cool pandas is. Obvioulsy there are better ways to visualize the phenotype profile of a cluster than to spam histograms. But this is just so fast and clean and I think it is a great way to get a feeling for what the clusters look like. Now the examples here are obviously random clusters so they don't mean anything. But if this works, I will roll this out to my real data (provided my memory does not run out).


In [12]:
dist = 4
y = 2.0
x = n_clus / y
f = plt.figure(figsize=(y*dist,x*dist))

for cl_id in range(1, n_clus+1):
    cl_num = np.where(part == cl_id)
    cl_pheno = pheno.ix[cl_num]
    cl_cols = cl_pheno[cols]
    ax = f.add_subplot(x,y,cl_id-1)
    cl_cols.hist(ax=ax)
    ax.set_title('cluster {}'.format(cl_id))


/home/surchs/Enthought/Canopy_64bit/User/lib/python2.7/site-packages/pandas/tools/plotting.py:2954: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared
  "is being cleared", UserWarning)

Ok, so pandas .hist() function does not play nicely with suplots if it is asked to plot more than one variable. This is not a big problem. We simpy use native pyplot functions to generate the figures. In fact, instead of plotting 4 or more graphs per network, we will just have one per metric and then combine the different clusters together.

However, this solution only works for a limited number of clusters (around 10). Potentially a better idea would be to have radar plots as explained in this example.

For now I will not make a multi-view of these clusters. I believe that this does not need to be automated at this point and that, given the limited number of pheno variables, I will probably choose what to display against what and with which visualization technique.


In [18]:
# Write a function to plot one variable across the 
def plot_var(part, pheno, var_name):
    """
    Plot things nicely,
    
    so far we can hist for single and bar for multiple
    """
    n_clus = np.max(part)
    dist = 2
    x = 2.0
    y = n_clus / x
    # Prepare figure
    f = plt.figure(figsize=(y*dist,4*x*dist))
    
    for cl_id in range(1,n_clus+1):
        ax_t = f.add_subplot(y, x, cl_id)
        cl_memb = np.where(part == cl_id)
        cl_subs = pheno.ix[cl_memb]
        cl_phen = cl_subs[var_name]
        if cl_phen.shape[1] == 1:
            # Only have one variable
            n, bins, patches = ax_t.hist(cl_phen.values)
            ax_t.set_title(var_name)
        else:
            # We have more than one variable
            cl_phen.mean().plot(kind='bar', stacked=True, ax=ax_t)

In [19]:
var_name = ['AGE_AT_SCAN']
# var_name = ['DX_GROUP', 'AGE_AT_SCAN', 'SEX', 'EYE_STATUS_AT_SCAN']
plot_var(part, pheno, var_name)


Ok, this makes very limited sense for more than one variable. If we want to plot them together, we will need to rescale the values. Maybe I can come up with some nicer way of plotting several phenotype variables for the clusters. For now, this should do.