These are the things I want to do today:
In [1]:
# Imports
import re
import warnings
import pandas as pd
import brainbox as bb
from collections import Counter
from operator import itemgetter as ig
In [2]:
# Grab image data
in_path = '/data1/abide/Out/Remote/some_failed/out'
metric = 'stability_maps'
file_dict = bb.fileOps.grab_files(in_path, '.nii.gz', sub=metric)
# Grab pheno data
pheno_path = '/home/surchs/Project/abide/pheno/pheno_full.csv'
pheno = pd.read_csv(pheno_path)
In [3]:
# Get the real sub ids from the file names
names = file_dict['sub_name']
# Get the subject's session and run and id
s_ses = np.array([int64(re.search(r'(?<=session_)\d{1}', sub_id).group()) for sub_id in names])
s_idn = np.array([int64(re.search(r'(?<=\d{2})\d{5}', sub_id).group()) for sub_id in names])
s_run = np.array([int64(re.search(r'(?<=_run)\d{1}', sub_id).group()) for sub_id in names])
The problem here is to deal with advanced indexing when going through different layers of indexes. Advanced indexing returns a copy of the object, not the object itself. Therefore something like data[idx1][idx2][idx3] = 10
does not change anything despite not throwing an error.
The trick here seems to be to replace boolean indexing with integer indexing by using numpy.flatnonzero. This returns a short list of index integers that can be sub-indexed to return a copy of itself. Since we wouldn't change anything about that cluster but just use it to slice a second object, we don't care that it's just a copy. So here this would be:
val_list = [list, of, values, to, be, sliced]
bool_idx = some_bool_array
flat_idx = np.flatnonzero(bool_idx)
val_list[flat_idx[idx2][idx3]]
Now there may be one more problem here. If flat_idx[idx2][idx3]
returns more than one value, we cannot use it to slice a list. Lists only take single integer arrays. So we need to check for number of indices if we are dealing with lists. A solution in this case would be to iterate through the indices:
vals = [names[x] for x in flat_idx[idx2][idx3]]
Here, I implemented a routine that either lets the user select the run and session he wants to use or selects the lowest session and run number.
In [4]:
# Count the number of occurences of one subject's id
dupl = [item for item, count in Counter(s_idn).iteritems() if count > 1]
run = 1
for val in dupl:
case_idx = s_idn == val
flat_idx = np.flatnonzero(case_idx)
# Iterator needed because names is not an array but a list
cases = [names[x] for x in np.where(case_idx)[0]]
# No iterators needed for arrays
sessions = s_ses[case_idx]
runs = s_run[case_idx]
# Find the one with the lowest session and then run number
min_ses = np.min(sessions)
ses_idx = sessions == min_ses
min_run = np.min(runs[ses_idx])
# run_idx = runs[ses_idx] == min_run
run_idx = runs[ses_idx] == run
if not run_idx.any():
print('\n WARNING:Subject {} does not have the sought run\n'.format(val))
continue
case = names[flat_idx[ses_idx][run_idx]]
print('We take {}'.format(case))
In [5]:
# Grab pheno data
pheno_path = '/home/surchs/Project/abide/pheno/pheno_full.csv'
pheno = pd.read_csv(pheno_path)
In [6]:
# Take a look at the pheno data
pheno.info()
In [7]:
# Make some random subject clusters
n_subs = len(pheno)
n_clus = 10
In [8]:
part = np.random.randint(1, high=n_clus+1, size=n_subs)
In [9]:
# Get one cluster and fetch specified columns
clus = 2
cols = ['DX_GROUP', 'AGE_AT_SCAN', 'SEX', 'EYE_STATUS_AT_SCAN']
In [10]:
cl_id = np.where(part == clus)
cl_pheno = pheno.ix[cl_id]
cl_cols = cl_pheno[cols]
In [11]:
# Display the profile in a litte window
dump = cl_cols.hist()
# Kaboooom
I am simpy amazed by how cool pandas is. Obvioulsy there are better ways to visualize the phenotype profile of a cluster than to spam histograms. But this is just so fast and clean and I think it is a great way to get a feeling for what the clusters look like. Now the examples here are obviously random clusters so they don't mean anything. But if this works, I will roll this out to my real data (provided my memory does not run out).
In [12]:
dist = 4
y = 2.0
x = n_clus / y
f = plt.figure(figsize=(y*dist,x*dist))
for cl_id in range(1, n_clus+1):
cl_num = np.where(part == cl_id)
cl_pheno = pheno.ix[cl_num]
cl_cols = cl_pheno[cols]
ax = f.add_subplot(x,y,cl_id-1)
cl_cols.hist(ax=ax)
ax.set_title('cluster {}'.format(cl_id))
Ok, so pandas .hist() function does not play nicely with suplots if it is asked to plot more than one variable. This is not a big problem. We simpy use native pyplot functions to generate the figures. In fact, instead of plotting 4 or more graphs per network, we will just have one per metric and then combine the different clusters together.
However, this solution only works for a limited number of clusters (around 10). Potentially a better idea would be to have radar plots as explained in this example.
For now I will not make a multi-view of these clusters. I believe that this does not need to be automated at this point and that, given the limited number of pheno variables, I will probably choose what to display against what and with which visualization technique.
In [18]:
# Write a function to plot one variable across the
def plot_var(part, pheno, var_name):
"""
Plot things nicely,
so far we can hist for single and bar for multiple
"""
n_clus = np.max(part)
dist = 2
x = 2.0
y = n_clus / x
# Prepare figure
f = plt.figure(figsize=(y*dist,4*x*dist))
for cl_id in range(1,n_clus+1):
ax_t = f.add_subplot(y, x, cl_id)
cl_memb = np.where(part == cl_id)
cl_subs = pheno.ix[cl_memb]
cl_phen = cl_subs[var_name]
if cl_phen.shape[1] == 1:
# Only have one variable
n, bins, patches = ax_t.hist(cl_phen.values)
ax_t.set_title(var_name)
else:
# We have more than one variable
cl_phen.mean().plot(kind='bar', stacked=True, ax=ax_t)
In [19]:
var_name = ['AGE_AT_SCAN']
# var_name = ['DX_GROUP', 'AGE_AT_SCAN', 'SEX', 'EYE_STATUS_AT_SCAN']
plot_var(part, pheno, var_name)
Ok, this makes very limited sense for more than one variable. If we want to plot them together, we will need to rescale the values. Maybe I can come up with some nicer way of plotting several phenotype variables for the clusters. For now, this should do.