Pipeline

Cleaning confounds

We first created the confound matrix according to Smith et al. (2015). The confound variables are motion (Jenkinson), sex, and age. We also created squared confound measures to help account for potentially nonlinear effects of these confounds.

Nested k-fold cross validation

We employed the nested approach to accomandate the hyper-parameter selection and model selection. This is a complex and costly method but the smaple size allows us to use this approach.



In [1]:

    
import copy
import os, sys
import numpy as np
import pandas as pd

import joblib



In [2]:

    
os.chdir('../')



In [3]:

    
# loa my modules
from src.utils import load_pkl
from src.file_io import save_output
from src.models import nested_kfold_cv_scca, clean_confound, permutate_scca
from src.visualise import set_text_size, show_results, write_pdf, write_png



In [4]:

    
dat_path = './data/processed/dict_SCCA_data_prepro_revision1.pkl'
# load data
dataset = load_pkl(dat_path)

dataset.keys()









    Out[4]:





['FC_BzdokDMN',
 'FC_Yeo7',
 'Yeo7LiberalMask_ROIs',
 'Gender',
 'Age',
 'CognitiveMeasures_labels',
 'IDs',
 'CognitiveMeasures',
 'MRIQ_labels',
 'FC_Yeo17',
 'BzdokDMN_ROIs',
 'Motion_power',
 'FC_Yeo7LiberalMask',
 'Yeo17_ROIs',
 'FC_nodes',
 'Motion_Jenkinson',
 'MRIQ',
 'Yeo7_ROIs']



In [5]:

    
FC_nodes = dataset['FC_nodes']
MRIQ    = dataset['MRIQ']
mot     = dataset['Motion_Jenkinson']
sex     = dataset['Gender']
age     = dataset['Age']
confound_raw = np.hstack((mot, sex, age))



In [6]:

    
out_folds = 5
in_folds = 5
n_selected = 4

confound cleaning in CV loops



In [7]:

    
%%time
para_search, best_model, pred_errors = nested_kfold_cv_scca(
    FC_nodes, MRIQ, R=confound_raw, n_selected=n_selected, 
    out_folds=5, in_folds=5, 
    reg_X=(0.1, 0.9), reg_Y=(0.1, 0.9)
)









    



Fold 1/5






    












    



New Best model: 
 4 components,penalty x: 0.1, penalty y: 0.5
OOS performance: 0.0225189060373
Fold 2/5






    












    



New Best model: 
 4 components,penalty x: 0.8, penalty y: 0.5
OOS performance: 0.475761403841
Fold 3/5






    












    



Fold 4/5






    












    



Fold 5/5






    












    



Best parameters based on outer fold ev results: X-0.8; Y-0.5

CPU times: user 13min 19s, sys: 8min 28s, total: 21min 47s
Wall time: 11min 58s

Examing the explained variance %



In [8]:

    
X, Y, R = clean_confound(FC_nodes, MRIQ, confound_raw)



In [9]:

    
from sklearn.linear_model import LinearRegression
from scipy.stats.mstats import zscore

lr = LinearRegression(fit_intercept=False)
lr.fit(R, np.arctanh(FC_nodes))
rec_ = lr.coef_.dot(R.T).T
r_2 = 1 - (np.var(np.arctanh(FC_nodes) - rec_) / np.var(np.arctanh(FC_nodes)))
print "confounds explained {}% of the FC data".format(np.round(r_2 * 100), 0)

lr = LinearRegression(fit_intercept=False)
lr.fit(R, zscore(MRIQ))
rec_ = lr.coef_.dot(R.T).T
r_2 = 1 - (np.var(zscore(MRIQ) - rec_) / np.var(zscore(MRIQ)))
print "confounds explained {}% of the self-report data".format(np.round(r_2 * 100), 0)









    



confounds explained 36.0% of the FC data
confounds explained 2.0% of the self-report data

Permutation test with data augmentation

After data decomposition with SCCA, one way to access the reliability of the canonical component is permutation test. The purpose of a permutation test is to construct a null distribution of the target matrice to access the confidence level of our discovery. The target matrix should be the the statistics optimisation goal in the original model. In SCCA, the canonical correlations are used to construct the null distribution. There are two possible ways to permute the data - scarmbling the subjuect-wise or variable-wise links. To perform subject-wise scrambling, you shuffle one dataset by row, so each observation will have non-matching variables. This permutation scheme access the significance of the individual information to the model. We can, otherwise, shuffle the order of the variable for each participant to disturb the variable property, hence it can access the contribution of variable profile to the modeling results. We adopt the permutation test with the FWE-corrected p-value in the Smith et al. 2015 paper with data arugmentaion to increase the size of the resampling datasets.



In [10]:

    
%%time

df_permute = permutate_scca(X, Y, best_model.cancorr_, best_model, n_permute=5000)









    



CPU times: user 1h 31min 40s, sys: 48.3 s, total: 1h 32min 28s
Wall time: 1h 32min 28s



In [11]:

    
df_permute









    Out[11]:







  
    
      
      P-values
      alpah0.001
      alpah0.01
      alpah0.05
    
    
      Component
      
      
      
      
    
  
  
    
      1
      0.0002
      True
      True
      True
    
    
      2
      0.0010
      False
      True
      True
    
    
      3
      0.0204
      False
      False
      True
    
    
      4
      0.9968
      False
      False
      False



In [ ]:

    
u, v = best_model.u, best_model.v

set_text_size(12)
figs = show_results(u, v, range(1,58), dataset['MRIQ_labels'], rank_v=True, sparse=True)
write_png('./reports/revision/bestModel_yeo7nodes_component_{:}.png', figs)
X_scores, Y_scores, df_z = save_output(dataset, confound_raw, best_model, X, Y, path=None)

df_z.to_csv('./data/processed/NYCQ_CCA_score_revision_yeo7nodes_{0:1d}_{1:.1f}_{2:.1f}.csv'.format(
            best_model.n_components, best_model.penX, best_model.penY))
df_z.to_pickle('./data/processed/NYCQ_CCA_score_revision_yeo7nodes_{0:1d}_{1:.1f}_{2:.1f}.pkl'.format(
            best_model.n_components, best_model.penX, best_model.penY))

joblib.dump(best_model, 
            './models/SCCA_Yeo7nodes_revision_{:1d}_{:.2f}_{:.2f}.pkl'.format(
                best_model.n_components, best_model.penX, best_model.penY))

	P-values	alpah0.001	alpah0.01	alpah0.05
Component
1	0.0002	True	True	True
2	0.0010	False	True	True
3	0.0204	False	False	True
4	0.9968	False	False	False