We first created the confound matrix according to Smith et al. (2015). The confound variables are motion (Jenkinson), sex, and age. We also created squared confound measures to help account for potentially nonlinear effects of these confounds.
We employed the nested approach to accomandate the hyper-parameter selection and model selection. This is a complex and costly method but the smaple size allows us to use this approach.
In [1]:
import copy
import os, sys
import numpy as np
import pandas as pd
import joblib
In [2]:
os.chdir('../')
In [3]:
# loa my modules
from src.utils import load_pkl
from src.file_io import save_output
from src.models import nested_kfold_cv_scca, clean_confound, permutate_scca
from src.visualise import set_text_size, show_results, write_pdf, write_png
In [4]:
dat_path = './data/processed/dict_SCCA_data_prepro_revision1.pkl'
# load data
dataset = load_pkl(dat_path)
dataset.keys()
Out[4]:
In [5]:
FC_nodes = dataset['FC_nodes']
MRIQ = dataset['MRIQ']
mot = dataset['Motion_Jenkinson']
sex = dataset['Gender']
age = dataset['Age']
confound_raw = np.hstack((mot, sex, age))
In [6]:
out_folds = 5
in_folds = 5
n_selected = 4
In [7]:
%%time
para_search, best_model, pred_errors = nested_kfold_cv_scca(
FC_nodes, MRIQ, R=confound_raw, n_selected=n_selected,
out_folds=5, in_folds=5,
reg_X=(0.1, 0.9), reg_Y=(0.1, 0.9)
)
Examing the explained variance %
In [8]:
X, Y, R = clean_confound(FC_nodes, MRIQ, confound_raw)
In [9]:
from sklearn.linear_model import LinearRegression
from scipy.stats.mstats import zscore
lr = LinearRegression(fit_intercept=False)
lr.fit(R, np.arctanh(FC_nodes))
rec_ = lr.coef_.dot(R.T).T
r_2 = 1 - (np.var(np.arctanh(FC_nodes) - rec_) / np.var(np.arctanh(FC_nodes)))
print "confounds explained {}% of the FC data".format(np.round(r_2 * 100), 0)
lr = LinearRegression(fit_intercept=False)
lr.fit(R, zscore(MRIQ))
rec_ = lr.coef_.dot(R.T).T
r_2 = 1 - (np.var(zscore(MRIQ) - rec_) / np.var(zscore(MRIQ)))
print "confounds explained {}% of the self-report data".format(np.round(r_2 * 100), 0)
After data decomposition with SCCA, one way to access the reliability of the canonical component is permutation test. The purpose of a permutation test is to construct a null distribution of the target matrice to access the confidence level of our discovery. The target matrix should be the the statistics optimisation goal in the original model. In SCCA, the canonical correlations are used to construct the null distribution. There are two possible ways to permute the data - scarmbling the subjuect-wise or variable-wise links. To perform subject-wise scrambling, you shuffle one dataset by row, so each observation will have non-matching variables. This permutation scheme access the significance of the individual information to the model. We can, otherwise, shuffle the order of the variable for each participant to disturb the variable property, hence it can access the contribution of variable profile to the modeling results. We adopt the permutation test with the FWE-corrected p-value in the Smith et al. 2015 paper with data arugmentaion to increase the size of the resampling datasets.
In [10]:
%%time
df_permute = permutate_scca(X, Y, best_model.cancorr_, best_model, n_permute=5000)
In [11]:
df_permute
Out[11]:
In [ ]:
u, v = best_model.u, best_model.v
set_text_size(12)
figs = show_results(u, v, range(1,58), dataset['MRIQ_labels'], rank_v=True, sparse=True)
write_png('./reports/revision/bestModel_yeo7nodes_component_{:}.png', figs)
X_scores, Y_scores, df_z = save_output(dataset, confound_raw, best_model, X, Y, path=None)
df_z.to_csv('./data/processed/NYCQ_CCA_score_revision_yeo7nodes_{0:1d}_{1:.1f}_{2:.1f}.csv'.format(
best_model.n_components, best_model.penX, best_model.penY))
df_z.to_pickle('./data/processed/NYCQ_CCA_score_revision_yeo7nodes_{0:1d}_{1:.1f}_{2:.1f}.pkl'.format(
best_model.n_components, best_model.penX, best_model.penY))
joblib.dump(best_model,
'./models/SCCA_Yeo7nodes_revision_{:1d}_{:.2f}_{:.2f}.pkl'.format(
best_model.n_components, best_model.penX, best_model.penY))