If you're working on NeuroImaging data, you should check another Python library, Nilearn, that is design for fast and easy statistical learning on NeuroImaging data. It leverages the scikit-learn Python toolbox for multivariate statistics with applications such as predictive modeling, classification, decoding, or connectivity analysis. Use their website to find out more.
As an example of how to use Nilearn, we will use the Haxby 2001 study on a face vs cat discrimination task in a mask of the ventral stream. This part is based on a Nilearn tutorial.
Note that first time you fetch the data, it can take a few minutes.
In [1]:
from nilearn import datasets
# By default 2nd subject will be fetched
haxby_dataset = datasets.fetch_haxby()
We can access anatomical, functional and mask data. And in addition we have true labels.
In [2]:
func_file = haxby_dataset.func[0]
mask_file = haxby_dataset.mask_vt[0]
anat_file = haxby_dataset.anat[0]
labels_file = haxby_dataset.session_target[0]
let's get some info about bold file:
In [3]:
!nib-ls $func_file
We need our functional data in a 2D, sample-by-voxel matrix. To get that, we'll select a set of voxels in Ventral Temporal cortex defined by mask from the Haxby study:
In [4]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from nilearn import plotting
plotting.plot_roi(mask_file, anat_file, cmap='Paired', dim=-.5)
Out[4]:
Now we will create masker using the NiftiMasker. NiftiMasker is an object that applies a mask to a dataset and returns the masked voxels as a vector at each time point. Here we use standardizing=True the time-series are centered and normed.
In [5]:
from nilearn.input_data import NiftiMasker
masker = NiftiMasker(mask_img=mask_file, standardize=True)
# We give the masker a filename and retrieve a 2D array ready
# for machine learning with scikit-learn
fmri_masked = masker.fit_transform(func_file)
fmr_mask is a NumPy array and its shape corresponds to the number of time-points times the number of voxels in the mask.
In [6]:
print(fmri_masked.shape)
To recover the original data shape (giving us a masked and z-scored BOLD series), you can use masker.inverse_transform.
The label_file is CSV file, we can read it using NumPy:
In [7]:
labels = np.recfromcsv(labels_file, delimiter=" ")
labels
Out[7]:
It's an array that have labels that gives information about condition and chunks represents a run number. We will use conditions:
In [8]:
conditions = labels['labels']
np.unique(conditions)
Out[8]:
We see that there are 9 different conditions, but we will use faces and cats only. Let's create another mask (this time masking the time points) that we will apply to our fmri_mask
In [9]:
condition_mask = np.logical_or(conditions == b'face', conditions == b'cat')
conditions_2lb = conditions[condition_mask]
fmri_masked_2lb = fmri_masked[condition_mask]
print(fmri_masked_2lb.shape)
Now we will use a learning algorithm from scikit-learn to apply to our neuroImaging data. We will use a Support Vector Classification, with a linear kernel.
In [10]:
from sklearn.svm import SVC
svc = SVC(kernel='linear')
Let's split our data and fit the model using the training set:
In [11]:
from sklearn.model_selection import train_test_split
fmri_tr, fmri_ts, cond_tr, cond_ts = train_test_split(fmri_masked_2lb, conditions_2lb)
svc.fit(fmri_tr, cond_tr)
Out[11]:
And we can check the score for the testing set:
In [12]:
svc.score(fmri_ts, cond_ts)
Out[12]:
Run cross_val_score for SVC linear kernel:
In [13]:
from sklearn.model_selection import cross_val_score, LeaveOneOut
svc_ln = SVC(kernel='linear')
scores = cross_val_score(svc_ln, fmri_masked_2lb, conditions_2lb, cv=LeaveOneOut())
print("Linear kernel: scores = {}, mean score = {:03.2f}".format(scores, scores.mean()))
And now, let's try a default kernel
In [14]:
svc_rbf = SVC()
scores = cross_val_score(svc_rbf, fmri_masked_2lb, conditions_2lb, cv=LeaveOneOut())
print("Default rbf kernel: scores = {}, mean score = {:03.2f}".format(scores, scores.mean()))
In [15]:
# write your solution here
# 1. read about available kernels http://scikit-learn.org/stable/modules/svm.html and initialize models with two different kernels
# 2. run cross_val_score for both models and compare the results
In [16]:
from sklearn.neighbors import KNeighborsClassifier
clf_kn = KNeighborsClassifier()
scores = cross_val_score(clf_kn, fmri_masked_2lb, conditions_2lb, cv=LeaveOneOut())
print("Scores: {}, mean score = {:03.2f}".format(scores, scores.mean()))
As we can see, this model also works pretty well for the dataset.
In [17]:
# write your solution here
# 1. initialize KNeighborsClassifier (you can change the default arguments)
# 2. run cross_val_score
We can check weights assigned to the features by the model:
In [18]:
coef = svc.coef_
print(coef)
Our array should have the same size as the VT mask:
In [19]:
coef.shape
Out[19]:
We need to turn it back into an original Nifti image, in essence, “inverting” what the NiftiMasker has done. For this, we can call inverse_transform on the NiftiMasker:
In [20]:
coef_img = masker.inverse_transform(coef)
print(coef_img)
If we need, we can save the image:
In [21]:
coef_img.to_filename('haxby_svc_weights.nii.gz')
Now, lets plot our weights on top of the anatomical image:
In [22]:
from nilearn.plotting import plot_stat_map
plot_stat_map(coef_img, anat_file, vmax=0.1, dim=-1,
title="SVM weights", display_mode="yx")
Out[22]:
Now you can see which area in VT cortex are important to distinguish between the two conditions according to our model.
Try to run model using all conditions (except rest state). This is multiclass classification, try one-vs-all and one-vs-one strategies (can read more here), which one should be faster? Does the new model has as high score as the one with two conditions only?
We will start from creating a new mask that removes only rest state.
In [23]:
conditions_new = conditions[conditions != b'rest']
fmri_masked_new = fmri_masked[conditions != b'rest']
fmri_masked_new.shape
Out[23]:
Running One-vs-one multiclass classification. Note, that this will take longer than last time, see the shape of our data.
In [24]:
from sklearn.model_selection import cross_val_score, LeaveOneOut
svc_new_ovo = SVC(kernel='linear', decision_function_shape="ovo")
scores = cross_val_score(svc_new_ovo, fmri_masked_new, conditions_new, cv=LeaveOneOut())
print("Scores: {}, mean score = {:03.2f}".format(scores, scores.mean()))
Let's try now one-vs-all now, it should be faster.
In [25]:
svc_new_ovr = SVC(kernel='linear', decision_function_shape="ovr")
scores = cross_val_score(svc_new_ovr, fmri_masked_new, conditions_new, cv=LeaveOneOut())
print("Scores: {}, mean score = {:03.2f}".format(scores, scores.mean()))
Both methods give the same rusult.
In [26]:
# write your solution here
# 1. create a new mask and apply to conditions and fmri_masked
# 2. initialize SVC model with two different decision_function_shape; run cross_val_score and compare results
Lets split manually for two sets and use the model with ovr from previous exercise.
In [27]:
fmri_new_tr, fmri_new_ts, cond_new_tr, cond_new_ts = train_test_split(fmri_masked_new, conditions_new)
svc_new_ovr.fit(fmri_new_tr, cond_new_tr)
cond_new_pred = svc_new_ovr.predict(fmri_new_ts)
Now we will create a dictionary and for every condition we will check how often the model identified it correctly:
In [28]:
acc_cond = {}
for cn in np.unique(cond_new_ts):
acc_cond[cn] = cond_new_pred[(cond_new_pred==cn) & (cond_new_ts==cn)].shape[0] / cond_new_ts[cond_new_ts==cn].shape[0]
print(acc_cond)
Looks like house was the easiest for our model.
In [29]:
# write your code here
# 1. split the data using train_test_split and fit the model using training data
# 2. run predict on testing data
# 3. for every condition calculate how often model identified it correctly