Welcome to this workshop! In the upcoming couple of hours, you'll work through this (Jupyter) "notebook" - a sort of code editor in your browser that allows mixing of code (Python) and text (markdown) cells. Originally, this notebook was developed for the Research Master Psychology course Neuroimaging: Pattern Analysis at the University of Amsterdam, but it was adapted and extended for this ICON workshop.
In general, this notebook is designed for people who are relatively unfamiliar with MVPA and/or its implementation in Python. We do assume that you are familiar with analyzing (f)MRI data; the purpose of this workshop is to "extend your analysis toolbox", so to say. As such, it’s definitely helpful if you have some knowledge about Python’s syntax, but it’s not strictly necessary. Also, we assume participants are relatively unfamiliar with machine learning (ML) concepts and their implementation; if you’re a ML guru, this is probably not the right workshop for you...
At certain points in the tutorial, we've included some more "advanced" material, but feel free to skip this. Also, realize that this tutorial is about the concepts behind MVPA and how to implement them using domain-general Python tools, such as scikit-learn. Importantly, this tutorial is not about how to use MVPA-specific frameworks such as PyMVPA or Nilearn, although we included a short section on skbold, our own "machine learning for fMRI"-package that is built around scikit-learn.
Also, there is actually no single "multivoxel pattern analysis"; rather, MVPA is a bucket-term referring to several kinds of pattern-based analyses. As this tutorial cannot cover all such analyses, we limit the scope to machine learning (classification-based) analyses of fMRI data based on within-subject designs (leaving out between-subject decoding, regression-based analyses, unsupervised learing, etc.). We hope to cover these grounds sufficiently well for you to be able to explore other kinds of analyses on your own.
In this workshop, we cover the basics and implementation of multivoxel pattern analysis (MVPA) to analyze (functional) MRI data in Python. After this tutorial, you are able to:
In short, this tutorial will show how to build your own complete, cross-validated MVPA pipeline for fMRI data.
The dataset we use for this workshop was chosen out of convenience and is not an example of an optimally design dataset for MVPA-based analyses. Although this workshop is not about optimal design strategies for MVPA, we point out shortcomings to the used example dataset wherever appropriate.
We designed this notebook to be interactive. That is, you'll encounter several "ToDos" (yellow boxes) and "ToThinks" (blue boxes) that describe short programming exercises (ToDos) and questions for you to think about (ToThinks), which are aimed to improve understanding of the discussed so far. After each ToDo, there is a code block that tests whether your answer is correct. Correct answers to the ToDos and ToThinks are included in the answers-notebook notebook (but try it yourself first!). If you're not very comfortable/experienced with Python, or if you just don't feel like it, feel free to skip the ToDos and ToThinks, of course. Also, feel free to ask us questions during this workshop!
The tutorial is split in two parts. In the first part, we cover the basics of experimental designs, and illustrate how to extract patterns from raw fMRI data in a way that MVPA can be applied. In the second part, the major steps in MVPA (ML-based) pipelines is discussed: feature extraction and selection, cross-validation, and model evaluation. We challenge you to optimize your pipelines to maximize your decoding accuracy scores!
Have fun and good luck!
This tutorial is implemented in a "Jupyter notebook", which includes "code cells" - which are able to run Python code - and "text cells" - which render Markdown text. For this tutorial, it suffices to know how to run cells, which you can do by selecting the (code) cell and pressing CTRL + ENTER
or pressing the "play" button in the toolbar on the top of the page. For more info on Jupyter notebooks, check the python tutorial notebook.
Did you already download the data, as described here? Check it by running the cell below (CTR + ENTER)!
In [ ]:
import os
if os.path.isdir(os.path.join('..', 'data')):
print("You already downloaded the data, nice! Continue with Part 1!")
else:
print("Seems you don't have the data yet ... Go to lukas-snoek.com/ICON2017 for instructions on how to download it.")
Before you can do any fancy machine learning analysis (or any other pattern analysis), there are several decisions you need to make and steps to take in (pre)processing and structuring your data. Roughly, there are three steps to take:
While we won't go into all the design factors that make for an efficient pattern analysis (see this article for a good review), we will now discuss/demonstrate some design considerations and how they impact the rest of the MVPA pipeline.
There are many ways in which you can categorize different types of pattern analyses, but one of the most basic categorizations is in terms of whether analyses are within-subject or between-subject. The major distinction revolves around whether you want to investigate an (experimental) factor that varies or is manipulated within subjects or that varies across subjects (i.e. individual differences or experimental between-subject designs)*.
* N.B.: there are also "hybrid" forms, such as when you decode a within-subject factor (e.g. "negative vs. positive images") from condition-average patterns across subjects. That is, each subject "contains" two patterns (one "negative image" pattern, one "positive image" pattern) and these patterns are decoded across subjects
Importantly, these two types of analyses differ in what is regarded as an instance of a pattern (also referred to as a sample):
This distinction is important, because the choice of type of pattern analysis has major consequences for the design of your experiment and the methods available for pattern extraction. Throughout this tutorial, we focus on within-subject analyses, using a simple two-class working memory data-set.
There are different ways to estimate your patterns for pattern-based analyses. Suppose you want to decode neural representations of faces from representations of houses. In that case, you need to to estimate a pattern for each face and each house. Now, suppose we show a face at 9 seconds, we can "extract" activity at each voxel for that stimulus in three ways (see figure below):
As method 3 (fitting a separate HRF per stimulus) has shown to yield the most stable pattern estimates (Misaki et al., 2010), we'll focus on that method in the rest of this section.
We will extract patterns of $\beta$-estimates by fitting a hemodynamic response function (HRF) per trial using the general linear model (GLM)*. In other words, "trials" (i.e. instances of your feature-of-interest) in within-subject designs are modelled as separate regressors in a first-level analysis. Below, we included an image of a single-trial design (of the hypothetical faces vs. houses experiment) as created in FSL:
* Note: This single-trial design is also referred to as a "Least Squares All" (LSA) approach. Alternative pattern estimation designs exist, including "Least Squares Single" (LSA); the choice for one or the other depends on the trail-by-trial variability and scan-noise (Adulrahman & Henson, 2017).
As you can see, each trial gets it's own regressor (hence the name "single-trial design"). Then, as depicted below the design matrix, a contrast-against-baseline is created for each regressor (trial). After you run a first-level analysis using this design, you'll have whole-brain maps containing statistics values ($\beta$-values, t-values, or z-values) for each trial that represent the trial's estimated (whole-brain) pattern. Usually, t-values or z-values are used instead of $\beta$-values (cf Misaki et al., 2010).
Importantly, this design thus specifies that for each "trial" the activation per voxel is estimated, e.g. each trial gets its own activation map .
Before you go on, make sure you understand this image! This image represents basically all you need to understand about single-trial designs.
To further illustrate how single-trial designs work, let's create your own single-trial matrix corresponding to a (real) working memory experiment. In this experiment, one condition ("ACTIVE") required subjects to remember a configuration of bars, and after a retention period had to respond whether one of the bars has changed in the test-image or not. In the other condition ("PASSIVE") they just watched a blank screen and had to respond with a random answer. The experiment is depicted schematically below:
In total, subjects performed 40 trials, of which 32 were of the "ACTIVE" condition and 8 were of the "PASSIVE" condition*. Next, we'll generate a single-trial design that aims to estimate the pattern for each trial.
* Note: this data-set was not necessarily meant to be analysed with MVPA, as it has some sub-optimal factors w.r.t. pattern analysis, such as class-imbalance (32 active vs. 8 passive trials) and all "instances" (trials) contained in a single run, while it is often recommended to spread out your trials across different runs (see e.g. this study), which will be shortly discussed in section 2.4. (on cross-validation).
Below, we'll load the trial onset times (and trial durations and conditions).
In [ ]:
# First, we need to import some Python packages
import numpy as np
import pandas as pd
import os.path as op
import warnings
import matplotlib.pyplot as plt
plt.style.use('classic')
warnings.filterwarnings("ignore")
%matplotlib inline
In [ ]:
# The onset times are loaded as pandas dataframe with three columns:
# onset times (in seconds) (column 1), durations (column 2), and conditions (column 3).
# N.B.: condition 0 = passive, condition 1 = active
stim_info = pd.read_csv(op.join('example_data', 'onsets.csv'), sep='\t',
names=['onset', 'duration', 'trial_type'])
stim_info
Remember, the onsets (and duration) are here defined in seconds (not TRs). Let's assume that the fMRI-run has a TR of 2. Now, we can convert (very easily!) the onsets/durations-in-seconds to onsets/durations-in-TRs.
In [ ]:
# We repeat loading in the dataframe to avoid dividing the onsets by 2 multiple times ...
stim_info = pd.read_csv(op.join('example_data', 'onsets.csv'), sep='\t',
names=['onset', 'duration', 'trial_type'])
# This is where we divide the onsets/durations by the TR
stim_info[['onset', 'duration']] = (stim_info[['onset', 'duration']] / 2).astype(int)
To perform the first-level analysis, for each regressor (trial) we need to create a regressor of zeros and ones, in which the ones represent the moments in which the particular trial was presented. Let's assume that our moment of interest is the encoding phase, which lasts only 2 seconds; we thus can model it as an "impulse".
So, for example, if you have a (hypothetical) run with a total duration of 15 TRs, and you show a stimulus at TR=3 for the duration of 1 TRs (i.e. 2 seconds), then you'd code your regressor as:
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
In [ ]:
# ToDo
n_timepoints = 162
n_trials = 40
stim_vec = np.zeros((n_timepoints, n_trials))
# Fill the stim_vec variable with ones at the indices of the onsets per trial!
In [ ]:
np.testing.assert_array_equal(stim_vec, np.load(op.join('example_data', 'stim_vec.npy')))
print("Well done!")
In [ ]:
stim_vec = np.load(op.join('example_data', 'stim_vec.npy'))
In [ ]:
from functions import double_gamma
hrf = double_gamma(np.arange(162))
# List comprehension (fancy for-loop) + stack results back to a matrix
X = np.vstack([np.convolve(hrf, stim_vec[:, i], 'full')[:162] for i in range(40)]).T
plt.figure(figsize=(30, 10))
for plot in range(40):
is_active = True if stim_info['trial_type'][plot] == 1 else False
plt.plot(X[:, plot] + plot, np.arange(X.shape[0]), c='blue' if is_active else 'red', lw=3)
plt.text(plot - 0.5, -15, 'active' if is_active else 'passive',
rotation=45, fontsize=20)
plt.ylabel('Time (TR)', fontsize=40)
plt.xlabel('Regressors (trials)', fontsize=40)
plt.xlim((-1, 40))
plt.ylim((0, 160))
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.invert_yaxis()
plt.show()
* Note: This specific data-set does not have a proper design for a machine-learning analysis on single-trial patterns, because the short ISI leads to temporal autocorrelation between subsequent trials and may thus result in inflated accuracy scores (cf. Mumford et al., 2014). It is generally recommended to use a long ISI (> 6 seconds) in combination with fully randomized (and class-balanced) trials, or to perform run-wise cross-validation (something we'll revisit in the section on cross-validation).
The GLM that evaluates this design will yield a $\beta$-estimate per trial per voxel. In other words, every trial is associated with a specific (whole-brain) pattern of $\beta$-values.
Usually, though, t-values instead of $\beta$-values are used, because it has been shown that converting $\beta$-values to t-values (also called univariate normalization) often creates more stable and robust patterns. $\beta$-values can be normalized to t-values by simply dividing the parameter estimate ($\beta$) by it's corresponding standard error, which we won't show here but we're confident you know how to do that yourself!
Note: realize that we don't interpret the patterns of t-values in terms of statistical significant. This is because we only use t-values for the goal of efficient pattern estimation, but not hypothesis testing like in mass-univariate analyses! As such,we're not thresholding or applying any multiple comparison correction!
So, now you know how to create a "single-trial design" for pattern analyses! As a short summary:
Alright, you now know what a single-trial (within-subject) design looks like and what it produces (i.e. single-trial pattern estimates in the form of whole-brain $\beta$/t-stat/z-stat maps). Before we go on, we will show what a pattern of a single trial looks like in fslview. This will hopefully give you some more "intuition" on what is meant with a single trial pattern.
You can also do this inside the jupyter notebook:
In [ ]:
from niwidgets import NiWidget
tstat_widget = NiWidget(op.join('..', 'data', 'pi0070', 'wm.feat', 'stats', 'tstat1.nii.gz'))
tstat_widget.nifti_plotter()
In pattern analyses, there is a specific way to 'store' and represent brain patterns: as 2D matrices of shape N-samples
$\times$ K-voxels
. Important: often (and confusingly), people refer to voxels as (brain) 'features' in pattern analyses. So in articles people often refer to samples-by-features matrices!
Anyway, this is what such a matrix looks like for the hypothetical faces-vs-houses decoding study:
Each row thus represents the voxel pattern of a sample (here: trial)!
As you can see, the originally 3D voxel patterns (e.g. whole-brain patterns of t-values) are flattened (also called "vectorized" or "raveled") such that we can stack all patterns vertically into a 2D matrix. Pattern analyses, and machine learning in specific, need this 2D (thus discarding spatial information about the voxel patterns) because most (ML) algorithms were not designed to work with 3D (spatial) data (barring convolutional neural networks). They, instead, assume that the data is a 2D samples-by-features matrix.
Anyway, let's look at an example. We're going to work with the working-memory data (as outline in the beginning of the notebook). Suppose we want to investigate whether we can predict whether a trial is passive or active (factor: working memory load) from (whole-brain) voxel patterns. Consequently, this is a within-subject design. As such, we model each trial separately by fitting a single-trial design matrix to obtain patterns of t-values per trial (similar to the plot just before Section 1.3). The results are in the directory: data/pi0070/wm.feat
. Check out the directory (again) and especially the stats-folder. You should see a bunch of nifti-files which contain 3D voxel patterns with "tstat" values (FSL also produces pes [betas], copes, varcopes - we removed these to minimize data size of this tutorial).
For this analysis, we're going to use patterns of t-stats (as is generally recommended over $\beta$s).
As you can see, there are 40 nifti-files with t-stats; these refer to the 40 trials in the experiment (32 active trials, 8 passive trials)! Given that we need to adhere to the data representation as outlined above, we are in the following situation:
What we have: 40 (3D) nifti-files
What we need: one 2D numpy array of shape 40 x {whatever amount of voxels there are in those niftis}
Alright, time to learn some Python gems that help us load in and transform those patterns into a usable 2D numpy matrix.
As a first thing, we need to find all the paths to the t-stat nifti-files. Python has a nifty (pun intended) tool called "glob
" which can find files/directories on disk using wildcards. It is usually imported as follows:
In [ ]:
from glob import glob
glob
, in Python, is a function that takes a path (as a string) with one or more wildcard characters (such as the *
) and searches for files/directories on disk that match that. For example, let's try to find all the png-imagesin the "img" directory using glob (these are the images that I used inside this notebook).
In [ ]:
import os
# the images are in img/ on Linux/Mac systems, but in img\ on Windows (hence the "os.sep" thingie)
my_search_string = 'img' + os.sep + '*.png'
png_files = glob(my_search_string)
print(png_files)
As you can see, it returns a list with all the files/directories that matched the search-string. Note that you can also search files outside of the current directory. To do so, we can simply specify the relative or absolute path to it.
In [ ]:
# Implement your ToDo here
In [ ]:
# To check your answer, run this cell
assert(len(tstat_paths) == 40)
print("Well done! You globbed all the 40 tstat-files correctly!")
In [ ]:
from functions import glob_tstats
tstat_paths = glob_tstats()
Warning: glob
returns unsorted paths (so in seemingly random order). It's better if we sort the paths before loading them in, so the order of the paths is more intuitive (the first file is tstat1, the seconds tstat2, etc.). Python has a builtin function sorted()
, which takes a list and sorts it alphabetically. The problem, here, is that if we'd use that - i.e. sorted(tstat_paths)
- it will actually sort the files as: tstat1, tstat10, tstat11, etc. See for yourself:
In [ ]:
print(sorted(tstat_paths))
To fix this issue, we wrote a little function (sort_nifti_paths()
) that sorts the paths correctly. (If you're interested in how it works, check out the functions.py file.)
In [ ]:
# Let's fix it
from functions import sort_nifti_paths
tstat_paths = sort_nifti_paths(tstat_paths)
print(tstat_paths)
Now, we have the paths to all the nifti-files of a single subject. To load a nifti-file into a numpy-array in Python, we can use the awesome nibabel package. This package has two useful methods you can use to load your data: load
and get_data
. You need to use them consecutively to load the t-values in memory. For example, to load the whole-brain pattern of t-values of the first trial of subject pi0070:
In [ ]:
import nibabel as nib
data = nib.load(tstat_paths[0]).get_data()
voxel_dims = data.shape
In [ ]:
# Implement your ToDo here
X = np.zeros((40, np.prod(voxel_dims)))
# Start your loop here:
In [ ]:
# Can we check if X is correct here? Would be a good check before continuing to part 2
np.testing.assert_almost_equal(X, np.load(op.join('example_data', 'X_section1.npz'))['X'])
print("Well done!")
In [ ]:
X = np.load(op.join('example_data', 'X_section1.npz'))['X']
In Part 1, we discussed within-subject single-trial designs, and showed how to load single-trial data in the right format to perform MVPA. It's now time to actually implement machine learning analyses!
We'll be looking at the implementation of concepts like K-fold cross-validation, feature-selection/extraction, model fitting/prediction, permutation testing, and feature visualization. To do so, we'll use the scikit-learn
machine learning package in Python (the de-facto and most-used ML package in Python).
Remember that, very generally, MVPA is all about inferring a feature-of-interest (dependent variable) using patterns of fMRI data. Let's call the feature-of-interest y, and the patterns of fMRI data X. This inference from X to y is called decoding: we want to decode a dependent variable y from X.
In our case, let's continue with the data from the working memory experiment used in Part 1. Remember that, in this experiment, participants performed two types of trials: "ACTIVE" trials, in which they had to remember the orientation of eight bars, and "PASSIVE" trials in which they did not do anything (note that the screen shortly flickered in the "PASSIVE" trials in order to activate the visual cortex). Let's try to decode trial type from the patterns of t-values you creates in Part 1!
In Section 1.3, you ended with a nice 2D-matrix of N-samples x N-features. This 2D-matrix contains all whole-brain patterns of t-values for all trials: this is your X. However, this leaves out a crucial part of the data: the actual feature-of-interest, trial type, your y.
While there is kind of a generic way to load in voxel patterns, there is usually not a single way to load in your dependent variable (y), because the exact factor that represents y
dependent on your exact research question (and also depends how you have stored this data on disk).
In within-subject single-trial designs, trial type or condition is often the dependent variable. The dependent variable can thus be extracted from your design. In fact, we already loaded the dependent variable previously, in the onsets
variable (see Section 1.2). The third column named 'trial_type') contains the trial types, where 1 is an "ACTIVE" trial and 0 a "PASSIVE" trial.
In [ ]:
# Implement your ToDo here
In [ ]:
np.testing.assert_equal(np.array(y), np.load(op.join('example_data', 'y.npy')))
print('Well done!')
In [ ]:
y = np.load(op.join('example_data', 'y.npy'))
And that's all you need for machine learning-based analyses in terms of preparation!
While it's still relatively easy to load in, structure, and preprocess all of the data necessary for pattern-based analyses, there's quite a lot of "boilerplate code", especially when you need to loop your analysis across multiple participants. At least, that's what we thought and motivated us to create skbold, a (Python-based) package that offers a set of tools to make machine learning analyses of fMRI data easier. As shown in the figure below, skbold complements the scikit-learn package by offering functionality specific to fMRI-data at different stages of your machine learning pipeline. For example, reading in single-trial estimates and corresponding labels ("y") only takes about 5 lines of Python code using skbold. To avoid distracting you from this tutorial, we included a section on how to use skbold at the very end of this tutorial. Check it out if you are interested!
* Skbold is, of course, not really commercial! It's fully open source (BSD-3 licensed) and can be easily installed with pip (pip install skbold
)!
Awesome, we now have everything we need (X and y). This took a while, but you'll soon see it was worth the effort.
In the rest of the tutorial, you'll finally learn how to actually implement decoding pipelines. Typical pipelines consist of the following elements:
In the rest of the tutorial, we'll discuss these topics in a slightly different order. First, we'll discuss how you can preprocess, fit, and cross-validate scikit-learn models, and while we're at it, how to calculate model performance. Subsequently, you'll learn how to embed these concepts in fully cross-validated K-fold data partitioning schemes. Finally, we'll extend our pipelines with feature selection methods and (optionally) show you how to use scikit-learn Pipelines
.
Before any machine learning analysis, is it advised to preprocess your pattern such that each feature (here: voxels) is on the same scale. This "feature scaling" process can be done in various ways, but the most often used method is standardization: scaling features such that they have 0 mean and unit variance. Feature scaling is important for many machine learning algorithms in the training process and omitting scaling may yield sub-optimal results.
Scikit-learn provides a class, StandardScaler, that implements feature standardization. We'll apply it to our entire dataset (X
) here, but in a later section we'll explain that - strictly speaking - it's best to "cross-validate" this preprocessing step.
Anyway, let's see how standardization is done using scikit-learn (in section 2.8. we'll explain this scikit-learn object's API in more detail):
In [ ]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # Here we initialize the StandardScaler object
scaler.fit(X) # Here we "fit" the StandardScaler to our entire dataset (i.e. calculates means and stds of each feature)
X = scaler.transform(X) # And here we transform the dataset using the calculated means/stds
Did the scaling procedure work? Let's check that below (by asserting that the mean of each column is 0, and the std of each column is 1):
In [ ]:
means = np.mean(X, axis=0)
np.testing.assert_almost_equal(means, np.zeros(X.shape[1]))
print("Each column (feature) has mean 0!")
stds = X.std(axis=0)
np.testing.assert_almost_equal(stds[stds != 0], np.ones((stds != 0).sum()))
print("Each column (feature) has std 1!")
Sweet! We correctly preprocessed out patterns. Now it's time for the more interesting stuff.
First, we'll show you how to use scikit-learn models and associated functionality. In fact, the most useful functionality of scikit-learn is probably that they made fitting and cross-validating models (or, in scikit-learn lingo: estimators) trivially easy.
Note that in the upcoming example, we will fit the model on all our samples for the sake of the example. In practice, this is something you would never do: you always need to cross-validate your model to a new, independent sample.*
* That is, independent with respect to noise: the noise in your training sample should not be correlated with the noise in your test set).
Anyway, let's import a scikit-learn model: the (linear) support vector classifier (SVC), which is one of the most-often used models in fMRI pattern analyses. In scikit-learn, this model is part of the (suprising...) svm
module:
In [ ]:
# Scikit-learn is always imported as 'sklearn'
from sklearn.svm import SVC
Like most of scikit-learn's functionality, SVC is a class (not a function!). So, let's initialize an SVC-object! One important argument that this object needs upon initialization is the "kernel" you want the model to have. Basically, the kernel determines how to treat/process your features: linearly (such as kernel='linear'
), or non-linearly (such as the kernel='rbf'
or kernel='poly'
options). Most often a linear kernel is the best option (as non-linear kernels often overfit very quickly).
To initialize an SVC-object with a linear kernel, just do as follows:
In [ ]:
# clf = CLassiFier
clf = SVC(kernel='linear')
Alright, we initialized an SVC-object (named clf
, short for "classifier") with a linear kernel. Now, you need to do two thing to get the prediction for each sample (i.e. whether they're predicted as 0 or 1): fit, using the method fit(X, y)
, and predict the class (i.e. 0 or 1) for each class using the method predict(X)
. Basically, in the fit
method, the parameters of the model (i.e. $\beta$) are estimated. Then, in the predict
method, the estimated parameters are used to generate a prediction for each sample (i.e. 0 or 1).
Let's first look at the fit
method. As you can see, the fit(X, y)
method with two parameters: X (a samples-by-features matrix) and y (a vector of length n-samples). Let's do that for our data:
In [ ]:
print('Fitting SVC ...', end='')
clf.fit(X, y)
print(' done.')
After calling the fit() method, the clf-object contains an attribute coef_
that represent the model's parameters ('coefficients' in scikit-learn lingo, i.e. $\beta$). Let's check that out:
In [ ]:
coefs = clf.coef_
print("Shape of coefficients: %r" % (coefs.shape,))
Ah, just like we expected: the coef_
attribute is exactly the same size as the number of voxels in our X-matrix (i.e. 80*80*37).
Anyway, usually, you don't have to do anything directly with the weights (perhaps only if you want to do anything with feature visualization). Scikit-learn handles the calculation of the prediction (let's call that $\hat{y}$) using the trained weights. To get the prediction of any sample, just call the predict(X)
method of the model, which returns the predictions as an array:
In [ ]:
y_hat = clf.predict(X)
print("The predictions for my samples are:\n %r" % y_hat.tolist())
In [ ]:
# Try out different estimators below (call fit and predict on X and y!)
A logical next step is to assess how good the model was in predicting the class of the samples. A straightforward metric to summarize performance is accuracy * which can be defined as:
\begin{align} accuracy = \frac{number\ of\ correct\ predictions}{number\ of\ predictions} \end{align}* There are waaaay more metrics to summarize model performance for classification models, such as precision, recall, F1-score, and ROC-AUC. These metrics are more appropriate than accuracy when you have imbalanced classes, i.e. more samples in one class (e.g. negative images) than in another class (e.g. positive images), like we have here (i.e. 32 active, 8 passive trials). We'll revisit this issue in the next (optional) section.
In [ ]:
# Implement your to-do here!
If you've done the ToDo above correctly, you should have found out that the accuracy was 1.0 - a perfect score! "Awesome! Nature Neuroscience material!", you might think. But, as is almost always the case: if it seems too good to be true, it probably is indeed too good to be true.
So, what is the issue here?
Well, we didn't cross-validate the model! We fitted it on all the samples in the mvp-object and predicted the same samples, which leads to optimistic estimate of model performance. Such optimistic estimates in uncross-validated models are especially likely when there are many more features (here: voxels) than samples (here: trials). In other words, we are probably overfitting the model here.
Thus, let's check what happens if we actually cross-validate the model. To do so, we need to partition the data into a train- and test-set. For this example, we'll use a simple hold-out scheme, in which we'll reserve half of the data for the test-set (we'll discuss more intricate cross-validation schemes such as K-fold in the next section).
To partition our data in a 50/50 split (50% for train, 50% for test), we'll use the scikit-learn function train_test_split:
In [ ]:
from sklearn.model_selection import train_test_split
if not isinstance(y, np.ndarray):
y = np.array(y)
# The argument "test_size" indicates the test-size as a proportion
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=y,
random_state=5)
print("Shape X_train: %r" % (X_train.shape,))
print("Shape y_train: %r" % (y_train.shape,))
print("Shape X_test: %r" % (X_test.shape,))
print("Shape y_test: %r" % (y_test.shape,))
As you can see, the train_test_split
function nicely partitions our data in two sets, a train- and test-set. Note the argument stratify
in the function, which is set to True
, which ensures that the class proportion (here: ratio of "active" trials to "passive" trials) is the same for the train-set as the test-set.
In [ ]:
# Implement the ToDo here
Now it's up to you to actually implement the cross-validated equivalent of the fit/predict procedure we showed you before, in the next ToDo!
In [ ]:
# Implement your ToDo here
If you implemented the ToDo correctly, you should see that the test-set accuracy is quite a bit lower (10%) than the train-set accuracy! This test-set accuracy is still slightly biased due to imbalanced classes, which is discussed in the next section.
As hinted at before, one problem with our dataset is that it is imbalanced. That is, the amount of samples are imbalanced between classes, because we have 32 samples in the "active" class and only 8 samples in the "passive" class.
"Why is that a problem?", you might ask?
Well, suppose that our classifier "learns" to predict every sample as "active", regardless of the underlying patterns. In that scenario, it will still achieve an accuracy of:
\begin{align} accuracy = \frac{32}{40} = 0.8 \end{align}While 0.8 may seem like a great classification score, in this case it may just be due to our class imbalance! Therefore, imbalanced datasets therefore need a different evaluation metric than accuracy. Fortunately, scikit-learn has many more performance metrics you can use, including metrics that "correct" for the (potential) bias due to class imbalance (including f1-score, ROC-AUC-score, and Cohen's Kappa).
Another way to "correct" for imbalanced datasets is to either undersample your majority class(es) to match the amount of samples in the minority class or to oversample the minority class(es) to match the amount of samples in the majority class. Unlike using "class-imbalance-aware" metrics, which deals with the issue post-hoc, under/oversampling actually forces your model to learn on a balanced dataset (which is, in our experience, the most sensitive strategy).
As the over/undersampling methods are slightly beyond the scope of this tutorial (but check out this library for implementations of different methods to deal with imbalanced datasets), let's check out what happens with our performance estimate if we use a different ('class-imbalance-aware') metric, like the "ROC-AUC-score". Like all metrics, this function can be imported from the sklearn.metrics
module. Any scikit-learn metric function is called using the following format:
performance = some_metric(y_true, y_predicted)
In [ ]:
# The roc-auc-curve function is imported as follows:
from sklearn.metrics import roc_auc_score
# Implement your ToDo here
If you've done the ToDo correctly, you should see that the performance drops to 0.75 if using the ROC-AUC score! While this thus leads to worse results, it's more likely to be based on information from your patterns rather than just the class-imbalance of your dataset.
That said, it's, of course, always best to avoid imbalance in the first place, but sometimes this is unavoidable, for example when decoding subjects' responses (which are not under control of the experimenter). But if you have imbalanced classes, make sure to deal with it appropriately!
Sweet! You actually performed your first proper decoding analysis in this section! There are however a couple of things we can do to improve the efficiency and results of our analysis. One thing we can do is to use the data more efficiently in cross-validation. In the previous example, we split the data into two sets and fit the model on one (the train-set) and cross-validated to the other (the test-set). There is actually a way to "re-use" the data by using K-fold cross-validation, in which data is iteratively partitioned in train- and test-sets. This is explained in the next section.
There are two major ways of cross-validation:
In the previous section we implemented a form of hold-out cross-validation, which can be seen as a kind of "one-shot cross-validation". In fMRI data-sets, however, we usually have few samples (simply because MRI-data is expensive to acquire!). Because K-fold cross-validation allows you to reuse the data, it is much more common than hold-out cross-validation in MRI studies.
Now, we can finally use some of scikit-learn's functionality. We are going to use the StratifiedKFold class from scikit-learn's model_selection module. Click the highlighted link above and read through the manual to see how it works.
Importantly, if you're dealing with a classification analysis, always use StratifiedKFold (instead of the regular KFold), because this version makes sure that each fold contains the same proportion of the different classes (here: 0 and 1; just like we did with train_test_split
when we set stratified=True
).
Anyway, enough talking. Let's initialize a StratifiedKFold object with 5 folds:
In [ ]:
# scikit-learn is imported as 'sklearn'
from sklearn.model_selection import StratifiedKFold
# They call folds 'splits' in scikit-learn
skf = StratifiedKFold(n_splits=5)
Alright, we have a StratifiedKFold object now, but not yet any indices for our folds (i.e. indices to split X and y into different samples). To do that, we need to call the split(X, y)
method:
In [ ]:
folds = skf.split(X, y)
Now, we created the variable folds
which is officially a generator object, but just think of it as a type of list (with indices) which is specialized for looping over it. Each entry in folds
is a tuple with two elements: an array with train-indices and an array with test-indices. Let's demonstrate that*:
* Note that you can only run the cell below once. After running it, the folds
generator object is "exhausted", and you'll need to call skf.split(X, y)
again in the above cell.
In [ ]:
# Notice how we "unpack" the train- and test-indices at the start of the loop
i = 1
for train_idx, test_idx in folds:
print("Processing fold %i" % i)
print("Train-indices: %s" % train_idx)
print("Test-indices: %s\n" % test_idx)
i += 1
As you can see, StratifiedKFold determined that for the first fold, sample 1 to 9 should be used for training and sample 0, 1, 2, 3, 4, 5, 6, 7, and 13 should be used for testing (remember, Python uses 0-based indexing!) and the rest for training.
Now, we know how to access the train- and test-indices, but we haven't actually indexed our X and y (i.e. mvp.X and mvp.y) in the for-loop over folds. Nor have we actually fit the model on the train-set and cross-validated this to the test-set.
In [ ]:
# Implement the ToDo here by completing the statements in the loop!
from sklearn.linear_model import LogisticRegression
# clf now is a logistic regression model
clf = LogisticRegression()
# run split() again to generate folds
folds = skf.split(X, y)
performance = np.zeros(skf.n_splits)
for i, (train_idx, test_idx) in enumerate(folds):
# Complete these statements by indexing X and y with train_idx and test_idx
X_train =
y_train =
X_test =
y_test =
# ToDo: call fit (on train) and predict (on test)
# ToDo: calculate roc-auc-score (or accuracy, but that's biased ...)
this_performance =
performance[i] = this_performance
# ToDo: calculate average performance (and print it!)
Implementing K-fold cross-validation instead of hold-out cross-validation allowed us to make our analysis a little more efficient (by reusing samples). Another method that (usually) improves performance in decoding analyses is feature selection/extraction, which is discussed in the next section (after the optional run-wise decoding section).
Contrary to our example data-set which decoding single trials within a single fMRI run, it is recommended to spread out your samples across different fMRI runs and perform run-wise cross-validation. This type of cross-validation basically partitions your samples according to their run; one example, referred to as "leave-one-run-out" cross-validation, partitions your data such that you always train on trials from all runs minus one, and use the left-out run for testing. A nice property of this type of cross-validation is that it effectively avoids the problem of autocorrelated samples (which is highly likely in fMRI, especially if the ISI is short!). While the samples within each run are still autocorrelated, run-wise-cv makes sure that cross-validation is performed on samples from an independent run, thus dealing with bias due to autocorrelation.
Anyway, let's look at an example. Since we don't have a real fMRI dataset with different runs, we're just going to simulate some data. Suppose we have four fMRI runs, each with 20 trials, of which 10 of class "A" (coded with 0) and 10 of class "B" (coded with 1), and each sample is associated with a pattern of a 1000 voxels (let's say the activity pattern in the bilateral amygdala). So, in total we have 80 trials, 40 per class, spread out over 4 runs:
In [ ]:
X_r = np.random.randn(80, 1000)
print("Shape of X: %s" % (X_r.shape, ), '\n')
y_r = np.tile([0, 1], 40)
print("Shape of y: %s" % (y_r.shape, ))
print("Y labels:\n%r" % y_r.tolist(), '\n')
runs = np.repeat([1, 2, 3, 4], 20)
print("Shape of runs: %s" % (runs.shape, ))
print("Run-indices: \n%r" % runs.tolist())
Now, scikit-learn offers a nice cross-validation class that partitions your data according to a "grouping" variable: GroupKFold
, or variations thereof like LeaveOneGroupOut
of LeavePGroupsOut
. Let's check out how that can be used using our simulated data:
In [ ]:
# Import from model_selection module
from sklearn.model_selection import GroupKFold
# In fact, when we initialize GroupKFold with 4 splits, as below, it is exactly the same as
# the LeaveOneGroupOut cross-validator, since we only have 4 groups
gkf = GroupKFold(n_splits=4)
for train_idx, test_idx in gkf.split(X=X_r, y=y_r, groups=runs):
print("Indices of our test-samples: %r" % test_idx.tolist())
print("... which correspond to following runs: %r" % runs[test_idx].tolist(), '\n')
As you can see in the example above, the GroupKFold
cross-validator effectively partionioned our data such that in every fold one group was left out for testing (group 4 in the first fold, group 3 in the second fold, etc.)! So keep in mind to use this type of cross-validation object when you want to employ such a run-wise-cross-validation routine (which is recommended anyway to avoid within-run confounds such as autocorrelation between samples!).
An often heard question about cross-validation is what train/test ratio one should choose in their cross-validation paradigm. Usually, larger train sets (and thus smaller test sets), such as Leave-One-Out-Cross-Validation, lead to a higher average performance across folds, but also a larger variance of performance across folds. Or, in ML terms: a large train/test ratio leads to relatively low bias but high variance.
On the other hand, relatively larger test-sets lead to lower performance estimates on average, but also less variance across folds. As such, it has been proposed to use test-sets of about 10% or 20% (that is, 10- or 5-fold cross-validation). Furthermore, instead of just doing K-fold cross-validation once, it has been argued that drawing repeated random splits of, let's say, 10% of your data improves generalization even more (see Varoquaux (2017)).
This repeated random splits technique is sometimes referred to as bagging (which stands for bootstrap aggregating), a common "ensembling" technique in machine learning. Scikit-learn provides a class that implements this repeated random splits technique: StratifiedShuffleSplit
. If you feel like it, try to implement a cross-validation pipeline using this procedure as indicated in the next ToDo.
In [ ]:
from sklearn.model_selection import StratifiedShuffleSplit
# Try implementing the ToDo here
While a complete discussion on (causes of) the phenomenon of "overfitting" is definitely beyond the scope of this workshop, it is generally agreed upon that a small sample/feature-ratio will likely lead to overfitting of your models (optimistic performance estimates on train-set). A small sample/feature ratio, unfortunately, is very common in machine learning analyses of neuroimaging (and especially fMRI) data, as the number of features (voxels, MEG sensors) often largely outnumber the amount of samples (trials or subjects).
Given that this small sample/feature ratio (usually) leads to overfitting, it follows that reducing the amount of features often has a beneficial effect on cross-validated performance estimates. (Or alternatively, increasing your number of samples, but this is not always feasible due to practical and financial reasons...)
Feature reduction can be achieved in two principled ways:
Examples of feature extraction are PCA (i.e. transform voxels to orthogonal components) and averaging features within brain regions from an atlas.
Examples of feature selection are ROI-analysis (i.e. restricting your patterns to a specific ROI in the brain) and "univariate feature selection" (UFS). This latter method is an often-used data-driven method to select features based upon their univariate difference, which is basically like using a traditional whole-brain mass-univariate analysis to select potentially useful features!
Fortunately, scikit-learn has a bunch of feature selection/extraction objects for us to use. These objects ("transformers" in scikit-learn lingo) work similarly to estimators: they also have a fit(X, y)
method, in which for example the univariate differences (in UFS) or PCA-components (in PCA-driven feature extraction) are computed. Then, instead of having a predict(X)
method, transformers have a transform(X)
method.
As an example of feature selection, let's take a closer look at how UFS can be implemented using scikit-learn. Basically, it follows the following structure:
transformer = Selector(score_func=ufs_function, other_args)
Here, the score_func
parameter takes a function that implements a univariate statistical test (like an f-test or a chi-square test). The Selector
class determines how to select the subset of features based on the result of the score_func
.
For example, the SelectKBest selector selects the best K
features based on their scores from the statistical test. For example, the following transformer would select the best 2000 features based on their F-score:
In [ ]:
from sklearn.feature_selection import SelectKBest, f_classif
# f_classif is a scikit-learn specific implementation of the F-test
select2000best = SelectKBest(score_func=f_classif, k=2000)
Another example of a selector is SelectFwe
which selects only features with statistics-values corresponding to p-values lower than a set alpha-level. For example, the following transformer would only select features with p-values from a chi-square test lower than 0.01:
In [ ]:
from sklearn.feature_selection import SelectFwe, chi2
selectfwe_transformer = SelectFwe(score_func=chi2, alpha=0.01)
But how does this work in practice? We'll show you an (not cross-validated!) example using the select2000best transformer initialized earlier:
In [ ]:
# Fit the transformer ...
select2000best.fit(X, y)
# ... which calculates the following attributes (.scores_ and .pvalues_)
# Let's check them out
scores = select2000best.scores_
pvalues = select2000best.pvalues_
# As you can see, each voxel gets its own score (in this case: an F-score)
print(scores.size)
# and its own p-value:
print(pvalues.size)
We can also visualize these scores in brain space:
In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
scores_3d = scores.reshape((80, 80, 37))
plt.figure(figsize=(20, 5))
for i, slce in enumerate(np.arange(15, 65, 5)):
plt.subplot(2, 5, (i+1))
plt.title('X = %i' % slce, fontsize=20)
plt.imshow(scores_3d[slce, :, :].T, origin='lower', cmap='hot')
plt.axis('off')
plt.tight_layout()
plt.show()
Thus, in the images above, brighter (more yellow) colors represent voxels with higher scores on the univariate test. If we subsequently apply (or "transform" in scikit-learn lingo) our X-matrix using, for example, the select2000best
selector, we'll select only the 2000 "most yellow" (i.e. highest F-scores) voxels.
Practically, after we've fit the transformer, we can call the transform(X)
method that will actually select the subset according to the selector:
In [ ]:
print("Shape of X before transform: %r ..." % (X.shape,))
X_after_ufs = select2000best.transform(X)
print("... and shape of X after transform: %r." % (X_after_ufs.shape,))
As you can see, the transformer correctly selected a subset of 100 voxels from our X matrix! Now, both selectors were fit on the entire dataset, which is often course not how it should be done: because they use information from the labels (y
), this step should be cross-validated. Thus, what you have to do is:
We summarized the entire pipeline, including data partitioning (indexing), feature selection (transformation), and model fitting and their corresponding cross-validation steps in the image below:
There are many methods for feature extraction, like "downsampling" to ROI-averages (i.e. averaging voxel patterns in brain regions) and dimensionality-reduction techniques like PCA. Scikit-learn provides some of these techniques as "transformer" objects, which again have a fit()
and transform
method.
Note: since feature extraction techniques like PCA do not strictly use information from the labels (y
), you might think that you do not have to cross-validate this step in your pipeline. This is not true. For a completely unbiased generalization error, you need to cross-validate every transformation in your data-set that is in some way dependent on the data (like UFS, PCA, but not ROI-based feature selection, since this is not dependent on the patterns themselves).
In the next ToDo, you'll have to apply PCA-based feature extraction!
In [ ]:
# Implement your ToDo here!
from sklearn.decomposition import PCA
X_train_tmp, X_test_tmp = train_test_split(X, test_size=0.5)
# initialize a PCA object ...
# ... call fit on X_train_tmp ...
# ... and call transform on X_train_tmp and X_test_tmp
X_train_pca_transformed =
X_test_pca_transformed =
# And finally check out the shape of X_pca_transformed!
In [ ]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
folds = skf.split(X, y)
performance = np.zeros(skf.n_splits)
select1000best = SelectKBest(score_func=f_classif, k=1000)
i = 0
for train_ix, test_idx in folds:
# ToDo: make X_train, X_test, y_train, y_test
# ToDo: call the select1000best fit method (on X_train), and transform both X_train and X_test
# ToDo: fit model on X_train (transformed) and predict X_test (transformed)
# also, save your performance (e.g roc-auc-curve) in the "performance" variable
# ToDo: calculate average performance
If you did the above ToDo, your performance is likely a bit higher than in the previous section in which we didn't do any feature selection/extraction!
As you have seen in the previous assignment, the code within the K-fold for-loop becomes quite long (and messy) when you add a feature-selection step. Suppose you want to add another feature extraction step, like performing a PCA after an initial feature selection step. The code then becomes even more obscure ...
Fortunately, scikit-learn has an awesome solution for this: "Pipelines".
Pipelines are somewhat more advanced functionality within scikit-learn, but we wanted to show you because it really "cleans up" your code, and additionally, it is a great safeguard for accidental overfitting (e.g. when you accidentally perform feature-selection on all your data instead of only your train-data).
Anyway, how does this work? Well, a picture it worth a thousand words, so check out the image below which schematically depicts what a pipeline does:
As you can see, a Pipeline
-object "strings together" an arbitrary number of transformers (including "preprocessing" transformers like StandardScaler
*) and can optionally end in an estimator. Pipeline-objects have two (relevant) methods: fit(X, y)
and predict(X, y)
(the latter method only exists if there is an estimator in the pipeline). To use it, you only have to call fit()
with X_train
and y_train
and it'll sequentially fit the transformers which will finally pass the transformed data to the estimator. Then, simply call predict()
with X_test
as argument and the pipeline will automatically cross-validate all fitted transformers, and eventually the estimator, on the X_test
variable.
Okay, lot's of info - let's break this down. First, let's initialize some transformers and an estimator:
* Feature scaling using e.g. StandardScaler, which we discussed earlier, is generally recommended to cross-validate as well (i.e. apply the mean and std from the train-set to standardize the test-set.)
In [ ]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
scaler = StandardScaler()
ufs = SelectKBest(score_func=f_classif, k=1000)
pca = PCA(n_components=10) # we want to reduce the features to 10 components
svc = SVC(kernel='linear')
Now, to initialize a Pipeline-object, we need to give it a list of tuples, in the first entry of each tuple is a name for the step in the pipeline and the second entry of each tuple is the actual transformer/estimator object. Let's do that for our pipeline:
In [ ]:
from sklearn.pipeline import Pipeline
pipeline_to_make = [('preproc', scaler),
('ufs', ufs),
('pca', pca),
('clf', svc)]
my_pipe = Pipeline(pipeline_to_make)
Let's test our pipeline-object (my_pipe
) on our data. For this example, we'll use a simple hold-out cross-validation scheme (but pipelines are equally valuable in K-fold CV schemes!).
In [ ]:
X_train, y_train = X[0::2], y[0::2]
X_test, y_test = X[1::2], y[1::2]
my_pipe.fit(X_train, y_train)
predictions = my_pipe.predict(X_test)
performance = roc_auc_score(y_test, predictions)
print("Cross-validated performance on test-set: %.3f" % performance)
Cool stuff huh? Quite efficient, I would say!
In [ ]:
# Implement the ToDo here
from sklearn.feature_selection import VarianceThreshold
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier
Up to now, we examined the model performance of decoding the experimental condition from fMRI data of a single subject. When averaging over folds, we got one scalar estimate of performance. Usually, though, within-subject decoding studies measure multiple subjects and want to test whether their hypothesis - which is usually that decoding is above chance - generalizes to the population. Chance level decoding depends on the specific metric you use; for example, for accuracy, chance level equals to $\frac{1}{P}$, where $P$ is the amount of classes. Decoding with binary classes, therefore, have a chance level of 50% for accuracy. However, as discussed before, accuracy is biased in scenarios with imbalanced classes, so we should use a 'class-imbalance-aware' metric - such as F1-score of ROC-AUC-curve - instead.
Anyway, let's assume we use ROC-AUC-curve as our performance metric (its chance level is also 0.5), our group-level null ($H_{0}$) and alternative ($H_{a}$) hypotheses become:
Given that we obtained performance estimates from a group of subjects, which should yield independent estimates, we can simply do a one-sample t-test of our performance estimates against chance (0.5).* To do so, we first need to loop our decoding analysis across all our participants to get a sample of decoding performances.
* Note: Last year, Allefeld and colleagues published a paper in which they show that second-level statistics using a t-test (or other parametric statistics) based on "first-level" per-subject performance estimates does not entail a random-effects analysis, but rather a fixed-effects analysis, and thus does not warrant inference to the population. This is because true performance estimates cannot possibly be below chance. We haven't had time to implement this ourselves for this workshop (we're working on implementing this in skbold), so we recommend checking out the paper yourself!
In [ ]:
participant_numbers = # Fill in the right glob call here!
# Next, we need to extract the participant numbers from the paths you just obtained. We do this for you here.
participant_numbers = [x.split(op.sep)[-1] for x in participant_numbers]
print(participant_numbers)
In [ ]:
with open(op.join('example_data', 'subject_names'), 'r') as f:
participant_numbers = [s.replace('\n', '') for s in f.readlines()]
We can now use this participants_numbers
list to loop over the participant names!
In [ ]:
### Set your analyis parameters/pipeline ###
# ToDo: initialize a stratified K-fold class (let's use K=5)
skf =
# ToDo: Initialize the SelectKBest Selector with an f_classif score function
select100best =
# ToDo: initialize an SVC-object
clf =
# Optional: build a pipeline-object!
# Keep track of all performance estimates
all_performance = np.zeros(len(participant_numbers))
### Start loop over subjects ###
for i_sub, participant_number in enumerate(participant_numbers):
# For each participant, get all tstat file paths. We did this for a single subject in Section 1.3.1!
path_this_participant = op.join('..', 'data', participant_number, 'wm.feat', 'stats', 'tstat*.nii.gz')
t_stat_paths_this_participant = sort_nifti_paths(glob(path_this_participant))
# Then, we need to create the pattern matrix X for each participant. Remember how we did this in Section 1.3.1?
# For each participant, we need to get the data dimensions. Let's extract these from the nifti header.
voxel_dims = nib.load(t_stat_paths_this_participant[0]).header.get_data_shape()
X = np.zeros((len(t_stat_paths_this_participant), np.prod(voxel_dims)))
# Load the data of each trial separately. This was the last thing we did as part of Part 1.
for trial, tstat_path in enumerate(t_stat_paths_this_participant):
data = nib.load(tstat_path).get_data()
data = data.ravel()
X[trial,:] = data
# ToDo: scale the patterns (or optionally: cross-validate the scaler by fitting on train)
# All data of the current participant is now loaded! We can start the pattern analysis for this participant now
# Keep track of performance between folds
performance_this_participant = np.zeros(skf.n_splits)
# Loop over the folds
for i_fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
# ToDo: make X_train, X_test, y_train, y_test
# ToDo: call the select100best fit method (on X_train), and transform both X_train and X_test
# ToDo: fit your model on the train data
# ToDo: predict the classes of your test data
# ToDo: calculate the performance (use ROC-AUC-curve) in this fold
performance =
performance_this_participant[i_fold] = performance
# ToDo: calculate average accuracy (mean across folds!)
mean_performance_this_participant =
print('Mean performance for participant %s: %.3f' % (participant_number, mean_performance_this_participant))
all_performance[i_sub] = mean_performance_this_participant
print('Final performance mean (std): %.3f (%.3f)' % (all_performance.mean(), all_performance.std())
If everything went well, you now have a list all_performance
which contains the accuracies (mean over folds) of all participants! This is all data we need to do a one-sample t-test.
The one-sample t-test is implemented in the Python package Scipy. Let's load it first
In [ ]:
from scipy.stats import ttest_1samp
The function requires two arguments: the vector of values to test, and the hypothesized population mean. In our case, the data vector is all_accuracies
, and the population mean is 0.50. If done correctly, the ttest_1samp
function returns a tuple of the t-value and the p-value.
In [ ]:
# Example answer
t, p = ttest_1samp(all_performance, 0.5)
print('The t-value is %.3f, with a p-value of %.5f' % (t, p))
Finally, we can draw our conclusions: can we decode trial condition, using the fMRI data, with an above-chance accuracy? (Yes we can! But remember, the results are probably positively biased due to autocorrelated samples and, as discussed earlier, this t-test is not a proper random-effects analysis!)
We hope you enjoyed the workshop. It should be stressed that these are really the basics of MVPA - every step that was introduced warrants it own workshop (feature selection, feature extraction, model selection, cross-validation, significance testing - and much more that we did not discuss such as regularization, hyperparameter optimization/grid-search, and feature visualization). You are encouraged to discover what's possible with MVPA on your own!
Good luck using MVPA!
If you want some more background information/material on machine learning in the context of neuroimaging, we recommend the following articles:
skbold
for your machine learning analysesAs mentioned earlier, we developed skbold
, a Python package that offers a set of tools to make machine learning analyses of fMRI data easier. In what follows, we showcase the functionality of skbold at different stages of the typical machine learning pipeline!
As you've seen, there is a lot of "boilerplate" code involved in loading in/structuring the patterns and corresponding labels. In skbold we created a custom object, the Mvp
(multivoxel pattern object), which makes the pattern structuring and preprocessing trivially easy. There are two flavours of Mvp
-objects: the MvpWithin
object and the MvpBetween
object, which are appropriate for within-subject and between-subject decoding analyses, respectively. Given that our example dataset is a within-subject data-set, let's look at an example of how to use MvpWithin
.
First, we load in the class from skbold.core
:
In [ ]:
# we'll reset our namespace to free up some memory
%reset -f
from skbold.core import MvpWithin
import os.path as op
Now, the MvpWithin
class is able to quickly load and transform data from FSL-feat (i.e. first-level) directories. It looks for patters in the stats
directory (given a specific statistic
) and loads condition-labels from the design.con
file, which are all standard contents of FEAT-directories.
Additionally, we can specify a mask to mask out patterns with. This mask can be in MNI-space or native EPI-space - skbold handles the registration between these two spaces (only if FSL is installed!). We can also choose to remove voxels which are values 0 across all samples (which are probably located outside the brain).
We need to specify these parameters upon initialization of an MvpWithin
object:
In [ ]:
feat_dir = op.join('..', 'data', 'pi0070', 'wm.feat')
mask_file = None # only works if you have FSL installed (if so, uncomment the next line!)
# UNCOMMENT IF YOU HAVE FSL INSTALLED!)
#mask_file = op.join('..', 'data', 'GrayMatter_prob.nii.gz') # mask all non-gray matter!
read_labels = True # parse labels (targets) from design.con file!
ref_space = 'epi' # extract patterns in functional space (alternatively: 'mni')
statistic = 'tstat' # use the tstat*.nii.gz files (in *.feat/stats) as patterns
remove_zeros = True # remove voxels which are zero in each trial
mvp = MvpWithin(source=feat_dir, read_labels=read_labels, ref_space=ref_space,
statistic=statistic, remove_zeros=remove_zeros, mask=mask_file)
Alright, now we can simply call the create()
method to start the pattern loading/structuring process!
In [ ]:
mvp.create()
After calling create()
, the Mvp
-object has a couple of 'new' attributes! Let's check them out.
In [ ]:
print("The attribute .X represents our samples-by-features matrix of shape %s" % (mvp.X.shape,))
print("The attribute .y represents our targets (y) of shape %s" % (mvp.y.shape,))
As you can see, these are exactly the patterns (X) and labels (y) which we created manually earlier in our workshop (except for that X
contains fewer features due to the setting remove_zeros=True
). We can also inspect the names of the patterns (as parsed from the design.con
file):
In [ ]:
print(mvp.contrast_labels)
Another feature of skbold
is that is offers some neuroimaging-specific transformers (implemented the same way as scikit-learn transformers). Let's look at, for example, the ClusterThreshold
class - a transformer that applies a (3D) cluster-thresholding procedure on top of univariate feature selection and subsequently returns the cluster-average values as features (i.e. a dimensionality reduction procedure that reduces voxels to cluster-averages).
Let's check it out:
In [ ]:
from skbold.feature_extraction import ClusterThreshold
from sklearn.feature_selection import f_classif
clt = ClusterThreshold(mvp=mvp, min_score=10, selector=f_classif)
We initialized our ClusterThreshold
object to perform an initial threshold at min_score=10
. The voxels that "survived" this threshold are subsequently clustered and averaged within clusters. Below, we'll show that the API is exactly the same as scikit-learn's transformers:
In [ ]:
from sklearn.model_selection import train_test_split
# Let's cross-validate our ClusterThresholding procedure (which you should always do!)
X_train, X_test, y_train, y_test = train_test_split(mvp.X, mvp.y, test_size=0.25)
print("Shape of X_train before cluster-thresholding: %s" % (X_train.shape,))
print("Shape of X_test before cluster-thresholding: %s" % (X_test.shape,), '\n')
# Let's fit and transform
clt.fit(X_train, y_train)
X_train_clt = clt.transform(X_train)
print("Shape of X_train after cluster-thresholding: %s" % (X_train_clt.shape,))
X_test_clt = clt.transform(X_test)
print("Shape of X_test after cluster-thresholding: %s" % (X_test_clt.shape,))
Skbold has many more transformers, such as RoiIndexer
, which indexes patterns given a certain mask/ROI. It doesn't matter whether the patterns are in EPI-space and the mask/ROI is in MNI-space; skbold registers the mask/ROI from one space to the other accordingly. (It needs FSL for this, and as we don't know whether you have this installed, we won't showcase it here.)
Apart from the Mvp
object, skbold
also includes the MvpResults
object, which is designed to keep track of your pipeline's performance and feature-weights across different folds. You can keep track of any (scikit-learn) metric (like f1-score, accuracy, ROC-AUC-score, etc.). Keeping track of feature weights comes in three 'flavors': the option "ufs
", which keeps track of the 'univariate' feature scores (only applicable if you actually use 'univariate feature selection' in your pipeline); the option "fwm
" ('feature weight mapping', see this article which keeps track of the raw feature weights; and the option "forward
", which keeps track of the corresponding forward model weights given the feature weights (see this article).
Anyway, let's look at how we shoulds initialize such an object:
In [ ]:
from skbold.postproc import MvpResults
from sklearn.metrics import accuracy_score, f1_score
mvpr = MvpResults(mvp=mvp, n_iter=5, feature_scoring='forward', confmat=True,
accuracy=accuracy_score, f1=f1_score)
Importantly, the MvpResults
class needs a Mvp
object upon initialization to extract some meta-data and it needs to know how many folds (n_iter
) we're going to keep track of (here we assume we'll do 5-fold CV). We also indicate that we want to keep track of the confusion-matrices across folds (confmat=True
) and after that we specified a couple of metrics (you can indicate any amount of metrics in the form name_metric=sklearn_metric_function
).
Now, we can build a ML pipeline and update our MvpResults
object after each fold. Importantly, as you'll see below, to correctly extract feature-weights, you need to pass a Pipeline
object when calling the method update()
.
Let's first import some things for our pipeline and define the pipeline and CV-scheme:
In [ ]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold
pipe_line = Pipeline([('scaler', StandardScaler()),
('ufs', SelectKBest(score_func=f_classif, k=1000)),
('clf', SVC(kernel='linear'))
])
skf = StratifiedKFold(n_splits=5)
Now we can implement our analysis and simply call mvpr.update()
after each fold:
In [ ]:
for i, (train_idx, test_idx) in enumerate(skf.split(mvp.X, mvp.y)):
print("Processing fold %i / %i" % (i+1, skf.n_splits))
X_train, X_test = mvp.X[train_idx], mvp.X[test_idx]
y_train, y_test = mvp.y[train_idx], mvp.y[test_idx]
pipe_line.fit(X_train, y_train)
pred = pipe_line.predict(X_test)
mvpr.update(test_idx=test_idx, y_pred=pred, pipeline=pipe_line)
We can check out the results of our analysis by calling the compute_scores()
method:
In [ ]:
performance, feature_scores = mvpr.compute_scores()
This prints out the mean and standard-deviation of our metrics across folds and the amount of voxels that were part of the analysis. We can check out the per-fold performance by looking at the first returned variable (here: performance
):
In [ ]:
performance
Also, we can check out the feature-scores (here: the "forward" model corresponding to the classifier weights), which is returned here as feature_scores
. This is a nibabel
Nifti-object, which we can check out using matplotlib:
In [ ]:
import nibabel as nib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(20, 5))
scores_3d = feature_scores.get_data()
background = op.join('..', 'data', 'pi0070', 'wm.feat', 'reg', 'example_func.nii.gz')
background = nib.load(background).get_data()
for i, slce in enumerate(np.arange(20, 60, 5)):
plt.subplot(2, 4, (i+1))
plt.title('X = %i' % slce, fontsize=20)
to_plot = np.ma.masked_where(scores_3d[slce, :, :] == 0, scores_3d[slce, :, :])
plt.imshow(background[slce, :, :].T, origin='lower', cmap='gray')
plt.imshow(to_plot.T, origin='lower', cmap='hot')
plt.axis('off')
plt.tight_layout()
plt.show()
Almost always you have more than one subject, so what you can do is loop over subjects and initialize a new MvpResults
object for every subject and store them in a separate list. Once the loop over subjets is completed, simply initialize a MvpAverageResults
and call compute_statistics()
, which we'll show below:
In [ ]:
from glob import glob
feat_dirs = glob(op.join('..', 'data', 'pi*', 'wm.feat'))
n_folds = 5
mvp_results_list = []
for feat_dir in feat_dirs:
print("Subject: %s" % feat_dir)
mvp = MvpWithin(source=feat_dir, read_labels=read_labels,
ref_space=ref_space, statistic=statistic,
remove_zeros=remove_zeros,
mask=mask_file)
mvp.create()
mvpr = MvpResults(mvp=mvp, n_iter=5, feature_scoring='forward', confmat=True,
accuracy=accuracy_score, f1=f1_score)
for train_idx, test_idx in skf.split(mvp.X, mvp.y):
X_train, X_test = mvp.X[train_idx], mvp.X[test_idx]
y_train, y_test = mvp.y[train_idx], mvp.y[test_idx]
pipe_line.fit(X_train, y_train)
pred = pipe_line.predict(X_test)
mvpr.update(test_idx=test_idx, y_pred=pred, pipeline=pipe_line)
mvp_results_list.append(mvpr)
In [ ]:
from skbold.postproc import MvpAverageResults
mvpr_average = MvpAverageResults()
subjects = [op.basename(op.dirname(f)) for f in feat_dirs]
results = mvpr_average.compute_statistics(mvp_results_list, identifiers=subjects, metric='f1', h0=.5)
results
This section demonstrated some of the functionality of skbold. If you have any questions, suggestions, or want more information about skbold - feel free to ask (or email!). Details on skbold and its API can be found on its page on ReadTheDocs!