ICON 2017 workshop: MVPA of fMRI data in Python

Welcome to this workshop! In the upcoming couple of hours, you'll work through this (Jupyter) "notebook" - a sort of code editor in your browser that allows mixing of code (Python) and text (markdown) cells. Originally, this notebook was developed for the Research Master Psychology course Neuroimaging: Pattern Analysis at the University of Amsterdam, but it was adapted and extended for this ICON workshop.

Scope

In general, this notebook is designed for people who are relatively unfamiliar with MVPA and/or its implementation in Python. We do assume that you are familiar with analyzing (f)MRI data; the purpose of this workshop is to "extend your analysis toolbox", so to say. As such, it’s definitely helpful if you have some knowledge about Python’s syntax, but it’s not strictly necessary. Also, we assume participants are relatively unfamiliar with machine learning (ML) concepts and their implementation; if you’re a ML guru, this is probably not the right workshop for you...

At certain points in the tutorial, we've included some more "advanced" material, but feel free to skip this. Also, realize that this tutorial is about the concepts behind MVPA and how to implement them using domain-general Python tools, such as scikit-learn. Importantly, this tutorial is not about how to use MVPA-specific frameworks such as PyMVPA or Nilearn, although we included a short section on skbold, our own "machine learning for fMRI"-package that is built around scikit-learn.

Also, there is actually no single "multivoxel pattern analysis"; rather, MVPA is a bucket-term referring to several kinds of pattern-based analyses. As this tutorial cannot cover all such analyses, we limit the scope to machine learning (classification-based) analyses of fMRI data based on within-subject designs (leaving out between-subject decoding, regression-based analyses, unsupervised learing, etc.). We hope to cover these grounds sufficiently well for you to be able to explore other kinds of analyses on your own.

Learning goals of this workshop

In this workshop, we cover the basics and implementation of multivoxel pattern analysis (MVPA) to analyze (functional) MRI data in Python. After this tutorial, you are able to:

  • Understand the differences between within-subject and between-subject pattern analysis;
  • Know how to extract/estimate patterns from fMRI data;
  • Implement different feature extraction and selection methods using scikit-learn;
  • Build cross-validated MVPA pipelines using scikit-learn;
  • Evaluate your models

In short, this tutorial will show how to build your own complete, cross-validated MVPA pipeline for fMRI data.

A note on the data we use

The dataset we use for this workshop was chosen out of convenience and is not an example of an optimally design dataset for MVPA-based analyses. Although this workshop is not about optimal design strategies for MVPA, we point out shortcomings to the used example dataset wherever appropriate.

Contents of this notebook

We designed this notebook to be interactive. That is, you'll encounter several "ToDos" (yellow boxes) and "ToThinks" (blue boxes) that describe short programming exercises (ToDos) and questions for you to think about (ToThinks), which are aimed to improve understanding of the discussed so far. After each ToDo, there is a code block that tests whether your answer is correct. Correct answers to the ToDos and ToThinks are included in the answers-notebook notebook (but try it yourself first!). If you're not very comfortable/experienced with Python, or if you just don't feel like it, feel free to skip the ToDos and ToThinks, of course. Also, feel free to ask us questions during this workshop!

The tutorial is split in two parts. In the first part, we cover the basics of experimental designs, and illustrate how to extract patterns from raw fMRI data in a way that MVPA can be applied. In the second part, the major steps in MVPA (ML-based) pipelines is discussed: feature extraction and selection, cross-validation, and model evaluation. We challenge you to optimize your pipelines to maximize your decoding accuracy scores!

Have fun and good luck!


Using Jupyter notebooks

This tutorial is implemented in a "Jupyter notebook", which includes "code cells" - which are able to run Python code - and "text cells" - which render Markdown text. For this tutorial, it suffices to know how to run cells, which you can do by selecting the (code) cell and pressing CTRL + ENTER or pressing the "play" button in the toolbar on the top of the page. For more info on Jupyter notebooks, check the python tutorial notebook.

Before you go on: did you download the data?

Did you already download the data, as described here? Check it by running the cell below (CTR + ENTER)!


In [ ]:
import os
if os.path.isdir(os.path.join('..', 'data')):
    print("You already downloaded the data, nice! Continue with Part 1!")
else:
    print("Seems you don't have the data yet ... Go to lukas-snoek.com/ICON2017 for instructions on how to download it.")

Part 1: Experimental design, pattern estimation, and data representation

Before you can do any fancy machine learning analysis (or any other pattern analysis), there are several decisions you need to make and steps to take in (pre)processing and structuring your data. Roughly, there are three steps to take:

  1. Design your study in a way that's appropriate to answer your question through a pattern analysis; this, of course, needs to be done before data acquisition!
  2. Estimate/extract your patterns from the (functional) MRI data;
  3. Structure and represent your data appropriately for pattern analyses;

While we won't go into all the design factors that make for an efficient pattern analysis (see this article for a good review), we will now discuss/demonstrate some design considerations and how they impact the rest of the MVPA pipeline.

1.1. Designs for pattern analysis

There are many ways in which you can categorize different types of pattern analyses, but one of the most basic categorizations is in terms of whether analyses are within-subject or between-subject. The major distinction revolves around whether you want to investigate an (experimental) factor that varies or is manipulated within subjects or that varies across subjects (i.e. individual differences or experimental between-subject designs)*.


* N.B.: there are also "hybrid" forms, such as when you decode a within-subject factor (e.g. "negative vs. positive images") from condition-average patterns across subjects. That is, each subject "contains" two patterns (one "negative image" pattern, one "positive image" pattern) and these patterns are decoded across subjects

**ToThink**: Decide for the following scenarios whether they correspond to a **within-subject** or **between-subject** design!

1. Subjects view a sequence of negative (violence, mutilation, war scenes) images during fMRI acquisition. One group of subjects are administered [propanolol](https://en.wikipedia.org/wiki/Propranolol) before the fMRI session and another group a placebo. The researcher in charge wants to see whether patterns in the insula change in reponse to propanolol using a pattern-based analysis.

2. Subjects view images of scenes with either animate of inanimate objects during fMRI acquisition. The researcher wants to decode "animacy" from patterns in the lateral occipital cortex.

Importantly, these two types of analyses differ in what is regarded as an instance of a pattern (also referred to as a sample):

  • in within-subject analyses, each instance of your feature-of-interest represents one pattern (e.g., each single trial/stimulus/response);
  • in between-subject analyses, each subject represents one pattern.

This distinction is important, because the choice of type of pattern analysis has major consequences for the design of your experiment and the methods available for pattern extraction. Throughout this tutorial, we focus on within-subject analyses, using a simple two-class working memory data-set.

1.2. Pattern estimation for within-subject designs

There are different ways to estimate your patterns for pattern-based analyses. Suppose you want to decode neural representations of faces from representations of houses. In that case, you need to to estimate a pattern for each face and each house. Now, suppose we show a face at 9 seconds, we can "extract" activity at each voxel for that stimulus in three ways (see figure below):

  1. Extract just a point estimate of activity at, e.g., 6 seconds post-stimulus (expected HRF peak);
  2. Average across a range of points;
  3. Fit a HRF per stimulus and extract the corresponding $\beta$-parameter;

As method 3 (fitting a separate HRF per stimulus) has shown to yield the most stable pattern estimates (Misaki et al., 2010), we'll focus on that method in the rest of this section.

We will extract patterns of $\beta$-estimates by fitting a hemodynamic response function (HRF) per trial using the general linear model (GLM)*. In other words, "trials" (i.e. instances of your feature-of-interest) in within-subject designs are modelled as separate regressors in a first-level analysis. Below, we included an image of a single-trial design (of the hypothetical faces vs. houses experiment) as created in FSL:


* Note: This single-trial design is also referred to as a "Least Squares All" (LSA) approach. Alternative pattern estimation designs exist, including "Least Squares Single" (LSA); the choice for one or the other depends on the trail-by-trial variability and scan-noise (Adulrahman & Henson, 2017).

As you can see, each trial gets it's own regressor (hence the name "single-trial design"). Then, as depicted below the design matrix, a contrast-against-baseline is created for each regressor (trial). After you run a first-level analysis using this design, you'll have whole-brain maps containing statistics values ($\beta$-values, t-values, or z-values) for each trial that represent the trial's estimated (whole-brain) pattern. Usually, t-values or z-values are used instead of $\beta$-values (cf Misaki et al., 2010).

Importantly, this design thus specifies that for each "trial" the activation per voxel is estimated, e.g. each trial gets its own activation map .

Before you go on, make sure you understand this image! This image represents basically all you need to understand about single-trial designs.

1.3. Implementing a single-trial design

To further illustrate how single-trial designs work, let's create your own single-trial matrix corresponding to a (real) working memory experiment. In this experiment, one condition ("ACTIVE") required subjects to remember a configuration of bars, and after a retention period had to respond whether one of the bars has changed in the test-image or not. In the other condition ("PASSIVE") they just watched a blank screen and had to respond with a random answer. The experiment is depicted schematically below:

In total, subjects performed 40 trials, of which 32 were of the "ACTIVE" condition and 8 were of the "PASSIVE" condition*. Next, we'll generate a single-trial design that aims to estimate the pattern for each trial.


* Note: this data-set was not necessarily meant to be analysed with MVPA, as it has some sub-optimal factors w.r.t. pattern analysis, such as class-imbalance (32 active vs. 8 passive trials) and all "instances" (trials) contained in a single run, while it is often recommended to spread out your trials across different runs (see e.g. this study), which will be shortly discussed in section 2.4. (on cross-validation).

Below, we'll load the trial onset times (and trial durations and conditions).


In [ ]:
# First, we need to import some Python packages
import numpy as np
import pandas as pd
import os.path as op
import warnings
import matplotlib.pyplot as plt
plt.style.use('classic')
warnings.filterwarnings("ignore")
%matplotlib inline

In [ ]:
# The onset times are loaded as pandas dataframe with three columns: 
# onset times (in seconds) (column 1), durations (column 2), and conditions (column 3).
# N.B.: condition 0 = passive, condition 1 = active
stim_info = pd.read_csv(op.join('example_data', 'onsets.csv'), sep='\t',
                        names=['onset', 'duration', 'trial_type'])
stim_info

Remember, the onsets (and duration) are here defined in seconds (not TRs). Let's assume that the fMRI-run has a TR of 2. Now, we can convert (very easily!) the onsets/durations-in-seconds to onsets/durations-in-TRs.


In [ ]:
# We repeat loading in the dataframe to avoid dividing the onsets by 2 multiple times ...
stim_info = pd.read_csv(op.join('example_data', 'onsets.csv'), sep='\t',
                        names=['onset', 'duration', 'trial_type'])

# This is where we divide the onsets/durations by the TR
stim_info[['onset', 'duration']] = (stim_info[['onset', 'duration']] / 2).astype(int)

To perform the first-level analysis, for each regressor (trial) we need to create a regressor of zeros and ones, in which the ones represent the moments in which the particular trial was presented. Let's assume that our moment of interest is the encoding phase, which lasts only 2 seconds; we thus can model it as an "impulse".

So, for example, if you have a (hypothetical) run with a total duration of 15 TRs, and you show a stimulus at TR=3 for the duration of 1 TRs (i.e. 2 seconds), then you'd code your regressor as:

[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

**ToDo**: Below, we initialized a stimulus vector (`stim_vec`) of shape=[162, 40], i.e. timepoints x trials (this run was 162 TRs long), with zeros. Each of the 40 rows represents one trial. Loop over the colums of the `stim_vec` matrix and fill the times at onset with a 1. Remember, the first index in Python is zero (not 1!).

In [ ]:
# ToDo
n_timepoints = 162
n_trials = 40
stim_vec = np.zeros((n_timepoints, n_trials))

# Fill the stim_vec variable with ones at the indices of the onsets per trial!

Check your ToDo

Run the cell below to check whether you implemented the ToDo correctly! If it gives no errors, your implementation is correct!


In [ ]:
np.testing.assert_array_equal(stim_vec, np.load(op.join('example_data', 'stim_vec.npy')))
print("Well done!")

Skip this ToDo?

If you want to skip this ToDo, load the correct stim_vec variable below, because we'll need that for the rest of the tutorial!


In [ ]:
stim_vec = np.load(op.join('example_data', 'stim_vec.npy'))

Visualizing the design

Now, we only need to convolve an HRF with the stimulus-vectors (using the numpy function convolve()) and we'll have a complete single-trial design!* Don't worry, we do this for you. We'll also plot it to see how it looks (blue = active trials, red = passive trials):


In [ ]:
from functions import double_gamma

hrf = double_gamma(np.arange(162))

# List comprehension (fancy for-loop) + stack results back to a matrix
X = np.vstack([np.convolve(hrf, stim_vec[:, i], 'full')[:162] for i in range(40)]).T

plt.figure(figsize=(30, 10))
for plot in range(40):
    is_active = True if stim_info['trial_type'][plot] == 1 else False
    plt.plot(X[:, plot] + plot, np.arange(X.shape[0]), c='blue' if is_active else 'red', lw=3)
    plt.text(plot - 0.5, -15, 'active' if is_active else 'passive',
             rotation=45, fontsize=20)

plt.ylabel('Time (TR)', fontsize=40)
plt.xlabel('Regressors (trials)', fontsize=40)
plt.xlim((-1, 40))
plt.ylim((0, 160))
ax = plt.gca()
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.invert_yaxis()
plt.show()

* Note: This specific data-set does not have a proper design for a machine-learning analysis on single-trial patterns, because the short ISI leads to temporal autocorrelation between subsequent trials and may thus result in inflated accuracy scores (cf. Mumford et al., 2014). It is generally recommended to use a long ISI (> 6 seconds) in combination with fully randomized (and class-balanced) trials, or to perform run-wise cross-validation (something we'll revisit in the section on cross-validation).


The GLM that evaluates this design will yield a $\beta$-estimate per trial per voxel. In other words, every trial is associated with a specific (whole-brain) pattern of $\beta$-values.

Usually, though, t-values instead of $\beta$-values are used, because it has been shown that converting $\beta$-values to t-values (also called univariate normalization) often creates more stable and robust patterns. $\beta$-values can be normalized to t-values by simply dividing the parameter estimate ($\beta$) by it's corresponding standard error, which we won't show here but we're confident you know how to do that yourself!

Note: realize that we don't interpret the patterns of t-values in terms of statistical significant. This is because we only use t-values for the goal of efficient pattern estimation, but not hypothesis testing like in mass-univariate analyses! As such,we're not thresholding or applying any multiple comparison correction!

1.4. Visualizing patterns

So, now you know how to create a "single-trial design" for pattern analyses! As a short summary:

  • Model each "trial" (or more generally, "instance") as a separate HRF-convolved regressor;
  • Additionally define a contrast-matrix in which each trial is contrasted against baseline in order to create t-value/z-value patterns;
**ToDo**: We ran the first-level analysis of the single-trial working-memory design outlined above for fifteen subjects using FSL (which conveniently also calculates t-values from the $\beta$-values). You can find the results in the directory: `../data/{subject_number}/wm.feat`. Pick one of the subjects, and check out the `stats` subdirectory. You will see forty tstat-niftis, each containing a whole-brain pattern of *t*-stats of a single trial. These contain the data we will use in this tutorial for MVPA.

Alright, you now know what a single-trial (within-subject) design looks like and what it produces (i.e. single-trial pattern estimates in the form of whole-brain $\beta$/t-stat/z-stat maps). Before we go on, we will show what a pattern of a single trial looks like in fslview. This will hopefully give you some more "intuition" on what is meant with a single trial pattern.

**If you have FSL installed on your computer, you can do this yourself!** open up a new terminal or terminal-tab and start fslview (by typing `fslview` in the terminal). Now, click "File" > "Open", navigate to `data/pi0070/wm.feat/stats` and select `tstat1.nii.gz`. Now, to visualize the patterns somewhat more intuitively, set in the fslview header "Min" to 0 and "Max" to 5. Then, click the little circle with the blue "i":
Under the "lookup table options", select the "Red-Yellow" colormap. What you visualize here are all the voxels that activate during this particular trial. However, a pattern is not necessarily only the voxels that *activate*, but also those that *deactivate*. A pattern is *any* estimated response of the brain during an instance of a sample (here: a trial). Therefore, let's also visualize the deactivating voxels. To do so, click "File" > "Add" and select the `tstat1.nii.gz` image *again*, but now set "Min" to 0 and "Max" to -5. For this file, select the colormap "Blue-Lightblue". What you should see now is something like this:
What we visualized here is an example of a within-subject pattern of a particular trial ("sample"), in which the activated voxels (relative to baseline) are colored red/yellow, and the deactivated voxels (relative to baseline) are colored blue/lightblue. Make sure you understand how this image represents the pattern of a single sample.

You can also do this inside the jupyter notebook:


In [ ]:
from niwidgets import NiWidget

tstat_widget = NiWidget(op.join('..', 'data', 'pi0070', 'wm.feat', 'stats', 'tstat1.nii.gz'))
tstat_widget.nifti_plotter()

1.4.1 Data representation

In pattern analyses, there is a specific way to 'store' and represent brain patterns: as 2D matrices of shape N-samples $\times$ K-voxels. Important: often (and confusingly), people refer to voxels as (brain) 'features' in pattern analyses. So in articles people often refer to samples-by-features matrices!

Anyway, this is what such a matrix looks like for the hypothetical faces-vs-houses decoding study:

Each row thus represents the voxel pattern of a sample (here: trial)!

From 3D patterns to a 2D matrix ...

As you can see, the originally 3D voxel patterns (e.g. whole-brain patterns of t-values) are flattened (also called "vectorized" or "raveled") such that we can stack all patterns vertically into a 2D matrix. Pattern analyses, and machine learning in specific, need this 2D (thus discarding spatial information about the voxel patterns) because most (ML) algorithms were not designed to work with 3D (spatial) data (barring convolutional neural networks). They, instead, assume that the data is a 2D samples-by-features matrix.

Anyway, let's look at an example. We're going to work with the working-memory data (as outline in the beginning of the notebook). Suppose we want to investigate whether we can predict whether a trial is passive or active (factor: working memory load) from (whole-brain) voxel patterns. Consequently, this is a within-subject design. As such, we model each trial separately by fitting a single-trial design matrix to obtain patterns of t-values per trial (similar to the plot just before Section 1.3). The results are in the directory: data/pi0070/wm.feat. Check out the directory (again) and especially the stats-folder. You should see a bunch of nifti-files which contain 3D voxel patterns with "tstat" values (FSL also produces pes [betas], copes, varcopes - we removed these to minimize data size of this tutorial).

For this analysis, we're going to use patterns of t-stats (as is generally recommended over $\beta$s).
As you can see, there are 40 nifti-files with t-stats; these refer to the 40 trials in the experiment (32 active trials, 8 passive trials)! Given that we need to adhere to the data representation as outlined above, we are in the following situation:

What we have: 40 (3D) nifti-files
What we need: one 2D numpy array of shape 40 x {whatever amount of voxels there are in those niftis}

Alright, time to learn some Python gems that help us load in and transform those patterns into a usable 2D numpy matrix.

1.4.2 tips & tricks to load and transform (nifti-)files

As a first thing, we need to find all the paths to the t-stat nifti-files. Python has a nifty (pun intended) tool called "glob" which can find files/directories on disk using wildcards. It is usually imported as follows:


In [ ]:
from glob import glob

glob, in Python, is a function that takes a path (as a string) with one or more wildcard characters (such as the *) and searches for files/directories on disk that match that. For example, let's try to find all the png-imagesin the "img" directory using glob (these are the images that I used inside this notebook).


In [ ]:
import os
# the images are in img/ on Linux/Mac systems, but in img\ on Windows (hence the "os.sep" thingie)
my_search_string = 'img' + os.sep + '*.png'
png_files = glob(my_search_string)
print(png_files)

As you can see, it returns a list with all the files/directories that matched the search-string. Note that you can also search files outside of the current directory. To do so, we can simply specify the relative or absolute path to it.

**ToDo**: Now you have the skills to actually "glob" all the *t*-stats from subject `pi0070` yourself! Use glob to find all the paths to the t-stats and store the results (a list with 40 strings) in a variable called `tstat_paths`. Note: the data directory is one directory above the current directory! Hint: watch out! There might be an `ftest.nii.gz` file in the stats-directory ...

In [ ]:
# Implement your ToDo here

Check your ToDo!

Run the cell below.


In [ ]:
# To check your answer, run this cell
assert(len(tstat_paths) == 40)
print("Well done! You globbed all the 40 tstat-files correctly!")

Skip this ToDo?

To skip this ToDo, run the cell below.


In [ ]:
from functions import glob_tstats
tstat_paths = glob_tstats()

Warning: glob returns unsorted paths (so in seemingly random order). It's better if we sort the paths before loading them in, so the order of the paths is more intuitive (the first file is tstat1, the seconds tstat2, etc.). Python has a builtin function sorted(), which takes a list and sorts it alphabetically. The problem, here, is that if we'd use that - i.e. sorted(tstat_paths) - it will actually sort the files as: tstat1, tstat10, tstat11, etc. See for yourself:


In [ ]:
print(sorted(tstat_paths))

To fix this issue, we wrote a little function (sort_nifti_paths()) that sorts the paths correctly. (If you're interested in how it works, check out the functions.py file.)


In [ ]:
# Let's fix it
from functions import sort_nifti_paths
tstat_paths = sort_nifti_paths(tstat_paths)
print(tstat_paths)

1.4.3. Loading in nifti-data using nibabel

Now, we have the paths to all the nifti-files of a single subject. To load a nifti-file into a numpy-array in Python, we can use the awesome nibabel package. This package has two useful methods you can use to load your data: load and get_data. You need to use them consecutively to load the t-values in memory. For example, to load the whole-brain pattern of t-values of the first trial of subject pi0070:


In [ ]:
import nibabel as nib
data = nib.load(tstat_paths[0]).get_data()
voxel_dims = data.shape
**ToDo**: in the code block below, write a loop that loads in the tstat nifti-files one by one (using nibabel) and store them in the already preallocated array "X". Note that "X" is a 2D matrix (samples-by-features), but each tstat-file contains a 3D array, so you need to "flatten" the 3D array to a single vector: use e.g. the numpy function "flatten()" or "ravel()".

In [ ]:
# Implement your ToDo here
X = np.zeros((40, np.prod(voxel_dims)))

# Start your loop here:

Check your ToDo

Run the cell below to see whether you implemented the ToDo below correctly. If you don't get errors, you did it correctly!


In [ ]:
# Can we check if X is correct here? Would be a good check before continuing to part 2
np.testing.assert_almost_equal(X, np.load(op.join('example_data', 'X_section1.npz'))['X'])
print("Well done!")

Want to skip this ToDo?

Run the cell below.


In [ ]:
X = np.load(op.join('example_data', 'X_section1.npz'))['X']

You've reached a milestone!

Starting with 4D fMRI data, you now have patterns in the right format (a 2D-matrix of N-samples x N-features). Now, let's start talking about machine learning!

Part 2. Multivoxel pattern analysis

In Part 1, we discussed within-subject single-trial designs, and showed how to load single-trial data in the right format to perform MVPA. It's now time to actually implement machine learning analyses!

We'll be looking at the implementation of concepts like K-fold cross-validation, feature-selection/extraction, model fitting/prediction, permutation testing, and feature visualization. To do so, we'll use the scikit-learn machine learning package in Python (the de-facto and most-used ML package in Python).

Remember that, very generally, MVPA is all about inferring a feature-of-interest (dependent variable) using patterns of fMRI data. Let's call the feature-of-interest y, and the patterns of fMRI data X. This inference from X to y is called decoding: we want to decode a dependent variable y from X.

In our case, let's continue with the data from the working memory experiment used in Part 1. Remember that, in this experiment, participants performed two types of trials: "ACTIVE" trials, in which they had to remember the orientation of eight bars, and "PASSIVE" trials in which they did not do anything (note that the screen shortly flickered in the "PASSIVE" trials in order to activate the visual cortex). Let's try to decode trial type from the patterns of t-values you creates in Part 1!

2.1 Adding the dependent variable y

In Section 1.3, you ended with a nice 2D-matrix of N-samples x N-features. This 2D-matrix contains all whole-brain patterns of t-values for all trials: this is your X. However, this leaves out a crucial part of the data: the actual feature-of-interest, trial type, your y.

While there is kind of a generic way to load in voxel patterns, there is usually not a single way to load in your dependent variable (y), because the exact factor that represents y dependent on your exact research question (and also depends how you have stored this data on disk).

In within-subject single-trial designs, trial type or condition is often the dependent variable. The dependent variable can thus be extracted from your design. In fact, we already loaded the dependent variable previously, in the onsets variable (see Section 1.2). The third column named 'trial_type') contains the trial types, where 1 is an "ACTIVE" trial and 0 a "PASSIVE" trial.

**ToDo**: Extract vector *y* from `onsets`, which only contains the trial types (i.e., y = [1, 1, 1, 0, ..., 0, 1, 1])

In [ ]:
# Implement your ToDo here

Check your answer


In [ ]:
np.testing.assert_equal(np.array(y), np.load(op.join('example_data', 'y.npy')))
print('Well done!')

Run the cell below if you want to skip this ToDo


In [ ]:
y = np.load(op.join('example_data', 'y.npy'))

And that's all you need for machine learning-based analyses in terms of preparation!

Commercial* break: skbold

While it's still relatively easy to load in, structure, and preprocess all of the data necessary for pattern-based analyses, there's quite a lot of "boilerplate code", especially when you need to loop your analysis across multiple participants. At least, that's what we thought and motivated us to create skbold, a (Python-based) package that offers a set of tools to make machine learning analyses of fMRI data easier. As shown in the figure below, skbold complements the scikit-learn package by offering functionality specific to fMRI-data at different stages of your machine learning pipeline. For example, reading in single-trial estimates and corresponding labels ("y") only takes about 5 lines of Python code using skbold. To avoid distracting you from this tutorial, we included a section on how to use skbold at the very end of this tutorial. Check it out if you are interested!


* Skbold is, of course, not really commercial! It's fully open source (BSD-3 licensed) and can be easily installed with pip (pip install skbold)!

2.2. An overview of a typical machine learning pipeline

Awesome, we now have everything we need (X and y). This took a while, but you'll soon see it was worth the effort.

In the rest of the tutorial, you'll finally learn how to actually implement decoding pipelines. Typical pipelines consist of the following elements:

  • Data preprocessing (scaling)
  • Data partitioning
  • Feature selection/extraction
  • Model fitting (on train-set)
  • Model cross-validation (on test-set)
  • Calculate model performance
  • Statistical analyses of performance
  • Optional: feature visualization (not discussed in this workshop)

In the rest of the tutorial, we'll discuss these topics in a slightly different order. First, we'll discuss how you can preprocess, fit, and cross-validate scikit-learn models, and while we're at it, how to calculate model performance. Subsequently, you'll learn how to embed these concepts in fully cross-validated K-fold data partitioning schemes. Finally, we'll extend our pipelines with feature selection methods and (optionally) show you how to use scikit-learn Pipelines.

2.3. Pattern preprocessing (scaling)

Before any machine learning analysis, is it advised to preprocess your pattern such that each feature (here: voxels) is on the same scale. This "feature scaling" process can be done in various ways, but the most often used method is standardization: scaling features such that they have 0 mean and unit variance. Feature scaling is important for many machine learning algorithms in the training process and omitting scaling may yield sub-optimal results.

Scikit-learn provides a class, StandardScaler, that implements feature standardization. We'll apply it to our entire dataset (X) here, but in a later section we'll explain that - strictly speaking - it's best to "cross-validate" this preprocessing step.

Anyway, let's see how standardization is done using scikit-learn (in section 2.8. we'll explain this scikit-learn object's API in more detail):


In [ ]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() # Here we initialize the StandardScaler object
scaler.fit(X)             # Here we "fit" the StandardScaler to our entire dataset (i.e. calculates means and stds of each feature)
X = scaler.transform(X)   # And here we transform the dataset using the calculated means/stds

Did the scaling procedure work? Let's check that below (by asserting that the mean of each column is 0, and the std of each column is 1):


In [ ]:
means = np.mean(X, axis=0)
np.testing.assert_almost_equal(means, np.zeros(X.shape[1]))
print("Each column (feature) has mean 0!")
stds = X.std(axis=0)
np.testing.assert_almost_equal(stds[stds != 0], np.ones((stds != 0).sum()))
print("Each column (feature) has std 1!")

Sweet! We correctly preprocessed out patterns. Now it's time for the more interesting stuff.

2.4. Model fitting & cross-validation

First, we'll show you how to use scikit-learn models and associated functionality. In fact, the most useful functionality of scikit-learn is probably that they made fitting and cross-validating models (or, in scikit-learn lingo: estimators) trivially easy.

Note that in the upcoming example, we will fit the model on all our samples for the sake of the example. In practice, this is something you would never do: you always need to cross-validate your model to a new, independent sample.*


* That is, independent with respect to noise: the noise in your training sample should not be correlated with the noise in your test set).

Anyway, let's import a scikit-learn model: the (linear) support vector classifier (SVC), which is one of the most-often used models in fMRI pattern analyses. In scikit-learn, this model is part of the (suprising...) svm module:


In [ ]:
# Scikit-learn is always imported as 'sklearn'
from sklearn.svm import SVC

Like most of scikit-learn's functionality, SVC is a class (not a function!). So, let's initialize an SVC-object! One important argument that this object needs upon initialization is the "kernel" you want the model to have. Basically, the kernel determines how to treat/process your features: linearly (such as kernel='linear'), or non-linearly (such as the kernel='rbf' or kernel='poly' options). Most often a linear kernel is the best option (as non-linear kernels often overfit very quickly).

To initialize an SVC-object with a linear kernel, just do as follows:


In [ ]:
# clf = CLassiFier
clf = SVC(kernel='linear')

Alright, we initialized an SVC-object (named clf, short for "classifier") with a linear kernel. Now, you need to do two thing to get the prediction for each sample (i.e. whether they're predicted as 0 or 1): fit, using the method fit(X, y), and predict the class (i.e. 0 or 1) for each class using the method predict(X). Basically, in the fit method, the parameters of the model (i.e. $\beta$) are estimated. Then, in the predict method, the estimated parameters are used to generate a prediction for each sample (i.e. 0 or 1).

Let's first look at the fit method. As you can see, the fit(X, y) method with two parameters: X (a samples-by-features matrix) and y (a vector of length n-samples). Let's do that for our data:


In [ ]:
print('Fitting SVC ...', end='')
clf.fit(X, y)
print(' done.')

After calling the fit() method, the clf-object contains an attribute coef_ that represent the model's parameters ('coefficients' in scikit-learn lingo, i.e. $\beta$). Let's check that out:


In [ ]:
coefs = clf.coef_
print("Shape of coefficients: %r" % (coefs.shape,))

Ah, just like we expected: the coef_ attribute is exactly the same size as the number of voxels in our X-matrix (i.e. 80*80*37).

Anyway, usually, you don't have to do anything directly with the weights (perhaps only if you want to do anything with feature visualization). Scikit-learn handles the calculation of the prediction (let's call that $\hat{y}$) using the trained weights. To get the prediction of any sample, just call the predict(X) method of the model, which returns the predictions as an array:


In [ ]:
y_hat = clf.predict(X)
print("The predictions for my samples are:\n %r" % y_hat.tolist())
**ToDo**: The cool thing about scikit-learn is that their objects have a very consistent API and have sensible defaults. As a consequence, *every* model ("estimator" in scikit-learn terms) is used in the same way using the `fit(X, y)` and `predict(X)` methods. Try it out yourself below!

Try using the [LinearDiscriminantAnalysis](http://scikit-learn.org/stable/modules/generated/sklearn.discriminant_analysis.LinearDiscriminantAnalysis.html#sklearn.discriminant_analysis.LinearDiscriminantAnalysis) class (`from sklearn.discriminant_analysis import LinearDiscriminantAnalysis`) or [GaussianNB](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB) ("Naive Bayes" classifier; `from sklearn.naive_bayes import GaussianNB`) for example (or any other!). All methods should work exactly the same as the previous example using the SVM.

In [ ]:
# Try out different estimators below (call fit and predict on X and y!)

2.5. Model evaluation

A logical next step is to assess how good the model was in predicting the class of the samples. A straightforward metric to summarize performance is accuracy * which can be defined as:

\begin{align} accuracy = \frac{number\ of\ correct\ predictions}{number\ of\ predictions} \end{align}

* There are waaaay more metrics to summarize model performance for classification models, such as precision, recall, F1-score, and ROC-AUC. These metrics are more appropriate than accuracy when you have imbalanced classes, i.e. more samples in one class (e.g. negative images) than in another class (e.g. positive images), like we have here (i.e. 32 active, 8 passive trials). We'll revisit this issue in the next (optional) section.

**ToDo**: Can you calculate the accuracy of the above model? Hint 1: you need to compare the true labels (i.e. y) with the predicted labels (i.e. y_hat). Hint 2: if you do arithmetic with boolean values (i.e. `True` and `False`), `True` is interpreted as 1 and `False` is interpreted as 0.

In [ ]:
# Implement your to-do here!

If you've done the ToDo above correctly, you should have found out that the accuracy was 1.0 - a perfect score! "Awesome! Nature Neuroscience material!", you might think. But, as is almost always the case: if it seems too good to be true, it probably is indeed too good to be true.

So, what is the issue here?

Well, we didn't cross-validate the model! We fitted it on all the samples in the mvp-object and predicted the same samples, which leads to optimistic estimate of model performance. Such optimistic estimates in uncross-validated models are especially likely when there are many more features (here: voxels) than samples (here: trials). In other words, we are probably overfitting the model here.

Thus, let's check what happens if we actually cross-validate the model. To do so, we need to partition the data into a train- and test-set. For this example, we'll use a simple hold-out scheme, in which we'll reserve half of the data for the test-set (we'll discuss more intricate cross-validation schemes such as K-fold in the next section).

To partition our data in a 50/50 split (50% for train, 50% for test), we'll use the scikit-learn function train_test_split:


In [ ]:
from sklearn.model_selection import train_test_split

if not isinstance(y, np.ndarray):
    y = np.array(y)

# The argument "test_size" indicates the test-size as a proportion
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, stratify=y,
                                                    random_state=5)
    
print("Shape X_train: %r" % (X_train.shape,))
print("Shape y_train: %r" % (y_train.shape,))
print("Shape X_test: %r" % (X_test.shape,))
print("Shape y_test: %r" % (y_test.shape,))

As you can see, the train_test_split function nicely partitions our data in two sets, a train- and test-set. Note the argument stratify in the function, which is set to True, which ensures that the class proportion (here: ratio of "active" trials to "passive" trials) is the same for the train-set as the test-set.

**ToDo**: To convince yourself that the `stratify` option does what it supposed to do, check whether the class-ratio is the same for the train and test-set.

In [ ]:
# Implement the ToDo here

Now it's up to you to actually implement the cross-validated equivalent of the fit/predict procedure we showed you before, in the next ToDo!

**ToDo**: Fit your model on `X_train` and `y_train` and then predict `X_test`. Calculate both the accuracy on the train-set (fit and predict on train) *and* the cross-validated accuracy (fit on train, predict on test). Compare the two accuracies - are we overfitting the model?

In [ ]:
# Implement your ToDo here

If you implemented the ToDo correctly, you should see that the test-set accuracy is quite a bit lower (10%) than the train-set accuracy! This test-set accuracy is still slightly biased due to imbalanced classes, which is discussed in the next section.

Advanced: evaluation of imbalanced datasets/models (optional, but recommended)

As hinted at before, one problem with our dataset is that it is imbalanced. That is, the amount of samples are imbalanced between classes, because we have 32 samples in the "active" class and only 8 samples in the "passive" class.

"Why is that a problem?", you might ask?

Well, suppose that our classifier "learns" to predict every sample as "active", regardless of the underlying patterns. In that scenario, it will still achieve an accuracy of:

\begin{align} accuracy = \frac{32}{40} = 0.8 \end{align}

While 0.8 may seem like a great classification score, in this case it may just be due to our class imbalance! Therefore, imbalanced datasets therefore need a different evaluation metric than accuracy. Fortunately, scikit-learn has many more performance metrics you can use, including metrics that "correct" for the (potential) bias due to class imbalance (including f1-score, ROC-AUC-score, and Cohen's Kappa).

Another way to "correct" for imbalanced datasets is to either undersample your majority class(es) to match the amount of samples in the minority class or to oversample the minority class(es) to match the amount of samples in the majority class. Unlike using "class-imbalance-aware" metrics, which deals with the issue post-hoc, under/oversampling actually forces your model to learn on a balanced dataset (which is, in our experience, the most sensitive strategy).

As the over/undersampling methods are slightly beyond the scope of this tutorial (but check out this library for implementations of different methods to deal with imbalanced datasets), let's check out what happens with our performance estimate if we use a different ('class-imbalance-aware') metric, like the "ROC-AUC-score". Like all metrics, this function can be imported from the sklearn.metrics module. Any scikit-learn metric function is called using the following format:

performance = some_metric(y_true, y_predicted)
**ToDo**: Evaluate the model from the previous ToDo with the ROC-AUC-score metric instead of accuracy. Check both the performance on the train-set (which should still be 1.0) and the (cross-validated) performance on the test-set. Notice the drop in performance compared to using accuracy!

In [ ]:
# The roc-auc-curve function is imported as follows:
from sklearn.metrics import roc_auc_score

# Implement your ToDo here

If you've done the ToDo correctly, you should see that the performance drops to 0.75 if using the ROC-AUC score! While this thus leads to worse results, it's more likely to be based on information from your patterns rather than just the class-imbalance of your dataset.

That said, it's, of course, always best to avoid imbalance in the first place, but sometimes this is unavoidable, for example when decoding subjects' responses (which are not under control of the experimenter). But if you have imbalanced classes, make sure to deal with it appropriately!

Recap

Sweet! You actually performed your first proper decoding analysis in this section! There are however a couple of things we can do to improve the efficiency and results of our analysis. One thing we can do is to use the data more efficiently in cross-validation. In the previous example, we split the data into two sets and fit the model on one (the train-set) and cross-validated to the other (the test-set). There is actually a way to "re-use" the data by using K-fold cross-validation, in which data is iteratively partitioned in train- and test-sets. This is explained in the next section.

2.7. Data partitioning: K-fold cross-validation

There are two major ways of cross-validation:

  • Hold-out cross-validation;
  • K-fold cross-validation;

In the previous section we implemented a form of hold-out cross-validation, which can be seen as a kind of "one-shot cross-validation". In fMRI data-sets, however, we usually have few samples (simply because MRI-data is expensive to acquire!). Because K-fold cross-validation allows you to reuse the data, it is much more common than hold-out cross-validation in MRI studies.

Now, we can finally use some of scikit-learn's functionality. We are going to use the StratifiedKFold class from scikit-learn's model_selection module. Click the highlighted link above and read through the manual to see how it works.

Importantly, if you're dealing with a classification analysis, always use StratifiedKFold (instead of the regular KFold), because this version makes sure that each fold contains the same proportion of the different classes (here: 0 and 1; just like we did with train_test_split when we set stratified=True).

Anyway, enough talking. Let's initialize a StratifiedKFold object with 5 folds:


In [ ]:
# scikit-learn is imported as 'sklearn'
from sklearn.model_selection import StratifiedKFold

# They call folds 'splits' in scikit-learn
skf = StratifiedKFold(n_splits=5)

Alright, we have a StratifiedKFold object now, but not yet any indices for our folds (i.e. indices to split X and y into different samples). To do that, we need to call the split(X, y) method:


In [ ]:
folds = skf.split(X, y)

Now, we created the variable folds which is officially a generator object, but just think of it as a type of list (with indices) which is specialized for looping over it. Each entry in folds is a tuple with two elements: an array with train-indices and an array with test-indices. Let's demonstrate that*:


* Note that you can only run the cell below once. After running it, the folds generator object is "exhausted", and you'll need to call skf.split(X, y) again in the above cell.


In [ ]:
# Notice how we "unpack" the train- and test-indices at the start of the loop
i = 1
for train_idx, test_idx in folds:
    
    print("Processing fold %i" % i)
    print("Train-indices: %s" % train_idx)
    print("Test-indices: %s\n" % test_idx)
    
    i += 1

As you can see, StratifiedKFold determined that for the first fold, sample 1 to 9 should be used for training and sample 0, 1, 2, 3, 4, 5, 6, 7, and 13 should be used for testing (remember, Python uses 0-based indexing!) and the rest for training.

Now, we know how to access the train- and test-indices, but we haven't actually indexed our X and y (i.e. mvp.X and mvp.y) in the for-loop over folds. Nor have we actually fit the model on the train-set and cross-validated this to the test-set.

**ToDo**: in the code-cell below, complete the statements by indexing X and y to create four different objects in every fold: X_train, y_train, X_test, y_test. Also, we created a new classifier-object (clf) for you based on a different model: scikit-learn's `LogisticRegression` to show you that *every* model in scikit-learn works the same (i.e. has the same `fit` and `predict` methods).

Use this classifier to fit on the train-set and predict the test-set in every fold. Then, calculate the (cross-validated) performance (e.g. ROC-AUC-score) in every fold. Keep track of the accuracies across folds, and after the loop over folds, calculate the average performance across folds.

In [ ]:
# Implement the ToDo here by completing the statements in the loop!
from sklearn.linear_model import LogisticRegression

# clf now is a logistic regression model
clf = LogisticRegression()
# run split() again to generate folds
folds = skf.split(X, y)

performance = np.zeros(skf.n_splits)

for i, (train_idx, test_idx) in enumerate(folds):
    
    # Complete these statements by indexing X and y with train_idx and test_idx
    X_train = 
    y_train = 
    
    X_test = 
    y_test = 
    
    # ToDo: call fit (on train) and predict (on test)
    
    # ToDo: calculate roc-auc-score (or accuracy, but that's biased ...)
    this_performance = 
    
    performance[i] = this_performance
    
# ToDo: calculate average performance (and print it!)

Implementing K-fold cross-validation instead of hold-out cross-validation allowed us to make our analysis a little more efficient (by reusing samples). Another method that (usually) improves performance in decoding analyses is feature selection/extraction, which is discussed in the next section (after the optional run-wise decoding section).

Advanced: run-wise decoding (optional)

Contrary to our example data-set which decoding single trials within a single fMRI run, it is recommended to spread out your samples across different fMRI runs and perform run-wise cross-validation. This type of cross-validation basically partitions your samples according to their run; one example, referred to as "leave-one-run-out" cross-validation, partitions your data such that you always train on trials from all runs minus one, and use the left-out run for testing. A nice property of this type of cross-validation is that it effectively avoids the problem of autocorrelated samples (which is highly likely in fMRI, especially if the ISI is short!). While the samples within each run are still autocorrelated, run-wise-cv makes sure that cross-validation is performed on samples from an independent run, thus dealing with bias due to autocorrelation.

Anyway, let's look at an example. Since we don't have a real fMRI dataset with different runs, we're just going to simulate some data. Suppose we have four fMRI runs, each with 20 trials, of which 10 of class "A" (coded with 0) and 10 of class "B" (coded with 1), and each sample is associated with a pattern of a 1000 voxels (let's say the activity pattern in the bilateral amygdala). So, in total we have 80 trials, 40 per class, spread out over 4 runs:


In [ ]:
X_r = np.random.randn(80, 1000)
print("Shape of X: %s" % (X_r.shape, ), '\n')

y_r = np.tile([0, 1], 40)
print("Shape of y: %s" % (y_r.shape, ))
print("Y labels:\n%r" % y_r.tolist(), '\n')
runs = np.repeat([1, 2, 3, 4], 20)
print("Shape of runs: %s" % (runs.shape, ))
print("Run-indices: \n%r" % runs.tolist())

Now, scikit-learn offers a nice cross-validation class that partitions your data according to a "grouping" variable: GroupKFold, or variations thereof like LeaveOneGroupOut of LeavePGroupsOut. Let's check out how that can be used using our simulated data:


In [ ]:
# Import from model_selection module
from sklearn.model_selection import GroupKFold

# In fact, when we initialize GroupKFold with 4 splits, as below, it is exactly the same as
# the LeaveOneGroupOut cross-validator, since we only have 4 groups
gkf = GroupKFold(n_splits=4)
for train_idx, test_idx in gkf.split(X=X_r, y=y_r, groups=runs):
    
    print("Indices of our test-samples: %r" % test_idx.tolist())
    print("... which correspond to following runs: %r" % runs[test_idx].tolist(), '\n')

As you can see in the example above, the GroupKFold cross-validator effectively partionioned our data such that in every fold one group was left out for testing (group 4 in the first fold, group 3 in the second fold, etc.)! So keep in mind to use this type of cross-validation object when you want to employ such a run-wise-cross-validation routine (which is recommended anyway to avoid within-run confounds such as autocorrelation between samples!).

**ToThink**: While stratification can be ensured by using e.g. `StratifiedKFold` cross-validators, stratification cannot be guaranteed when using `GroupKFold`. Why do you think this is the case?

Advanced: repeated random splits

An often heard question about cross-validation is what train/test ratio one should choose in their cross-validation paradigm. Usually, larger train sets (and thus smaller test sets), such as Leave-One-Out-Cross-Validation, lead to a higher average performance across folds, but also a larger variance of performance across folds. Or, in ML terms: a large train/test ratio leads to relatively low bias but high variance.

On the other hand, relatively larger test-sets lead to lower performance estimates on average, but also less variance across folds. As such, it has been proposed to use test-sets of about 10% or 20% (that is, 10- or 5-fold cross-validation). Furthermore, instead of just doing K-fold cross-validation once, it has been argued that drawing repeated random splits of, let's say, 10% of your data improves generalization even more (see Varoquaux (2017)).

This repeated random splits technique is sometimes referred to as bagging (which stands for bootstrap aggregating), a common "ensembling" technique in machine learning. Scikit-learn provides a class that implements this repeated random splits technique: StratifiedShuffleSplit. If you feel like it, try to implement a cross-validation pipeline using this procedure as indicated in the next ToDo.

**ToDo**: Below, try to implement a *repeated random splits* cross-validation routine using `StratifiedShuffleSplit` with 100 repetitions and a test-size of 0.2! Check out the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit) for more info about its parameters.

In [ ]:
from sklearn.model_selection import StratifiedShuffleSplit
# Try implementing the ToDo here

2.8. Feature selection/extraction

While a complete discussion on (causes of) the phenomenon of "overfitting" is definitely beyond the scope of this workshop, it is generally agreed upon that a small sample/feature-ratio will likely lead to overfitting of your models (optimistic performance estimates on train-set). A small sample/feature ratio, unfortunately, is very common in machine learning analyses of neuroimaging (and especially fMRI) data, as the number of features (voxels, MEG sensors) often largely outnumber the amount of samples (trials or subjects).

Given that this small sample/feature ratio (usually) leads to overfitting, it follows that reducing the amount of features often has a beneficial effect on cross-validated performance estimates. (Or alternatively, increasing your number of samples, but this is not always feasible due to practical and financial reasons...)

Feature reduction can be achieved in two principled ways:

  • feature extraction: transform your features into a set of lower-dimensional components;
  • feature selection: select a subset of features

Examples of feature extraction are PCA (i.e. transform voxels to orthogonal components) and averaging features within brain regions from an atlas.

Examples of feature selection are ROI-analysis (i.e. restricting your patterns to a specific ROI in the brain) and "univariate feature selection" (UFS). This latter method is an often-used data-driven method to select features based upon their univariate difference, which is basically like using a traditional whole-brain mass-univariate analysis to select potentially useful features!

**ToThink**: Suppose a researcher wants to decode gratings with two different orientations from V1. To delineate V1, the subject underwent a retinotopy session in a *different* fMRI run. The data from this retinotopy session was subsequently used to extract ("mask") V1 by excluding non-significant voxels; the significant voxels were in turn used to base the orientation decoding analysis on.

Is masking V1 using the retinotopy data a form of *feature selection* or *feature extraction*? Why?
**ToThink**: What could be a reason to prefer feature selection above feature extraction? And the other way around?

Fortunately, scikit-learn has a bunch of feature selection/extraction objects for us to use. These objects ("transformers" in scikit-learn lingo) work similarly to estimators: they also have a fit(X, y) method, in which for example the univariate differences (in UFS) or PCA-components (in PCA-driven feature extraction) are computed. Then, instead of having a predict(X) method, transformers have a transform(X) method.

2.8.1. (Univariate) feature selection

As an example of feature selection, let's take a closer look at how UFS can be implemented using scikit-learn. Basically, it follows the following structure:

transformer = Selector(score_func=ufs_function, other_args)

Here, the score_func parameter takes a function that implements a univariate statistical test (like an f-test or a chi-square test). The Selector class determines how to select the subset of features based on the result of the score_func.

For example, the SelectKBest selector selects the best K features based on their scores from the statistical test. For example, the following transformer would select the best 2000 features based on their F-score:


In [ ]:
from sklearn.feature_selection import SelectKBest, f_classif

# f_classif is a scikit-learn specific implementation of the F-test
select2000best = SelectKBest(score_func=f_classif, k=2000)

Another example of a selector is SelectFwe which selects only features with statistics-values corresponding to p-values lower than a set alpha-level. For example, the following transformer would only select features with p-values from a chi-square test lower than 0.01:


In [ ]:
from sklearn.feature_selection import SelectFwe, chi2
selectfwe_transformer = SelectFwe(score_func=chi2, alpha=0.01)

But how does this work in practice? We'll show you an (not cross-validated!) example using the select2000best transformer initialized earlier:


In [ ]:
# Fit the transformer ...
select2000best.fit(X, y)

# ... which calculates the following attributes (.scores_ and .pvalues_)
# Let's check them out
scores = select2000best.scores_
pvalues =  select2000best.pvalues_

# As you can see, each voxel gets its own score (in this case: an F-score)
print(scores.size)

# and its own p-value:
print(pvalues.size)

We can also visualize these scores in brain space:


In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline

scores_3d = scores.reshape((80, 80, 37))
plt.figure(figsize=(20, 5))

for i, slce in enumerate(np.arange(15, 65, 5)):
    plt.subplot(2, 5, (i+1))
    plt.title('X = %i' % slce, fontsize=20)
    plt.imshow(scores_3d[slce, :, :].T, origin='lower', cmap='hot')
    plt.axis('off')
plt.tight_layout()
plt.show()

Thus, in the images above, brighter (more yellow) colors represent voxels with higher scores on the univariate test. If we subsequently apply (or "transform" in scikit-learn lingo) our X-matrix using, for example, the select2000best selector, we'll select only the 2000 "most yellow" (i.e. highest F-scores) voxels.

**ToThink**: Given the image above, what is the major difference between data driven feature selection (like UFS) and ROI-based feature selection (e.g. only look at patterns in the amygdala) in terms of the spatial scale of patterns you'll select? Try to think of an example of UFS over ROI-based feature selection and vice versa.

Practically, after we've fit the transformer, we can call the transform(X) method that will actually select the subset according to the selector:


In [ ]:
print("Shape of X before transform: %r ..." % (X.shape,))
X_after_ufs = select2000best.transform(X)

print("... and shape of X after transform: %r." % (X_after_ufs.shape,))

As you can see, the transformer correctly selected a subset of 100 voxels from our X matrix! Now, both selectors were fit on the entire dataset, which is often course not how it should be done: because they use information from the labels (y), this step should be cross-validated. Thus, what you have to do is:

  • fit your transformer on the train-set;
  • transform your train-set;
  • transform your test-set;
  • (fit your model on the transformed train-set;)
  • (cross-validate to the transformed test-set;)

We summarized the entire pipeline, including data partitioning (indexing), feature selection (transformation), and model fitting and their corresponding cross-validation steps in the image below:

2.8.2 Feature extraction

There are many methods for feature extraction, like "downsampling" to ROI-averages (i.e. averaging voxel patterns in brain regions) and dimensionality-reduction techniques like PCA. Scikit-learn provides some of these techniques as "transformer" objects, which again have a fit() and transform method.

Note: since feature extraction techniques like PCA do not strictly use information from the labels (y), you might think that you do not have to cross-validate this step in your pipeline. This is not true. For a completely unbiased generalization error, you need to cross-validate every transformation in your data-set that is in some way dependent on the data (like UFS, PCA, but not ROI-based feature selection, since this is not dependent on the patterns themselves).

In the next ToDo, you'll have to apply PCA-based feature extraction!

**ToDo**: Below, we import the `PCA` class from scikit-learn. Check out the [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html). Now, initialize an object from this `PCA`-class with the number of components to store set to 5, and subsequently fit in on `X_train_tmp` and subsequently call transform on `X_train_tmp` and `X_test_tmp` separately; store the result in the variables `X_train_pca_transformed` and `X_test_pca_transformed`. Then, check out the shape of `X_train_pca_transformed`: does it have the shape you expected?

In [ ]:
# Implement your ToDo here!
from sklearn.decomposition import PCA
X_train_tmp, X_test_tmp = train_test_split(X, test_size=0.5)

# initialize a PCA object ...

# ... call fit on X_train_tmp ...

# ... and call transform on X_train_tmp and X_test_tmp
X_train_pca_transformed = 
X_test_pca_transformed = 

# And finally check out the shape of X_pca_transformed!

Integrating feature selection and model fitting/cv!

Now, let's put everything you learned so far together and implement a fully cross-validated pipeline with UFS and model fitting/prediction!

**ToDo**: Below, we set up a K-fold cross-validation loop and prespecified a classifier (`clf`, a logistic regression model) and a transformer (`select1000best`, selecting the 1000 best features based upon an F-test). Now, it's up to you to actually implement the feature selection inside the for-loop. Make sure to fit the transformer only on the train-set, but then transform *both* the train-set and the test-set. Then, fit the model on the transformed train-set and cross-validate to the transformed test-set. Calculate performance (metric of your choice) of the cross-validated model for each fold, and after all folds calculate the average performance (across folds).

In [ ]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression()
folds = skf.split(X, y)

performance = np.zeros(skf.n_splits)
select1000best = SelectKBest(score_func=f_classif, k=1000)

i = 0
for train_ix, test_idx in folds:
    
    # ToDo: make X_train, X_test, y_train, y_test
    
    # ToDo: call the select1000best fit method (on X_train), and transform both X_train and X_test
    
    # ToDo: fit model on X_train (transformed) and predict X_test (transformed)
    # also, save your performance (e.g roc-auc-curve) in the "performance" variable

# ToDo: calculate average performance

If you did the above ToDo, your performance is likely a bit higher than in the previous section in which we didn't do any feature selection/extraction!

Advanced: cross-validation using Pipelines (optional)

As you have seen in the previous assignment, the code within the K-fold for-loop becomes quite long (and messy) when you add a feature-selection step. Suppose you want to add another feature extraction step, like performing a PCA after an initial feature selection step. The code then becomes even more obscure ...

Fortunately, scikit-learn has an awesome solution for this: "Pipelines".

Pipelines are somewhat more advanced functionality within scikit-learn, but we wanted to show you because it really "cleans up" your code, and additionally, it is a great safeguard for accidental overfitting (e.g. when you accidentally perform feature-selection on all your data instead of only your train-data).

Anyway, how does this work? Well, a picture it worth a thousand words, so check out the image below which schematically depicts what a pipeline does:

As you can see, a Pipeline-object "strings together" an arbitrary number of transformers (including "preprocessing" transformers like StandardScaler*) and can optionally end in an estimator. Pipeline-objects have two (relevant) methods: fit(X, y) and predict(X, y) (the latter method only exists if there is an estimator in the pipeline). To use it, you only have to call fit() with X_train and y_train and it'll sequentially fit the transformers which will finally pass the transformed data to the estimator. Then, simply call predict() with X_test as argument and the pipeline will automatically cross-validate all fitted transformers, and eventually the estimator, on the X_test variable.

Okay, lot's of info - let's break this down. First, let's initialize some transformers and an estimator:


* Feature scaling using e.g. StandardScaler, which we discussed earlier, is generally recommended to cross-validate as well (i.e. apply the mean and std from the train-set to standardize the test-set.)


In [ ]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

scaler = StandardScaler()
ufs = SelectKBest(score_func=f_classif, k=1000)
pca = PCA(n_components=10)  # we want to reduce the features to 10 components
svc = SVC(kernel='linear')

Now, to initialize a Pipeline-object, we need to give it a list of tuples, in the first entry of each tuple is a name for the step in the pipeline and the second entry of each tuple is the actual transformer/estimator object. Let's do that for our pipeline:


In [ ]:
from sklearn.pipeline import Pipeline

pipeline_to_make = [('preproc', scaler),
                    ('ufs', ufs),
                    ('pca', pca),
                    ('clf', svc)]

my_pipe = Pipeline(pipeline_to_make)

Let's test our pipeline-object (my_pipe) on our data. For this example, we'll use a simple hold-out cross-validation scheme (but pipelines are equally valuable in K-fold CV schemes!).


In [ ]:
X_train, y_train = X[0::2], y[0::2]
X_test, y_test = X[1::2], y[1::2]

my_pipe.fit(X_train, y_train)
predictions = my_pipe.predict(X_test)
performance = roc_auc_score(y_test, predictions)
print("Cross-validated performance on test-set: %.3f" % performance)

Cool stuff huh? Quite efficient, I would say!

**ToDo**: Test your pipeline-skills! Can you build a pipeline that incorporates a [VarianceThreshold](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.VarianceThreshold.html#sklearn.feature_selection.VarianceThreshold) (removes all voxels outside the brain), K-means clustering (with 8 clusters, such that $K$ voxels --> 8 clusters, using [KMeans](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.k_means.html#sklearn.cluster.k_means)), scaling (`StandardScaler`), and a [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)?

In [ ]:
# Implement the ToDo here
from sklearn.feature_selection import VarianceThreshold
from sklearn.cluster import KMeans
from sklearn.ensemble import RandomForestClassifier

2.9. Assessing significance

Up to now, we examined the model performance of decoding the experimental condition from fMRI data of a single subject. When averaging over folds, we got one scalar estimate of performance. Usually, though, within-subject decoding studies measure multiple subjects and want to test whether their hypothesis - which is usually that decoding is above chance - generalizes to the population. Chance level decoding depends on the specific metric you use; for example, for accuracy, chance level equals to $\frac{1}{P}$, where $P$ is the amount of classes. Decoding with binary classes, therefore, have a chance level of 50% for accuracy. However, as discussed before, accuracy is biased in scenarios with imbalanced classes, so we should use a 'class-imbalance-aware' metric - such as F1-score of ROC-AUC-curve - instead.

Anyway, let's assume we use ROC-AUC-curve as our performance metric (its chance level is also 0.5), our group-level null ($H_{0}$) and alternative ($H_{a}$) hypotheses become:

  • $H_{0}$: performance = 0.5
  • $H_{a}$: performance > 0.5

Given that we obtained performance estimates from a group of subjects, which should yield independent estimates, we can simply do a one-sample t-test of our performance estimates against chance (0.5).* To do so, we first need to loop our decoding analysis across all our participants to get a sample of decoding performances.


* Note: Last year, Allefeld and colleagues published a paper in which they show that second-level statistics using a t-test (or other parametric statistics) based on "first-level" per-subject performance estimates does not entail a random-effects analysis, but rather a fixed-effects analysis, and thus does not warrant inference to the population. This is because true performance estimates cannot possibly be below chance. We haven't had time to implement this ourselves for this workshop (we're working on implementing this in skbold), so we recommend checking out the paper yourself!

**ToDo**: Before we start, we need to get the participant numbers of all our participants. To find these, we can make use of the directory names. In Section **1.4.1** you learned the tricks you need to get all subdirectory names of the `data` directory using a wildcard!

In [ ]:
participant_numbers =  # Fill in the right glob call here!

# Next, we need to extract the participant numbers from the paths you just obtained. We do this for you here.
participant_numbers = [x.split(op.sep)[-1] for x in participant_numbers]
print(participant_numbers)

Want to skip the ToDo?

Run the next cell.


In [ ]:
with open(op.join('example_data', 'subject_names'), 'r') as f:
    participant_numbers = [s.replace('\n', '') for s in f.readlines()]

We can now use this participants_numbers list to loop over the participant names!

**ToDo** (optional!): This final ToDo is a big one, in which everything we learned so far comes together nicely. Write a loop over all participants, implementing a cross-valdidated classification pipeline including feature selection. We provide you with some "boilerplate" code to start with, but it's up to you to complete the analysis/loop.

Make sure to include the following:
1. Scale the patterns (even better: cross-validate your scaler)
2. Use 5-fold Stratified cross-validation
3. Within each fold, select the best 100 features using an f_classif Selector
4. Use a SVM to decode condition from these 100 features

Optional:
- Use a scikit-learn Pipeline in which you group your scaler, selector, and estimator!

Hints:
- The design for each subject was exactly the same, so you can use the same "y" as in the previous example!

In [ ]:
### Set your analyis parameters/pipeline ###

# ToDo: initialize a stratified K-fold class (let's use K=5) 
skf =  
# ToDo: Initialize the SelectKBest Selector with an f_classif score function
select100best = 
    
# ToDo: initialize an SVC-object
clf = 

# Optional: build a pipeline-object!

# Keep track of all performance estimates
all_performance = np.zeros(len(participant_numbers))

### Start loop over subjects ###
for i_sub, participant_number in enumerate(participant_numbers):
    
    # For each participant, get all tstat file paths. We did this for a single subject in Section 1.3.1!
    path_this_participant = op.join('..', 'data', participant_number, 'wm.feat', 'stats', 'tstat*.nii.gz')
    t_stat_paths_this_participant = sort_nifti_paths(glob(path_this_participant))
    
    # Then, we need to create the pattern matrix X for each participant. Remember how we did this in Section 1.3.1?
    # For each participant, we need to get the data dimensions. Let's extract these from the nifti header. 
    voxel_dims = nib.load(t_stat_paths_this_participant[0]).header.get_data_shape()
    X = np.zeros((len(t_stat_paths_this_participant), np.prod(voxel_dims)))

    # Load the data of each trial separately. This was the last thing we did as part of Part 1.
    for trial, tstat_path in enumerate(t_stat_paths_this_participant):
        data = nib.load(tstat_path).get_data()
        data = data.ravel()
        X[trial,:] = data
    
    # ToDo: scale the patterns (or optionally: cross-validate the scaler by fitting on train)
    
    # All data of the current participant is now loaded! We can start the pattern analysis for this participant now
    
    # Keep track of performance between folds
    performance_this_participant = np.zeros(skf.n_splits)
    
    # Loop over the folds
    for i_fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):

        # ToDo: make X_train, X_test, y_train, y_test

        # ToDo: call the select100best fit method (on X_train), and transform both X_train and X_test
    
        # ToDo: fit your model on the train data
        
        # ToDo: predict the classes of your test data
        
        # ToDo: calculate the performance (use ROC-AUC-curve) in this fold
        performance = 
        
        performance_this_participant[i_fold] = performance

    # ToDo: calculate average accuracy (mean across folds!)
    mean_performance_this_participant = 
    print('Mean performance for participant %s: %.3f' % (participant_number, mean_performance_this_participant))
    
    all_performance[i_sub] = mean_performance_this_participant

print('Final performance mean (std): %.3f (%.3f)' % (all_performance.mean(), all_performance.std())

If everything went well, you now have a list all_performance which contains the accuracies (mean over folds) of all participants! This is all data we need to do a one-sample t-test.

The one-sample t-test is implemented in the Python package Scipy. Let's load it first


In [ ]:
from scipy.stats import ttest_1samp

The function requires two arguments: the vector of values to test, and the hypothesized population mean. In our case, the data vector is all_accuracies, and the population mean is 0.50. If done correctly, the ttest_1samp function returns a tuple of the t-value and the p-value.

**ToDo**: Do the one-sample t-test! Can we decode the condition with a significantly above-chance accuracy?

In [ ]:
# Example answer
t, p = ttest_1samp(all_performance, 0.5)
print('The t-value is %.3f, with a p-value of %.5f' % (t, p))

Finally, we can draw our conclusions: can we decode trial condition, using the fMRI data, with an above-chance accuracy? (Yes we can! But remember, the results are probably positively biased due to autocorrelated samples and, as discussed earlier, this t-test is not a proper random-effects analysis!)

Congratulations! You've reached the end of this tutorial

We hope you enjoyed the workshop. It should be stressed that these are really the basics of MVPA - every step that was introduced warrants it own workshop (feature selection, feature extraction, model selection, cross-validation, significance testing - and much more that we did not discuss such as regularization, hyperparameter optimization/grid-search, and feature visualization). You are encouraged to discover what's possible with MVPA on your own!

Good luck using MVPA!

Extra literature ...

If you want some more background information/material on machine learning in the context of neuroimaging, we recommend the following articles:

On machine learning in Python

On MVPA design

On cross-validation

On (group-level) statistics

Extra! Using skbold for your machine learning analyses

As mentioned earlier, we developed skbold, a Python package that offers a set of tools to make machine learning analyses of fMRI data easier. In what follows, we showcase the functionality of skbold at different stages of the typical machine learning pipeline!

Data structuring and representation

As you've seen, there is a lot of "boilerplate" code involved in loading in/structuring the patterns and corresponding labels. In skbold we created a custom object, the Mvp (multivoxel pattern object), which makes the pattern structuring and preprocessing trivially easy. There are two flavours of Mvp-objects: the MvpWithin object and the MvpBetween object, which are appropriate for within-subject and between-subject decoding analyses, respectively. Given that our example dataset is a within-subject data-set, let's look at an example of how to use MvpWithin.

First, we load in the class from skbold.core:


In [ ]:
# we'll reset our namespace to free up some memory
%reset -f 
from skbold.core import MvpWithin
import os.path as op

Now, the MvpWithin class is able to quickly load and transform data from FSL-feat (i.e. first-level) directories. It looks for patters in the stats directory (given a specific statistic) and loads condition-labels from the design.con file, which are all standard contents of FEAT-directories.

Additionally, we can specify a mask to mask out patterns with. This mask can be in MNI-space or native EPI-space - skbold handles the registration between these two spaces (only if FSL is installed!). We can also choose to remove voxels which are values 0 across all samples (which are probably located outside the brain).

We need to specify these parameters upon initialization of an MvpWithin object:


In [ ]:
feat_dir = op.join('..', 'data', 'pi0070', 'wm.feat')
mask_file = None   # only works if you have FSL installed (if so, uncomment the next line!)

# UNCOMMENT IF YOU HAVE FSL INSTALLED!)
#mask_file = op.join('..', 'data', 'GrayMatter_prob.nii.gz') # mask all non-gray matter! 
read_labels = True      # parse labels (targets) from design.con file!
ref_space = 'epi'       # extract patterns in functional space (alternatively: 'mni')
statistic = 'tstat'     # use the tstat*.nii.gz files (in *.feat/stats) as patterns
remove_zeros = True     # remove voxels which are zero in each trial

mvp = MvpWithin(source=feat_dir, read_labels=read_labels, ref_space=ref_space, 
                statistic=statistic, remove_zeros=remove_zeros, mask=mask_file)

Alright, now we can simply call the create() method to start the pattern loading/structuring process!


In [ ]:
mvp.create()

After calling create(), the Mvp-object has a couple of 'new' attributes! Let's check them out.


In [ ]:
print("The attribute .X represents our samples-by-features matrix of shape %s" % (mvp.X.shape,))
print("The attribute .y represents our targets (y) of shape %s" % (mvp.y.shape,))

As you can see, these are exactly the patterns (X) and labels (y) which we created manually earlier in our workshop (except for that X contains fewer features due to the setting remove_zeros=True). We can also inspect the names of the patterns (as parsed from the design.con file):


In [ ]:
print(mvp.contrast_labels)

Feature extraction / selection

Another feature of skbold is that is offers some neuroimaging-specific transformers (implemented the same way as scikit-learn transformers). Let's look at, for example, the ClusterThreshold class - a transformer that applies a (3D) cluster-thresholding procedure on top of univariate feature selection and subsequently returns the cluster-average values as features (i.e. a dimensionality reduction procedure that reduces voxels to cluster-averages).

Let's check it out:


In [ ]:
from skbold.feature_extraction import ClusterThreshold
from sklearn.feature_selection import f_classif

clt = ClusterThreshold(mvp=mvp, min_score=10, selector=f_classif)

We initialized our ClusterThreshold object to perform an initial threshold at min_score=10. The voxels that "survived" this threshold are subsequently clustered and averaged within clusters. Below, we'll show that the API is exactly the same as scikit-learn's transformers:


In [ ]:
from sklearn.model_selection import train_test_split

# Let's cross-validate our ClusterThresholding procedure (which you should always do!)
X_train, X_test, y_train, y_test = train_test_split(mvp.X, mvp.y, test_size=0.25)
print("Shape of X_train before cluster-thresholding: %s" % (X_train.shape,))
print("Shape of X_test before cluster-thresholding: %s" % (X_test.shape,), '\n')

# Let's fit and transform
clt.fit(X_train, y_train)
X_train_clt = clt.transform(X_train)
print("Shape of X_train after cluster-thresholding: %s" % (X_train_clt.shape,))

X_test_clt = clt.transform(X_test)
print("Shape of X_test after cluster-thresholding: %s" % (X_test_clt.shape,))

Skbold has many more transformers, such as RoiIndexer, which indexes patterns given a certain mask/ROI. It doesn't matter whether the patterns are in EPI-space and the mask/ROI is in MNI-space; skbold registers the mask/ROI from one space to the other accordingly. (It needs FSL for this, and as we don't know whether you have this installed, we won't showcase it here.)

Model evaluation / Feature visualization

Apart from the Mvp object, skbold also includes the MvpResults object, which is designed to keep track of your pipeline's performance and feature-weights across different folds. You can keep track of any (scikit-learn) metric (like f1-score, accuracy, ROC-AUC-score, etc.). Keeping track of feature weights comes in three 'flavors': the option "ufs", which keeps track of the 'univariate' feature scores (only applicable if you actually use 'univariate feature selection' in your pipeline); the option "fwm" ('feature weight mapping', see this article which keeps track of the raw feature weights; and the option "forward", which keeps track of the corresponding forward model weights given the feature weights (see this article).

Anyway, let's look at how we shoulds initialize such an object:


In [ ]:
from skbold.postproc import MvpResults
from sklearn.metrics import accuracy_score, f1_score

mvpr = MvpResults(mvp=mvp, n_iter=5, feature_scoring='forward', confmat=True,
                  accuracy=accuracy_score, f1=f1_score)

Importantly, the MvpResults class needs a Mvp object upon initialization to extract some meta-data and it needs to know how many folds (n_iter) we're going to keep track of (here we assume we'll do 5-fold CV). We also indicate that we want to keep track of the confusion-matrices across folds (confmat=True) and after that we specified a couple of metrics (you can indicate any amount of metrics in the form name_metric=sklearn_metric_function).

Now, we can build a ML pipeline and update our MvpResults object after each fold. Importantly, as you'll see below, to correctly extract feature-weights, you need to pass a Pipeline object when calling the method update().

Let's first import some things for our pipeline and define the pipeline and CV-scheme:


In [ ]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold

pipe_line = Pipeline([('scaler', StandardScaler()),
                      ('ufs', SelectKBest(score_func=f_classif, k=1000)),
                      ('clf', SVC(kernel='linear'))
                     ])

skf = StratifiedKFold(n_splits=5)

Now we can implement our analysis and simply call mvpr.update() after each fold:


In [ ]:
for i, (train_idx, test_idx) in enumerate(skf.split(mvp.X, mvp.y)):
    print("Processing fold %i / %i" % (i+1, skf.n_splits))
    X_train, X_test = mvp.X[train_idx], mvp.X[test_idx]
    y_train, y_test = mvp.y[train_idx], mvp.y[test_idx]
    pipe_line.fit(X_train, y_train)
    pred = pipe_line.predict(X_test)
    mvpr.update(test_idx=test_idx, y_pred=pred, pipeline=pipe_line)

We can check out the results of our analysis by calling the compute_scores() method:


In [ ]:
performance, feature_scores = mvpr.compute_scores()

This prints out the mean and standard-deviation of our metrics across folds and the amount of voxels that were part of the analysis. We can check out the per-fold performance by looking at the first returned variable (here: performance):


In [ ]:
performance

Also, we can check out the feature-scores (here: the "forward" model corresponding to the classifier weights), which is returned here as feature_scores. This is a nibabel Nifti-object, which we can check out using matplotlib:


In [ ]:
import nibabel as nib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(20, 5))
scores_3d = feature_scores.get_data()
background = op.join('..', 'data', 'pi0070', 'wm.feat', 'reg', 'example_func.nii.gz')
background = nib.load(background).get_data()

for i, slce in enumerate(np.arange(20, 60, 5)):
    plt.subplot(2, 4, (i+1))
    plt.title('X = %i' % slce, fontsize=20)
    to_plot = np.ma.masked_where(scores_3d[slce, :, :] == 0, scores_3d[slce, :, :])
    plt.imshow(background[slce, :, :].T, origin='lower', cmap='gray')
    plt.imshow(to_plot.T, origin='lower', cmap='hot')
    plt.axis('off')
plt.tight_layout()
plt.show()

Almost always you have more than one subject, so what you can do is loop over subjects and initialize a new MvpResults object for every subject and store them in a separate list. Once the loop over subjets is completed, simply initialize a MvpAverageResults and call compute_statistics(), which we'll show below:


In [ ]:
from glob import glob
feat_dirs = glob(op.join('..', 'data', 'pi*', 'wm.feat'))
n_folds = 5

mvp_results_list = []
for feat_dir in feat_dirs:
    print("Subject: %s" % feat_dir)
    mvp = MvpWithin(source=feat_dir, read_labels=read_labels,
                    ref_space=ref_space, statistic=statistic,
                    remove_zeros=remove_zeros,
                    mask=mask_file)
    mvp.create()
    mvpr = MvpResults(mvp=mvp, n_iter=5, feature_scoring='forward', confmat=True,
                  accuracy=accuracy_score, f1=f1_score)

    for train_idx, test_idx in skf.split(mvp.X, mvp.y):
        X_train, X_test = mvp.X[train_idx], mvp.X[test_idx]
        y_train, y_test = mvp.y[train_idx], mvp.y[test_idx]
        pipe_line.fit(X_train, y_train)
        pred = pipe_line.predict(X_test)
        mvpr.update(test_idx=test_idx, y_pred=pred, pipeline=pipe_line)
    
    mvp_results_list.append(mvpr)

In [ ]:
from skbold.postproc import MvpAverageResults
mvpr_average = MvpAverageResults()
subjects = [op.basename(op.dirname(f)) for f in feat_dirs]
results = mvpr_average.compute_statistics(mvp_results_list, identifiers=subjects, metric='f1', h0=.5)
results

Questions? Suggestions?

This section demonstrated some of the functionality of skbold. If you have any questions, suggestions, or want more information about skbold - feel free to ask (or email!). Details on skbold and its API can be found on its page on ReadTheDocs!