pyradigm
Data structures provided by the pyradigm library are greatly suited for medical data, where each sample/subject/unit needs to be uniquely identified with a single identifier (denoted as subject ID or something similar depending on the field) that links multiple tables containing diverse types of data, such features extracted and prediction targets for machine learning analyses, covariates and demographics as well as needing to deal with multiple data types (numerical, categorical etc).
Key-level correspondence across data, targets (either as labels 1 or 2, or as class names such as healthy
and disease
) and related tables (demographics and other meta data) is essential to maintain integrity of the datasets themselves as well as the subsequent analyses. This would help improve the provenance, as it is easy to encapsulate the relevant info and necessary history in the user-defined attribute and meta data features.
To provide a concrete examples, let's look at how a machine learning dataset is handled traditionally.
You typically have a matrix X
of size n x p
(n
samples and p
features), and a target vector y
containing the target values (numerical or categorical or multi-output arrays). These X
and y
serve as training (and test set) for a classifier (like SVM) to fit the data X
to match y
as accurately as possible.
Let's get a little more concrete, and produce toy data for X
and y
:
In [1]:
import sys, os
import numpy as np
import matplotlib
%matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
n = 10 # number of samples
p = 3 # number of features
X = np.random.random([n, p]) # random data for illustration
y = [1]*5 + [2]*5 # random labels ...
np.set_printoptions(precision=2) # save some screen space
print('X : \n{}'.format(X))
print('y : \n{}'.format(y))
Almost all the machine learning toolboxes take their input in this form: X and y, regardless of the original source that produced these features in the first place.
This might be fine if all you ever wanted to do is to extract some features, do some machine learning one-time and dispose these features away!
But this is almost never the case!
Because it doesn't simply end there.
At a minimum, you often need to dissect the results and link them across different tables e.g. to figure out
And all this info needs to be obtained
And if you are like me, you would be thinking about how would you organize your workflow such that the aforementioned tasks can be accomplished with ease.
This pyradigm data structure is the result of years of refinement to conquer the above complex workflow and related scenarios with ease. By always organizing the extracted features in a dictionary with keyed-in by their samplet ID, along with other important info such as target values, other attributes and associated meta-data. This, by definition, preserves the integrity of the data (inability to incorrectly label samples etc).
With user-defined attributes, this data structure allows the capture of provenance info in a way that is meaningful to the user.
NOTE: we define a samplet to be a single row in X
, to distinguish from a sample
that is sometimes incorrectly used to refer both a single instance and an entire "sample" / dataset.
An example application is shown below, touching upon the following topics:
Improting the necessary modules and our fancy class definition:
In [2]:
from pyradigm import ClassificationDataset as ClfDataset
from pyradigm import RegressionDataset as RegrDataset
The only major difference between the above two data structures is the data type of target i.e. categorical/discrete vs. continuous values. The ClassificationDataset accepts only categorical values for its target, often as strings ('healthy', 'disease', 'cat', 'chair' etc), whereas the RegressionDataset allows continuous floating point numbers (e.g. age, temperature, salary, weight etc). Depending on the target data type, the analyses (esp. in machine learning) changes dramatically in terms of which predictive model is employed, how they are optimized and what performance metrics are considered etc.
Let's get started with a classification dataset, which can be instantiated as shown below. We also give it a simple description:
In [3]:
dataset = ClfDataset()
dataset.description = 'ADNI1: cortical thickness features from Freesurfer v6.0, QCed.'
These dataset have customised and convenient display methods, summarizing the important info. For example, just printing instance shows its content summary (typically count of samplets per target and a description if any).
In [4]:
dataset
Out[4]:
You can see the dataset some description attached to it, however we can also it is empty. This can be verified in a boolean context as shown below:
In [5]:
bool(dataset)
Out[5]:
Let's add samples to this dataset which is when this dataset implementation becomes really handy. Before we do that, we will define some convenience routines defined to just illustrate a simple yet common use of this dataset.
In [6]:
def read_thickness(path):
"""Dummy function to minic a data reader."""
# in your actural routine, this might be:
# pysurfer.read_thickness(path).values()
return np.random.random(2)
def get_features(work_dir, subj_id):
"""Returns the whole brain cortical thickness for a given subject ID."""
# extension to identify the data file; this could be .curv, anything else you choose
ext_thickness = '.thickness'
thickness = dict()
for hemi in ['lh', 'rh']:
path_thickness = os.path.join(work_dir, subj_id, hemi + ext_thickness)
thickness[hemi] = read_thickness(path_thickness)
# concatenating them to build a whole brain feature set
thickness_wb = np.concatenate([thickness['lh'], thickness['rh']])
return thickness_wb
So now we have IO routines to read the data for us. Let's define where the data will come from:
In [7]:
work_dir = '/project/ADNI/FreesurferThickness_v4p3'
class_set = ['Control', 'Alzheimer', 'MCI']
class_sizes = [7, 8, 6]
This would obviously change for your applications, but this has sufficient properties to illustrate the point.
Let's look at what methods this dataset offers us:
In [8]:
[mm for mm in dir(dataset) if not mm.startswith('_') ]
Out[8]:
That's a lot of methods and attributes to use, organize and retrieve datasets.
So let's go through them by their usage sections.
To contruct a dataset, one typically starts with a list of subject IDs to be added - we create few random lists, each to be considered as a separate class:
In [9]:
import random
random.seed(42)
def get_id_list(class_name, size=10):
"""Generates a random ID list. """
return ['{}{:04d}'.format(class_name[0],np.random.randint(50*size)) for _ in range(size)]
The dataset can be populated by adding all subjects belonging to a one class (referred to by cls_id
here), done by adding one samplet at a time, using the .add_samplet()
method. Let's go ahead and add some samplets based on id lists we just created.
Note: samplet
here refers to a single in the sample feature matrix X. This new term samplet
is defined to distinguish individual row elements of X from X itself and minimized confusion.
In [10]:
for class_index, class_id in enumerate(class_set):
print('Adding class {:>5}'.format(class_id))
target_list = get_id_list(class_id,class_sizes[class_index])
for subj_id in target_list:
print('\t reading subject {:>7}'.format(subj_id))
thickness_wb = get_features(work_dir, subj_id)
# adding the sample to the dataset
dataset.add_samplet(subj_id, thickness_wb, class_id)
Nice. Isn't it?
So what's nice about this, you say? The simple fact that you are constructing a dataset as you read the data in its most elemental form (in the units of the dataset such as the subject ID in our neuroimaging application). You're done as soon as you're done reading the features from disk.
What's more - you can inspect the dataset in an intuitive manner, as shown below:
In [11]:
dataset
Out[11]:
Even better, right? No more coding of several commands to get the complete and concise sense of the dataset.
In [12]:
dataset.num_samplets
Out[12]:
In [13]:
dataset.num_features
Out[13]:
In [14]:
dataset.target_set
Out[14]:
In [15]:
dataset.target_sizes
Out[15]:
In [16]:
dataset.target_sizes['Control']
Out[16]:
If you'd like to take a look data inside for few subjects - shall we call it a glance?
In [17]:
dataset.glance()
Out[17]:
We can control the number of items to glance, by passing a number to dataset.glance() method:
In [18]:
dataset.glance(2)
Out[18]:
Or you may be wondering what are the subject IDs in the dataset.. here they are:
In [19]:
dataset.samplet_ids
Out[19]:
These datasets offer all convenient methods and attributes you need. Besides it is quite easy to extend them to fit your needs and improve your workflow.
In [21]:
dataset['M0022']
Out[21]:
Like a Python dict, it raises an error if the key is not in the dataset:
In [22]:
dataset['dlfjdjf']
A more graceful handling would be to use dataset.get
to control what value to be returned in case the requested id is not found in the dataset.
In [23]:
dataset.get('dkfjd', np.nan)
Out[23]:
Thanks to builtin iteration, we can easily iterate over all the samplets in the dataset:
In [24]:
for samplet, features in dataset:
print("{} : {:>10} : {}".format(sample, dataset.targets[samplet], features))
Did you see that? It's so intuitive and natural! Such a clean traversal of dataset.
Thanks to the choice of the OrderedDict() to represent the data, classes and labels underneath, the order of sample addition is retained. Hence the correspondence across samples in the dataset not only key-wise (by the sample id), but also index-wise.
Quite often, we are interested in computing some statistics on data for a given subject (such as mean, or ROI-wise median). Typically this requires a loop, with some computation and organizing it in a new dataset! A simple routine pattern of usage, but can't avoided if you are still fiddling with representing your dataset in medieval matrices! :).
If you organized your dataset in a pyradigm
, such computation is trivial, thanks to builtin implementation of transform
method. The mean value for each subject can be computed and organized in a new dataset, with an intuitive and single line:
In [ ]:
mean_data = dataset.transform(np.mean)
mean_data.description = 'mean values per subject'
mean_data
we can easily traverse the dataset to check the result:
In [ ]:
for samplet, val in mean_data:
print("{} : {:>10} : {:.3f}".format(samplet, mean_data.targets[samplet], val))
As the transform accepts an arbitrary callable, we could do many more sophisticated things, such as access the subset of features e.g. cortical thickness for a particular region of interest (say posterior cingulate gyrus).
In [ ]:
# let's make a toy function to return the indices for the ROI
def get_ROI_indices(x): return x[:3]
Using this "mask" function, we can easily obtain features for an ROI
In [ ]:
pcg = dataset.transform(get_ROI_indices)
We can verify that the new dataset does indeed have only 3 features, for the same subjects/classes:
In [ ]:
pcg
In [ ]:
pcg.num_features
Let's make a bar plot with the just computed numbers:
In [ ]:
data, lbl, keys = pcg.data_and_targets()
In [ ]:
n, bins, patches = plt.hist(data)
Remember as the original source of data was random, this has no units, property or meaning!
In addition to the structured way of obtaining the various properties of this dataset, this implementation really will come in handy when you have to slice and dice the dataset (with large number of classes and features) into smaller subsets (e.g. for binary classification). Let's see how we can retrieve the data for a single class:
In [25]:
ctrl = dataset.get_class('Control')
That's it, obtaining the data for a given class is a simple call away.
Now let's see what it looks like:
In [26]:
ctrl
Out[26]:
Even with updated description automatically, to indicate its history. Let's see some data from controls:
In [27]:
ctrl.glance(2)
Out[27]:
We can also query a random subset of samples for manual inspection or cross-validation purposes. For example:
In [28]:
random_subset = dataset.random_subset(perc_in_class=0.3)
random_subset
Out[28]:
You can see which samplets were selected:
In [29]:
random_subset.samplet_ids
Out[29]:
You can verify that it is indeed random by issuing another call:
In [30]:
# supplying a new seed everytime to ensure randomization
from datetime import datetime
dataset.random_subset(perc_in_class=0.3).samplet_ids
Out[30]:
In [31]:
data = dataset.get_subset(dataset.samplet_ids[1:20])
data
Out[31]:
So as simple as that.
In [32]:
train_set, test_set = dataset.train_test_split_ids( train_perc = 0.5)
This method returns two sets of sample ids corresponding to training set (which 50% of samples from all classes in the dataset) and the rest in test_set. Let's see what they have:
In [33]:
train_set, test_set
Out[33]:
We can also get a train/test split by specifying an exact number of subjects we would like from each class (e.g. when you would like to avoid class imbalance in the training set):
In [34]:
train_set, test_set = dataset.train_test_split_ids( count_per_class = 3)
Let's see what the training set contains - we expect 3*3 =9 subjects :
In [35]:
train_set
Out[35]:
We can indeed verify that is the case, by creating a new smaller dataset from that list of ids and getting a summary:
In [36]:
training_dataset = dataset.get_subset(train_set)
training_dataset
Out[36]:
Another programmatic way to look into different classes is this:
In [37]:
target_set, target_sizes = training_dataset.summarize()
target_set, target_sizes
Out[37]:
which returns all the classes that you could iterative over.
Using these two lists, we can easily obtain subset datasets, as illustrated below.
In [38]:
dataset
Out[38]:
In [39]:
binary_dataset = dataset.get_class(['Control','Alzheimer'])
binary_dataset
Out[39]:
How about selecting a subset of features from all samples?
In [40]:
binary_dataset.get_feature_subset(range(2))
Out[40]:
Great. Isn't it? You can also see the two-time-point history (initial subset in classes, followed by a subset in features).
In [41]:
from pathlib import Path
out_file = Path('.') / 'Freesurfer_thickness_v4p3.PyradigmDataset.pkl'
binary_dataset.save(out_file)
That's it - it is saved.
Let's reload it from disk and make sure we can indeed retrieve it:
In [42]:
reloaded = ClfDataset(out_file) # another form of the constructor!
In [43]:
reloaded
Out[43]:
We can check to see they are indeed one and the same:
In [44]:
binary_dataset == reloaded
Out[44]:
In [45]:
dataset_two = ClfDataset(in_dataset=dataset) # yet another constructor: in its copy form!
How can you check if they are "functionally identical"? As in same keys, same data and classes for each key... Easy:
In [46]:
dataset_two == dataset
Out[46]:
Now let's try the arithmetic:
In [47]:
combined = dataset + dataset_two
Great. The add method recognized the identical set of keys and performed a horiz cat, as can be noticed by the twice the number of features in the combined dataset:
In [48]:
combined
Out[48]:
We can also do some removal in similar fashion:
In [49]:
smaller = combined - dataset
Data structure is even producing a warning to let you know the resulting output would be empty! We can verify that:
In [50]:
bool(smaller)
Out[50]:
This is all well and good. How does it interact with other packages out there, you might ask? It is as simple as you can imagine:
In [51]:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
In [52]:
data_matrix, target, sample_ids = binary_dataset.data_and_targets()
clf.fit(data_matrix, target)
Out[52]:
There you have it, a simple example to show you the utility and convenience of this ClassificationDataset
.
Usage for the RegressionDataset
will more or less be the same, except keeping in mind that the target value is continuous.