Research data management for medical and machine learning datasets with pyradigm

Data structures provided by the pyradigm library are greatly suited for medical data, where each sample/subject/unit needs to be uniquely identified with a single identifier (denoted as subject ID or something similar depending on the field) that links multiple tables containing diverse types of data, such features extracted and prediction targets for machine learning analyses, covariates and demographics as well as needing to deal with multiple data types (numerical, categorical etc).

Key-level correspondence across data, targets (either as labels 1 or 2, or as class names such as healthy and disease) and related tables (demographics and other meta data) is essential to maintain integrity of the datasets themselves as well as the subsequent analyses. This would help improve the provenance, as it is easy to encapsulate the relevant info and necessary history in the user-defined attribute and meta data features.

To provide a concrete examples, let's look at how a machine learning dataset is handled traditionally.

You typically have a matrix X of size n x p (n samples and p features), and a target vector y containing the target values (numerical or categorical or multi-output arrays). These X and y serve as training (and test set) for a classifier (like SVM) to fit the data X to match y as accurately as possible.

Let's get a little more concrete, and produce toy data for X and y:


In [1]:
import sys, os
import numpy as np
import matplotlib
%matplotlib
%matplotlib inline
import matplotlib.pyplot as plt

n = 10 # number of samples
p = 3  # number of features

X = np.random.random([n, p]) # random data for illustration
y = [1]*5 + [2]*5            # random labels ...

np.set_printoptions(precision=2) # save some screen space
print('X : \n{}'.format(X))
print('y : \n{}'.format(y))


Using matplotlib backend: TkAgg
X : 
[[0.36 0.81 0.75]
 [0.32 0.19 0.9 ]
 [0.17 0.72 0.04]
 [0.84 0.66 0.07]
 [0.41 0.41 0.04]
 [0.27 0.96 0.65]
 [0.53 0.06 0.89]
 [0.04 0.44 0.46]
 [0.6  0.05 0.42]
 [0.5  0.06 0.58]]
y : 
[1, 1, 1, 1, 1, 2, 2, 2, 2, 2]

Almost all the machine learning toolboxes take their input in this form: X and y, regardless of the original source that produced these features in the first place.

This might be fine if all you ever wanted to do is to extract some features, do some machine learning one-time and dispose these features away!

But this is almost never the case!

Because it doesn't simply end there.

At a minimum, you often need to dissect the results and link them across different tables e.g. to figure out

  • which samples are misclassified, that can only be queried with their identifiers and not simply their row indices in X?
  • what are the characteristics of those samples?
  • what targets/labels/classes do they belong to?

And all this info needs to be obtained

  • without having to write lots of code connecting few non-obvious links to disparate sources of data (numerical features X, and sample identifiers in a CSV file) to find the relevant info
  • without having to track down who or which method originally produced these features
  • how the previous colleage or student organized the whole dataset, if you haven't generated the features yourself from scratch

And if you are like me, you would be thinking about how would you organize your workflow such that the aforementioned tasks can be accomplished with ease.

This pyradigm data structure is the result of years of refinement to conquer the above complex workflow and related scenarios with ease. By always organizing the extracted features in a dictionary with keyed-in by their samplet ID, along with other important info such as target values, other attributes and associated meta-data. This, by definition, preserves the integrity of the data (inability to incorrectly label samples etc).

With user-defined attributes, this data structure allows the capture of provenance info in a way that is meaningful to the user.

NOTE: we define a samplet to be a single row in X, to distinguish from a sample that is sometimes incorrectly used to refer both a single instance and an entire "sample" / dataset.

Improting the necessary modules and our fancy class definition:


In [2]:
from pyradigm import ClassificationDataset as ClfDataset
from pyradigm import RegressionDataset as RegrDataset

The only major difference between the above two data structures is the data type of target i.e. categorical/discrete vs. continuous values. The ClassificationDataset accepts only categorical values for its target, often as strings ('healthy', 'disease', 'cat', 'chair' etc), whereas the RegressionDataset allows continuous floating point numbers (e.g. age, temperature, salary, weight etc). Depending on the target data type, the analyses (esp. in machine learning) changes dramatically in terms of which predictive model is employed, how they are optimized and what performance metrics are considered etc.

Let's get started with a classification dataset, which can be instantiated as shown below. We also give it a simple description:


In [3]:
dataset = ClfDataset()
dataset.description = 'ADNI1: cortical thickness features from Freesurfer v6.0, QCed.'

These dataset have customised and convenient display methods, summarizing the important info. For example, just printing instance shows its content summary (typically count of samplets per target and a description if any).


In [4]:
dataset


Out[4]:
ADNI1: cortical thickness features from Freesurfer v6.0, QCed.
Empty dataset.

You can see the dataset some description attached to it, however we can also it is empty. This can be verified in a boolean context as shown below:


In [5]:
bool(dataset)


Out[5]:
False

Let's add samples to this dataset which is when this dataset implementation becomes really handy. Before we do that, we will define some convenience routines defined to just illustrate a simple yet common use of this dataset.


In [6]:
def read_thickness(path):
    """Dummy function to minic a data reader."""

    # in your actural routine, this might be:
    #   pysurfer.read_thickness(path).values()
    return np.random.random(2)


def get_features(work_dir, subj_id):
    """Returns the whole brain cortical thickness for a given subject ID."""

    # extension to identify the data file; this could be .curv, anything else you choose
    ext_thickness = '.thickness'

    thickness = dict()
    for hemi in ['lh', 'rh']:
        path_thickness = os.path.join(work_dir, subj_id, hemi + ext_thickness)
        thickness[hemi] = read_thickness(path_thickness)

    # concatenating them to build a whole brain feature set
    thickness_wb = np.concatenate([thickness['lh'], thickness['rh']])

    return thickness_wb

So now we have IO routines to read the data for us. Let's define where the data will come from:


In [7]:
work_dir = '/project/ADNI/FreesurferThickness_v4p3'
class_set = ['Control', 'Alzheimer', 'MCI']
class_sizes = [7, 8, 6]

This would obviously change for your applications, but this has sufficient properties to illustrate the point.

Let's look at what methods this dataset offers us:


In [8]:
[mm for mm in dir(dataset) if not mm.startswith('_') ]


Out[8]:
['add_attr',
 'add_dataset_attr',
 'add_samplet',
 'attr',
 'attr_dtype',
 'attr_summary',
 'data',
 'data_and_labels',
 'data_and_targets',
 'dataset_attr',
 'del_attr',
 'del_samplet',
 'description',
 'dtype',
 'extend',
 'feature_names',
 'from_arff',
 'get',
 'get_attr',
 'get_class',
 'get_data_matrix_in_order',
 'get_feature_subset',
 'get_subset',
 'glance',
 'num_features',
 'num_samplets',
 'num_targets',
 'random_subset',
 'random_subset_ids',
 'random_subset_ids_by_count',
 'rename_targets',
 'sample_ids_in_class',
 'samplet_ids',
 'save',
 'shape',
 'summarize',
 'target_set',
 'target_sizes',
 'targets',
 'train_test_split_ids',
 'transform']

That's a lot of methods and attributes to use, organize and retrieve datasets.

So let's go through them by their usage sections.

Constructor

You can see there few methods such as .add_samplet(), .get_subset() etc. The most often used method is the .add_samplet(), which is key to constructing a pyradigm dataset.

To contruct a dataset, one typically starts with a list of subject IDs to be added - we create few random lists, each to be considered as a separate class:


In [9]:
import random

random.seed(42)

def get_id_list(class_name, size=10):
    """Generates a random ID list. """
        
    return ['{}{:04d}'.format(class_name[0],np.random.randint(50*size)) for _ in range(size)]

The dataset can be populated by adding all subjects belonging to a one class (referred to by cls_id here), done by adding one samplet at a time, using the .add_samplet() method. Let's go ahead and add some samplets based on id lists we just created.

Note: samplet here refers to a single in the sample feature matrix X. This new term samplet is defined to distinguish individual row elements of X from X itself and minimized confusion.


In [10]:
for class_index, class_id in enumerate(class_set):
    print('Adding class {:>5}'.format(class_id))

    target_list = get_id_list(class_id,class_sizes[class_index])
    for subj_id in target_list:
        print('\t reading subject {:>7}'.format(subj_id))
        thickness_wb = get_features(work_dir, subj_id)

        # adding the sample to the dataset
        dataset.add_samplet(subj_id, thickness_wb, class_id)


Adding class Control
	 reading subject   C0052
	 reading subject   C0101
	 reading subject   C0166
	 reading subject   C0168
	 reading subject   C0018
	 reading subject   C0267
	 reading subject   C0000
Adding class Alzheimer
	 reading subject   A0179
	 reading subject   A0253
	 reading subject   A0152
	 reading subject   A0164
	 reading subject   A0141
	 reading subject   A0320
	 reading subject   A0197
	 reading subject   A0189
Adding class   MCI
	 reading subject   M0269
	 reading subject   M0078
	 reading subject   M0298
	 reading subject   M0024
	 reading subject   M0022
	 reading subject   M0081

Nice. Isn't it?

So what's nice about this, you say? The simple fact that you are constructing a dataset as you read the data in its most elemental form (in the units of the dataset such as the subject ID in our neuroimaging application). You're done as soon as you're done reading the features from disk.

What's more - you can inspect the dataset in an intuitive manner, as shown below:


In [11]:
dataset


Out[11]:
ADNI1: cortical thickness features from Freesurfer v6.0, QCed.
21 samplets, 3 classes, 4 features
Class   Control : 7 samplets
Class Alzheimer : 8 samplets
Class       MCI : 6 samplets

Even better, right? No more coding of several commands to get the complete and concise sense of the dataset.

Convenient attributes

If you would like, you can always get more specific information, such as:


In [12]:
dataset.num_samplets


Out[12]:
21

In [13]:
dataset.num_features


Out[13]:
4

In [14]:
dataset.target_set


Out[14]:
['MCI', 'Control', 'Alzheimer']

In [15]:
dataset.target_sizes


Out[15]:
Counter({'Control': 7, 'Alzheimer': 8, 'MCI': 6})

In [16]:
dataset.target_sizes['Control']


Out[16]:
7

If you'd like to take a look data inside for few subjects - shall we call it a glance?


In [17]:
dataset.glance()


Out[17]:
{'C0052': array([0.27, 0.94, 0.06, 0.06]),
 'C0101': array([0.02, 0.92, 0.53, 0.72]),
 'C0166': array([0.44, 0.47, 0.64, 0.17]),
 'C0168': array([0.46, 0.83, 0.38, 0.62]),
 'C0018': array([0.3 , 0.95, 0.84, 0.73])}

We can control the number of items to glance, by passing a number to dataset.glance() method:


In [18]:
dataset.glance(2)


Out[18]:
{'C0052': array([0.27, 0.94, 0.06, 0.06]),
 'C0101': array([0.02, 0.92, 0.53, 0.72])}

Or you may be wondering what are the subject IDs in the dataset.. here they are:


In [19]:
dataset.samplet_ids


Out[19]:
['C0052',
 'C0101',
 'C0166',
 'C0168',
 'C0018',
 'C0267',
 'C0000',
 'A0179',
 'A0253',
 'A0152',
 'A0164',
 'A0141',
 'A0320',
 'A0197',
 'A0189',
 'M0269',
 'M0078',
 'M0298',
 'M0024',
 'M0022',
 'M0081']

These datasets offer all convenient methods and attributes you need. Besides it is quite easy to extend them to fit your needs and improve your workflow.

Accessing samples

Thanks to its design, data for a given samplet 'M0299' can simply be obtained by:


In [21]:
dataset['M0022']


Out[21]:
array([0.31, 0.86, 0.7 , 0.01])

Like a Python dict, it raises an error if the key is not in the dataset:


In [22]:
dataset['dlfjdjf']


---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-22-4b19d52bac71> in <module>
----> 1 dataset['dlfjdjf']

~/dev/pyradigm/pyradigm/base.py in __getitem__(self, item)
   1020             return self._data[item]
   1021         else:
-> 1022             raise KeyError('{} not found in dataset.'.format(item))
   1023 
   1024 

KeyError: 'dlfjdjf not found in dataset.'

A more graceful handling would be to use dataset.get to control what value to be returned in case the requested id is not found in the dataset.


In [23]:
dataset.get('dkfjd', np.nan)


Out[23]:
nan

Iteration

Thanks to builtin iteration, we can easily iterate over all the samplets in the dataset:


In [24]:
for samplet, features in dataset:
    print("{} : {:>10} : {}".format(sample, dataset.targets[samplet], features))


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-24-3c3646e8c9b1> in <module>
      1 for samplet, features in dataset:
----> 2     print("{} : {:>10} : {}".format(sample, dataset.targets[samplet], features))

NameError: name 'sample' is not defined

Did you see that? It's so intuitive and natural! Such a clean traversal of dataset.

Thanks to the choice of the OrderedDict() to represent the data, classes and labels underneath, the order of sample addition is retained. Hence the correspondence across samples in the dataset not only key-wise (by the sample id), but also index-wise.

Subject-wise transform

Quite often, we are interested in computing some statistics on data for a given subject (such as mean, or ROI-wise median). Typically this requires a loop, with some computation and organizing it in a new dataset! A simple routine pattern of usage, but can't avoided if you are still fiddling with representing your dataset in medieval matrices! :).

If you organized your dataset in a pyradigm, such computation is trivial, thanks to builtin implementation of transform method. The mean value for each subject can be computed and organized in a new dataset, with an intuitive and single line:


In [ ]:
mean_data = dataset.transform(np.mean)
mean_data.description = 'mean values per subject'
mean_data

we can easily traverse the dataset to check the result:


In [ ]:
for samplet, val in mean_data:
    print("{} : {:>10} : {:.3f}".format(samplet, mean_data.targets[samplet], val))

As the transform accepts an arbitrary callable, we could do many more sophisticated things, such as access the subset of features e.g. cortical thickness for a particular region of interest (say posterior cingulate gyrus).


In [ ]:
# let's make a toy function to return the indices for the ROI
def get_ROI_indices(x): return x[:3]

Using this "mask" function, we can easily obtain features for an ROI


In [ ]:
pcg = dataset.transform(get_ROI_indices)

We can verify that the new dataset does indeed have only 3 features, for the same subjects/classes:


In [ ]:
pcg

In [ ]:
pcg.num_features

Let's make a bar plot with the just computed numbers:


In [ ]:
data, lbl, keys = pcg.data_and_targets()

In [ ]:
n, bins, patches = plt.hist(data)

Remember as the original source of data was random, this has no units, property or meaning!

Subset selection

In addition to the structured way of obtaining the various properties of this dataset, this implementation really will come in handy when you have to slice and dice the dataset (with large number of classes and features) into smaller subsets (e.g. for binary classification). Let's see how we can retrieve the data for a single class:


In [25]:
ctrl = dataset.get_class('Control')

That's it, obtaining the data for a given class is a simple call away.

Now let's see what it looks like:


In [26]:
ctrl


Out[26]:
 Subset derived from: ADNI1: cortical thickness features from Freesurfer v6.0, QCed.
7 samplets, 1 classes, 4 features
Class Control : 7 samplets

Even with updated description automatically, to indicate its history. Let's see some data from controls:


In [27]:
ctrl.glance(2)


Out[27]:
{'C0052': array([0.27, 0.94, 0.06, 0.06]),
 'C0101': array([0.02, 0.92, 0.53, 0.72])}

We can also query a random subset of samples for manual inspection or cross-validation purposes. For example:


In [28]:
random_subset = dataset.random_subset(perc_in_class=0.3)
random_subset


Out[28]:
 Subset derived from: ADNI1: cortical thickness features from Freesurfer v6.0, QCed.
5 samplets, 3 classes, 4 features
Class   Control : 2 samplets
Class Alzheimer : 2 samplets
Class       MCI : 1 samplets

You can see which samplets were selected:


In [29]:
random_subset.samplet_ids


Out[29]:
['C0101', 'C0168', 'A0253', 'A0164', 'M0078']

You can verify that it is indeed random by issuing another call:


In [30]:
# supplying a new seed everytime to ensure randomization
from datetime import datetime
dataset.random_subset(perc_in_class=0.3).samplet_ids


Out[30]:
['C0168', 'C0267', 'A0253', 'A0320', 'M0022']

Let's see how we can retrieve specific samples by their IDs (for which there are many use cases):


In [31]:
data = dataset.get_subset(dataset.samplet_ids[1:20])
data


Out[31]:
 Subset derived from: ADNI1: cortical thickness features from Freesurfer v6.0, QCed.
19 samplets, 3 classes, 4 features
Class   Control : 6 samplets
Class Alzheimer : 8 samplets
Class       MCI : 5 samplets

So as simple as that.

Cross-validation

If you would like to develop a variant of cross-validation, and need to obtain a random split of the dataset to obtain training and test sets, it is as simple as:


In [32]:
train_set, test_set = dataset.train_test_split_ids( train_perc = 0.5)

This method returns two sets of sample ids corresponding to training set (which 50% of samples from all classes in the dataset) and the rest in test_set. Let's see what they have:


In [33]:
train_set, test_set


Out[33]:
(['C0018',
  'C0166',
  'C0101',
  'A0152',
  'A0189',
  'A0253',
  'A0320',
  'M0078',
  'M0269',
  'M0024'],
 ['A0197',
  'M0081',
  'A0141',
  'C0000',
  'C0168',
  'M0298',
  'A0164',
  'C0052',
  'A0179',
  'M0022',
  'C0267'])

We can also get a train/test split by specifying an exact number of subjects we would like from each class (e.g. when you would like to avoid class imbalance in the training set):


In [34]:
train_set, test_set = dataset.train_test_split_ids( count_per_class = 3)

Let's see what the training set contains - we expect 3*3 =9 subjects :


In [35]:
train_set


Out[35]:
['C0166',
 'C0018',
 'C0168',
 'A0320',
 'A0141',
 'A0253',
 'M0024',
 'M0081',
 'M0269']

We can indeed verify that is the case, by creating a new smaller dataset from that list of ids and getting a summary:


In [36]:
training_dataset = dataset.get_subset(train_set)
training_dataset


Out[36]:
 Subset derived from: ADNI1: cortical thickness features from Freesurfer v6.0, QCed.
9 samplets, 3 classes, 4 features
Class   Control : 3 samplets
Class Alzheimer : 3 samplets
Class       MCI : 3 samplets

Another programmatic way to look into different classes is this:


In [37]:
target_set, target_sizes = training_dataset.summarize()
target_set, target_sizes


Out[37]:
(['MCI', 'Control', 'Alzheimer'], array([3., 3., 3.]))

which returns all the classes that you could iterative over.

Using these two lists, we can easily obtain subset datasets, as illustrated below.


In [38]:
dataset


Out[38]:
ADNI1: cortical thickness features from Freesurfer v6.0, QCed.
21 samplets, 3 classes, 4 features
Class   Control : 7 samplets
Class Alzheimer : 8 samplets
Class       MCI : 6 samplets

In [39]:
binary_dataset = dataset.get_class(['Control','Alzheimer'])
binary_dataset


Out[39]:
 Subset derived from: ADNI1: cortical thickness features from Freesurfer v6.0, QCed.
15 samplets, 2 classes, 4 features
Class   Control : 7 samplets
Class Alzheimer : 8 samplets

How about selecting a subset of features from all samples?


In [40]:
binary_dataset.get_feature_subset(range(2))


Out[40]:
Subset features derived from: 
 
 Subset derived from: ADNI1: cortical thickness features from Freesurfer v6.0, QCed.
15 samplets, 2 classes, 2 features
Class   Control : 7 samplets
Class Alzheimer : 8 samplets

Great. Isn't it? You can also see the two-time-point history (initial subset in classes, followed by a subset in features).

Serialization

Once you have this dataset, you can save and load these trivially using your favourite serialization module. Let's do some pickling:


In [41]:
from pathlib import Path
out_file = Path('.') / 'Freesurfer_thickness_v4p3.PyradigmDataset.pkl'
binary_dataset.save(out_file)

That's it - it is saved.

Let's reload it from disk and make sure we can indeed retrieve it:


In [42]:
reloaded = ClfDataset(out_file) # another form of the constructor!

In [43]:
reloaded


Out[43]:
 Subset derived from: ADNI1: cortical thickness features from Freesurfer v6.0, QCed.
15 samplets, 2 classes, 4 features
Class   Control : 7 samplets
Class Alzheimer : 8 samplets

We can check to see they are indeed one and the same:


In [44]:
binary_dataset == reloaded


Out[44]:
True

Dataset Arithmetic

You might wonder how can you combine two different types of features ( thickness and shape ) from the dataset. Piece of cake, see below ...

To concatenat two datasets, first we make a second dataset:


In [45]:
dataset_two = ClfDataset(in_dataset=dataset) # yet another constructor: in its copy form!

How can you check if they are "functionally identical"? As in same keys, same data and classes for each key... Easy:


In [46]:
dataset_two == dataset


Out[46]:
True

Now let's try the arithmetic:


In [47]:
combined = dataset + dataset_two


Identical keys found. Trying to horizontally concatenate features for each samplet.

Great. The add method recognized the identical set of keys and performed a horiz cat, as can be noticed by the twice the number of features in the combined dataset:


In [48]:
combined


Out[48]:
21 samplets, 3 classes, 8 features
Class   Control : 7 samplets
Class Alzheimer : 8 samplets
Class       MCI : 6 samplets

We can also do some removal in similar fashion:


In [49]:
smaller = combined - dataset


C0052 removed.
C0101 removed.
C0166 removed.
C0168 removed.
C0018 removed.
C0267 removed.
C0000 removed.
A0179 removed.
A0253 removed.
A0152 removed.
A0164 removed.
A0141 removed.
A0320 removed.
A0197 removed.
A0189 removed.
M0269 removed.
M0078 removed.
M0298 removed.
M0024 removed.
M0022 removed.
M0081 removed.
/Users/Reddy/dev/pyradigm/pyradigm/base.py:1470: UserWarning: Requested removal of all the samplets - output dataset would be empty.
  warn('Requested removal of all the samplets - '

Data structure is even producing a warning to let you know the resulting output would be empty! We can verify that:


In [50]:
bool(smaller)


Out[50]:
False

Portability

This is all well and good. How does it interact with other packages out there, you might ask? It is as simple as you can imagine:


In [51]:
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)

In [52]:
data_matrix, target, sample_ids = binary_dataset.data_and_targets()
clf.fit(data_matrix, target)


Out[52]:
SVC(C=100.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

There you have it, a simple example to show you the utility and convenience of this ClassificationDataset.

Usage for the RegressionDataset will more or less be the same, except keeping in mind that the target value is continuous.

Thanks for checking it out.

I would appreciate if you could give me feedback on improving or sharpening it further.