Multi-indexing

When dealing with Series data, it is often useful to index each element of the series with multiple labels and then select and aggregrate data based on these indices. For example, for a collection of time series data, each point in time might be identified by trial number, block number, one or more experimental conditions, etc. This tutorial shows how to create and leverage such "multi-indices" when working with Series objects.

Creating a toy data set

Let's start by building a simple Series with only a single record.


In [1]:
from thunder import Series
from numpy import arange, array

In [2]:
data = tsc.loadSeriesFromArray(arange(12))
data.first()


Out[2]:
((0,), array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]))

By default, the index on the series will label the elemets with ascending integers.


In [3]:
data.index


Out[3]:
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

For the sake of example, let's assume that these data represent two independent trials. Thus we might have one index describing the trial structure.


In [4]:
trial = array([1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2])

Furthermore, let's assume that each trial is broken into three blocks. This can be described with a second index.


In [5]:
block = array([1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3])

Finally, in this simple example, we have two time points within each block.


In [6]:
point = array([1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2])

A multi-index for this data can then be created as list of lists, where each sub-list contains one value from each of the individual indices.


In [7]:
index = array([trial, block, point]).T
data.index = index

To inspect the index, we look at the transpose so it lines up with the Series.


In [8]:
data.index.T


Out[8]:
array([[1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2],
       [1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3],
       [1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2]])

As a useful piece of terminology, we would say that the resulting multi-index has three levels: level 0 (trial); level 1 (block); and level 3 (time point).

Selecting

There are two major pieces of multi-index functionality. The first is selection. To select a subset of the Series based on the multi-index, we choose a value and a level and then only elements where that level of the index matches the value will be retained. For instance, we could select only the data data points from the first trial (level = 0; value = 1).


In [9]:
selected = data.selectByIndex(1, level=0)

def displaySeries(series):
    print "index"
    print "-----"
    print series.index.T
    print "series"
    print "------"
    print series.values().first()
    
displaySeries(selected)


index
-----
[[1 1 1 1 1 1]
 [1 1 2 2 3 3]
 [1 2 1 2 1 2]]
series
------
[0 1 2 3 4 5]

As we see above, once a single value has been selected from a certain level, the index values at that level become redundant and we might desire to discard them. This can be accomplished with the "squeeze" option.


In [10]:
selected = data.selectByIndex(1, level=0, squeeze=True)
displaySeries(selected)


index
-----
[[1 1 2 2 3 3]
 [1 2 1 2 1 2]]
series
------
[0 1 2 3 4 5]

We can also select multiple values at a given level by passing a list of values. He we select data from blocks 2 and 3 (level = 1; value = 2 or 3).


In [11]:
selected = data.selectByIndex([2, 3], level=1)
displaySeries(selected)


index
-----
[[1 1 1 1 2 2 2 2]
 [2 2 3 3 2 2 3 3]
 [1 2 1 2 1 2 1 2]]
series
------
[ 2  3  4  5  8  9 10 11]

In the most general case, we can select multiple values at multiple levels. Let's combine the previous two examples and get the 2nd and 3rd blocks (level = 1; value = 2 or 3), but only for the 1st trial (level = 0; value = 1).


In [12]:
selected = data.selectByIndex([1, [2, 3]], level=[0, 1])
displaySeries(selected)


index
-----
[[1 1 1 1]
 [2 2 3 3]
 [1 2 1 2]]
series
------
[2 3 4 5]

Finally, we can reverse the process of "selection" (keeping only the elements that match the values) to that of "filtering" (keeping all elements except those that match the values). This is accomplished with the "filter" keyword. To demonstrate, lets get all of the blocks except for the 2nd (level = 1; value = 2).


In [13]:
selected = data.selectByIndex(2, level=1, filter=True)
displaySeries(selected)


index
-----
[[1 1 1 1 2 2 2 2]
 [1 1 3 3 1 1 3 3]
 [1 2 1 2 1 2 1 2]]
series
------
[ 0  1  4  5  6  7 10 11]

Aggregation

The second major multi-index operation is aggregation. Aggregation can be thought of as a two step-process. First a level is selected and the series is partitioned into pieces that share the index value at that level. Second an aggregating function is applied to each of these partitions, and a new series is reconsituted with one element for the aggregate value computed on each piece. The aggregating function should take an array as input and return a single numeric values as output.

As a simple initial demonstration, let's find the average value of our series for each trial (level = 0).


In [14]:
from numpy import mean
aggregated = data.seriesAggregateByIndex(mean, level=0)
displaySeries(aggregated)


index
-----
[1 2]
series
------
[ 2.5  8.5]

The same operation can be called through the convienience function seriesMeanByIndex


In [15]:
aggregated = data.seriesMeanByIndex(level=0)
displaySeries(aggregated)


index
-----
[1 2]
series
------
[ 2.5  8.5]

As a more complex example, we might want aggregation with respect to the values on multiple levels. For example, we might want to examine how the maximum value at each time point (level = 2) is different across the different trials (level = 0).


In [16]:
aggregated = data.seriesMaxByIndex(level=[0, 2])
displaySeries(aggregated)


index
-----
[[1 1 2 2]
 [1 2 1 2]]
series
------
[ 4  5 10 11]