[1] Load data from CRCNS:hc-3 into pandas DataFrame

Etienne Ackermann, 07/02/2015

This series of notebooks will go through the steps of analyzing some data from the CRCNS hc-3 data set, as well as generating some of the figures to be used on my SfN poster.

The data can be downloaded from the CRCNS (Collaborative Research in Computational Neuroscience) website, and the hc-3 data set in particular.

Here I will use a modified version of the Python module YAHMM (yet another hidden Markov model) by Jacob Schreiber.

The (rough) workflow of my YAHMM experiment is as follows:

  1. load and manipule data from the hc-3 data set (from the crcns.org website) into a pandas DataFrame containing spike counts
    1. extract single unit spike trains from data and store in a list of spike trains
    2. bin and count the spikes and save the results in a DataFrame: df_spk_counts
  2. separate the observations (df_spk_counts) into training and a validation sets
    1. extract sequences of random lengths (bounded by min and max lengths) and summarize sequence info
    2. use df_spk_counts and sequence summary info to build sequences of observations suitable for YAHMM; store these in training_sequences and validation_sequences, respectively
  3. define and bake various hidden Markov models
  4. use training sequences to learn the parameters of the various models defined previously
  5. ...

Summary

From the data set description page:

The data set contains recordings made from multiple hippocampal areas in Long-Evans rats, including: Cornu Ammonis: CA1 and CA3; dentate gyrus (DG), and entorhinal cortex: EC2, EC3, EC4, EC5. The data was obtained during 442 recording sessions and includes responses of 7,737 neurons in eleven animals while they performed one of fourteen behaviors. Total time for all experiments is 204.5 hours. The raw (broadband) data was recorded at 20KHz, simultaneously from 31 to 127 channels. The raw data was processed to extract the LFPs (Local Field Potentials) and detect spikes. Included in the data set are the following items:

  1. Times and waveforms of detected potential spikes.
  2. Results of spike sorting.
  3. LFPs.
  4. For many sessions, the coordinate and direction of the rats head and video files from which the positions were extracted.
  5. Metadata tables giving properties of the neurons and characteristics of the recording sessions.

Subset of interest

I will start by considering gor01 and vvp01 (animals). In particular, they have the following number of recorded sessions:

Task gor01 vvp01 Description
linear one 3 5 linear track (250 cm?)
linear two 3 5 linear track (250 cm?)
open 3 open field
sleep 1 sleeping
T maze 2 1
wheel 1 operant wheel running task; see Mizuseki et al., 2009

In this way, I might be able to distinguish between different environments using my HMM implementation. There are (perhaps) easy cases, such as linear vs wheel, and perhaps less easy cases, such as linear one vs linear two.

Ideally, I would like to see

  1. small intratask variability, and
  2. large intertask variability

although this remains to be seen...

Also note that spike sorting occurs on a daily basis, so that sorted units are only meaningful within a small number of recording sessions.

Data description (technical details)

Note that both gor01 and vvp01 were recorded with a Neuralynx recording system at 32,552 Hz, then amplified 1000x, followed by 1–5 kHz bandpass filtering.

Experiment

  • .dat file. Raw data or "wideband data." Contains everything–LFP (Local Field Potentials) and spikes. Has up to 32 (four shank probe) or 64 (eight shank probe) short integers (2 bytes, signed) every time step. (One integer from each recording site, i.e. channel). Actual number of channels is specified in the .xml file and is not always a multiple of 8 because bad channels (due to very high impedance or broken shank) were removed from the data.
  • .xml file. Has information about recording. Also used for viewing data using neuroscope, can be loaded into Matlab using the LoadXml script and can be viewed using ndmanager (a GUI to browse the xml file).
  • .mpg file. Video recording of animal position during the experiment. The video shows LEDs attached to the animal. This is used to generate the .whl file, which contains the coordinates of the position of the LEDs.
  • .led file. Contains recordings (at 32,552 Hz sampling rate) of the synchronization pulse that drives a stationary flashing LED visible in the .mpg file. Format is one short integer (2 bytes, signed) per sample.

Extracting [useful] data from hc-3

I will assume that the data from hc-3 have already been downloaded.


In [1]:
import sys
sys.path.insert(0, '../')

import numpy as np
import pandas as pd
import pickle 
import seaborn as sns
#import yahmm as ym

from matplotlib import pyplot as plt
from pandas import Series, DataFrame

from efunctions import * # load my helper functions

%matplotlib inline


function saveFigure(filename) loaded

Tip: to save a figure, call saveFigure("path/figure.pdf")

Extract spike trains for linear one and linear two, day 1


In [2]:
import os
def get_sessions_for_animal(a_dir, animal):
    return [name for name in os.listdir(a_dir) if (os.path.isdir(os.path.join(a_dir, name))) & (name[0:len(animal)]==animal)]

def get_trials_for_session(SessionInfo, session=0):
    mypath = SessionInfo['datadir'] + '/' + SessionInfo['sessions'][session]
    SessionInfo['LinOne'] = [mypath + '/' + name + '/' + name[:-5] for name in os.listdir(mypath) if (os.path.isdir(os.path.join(mypath, name))) & (name[-4:]=='lin1')]
    SessionInfo['LinTwo'] = [mypath + '/' + name + '/' + name[:-5] for name in os.listdir(mypath) if (os.path.isdir(os.path.join(mypath, name))) & (name[-4:]=='lin2')]


SessionInfo=dict()
SessionInfo['datadir'] = '/home/etienne/Dropbox/neoReader/Data'
#SessionInfo['datadir'] = 'C:/Etienne/Dropbox/neoReader/Data'
SessionInfo['animal'] = 'gor01'
SessionInfo['sessions'] = get_sessions_for_animal(SessionInfo['datadir'],SessionInfo['animal'])
get_trials_for_session(SessionInfo, session=0)

SessionInfo


Out[2]:
{'LinOne': ['/home/etienne/Dropbox/neoReader/Data/gor01-6-12/2006-6-12_15-55-31_lin1/2006-6-12_15-55-31'],
 'LinTwo': ['/home/etienne/Dropbox/neoReader/Data/gor01-6-12/2006-6-12_16-53-46_lin2/2006-6-12_16-53-46'],
 'animal': 'gor01',
 'datadir': '/home/etienne/Dropbox/neoReader/Data',
 'sessions': ['gor01-6-12', 'gor01-6-13', 'gor01-6-7']}

In [5]:
# specify session parameters

num_electrodes = 12

for ele in np.arange(num_electrodes):
    dt1a = pd.read_table( SessionInfo['LinOne'][0] + '.clu.' + str(ele + 1), skiprows=1, names='u' )
    dt1b = pd.read_table( SessionInfo['LinOne'][0] + '.res.' + str(ele + 1), header=None, names='t' )
    dt2a = pd.read_table( SessionInfo['LinTwo'][0] + '.clu.' + str(ele + 1), skiprows=1, names='u' )
    dt2b = pd.read_table( SessionInfo['LinTwo'][0] + '.res.' + str(ele + 1), header=None, names='t' )
    ls1a = list(dt1a['u'])
    ls1b = list(dt1b['t'])
    ls2a = list(dt2a['u'])
    ls2b = list(dt2b['t'])
    d1 = {'clu' + str( ele + 1 ): Series(ls1a, index=ls1b)}
    d2 = {'clu' + str( ele + 1 ): Series(ls2a, index=ls2b)}
    if ele == 0:
        df1 = DataFrame(d1)
        df2 = DataFrame(d2)
    else:
        df1 = df1.append(DataFrame(d1))
        df2 = df2.append(DataFrame(d2))

print( 'The number of spikes recorded on each electrode is as follows:' )
print( 'For linear one:' )
print( df1.count() )
print( 'For linear two:' )
print( df2.count() )
print( 'The total number of spikes detected (lin1, lin2):' + str(sum(df1.count())) + ', ' + str(sum(df2.count())))


The number of spikes recorded on each electrode is as follows:
For linear one:
clu1      97182
clu10     86211
clu11    464786
clu12    204332
clu2     104050
clu3      84827
clu4      99562
clu5      32485
clu6      49000
clu7     101825
clu8      97978
clu9     125859
dtype: int64
For linear two:
clu1      60837
clu10     61657
clu11    363033
clu12    178708
clu2      67546
clu3      46988
clu4      75359
clu5      41363
clu6      35129
clu7      71768
clu8      66574
clu9     121952
dtype: int64
The total number of spikes detected (lin1, lin2):1548097, 1190914

However, we now need to transform the data so that we can view the spikes by sorted units, and not on which electrodes they were recorded.

First, we know that units 0 and 1 should be omitted from any analyses, since these correspond to mechanical noise, and small, unsortable spikes, respectively.

So let's remove all rows of our DataFrames with spikes corresponding only to clusters 0 or 1:


In [6]:
cidx = 0
idx1 = list(df1[df1==cidx].dropna(how='all').index)
idx2 = list(df2[df2==cidx].dropna(how='all').index)
df1 = df1.drop(idx1)
df2 = df2.drop(idx2)

cidx = 1
idx1 = list(df1[df1==cidx].dropna(how='all').index)
idx2 = list(df2[df2==cidx].dropna(how='all').index)
df1 = df1.drop(idx1)
df2 = df2.drop(idx2)

print( 'The number of spikes recorded on each electrode (after removing spikes corresponding to clusters 0 and 1) is as follows:' )
print( 'For linear one:' )
print( df1.count() )
print( 'For linear two:' )
print( df2.count() )
print( 'The total number of spikes detected (lin1, lin2):' + str(sum(df1.count())) + ', ' + str(sum(df2.count())))


The number of spikes recorded on each electrode (after removing spikes corresponding to clusters 0 and 1) is as follows:
For linear one:
clu1         0
clu10    29884
clu11        0
clu12    25596
clu2     25508
clu3         4
clu4        82
clu5         0
clu6         0
clu7     51037
clu8     40370
clu9        46
dtype: int64
For linear two:
clu1         0
clu10    20920
clu11        0
clu12    17930
clu2     19262
clu3         0
clu4        32
clu5         0
clu6         0
clu7     34926
clu8     30873
clu9       416
dtype: int64
The total number of spikes detected (lin1, lin2):172527, 124359

This clean-up operation has significantly reduced the size of our DataFrame, so extracting spike trains should now be a little bit faster, or at least require less memory.

Note that this clean-up is not perfect. In particular, the count operation counts only numbers per column, and ignores the NaNs, UNLESS the entire column is filled with NaNs, in which case, it counts the number of NaNs. This is the case for df2.clu5, for example, where we have 5583 NaNs after clean-up, and no actual useable spikes.

We can now request spike trains very similar to our clean-up operation. That is, we find all the rows in which any column has a value for the unit number that we want, and we note the timestamps. This is our spike train for that unit, although the timestamps have not been sorted yet.

Let's do this now. But first, as an example, consider the following:

Remove clu5, or any other electrode without ANY spikes...


In [7]:
zero_spk_cols = [ x|y for (x,y) in zip((df1.count()==0).tolist(), (df2.count()==0).tolist())]
zspi = [i for i, elem in enumerate(zero_spk_cols) if elem]
df1.drop(df1.columns[zspi], axis=1, inplace=True) # WARNING! df1 and df2 must have the same column order!
df2.drop(df2.columns[zspi], axis=1, inplace=True) # WARNING! df1 and df2 must have the same column order!

In [8]:
df1.sort_index().head()


Out[8]:
clu10 clu12 clu2 clu4 clu7 clu8 clu9
278 NaN NaN NaN NaN NaN 17 NaN
283 NaN 10 NaN NaN NaN NaN NaN
301 NaN NaN NaN NaN 2 NaN NaN
537 NaN 3 NaN NaN NaN NaN NaN
575 NaN NaN NaN NaN 2 NaN NaN

here we have called sort_index() (which does not replace the order in df1 by the way), from where we can see that the first three spikes have been attributed to unit 4, and the next two spikes to unit 14.

Also, since we would like to ultimately sort the spike trains by timestamp for all of the units, we may as well sort the entire DataFrame once, and then extract the (already-sorted) spike trains from there.


In [9]:
df1 = df1.sort_index()
df2 = df2.sort_index()

NEW!!! Now I rename all the units sequentially, starting on clu1 with unit 2 renamed as 1, and so on...


In [10]:
def rename_units(dfextern):
    df = dfextern.copy()
    prevmax = 0
    num_shanks = len(df.columns)
    for ss in np.arange(0,num_shanks):
        if df[df.columns[ss]].min() == 2:
            df.loc[:,df.columns[ss]][df.loc[:,df.columns[ss]].notnull()] = df.loc[:,df.columns[ss]][df.loc[:,df.columns[ss]].notnull()] + prevmax - 1
        prevmax = df.loc[:,df.columns[ss]].max()
    return df

In [11]:
df1rn = rename_units(df1)
df2rn = rename_units(df2)

In [12]:
df1rn.max()


Out[12]:
clu10     7
clu12    16
clu2     21
clu4     23
clu7     26
clu8     45
clu9     51
dtype: float64

and now we can request the spike trains:


In [13]:
num_units1 = int(max( df1rn.max() ))
num_units2 = int(max( df2rn.max() ))

spk_counts1 = {}
spk_counts2 = {}
st_list1 = []
st_list2 = []
for uu in np.arange(1,num_units1+1):
    st1 = list((((df1rn[df1rn==uu])).dropna(how='all')).index)
    spk_counts1['u' + str(uu)] = len(st1)
    st2 = list((((df2rn[df2rn==uu])).dropna(how='all')).index)
    spk_counts2['u' + str(uu)] = len(st2)
    st_list1.append(st1)
    st_list2.append(st2)
# convert list of lists to list of ndarrays:
st_array1rn = [np.array(a) for a in st_list1]
st_array2rn = [np.array(a) for a in st_list2]

In [14]:
import pickle
with open('../../Data/st_array1rn126.pickle', 'wb') as f:
    pickle.dump(st_array1rn, f)
with open('../../Data/st_array2rn126.pickle', 'wb') as f:
    pickle.dump(st_array2rn, f)

In [15]:
def list_of_spk_time_arrays_to_spk_counts_arrays(st_array_extern, ds=0, fs=0 ):
    """
    st_array: list of ndarrays containing spike times (in sample numbers!)
    ds:       delta sample number; integer value of samples per time bin
    fs:       sampling frequency
    
    argument logic: if only st_array is passed, use default ds; if ds is passed, use as is and ignore fs; if ds and fs are passed, use ds as time in seconds
    
    returns a (numBins x numCell) array with spike counts
    """
    
    st_array = st_array_extern
    
    if fs == 0:
        if ds == 0:
            ds = 1000 # assume default interval size
    else: # sampling frequency was passed, so interpret ds as time-interval, and convert accordingly:
        if ds == 0:
            ds = 1000 # assume default interval size
        else:
            ds = round(ds*fs)
            
    # determine number of units:
    num_units = len(st_array)
    
    #columns = np.arange(0,num_units)
    #df = DataFrame(columns=columns)
    
    maxtime = 0
    for uu in np.arange(num_units):
        try:
            maxtime = max(st_array[uu].max(), maxtime)
        except:
            maxtime = maxtime
    
    # create list of intervals:
    intlist = np.arange(0,maxtime,ds)
    num_bins = len(intlist)
    
    spks_bin = np.zeros((num_bins,num_units))
    
    print("iterating over {0} intervals...".format(num_bins))
    
    for uu in np.arange(num_units):
        # count number of spikes in an interval:
        spks_bin[:,uu] = np.histogram(st_array[uu], bins=num_bins, density=False, range=(0,maxtime))[0]
        #spk_count_list.append([x&y for (x,y) in zip(st_array[uu]>ii, st_array[uu] < ii+ds)].count(True))
        #st_array[uu] = st_array[uu][st_array[uu]>ii+ds]        
        #if df.empty:
        #    df = DataFrame([spk_count_list], columns=columns)
        #else:
        #    df = df.append(DataFrame([spk_count_list], columns=columns),ignore_index=True)
                    
    return pd.DataFrame(spks_bin)

df_spk_counts1 = list_of_spk_time_arrays_to_spk_counts_arrays(st_array1rn, ds=0.25, fs=32552)
df_spk_counts2 = list_of_spk_time_arrays_to_spk_counts_arrays(st_array2rn, ds=0.25, fs=32552)
#df_spk_counts2 = list_of_spk_time_arrays_to_spk_counts_df(st_array2, ds=500000)


iterating over 4536 intervals...
iterating over 3288 intervals...

In [16]:
df_spk_counts1.head()


Out[16]:
0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49 50
0 0 0 0 4 0 0 0 0 1 0 ... 10 0 0 0 0 0 0 0 0 0
1 0 0 0 4 0 0 0 0 0 0 ... 7 0 0 4 0 0 0 0 0 0
2 0 0 0 9 0 0 0 0 2 0 ... 9 0 0 0 0 0 0 0 0 0
3 0 0 0 6 0 0 0 0 1 0 ... 9 0 0 1 0 0 0 0 0 0
4 0 0 0 5 0 0 0 0 2 0 ... 8 0 0 0 0 0 0 0 0 0

5 rows × 51 columns


In [17]:
df_spk_counts2.head()


Out[17]:
0 1 2 3 4 5 6 7 8 9 ... 41 42 43 44 45 46 47 48 49 50
0 0 0 0 1 0 0 0 0 7 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 4 0 0 0 0 0 8 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 6 5 0 0 0 0 11 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 4 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 51 columns

After re-writing the binning function, it is now really fast to bin, so that I don't think I need to store these anymore...


In [108]:
#import pickle

# WARNING!!! Pickling does NOT preserve meta-data as per below...
#df_spk_counts1._description = 'gor01-6-7/2006-6-7_11-26-53_lin1'
#df_spk_counts1._timebin = '250 ms'
#df_spk_counts2._description = 'gor01-6-7/2006-6-7_16-40-19_lin2'
#df_spk_counts2._timebin = '250 ms'

#with open('../../Data/df_spk_counts1_250ms_rn.pickle', 'wb') as f:
#    pickle.dump(df_spk_counts1, f)
#
#with open('../../Data/df_spk_counts2_250ms_rn.pickle', 'wb') as f:
#    pickle.dump(df_spk_counts2, f)

In [ ]:


In [ ]: