Data Analysis and Plotting with Python

There are many python libraries that can be used in aide of data analysis and plotting. However, all (mature, popular) are built on top of numpy, a 3rd-party numerics package. You can get Python with numpy, in addition to a host of other relevant scientific packages, in one install process through Anaconda.

The following notebook is an example of loading and plotting some brain recording data from the Portfors' Hearing Reasearch Lab at Washington State University, Vancouver.


In [2]:
import csv

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

Investigating the spike timing of neurons

We are going to be looking at the spike timing of neurons. A spike is an electrical impulse that cells of the nervous system (e.g. your brain) use to send information. Lets import brain recording data, to see what this kind of data looks like. We have a data set of brain recordings from a single cell, where a stimulus was repeated 100 times.


In [3]:
# load in that data...
data = np.loadtxt('spike_sample.csv', delimiter=',')
# display dimension of imported data
print 'data dimensions (rows, columns):', data.shape

x = np.arange(data.shape[-1])/50000.
plt.plot(x, data[0,:])
plt.xlabel('time (s)')
plt.ylabel('voltage (V)')
# add a pointer for the spike
ax = plt.gca()
ax.annotate('this is a spike', xy=(0.060, 0.035), xytext=(0.1, 0.035), arrowprops={'facecolor': 'black'});


data dimensions (rows, columns): (100, 10000)

I have saved to file the spike times from these 100 recordings. This file contains one line per stimulus, with as many spikes as were detected per line.

If you are on a unix system, you can use the head command to take a look at the file. You would get ouput like this (first 10 lines of the file):

0.05946
0.05842,0.1589
0.05632
0.00316,0.04972
0.0593
0.06124,0.07648
0.05784

0.04674,0.0602,0.07572,0.12892,0.1964
0.05548

We can use numpy to import this data into our IPython notebook

Data Munging

If we try to import the spike times using numpy, we get an error becuase the data is not "well-formed". Therefore, we must adjust our import procedure, or prune/reformate the data to get it into a more accurate representation.

This sort of task is actually about 75-90% of what data analysis typically is. It is helpful to have enough domain knowledge to recoginze when something is very off, and a problem needs to be addressed before moving on.


In [4]:
# numpy too naive for non-rectular data shapes...
#spike_times = np.loadtxt('spike_times.csv', delimiter=',')

# use csv module in Python standard lib to read in all times
spike_times = []
with open('spike_times.csv', 'r') as df:
    reader = csv.reader(df)
    for row in reader:
        floatrow = [float(item) for item in row]
        spike_times.append(floatrow)

print len(spike_times)


100

In [5]:
# In this case I don't care which spike times belong to which recording, so lets just get a list of all the times together.
# Flatten into a single list of times:

# spike_times = reduce(lambda a,b: a+b, spike_times)
all_spike_times = sum(spike_times, [])
# spike_times = [spike for trace in spike_times for spike in trace]
print all_spike_times


[0.05946, 0.05842, 0.1589, 0.05632, 0.00316, 0.04972, 0.0593, 0.06124, 0.07648, 0.05784, 0.04674, 0.0602, 0.07572, 0.12892, 0.1964, 0.05548, 0.0566, 0.16786, 0.05378, 0.06006, 0.09754, 0.11978, 0.193, 0.02288, 0.05838, 0.0565, 0.08508, 0.01994, 0.05622, 0.19814, 0.0605, 0.05784, 0.05658, 0.00982, 0.05682, 0.12018, 0.0545, 0.07718, 0.05738, 0.17944, 0.05968, 0.16882, 0.00688, 0.05954, 0.0576, 0.0598, 0.0597, 0.06062, 0.06184, 0.06134, 0.056, 0.00458, 0.03228, 0.05654, 0.0606, 0.07982, 0.00436, 0.05558, 0.05714, 0.09914, 0.05538, 0.07768, 0.05546, 0.15714, 0.0458, 0.0593, 0.08508, 0.05652, 0.05598, 0.14858, 0.09744, 0.01534, 0.05648, 0.03022, 0.05826, 0.15538, 0.16904, 0.05394, 0.06234, 0.05952, 0.04746, 0.11496, 0.05704, 0.0579, 0.137, 0.0562, 0.09416, 0.04336, 0.1511, 0.05446, 0.03934, 0.16496, 0.05752, 0.133, 0.17518, 0.05734, 0.05708, 0.15976, 0.06036, 0.16582, 0.05676, 0.17776, 0.05696, 0.05542, 0.0058, 0.04544, 0.05736, 0.095, 0.15648, 0.05846, 0.05932, 0.18894, 0.06012, 0.0543, 0.06264, 0.10044, 0.17836, 0.0617, 0.11788, 0.05988, 0.07778, 0.05742, 0.09216, 0.05706, 0.17544, 0.07632, 0.18368, 0.05878, 0.02108, 0.0419, 0.05696, 0.05602, 0.05656, 0.039, 0.056, 0.11696, 0.05846, 0.0612, 0.03856, 0.05634, 0.06998, 0.0741, 0.0009, 0.02516, 0.05728, 0.05804]

Plotting the spike times

Now that we have our data, lets plot all the spike times together to take a look at the characteristic timing of this cell. We will first plot the data using Matplotlib, a plotting library based off of the MATLAB interface. It is the most mature, and popular plotting library in python. It is and higly customizable, and capable of producing publication-quailty graphs.


In [6]:
# count bins using numpy
bins, edges = np.histogram(all_spike_times, bins=20, range=(0, 0.2))
print bins, edges


[ 7  2  3  5  7 60 15  8  2  6  1  4  2  2  1  6  5  5  2  3] [ 0.    0.01  0.02  0.03  0.04  0.05  0.06  0.07  0.08  0.09  0.1   0.11
  0.12  0.13  0.14  0.15  0.16  0.17  0.18  0.19  0.2 ]

In [7]:
# count bins using pure python
binsz = 0.02
bins = [int(x/binsz) for x in all_spike_times]
bin_counts = [bins.count(i) for i in range(10)]
bin_edges = [i*binsz for i in range(10)]
print bin_edges, bin_counts


[0.0, 0.02, 0.04, 0.06, 0.08, 0.1, 0.12, 0.14, 0.16, 0.18] [9, 8, 67, 23, 8, 5, 4, 7, 10, 5]

In [8]:
n, bins, patches = plt.hist(all_spike_times, 20, range=(0,0.2))
plt.xlabel("time (s)")
plt.ylabel("no. spikes")
plt.title("Cell Spike Timing");


Seaborn is another Python plotting package, that is built on top of matplotlib, and aims to make the plot prettier and easier to work with. Just import seaborn and use matplotlib as usual, or use seaborn's high level plotting helpers:


In [9]:
import seaborn as sns
n, bins, patches = plt.hist(all_spike_times, 20, range=(0,0.2))
plt.xlabel("time (s)")
plt.ylabel("no. spikes")
plt.title("Cell Spike Timing");


Bokeh is an interactive visualization library that targets modern web browsers for presentation.


In [10]:
from bokeh.charts import Bar, Histogram, show, output_notebook
output_notebook()

hm = Histogram(all_spike_times, bins=20, xlabel='time (s)', ylabel='no. spikes', title='Spike timing')
show(hm)


BokehJS successfully loaded.

Pandas provides data structures designed to make data munging, modeling and plotting easier in Python. It aims to add R-like functionality to Python. It is best suited to tabular data


In [11]:
spike_table = pd.read_csv('spike_times.csv', sep=',', names=range(5))

# note that the import skips the empty rows:
print 'Table shape', spike_table.shape
spike_table.plot(kind='hist', bins=20,  range=(0,0.2));
plt.xlabel("time (s)")
plt.ylabel("no. spikes")
plt.title("Cell Spike Timing");


Table shape (88, 5)

In [16]:
# spike binning from pandas without plotting

all_spikes = spike_table.values.flatten()
all_spikes = all_spikes[~np.isnan(all_spikes)]
print 'no. total spikes', len(all_spikes)
bin_edges = [i*0.02 for i in range(10)] + [0.2]
cats = pd.cut(all_spikes, bin_edges, labels=False)
bin_counts = np.bincount(cats)
print bin_counts, bin_edges


no. total spikes 146
[ 9  8 67 23  8  5  4  7 10  5] [0.0, 0.02, 0.04, 0.06, 0.08, 0.1, 0.12, 0.14, 0.16, 0.18, 0.2]

Resources for learing more:

Interesting notebooks