In this notebook, we show how to write an algorithm and put it in a function that can be submitted to the NeuroFinder challenge. In these examples, the algorithms will use functionality from Spark / Thunder for distributed image and time series processing. See the other tutorials for an example submission that does the entire job using only the core Python scientific stack (numpy
, scipy
, etc.)
In [1]:
%matplotlib inline
from thunder import Colorize
image = Colorize.image
tile = Colorize.tile
First, let's load some example data so we have something to play with. We'll load the first 100 images from one of the data sets.
In [2]:
bucket = "s3n://neuro.datasets/"
path = "challenges/neurofinder/01.00/"
images = tsc.loadImages(bucket + path + 'images', startIdx=0, stopIdx=100)
Our images
is a class from Thunder for representing time-varying image sequences. Let's cache
and count
it, which forces it to be loaded and saved, and we'll also compute a reference mean image, which will be useful for displays
In [3]:
images.cache()
images.count()
ref = images.mean()
We'll also load the ground truth and the metadata for this data set
In [4]:
sources = tsc.loadSources(bucket + path + 'sources')
info = tsc.loadJSON(bucket + path + 'info.json')
We're going to write a function that takes the images
variable as an input, as well as an info
dictionary with data-set specific metadata, and returns identified sources as an output. It'll look like this (for now our function will just pass
and thus do nothing):
In [5]:
def run(data, info=None):
# do an analysis on the images
# optionally make use of the metadata
# return a set of sources
pass
The first thing we could do is use one of Thunder's built-in methods for spatio-temporal feature detection, for example, the localmax
algorithm. This is a very simple algorithm that computes the mean, and then applies some very simple image processing to detect local image peaks.
In [6]:
def run(data, info):
from thunder import SourceExtraction
method = SourceExtraction('localmax')
result = method.fit(data)
return result
Let's run our function on the example data and inspect the output
In [7]:
out = run(images, info)
In [8]:
image(out.masks((512,512), base=ref, outline=True))
Let's see how well it did on the example data
In [9]:
recall, precision, score = sources.similarity(out, metric='distance', minDistance=5)
print('score: %.2f' % score)
This algorithm isn't doing particularly well, but you could submit this right now to the challenge. Take the run
function we wrote, put it in a file run.py
in a folder called run
, and add an empty __init__.py
file in the same folder. Then fork the the neurofinder repository on GitHub and add this folder inside submissions
. See here for more detailed instructions.
Let's try to improve the algorithm a bit. One option is to use the same algorithm, but just tweak the parameters. We can inspect the algorithm we used with ?
to see all the available parameters.
In [10]:
from thunder.extraction.feature.methods.localmax import LocalMaxFeatureAlgorithm
LocalMaxFeatureAlgorithm?
Try increasing the maximum number of sources, and decrease the minimum distance
In [11]:
def run(data, info):
from thunder import SourceExtraction
method = SourceExtraction('localmax', maxSources=500, minDistance=5)
result = method.fit(data)
return result
In [12]:
out = run(images, info)
In [13]:
image(out.masks((512,512), base=ref, outline=True))
In [14]:
recall, precision, score = sources.similarity(out, metric='distance', minDistance=5)
print('score: %.2f' % score)
Hmm, that did a bit better, but still not great. Note that the precision (the number of extra sources the algorithm found) is particularly bad.
In [15]:
print('precision: %.2f' % precision)
You probably don't want to submit this one, but using and tweaking the existing algorithms is a perfectly valid way to submit algorithms! You might end up with something that works really well.
Most likely the algorithm about just isn't the right algorithm for these data. Let's try a block
algorithm, which does more complex spatio-temporal feature extraction on sub-regions, or blocks, of the full movie
In [ ]:
def run(data, info):
from thunder import SourceExtraction
from thunder.extraction import OverlapBlockMerger
merger = OverlapBlockMerger(0.1)
method = SourceExtraction('nmf', merger=merger, componentsPerBlock=5, percentile=95, minArea=100, maxArea=500)
result = method.fit(data, size=(32, 32), padding=8)
return result
Let's run this algorithm. It'll take a little longer because it's more complex, that's one of the reasons we try to parallelize these computations!
In [ ]:
out = run(images, info)
Inspect the result
In [ ]:
image(out.masks((512,512), base=ref, outline=True))
In [ ]:
recall, precision, score = sources.similarity(out, metric='distance', minDistance=5)
print('score: %.2f' % score)
The overall score is worse, but note that the precision is incredibly high. We missed a lot of sources, but the ones we found are all good. You can see that in the image above: every identified region does indeed look like it found a neuron.
In [ ]:
print('precision: %.2f' % precision)
For our final example, we'll build a custom algorithm from strach using the constructors from Thunder
. First, we'll define a function to run on each block. For testing and debugging our function, we'll grab a single block. We'll pick one with a large total standard deviation (in both space and time), so it's likely to have some structure.
In [ ]:
b = images.toBlocks(size=(40,40)).values().filter(lambda x: x.std() > 1000).first()
This should be a single numpy
array with shape (100,40,40)
, corresponding to the dimensions in time and space.
In [ ]:
b.shape
Let's write a function that computes the standard deviation over time, finds the index of the max, draws a circle around the peak, and returns it as a Source
.
In [ ]:
def stdpeak(block):
# compute the standard deviation over time
s = block.std(axis=0)
# get the indices of the peak
from numpy import where
r, c = where(s == s.max())
# define a circle around the center, clipping at the boundaries
from skimage.draw import circle
rr, cc = circle(r[0], c[0], 10, shape=block.shape[1:])
coords = zip(rr, cc)
# return as a list of sources (in this case it's just one)\n",
from thunder.extraction.source import Source
if len(coords) > 0:
return [Source(coords)]
else:
return []
Test that our function does something reasonable on the test block, showing the recovered source and the mean of the block over time side by side
In [ ]:
s = stdpeak(b)
tile([s[0].mask((40,40)), b.std(axis=0)])
Now we can build a block method that uses this function. We just need to import the classes for constructing block methods, and define an extract
function to run on each block. In this case, we'll just call our stdpeak
function from above, but to form a complete submission you'd need to include this function alongside run
. See the inline comments for what we're doing at each step.
In [ ]:
def run(data, info):
# import the classes we need for construction
from thunder.extraction.block.base import BlockAlgorithm, BlockMethod
# create a custom class by extending the base method
class TestBlockAlgorithm(BlockAlgorithm):
# write an extract function which draws a circle around the pixel
# in each block with peak standard deviation
def extract(self, block):
return stdpeak(block)
# now instaitiate our new method and use it to fit the data
method = BlockMethod(algorithm=TestBlockAlgorithm())
result = method.fit(data, size=(40, 40))
return result
Now run and evaluate the algorithm
In [ ]:
out = run(images, info)
In [ ]:
image(out.masks((512,512), base=sources, outline=True))
In [ ]:
recall, precision, score = sources.similarity(out, metric='distance', minDistance=5)
print('score: %.2f' % score)
Performance is poor, but this isn't a particularly good algorithm! Our goal was just demonstrating the API. If you made it this far, you're definitely ready to write your own algorithm. Go forth and find all the neurons!