Human Protein Atlas Notebook

This notebook uses a fraction of the content database built for OMERO.searcher Local client

http://murphylab.web.cmu.edu/software/searcher/

The database contains 101 SLF33 feature vectors from images from the Human Protein Atlas.


In [20]:
import cPickle as pickle
from IPython.display import Image
import halcon

In [21]:
data = pickle.load( open( 'dataset.pkl', 'r' ) )

I will use the first image in the dataset as the query image.


In [22]:
url = data[0][0]
print url
Image(url=url,height=400,width=400,retina=True)


http://www.proteinatlas.org/images/10505/100_A12_1_blue_green.jpg
Out[22]:

I will also "cheat". I am going to include this image in the dataset. This way, I can assess that the most similar image in the dataset is the query image itself.

Now, if we take a lot at one of the records in the dataset we will realize they are made of two elements

  • an image URL
  • a feature vector

In [23]:
datum = data[0]
url = datum[0]
print "Elements in datum: " + str(len(datum))
print "Image URL: " + url
feature_vector = datum[1]
print "Number of features in SLF33 feature vector: " + str(len(feature_vector))


Elements in datum: 2
Image URL: http://www.proteinatlas.org/images/10505/100_A12_1_blue_green.jpg
Number of features in SLF33 feature vector: 162

Now we will need to reshape this dataset since each element in FALCON has three parts

  • Any string (in this case we are using the image URL as its identifier)
  • An initial score (missing in this dataset)
  • A feature vector (in this case an SLF33 feature vector set)

If you are interested in learning more about Subcellular Location Features (SLF) visit the

http://murphylab.web.cmu.edu/services/SLF/


In [24]:
print "Preparing dataset"
dataset = []
for datum in data:
    dataset.append( [ datum[0], 1, datum[1] ] )
    
print "Preparing query image"
query_image = [dataset[0]]

[iids, scores] = halcon.search.query( query_image, dataset, normalization='standard' )


Preparing dataset
Preparing query image

Now, according to HALCON, the image that looks more similar to the query image is


In [25]:
url = iids[1]
print url
Image(url=url,height=400,width=400,retina=True)


http://www.proteinatlas.org/images/10549/100_B12_2_blue_green.jpg
Out[25]:

The TOP 10 images are


In [26]:
for i in range(10):
    url = iids[i]
    print url


http://www.proteinatlas.org/images/10505/100_A12_1_blue_green.jpg
http://www.proteinatlas.org/images/10549/100_B12_2_blue_green.jpg
http://www.proteinatlas.org/images/9143/100_D8_2_blue_green.jpg
http://www.proteinatlas.org/images/8406/100_F4_2_blue_green.jpg
http://www.proteinatlas.org/images/8527/100_B5_1_blue_green.jpg
http://www.proteinatlas.org/images/8802/100_B7_2_blue_green.jpg
http://www.proteinatlas.org/images/8716/100_G1_1_blue_green.jpg
http://www.proteinatlas.org/images/8411/100_G3_2_blue_green.jpg
http://www.proteinatlas.org/images/6154/100_D5_1_blue_green.jpg
http://www.proteinatlas.org/images/8614/100_B3_1_blue_green.jpg