Potential Host Queries

Now that we're using SWIRE directly instead of querying Gator, we have to quickly find the potential hosts in a neighbourhood. I can think of two ways, one which is $O(n)$ and one which is $O(\log n)$, but the former is a lot easier. How slow is it? Let's grab 1000 subjects and find all the potential hosts in the $2' \times 2'$ neighbourhood, then average the times.


In [42]:
import csv
import time

import h5py
import numpy

CROWDASTRO_H5_PATH = '../crowdastro.h5'
CROWDASTRO_CSV_PATH = '../crowdastro.csv'
ARCMIN = 0.0166667

In [54]:
with h5py.File(CROWDASTRO_H5_PATH) as f_h5:
    positions = f_h5['/swire/cdfs/catalogue'][:, :2]
    
    times = []
    for i in range(1000):
        now = time.time()
        sx, sy = f_h5['/atlas/cdfs/positions'][i]

        lt_x = positions[:, 0] <= sx + ARCMIN
        gt_x = positions[:, 0] >= sx - ARCMIN
        lt_y = positions[:, 1] <= sy + ARCMIN
        gt_y = positions[:, 1] >= sy - ARCMIN
        enclosed = numpy.all([lt_x, gt_x, lt_y, gt_y], axis=0)
        potential_hosts = positions[enclosed]
        total = time.time() - now
        times.append(total)

    print('{:.02} +- {:1.1} s'.format(numpy.mean(times), numpy.std(times)))


0.0016 +- 0.0004 s

Now let's try using a tree. I'll use a $k$-d tree to store SWIRE indices.


In [50]:
import scipy.spatial

In [63]:
with h5py.File(CROWDASTRO_H5_PATH) as f_h5:
    positions = f_h5['/swire/cdfs/catalogue'][:, :2]
    tree = scipy.spatial.KDTree(positions, leafsize=100)
    radius = numpy.sqrt(2) * ARCMIN
    
    times = []
    for i in range(1000):
        now = time.time()
        point = f_h5['/atlas/cdfs/positions'][i]

        enclosed = tree.query_ball_point(point, radius)
        potential_hosts = positions[enclosed]
        
        lt_x = potential_hosts[:, 0] <= sx + ARCMIN
        gt_x = potential_hosts[:, 0] >= sx - ARCMIN
        lt_y = potential_hosts[:, 1] <= sy + ARCMIN
        gt_y = potential_hosts[:, 1] >= sy - ARCMIN
        enclosed = numpy.all([lt_x, gt_x, lt_y, gt_y], axis=0)
        potential_hosts = potential_hosts[enclosed]
        
        total = time.time() - now
        times.append(total)

    print('{:.02} +- {:1.1} s'.format(numpy.mean(times), numpy.std(times)))


0.0012 +- 0.0003 s

Note that this measurement actually depends a lot on leafsize. Smaller values seem to be slower to some extent. I think this is because I am returning many points.

I kinda prefer sklearn to scipy, so let's have a try of sklearn.


In [71]:
import sklearn.neighbors

In [83]:
with h5py.File(CROWDASTRO_H5_PATH) as f_h5:
    positions = f_h5['/swire/cdfs/catalogue'][:, :2]
    tree = sklearn.neighbors.KDTree(positions, leaf_size=100)
    radius = numpy.sqrt(2) * ARCMIN
    
    times = []
    for i in range(1000):
        now = time.time()
        point = f_h5['/atlas/cdfs/positions'][i]

        (enclosed,) = tree.query_radius(point.reshape((1, -1)), r=radius)
        potential_hosts = positions[enclosed]
        
        lt_x = potential_hosts[:, 0] <= sx + ARCMIN
        gt_x = potential_hosts[:, 0] >= sx - ARCMIN
        lt_y = potential_hosts[:, 1] <= sy + ARCMIN
        gt_y = potential_hosts[:, 1] >= sy - ARCMIN
        enclosed = numpy.all([lt_x, gt_x, lt_y, gt_y], axis=0)
        potential_hosts = potential_hosts[enclosed]
        
        total = time.time() - now
        times.append(total)

    print('{:.02} +- {:1.1} s'.format(numpy.mean(times), numpy.std(times)))


0.00036 +- 0.0002 s

In [88]:
with h5py.File(CROWDASTRO_H5_PATH) as f_h5:
    positions = f_h5['/swire/cdfs/catalogue'][:, :2]
    tree = sklearn.neighbors.KDTree(positions, leaf_size=20)
    radius = numpy.sqrt(2) * ARCMIN
    
    now = time.time()
    points = f_h5['/atlas/cdfs/positions'][:1000]

    all_enclosed = tree.query_radius(points, r=radius)
#     potential_hosts = positions[enclosed]

#     lt_x = potential_hosts[:, 0] <= sx + ARCMIN
#     gt_x = potential_hosts[:, 0] >= sx - ARCMIN
#     lt_y = potential_hosts[:, 1] <= sy + ARCMIN
#     gt_y = potential_hosts[:, 1] >= sy - ARCMIN
#     enclosed = numpy.all([lt_x, gt_x, lt_y, gt_y], axis=0)
#     potential_hosts = potential_hosts[enclosed]

    total = time.time() - now

    print('{:.06f} s'.format(total / 1000))


0.000014 s

So if I can find a nice way to pre-process the data into a tree, then we can very quickly query that tree.


In [ ]: