Now that we're using SWIRE directly instead of querying Gator, we have to quickly find the potential hosts in a neighbourhood. I can think of two ways, one which is $O(n)$ and one which is $O(\log n)$, but the former is a lot easier. How slow is it? Let's grab 1000 subjects and find all the potential hosts in the $2' \times 2'$ neighbourhood, then average the times.
In [42]:
import csv
import time
import h5py
import numpy
CROWDASTRO_H5_PATH = '../crowdastro.h5'
CROWDASTRO_CSV_PATH = '../crowdastro.csv'
ARCMIN = 0.0166667
In [54]:
with h5py.File(CROWDASTRO_H5_PATH) as f_h5:
positions = f_h5['/swire/cdfs/catalogue'][:, :2]
times = []
for i in range(1000):
now = time.time()
sx, sy = f_h5['/atlas/cdfs/positions'][i]
lt_x = positions[:, 0] <= sx + ARCMIN
gt_x = positions[:, 0] >= sx - ARCMIN
lt_y = positions[:, 1] <= sy + ARCMIN
gt_y = positions[:, 1] >= sy - ARCMIN
enclosed = numpy.all([lt_x, gt_x, lt_y, gt_y], axis=0)
potential_hosts = positions[enclosed]
total = time.time() - now
times.append(total)
print('{:.02} +- {:1.1} s'.format(numpy.mean(times), numpy.std(times)))
Now let's try using a tree. I'll use a $k$-d tree to store SWIRE indices.
In [50]:
import scipy.spatial
In [63]:
with h5py.File(CROWDASTRO_H5_PATH) as f_h5:
positions = f_h5['/swire/cdfs/catalogue'][:, :2]
tree = scipy.spatial.KDTree(positions, leafsize=100)
radius = numpy.sqrt(2) * ARCMIN
times = []
for i in range(1000):
now = time.time()
point = f_h5['/atlas/cdfs/positions'][i]
enclosed = tree.query_ball_point(point, radius)
potential_hosts = positions[enclosed]
lt_x = potential_hosts[:, 0] <= sx + ARCMIN
gt_x = potential_hosts[:, 0] >= sx - ARCMIN
lt_y = potential_hosts[:, 1] <= sy + ARCMIN
gt_y = potential_hosts[:, 1] >= sy - ARCMIN
enclosed = numpy.all([lt_x, gt_x, lt_y, gt_y], axis=0)
potential_hosts = potential_hosts[enclosed]
total = time.time() - now
times.append(total)
print('{:.02} +- {:1.1} s'.format(numpy.mean(times), numpy.std(times)))
Note that this measurement actually depends a lot on leafsize
. Smaller values seem to be slower to some extent. I think this is because I am returning many points.
I kinda prefer sklearn to scipy, so let's have a try of sklearn.
In [71]:
import sklearn.neighbors
In [83]:
with h5py.File(CROWDASTRO_H5_PATH) as f_h5:
positions = f_h5['/swire/cdfs/catalogue'][:, :2]
tree = sklearn.neighbors.KDTree(positions, leaf_size=100)
radius = numpy.sqrt(2) * ARCMIN
times = []
for i in range(1000):
now = time.time()
point = f_h5['/atlas/cdfs/positions'][i]
(enclosed,) = tree.query_radius(point.reshape((1, -1)), r=radius)
potential_hosts = positions[enclosed]
lt_x = potential_hosts[:, 0] <= sx + ARCMIN
gt_x = potential_hosts[:, 0] >= sx - ARCMIN
lt_y = potential_hosts[:, 1] <= sy + ARCMIN
gt_y = potential_hosts[:, 1] >= sy - ARCMIN
enclosed = numpy.all([lt_x, gt_x, lt_y, gt_y], axis=0)
potential_hosts = potential_hosts[enclosed]
total = time.time() - now
times.append(total)
print('{:.02} +- {:1.1} s'.format(numpy.mean(times), numpy.std(times)))
In [88]:
with h5py.File(CROWDASTRO_H5_PATH) as f_h5:
positions = f_h5['/swire/cdfs/catalogue'][:, :2]
tree = sklearn.neighbors.KDTree(positions, leaf_size=20)
radius = numpy.sqrt(2) * ARCMIN
now = time.time()
points = f_h5['/atlas/cdfs/positions'][:1000]
all_enclosed = tree.query_radius(points, r=radius)
# potential_hosts = positions[enclosed]
# lt_x = potential_hosts[:, 0] <= sx + ARCMIN
# gt_x = potential_hosts[:, 0] >= sx - ARCMIN
# lt_y = potential_hosts[:, 1] <= sy + ARCMIN
# gt_y = potential_hosts[:, 1] >= sy - ARCMIN
# enclosed = numpy.all([lt_x, gt_x, lt_y, gt_y], axis=0)
# potential_hosts = potential_hosts[enclosed]
total = time.time() - now
print('{:.06f} s'.format(total / 1000))
So if I can find a nice way to pre-process the data into a tree, then we can very quickly query that tree.
In [ ]: