In [1]:
import numpy
import pqkmeans
import tqdm
import os
import six
import gzip
import texmex_python
In this chapter, we show an example of billion-scale clustering. Since input vectors are compressed by PQ, our PQk-means can handle a large amount of vectors even if they cannot be directly loaded on memory.
In a programming perspective, our PQ-encoder has an iterative encoding function (tranfsorm_generator
), by which we can handle large-scale data as if theyr are on memory.
Let's start the tutorial by downloading the SIFT1B data. It consists of one billion 128-dimensional SIFT vectors, and requires 97.9 GB disk space. The download might take several hours.
In [2]:
cache_directory = "." # Please set this according to your environment. 97.9 GB disk space is required.
filename = "bigann_base.bvecs.gz"
url = "ftp://ftp.irisa.fr/local/texmex/corpus/" + filename
path = os.path.join(cache_directory, filename)
if not os.path.exists(path):
print("downloading {}".format(url))
%time six.moves.urllib.request.urlretrieve(url, path)
Next, let's open the data and construct an iterator for it in a usual Python way. The texmex_python
package contains an iterator-interface for bvecs
-type data.
In [3]:
f = gzip.open(path, 'rb')
vec_iterator = texmex_python.reader.read_bvec_iter(f)
Then, you can read each SIFT vector one by one using a usual for-loop access, e.g., "for v in vec_iterator: ...
". Note that you do not need to read all data at once, which would require 97.9 GB of memory space.
Before encoding, let us construct a PQ-encoder using a small amount of training data. We use a traning data of SIFTSMALL dataset for the sake of simplicity (You should use training data of SIFT1B dataset for the evaluation, which takes 9.7 GB disk space).
In [4]:
learn_data, _ = pqkmeans.evaluation.get_siftsmall_dataset()
M = 4
encoder = pqkmeans.encoder.PQEncoder(num_subdim=M, Ks=256)
encoder.fit(learn_data)
Next, we'll encode each SIFT vector to PQ-code iteratively. To do so, let us create a generator by calling transform_generator
function.
In [5]:
pqcode_generator = encoder.transform_generator(vec_iterator)
The resulting pqcode_generator
is a generator for PQ-code. We can encode each SIFT vector by, e.g., "for code in pqcode_generator: ...
", without loading all data on memory at once.
This design is not specific for SIFT1B data. Whenever you need to compress big data that cannot be loaded on memory at once, you can write an iterator for your data, and pass it to a PQ-encoder.
So let's run encoding. To avoid consuming redundant memory space, we first allocate a big matrix as follows.
In [6]:
N = 1000000000
pqcodes = numpy.empty([N, M], dtype=encoder.code_dtype)
print("pqcodes.shape:\n{}".format(pqcodes.shape))
print("pqcodes.nbytes:\n{} bytes".format(pqcodes.nbytes))
print("pqcodes.dtype:\n{}".format(pqcodes.dtype))
We can encode vectors by simply running a usual for-loop statement. The encoding is automatically parallelized. You do not need to execute any additional steps. The encoding for the SIFT1B would take several hours depending on your computer. Please find that this does not consume any addirional memory cost at all.
In [7]:
for n, code in enumerate(tqdm.tqdm(pqcode_generator, total=N)):
pqcodes[n, :] = code
Note that it's also fine to use list comprehensions and numpy conversion such as "pqcodes=[code for code in pqcode_generator]" and "pqcodes=numpy.array(pqcodes)
". But it would take memory overhead for temporal data storage.
After encoding, you can save the pqcodes (and the PQ-encoder itself) if you want. Typically, the resulting PQ-codes do not take so much memory space (in this case, they take only 4 GB). So you can read/write the PQ-codes directly without any iterator/generator.
In [8]:
# pickle.dump(encoder, open('encoder.pkl', 'wb'))
# numpy.save('pqcode.npy', pqcodes)
Finally, we can run clustering on one billion PQ-codes. The clustering for billion-scale data with K=1000
is finished in several hours depending on your computer.
In [9]:
K = 1000
print("Runtime of clustering:")
%time clustered = pqkmeans.clustering.PQKMeans(encoder=encoder, k=K).fit_predict(pqcodes)
In [10]:
print("The assigned label for the top 100 PQ-codes:\n{}".format(clustered[:100]))