Chapter 3: Billion-scale clustering

This chapter contains the followings:

  1. Download the SIFT1B dataset
  2. Encode billion-scale data iteratively
  3. Run clustering

Requisites:

  • numpy
  • pqkmeans
  • tqdm
  • six
  • os
  • gzip
  • texmex_python (automatically installed when you pip pqkmeans)

1. Download the SIFT1B dataset


In [1]:
import numpy
import pqkmeans
import tqdm
import os
import six
import gzip
import texmex_python

In this chapter, we show an example of billion-scale clustering. Since input vectors are compressed by PQ, our PQk-means can handle a large amount of vectors even if they cannot be directly loaded on memory.

In a programming perspective, our PQ-encoder has an iterative encoding function (tranfsorm_generator), by which we can handle large-scale data as if theyr are on memory.

Let's start the tutorial by downloading the SIFT1B data. It consists of one billion 128-dimensional SIFT vectors, and requires 97.9 GB disk space. The download might take several hours.


In [2]:
cache_directory = "."  # Please set this according to your environment. 97.9 GB disk space is required.  
filename = "bigann_base.bvecs.gz"
url = "ftp://ftp.irisa.fr/local/texmex/corpus/" + filename
path = os.path.join(cache_directory, filename)
if not os.path.exists(path):
    print("downloading {}".format(url))
    %time six.moves.urllib.request.urlretrieve(url, path)

Next, let's open the data and construct an iterator for it in a usual Python way. The texmex_python package contains an iterator-interface for bvecs-type data.


In [3]:
f = gzip.open(path, 'rb')
vec_iterator = texmex_python.reader.read_bvec_iter(f)

Then, you can read each SIFT vector one by one using a usual for-loop access, e.g., "for v in vec_iterator: ...". Note that you do not need to read all data at once, which would require 97.9 GB of memory space.

2. Encode billion-scale data iteratively

Before encoding, let us construct a PQ-encoder using a small amount of training data. We use a traning data of SIFTSMALL dataset for the sake of simplicity (You should use training data of SIFT1B dataset for the evaluation, which takes 9.7 GB disk space).


In [4]:
learn_data, _ = pqkmeans.evaluation.get_siftsmall_dataset()
M = 4
encoder = pqkmeans.encoder.PQEncoder(num_subdim=M, Ks=256)
encoder.fit(learn_data)

Next, we'll encode each SIFT vector to PQ-code iteratively. To do so, let us create a generator by calling transform_generator function.


In [5]:
pqcode_generator = encoder.transform_generator(vec_iterator)

The resulting pqcode_generator is a generator for PQ-code. We can encode each SIFT vector by, e.g., "for code in pqcode_generator: ...", without loading all data on memory at once.

This design is not specific for SIFT1B data. Whenever you need to compress big data that cannot be loaded on memory at once, you can write an iterator for your data, and pass it to a PQ-encoder.

So let's run encoding. To avoid consuming redundant memory space, we first allocate a big matrix as follows.


In [6]:
N = 1000000000
pqcodes = numpy.empty([N, M], dtype=encoder.code_dtype)
print("pqcodes.shape:\n{}".format(pqcodes.shape))
print("pqcodes.nbytes:\n{} bytes".format(pqcodes.nbytes))
print("pqcodes.dtype:\n{}".format(pqcodes.dtype))


pqcodes.shape:
(1000000000, 4)
pqcodes.nbytes:
4000000000 bytes
pqcodes.dtype:
uint8

We can encode vectors by simply running a usual for-loop statement. The encoding is automatically parallelized. You do not need to execute any additional steps. The encoding for the SIFT1B would take several hours depending on your computer. Please find that this does not consume any addirional memory cost at all.


In [7]:
for n, code in enumerate(tqdm.tqdm(pqcode_generator, total=N)):
    pqcodes[n, :] = code


100%|██████████| 1000000000/1000000000 [4:34:45<00:00, 60660.54it/s] 

Note that it's also fine to use list comprehensions and numpy conversion such as "pqcodes=[code for code in pqcode_generator]" and "pqcodes=numpy.array(pqcodes)". But it would take memory overhead for temporal data storage.

After encoding, you can save the pqcodes (and the PQ-encoder itself) if you want. Typically, the resulting PQ-codes do not take so much memory space (in this case, they take only 4 GB). So you can read/write the PQ-codes directly without any iterator/generator.


In [8]:
# pickle.dump(encoder, open('encoder.pkl', 'wb'))
# numpy.save('pqcode.npy', pqcodes)

3. Run clustering

Finally, we can run clustering on one billion PQ-codes. The clustering for billion-scale data with K=1000 is finished in several hours depending on your computer.


In [9]:
K = 1000
print("Runtime of clustering:")
%time clustered = pqkmeans.clustering.PQKMeans(encoder=encoder, k=K).fit_predict(pqcodes)


Runtime of clustering:
CPU times: user 1d 1h 27min 35s, sys: 4min 16s, total: 1d 1h 31min 51s
Wall time: 3h 41min 36s

In [10]:
print("The assigned label for the top 100 PQ-codes:\n{}".format(clustered[:100]))


The assigned label for the top 100 PQ-codes:
[695, 981, 119, 124, 630, 471, 709, 287, 630, 240, 395, 721, 214, 769, 993, 742, 812, 742, 205, 812, 467, 26, 535, 709, 890, 699, 984, 446, 331, 984, 299, 947, 905, 874, 984, 962, 326, 366, 872, 26, 113, 1, 113, 321, 983, 295, 766, 4, 946, 399, 641, 164, 848, 71, 169, 415, 75, 205, 96, 144, 198, 671, 131, 4, 706, 859, 152, 459, 623, 361, 991, 38, 723, 400, 941, 15, 190, 840, 638, 371, 140, 805, 56, 345, 578, 442, 250, 779, 333, 111, 585, 974, 52, 52, 991, 12, 415, 408, 819, 54]