Chapter 4: Comparison to faiss

This chapter contains the followings:

  1. Setup the experiment using SIFT1M
  2. Small-scale comparison: N=10^5, K=10^3 (k-means with faiss-CPU and k-means with sklearn)
  3. Large-scale comparison: N=10^6, K=10^4 (PQk-means, k-means with faiss-CPU, and k-measn with falss-GPU)

Requisites:

  • numpy
  • pqkmeans
  • sklearn
  • faiss (you can install it via conda)
    1. CPU version: conda install faiss-cpu -c pytorch
    2. GPU version (with two NVIDIA GTX1080s): conda install faiss-gpu -c pytorch

Our final suggestions are as follows:

  • If you have GPU(s) and your GPU memory is large enough (all data can be loaded on the GPU memory at once), faiss-GPU is the fastest option.
  • Otherwise,
    • If your problem is small enough (all vectors can be easily fit into the RAM), faiss-CPU would be the best option.
    • If the problem is large, e.g., (1) faiss-CPU seems slow, or (2) the vectors cannot be loaded on the memory at once, then PQk-means is the best option.

1. Setup the experiment using SIFT1M


In [1]:
import numpy
import pqkmeans
from sklearn.cluster import KMeans
import faiss

In this chapter, we compare our PQk-means to k-means in the faiss library. Faiss provides one of the most efficient implementations of nearest neighbor algorithms for both CPU(s) and GPU(s). It also provides an implementation of vanilla k-means, which we will compare to. The core part of faiss is implemented by C++, and the python binding is available.

We compare PQk-means to both CPU- and GPU-version. Our configurations are:

  • faiss-CPU: This was built with Intel MKL, which provides the fastest backend BLAS implementation. The algorithms in the library are automatically parallelized. All evaluations are conducted on a server with 3.6 GHz Intel Xeon CPU (6 cores, 12 threads)
  • faiss-GPU: The library was built with CUDA 8.0. Two middle-level GPUs, NVIDIA GTX 1080s, are used for the evaluation. The algorithms can be run over multi GPUs.

For the comparison, we leverage the SIFT1M dataset.


In [2]:
Xt, X = pqkmeans.evaluation.get_sift1m_dataset()  #  Xt: the training data.  X: the testing data to be clusterd

First, you can download the data by a helper script. This would take several minutes, and consume 168 MB of the disk space.


In [3]:
Xt = Xt.astype(numpy.float32)
X = X.astype(numpy.float32)
D = X.shape[1]
print("Xt.shape:{}\nX.shape:{}".format(Xt.shape, X.shape))


Xt.shape:(100000, 128)
X.shape:(1000000, 128)

Because faiss takes 32-bit float vectors as inputs, the data is converted to float32.

2. Small-scale comparison: N=10^5, K=10^3 (k-means with faiss-CPU v.s. k-means with sklearn)

First, let us compare the k-means implementation of faiss and sklearn using 100K vectors from SIFT1M. Then we show that faiss is much faster than sklearn with almost the same error.

Note that it is hard to run k-means-sklearn with a large K because it is too slow (that is the reason for this small-scale experiment)


In [4]:
K_small = 1000
N_small = 100000

# Setup clustering instances. We stop each algorithm with 10 iterations
kmeans_faiss_cpu_small = faiss.Kmeans(d=D, k=K_small, niter=10)
kmeans_sklearn_small = KMeans(n_clusters=K_small, n_jobs=-1, max_iter=10)

Let's run each algorithm


In [5]:
%%time
print("faiss-cpu:")
kmeans_faiss_cpu_small.train(X[:N_small])
_, ids_faiss_cpu_small = kmeans_faiss_cpu_small.index.search(X[:N_small], 1)


faiss-cpu:
CPU times: user 18.1 s, sys: 1.82 s, total: 20 s
Wall time: 1.8 s

In [6]:
%%time
print("sklearn:")
ids_sklearn_small = kmeans_sklearn_small.fit_predict(X[:N_small])


sklearn:
/home/matsui/usr/src/anaconda/anaconda3/envs/pqkmeans_faiss/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning: invalid value encountered in sqrt
  return distances if squared else np.sqrt(distances, out=distances)
/home/matsui/usr/src/anaconda/anaconda3/envs/pqkmeans_faiss/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning: invalid value encountered in sqrt
  return distances if squared else np.sqrt(distances, out=distances)
/home/matsui/usr/src/anaconda/anaconda3/envs/pqkmeans_faiss/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning: invalid value encountered in sqrt
  return distances if squared else np.sqrt(distances, out=distances)
/home/matsui/usr/src/anaconda/anaconda3/envs/pqkmeans_faiss/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning: invalid value encountered in sqrt
  return distances if squared else np.sqrt(distances, out=distances)
/home/matsui/usr/src/anaconda/anaconda3/envs/pqkmeans_faiss/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning: invalid value encountered in sqrt
  return distances if squared else np.sqrt(distances, out=distances)
/home/matsui/usr/src/anaconda/anaconda3/envs/pqkmeans_faiss/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning: invalid value encountered in sqrt
  return distances if squared else np.sqrt(distances, out=distances)
/home/matsui/usr/src/anaconda/anaconda3/envs/pqkmeans_faiss/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning: invalid value encountered in sqrt
  return distances if squared else np.sqrt(distances, out=distances)
/home/matsui/usr/src/anaconda/anaconda3/envs/pqkmeans_faiss/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning: invalid value encountered in sqrt
  return distances if squared else np.sqrt(distances, out=distances)
/home/matsui/usr/src/anaconda/anaconda3/envs/pqkmeans_faiss/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning: invalid value encountered in sqrt
  return distances if squared else np.sqrt(distances, out=distances)
/home/matsui/usr/src/anaconda/anaconda3/envs/pqkmeans_faiss/lib/python3.6/site-packages/sklearn/metrics/pairwise.py:257: RuntimeWarning: invalid value encountered in sqrt
  return distances if squared else np.sqrt(distances, out=distances)
CPU times: user 2.86 s, sys: 380 ms, total: 3.24 s
Wall time: 2min 53s

In [7]:
_, faiss_cpu_small_error, _ = pqkmeans.evaluation.calc_error(ids_faiss_cpu_small.reshape(-1), X[:N_small], K_small)
_, sklearn_small_error, _ = pqkmeans.evaluation.calc_error(ids_sklearn_small, X[:N_small], K_small)

print("k-means, faiss-cpu, error: {}".format(faiss_cpu_small_error))
print("k-means, sklearn, error: {}".format(sklearn_small_error))


k-means, faiss-cpu, error: 216.8936472830963
k-means, sklearn, error: 216.3501130987549

We observed that

  • k-means with faiss-CPU (2 sec) is surprisingly faster than k-means with sklearn (3 min) with almost the same error. This speedup would be due to the highly optimized implementation of the nearest neighbor search in faiss with Intel MKL BLAS. This suggests that faiss-CPU is a better option for the exact k-means in a usual computer.

Because faiss-CPU is faster thant sklearn, sklearn is not compared in the next section.

3. Large-scale comparison: N=10^6, K=10^4 (PQk-means, k-means with faiss-CPU, and k-measn with falss-GPU)


In [8]:
# Setup GPUs for faiss-gpu
# In my environment, the first GPU (id=0) is for rendering, and the second (id=1) and the third (id=2) GPUs are GPGPU (GTX1080).
# We activate only the second and the third GPU
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"  # make sure the order is identical to the result of nvidia-smi
os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"  # Please change here for your environment

Next, let us compare PQk-meas with faiss-CPU and faiss-GPU using the whole dataset (N=10^6, K=10^4). Note that this is 100x larter setting compared to Sec 2 (NK=10^8 vs NK=10^10).

First, as pre-processing for PQk-means, let's train a PQencoder and encode all data. It will take around 10 sec.


In [9]:
%%time
# Train the encoder
encoder = pqkmeans.encoder.PQEncoder(num_subdim=4, Ks=256)
encoder.fit(Xt)

# Encode the vectors to PQ-codes
X_code = encoder.transform(X)


/home/matsui/usr/src/anaconda/anaconda3/envs/pqkmeans_faiss/lib/python3.6/site-packages/scipy/cluster/vq.py:523: UserWarning: One of the clusters is empty. Re-run kmeans with a different initialization.
  warnings.warn("One of the clusters is empty. "
CPU times: user 57.3 s, sys: 7.74 s, total: 1min 5s
Wall time: 11 s

Note that X_code is 128x more memory efficient than X:


In [10]:
print("X.shape: {}, X.dtype: {}, X.nbytes: {} MB".format(X.shape, X.dtype, X.nbytes / 10**6))
print("X_code.shape: {}, X_code.dtype: {}, X_code.nbytes: {} MB".format(X_code.shape, X_code.dtype, X_code.nbytes / 10**6))


X.shape: (1000000, 128), X.dtype: float32, X.nbytes: 512.0 MB
X_code.shape: (1000000, 4), X_code.dtype: uint8, X_code.nbytes: 4.0 MB

Then each algorithms are instantiated as follows


In [11]:
K = 10000  # Set larger K

# Setup k-means instances. The number of iteration is set as 20 for all methods

# PQ-kmeans
kmeans_pqkmeans = pqkmeans.clustering.PQKMeans(encoder=encoder, k=K, iteration=20)

# Faiss-cpu
kmeans_faiss_cpu = faiss.Kmeans(d=D, k=K, niter=20)
kmeans_faiss_cpu.cp.max_points_per_centroid = 1000000   #  otherwise the kmeans implementation sub-samples the training set

Because some configurations are required for GPU, we wrap up the gpu clustering as one function:


In [12]:
def run_faiss_gpu(X, K, ngpu):
    # This code is based on https://github.com/facebookresearch/faiss/blob/master/benchs/kmeans_mnist.py
    
    D = X.shape[1]
    clus = faiss.Clustering(D, K)
    
    # otherwise the kmeans implementation sub-samples the training set
    clus.max_points_per_centroid = 10000000
    
    clus.niter = 20
    
    res = [faiss.StandardGpuResources() for i in range(ngpu)]

    flat_config = []
    for i in range(ngpu):
        cfg = faiss.GpuIndexFlatConfig()
        cfg.useFloat16 = False
        cfg.device = i
        flat_config.append(cfg)

    if ngpu == 1:
        index = faiss.GpuIndexFlatL2(res[0], D, flat_config[0])
    else:
        indexes = [faiss.GpuIndexFlatL2(res[i], D, flat_config[i])
                   for i in range(ngpu)]
        index = faiss.IndexProxy()
        for sub_index in indexes:
            index.addIndex(sub_index)
            
    
    # Run clustering
    clus.train(X, index)
    
    # Return the assignment
    _, ids = index.search(X, 1)
    return ids

Run each method and see the computational cost.


In [13]:
%%time
print("PQk-means:")
ids_pqkmeans = kmeans_pqkmeans.fit_predict(X_code)


PQk-means:
CPU times: user 29min 5s, sys: 3.45 s, total: 29min 8s
Wall time: 2min 34s

In [14]:
%%time
print("faiss-cpu:")
kmeans_faiss_cpu.train(X)
_, ids_faiss_cpu = kmeans_faiss_cpu.index.search(X, 1)


faiss-cpu:
CPU times: user 48min 9s, sys: 2min 43s, total: 50min 53s
Wall time: 4min 21s

In [15]:
%%time
print("faiss with GPU:")
ids_faiss_gpu = run_faiss_gpu(X, K, ngpu=2) # Please adjust ngpu for your environment


faiss with GPU:
CPU times: user 1min, sys: 5.44 s, total: 1min 5s
Wall time: 10.6 s

In [16]:
_, pqkmeans_error, _ = pqkmeans.evaluation.calc_error(ids_pqkmeans, X, K)
_, faiss_cpu_error, _ = pqkmeans.evaluation.calc_error(ids_faiss_cpu.reshape(-1), X, K)
_, faiss_gpu_error, _ = pqkmeans.evaluation.calc_error(ids_faiss_gpu.reshape(-1), X, K)

print("PQk-means, error: {}".format(pqkmeans_error))
print("k-means, faiss-cpu, error: {}".format(faiss_cpu_error))
print("k-means, faiss-gpu, error: {}".format(faiss_gpu_error))


PQk-means, error: 218.5069038487091
k-means, faiss-cpu, error: 197.61731756228255
k-means, faiss-gpu, error: 197.61766727244188

Our observations are:

  • PQk-means (around 2 min) is 2x faster than k-means with faiss-CPU (around 4 min). The cost of learining/encoding is marginal (10 sec).
  • PQk-means is memory efficient (128x in this case). More data can be easily processed even if the data itself cannot be loaded on the RAM at once (see tutorial3). Note that faiss provides memory-efficient search algorithms including IVFPQ, but the clustering itself is vanilla k-means (all original vectors need to be loaded on the memory).
  • Because PQk-means is an approximation of k-means, the accuracy of clustering is lower than k-means with CPU/GPU faiss.
  • k-means with faiss-GPU (10 sec) is suprisingly faster than both PQk-means and faiss-CPU, with the same error as faiss-CPU. We conlude that, if you have several GPUs, faiss-GPU is the fastest option for the exact k-means (see benchmark for more results). Note that PQk-means with GPUs could be faster, but has not been implemented yet.

Our final suggestions are as follows:

  • If you have GPU(s) and your GPU memory is large enough (all data can be loaded on the GPU memory at once), faiss-GPU is the fastest option.
  • Otherwise,
    • If your problem is small enough (all vectors can be easily fit into the RAM), faiss-CPU would be the best option.
    • If the problem is large, e.g., (1) faiss-CPU seems slow, or (2) the vectors cannot be loaded on the memory at once, then PQk-means is the best option.