Chapter 1: PQk-means

This chapter contains the followings:

Vector compression by Product Quantization
Clustering by PQk-means
Comparison to other clustering methods

Requisites:

numpy
sklearn
pqkmeans

1. Vector compression by Product Quantization



In [1]:

    
import numpy
import pqkmeans
import sys
import pickle

First , we introduce vector compression by Product Quantization (PQ) [Jegou, TPAMI 11]. The first task is to train an encoder. Let us assume that there are 1000 six-dimensional vectors for training; $X_1 \in \mathbb{R}^{1000\times6}$



In [2]:

    
X1 = numpy.random.random((1000, 6))
print("X1.shape:\n{}\n".format(X1.shape))
print("X1:\n{}".format(X1))









    



X1.shape:
(1000, 6)

X1:
[[ 0.45915011  0.81146178  0.00506377  0.50923565  0.03212935  0.0972893 ]
 [ 0.18389336  0.85654138  0.28500285  0.50927319  0.22965493  0.24099061]
 [ 0.98793263  0.51580904  0.21431657  0.0530903   0.85640898  0.58706595]
 ..., 
 [ 0.51872839  0.55220418  0.21158864  0.60338268  0.93821744  0.80093949]
 [ 0.40972777  0.81493828  0.12831053  0.07235744  0.89317441  0.99813665]
 [ 0.35910136  0.83867699  0.47266625  0.89194177  0.15437061  0.19536952]]

Then we can train a PQEncoder using $X_1$.



In [3]:

    
encoder = pqkmeans.encoder.PQEncoder(num_subdim=2, Ks=256)
encoder.fit(X1)

The encoder takes two parameters: $num\_subdim$ and $Ks$. In the training step, each vector is splitted into $num\_subdim$ sub-vectors, and quantized with $Ks$ codewords. The $num\_subdim$ decides the bit length of PQ-code, and typically set as 4, 8, etc. The $Ks$ is usually set as 256 so as to represent each sub-code by $\log_2 256=8$ bit.

In this example, each 6D training vector is splitted into $num\_subdim(=2)$ sub-vectors (two 3D vectors). Consequently, the 1000 6D training vectors are splitted into the two set of 1000 3D vectors. The k-means clustering is applied for each set of subvectors with $Ks=256$.

Note that, alternatively, you can use fit_generator for a large dataset. This will be covered in the tutorial3.

After the training step, the encoder stores the resulting codewords (2 subpspaces $*$ 256 codewords $*$ 3 dimensions):



In [4]:

    
print("codewords.shape:\n{}".format(encoder.codewords.shape))









    



codewords.shape:
(2, 256, 3)

Note that you can train the encoder preliminary using training data, and write/read the encoder via pickle.



In [5]:

    
# pickle.dump(encoder, open('encoder.pkl', 'wb'))  # Write
# encoder = pickle.load(open('encoder.pkl', 'rb'))  # Read

Next, let us consider database vectors (2000 six-dimensional vectors, $X_2$) that we'd like to compress.



In [6]:

    
X2 = numpy.random.random((2000, 6))
print("X2.shape:\n{}\n".format(X2.shape))
print("X2:\n{}\n".format(X2))
print("Data type of each element:\n{}\n".format(type(X2[0][0])))
print("Memory usage:\n{} byte".format(X2.nbytes))









    



X2.shape:
(2000, 6)

X2:
[[ 0.15180971  0.80327565  0.59099579  0.00774046  0.16339874  0.3198269 ]
 [ 0.16263575  0.59685954  0.70140796  0.53771596  0.355683    0.38158597]
 [ 0.72568346  0.90049619  0.15491865  0.8290363   0.88244248  0.47203929]
 ..., 
 [ 0.70051838  0.47091249  0.19485669  0.21541201  0.61178776  0.1875382 ]
 [ 0.02273475  0.20743724  0.79219709  0.30482205  0.01841295  0.36284165]
 [ 0.07808816  0.13373749  0.9020573   0.24226782  0.10198825  0.31281366]]

Data type of each element:
<class 'numpy.float64'>

Memory usage:
96000 byte

We can compress these vectors by the trained PQ-encoder.



In [7]:

    
X2_pqcode = encoder.transform(X2)
print("X2_pqcode.shape:\n{}\n".format(X2_pqcode.shape))
print("X2_pqcode:\n{}\n".format(X2_pqcode))
print("Data type of each element:\n{}\n".format(type(X2_pqcode[0][0])))
print("Memory usage:\n{} byte".format(X2_pqcode.nbytes))









    



X2_pqcode.shape:
(2000, 2)

X2_pqcode:
[[ 13   8]
 [204 100]
 [168 125]
 ..., 
 [  5 236]
 [ 92 175]
 [ 29 175]]

Data type of each element:
<class 'numpy.uint8'>

Memory usage:
4000 byte

Each vector is splitted into $num\_subdim(=2)$ sub-vectors, and the nearest codeword is searched for each sub-vector. The id of the nearest codeword is recorded, i.e., two integers in this case. This representation is called PQ-code.

PQ-code is a memory efficient data representation. The original 6D vector requies $6 * 64 = 384$ bit if 64 bit float is used for each element. On the other, a PQ-code requires only $2 * \log_2 256 = 16$ bit.

Note that we can approximately recunstruct the original vector from a PQ-code, by fetching the codewords using the PQ-code:



In [8]:

    
X2_reconstructed = encoder.inverse_transform(X2_pqcode)
print("original X2:\n{}\n".format(X2))
print("reconstructed X2:\n{}".format(X2_reconstructed))









    



original X2:
[[ 0.15180971  0.80327565  0.59099579  0.00774046  0.16339874  0.3198269 ]
 [ 0.16263575  0.59685954  0.70140796  0.53771596  0.355683    0.38158597]
 [ 0.72568346  0.90049619  0.15491865  0.8290363   0.88244248  0.47203929]
 ..., 
 [ 0.70051838  0.47091249  0.19485669  0.21541201  0.61178776  0.1875382 ]
 [ 0.02273475  0.20743724  0.79219709  0.30482205  0.01841295  0.36284165]
 [ 0.07808816  0.13373749  0.9020573   0.24226782  0.10198825  0.31281366]]

reconstructed X2:
[[ 0.08264026  0.7781403   0.71078617  0.06181019  0.11324808  0.29554268]
 [ 0.11838154  0.54206636  0.6747955   0.56273941  0.41778562  0.38644548]
 [ 0.70046182  0.94012103  0.15829005  0.72679207  0.84797079  0.536334  ]
 ..., 
 [ 0.76580784  0.45700225  0.19958891  0.15874598  0.53537705  0.18357633]
 [ 0.06143844  0.08844606  0.75682856  0.25799713  0.11335996  0.33009549]
 [ 0.02050005  0.0332833   0.93969899  0.25799713  0.11335996  0.33009549]]

As can be seen, the reconstructed vectors are similar to the original one.

In a large-scale data processing scenario where all data cannot be stored on memory, you can compress input vectors to PQ-codes, and store the PQ-codes only (X2_pqcode).



In [9]:

    
# numpy.save('pqcode.npy', X2_pqcode) # You can store the PQ-codes only

2. Clustering by PQk-means

Let us run the clustering over the PQ-codes. The clustering object is instanciated with the trained encoder. Here, we set the number of cluster as $k=10$.



In [10]:

    
kmeans = pqkmeans.clustering.PQKMeans(encoder=encoder, k=10)

Let's run the PQk-means over X2_pqcode.



In [11]:

    
clustered = kmeans.fit_predict(X2_pqcode)
print(clustered[:100]) # Just show the 100 results









    



[3, 5, 1, 6, 5, 6, 7, 2, 6, 6, 2, 7, 2, 3, 4, 0, 6, 5, 7, 1, 0, 3, 7, 0, 6, 3, 8, 7, 9, 1, 7, 0, 7, 8, 1, 4, 4, 0, 7, 4, 1, 2, 7, 9, 6, 4, 0, 8, 4, 5, 5, 0, 5, 4, 2, 9, 1, 9, 9, 8, 1, 4, 0, 6, 2, 2, 4, 5, 1, 2, 5, 2, 9, 9, 1, 4, 1, 1, 9, 2, 1, 7, 9, 9, 1, 2, 8, 0, 8, 3, 6, 9, 7, 2, 7, 0, 7, 0, 1, 9]

The resulting vector (clustered) contains the id of assigned codeword for each input PQ-code.



In [12]:

    
print("The id of assigned codeword for the 1st PQ-code is {}".format(clustered[0]))
print("The id of assigned codeword for the 2nd PQ-code is {}".format(clustered[1]))
print("The id of assigned codeword for the 3rd PQ-code is {}".format(clustered[2]))









    



The id of assigned codeword for the 1st PQ-code is 3
The id of assigned codeword for the 2nd PQ-code is 5
The id of assigned codeword for the 3rd PQ-code is 1

You can fetch the center of the clustering by:



In [13]:

    
print("clustering centers:{}\n".format(kmeans.cluster_centers_))









    



clustering centers:[[150, 87], [138, 199], [80, 146], [91, 110], [206, 241], [166, 162], [93, 195], [34, 221], [77, 195], [238, 197]]

The centers are also PQ-codes. They can be reconstructed by the PQ-encoder.



In [14]:

    
clustering_centers_numpy = numpy.array(kmeans.cluster_centers_, dtype=encoder.code_dtype)  # Convert to np.array with the proper dtype
clustering_centers_reconstructd = encoder.inverse_transform(clustering_centers_numpy) # From PQ-code to 6D vectors
print("reconstructed clustering centers:\n{}".format(clustering_centers_reconstructd))









    



reconstructed clustering centers:
[[ 0.67236277  0.17107652  0.53619635  0.73929783  0.49237796  0.47861757]
 [ 0.8055603   0.66486587  0.4045706   0.53686282  0.70065459  0.49703257]
 [ 0.19485406  0.12841325  0.44179731  0.66638549  0.39990629  0.52126625]
 [ 0.41317585  0.53819928  0.6209654   0.27129303  0.25384901  0.44805067]
 [ 0.65261487  0.77149634  0.73270011  0.74112421  0.11458843  0.41761175]
 [ 0.24689513  0.56747225  0.75825082  0.62652277  0.71499949  0.30156924]
 [ 0.35128772  0.35390192  0.24642649  0.23423394  0.59162514  0.58040981]
 [ 0.23566112  0.81520581  0.21010959  0.56226355  0.54656536  0.49115437]
 [ 0.43404732  0.32583786  0.79189525  0.23423394  0.59162514  0.58040981]
 [ 0.66513587  0.57157523  0.37975279  0.58714897  0.20513124  0.69112299]]

Let's summalize the result:



In [15]:

    
print("13th input vector:\n{}\n".format(X2[12]))
print("13th PQ code:\n{}\n".format(X2_pqcode[12]))
print("reconstructed 13th PQ code:\n{}\n".format(X2_reconstructed[12]))
print("ID of the assigned center:\n{}\n".format(clustered[12]))
print("Assigned center (PQ-code):\n{}\n".format(kmeans.cluster_centers_[clustered[12]]))
print("Assigned center (reconstructed):\n{}".format(clustering_centers_reconstructd[clustered[12]]))









    



13th input vector:
[ 0.23370251  0.34135083  0.03342117  0.91975159  0.78653989  0.32892066]

13th PQ code:
[ 63 223]

reconstructed 13th PQ code:
[ 0.22967011  0.29144491  0.08561297  0.97148021  0.81842     0.30072818]

ID of the assigned center:
2

Assigned center (PQ-code):
[80, 146]

Assigned center (reconstructed):
[ 0.19485406  0.12841325  0.44179731  0.66638549  0.39990629  0.52126625]

Note that you can pickle the kmeans instace. The instance can be reused later as a vector quantizer for new input vectors.



In [ ]:

    
# pickle.dump(kmeans, open('kmeans.pkl', 'wb'))  # Write
# kmeans = pickle.load(open('kmeans.pkl', 'rb'))  # Read

3. Comparison to other clustering methods

Let us compare PQk-means and the traditional k-means using high-dimensional data.



In [16]:

    
from sklearn.cluster import KMeans



In [17]:

    
X3 = numpy.random.random((1000, 1024))  # 1K 1024-dim vectors, for training 
X4 = numpy.random.random((10000, 1024)) # 10K 1024-dim vectors, for database
K = 100



In [18]:

    
# Train the encoder
encoder_large = pqkmeans.encoder.PQEncoder(num_subdim=4, Ks=256)
encoder_large.fit(X3)

# Encode the vectors to PQ-code
X4_pqcode = encoder_large.transform(X4)

Let's run the PQ-kmeans, and see the computational cost



In [19]:

    
%time clustered_pqkmeans = pqkmeans.clustering.PQKMeans(encoder=encoder_large, k=K).fit_predict(X4_pqcode)









    



CPU times: user 340 ms, sys: 0 ns, total: 340 ms
Wall time: 136 ms

Then, run the traditional k-means clustering



In [20]:

    
%time clustered_kmeans = KMeans(n_clusters=K, n_jobs=-1).fit_predict(X4)









    



CPU times: user 1.8 s, sys: 68 ms, total: 1.87 s
Wall time: 51.3 s

PQk-means would be tens to hundreds of times faster than k-means depending on your machine. Then let's see the accuracy. Since the result of PQk-means is the approximation of that of k-means, k-means achieved the lower error:



In [21]:

    
_, pqkmeans_micro_average_error, _ = pqkmeans.evaluation.calc_error(clustered_pqkmeans, X4, K)
_, kmeans_micro_average_error, _ = pqkmeans.evaluation.calc_error(clustered_kmeans, X4, K)

print("PQk-means, micro avg error: {}".format(pqkmeans_micro_average_error))
print("k-means, micro avg error: {}".format(kmeans_micro_average_error))









    



PQk-means, micro avg error: 9.175470609104421
k-means, micro avg error: 9.151204550465865



In [ ]: