In [1]:
import numpy
import pqkmeans
import sys
import pickle
First , we introduce vector compression by Product Quantization (PQ) [Jegou, TPAMI 11]. The first task is to train an encoder. Let us assume that there are 1000 six-dimensional vectors for training; $X_1 \in \mathbb{R}^{1000\times6}$
In [2]:
X1 = numpy.random.random((1000, 6))
print("X1.shape:\n{}\n".format(X1.shape))
print("X1:\n{}".format(X1))
Then we can train a PQEncoder using $X_1$.
In [3]:
encoder = pqkmeans.encoder.PQEncoder(num_subdim=2, Ks=256)
encoder.fit(X1)
The encoder takes two parameters: $num\_subdim$ and $Ks$. In the training step, each vector is splitted into $num\_subdim$ sub-vectors, and quantized with $Ks$ codewords. The $num\_subdim$ decides the bit length of PQ-code, and typically set as 4, 8, etc. The $Ks$ is usually set as 256 so as to represent each sub-code by $\log_2 256=8$ bit.
In this example, each 6D training vector is splitted into $num\_subdim(=2)$ sub-vectors (two 3D vectors). Consequently, the 1000 6D training vectors are splitted into the two set of 1000 3D vectors. The k-means clustering is applied for each set of subvectors with $Ks=256$.
Note that, alternatively, you can use fit_generator
for a large dataset. This will be covered in the tutorial3.
After the training step, the encoder stores the resulting codewords (2 subpspaces $*$ 256 codewords $*$ 3 dimensions):
In [4]:
print("codewords.shape:\n{}".format(encoder.codewords.shape))
Note that you can train the encoder preliminary using training data, and write/read the encoder via pickle.
In [5]:
# pickle.dump(encoder, open('encoder.pkl', 'wb')) # Write
# encoder = pickle.load(open('encoder.pkl', 'rb')) # Read
Next, let us consider database vectors (2000 six-dimensional vectors, $X_2$) that we'd like to compress.
In [6]:
X2 = numpy.random.random((2000, 6))
print("X2.shape:\n{}\n".format(X2.shape))
print("X2:\n{}\n".format(X2))
print("Data type of each element:\n{}\n".format(type(X2[0][0])))
print("Memory usage:\n{} byte".format(X2.nbytes))
We can compress these vectors by the trained PQ-encoder.
In [7]:
X2_pqcode = encoder.transform(X2)
print("X2_pqcode.shape:\n{}\n".format(X2_pqcode.shape))
print("X2_pqcode:\n{}\n".format(X2_pqcode))
print("Data type of each element:\n{}\n".format(type(X2_pqcode[0][0])))
print("Memory usage:\n{} byte".format(X2_pqcode.nbytes))
Each vector is splitted into $num\_subdim(=2)$ sub-vectors, and the nearest codeword is searched for each sub-vector. The id of the nearest codeword is recorded, i.e., two integers in this case. This representation is called PQ-code.
PQ-code is a memory efficient data representation. The original 6D vector requies $6 * 64 = 384$ bit if 64 bit float is used for each element. On the other, a PQ-code requires only $2 * \log_2 256 = 16$ bit.
Note that we can approximately recunstruct the original vector from a PQ-code, by fetching the codewords using the PQ-code:
In [8]:
X2_reconstructed = encoder.inverse_transform(X2_pqcode)
print("original X2:\n{}\n".format(X2))
print("reconstructed X2:\n{}".format(X2_reconstructed))
As can be seen, the reconstructed vectors are similar to the original one.
In a large-scale data processing scenario where all data cannot be stored on memory, you can compress input vectors to PQ-codes, and store the PQ-codes only (X2_pqcode).
In [9]:
# numpy.save('pqcode.npy', X2_pqcode) # You can store the PQ-codes only
Let us run the clustering over the PQ-codes. The clustering object is instanciated with the trained encoder. Here, we set the number of cluster as $k=10$.
In [10]:
kmeans = pqkmeans.clustering.PQKMeans(encoder=encoder, k=10)
Let's run the PQk-means over X2_pqcode.
In [11]:
clustered = kmeans.fit_predict(X2_pqcode)
print(clustered[:100]) # Just show the 100 results
The resulting vector (clustered) contains the id of assigned codeword for each input PQ-code.
In [12]:
print("The id of assigned codeword for the 1st PQ-code is {}".format(clustered[0]))
print("The id of assigned codeword for the 2nd PQ-code is {}".format(clustered[1]))
print("The id of assigned codeword for the 3rd PQ-code is {}".format(clustered[2]))
You can fetch the center of the clustering by:
In [13]:
print("clustering centers:{}\n".format(kmeans.cluster_centers_))
The centers are also PQ-codes. They can be reconstructed by the PQ-encoder.
In [14]:
clustering_centers_numpy = numpy.array(kmeans.cluster_centers_, dtype=encoder.code_dtype) # Convert to np.array with the proper dtype
clustering_centers_reconstructd = encoder.inverse_transform(clustering_centers_numpy) # From PQ-code to 6D vectors
print("reconstructed clustering centers:\n{}".format(clustering_centers_reconstructd))
Let's summalize the result:
In [15]:
print("13th input vector:\n{}\n".format(X2[12]))
print("13th PQ code:\n{}\n".format(X2_pqcode[12]))
print("reconstructed 13th PQ code:\n{}\n".format(X2_reconstructed[12]))
print("ID of the assigned center:\n{}\n".format(clustered[12]))
print("Assigned center (PQ-code):\n{}\n".format(kmeans.cluster_centers_[clustered[12]]))
print("Assigned center (reconstructed):\n{}".format(clustering_centers_reconstructd[clustered[12]]))
Note that you can pickle the kmeans instace. The instance can be reused later as a vector quantizer for new input vectors.
In [ ]:
# pickle.dump(kmeans, open('kmeans.pkl', 'wb')) # Write
# kmeans = pickle.load(open('kmeans.pkl', 'rb')) # Read
Let us compare PQk-means and the traditional k-means using high-dimensional data.
In [16]:
from sklearn.cluster import KMeans
In [17]:
X3 = numpy.random.random((1000, 1024)) # 1K 1024-dim vectors, for training
X4 = numpy.random.random((10000, 1024)) # 10K 1024-dim vectors, for database
K = 100
In [18]:
# Train the encoder
encoder_large = pqkmeans.encoder.PQEncoder(num_subdim=4, Ks=256)
encoder_large.fit(X3)
# Encode the vectors to PQ-code
X4_pqcode = encoder_large.transform(X4)
Let's run the PQ-kmeans, and see the computational cost
In [19]:
%time clustered_pqkmeans = pqkmeans.clustering.PQKMeans(encoder=encoder_large, k=K).fit_predict(X4_pqcode)
Then, run the traditional k-means clustering
In [20]:
%time clustered_kmeans = KMeans(n_clusters=K, n_jobs=-1).fit_predict(X4)
PQk-means would be tens to hundreds of times faster than k-means depending on your machine. Then let's see the accuracy. Since the result of PQk-means is the approximation of that of k-means, k-means achieved the lower error:
In [21]:
_, pqkmeans_micro_average_error, _ = pqkmeans.evaluation.calc_error(clustered_pqkmeans, X4, K)
_, kmeans_micro_average_error, _ = pqkmeans.evaluation.calc_error(clustered_kmeans, X4, K)
print("PQk-means, micro avg error: {}".format(pqkmeans_micro_average_error))
print("k-means, micro avg error: {}".format(kmeans_micro_average_error))
In [ ]: