Quickly find nearest neighbors in (very) high dimensions.
Examples:
Find all $s$, $s'$ in $S$, with distance at most $\epsilon$.
In [1]:
def jaccard_sim(a, b):
"""The Jaccard similarity of the two sets a and b."""
return len(a & b) * 1.0 / len(a | b)
def jaccard_distance(a, b):
return 1 - jaccard_sim(a, b)
x = {1, 5, 6, 10}
y = {2, 5, 6, 20}
print("Similarity: %.2f" % jaccard_sim(x, y))
print("Distance: %.2f" % jaccard_distance(x, y))
Reorder shingle matrix rows with random permutation $\pi$
$\operatorname{hash}(C) =$ minimum row number in which permuted column contains a one (C represents a column, i.e. a document in shingle form)
$\operatorname{hash}(C) = h_\pi(C) = \min_{i:C(i)=1}\pi(i)$ (hence the min in min-hashing)
Turns out that the probability of two documents sharing a hash is equal to their Jaccard similarity: $P[h(C_1) = h(C_2)] = Sim(C_1, C_2)$ (trivial but interesting proof; see slides).
This means we can use many hash functions, see how often they clash for a pair of documents, and we have a decent estimate for the documents' Jaccard similarity.
An alternative to min-hashing is sim-hashing (see Information Retrieval)
In [44]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
In [45]:
shingle_matrix = np.array([
[1, 0, 1, 0],
[1, 0, 0, 1],
[0, 1, 0, 1],
[0, 1, 0, 1],
[0, 1, 0, 1],
[1, 0, 1, 0],
[1, 0, 1, 0]
])
In [46]:
def min_hash(matrix, permutation):
signature = [None for col in matrix[0]]
for index, row in enumerate(permutation):
for col, byte in enumerate(signature):
if byte is None and matrix[row][col] == 1:
signature[col] = index
return signature
def array_sim(a1, a2):
in_common = len(a1[a1 == a2])
total = len(a1)
# print("In common: %d" % in_common)
# print("Total: %d" % total)
return in_common * 1.0 / total
def col_col_sim(matrix, c1_index, c2_index):
tr = matrix.T
c1 = tr[c1_index]
c2 = tr[c2_index]
return array_sim(c1, c2)
def sig_sig_sim(matrix, c1_index, c2_index, hash_fns):
sig_matrix = np.array([min_hash(matrix, hf) for hf in hash_fns])
tr = sig_matrix.T
c1 = tr[c1_index]
c2 = tr[c2_index]
return array_sim(c1, c2)
In [47]:
# Same permutations as in slides, just 0-indexed and reversed
p1 = [0, 4, 1, 6, 5, 3, 2]
p2 = [2, 1, 3, 0, 6, 4, 5]
p3 = [4, 5, 0, 1, 6, 3, 2]
print(min_hash(shingle_matrix, p1))
print(min_hash(shingle_matrix, p2))
print(min_hash(shingle_matrix, p3))
In [48]:
hash_fns = [p1, p2, p3]
pairs = [(0, 2), (1, 3), (0, 1), (2, 3)]
for c1, c2 in pairs:
print("Columns %d and %d:" % (c1 + 1, c2 + 1))
print("Col/col similarity: %.2f" % col_col_sim(shingle_matrix, c1, c2))
print("Sig/sig similarity: %.2f" % sig_sig_sim(shingle_matrix, c1, c2, hash_fns))
While the above method works well, it's quite problematic to actually permute a huge dataset
We therefore want to implement the permutations as linear hash funtions (see your local linear algebra course for more info about why this works!).
\begin{equation} \pi(r) = a \cdot r + b \operatorname{mod} n =: h_{a,b}(r), \> a, b \> \text{random} \end{equation}This is huge! It allows us to iteratively process any shingle matrix (row-wise), without having to actually move (huge) data around (necessary for performing actual permutations).
Moreover, $h_{a,b}$ is efficient to store.
Note: on slide 17, try to really figure out the prof's corrections, if time allows it.
In [56]:
index = np.arange(0, 1, 0.01)
curves = {'k = %d' % k: 1 - (1 - index) ** k for k in range(1, 20, 2)}
fr = pd.DataFrame(data=curves, index=pd.Index(index))
In [57]:
ax = fr.plot()
plt.xlabel("Similarity")
plt.ylabel("P(hit)")
Out[57]:
TODO: Numeric example from slides maybe.
In [ ]: