Implement the cosine similarity metric and show some vector examples:
https://en.wikipedia.org/wiki/Cosine_similarity
cos_sim(a, b) gives us a measure of similarity from 0 to 1, 0 means the measure is orthogonal (not related), 1 is totally-similar. For positive inputs the similarity measure is always in the range [0, 1].
One degenerative case is the all-0s vector (e.g. a==[0, 0, ..., 0]). In this case the 'similarity' is NaN. This similarity measure is not used for all-0s inputs, you use it when you have at least one binary element set to True.
Note also that Cos Sim is not a true distance metric, it is a similarity metric. It doesn't have the triangle inequality property: https://en.wikipedia.org/wiki/Metric_%28mathematics%29 nor some other requirements.
It is commonly used because it is terribly quick to calculate and works reliably for similarity questions.
In [126]:
import numpy as np
def cos_sim(a, b):
"""a and b are equal length 1D vectors"""
sqrt = np.sqrt
dot = np.dot
# note dot product for dot(a,b) == (a*b).sum()
numerator = np.dot(a, b) # same as: (a*b).sum()
divisor = sqrt(dot(a, a)) * sqrt(dot(b, b))
if divisor == 0:
divisor=1 # avoid divide-by-zero errors
return numerator / divisor
In [145]:
# example 1D vectors
v1 = np.array([0, 0, 1])
v2 = np.array([0, 1, 1])
v3 = np.array([1, 1, 1])
v4 = np.array([1, 0, 0])
In [146]:
print("Opposite binary items {} {}:".format(v2, v4), cos_sim(v2, v4))
print("Less similar binary items {} {}:".format(v1, v3), cos_sim(v1, v3))
print("More similar binary items {} {}:".format(v2, v3), cos_sim(v2, v3))
print("Same binary items {} {}:".format(v1, v1), cos_sim(v1, v1))
print("Same binary items {} {}:".format(v2, v2), cos_sim(v2, v2))
In [147]:
# For validation you could compare with scikit learn's cosine *distance* (where distance is 1-score)
import sklearn.metrics
cos_sim_sklearn = sklearn.metrics.pairwise.cosine_similarity
result_v1_v1 = cos_sim_sklearn(v1, v1)
result_v1_v2 = cos_sim_sklearn(v1, v2)
result_v1_v3 = cos_sim_sklearn(v1, v3)
assert result_v1_v1 == cos_sim(v1, v1)
assert result_v1_v2 == cos_sim(v1, v2)
assert result_v1_v3 == cos_sim(v1, v3)
In [148]:
# For negative inputs you'll get a negative result in the range [-1, 1]
v1_neg = np.array([0, -1, -1])
cos_sim(v1, v1_neg)
Out[148]:
In [149]:
# For totally opposite results you'll get -1
v3_neg = np.array([-1, -1, -1])
cos_sim(v3, v3_neg)
Out[149]: