Cosine Similarity Demo

Implement the cosine similarity metric and show some vector examples:

https://en.wikipedia.org/wiki/Cosine_similarity

cos_sim(a, b) gives us a measure of similarity from 0 to 1, 0 means the measure is orthogonal (not related), 1 is totally-similar. For positive inputs the similarity measure is always in the range [0, 1].

One degenerative case is the all-0s vector (e.g. a==[0, 0, ..., 0]). In this case the 'similarity' is NaN. This similarity measure is not used for all-0s inputs, you use it when you have at least one binary element set to True.

Note also that Cos Sim is not a true distance metric, it is a similarity metric. It doesn't have the triangle inequality property: https://en.wikipedia.org/wiki/Metric_%28mathematics%29 nor some other requirements.

It is commonly used because it is terribly quick to calculate and works reliably for similarity questions.


In [126]:
import numpy as np

def cos_sim(a, b):
    """a and b are equal length 1D vectors"""
    sqrt = np.sqrt
    dot = np.dot
    # note dot product for dot(a,b) == (a*b).sum()
    numerator = np.dot(a, b)  # same as: (a*b).sum()
    divisor = sqrt(dot(a, a)) * sqrt(dot(b, b)) 
    if divisor == 0:
        divisor=1  # avoid divide-by-zero errors
    return numerator / divisor

In [145]:
# example 1D vectors
v1 = np.array([0, 0, 1])
v2 = np.array([0, 1, 1])
v3 = np.array([1, 1, 1])
v4 = np.array([1, 0, 0])

In [146]:
print("Opposite binary items {} {}:".format(v2, v4), cos_sim(v2, v4))
print("Less similar binary items {} {}:".format(v1, v3), cos_sim(v1, v3))
print("More similar binary items {} {}:".format(v2, v3), cos_sim(v2, v3))
print("Same binary items {} {}:".format(v1, v1), cos_sim(v1, v1))
print("Same binary items {} {}:".format(v2, v2), cos_sim(v2, v2))


Opposite binary items [0 1 1] [1 0 0]: 0.0
Less similar binary items [0 0 1] [1 1 1]: 0.57735026919
More similar binary items [0 1 1] [1 1 1]: 0.816496580928
Same binary items [0 0 1] [0 0 1]: 1.0
Same binary items [0 1 1] [0 1 1]: 1.0

In [147]:
# For validation you could compare with scikit learn's cosine *distance* (where distance is 1-score)
import sklearn.metrics
cos_sim_sklearn = sklearn.metrics.pairwise.cosine_similarity
result_v1_v1 = cos_sim_sklearn(v1, v1)
result_v1_v2 = cos_sim_sklearn(v1, v2)
result_v1_v3 = cos_sim_sklearn(v1, v3)

assert result_v1_v1 == cos_sim(v1, v1)
assert result_v1_v2 == cos_sim(v1, v2)
assert result_v1_v3 == cos_sim(v1, v3)

In [148]:
# For negative inputs you'll get a negative result in the range [-1, 1] 
v1_neg = np.array([0, -1, -1])
cos_sim(v1, v1_neg)


Out[148]:
-0.70710678118654746

In [149]:
# For totally opposite results you'll get -1
v3_neg = np.array([-1, -1, -1])
cos_sim(v3, v3_neg)


Out[149]:
-1.0000000000000002