The formula

$$similarity = cos(\theta) = \frac{A \cdot B}{\|A\|\|B\|} = \frac{\displaystyle \sum_{i=1}^{n} A_i \times B_i}{\displaystyle \sqrt {\displaystyle \sum_{i=1}^{n} (A_i)^2} \displaystyle \sqrt {\displaystyle \sum_{i=1}^{n} (B_i)^2}}$$

Implement the numerator

$$\sum_{i=1}^{n} A_i \times B_i$$

For every element $1 \dots n$, we should multiply the value of that element in array (Series) $A$ with the value for that same element in array $B$ and then sum all of these values.
Implement a function below that takes as input two Series $A$ and $B$ and returns the value required for the numerator of the equation.


In [ ]:
import pandas as pd

def CS_num(A,B):
    #insert your code here
    

# Run the code below to check your code.
df = pd.read_pickle('test_LL.pickle')
print(CS_num(df.ix[0], df.ix[1]))

Implement denominator

$$\sqrt {\sum_{i=1}^{n} (A_i)^2} \sqrt {\sum_{i=1}^{n} (B_i)^2}$$

Part I: $\sqrt {\sum\limits_{i=1}^{n} (A_i)^2}$

Similar to the numerator, we want to multiply each value in Series $A$ by itself, and then find the sum of all of these values.
Write a function below that will take as input a Series $A$ and will return the appropriate value for the first half of the denominator of our cosine similarity equation.


In [ ]:
def CS_den_part(A):
    #insert your code here
    
#The lines below are to check your code for errors.
print(CS_den(df.ix[0]))

Part II: $\sqrt {\sum\limits_{i=1}^{n} (B_i)^2}$

A brief look at the second square root in the denominator should demonstrate that we do can use our previous function (CS_den_part) to calculate the second part of the denominator as well as the first.

Part III: Bring it together

The last pre-calculation for our cosine similarity equation is to bring the two parts of the denominator together. Define a function below that will take two Series $A$ and $B$ and call the CS_den_part function above to do the appropriate calculations and will return the necessary value for the denominator.


In [ ]:
def CS_den(A, B):
    #insert your code here
    
#The lines below are to check your code for errors.
print(CS_den(df.ix[0], df.ix[1]))

Calculate

In order to do the final calculation of our cosine similarity score, we need to write a function that will take two Series $A$ and $B$ as input, call the appropriate functions to do the calculations for the parts of the CS equation, and return a single number that is the cosine similarity of the two Series.


In [ ]:
def cos_sim(A,B):
    #insert your code here
    
#The lines below are to check your code for errors.
print(cos_sim(df.ix[0], df.ix[1]))

Construct the matrix of answers.

Now, finally, we need to wrap everything we have done above in a function that takes a matrix of arrays (DataFrame) $DF\_LL$, performs the necessary calculations to calculate the CS of every row with every other row, and builds a new DataFrame $CS\_DF$ that has the same shape as $DF\_LL$ but contains the cosine similarity scores for the individual arrays.


In [ ]:
def CS_matrix(DF_LL):
    #insert your code here
    
#The lines below are meant to check your function for errors.
print(CS_matrix(df))

What do you notice about your matrix? Do these characteristics that you notice make sense? Why or why not?

Compare to sklearn

Our last step is simply to compare our answers to the cosine distance function in the sci-kit learn package. This function is extremely easy to implement. It is done like this:


In [ ]:
from sklearn.metrics.pairwise import pairwise_distances
print(pairwise_distances(df))

And not only is it easier, but it is also faster. Check this out.


In [ ]:
from timeit import timeit

print('1000 iterations of sklearn takes %s seconds' % 
      timeit('pairwise_distances(df)', 
             'from __main__ import pairwise_distances, df',
             number = 1000))
print('1000 iterations of our function takes %s seconds' % 
      timeit('CS_matrix(df)', 
             '''from __main__ import CS_matrix, 
             CS_num, CS_den_part, CS_den, cos_sim''',
             number = 1000))

But how about the results? Write a function below that takes as input an LL DataFrame and compares the results of our function with that of the sklearn function, returning True if they are equal and False if they are not.

Note: $similarity = 1-difference$


In [ ]:
def CS_compare(LL_DF):
    #insert your code here
    
#The following lines are to check your function for errors.
print(CS_compare(df))