In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel
In [9]:
products = pd.read_csv('data/sample-data.csv')
products.head()
# To-Do, split heads from the data to allow queries from data!
Out[9]:
In [10]:
# Initialize the vectorizer to be word-based and to consider uni-, bi-, and tri-grams
tfidf = TfidfVectorizer(analyzer = 'word', ngram_range = (1,3), min_df = 0, stop_words = 'english')
tfidf_matrix = tfidf.fit_transform(products.description)
Go back to vector algebra! The cosine of two vectors will be 1 if the angle between them is 0 degrees, i.e; they are similar!
So if the cosine-similarity is closer to 1, the more similar the vectors are and thus, the more similar are the products in content of their description. Very basic!
In [22]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
Let's look at our amazing similarity matrix:
In [28]:
pd.DataFrame(cosine_similarities).head()
Out[28]:
In [55]:
# To-Do: add notes
similarities = {}
for index, row in products.iterrows():
similar_indices = cosine_similarities[index].argsort()[:-100:-1]
# add product title after you separate it!
similar_items = [(cosine_similarities[index][i], products['id'][i]) for i in similar_indices]
similarities[row['id']] = similar_items[1:]
In [80]:
# To-do: Make this great again
pd.DataFrame(similarities).head()
Out[80]:
Function to query for an item given it's id and return the name:
In [71]:
def query_item(item_id):
return products.loc[products['id'] == item_id]['description'].tolist()[0].split(' - ')[0]
Generate top n similar items given an item's id and n (of course):
In [76]:
def recommend(item_id, n):
print(str(n) + " products similar to " + query_item(item_id) + " :")
print("------------------")
recommendations = similarities[item_id][:n]
for r in recommendations:
print(query_item(r[1]) + " (score:" + str(r[0]) + ")")
In [78]:
recommend(3,5)
In [79]:
# To-Do: Check to-dos