Simple Product Recommendations

By leveraging text frequency metrics and vector similarity, products can be matched with one another.


In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel

The Data

The data looks something like this:


In [9]:
products = pd.read_csv('data/sample-data.csv')
products.head()
# To-Do, split heads from the data to allow queries from data!


Out[9]:
id description
0 1 Active classic boxers - There's a reason why o...
1 2 Active sport boxer briefs - Skinning up Glory ...
2 3 Active sport briefs - These superbreathable no...
3 4 Alpine guide pants - Skin in, climb ice, switc...
4 5 Alpine wind jkt - On high ridges, steep ice an...

tfidf vectorising

This converts words to their tfidfs.

To-Do: Add explanation and theory behind tfidfs


In [10]:
# Initialize the vectorizer to be word-based and to consider uni-, bi-, and tri-grams
tfidf = TfidfVectorizer(analyzer = 'word', ngram_range = (1,3), min_df = 0, stop_words = 'english')
tfidf_matrix = tfidf.fit_transform(products.description)

Cosine Similarities

Go back to vector algebra! The cosine of two vectors will be 1 if the angle between them is 0 degrees, i.e; they are similar!

So if the cosine-similarity is closer to 1, the more similar the vectors are and thus, the more similar are the products in content of their description. Very basic!


In [22]:
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

Let's look at our amazing similarity matrix:


In [28]:
pd.DataFrame(cosine_similarities).head()


Out[28]:
0 1 2 3 4 5 6 7 8 9 ... 490 491 492 493 494 495 496 497 498 499
0 1.000000 0.101106 0.064874 0.054205 0.045668 0.043036 0.038365 0.033483 0.065326 0.023683 ... 0.055639 0.044734 0.049726 0.169390 0.138845 0.138136 0.116579 0.060974 0.065469 0.069556
1 0.101106 1.000000 0.418166 0.054540 0.058340 0.040264 0.043855 0.032784 0.037040 0.027327 ... 0.049276 0.037292 0.048259 0.113034 0.072082 0.074730 0.054444 0.035500 0.069364 0.064805
2 0.064874 0.418166 1.000000 0.050032 0.063913 0.045049 0.051047 0.062232 0.043588 0.044433 ... 0.048307 0.039313 0.044638 0.081965 0.110537 0.061168 0.059371 0.034024 0.045514 0.050385
3 0.054205 0.054540 0.050032 1.000000 0.099679 0.104469 0.065436 0.033349 0.057038 0.042155 ... 0.044300 0.036827 0.040935 0.065442 0.047781 0.062331 0.049173 0.072626 0.047726 0.058470
4 0.045668 0.058340 0.063913 0.099679 1.000000 0.115886 0.080627 0.033233 0.054776 0.056843 ... 0.039783 0.032425 0.043029 0.052072 0.043866 0.053034 0.062629 0.107587 0.042696 0.054810

5 rows × 500 columns


In [55]:
# To-Do: add notes
similarities = {}

for index, row in products.iterrows():
    similar_indices = cosine_similarities[index].argsort()[:-100:-1]
    # add product title after you separate it!
    similar_items = [(cosine_similarities[index][i], products['id'][i]) for i in similar_indices]
    similarities[row['id']] = similar_items[1:]

In [80]:
# To-do: Make this great again 
pd.DataFrame(similarities).head()


Out[80]:
1 2 3 4 5 6 7 8 9 10 ... 491 492 493 494 495 496 497 498 499 500
0 (0.220379214726, 19) (0.418166399216, 3) (0.418166399216, 2) (0.825385675995, 159) (0.955003649316, 308) (0.301900561799, 438) (0.266268232308, 354) (0.913268218046, 220) (0.375508408435, 417) (0.302470602334, 425) ... (0.401490868812, 116) (0.981262273383, 286) (0.417995838571, 138) (0.528361268034, 19) (0.311684484525, 494) (0.615185258188, 173) (0.704989080363, 22) (0.237471568777, 302) (0.386247565848, 462) (0.36281626186, 499)
1 (0.16938950913, 494) (0.115463820986, 19) (0.11401848122, 299) (0.20769755385, 184) (0.183044200891, 96) (0.293119445145, 184) (0.254988934946, 104) (0.449363734332, 262) (0.116546926409, 469) (0.216625222869, 466) ... (0.400336582155, 72) (0.254601905813, 116) (0.38993391691, 116) (0.311684484525, 495) (0.2400276437, 496) (0.415822638361, 22) (0.63344523705, 360) (0.226974058613, 267) (0.384221675213, 463) (0.318046459929, 462)
2 (0.167694580653, 18) (0.113033922454, 494) (0.110537294466, 495) (0.188279918017, 438) (0.180399268592, 281) (0.184122871365, 382) (0.254573476606, 403) (0.446206375211, 255) (0.108203020376, 474) (0.190674483329, 428) ... (0.391326437078, 139) (0.251968784786, 56) (0.356114740564, 347) (0.217157070817, 496) (0.222285329056, 173) (0.385800979652, 23) (0.608114612943, 359) (0.22636494599, 386) (0.383771537386, 32) (0.31778345313, 463)
3 (0.164855277456, 172) (0.112478545211, 300) (0.109176400166, 300) (0.165740268287, 343) (0.157542170023, 293) (0.16468574577, 415) (0.242631339147, 464) (0.38174009592, 291) (0.103801046495, 475) (0.167516510742, 408) ... (0.376283960326, 98) (0.240331749948, 372) (0.346087059195, 98) (0.214422798401, 173) (0.222213654254, 19) (0.383677745095, 359) (0.567262700882, 23) (0.184041254207, 212) (0.36281626186, 500) (0.315561344229, 32)
4 (0.148126154606, 442) (0.111470179244, 299) (0.101723204487, 156) (0.163738275363, 384) (0.152097992726, 210) (0.150837910762, 387) (0.231217902502, 437) (0.369299287912, 240) (0.101759256605, 230) (0.153008463261, 465) ... (0.375307991886, 397) (0.23862073502, 138) (0.330714640509, 56) (0.185013957657, 497) (0.209197225592, 23) (0.383561421069, 497) (0.530864978094, 175) (0.179860082443, 415) (0.283497236892, 34) (0.256628673385, 34)

5 rows × 500 columns

Function to query for an item given it's id and return the name:


In [71]:
def query_item(item_id):
    return products.loc[products['id'] == item_id]['description'].tolist()[0].split(' - ')[0]

Generate top n similar items given an item's id and n (of course):


In [76]:
def recommend(item_id, n):
    print(str(n) + " products similar to " + query_item(item_id) + " :")
    print("------------------")
    recommendations = similarities[item_id][:n]
    for r in recommendations:
        print(query_item(r[1]) + " (score:" + str(r[0]) + ")")

In [78]:
recommend(3,5)


5 products similar to Active sport briefs :
------------------
Active sport boxer briefs (score:0.418166399216)
Active boy shorts (score:0.11401848122)
Active briefs (score:0.110537294466)
Active briefs (score:0.109176400166)
Active mesh bra (score:0.101723204487)

In [79]:
# To-Do: Check to-dos