Simple Product Recommendations

By leveraging text frequency metrics and vector similarity, products can be matched with one another.



In [1]:

    
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel

The Data

The data looks something like this:



In [9]:

    
products = pd.read_csv('data/sample-data.csv')
products.head()
# To-Do, split heads from the data to allow queries from data!









    Out[9]:






  
    
      
      id
      description
    
  
  
    
      0
      1
      Active classic boxers - There's a reason why o...
    
    
      1
      2
      Active sport boxer briefs - Skinning up Glory ...
    
    
      2
      3
      Active sport briefs - These superbreathable no...
    
    
      3
      4
      Alpine guide pants - Skin in, climb ice, switc...
    
    
      4
      5
      Alpine wind jkt - On high ridges, steep ice an...

tfidf vectorising

This converts words to their tfidfs.

To-Do: Add explanation and theory behind tfidfs



In [10]:

    
# Initialize the vectorizer to be word-based and to consider uni-, bi-, and tri-grams
tfidf = TfidfVectorizer(analyzer = 'word', ngram_range = (1,3), min_df = 0, stop_words = 'english')
tfidf_matrix = tfidf.fit_transform(products.description)

Cosine Similarities

Go back to vector algebra! The cosine of two vectors will be 1 if the angle between them is 0 degrees, i.e; they are similar!

So if the cosine-similarity is closer to 1, the more similar the vectors are and thus, the more similar are the products in content of their description. Very basic!



In [22]:

    
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)

Let's look at our amazing similarity matrix:



In [28]:

    
pd.DataFrame(cosine_similarities).head()









    Out[28]:






  
    
      
      0
      1
      2
      3
      4
      5
      6
      7
      8
      9
      ...
      490
      491
      492
      493
      494
      495
      496
      497
      498
      499
    
  
  
    
      0
      1.000000
      0.101106
      0.064874
      0.054205
      0.045668
      0.043036
      0.038365
      0.033483
      0.065326
      0.023683
      ...
      0.055639
      0.044734
      0.049726
      0.169390
      0.138845
      0.138136
      0.116579
      0.060974
      0.065469
      0.069556
    
    
      1
      0.101106
      1.000000
      0.418166
      0.054540
      0.058340
      0.040264
      0.043855
      0.032784
      0.037040
      0.027327
      ...
      0.049276
      0.037292
      0.048259
      0.113034
      0.072082
      0.074730
      0.054444
      0.035500
      0.069364
      0.064805
    
    
      2
      0.064874
      0.418166
      1.000000
      0.050032
      0.063913
      0.045049
      0.051047
      0.062232
      0.043588
      0.044433
      ...
      0.048307
      0.039313
      0.044638
      0.081965
      0.110537
      0.061168
      0.059371
      0.034024
      0.045514
      0.050385
    
    
      3
      0.054205
      0.054540
      0.050032
      1.000000
      0.099679
      0.104469
      0.065436
      0.033349
      0.057038
      0.042155
      ...
      0.044300
      0.036827
      0.040935
      0.065442
      0.047781
      0.062331
      0.049173
      0.072626
      0.047726
      0.058470
    
    
      4
      0.045668
      0.058340
      0.063913
      0.099679
      1.000000
      0.115886
      0.080627
      0.033233
      0.054776
      0.056843
      ...
      0.039783
      0.032425
      0.043029
      0.052072
      0.043866
      0.053034
      0.062629
      0.107587
      0.042696
      0.054810
    
  

5 rows × 500 columns



In [55]:

    
# To-Do: add notes
similarities = {}

for index, row in products.iterrows():
    similar_indices = cosine_similarities[index].argsort()[:-100:-1]
    # add product title after you separate it!
    similar_items = [(cosine_similarities[index][i], products['id'][i]) for i in similar_indices]
    similarities[row['id']] = similar_items[1:]



In [80]:

    
# To-do: Make this great again 
pd.DataFrame(similarities).head()









    Out[80]:






  
    
      
      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      ...
      491
      492
      493
      494
      495
      496
      497
      498
      499
      500
    
  
  
    
      0
      (0.220379214726, 19)
      (0.418166399216, 3)
      (0.418166399216, 2)
      (0.825385675995, 159)
      (0.955003649316, 308)
      (0.301900561799, 438)
      (0.266268232308, 354)
      (0.913268218046, 220)
      (0.375508408435, 417)
      (0.302470602334, 425)
      ...
      (0.401490868812, 116)
      (0.981262273383, 286)
      (0.417995838571, 138)
      (0.528361268034, 19)
      (0.311684484525, 494)
      (0.615185258188, 173)
      (0.704989080363, 22)
      (0.237471568777, 302)
      (0.386247565848, 462)
      (0.36281626186, 499)
    
    
      1
      (0.16938950913, 494)
      (0.115463820986, 19)
      (0.11401848122, 299)
      (0.20769755385, 184)
      (0.183044200891, 96)
      (0.293119445145, 184)
      (0.254988934946, 104)
      (0.449363734332, 262)
      (0.116546926409, 469)
      (0.216625222869, 466)
      ...
      (0.400336582155, 72)
      (0.254601905813, 116)
      (0.38993391691, 116)
      (0.311684484525, 495)
      (0.2400276437, 496)
      (0.415822638361, 22)
      (0.63344523705, 360)
      (0.226974058613, 267)
      (0.384221675213, 463)
      (0.318046459929, 462)
    
    
      2
      (0.167694580653, 18)
      (0.113033922454, 494)
      (0.110537294466, 495)
      (0.188279918017, 438)
      (0.180399268592, 281)
      (0.184122871365, 382)
      (0.254573476606, 403)
      (0.446206375211, 255)
      (0.108203020376, 474)
      (0.190674483329, 428)
      ...
      (0.391326437078, 139)
      (0.251968784786, 56)
      (0.356114740564, 347)
      (0.217157070817, 496)
      (0.222285329056, 173)
      (0.385800979652, 23)
      (0.608114612943, 359)
      (0.22636494599, 386)
      (0.383771537386, 32)
      (0.31778345313, 463)
    
    
      3
      (0.164855277456, 172)
      (0.112478545211, 300)
      (0.109176400166, 300)
      (0.165740268287, 343)
      (0.157542170023, 293)
      (0.16468574577, 415)
      (0.242631339147, 464)
      (0.38174009592, 291)
      (0.103801046495, 475)
      (0.167516510742, 408)
      ...
      (0.376283960326, 98)
      (0.240331749948, 372)
      (0.346087059195, 98)
      (0.214422798401, 173)
      (0.222213654254, 19)
      (0.383677745095, 359)
      (0.567262700882, 23)
      (0.184041254207, 212)
      (0.36281626186, 500)
      (0.315561344229, 32)
    
    
      4
      (0.148126154606, 442)
      (0.111470179244, 299)
      (0.101723204487, 156)
      (0.163738275363, 384)
      (0.152097992726, 210)
      (0.150837910762, 387)
      (0.231217902502, 437)
      (0.369299287912, 240)
      (0.101759256605, 230)
      (0.153008463261, 465)
      ...
      (0.375307991886, 397)
      (0.23862073502, 138)
      (0.330714640509, 56)
      (0.185013957657, 497)
      (0.209197225592, 23)
      (0.383561421069, 497)
      (0.530864978094, 175)
      (0.179860082443, 415)
      (0.283497236892, 34)
      (0.256628673385, 34)
    
  

5 rows × 500 columns

Function to query for an item given it's id and return the name:



In [71]:

    
def query_item(item_id):
    return products.loc[products['id'] == item_id]['description'].tolist()[0].split(' - ')[0]

Generate top n similar items given an item's id and n (of course):



In [76]:

    
def recommend(item_id, n):
    print(str(n) + " products similar to " + query_item(item_id) + " :")
    print("------------------")
    recommendations = similarities[item_id][:n]
    for r in recommendations:
        print(query_item(r[1]) + " (score:" + str(r[0]) + ")")



In [78]:

    
recommend(3,5)









    



5 products similar to Active sport briefs :
------------------
Active sport boxer briefs (score:0.418166399216)
Active boy shorts (score:0.11401848122)
Active briefs (score:0.110537294466)
Active briefs (score:0.109176400166)
Active mesh bra (score:0.101723204487)



In [79]:

    
# To-Do: Check to-dos

	id	description
0	1	Active classic boxers - There's a reason why o...
1	2	Active sport boxer briefs - Skinning up Glory ...
2	3	Active sport briefs - These superbreathable no...
3	4	Alpine guide pants - Skin in, climb ice, switc...
4	5	Alpine wind jkt - On high ridges, steep ice an...

	0	1	2	3	4	5	6	7	8	9	...	490	491	492	493	494	495	496	497	498	499
0	1.000000	0.101106	0.064874	0.054205	0.045668	0.043036	0.038365	0.033483	0.065326	0.023683	...	0.055639	0.044734	0.049726	0.169390	0.138845	0.138136	0.116579	0.060974	0.065469	0.069556
1	0.101106	1.000000	0.418166	0.054540	0.058340	0.040264	0.043855	0.032784	0.037040	0.027327	...	0.049276	0.037292	0.048259	0.113034	0.072082	0.074730	0.054444	0.035500	0.069364	0.064805
2	0.064874	0.418166	1.000000	0.050032	0.063913	0.045049	0.051047	0.062232	0.043588	0.044433	...	0.048307	0.039313	0.044638	0.081965	0.110537	0.061168	0.059371	0.034024	0.045514	0.050385
3	0.054205	0.054540	0.050032	1.000000	0.099679	0.104469	0.065436	0.033349	0.057038	0.042155	...	0.044300	0.036827	0.040935	0.065442	0.047781	0.062331	0.049173	0.072626	0.047726	0.058470
4	0.045668	0.058340	0.063913	0.099679	1.000000	0.115886	0.080627	0.033233	0.054776	0.056843	...	0.039783	0.032425	0.043029	0.052072	0.043866	0.053034	0.062629	0.107587	0.042696	0.054810

	1	2	3	4	5	6	7	8	9	10	...	491	492	493	494	495	496	497	498	499	500
0	(0.220379214726, 19)	(0.418166399216, 3)	(0.418166399216, 2)	(0.825385675995, 159)	(0.955003649316, 308)	(0.301900561799, 438)	(0.266268232308, 354)	(0.913268218046, 220)	(0.375508408435, 417)	(0.302470602334, 425)	...	(0.401490868812, 116)	(0.981262273383, 286)	(0.417995838571, 138)	(0.528361268034, 19)	(0.311684484525, 494)	(0.615185258188, 173)	(0.704989080363, 22)	(0.237471568777, 302)	(0.386247565848, 462)	(0.36281626186, 499)
1	(0.16938950913, 494)	(0.115463820986, 19)	(0.11401848122, 299)	(0.20769755385, 184)	(0.183044200891, 96)	(0.293119445145, 184)	(0.254988934946, 104)	(0.449363734332, 262)	(0.116546926409, 469)	(0.216625222869, 466)	...	(0.400336582155, 72)	(0.254601905813, 116)	(0.38993391691, 116)	(0.311684484525, 495)	(0.2400276437, 496)	(0.415822638361, 22)	(0.63344523705, 360)	(0.226974058613, 267)	(0.384221675213, 463)	(0.318046459929, 462)
2	(0.167694580653, 18)	(0.113033922454, 494)	(0.110537294466, 495)	(0.188279918017, 438)	(0.180399268592, 281)	(0.184122871365, 382)	(0.254573476606, 403)	(0.446206375211, 255)	(0.108203020376, 474)	(0.190674483329, 428)	...	(0.391326437078, 139)	(0.251968784786, 56)	(0.356114740564, 347)	(0.217157070817, 496)	(0.222285329056, 173)	(0.385800979652, 23)	(0.608114612943, 359)	(0.22636494599, 386)	(0.383771537386, 32)	(0.31778345313, 463)
3	(0.164855277456, 172)	(0.112478545211, 300)	(0.109176400166, 300)	(0.165740268287, 343)	(0.157542170023, 293)	(0.16468574577, 415)	(0.242631339147, 464)	(0.38174009592, 291)	(0.103801046495, 475)	(0.167516510742, 408)	...	(0.376283960326, 98)	(0.240331749948, 372)	(0.346087059195, 98)	(0.214422798401, 173)	(0.222213654254, 19)	(0.383677745095, 359)	(0.567262700882, 23)	(0.184041254207, 212)	(0.36281626186, 500)	(0.315561344229, 32)
4	(0.148126154606, 442)	(0.111470179244, 299)	(0.101723204487, 156)	(0.163738275363, 384)	(0.152097992726, 210)	(0.150837910762, 387)	(0.231217902502, 437)	(0.369299287912, 240)	(0.101759256605, 230)	(0.153008463261, 465)	...	(0.375307991886, 397)	(0.23862073502, 138)	(0.330714640509, 56)	(0.185013957657, 497)	(0.209197225592, 23)	(0.383561421069, 497)	(0.530864978094, 175)	(0.179860082443, 415)	(0.283497236892, 34)	(0.256628673385, 34)