Home Depot Product Search Relevance

The challenge is to predict a relevance score for the provided combinations of search terms and products. To create the ground truth labels, Home Depot has crowdsourced the search/product pairs to multiple human raters.

LabGraph Create

This notebook uses the LabGraph create machine learning iPython module. You need a personal licence to run this code.


In [1]:
import graphlab as gl

Load data from CSV files


In [2]:
train = gl.SFrame.read_csv("../data/train.csv")


[INFO] This non-commercial license of GraphLab Create is assigned to thomasv1000@hotmail.fr and will expire on October 12, 2016. For commercial licensing options, visit https://dato.com/buy/.

[INFO] Start server at: ipc:///tmp/graphlab_server-32514 - Server binary: /Users/tjaskula/.graphlab/anaconda/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1454970983.log
[INFO] GraphLab Server Version: 1.8.1
PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/train.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.117518 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,str,str,float]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/train.csv
PROGRESS: Parsing completed. Parsed 74067 lines in 0.174436 secs.

In [3]:
test = gl.SFrame.read_csv("../data/test.csv")


PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/test.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.194729 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,int,str,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/test.csv
PROGRESS: Parsing completed. Parsed 166693 lines in 0.33546 secs.

In [4]:
desc = gl.SFrame.read_csv("../data/product_descriptions.csv")


PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/product_descriptions.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.484572 secs.
------------------------------------------------------
Inferred types from first line of file as 
column_type_hints=[int,str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------
PROGRESS: Read 61134 lines. Lines per second: 61700.6
PROGRESS: Finished parsing file /Users/tjaskula/Documents/GitHub/Kaggle.HomeDepot/data/product_descriptions.csv
PROGRESS: Parsing completed. Parsed 124428 lines in 1.56952 secs.

Data merging


In [5]:
# merge train with description
train = train.join(desc, on = 'product_uid', how = 'left')

In [6]:
# merge test with description
test = test.join(desc, on = 'product_uid', how = 'left')

Let's explore some data

Let's examine 3 different queries and products:

  • first from the training set
  • somewhere in the moddle in the training set
  • the last one from the training set

In [7]:
first_doc = train[0]
first_doc


Out[7]:
{'id': 2,
 'product_description': 'Not only do angles make joints stronger, they also provide more consistent, straight corners. Simpson Strong-Tie offers a wide variety of angles in various sizes and thicknesses to handle light-duty jobs or projects where a structural connection is needed. Some can be bent (skewed) to match the project. For outdoor projects or those where moisture is present, use our ZMAX zinc-coated connectors, which provide extra resistance against corrosion (look for a "Z" at the end of the model number).Versatile connector for various 90 connections and home repair projectsStronger than angled nailing or screw fastening aloneHelp ensure joints are consistently straight and strongDimensions: 3 in. x 3 in. x 1-1/2 in.Made from 12-Gauge steelGalvanized for extra corrosion resistanceInstall with 10d common nails or #9 x 1-1/2 in. Strong-Drive SD screws',
 'product_title': 'Simpson Strong-Tie 12-Gauge Angle',
 'product_uid': 100001,
 'relevance': 3.0,
 'search_term': 'angle bracket'}

'angle bracket' search term is not contained in the body. 'angle' would be after stemming however 'bracket' is not.


In [8]:
middle_doc = train[37033]
middle_doc


Out[8]:
{'id': 113228,
 'product_description': 'PureBond Plywood Project Panels are a convenient and cost-effective way to build cabinets, furniture and other woodworking projects. It provides a beautiful wood veneer face bonded to a strong and flat wood core. These PureBond Project Panels are made with no added formaldehyde, eliminating the concern about off-gassing dangerous fumes during fabrication or when installed in your home. Their smaller size makes them easy to handle and allows you to order just the amount of wood you need. PureBond plywood, in Project Panels sizes or in full sheet sizes, are a Home Depot exclusive.California residents: see Proposition 65 informationDecorative mahogany veneer applied to both sides of this panelB-2 plain sliced mahogany - 7-ply constructionLight weight, all-wood veneer constructionPrecision-cut hardwood plywood panels in convenient small sizesCommon: 3/4 in. x 2 ft. x 4 ft.; Actual: 0.703 in. x 24 in. x 48 in.Grade: B-2',
 'product_title': '3/4 in. x 2 ft. x 4 ft. PureBond Mahogany Plywood Project Panel',
 'product_uid': 137334,
 'relevance': 3.0,
 'search_term': 'table top wood'}

only 'wood' is present from search term


In [9]:
last_doc = train[-1]
last_doc


Out[9]:
{'id': 221473,
 'product_description': 'No. 918 Millennial Ryan heathered texture semi-sheer curtain is a casual solid that adds freshness and a finishing touch to any decor setting. Enhances privacy while allowing light to gently filter through. Clean, simple one-pocket pole top design can be used with a standard or decorative curtain rod. Mix and match with other solids and prints for a look that is all your own.Sheer panel, gently filters lightNo header pole top panelMachine washableWide array of colors to choose from100% polyesterContains 1-curtain panel',
 'product_title': 'LICHTENBERG Pool Blue No. 918 Millennial Ryan Heathered Texture Sheer Curtain Panel, 40 in. W x 63 in. L',
 'product_uid': 206650,
 'relevance': 2.33,
 'search_term': 'fine sheer curtain 63 inches'}

'sheer' and 'courtain' are present and that's all

How many search terms are not present in description and title for ranked 3 documents

Ranked 3 documents are the most relevents searches, but how many search queries doesn't include the searched term in the description and the title


In [10]:
train['search_term_word_count'] = gl.text_analytics.count_words(train['search_term'])
ranked3doc = train[train['relevance'] == 3]
print ranked3doc.head()
len(ranked3doc)


+-----+-------------+-------------------------------+
|  id | product_uid |         product_title         |
+-----+-------------+-------------------------------+
|  2  |    100001   | Simpson Strong-Tie 12-Gaug... |
|  9  |    100002   | BEHR Premium Textured Deck... |
|  18 |    100006   | Whirlpool 1.9 cu. ft. Over... |
|  21 |    100006   | Whirlpool 1.9 cu. ft. Over... |
|  27 |    100009   | House of Fara 3/4 in. x 3 ... |
|  35 |    100011   | Toro Personal Pace Recycle... |
|  37 |    100011   | Toro Personal Pace Recycle... |
|  65 |    100016   | Sunjoy Calais 8 ft. x 5 ft... |
| 123 |    100023   | Quikrete 80 lb. Crack-Resi... |
| 162 |    100029   | DecoArt Americana Decor 16... |
+-----+-------------+-------------------------------+
+--------------------------------+-----------+-------------------------------+
|          search_term           | relevance |      product_description      |
+--------------------------------+-----------+-------------------------------+
|         angle bracket          |    3.0    | Not only do angles make jo... |
|           deck over            |    3.0    | BEHR Premium Textured DECK... |
|         convection otr         |    3.0    | Achieving delicious result... |
|           microwaves           |    3.0    | Achieving delicious result... |
|            mdf 3/4             |    3.0    | Get the House of Fara 3/4 ... |
| briggs and stratton lawn mower |    3.0    | Recycler 22 in. Personal P... |
|            gas mowe            |    3.0    | Recycler 22 in. Personal P... |
|          grill gazebo          |    3.0    | Make grilling great with t... |
| CONCRETE & MASONRY CLEANER...  |    3.0    | Quikrete 80 lb. Crack-Resi... |
|          chalk paint           |    3.0    | Achieving a vintage, time-... |
+--------------------------------+-----------+-------------------------------+
+-------------------------------+
|     search_term_word_count    |
+-------------------------------+
|   {'bracket': 1, 'angle': 1}  |
|     {'over': 1, 'deck': 1}    |
|  {'otr': 1, 'convection': 1}  |
|       {'microwaves': 1}       |
|      {'mdf': 1, '3/4': 1}     |
| {'and': 1, 'stratton': 1, ... |
|     {'gas': 1, 'mowe': 1}     |
|   {'grill': 1, 'gazebo': 1}   |
| {'etcher': 1, 'cleaner': 1... |
|    {'chalk': 1, 'paint': 1}   |
+-------------------------------+
[10 rows x 7 columns]

Out[10]:
19125

In [11]:
words_search = gl.text_analytics.tokenize(ranked3doc['search_term'], to_lower = True)
words_description = gl.text_analytics.tokenize(ranked3doc['product_description'], to_lower = True)
words_title = gl.text_analytics.tokenize(ranked3doc['product_title'], to_lower = True)
wordsdiff_desc = []
wordsdiff_title = []
puid = []
search_term = []
ws_count = []
ws_count_used_desc = []
ws_count_used_title = []
for item in xrange(len(ranked3doc)):
    ws = words_search[item]
    pd = words_description[item]
    pt = words_title[item]
    diff = set(ws) - set(pd)
    if diff is None:
        diff = 0
    wordsdiff_desc.append(diff)
    
    diff2 = set(ws) - set(pt)
    if diff2 is None:
        diff2 = 0
    wordsdiff_title.append(diff2)
    
    puid.append(ranked3doc[item]['product_uid'])
    search_term.append(ranked3doc[item]['search_term'])
    ws_count.append(len(ws))
    ws_count_used_desc.append(len(ws) - len(diff))
    ws_count_used_title.append(len(ws) - len(diff2))
    
differences = gl.SFrame({"puid" : puid,
                         "search term": search_term,
                         "diff desc" : wordsdiff_desc,
                         "diff title" : wordsdiff_title,
                         "ws count" : ws_count, 
                         "ws count used desc" : ws_count_used_desc,
                         "ws count used title" : ws_count_used_title})

In [12]:
differences.sort(['ws count used desc', 'ws count used title'])


Out[12]:
diff desc diff title puid search term ws count ws count used desc
[recycling, bins] [recycling, bins] 145727 recycling bins 2 0
[over, deck] [over, deck] 100002 deck over 2 0
[hammer, electric, drill] [hammer, electric, drill] 120061 electric hammer drill 3 0
[microwaves] [microwaves] 100006 microwaves 1 0
[plywoods] [plywoods] 119996 plywoods 1 0
[coca, cola] [coca, cola] 120276 coca cola 2 0
[greenhouses] [greenhouses] 120318 greenhouses 1 0
[pipe, cutters] [pipe, cutters] 119840 pipe cutters 2 0
[buit, themostat, in] [buit, themostat, in] 206359 buit in themostat 3 0
[mowers, ridding] [mowers, ridding] 120366 ridding mowers 2 0
ws count used title
0
0
0
0
0
0
0
0
0
0
[19125 rows x 7 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

In [13]:
print "No terms used in description : " + str(len(differences[differences['ws count used desc'] == 0]))
print "No terms used in title : " + str(len(differences[differences['ws count used title'] == 0]))
print "No terms used in description and title : " + str(len(differences[(differences['ws count used desc'] == 0) & 
                                                                        (differences['ws count used title'] == 0)]))


No terms used in description : 2666
No terms used in title : 2152
No terms used in description and title : 1206

In [14]:
import matplotlib.pyplot as plt
%matplotlib inline

TF-IDF with linear regression


In [15]:
train_search_tfidf = gl.text_analytics.tf_idf(train['search_term_word_count'])

In [16]:
train['search_tfidf'] = train_search_tfidf

In [17]:
train['product_desc_word_count'] = gl.text_analytics.count_words(train['product_description'])
train_desc_tfidf = gl.text_analytics.tf_idf(train['product_desc_word_count'])

In [18]:
train['desc_tfidf'] = train_desc_tfidf

In [19]:
train['product_title_word_count'] = gl.text_analytics.count_words(train['product_title'])
train_title_tfidf = gl.text_analytics.tf_idf(train['product_title_word_count'])
train['title_tfidf'] = train_title_tfidf

In [20]:
train['distance'] = train.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['desc_tfidf']))
train['distance2'] = train.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['title_tfidf']))

In [21]:
model1 = gl.linear_regression.create(train, target = 'relevance', features = ['distance', 'distance2'], validation_set = None)


PROGRESS: Linear regression:
PROGRESS: --------------------------------------------------------
PROGRESS: Number of examples          : 74067
PROGRESS: Number of features          : 2
PROGRESS: Number of unpacked features : 2
PROGRESS: Number of coefficients    : 3
PROGRESS: Starting Newton Method
PROGRESS: --------------------------------------------------------
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | Iteration | Passes   | Elapsed Time | Training-max_error | Training-rmse |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: | 1         | 2        | 1.054999     | 1.917518           | 0.510175      |
PROGRESS: +-----------+----------+--------------+--------------------+---------------+
PROGRESS: SUCCESS: Optimal solution found.
PROGRESS:

In [23]:
#let's take a look at the weights before we plot
model1.get("coefficients")


Out[23]:
name index value stderr
(intercept) None 3.32490615945 0.0148462020716
distance None -0.483754540522 0.019557894819
distance2 None -0.680280391407 0.0122246518296
[3 rows x 4 columns]

In [25]:
test['search_term_word_count'] = gl.text_analytics.count_words(test['search_term'])
test_search_tfidf = gl.text_analytics.tf_idf(test['search_term_word_count'])
test['search_tfidf'] = test_search_tfidf
test['product_desc_word_count'] = gl.text_analytics.count_words(test['product_description'])
test_desc_tfidf = gl.text_analytics.tf_idf(test['product_desc_word_count'])
test['desc_tfidf'] = test_desc_tfidf
test['product_title_word_count'] = gl.text_analytics.count_words(test['product_title'])
test_title_tfidf = gl.text_analytics.tf_idf(test['product_title_word_count'])
test['title_tfidf'] = test_title_tfidf
test['distance'] = test.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['desc_tfidf']))
test['distance2'] = test.apply(lambda x: gl.distances.cosine(x['search_tfidf'],x['title_tfidf']))

In [27]:
'''
predictions_test = model1.predict(test)
test_errors = predictions_test - test['relevance']
RSS_test = sum(test_errors * test_errors)
print RSS_test
'''


Out[27]:
id product_uid product_title search_term product_description
1 100001 Simpson Strong-Tie
12-Gauge Angle ...
90 degree bracket Not only do angles make
joints stronger, they ...
4 100001 Simpson Strong-Tie
12-Gauge Angle ...
metal l brackets Not only do angles make
joints stronger, they ...
5 100001 Simpson Strong-Tie
12-Gauge Angle ...
simpson sku able Not only do angles make
joints stronger, they ...
6 100001 Simpson Strong-Tie
12-Gauge Angle ...
simpson strong ties Not only do angles make
joints stronger, they ...
7 100001 Simpson Strong-Tie
12-Gauge Angle ...
simpson strong tie hcc668 Not only do angles make
joints stronger, they ...
8 100001 Simpson Strong-Tie
12-Gauge Angle ...
wood connectors Not only do angles make
joints stronger, they ...
10 100003 STERLING Ensemble 33-1/4
in. x 60 in. x 75-1/4 ...
bath and shower kit Classic architecture
meets contemporary de ...
11 100003 STERLING Ensemble 33-1/4
in. x 60 in. x 75-1/4 ...
bath drain kit Classic architecture
meets contemporary de ...
12 100003 STERLING Ensemble 33-1/4
in. x 60 in. x 75-1/4 ...
one piece tub shower Classic architecture
meets contemporary de ...
13 100004 Grape Solar 265-Watt
Polycrystalline Solar ...
solar panel The Grape Solar 265-Watt
Polycrystalline PV Solar ...
search_term_word_count search_tfidf product_desc_word_count desc_tfidf
{'90': 1, 'bracket': 1,
'degree': 1} ...
{'90':
6.7821620611958915, ...
{'outdoor': 1, 'zmax': 1,
'repair': 1, ...
{'outdoor':
2.2670671040608905, ...
{'metal': 1, 'l': 1,
'brackets': 1} ...
{'metal':
4.761280475281293, 'l': ...
{'outdoor': 1, 'zmax': 1,
'repair': 1, ...
{'outdoor':
2.2670671040608905, ...
{'sku': 1, 'able': 1,
'simpson': 1} ...
{'sku':
7.438941597584962, ...
{'outdoor': 1, 'zmax': 1,
'repair': 1, ...
{'outdoor':
2.2670671040608905, ...
{'ties': 1, 'strong': 1,
'simpson': 1} ...
{'ties':
7.605068468458936, ...
{'outdoor': 1, 'zmax': 1,
'repair': 1, ...
{'outdoor':
2.2670671040608905, ...
{'tie': 1, 'strong': 1,
'hcc668': 1, 'simpson': ...
{'tie': 7.13355994803378,
'strong': ...
{'outdoor': 1, 'zmax': 1,
'repair': 1, ...
{'outdoor':
2.2670671040608905, ...
{'connectors': 1, 'wood':
1} ...
{'connectors':
6.993471154863098, ...
{'outdoor': 1, 'zmax': 1,
'repair': 1, ...
{'outdoor':
2.2670671040608905, ...
{'shower': 1, 'and': 1,
'bath': 1, 'kit': 1} ...
{'shower':
3.890321658594567, 'a ...
{'and': 2, 'storing': 1,
'series,': 1, 'z124.1 ...
{'and':
0.0800483779106809, ...
{'bath': 1, 'drain': 1,
'kit': 1} ...
{'bath':
5.149710580802239, ...
{'and': 2, 'storing': 1,
'series,': 1, 'z124.1 ...
{'and':
0.0800483779106809, ...
{'shower': 1, 'tub': 1,
'piece': 1, 'one': 1} ...
{'shower':
3.890321658594567, 't ...
{'and': 2, 'storing': 1,
'series,': 1, 'z124.1 ...
{'and':
0.0800483779106809, ...
{'solar': 1, 'panel': 1} {'solar':
5.732339936697214, ...
{'polycrystalline': 2,
'module': 1, ...
{'polycrystalline':
18.056353605403086, ...
product_title_word_count title_tfidf distance distance2
{'strong-tie': 1,
'12-gauge': 1, 'angle': ...
{'strong-tie':
5.446047718534487, ...
0.955471973548 1.0
{'strong-tie': 1,
'12-gauge': 1, 'angle': ...
{'strong-tie':
5.446047718534487, ...
1.0 1.0
{'strong-tie': 1,
'12-gauge': 1, 'angle': ...
{'strong-tie':
5.446047718534487, ...
0.939075457861 0.748825894796
{'strong-tie': 1,
'12-gauge': 1, 'angle': ...
{'strong-tie':
5.446047718534487, ...
0.937712516547 0.743206885559
{'strong-tie': 1,
'12-gauge': 1, 'angle': ...
{'strong-tie':
5.446047718534487, ...
0.949250747211 0.790775642926
{'strong-tie': 1,
'12-gauge': 1, 'angle': ...
{'strong-tie':
5.446047718534487, ...
1.0 1.0
{'sterling': 1, 'and': 1,
'drain': 1, '75-1/4': 1, ...
{'sterling':
6.237011694888826, 'a ...
0.957250783503 0.682196478975
{'sterling': 1, 'and': 1,
'drain': 1, '75-1/4': 1, ...
{'sterling':
6.237011694888826, 'a ...
0.956490934274 0.670267588954
{'sterling': 1, 'and': 1,
'drain': 1, '75-1/4': 1, ...
{'sterling':
6.237011694888826, 'a ...
1.0 0.938706963098
{'polycrystalline': 1,
'grape': 1, '(4-pack)': ...
{'polycrystalline':
10.637614715135642, ...
0.635642096321 0.517535106244
[10 rows x 13 columns]


In [ ]:
output

In [ ]:
submission = gl.SFrame(test['id'])

In [ ]:
submission.add_column(output)
submission.rename({'X1': 'id', 'X2':'relevance'})

In [ ]:
submission['relevance'] = submission.apply(lambda x: 3.0 if x['relevance'] > 3.0 else x['relevance'])
submission['relevance'] = submission.apply(lambda x: 1.0 if x['relevance'] < 1.0 else x['relevance'])

In [ ]:
submission['relevance'] = submission.apply(lambda x: str(x['relevance']))

In [ ]:
submission.export_csv('../data/submission.csv', quote_level = 3)

In [ ]:
#gl.canvas.set_target('ipynb')