Product Matching

This notebook is presented in the Product Matching webinar, one of many interesting webinars given by Turi. Check out upcoming webinars here.
We will use GraphLab Create to perform product matching between textual descriptions of products from different sources, a task also known as Record Linkage. The data is available here.

The notebook is orginaized into the following sections:

Loading and cleaning the data
Evaluation functions
Record Linker model
Feature Engineering
Exploring parameters

Loading and cleaning the data



In [1]:

    
import graphlab as gl
import re
import matplotlib.pyplot as plt

gl.canvas.set_target('ipynb')
%matplotlib inline









    



This commercial license of GraphLab Create is assigned to engr@turi.com.






    



2016-05-19 11:07:15,655 [INFO] graphlab.cython.cy_server, 176: GraphLab Create v1.9 started. Logging: /tmp/graphlab_server_1463645233.log



In [2]:

    
amazon = gl.SFrame.read_csv('Amazon.csv', verbose=False)
google = gl.SFrame.read_csv('GoogleProducts.csv', verbose=False)
truth = gl.SFrame.read_csv('Amzon_GoogleProducts_perfectMapping.csv', verbose=False)



In [3]:

    
print 'Amazon length: ', amazon.num_rows()

amazon.head(2)









    



Amazon length:  1363






    Out[3]:





    
        id
        title
        description
        manufacturer
        price
    
    
        b000jz4hqo
        clickart 950 000 -
premier image pack (dvd- ...
        
        broderbund
        0.0
    
    
        b0006zf55o
        ca international -
arcserve lap/desktop oem ...
        oem arcserve backup v11.1
win 30u for laptops and ...
        computer associates
        0.0
    

[2 rows x 5 columns]



In [4]:

    
print 'Google length: ', google.num_rows()

google.head(2)









    



Google length:  3226






    Out[4]:





    
        id
        name
        description
        manufacturer
        price
    
    
        http://www.google.com/bas
e/feeds/snippets/1112 ...
        learning quickbooks 2007
        learning quickbooks 2007
        intuit
        38.99
    
    
        http://www.google.com/bas
e/feeds/snippets/1153 ...
        superstart! fun with
reading & writing! ...
        fun with reading &
writing! is designed to ...
        
        8.49
    

[2 rows x 5 columns]



In [5]:

    
print 'Truth length: ', truth.num_rows()

truth.head(2)









    



Truth length:  1300






    Out[5]:





    
        idAmazon
        idGoogleBase
    
    
        b000jz4hqo
        http://www.google.com/bas
e/feeds/snippets/1844 ...
    
    
        b00004tkvy
        http://www.google.com/bas
e/feeds/snippets/1844 ...
    

[2 rows x 2 columns]



In [6]:

    
def transform(truth, amazon, google):
    '''Transform the data into a more manageable format'''
    
    # For the sake of this webinar we will look only at the names of the products
    amazon = amazon[['id', 'title']]
    google = google[['id', 'name']]
    
    # Add a unique numeric label
    amazon = amazon.add_row_number(column_name='label')
    google = google.add_row_number(column_name='label')
    
    # Change labels in truth based on the new numerical labels
    truth = truth.join(amazon, on={'idAmazon' : 'id'})
    truth = truth.join(google, on={'idGoogleBase' : 'id'})
    
    # Rename some columns
    amazon = amazon.rename({'title' : 'name'})
    truth = truth.rename({
        'label' : 'amazon label',
        'title' : 'amazon name',
        'label.1' : 'google label',
        'name' : 'google name'
    })
    
    # Remove some others
    truth.remove_columns(['idGoogleBase', 'idAmazon'])
    amazon = amazon.remove_column('id')
    google = google.remove_column('id')
    
    return truth, amazon, google

truth, amazon, google = transform(truth, amazon, google)



In [7]:

    
amazon.head(3)









    Out[7]:





    
        label
        name
    
    
        0
        clickart 950 000 -
premier image pack (dvd- ...
    
    
        1
        ca international -
arcserve lap/desktop oem ...
    
    
        2
        noah's ark activity
center (jewel case ages ...
    

[3 rows x 2 columns]



In [8]:

    
google.head(3)









    Out[8]:





    
        label
        name
    
    
        0
        learning quickbooks 2007
    
    
        1
        superstart! fun with
reading & writing! ...
    
    
        2
        qb pos 6.0 basic software
    

[3 rows x 2 columns]



In [9]:

    
truth.head(3)









    Out[9]:





    
        amazon label
        amazon name
        google label
        google name
    
    
        539
        learning quickbooks 2007
        0
        learning quickbooks 2007
    
    
        1198
        quickbooks point-of-sale
basic 6.0 ...
        2
        qb pos 6.0 basic software
    
    
        736
        adobe creative suite cs3
production premium ...
        4
        production prem cs3 mac
upgrad ...
    

[3 rows x 4 columns]

Evaluation functions



In [10]:

    
def accuracy_at(results, truth):
    '''Compute the accuracy at k of a record linkage model, given a true mapping'''
    
    joined = truth.join(results, on={'google label' : 'query_label'})
    num_correct_labels = (joined['amazon label'] == joined['reference_label']).sum()
    
    return num_correct_labels / float(truth.num_rows())



In [11]:

    
def get_matches(results, amazon, google):
    '''Reutrn the results of a record linkage model in a readable format'''
    
    joined = results.join(amazon, on={'reference_label' : 'label'}).join(google, on={'query_label' : 'label'})
    joined = joined[['name', 'name.1', 'distance', 'rank']]
    joined = joined.rename({'name' : 'amazon name', 'name.1' : 'google name'})
    
    return joined

Record Linker model



In [12]:

    
base_linker = gl.record_linker.create(amazon, features=['name'])









    




Starting brute force nearest neighbors model training.



In [13]:

    
results = base_linker.link(google, k=3)
results









    




Starting pairwise querying.






    




+--------------+---------+-------------+--------------+






    




| Query points | # Pairs | % Complete. | Elapsed Time |






    




+--------------+---------+-------------+--------------+






    




| 2            | 3226    | 0.0733676   | 9.112ms      |






    




| Done         |         | 100         | 312.228ms    |






    




+--------------+---------+-------------+--------------+






    Out[13]:





    
        query_label
        reference_label
        distance
        rank
    
    
        0
        539
        0.0
        1
    
    
        0
        528
        0.666666666667
        2
    
    
        0
        532
        0.692307692308
        3
    
    
        1
        183
        0.644444444444
        1
    
    
        1
        297
        0.808510638298
        2
    
    
        1
        392
        0.867924528302
        3
    
    
        2
        829
        0.810810810811
        1
    
    
        2
        847
        0.820512820513
        2
    
    
        2
        432
        0.825
        3
    
    
        3
        613
        0.0
        1
    

[9678 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.



In [14]:

    
print 'Accuracy@3', accuracy_at(results, truth)
get_matches(results, amazon, google)









    



Accuracy@3 0.922307692308






    Out[14]:





    
        amazon name
        google name
        distance
        rank
    
    
        learning quickbooks 2007
        learning quickbooks 2007
        0.0
        1
    
    
        quickbooks simple start
2007 ...
        learning quickbooks 2007
        0.666666666667
        2
    
    
        quickbooks pro 2007 for
mac (mac) ...
        learning quickbooks 2007
        0.692307692308
        3
    
    
        jumpstart reading with
karaoke ...
        superstart! fun with
reading & writing! ...
        0.644444444444
        1
    
    
        superstart! science
arcade ...
        superstart! fun with
reading & writing! ...
        0.808510638298
        2
    
    
        clifford reading pack
(dvd box) ...
        superstart! fun with
reading & writing! ...
        0.867924528302
        3
    
    
        avid liquid vs 7 software
        qb pos 6.0 basic software
        0.810810810811
        1
    
    
        vpn client software 5
users ...
        qb pos 6.0 basic software
        0.820512820513
        2
    
    
        cell phone software
solution ...
        qb pos 6.0 basic software
        0.825
        3
    
    
        math missions: the
amazing arcade adventure ...
        math missions: the
amazing arcade adventure ...
        0.0
        1
    

[9678 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Feature Engineering



In [15]:

    
# Example of features that the record linker create
amazon['3 char'] = gl.text_analytics.count_ngrams(amazon['name'], n=3, method='character')
amazon.head(3)









    Out[15]:





    
        label
        name
        3 char
    
    
        0
        clickart 950 000 -
premier image pack (dvd- ...
        {'pre': 1, 'art': 1,
'rom': 1, '00p': 1, ...
    
    
        1
        ca international -
arcserve lap/desktop oem ...
        {'ser': 1, 'rve': 1,
'30p': 1, 'ona': 1, ...
    
    
        2
        noah's ark activity
center (jewel case ages ...
        {'kac': 1, 'sea': 1,
'ivi': 1, 'ahs': 1, ...
    

[3 rows x 3 columns]



In [16]:

    
# Remove the feture for the sake of cleanliness
amazon = amazon.remove_column('3 char')

In product matching, numbers can be highly helpful as they can represent model identifiers, versions, ect.



In [17]:

    
from collections import Counter

# Extract numbers from the name
amazon['numbers'] = amazon['name'].apply(lambda name: dict(Counter(re.findall('\d+\.*\d*', name))))
google['numbers'] = google['name'].apply(lambda name: dict(Counter(re.findall('\d+\.*\d*', name))))

amazon.head(5)









    Out[17]:





    
        label
        name
        numbers
    
    
        0
        clickart 950 000 -
premier image pack (dvd- ...
        {'950': 1, '000': 1}
    
    
        1
        ca international -
arcserve lap/desktop oem ...
        {'30': 1}
    
    
        2
        noah's ark activity
center (jewel case ages ...
        {'8': 1, '3': 1}
    
    
        3
        peachtree by sage premium
accounting for nonpro ...
        {'2007': 1}
    
    
        4
        singing coach unlimited
        {}
    

[5 rows x 3 columns]



In [18]:

    
# Create a record linker using the extracted numeric features
num_linker = gl.record_linker.create(amazon, features=['name', 'numbers'])

results = num_linker.link(google, k=3, verbose=False)

print 'Accuracy@3', accuracy_at(results, truth)
get_matches(results, amazon, google)









    




Starting brute force nearest neighbors model training.






    



Accuracy@3 0.927692307692






    Out[18]:





    
        amazon name
        google name
        distance
        rank
    
    
        learning quickbooks 2007
        learning quickbooks 2007
        0.0
        1
    
    
        quickbooks simple start
2007 ...
        learning quickbooks 2007
        1.2972972973
        2
    
    
        quickbooks pro 2007 for
mac (mac) ...
        learning quickbooks 2007
        1.35
        3
    
    
        jumpstart reading with
karaoke ...
        superstart! fun with
reading & writing! ...
        1.28888888889
        1
    
    
        superstart! science
arcade ...
        superstart! fun with
reading & writing! ...
        1.6170212766
        2
    
    
        clifford reading pack
(dvd box) ...
        superstart! fun with
reading & writing! ...
        1.7358490566
        3
    
    
        quickbooks point-of-sale
basic 6.0 ...
        qb pos 6.0 basic software
        1.63636363636
        1
    
    
        avid liquid vs 7 software
        qb pos 6.0 basic software
        1.64102564103
        2
    
    
        cell phone software
solution ...
        qb pos 6.0 basic software
        1.65853658537
        3
    
    
        math missions: the
amazing arcade adventure ...
        math missions: the
amazing arcade adventure ...
        0.0
        1
    

[9678 rows x 4 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

Exploring k



In [19]:

    
# Calculate accuracy at k for k between 1 and 10 for both models
k_range = range(1, 11)
base_accuracy = [accuracy_at(base_linker.link(google, k, verbose=False), truth) for k in k_range]
num_accuracy = [accuracy_at(num_linker.link(google, k, verbose=False), truth) for k in k_range]



In [20]:

    
# Plot the results
plt.style.use('ggplot')

plt.title('Accuracy@k')
plt.ylabel('Accuracy')
plt.xlabel('k')

plt.plot(k_range, base_accuracy, marker='o', color='b', label='Base Linker')
plt.plot(k_range, num_accuracy, marker='o', color='g', label='Number Linker')

plt.legend(loc=4)
None

We can see that the extracted numbers positively affect our model.
There are many more possible features to explore, and different distance functions to try. Read more in [our userguide](https://turi.com/learn/userguide/data_matching/introduction.html).

Exploring the radius

Some times not all k results make sense, or we just don't want to present a user with too many possibilities.
For this we have the radius parameter which serves as a distance threshold.



In [21]:

    
results = num_linker.link(google, k=10, verbose=False)

print 'Accuracy:', accuracy_at(results, truth)
print 'Possible number of results to go through:', len(results)









    



Accuracy: 0.972307692308
Possible number of results to go through: 32260



In [22]:

    
results['distance'].show()



In [23]:

    
results = num_linker.link(google, k=None, radius=1.61, verbose=False)

print 'Accuracy:', accuracy_at(results, truth)
print 'Possible number of results to go through:', len(results)









    



Accuracy: 0.921538461538
Possible number of results to go through: 12853

id	title	description	manufacturer	price
b000jz4hqo	clickart 950 000 - premier image pack (dvd- ...		broderbund	0.0
b0006zf55o	ca international - arcserve lap/desktop oem ...	oem arcserve backup v11.1 win 30u for laptops and ...	computer associates	0.0

id	name	description	manufacturer	price
http://www.google.com/bas e/feeds/snippets/1112 ...	learning quickbooks 2007	learning quickbooks 2007	intuit	38.99
http://www.google.com/bas e/feeds/snippets/1153 ...	superstart! fun with reading & writing! ...	fun with reading & writing! is designed to ...		8.49

idAmazon	idGoogleBase
b000jz4hqo	http://www.google.com/bas e/feeds/snippets/1844 ...
b00004tkvy	http://www.google.com/bas e/feeds/snippets/1844 ...

label	name
0	clickart 950 000 - premier image pack (dvd- ...
1	ca international - arcserve lap/desktop oem ...
2	noah's ark activity center (jewel case ages ...

label	name
0	learning quickbooks 2007
1	superstart! fun with reading & writing! ...
2	qb pos 6.0 basic software

amazon label	amazon name	google label	google name
539	learning quickbooks 2007	0	learning quickbooks 2007
1198	quickbooks point-of-sale basic 6.0 ...	2	qb pos 6.0 basic software
736	adobe creative suite cs3 production premium ...	4	production prem cs3 mac upgrad ...

query_label	reference_label	distance	rank
0	539	0.0	1
0	528	0.666666666667	2
0	532	0.692307692308	3
1	183	0.644444444444	1
1	297	0.808510638298	2
1	392	0.867924528302	3
2	829	0.810810810811	1
2	847	0.820512820513	2
2	432	0.825	3
3	613	0.0	1

amazon name	google name	distance	rank
learning quickbooks 2007	learning quickbooks 2007	0.0	1
quickbooks simple start 2007 ...	learning quickbooks 2007	0.666666666667	2
quickbooks pro 2007 for mac (mac) ...	learning quickbooks 2007	0.692307692308	3
jumpstart reading with karaoke ...	superstart! fun with reading & writing! ...	0.644444444444	1
superstart! science arcade ...	superstart! fun with reading & writing! ...	0.808510638298	2
clifford reading pack (dvd box) ...	superstart! fun with reading & writing! ...	0.867924528302	3
avid liquid vs 7 software	qb pos 6.0 basic software	0.810810810811	1
vpn client software 5 users ...	qb pos 6.0 basic software	0.820512820513	2
cell phone software solution ...	qb pos 6.0 basic software	0.825	3
math missions: the amazing arcade adventure ...	math missions: the amazing arcade adventure ...	0.0	1

label	name	3 char
0	clickart 950 000 - premier image pack (dvd- ...	{'pre': 1, 'art': 1, 'rom': 1, '00p': 1, ...
1	ca international - arcserve lap/desktop oem ...	{'ser': 1, 'rve': 1, '30p': 1, 'ona': 1, ...
2	noah's ark activity center (jewel case ages ...	{'kac': 1, 'sea': 1, 'ivi': 1, 'ahs': 1, ...

label	name	numbers
0	clickart 950 000 - premier image pack (dvd- ...	{'950': 1, '000': 1}
1	ca international - arcserve lap/desktop oem ...	{'30': 1}
2	noah's ark activity center (jewel case ages ...	{'8': 1, '3': 1}
3	peachtree by sage premium accounting for nonpro ...	{'2007': 1}
4	singing coach unlimited	{}