This notebook is presented in the Product Matching webinar, one of many interesting webinars given by Turi. Check out upcoming webinars here.
We will use GraphLab Create to perform product matching between textual descriptions of products from different sources, a task also known as Record Linkage. The data is available here.
The notebook is orginaized into the following sections:
In [1]:
import graphlab as gl
import re
import matplotlib.pyplot as plt
gl.canvas.set_target('ipynb')
%matplotlib inline
In [2]:
amazon = gl.SFrame.read_csv('Amazon.csv', verbose=False)
google = gl.SFrame.read_csv('GoogleProducts.csv', verbose=False)
truth = gl.SFrame.read_csv('Amzon_GoogleProducts_perfectMapping.csv', verbose=False)
In [3]:
print 'Amazon length: ', amazon.num_rows()
amazon.head(2)
Out[3]:
In [4]:
print 'Google length: ', google.num_rows()
google.head(2)
Out[4]:
In [5]:
print 'Truth length: ', truth.num_rows()
truth.head(2)
Out[5]:
In [6]:
def transform(truth, amazon, google):
'''Transform the data into a more manageable format'''
# For the sake of this webinar we will look only at the names of the products
amazon = amazon[['id', 'title']]
google = google[['id', 'name']]
# Add a unique numeric label
amazon = amazon.add_row_number(column_name='label')
google = google.add_row_number(column_name='label')
# Change labels in truth based on the new numerical labels
truth = truth.join(amazon, on={'idAmazon' : 'id'})
truth = truth.join(google, on={'idGoogleBase' : 'id'})
# Rename some columns
amazon = amazon.rename({'title' : 'name'})
truth = truth.rename({
'label' : 'amazon label',
'title' : 'amazon name',
'label.1' : 'google label',
'name' : 'google name'
})
# Remove some others
truth.remove_columns(['idGoogleBase', 'idAmazon'])
amazon = amazon.remove_column('id')
google = google.remove_column('id')
return truth, amazon, google
truth, amazon, google = transform(truth, amazon, google)
In [7]:
amazon.head(3)
Out[7]:
In [8]:
google.head(3)
Out[8]:
In [9]:
truth.head(3)
Out[9]:
In [10]:
def accuracy_at(results, truth):
'''Compute the accuracy at k of a record linkage model, given a true mapping'''
joined = truth.join(results, on={'google label' : 'query_label'})
num_correct_labels = (joined['amazon label'] == joined['reference_label']).sum()
return num_correct_labels / float(truth.num_rows())
In [11]:
def get_matches(results, amazon, google):
'''Reutrn the results of a record linkage model in a readable format'''
joined = results.join(amazon, on={'reference_label' : 'label'}).join(google, on={'query_label' : 'label'})
joined = joined[['name', 'name.1', 'distance', 'rank']]
joined = joined.rename({'name' : 'amazon name', 'name.1' : 'google name'})
return joined
In [12]:
base_linker = gl.record_linker.create(amazon, features=['name'])
In [13]:
results = base_linker.link(google, k=3)
results
Out[13]:
In [14]:
print 'Accuracy@3', accuracy_at(results, truth)
get_matches(results, amazon, google)
Out[14]:
In [15]:
# Example of features that the record linker create
amazon['3 char'] = gl.text_analytics.count_ngrams(amazon['name'], n=3, method='character')
amazon.head(3)
Out[15]:
In [16]:
# Remove the feture for the sake of cleanliness
amazon = amazon.remove_column('3 char')
In product matching, numbers can be highly helpful as they can represent model identifiers, versions, ect.
In [17]:
from collections import Counter
# Extract numbers from the name
amazon['numbers'] = amazon['name'].apply(lambda name: dict(Counter(re.findall('\d+\.*\d*', name))))
google['numbers'] = google['name'].apply(lambda name: dict(Counter(re.findall('\d+\.*\d*', name))))
amazon.head(5)
Out[17]:
In [18]:
# Create a record linker using the extracted numeric features
num_linker = gl.record_linker.create(amazon, features=['name', 'numbers'])
results = num_linker.link(google, k=3, verbose=False)
print 'Accuracy@3', accuracy_at(results, truth)
get_matches(results, amazon, google)
Out[18]:
In [19]:
# Calculate accuracy at k for k between 1 and 10 for both models
k_range = range(1, 11)
base_accuracy = [accuracy_at(base_linker.link(google, k, verbose=False), truth) for k in k_range]
num_accuracy = [accuracy_at(num_linker.link(google, k, verbose=False), truth) for k in k_range]
In [20]:
# Plot the results
plt.style.use('ggplot')
plt.title('Accuracy@k')
plt.ylabel('Accuracy')
plt.xlabel('k')
plt.plot(k_range, base_accuracy, marker='o', color='b', label='Base Linker')
plt.plot(k_range, num_accuracy, marker='o', color='g', label='Number Linker')
plt.legend(loc=4)
None
We can see that the extracted numbers positively affect our model.
There are many more possible features to explore, and different distance functions to try. Read more in [our userguide](https://turi.com/learn/userguide/data_matching/introduction.html).
Some times not all k results make sense, or we just don't want to present a user with too many possibilities.
For this we have the radius parameter which serves as a distance threshold.
In [21]:
results = num_linker.link(google, k=10, verbose=False)
print 'Accuracy:', accuracy_at(results, truth)
print 'Possible number of results to go through:', len(results)
In [22]:
results['distance'].show()
In [23]:
results = num_linker.link(google, k=None, radius=1.61, verbose=False)
print 'Accuracy:', accuracy_at(results, truth)
print 'Possible number of results to go through:', len(results)