Data cleaning is a critical process for improving data quality and ultimately the accuracy of machine learning model output. In this notebook we show how the GraphLab Create Data Matching toolkit can be used to get your data shiny clean.
Note: this notebook requires GraphLab Create 1.6 or higher.
In [1]:
import os
import graphlab as gl
In the first section of this notebook we autotag posts from CrossValidated, the statistics section of the Stack Exchange network. Questions posted on this forum are typically annotated with tags by the authors but responses are not, making it more difficult to quickly scan responses for the most useful information. The raw data is available from the Stack Exchange data dump. For convenience we provide a preprocessed subsample (7.8MB) in the public Turi datasets bucket on Amazon S3, which is downloaded and saved locally with the first code snippet below.
For reference tags we use a lightly-curated list of statistics topics from Wikipedia. The preprocessed list is also available in the datasets S3 bucket.
A more extensive explanations of the code can be found in the Autotagger chapter of the User Guide.
The data is also saved locally to avoid repeated downloads.
In [2]:
if os.path.exists('statistics_topics.csv'):
stats_topics = gl.SFrame.read_csv('statistics_topics.csv', header=False)
else:
stats_topics = gl.SFrame.read_csv('https://static.turi.com/datasets//statistics_topics.csv',
header=False)
stats_topics.save('statistics_topics', format='csv')
In [3]:
stats_topics = stats_topics.rename({'X1': 'tag'})
stats_topics.tail(10)
Out[3]:
In [4]:
model = gl.autotagger.create(stats_topics)
In [5]:
model.list_fields()
Out[5]:
In [6]:
model.tag?
In [7]:
if os.path.exists('stats_overflow_clean'):
posts = gl.SFrame('stats_overflow_clean')
else:
posts = gl.SFrame('https://static.turi.com/datasets/stats_overflow_clean')
posts.save('stats_overflow_clean')
In [8]:
print "Number of posts:", posts.num_rows()
posts[['Body', 'Title', 'PostTypeId', 'Tags']].tail(5)
Out[8]:
In [9]:
posts['doc'] = posts['Title'] + ' ' + posts['Body']
There are two key parameters when querying the model: k, which indicates the maximum number of tags to return for each query, and similarity_threshold, which indicates the maximum distance from a query document to the tag. The most typical usage is to get preliminary results by setting k to 5 and leaving similarity_threshold unspecified. Use the similarity_threshold parameter to tune the final results for optimal precision and recall.
In [10]:
tags = model.tag(posts, query_name='doc', k=5, similarity_threshold=0.1)
In [11]:
tags.print_rows(10, max_row_width=110, max_column_width=40)
To illustrate usage of the record linker tool, we use synthetic address data generated by and packaged with the FEBRL program, another data matching tool. For the sake of illustration suppose the dataset called "post_address" is a relatively error free set of reference addresses (say, from the Australian postal service). The dataset called "agent_listings" contains data with the same schema, but it has many errors; imagine this is data created by real estate agencies.
As with the autotagger data, the datasets downloaded in this section are saved locally for repeated usage. From prior experience, we know only a handful of features are useful for this illustration, and they are enumerated in the address_features list.
In [12]:
col_types = {'street_number': str, 'postcode': str}
address_features = ['street_number', 'address_1', 'suburb', 'state', 'postcode']
if os.path.exists('febrl_F_org_5000.csv'):
post_address = gl.SFrame.read_csv('febrl_F_org_5000.csv', column_type_hints=col_types)
else:
url = 'https://static.turi.com/datasets/febrl_synthetic/febrl_F_org_5000.csv'
post_address = gl.SFrame.read_csv(url, column_type_hints=col_types)
post_address.save('febrl_F_org_5000.csv')
In [13]:
post_address = post_address[address_features]
post_address.print_rows(5)
In [14]:
model = gl.record_linker.create(post_address, distance='jaccard')
model.summary()
In [15]:
model.list_fields()
Out[15]:
In [16]:
if os.path.exists('febrl_F_dup_5000.csv'):
agent_listings = gl.SFrame.read_csv('febrl_F_dup_5000.csv',
column_type_hints=col_types)
else:
url = 'https://static.turi.com/datasets/febrl_synthetic/febrl_F_dup_5000.csv'
agent_listings = gl.SFrame.read_csv(url, column_type_hints=col_types)
agent_listings.save('febrl_F_dup_5000.csv')
In [17]:
agent_listings = agent_listings[address_features]
agent_listings.print_rows(5)
Results are obtained with the model's link method, which matches a new set of queries to the reference data passed in above to the create function. For our first pass, we set the radius parameter to 0.5, which means that matches must share at least roughly 50% of the information contained in both the post_address and agent_listings records.
In [18]:
model.link?
In [19]:
matches = model.link(agent_listings, k=None, radius=0.5)
matches.head(5)
Out[19]:
The results mean that the address in query row 1 match the address in refs row number 2438, although the Jaccard distance is relatively high at 0.42. Inspecting these records manually we see this is in fact not a good match.
In [20]:
print agent_listings[1]
print post_address[2438]
On the other hand, the match between query number 3 and reference number 2947 has a distance of 0.045, indicating these two records are far more similar. By pulling these records we confirm this to be the case.
In [21]:
print agent_listings[3]
print post_address[2947]
Unfortunately, these records are still not a true match because the street numbers are different (in a way that is not likely to be a typo). Ideally we would like street number differences to be weighted heavily in our distance function, while still allowing for typos and misspellings in the street and city names. To do this we can build a composite distance function.
In this case we'll use Levenshtein distance to measure the dissimilarity in street number, in addition to our existing Jaccard distance measured over all of the address features. Both of these components will be given equal weight. In the summary of the created model, we see the number of distance components is now two---Levenshtein and Jaccard distances---instead of one in our first model.
In [22]:
address_dist = [
[['street_number'], 'levenshtein', 1],
[address_features, 'jaccard', 1]
]
model2 = gl.record_linker.create(post_address, distance=address_dist)
model2.summary()
In [23]:
model2['distance']
Out[23]:
One tricky aspect of using a composite distance is figuring out the best threshold for match quality. A simple way to do this is to first return a relatively high number of matches for each query, then look at the distribution of distances for good thresholds using the radius parameter. For this notebook, I've captured a screenshot of the canvas output and display it below.
In [24]:
pre_match = model2.link(agent_listings, k=10, verbose=False)
pre_match['distance'].show()
In [25]:
from IPython.display import Image
Image(url='https://static.turi.com/datasets/house_link_distances.png')
Out[25]:
In this distribution we see a stark jump at 0.636 in the distribution of distances for the 10-nearest neighbors of every query (remember this is no longer simple Jaccard distance, but a sum of Jaccard and Levenshtein distances over different sets of features). In our final pass, we set the k parameter to None, but enforce this distance threshold with the radius parameter.
In [26]:
matches = model2.link(agent_listings, k=None, radius=0.64, verbose=False)
matches.head(5)
Out[26]:
There are far fewer results now, but they are much more likely to be true matches than with our first model, even while allowing for typos in many of the address fields.
In [27]:
print agent_listings[6]
print post_address[1266]