Data's messy - clean it up!

Data cleaning is a critical process for improving data quality and ultimately the accuracy of machine learning model output. In this notebook we show how the GraphLab Create Data Matching toolkit can be used to get your data shiny clean.

  1. Auto-tagging Stack Overflow questions and answers
  2. Record linkage of house listings
  3. Composite distances and choosing neighborhood parameters

Note: this notebook requires GraphLab Create 1.6 or higher.

Auto-tagging Stack Overflow questions *and* answers


In [1]:
import os
import graphlab as gl

In the first section of this notebook we autotag posts from CrossValidated, the statistics section of the Stack Exchange network. Questions posted on this forum are typically annotated with tags by the authors but responses are not, making it more difficult to quickly scan responses for the most useful information. The raw data is available from the Stack Exchange data dump. For convenience we provide a preprocessed subsample (7.8MB) in the public Turi datasets bucket on Amazon S3, which is downloaded and saved locally with the first code snippet below.

For reference tags we use a lightly-curated list of statistics topics from Wikipedia. The preprocessed list is also available in the datasets S3 bucket.

A more extensive explanations of the code can be found in the Autotagger chapter of the User Guide.

Read in the metadata

The data is also saved locally to avoid repeated downloads.


In [2]:
if os.path.exists('statistics_topics.csv'):
    stats_topics = gl.SFrame.read_csv('statistics_topics.csv', header=False)
else:
    stats_topics = gl.SFrame.read_csv('https://static.turi.com/datasets//statistics_topics.csv',
                             header=False)
    stats_topics.save('statistics_topics', format='csv')


[INFO] This commercial license of GraphLab Create is assigned to engr@turi.com.

[INFO] Start server at: ipc:///tmp/graphlab_server-8602 - Server binary: /home/brian/miniconda/envs/glc-1.6.1/lib/python2.7/site-packages/graphlab/unity_server - Server log: /tmp/graphlab_server_1443810428.log
[INFO] GraphLab Server Version: 1.6.1
PROGRESS: Finished parsing file /home/brian/gl-repos/turi.com/src/learn/gallery/notebooks/statistics_topics.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.017293 secs.
------------------------------------------------------
PROGRESS: Finished parsing file /home/brian/gl-repos/turi.com/src/learn/gallery/notebooks/statistics_topics.csv
PROGRESS: Parsing completed. Parsed 2736 lines in 0.008665 secs.
Inferred types from first line of file as 
column_type_hints=[str]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------

In [3]:
stats_topics = stats_topics.rename({'X1': 'tag'})
stats_topics.tail(10)


Out[3]:
tag
z statistic
Z-test
Z-transform
Zakai equation
Zelen's design
Zero–one law
Zeta distribution
Ziggurat algorithm
Zipf–Mandelbrot law
Zipf's law
[10 rows x 1 columns]

Create the autotagger model


In [4]:
model = gl.autotagger.create(stats_topics)


PROGRESS: Starting brute force nearest neighbors model training.
Class                               : NearestNeighborAutoTagger

Settings
--------
Number of examples                  : 2731
Number of feature columns           : 3
Total training time (seconds)       : 1.0918


In [5]:
model.list_fields()


Out[5]:
['distance',
 'verbose',
 'num_features',
 'tag_name',
 'training_time',
 'num_examples',
 'features']

In [6]:
model.tag?

Read in the document data


In [7]:
if os.path.exists('stats_overflow_clean'):
    posts = gl.SFrame('stats_overflow_clean')
else:
    posts = gl.SFrame('https://static.turi.com/datasets/stats_overflow_clean')
    posts.save('stats_overflow_clean')

In [8]:
print "Number of posts:", posts.num_rows()
posts[['Body', 'Title', 'PostTypeId', 'Tags']].tail(5)


Number of posts: 11077
Out[8]:
Body Title PostTypeId Tags
This grew too long for a
comment, but I think ...
None 2 None
Assume a classification
problem where there are ...
Detecting a consistent
pattern in a dataset via ...
1 <classification><cross-
validation><decision- ...
My goal is to create a
formula that can give an ...
How to project video
viewcount based on ...
1 <summary-statistics><medi
an><eviews> ...
As a practical answer to
the real questions yo ...
None 2 None
Decision trees are
notoriously <strong>u ...
None 2 None
[5 rows x 4 columns]


In [9]:
posts['doc'] = posts['Title'] + ' ' + posts['Body']

Query the model

There are two key parameters when querying the model: k, which indicates the maximum number of tags to return for each query, and similarity_threshold, which indicates the maximum distance from a query document to the tag. The most typical usage is to get preliminary results by setting k to 5 and leaving similarity_threshold unspecified. Use the similarity_threshold parameter to tune the final results for optimal precision and recall.


In [10]:
tags = model.tag(posts, query_name='doc', k=5, similarity_threshold=0.1)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 692     | 0.00228751  | 59.188ms     |
PROGRESS: | 1774         | 4845384 | 16.0171     | 1.05s        |
PROGRESS: | 3535         | 9656131 | 31.9197     | 2.05s        |
PROGRESS: | 5297         | 1.4e+07 | 47.8269     | 3.05s        |
PROGRESS: | 6918         | 1.9e+07 | 62.4599     | 4.05s        |
PROGRESS: | 8564         | 2.3e+07 | 77.3207     | 5.06s        |
PROGRESS: | 10373        | 2.8e+07 | 93.6523     | 6.06s        |
PROGRESS: | Done         |         | 100         | 6.47s        |
PROGRESS: +--------------+---------+-------------+--------------+

In [11]:
tags.print_rows(10, max_row_width=110, max_column_width=40)


+--------+-----------------------------------------+-----------------------------------------+----------------+
| doc_id |                   doc                   |                   tag                   |     score      |
+--------+-----------------------------------------+-----------------------------------------+----------------+
|   13   | neural network output layer for bina... |          Binary classification          | 0.15503875969  |
|   13   | neural network output layer for bina... |        Artificial neural network        | 0.107913669065 |
|   13   | neural network output layer for bina... |         One-class classification        | 0.101449275362 |
|   13   | neural network output layer for bina... |              Neural network             | 0.100775193798 |
|   13   | neural network output layer for bina... |        Multiclass classification        | 0.10071942446  |
|   37   | Negative predictions for binomial pr... |        Negative predictive value        | 0.104712041885 |
|   55   | Estimating entropy of multidimension... |           Dimension reduction           | 0.101123595506 |
|   80   | Does the sequence satisfy WLLN? Coul... |           Law of large numbers          | 0.197916666667 |
|   80   | Does the sequence satisfy WLLN? Coul... | Independent and identically distribu... | 0.186046511628 |
|   80   | Does the sequence satisfy WLLN? Coul... |     Convergence of random variables     | 0.177570093458 |
+--------+-----------------------------------------+-----------------------------------------+----------------+
[1356 rows x 4 columns]


Record linkage of house listings

To illustrate usage of the record linker tool, we use synthetic address data generated by and packaged with the FEBRL program, another data matching tool. For the sake of illustration suppose the dataset called "post_address" is a relatively error free set of reference addresses (say, from the Australian postal service). The dataset called "agent_listings" contains data with the same schema, but it has many errors; imagine this is data created by real estate agencies.

Read in the reference data

As with the autotagger data, the datasets downloaded in this section are saved locally for repeated usage. From prior experience, we know only a handful of features are useful for this illustration, and they are enumerated in the address_features list.


In [12]:
col_types = {'street_number': str, 'postcode': str}
address_features = ['street_number', 'address_1', 'suburb', 'state', 'postcode']

if os.path.exists('febrl_F_org_5000.csv'):
    post_address = gl.SFrame.read_csv('febrl_F_org_5000.csv', column_type_hints=col_types)
else:
    url = 'https://static.turi.com/datasets/febrl_synthetic/febrl_F_org_5000.csv'
    post_address = gl.SFrame.read_csv(url, column_type_hints=col_types)
    post_address.save('febrl_F_org_5000.csv')


PROGRESS: Finished parsing file /home/brian/gl-repos/turi.com/src/learn/gallery/notebooks/febrl_F_org_5000.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.027237 secs.
PROGRESS: Finished parsing file /home/brian/gl-repos/turi.com/src/learn/gallery/notebooks/febrl_F_org_5000.csv
PROGRESS: Parsing completed. Parsed 3000 lines in 0.01815 secs.

In [13]:
post_address = post_address[address_features]
post_address.print_rows(5)


+---------------+---------------------+---------------+-------+----------+
| street_number |      address_1      |     suburb    | state | postcode |
+---------------+---------------------+---------------+-------+----------+
|       95      |     galway place    |    st marys   |       |   2681   |
|       12      |    burnie street    |   wycheproof  |  nsw  |   2234   |
|       16      | macrobertson street |    branxton   |  qld  |   3073   |
|      170      |    bonrook street   | brighton east |  nsw  |   3087   |
|       32      |   proserpine court  |   helensvale  |  qld  |   2067   |
+---------------+---------------------+---------------+-------+----------+
[3000 rows x 5 columns]

Create the record linker model


In [14]:
model = gl.record_linker.create(post_address, distance='jaccard')
model.summary()


PROGRESS: Starting brute force nearest neighbors model training.
Class                               : RecordLinker

Schema
------
Number of examples                  : 3000
Number of feature columns           : 5
Number of distance components       : 1
Method                              : brute_force

Training
--------
Total training time (seconds)       : 0.6078


In [15]:
model.list_fields()


Out[15]:
['distance',
 'verbose',
 'num_features',
 'num_distance_components',
 'training_time',
 'num_examples',
 'method',
 'features']

Read in the query data


In [16]:
if os.path.exists('febrl_F_dup_5000.csv'):
    agent_listings = gl.SFrame.read_csv('febrl_F_dup_5000.csv',
                                        column_type_hints=col_types)
else:
    url = 'https://static.turi.com/datasets/febrl_synthetic/febrl_F_dup_5000.csv'
    agent_listings = gl.SFrame.read_csv(url, column_type_hints=col_types)
    agent_listings.save('febrl_F_dup_5000.csv')


PROGRESS: Finished parsing file /home/brian/gl-repos/turi.com/src/learn/gallery/notebooks/febrl_F_dup_5000.csv
PROGRESS: Parsing completed. Parsed 100 lines in 0.021774 secs.
PROGRESS: Finished parsing file /home/brian/gl-repos/turi.com/src/learn/gallery/notebooks/febrl_F_dup_5000.csv
PROGRESS: Parsing completed. Parsed 2000 lines in 0.016147 secs.

In [17]:
agent_listings = agent_listings[address_features]
agent_listings.print_rows(5)


+---------------+-----------------+-------------------+-------+----------+
| street_number |    address_1    |       suburb      | state | postcode |
+---------------+-----------------+-------------------+-------+----------+
|       31      |                 |     reseevoir     |  qld  |   5265   |
|      329      |                 | smithfield plains |  vic  |   5044   |
|       37      | kelleway avenue |     burwooast     |  nsw  |   2770   |
|       15      |  mawalan street |     kallangur     |  nss  |   2506   |
|      380      |   davisktrdet   |    boyne ilsand   |  nss  |   6059   |
+---------------+-----------------+-------------------+-------+----------+
[2000 rows x 5 columns]

Query the model

Results are obtained with the model's link method, which matches a new set of queries to the reference data passed in above to the create function. For our first pass, we set the radius parameter to 0.5, which means that matches must share at least roughly 50% of the information contained in both the post_address and agent_listings records.


In [18]:
model.link?

In [19]:
matches = model.link(agent_listings, k=None, radius=0.5)
matches.head(5)


PROGRESS: Starting pairwise querying.
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | Query points | # Pairs | % Complete. | Elapsed Time |
PROGRESS: +--------------+---------+-------------+--------------+
PROGRESS: | 0            | 2000    | 0.0333333   | 4.099ms      |
PROGRESS: | Done         |         | 100         | 461.073ms    |
PROGRESS: +--------------+---------+-------------+--------------+
Out[19]:
query_label reference_label distance rank
1 2438 0.41935483871 1
1 533 0.5 2
2 688 0.192307692308 1
3 2947 0.0454545454545 1
5 1705 0.047619047619 1
[5 rows x 4 columns]

Evaluate

The results mean that the address in query row 1 match the address in refs row number 2438, although the Jaccard distance is relatively high at 0.42. Inspecting these records manually we see this is in fact not a good match.


In [20]:
print agent_listings[1]
print post_address[2438]


{'suburb': 'smithfield plains', 'state': 'vic', 'address_1': '', 'street_number': '329', 'postcode': '5044'}
{'suburb': 'smithfield plains', 'state': 'vic', 'address_1': 'sculptor street', 'street_number': '16', 'postcode': '5044'}

On the other hand, the match between query number 3 and reference number 2947 has a distance of 0.045, indicating these two records are far more similar. By pulling these records we confirm this to be the case.


In [21]:
print agent_listings[3]
print post_address[2947]


{'suburb': 'kallangur', 'state': 'nss', 'address_1': 'mawalan street', 'street_number': '15', 'postcode': '2506'}
{'suburb': 'kallangur', 'state': 'nsw', 'address_1': 'mawalan street', 'street_number': '12', 'postcode': '2506'}

Unfortunately, these records are still not a true match because the street numbers are different (in a way that is not likely to be a typo). Ideally we would like street number differences to be weighted heavily in our distance function, while still allowing for typos and misspellings in the street and city names. To do this we can build a composite distance function.


Composite distances and choosing neighborhood parameters

Create a composite distance and a new model

In this case we'll use Levenshtein distance to measure the dissimilarity in street number, in addition to our existing Jaccard distance measured over all of the address features. Both of these components will be given equal weight. In the summary of the created model, we see the number of distance components is now two---Levenshtein and Jaccard distances---instead of one in our first model.


In [22]:
address_dist = [
                [['street_number'], 'levenshtein', 1],
                [address_features, 'jaccard', 1]
               ]

model2 = gl.record_linker.create(post_address, distance=address_dist)
model2.summary()


PROGRESS: Starting brute force nearest neighbors model training.
Defaulting to brute force instead of ball tree because there are multiple distance components.
Class                               : RecordLinker

Schema
------
Number of examples                  : 3000
Number of feature columns           : 5
Number of distance components       : 2
Method                              : brute_force

Training
--------
Total training time (seconds)       : 0.4195


In [23]:
model2['distance']


Out[23]:
[[['street_number'], 'levenshtein', 1],
 [['street_number', 'address_1', 'suburb', 'state', 'postcode'], 'jaccard', 1]]

Query the model for a large number of neighbors

One tricky aspect of using a composite distance is figuring out the best threshold for match quality. A simple way to do this is to first return a relatively high number of matches for each query, then look at the distribution of distances for good thresholds using the radius parameter. For this notebook, I've captured a screenshot of the canvas output and display it below.


In [24]:
pre_match = model2.link(agent_listings, k=10, verbose=False)
pre_match['distance'].show()


Canvas is accessible via web browser at the URL: http://localhost:60810/index.html
Opening Canvas in default web browser.

In [25]:
from IPython.display import Image
Image(url='https://static.turi.com/datasets/house_link_distances.png')


Out[25]:

Calibrate the parameters for results quality

In this distribution we see a stark jump at 0.636 in the distribution of distances for the 10-nearest neighbors of every query (remember this is no longer simple Jaccard distance, but a sum of Jaccard and Levenshtein distances over different sets of features). In our final pass, we set the k parameter to None, but enforce this distance threshold with the radius parameter.


In [26]:
matches = model2.link(agent_listings, k=None, radius=0.64, verbose=False)
matches.head(5)


Out[26]:
query_label reference_label distance rank
6 1266 0.333333333333 1
7 2377 0.208333333333 1
9 2804 0.575757575758 1
12 2208 0.181818181818 1
13 1346 0.111111111111 1
[5 rows x 4 columns]

There are far fewer results now, but they are much more likely to be true matches than with our first model, even while allowing for typos in many of the address fields.


In [27]:
print agent_listings[6]
print post_address[1266]


{'suburb': 'clayton south', 'state': '', 'address_1': 'ingham oace', 'street_number': '128', 'postcode': '7520'}
{'suburb': 'clayton south', 'state': 'nsw', 'address_1': 'ingham place', 'street_number': '128', 'postcode': '7052'}