Entity Resolution Workshop

Entity Resolution is the task of disambiguating manifestations of real world entities through linking and grouping and is often an essential part of the data wrangling process. There are three primary tasks involved in entity resolution: deduplication, record linkage, and canonicalization; each of which serve to improve data quality by reducing irrelevant or repeated data, joining information from disparate records, and providing a single source of information to perform analytics upon. However, due to data quality issues (misspellings or incorrect data), schema variations in different sources, or simply different representations, entity resolution is not a straightforward process and most ER techniques utilize machine learning and other stochastic approaches.


In [ ]:
## Imports

import os
import csv
import nltk
import json
import math
import random
import distance 

## Important Paths
FIXTURES = os.path.join(os.getcwd(), "fixtures")
PRODUCTS = os.path.join(FIXTURES, "products")

## Module Constants
GOOGID   = 'http://www.google.com/base/feeds/snippets'

In [ ]:
def load_data(name):
    """
    Create a generator to load data from the products data source.
    """
    with open(os.path.join(PRODUCTS, name), 'r') as f:
        reader = csv.DictReader(f)
        for row in reader:
            yield row

def google_key(key):
    return os.path.join(GOOGID, key)

In [ ]:
## Load Datasets into Memory
amazon  = list(load_data('amazon.csv'))
google  = list(load_data('google.csv'))
mapping = list(load_data('perfect_mapping.csv'))

## Report on contents of the dataset
for name, dataset in (('Amazon', amazon), ('Google Shopping', google)):
    print "{} dataset contains {} records".format(name, len(dataset))
    print "Record keys: {}\n".format(", ".join(dataset[0].keys()))

## Report on the contents of the mapping
print "There are {} matching records to link".format(len(mapping))

## Convert dataset to records indexed by their ID.
amazon  = dict((v['id'], v) for v in amazon)
google  = dict((v['id'], v) for v in google)

In [ ]:
X = amazon['b0000c7fpt']
Y = google[google_key('17175991674191849246')]

## Show example Records
print json.dumps(X, indent=2)
print json.dumps(Y, indent=2)

Similarity Scores

Links to information about distance metrics:

Numeric distances are fairly easy, but can be record specific (e.g. phone numbers can compare area codes, city codes, etc. to determine similarity). We will compare text similarity in this section:


In [ ]:
# Typographic Distances

print distance.levenshtein("lenvestein", "levenshtein")
print distance.hamming("hamming", "hamning")

In [ ]:
# Compare glyphs, syllables, or phonemes 
t1 = ("de", "ci", "si", "ve")
t2 = ("de", "ri", "si", "ve")
print distance.levenshtein(t1, t2)

In [ ]:
# Sentence Comparison
sent1 = "The quick brown fox jumped over the lazy dogs."
sent2 = "The lazy foxes are jumping over the crazy Dog."

print distance.nlevenshtein(sent1.split(), sent2.split(), method=1)

In [ ]:
# Normalization
print distance.hamming("fat", "cat", normalized=True)
print distance.nlevenshtein("abc", "acd", method=1)  # shortest alignment
print distance.nlevenshtein("abc", "acd", method=2)  # longest alignment

In [ ]:
# Set measures
print distance.sorensen("decide", "resize")
print distance.jaccard("decide", "resize")

Preprocessed Text Score

Use text preprocessing with NLTK to split long strings into parts, and normalize them using Wordnet.


In [ ]:
def tokenize(sent):
    """
    When passed in a sentence, tokenizes and normalizes the string,
    returning a list of lemmata.
    """
    lemmatizer = nltk.WordNetLemmatizer() 
    for token in nltk.wordpunct_tokenize(sent):
        token = token.lower()
        yield lemmatizer.lemmatize(token)

def normalized_jaccard(*args):
    try:
        return distance.jaccard(*[tokenize(arg) for arg in args])
    except UnicodeDecodeError:
        return 0.0

print normalized_jaccard(sent1, sent2)

Similarity Vectors


In [ ]:
def similarity(prod1, prod2):
    """
    Returns a similarity vector of match scores:
    [name_score, description_score, manufacturer_score, price_score]
    """
    pair  = (prod1, prod2)
    names = [r.get('name', None) or r.get('title', None) for r in pair]
    descr = [r.get('description') for r in pair]
    manuf = [r.get('manufacturer') for r in pair]
    price = [float(r.get('price')) for r in pair]
    
    return [
        normalized_jaccard(*names),
        normalized_jaccard(*descr),
        normalized_jaccard(*manuf),
        abs(1.0/(1+ (price[0] - price[1]))),
    ]

print similarity(X, Y)

Weighted Pairwise Matching


In [ ]:
THRESHOLD = 0.90
WEIGHTS   = (0.6, 0.1, 0.2, 0.1)

In [ ]:
matches = 0
for azprod in amazon.values():
    for googprod in google.values():
        vector = similarity(azprod, googprod)
        score  = sum(map(lambda v: v[0]*v[1], zip(WEIGHTS, vector)))
        if score > THRESHOLD:
            matches += 1
            print "{0:0.3f}: {1} {2}".format(
                    score, azprod['id'], googprod['id'].split("/")[-1]
            )

print "\n{} matches discovered".format(matches)