Learning String Similarity for Address Resolution

Ben Johnson, Phronesis LLC, 2015-12-15

Problem

In this notebook, we'll try to learn a custom string metric for address resolution.

Generally, when we're doing this kind of thing, we use something like Levenshtein distance or Jaro-Winkler distance to measure the distance between two strings.

This generally makes sense for measuring distances between first names:

Hilary
Hillary
Hillry

are plausibly the same name with various typos.

However, this is clearly not the best method for measuring distances between addresses:

123 Fake Street
123 Fake St
124 Fake Street

From writing conventions, it's obvious that the first two addresses are the same and the third is different, despite the fact that the first and third have the smallest Levenshtein distance. So we see that these kind of metrics are degraded when applied to addresses due to

different characters have different importance
abbreviations are commonly used

Approach

We could try to write down a set of weights for different characters and a mapping from words to abbreviations, but that would be tedious and may be brittle. It also would require separate hand-written rules for each datatype. We could spend time developing a set of rules for addresses, then realize that we need to do the same thing for corporation names, etc... So ideally we want to be able to learn the string similarity metric from data.

(Note, in this document, we'll say two addresses are "equivalent" if they refer to the same physical location.)

But what kind of training data can we use for this task? It turns out that it is not impossible to use publicly available datasets to construct a reasonably high fidelity set of pairs of non-equivalent addresses and pairs of equivalent addresses. The first set of pairs could be constructed by finding a high-quality list of unique addresses in a region and thus any pair of addresses is non-equivalent. The second set is a little trickier, but could be constructed by taking a messy dataset and joining addresses on some less messy attribute (i.e. phone number). This kind of dataset may be publicly available in the form of campaign finance records, government contracting records, etc...

From these two sets of paired addresses, we can construct a training dataset consisting of triplets of addresses:

(anchor, positive, negative)

where

- anchor and positive are equivalent
- anchor and negative are not equivalent

We'll use this data to learn a function F that maps address strings to N-dimensional space while minimizing the function

max(0, distance(F(anchor), F(positive)) - distance(F(anchor), F(negative)) + margin)

over the training data, with

distance(x, y) = 1 - cosine_similarity(x, y)

The model we use is a LSTM neural network with this triplet loss function implemented in the Keras python module.

Code

First we load some libraries, set some parameters and initialize some helpers.


In [1]:
import keras
import pandas as pd
from fuzzywuzzy import fuzz
from matplotlib import pyplot as plt

import sys
sys.path.append('/Users/BenJohnson/projects/what-is-this/wit/')
from wit import *

# --
# Setting parameters
num_features = 75  # Dimension of character embedding (more than enough)
max_len      = 350 # Max length of input string in characters
formatter    = KerasFormatter(num_features, max_len)


Using Theano backend.

Next, we load the data and construct a training set in the format we described above.

(Note: At the moment, we won't actually create the negative training pairs using a high-quality list of unique addresses in a region. Rather, we'll use randomly drawn pairs of addresses under the assumption that the number of addresses is sufficiently high as to make collisions relatively uncommon. We plan to re-run these analyses with higher quality negative training pairs in the near future.)


In [ ]:
# --
# Construct training set

# Load set of matching addresses
df         = pd.read_csv('real_address.csv')
df['id']   = 0

# Make training set
train      = make_triplet_train(df.head(60000), N = 20)

# Format training set for input into `wit`
trn, levs  = formatter.format(train, ['obj'], 'hash')

Now we can use the wit TripletClassifier model to learn the string embedding we described above.


In [ ]:
classifier = TripletClassifier(trn, levs)
classifier.fit(batch_size = 250, nb_epoch = 3)

And finally, we can compare the behavior of the learned similarity metric to Levenshtein distance. We expect Levenshtein distance to perform adequately on addresses that have different street names, but expect it to fare much worse on addresses with different (but similar) street numbers or different levels of abbreviation.


In [65]:
def compare(a, b):
    # Similarity metric learned by `wit`
    pred    = model.predict(formatter._format_x([a, b], False))
    wit_sim = np.round(np.dot(pred, pred.T)[0, 1], 3)

    # Fuzzywuzzy.fuzz.ratio Levenshtein similarity
    fuzz_sim = int(np.round(fuzz.ratio(a, b)))
    
    print '\nwit_sim : %f\nlev_sim : %d\n' %(wit_sim, fuzz_sim)

print "\n--\nCase 1, different street names : ('101 fake street', '101 elm street')"
compare('101 fake street', '101 elm street')

print "\n--\nCase 2, different abbrevations : ('101 fake street', '101 fake st')"
compare('101 fake street', '101 fake st')

print "\n--\nCase 3, different house numbers : ('101 fake street', '102 fake street')"
compare('101 fake street', '102 fake street')


--
Case 1, different street names : ('101 fake street', '101 elm street')

wit_sim : 0.790000
lev_sim : 83


--
Case 2, different abbrevations : ('101 fake street', '101 fake st')

wit_sim : 0.959000
lev_sim : 85


--
Case 3, different house numbers : ('101 fake street', '102 fake street')

wit_sim : 0.861000
lev_sim : 93

The dynamic ranges of the two methods are different, but notice that the ordering of pairs is different between the two methods

                     wit   lev
Most similar case     2     3 
                      3     2
Least similar case    1     1

wit seems to have learned the the varying importance of characters in the string, while Levenshtein continues to treat all characters equally.

Further work on a robust comparison between the two methods is needed, ideally applied to a benchmarker de-duplication/record linkage dataset. We would also like to construct reasonable training datasets for other types of strings where the imporance of characters is non-uniform (phone numbers, etc).