In this notebook, we'll try to learn a custom string metric for address resolution.
Generally, when we're doing this kind of thing, we use something like Levenshtein distance or Jaro-Winkler distance to measure the distance between two strings.
This generally makes sense for measuring distances between first names:
Hilary
Hillary
Hillry
are plausibly the same name with various typos.
However, this is clearly not the best method for measuring distances between addresses:
123 Fake Street
123 Fake St
124 Fake Street
From writing conventions, it's obvious that the first two addresses are the same and the third is different, despite the fact that the first and third have the smallest Levenshtein distance. So we see that these kind of metrics are degraded when applied to addresses due to
different characters have different importance
abbreviations are commonly used
We could try to write down a set of weights for different characters and a mapping from words to abbreviations, but that would be tedious and may be brittle. It also would require separate hand-written rules for each datatype. We could spend time developing a set of rules for addresses, then realize that we need to do the same thing for corporation names, etc... So ideally we want to be able to learn the string similarity metric from data.
(Note, in this document, we'll say two addresses are "equivalent" if they refer to the same physical location.)
But what kind of training data can we use for this task? It turns out that it is not impossible to use publicly available datasets to construct a reasonably high fidelity set of pairs of non-equivalent addresses and pairs of equivalent addresses. The first set of pairs could be constructed by finding a high-quality list of unique addresses in a region and thus any pair of addresses is non-equivalent. The second set is a little trickier, but could be constructed by taking a messy dataset and joining addresses on some less messy attribute (i.e. phone number). This kind of dataset may be publicly available in the form of campaign finance records, government contracting records, etc...
From these two sets of paired addresses, we can construct a training dataset consisting of triplets of addresses:
(anchor, positive, negative)
where
- anchor and positive are equivalent
- anchor and negative are not equivalent
We'll use this data to learn a function F
that maps address strings to N-dimensional space while minimizing the function
max(0, distance(F(anchor), F(positive)) - distance(F(anchor), F(negative)) + margin)
over the training data, with
distance(x, y) = 1 - cosine_similarity(x, y)
The model we use is a LSTM neural network with this triplet loss function implemented in the Keras
python
module.
First we load some libraries, set some parameters and initialize some helpers.
In [1]:
import keras
import pandas as pd
from fuzzywuzzy import fuzz
from matplotlib import pyplot as plt
import sys
sys.path.append('/Users/BenJohnson/projects/what-is-this/wit/')
from wit import *
# --
# Setting parameters
num_features = 75 # Dimension of character embedding (more than enough)
max_len = 350 # Max length of input string in characters
formatter = KerasFormatter(num_features, max_len)
Next, we load the data and construct a training set in the format we described above.
(Note: At the moment, we won't actually create the negative training pairs using a high-quality list of unique addresses in a region. Rather, we'll use randomly drawn pairs of addresses under the assumption that the number of addresses is sufficiently high as to make collisions relatively uncommon. We plan to re-run these analyses with higher quality negative training pairs in the near future.)
In [ ]:
# --
# Construct training set
# Load set of matching addresses
df = pd.read_csv('real_address.csv')
df['id'] = 0
# Make training set
train = make_triplet_train(df.head(60000), N = 20)
# Format training set for input into `wit`
trn, levs = formatter.format(train, ['obj'], 'hash')
Now we can use the wit
TripletClassifier model to learn the string embedding we described above.
In [ ]:
classifier = TripletClassifier(trn, levs)
classifier.fit(batch_size = 250, nb_epoch = 3)
And finally, we can compare the behavior of the learned similarity metric to Levenshtein distance. We expect Levenshtein distance to perform adequately on addresses that have different street names, but expect it to fare much worse on addresses with different (but similar) street numbers or different levels of abbreviation.
In [65]:
def compare(a, b):
# Similarity metric learned by `wit`
pred = model.predict(formatter._format_x([a, b], False))
wit_sim = np.round(np.dot(pred, pred.T)[0, 1], 3)
# Fuzzywuzzy.fuzz.ratio Levenshtein similarity
fuzz_sim = int(np.round(fuzz.ratio(a, b)))
print '\nwit_sim : %f\nlev_sim : %d\n' %(wit_sim, fuzz_sim)
print "\n--\nCase 1, different street names : ('101 fake street', '101 elm street')"
compare('101 fake street', '101 elm street')
print "\n--\nCase 2, different abbrevations : ('101 fake street', '101 fake st')"
compare('101 fake street', '101 fake st')
print "\n--\nCase 3, different house numbers : ('101 fake street', '102 fake street')"
compare('101 fake street', '102 fake street')
The dynamic ranges of the two methods are different, but notice that the ordering of pairs is different between the two methods
wit lev
Most similar case 2 3
3 2
Least similar case 1 1
wit
seems to have learned the the varying importance of characters in the string, while Levenshtein continues to treat all characters equally.
Further work on a robust comparison between the two methods is needed, ideally applied to a benchmarker de-duplication/record linkage dataset. We would also like to construct reasonable training datasets for other types of strings where the imporance of characters is non-uniform (phone numbers, etc).