In this notebook, we'll try to learn a custom string metric for address resolution.
Generally, when we're doing this kind of thing, we use something like Levenshtein distance or Jaro-Winkler distance to measure the distance between two strings.
This generally makes sense for measuring distances between first names:
Hilary Hillary Hillry
are plausibly the same name with various typos.
However, this is clearly not the best method for measuring distances between addresses:
123 Fake Street 123 Fake St 124 Fake Street
From writing conventions, it's obvious that the first two addresses are the same and the third is different, despite the fact that the first and third have the smallest Levenshtein distance. So we see that these kind of metrics are degraded when applied to addresses due to
different characters have different importance abbreviations are commonly used
We could try to write down a set of weights for different characters and a mapping from words to abbreviations, but that would be tedious and may be brittle. It also would require separate hand-written rules for each datatype. We could spend time developing a set of rules for addresses, then realize that we need to do the same thing for corporation names, etc... So ideally we want to be able to learn the string similarity metric from data.
(Note, in this document, we'll say two addresses are "equivalent" if they refer to the same physical location.)
But what kind of training data can we use for this task? It turns out that it is not impossible to use publicly available datasets to construct a reasonably high fidelity set of pairs of non-equivalent addresses and pairs of equivalent addresses. The first set of pairs could be constructed by finding a high-quality list of unique addresses in a region and thus any pair of addresses is non-equivalent. The second set is a little trickier, but could be constructed by taking a messy dataset and joining addresses on some less messy attribute (i.e. phone number). This kind of dataset may be publicly available in the form of campaign finance records, government contracting records, etc...
From these two sets of paired addresses, we can construct a training dataset consisting of triplets of addresses:
(anchor, positive, negative)
- anchor and positive are equivalent - anchor and negative are not equivalent
We'll use this data to learn a function
F that maps address strings to N-dimensional space while minimizing the function
max(0, distance(F(anchor), F(positive)) - distance(F(anchor), F(negative)) + margin)
over the training data, with
distance(x, y) = 1 - cosine_similarity(x, y)
The model we use is a LSTM neural network with this triplet loss function implemented in the
First we load some libraries, set some parameters and initialize some helpers.
In :import keras import pandas as pd from fuzzywuzzy import fuzz from matplotlib import pyplot as plt import sys sys.path.append('/Users/BenJohnson/projects/what-is-this/wit/') from wit import * # -- # Setting parameters num_features = 75 # Dimension of character embedding (more than enough) max_len = 350 # Max length of input string in characters formatter = KerasFormatter(num_features, max_len)
Using Theano backend.
Next, we load the data and construct a training set in the format we described above.
(Note: At the moment, we won't actually create the negative training pairs using a high-quality list of unique addresses in a region. Rather, we'll use randomly drawn pairs of addresses under the assumption that the number of addresses is sufficiently high as to make collisions relatively uncommon. We plan to re-run these analyses with higher quality negative training pairs in the near future.)
In [ ]:# -- # Construct training set # Load set of matching addresses df = pd.read_csv('real_address.csv') df['id'] = 0 # Make training set train = make_triplet_train(df.head(60000), N = 20) # Format training set for input into `wit` trn, levs = formatter.format(train, ['obj'], 'hash')
Now we can use the
wit TripletClassifier model to learn the string embedding we described above.
In [ ]:classifier = TripletClassifier(trn, levs) classifier.fit(batch_size = 250, nb_epoch = 3)
And finally, we can compare the behavior of the learned similarity metric to Levenshtein distance. We expect Levenshtein distance to perform adequately on addresses that have different street names, but expect it to fare much worse on addresses with different (but similar) street numbers or different levels of abbreviation.
In :def compare(a, b): # Similarity metric learned by `wit` pred = model.predict(formatter._format_x([a, b], False)) wit_sim = np.round(np.dot(pred, pred.T)[0, 1], 3) # Fuzzywuzzy.fuzz.ratio Levenshtein similarity fuzz_sim = int(np.round(fuzz.ratio(a, b))) print '\nwit_sim : %f\nlev_sim : %d\n' %(wit_sim, fuzz_sim) print "\n--\nCase 1, different street names : ('101 fake street', '101 elm street')" compare('101 fake street', '101 elm street') print "\n--\nCase 2, different abbrevations : ('101 fake street', '101 fake st')" compare('101 fake street', '101 fake st') print "\n--\nCase 3, different house numbers : ('101 fake street', '102 fake street')" compare('101 fake street', '102 fake street')
-- Case 1, different street names : ('101 fake street', '101 elm street') wit_sim : 0.790000 lev_sim : 83 -- Case 2, different abbrevations : ('101 fake street', '101 fake st') wit_sim : 0.959000 lev_sim : 85 -- Case 3, different house numbers : ('101 fake street', '102 fake street') wit_sim : 0.861000 lev_sim : 93
The dynamic ranges of the two methods are different, but notice that the ordering of pairs is different between the two methods
wit lev Most similar case 2 3 3 2 Least similar case 1 1
wit seems to have learned the the varying importance of characters in the string, while Levenshtein continues to treat all characters equally.
Further work on a robust comparison between the two methods is needed, ideally applied to a benchmarker de-duplication/record linkage dataset. We would also like to construct reasonable training datasets for other types of strings where the imporance of characters is non-uniform (phone numbers, etc).