Data Matching

A major topic of ETL is string matching and data matching. While string matching returns a simularity score between two strings, data matching returning a simularity between two two tuples (rows) in a database. For example, we would like to determine whether the tuples (David Smith, 608-245-4367, Madison WI) and (D. M. Smith, 245-4367, Madison WI) refer to the same person.

The problem of data matching arises in many integration situations. In the simplest case, we may be merging multiple databases with identical schemas, but without a unique global ID, and want to decide which rows are duplicates. The problem is complicated when we need to join rows from sources that have different schemas. Data matching may also arise at query time. For example, a query may often imprecisely refer to a particular data item, e.g., a query asking for the phone number of a David Smith who lives in Madison. We need to employ data matching techniques to decide which tuples in the database match the query.

In this project, we will complete a cross-dataset join across two terrorism databases w/ differnent schemas: GTD and GDELT.

Techniques

We cover several classes of solutions to the data matching problem. The first kind employs handcrafted rules, or Rule-based Matching to match tuples. These techniques typically make heavy use of domain-specific knowledge in domains where the complexity of the rules is manageable. The next kind of solution, Learning-based Matching learns the appropriate rules from labeled examples, using supervised learning. The third kind, clustering does not use training data. Instead, it iteratively assigns tuples to clusters, such that all tuples within a single cluster match and those across clusters do not.



In [1]:

    
import pandas as pd
import time
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib

Data

Load the two datasets



In [2]:

    
gtd=pd.read_excel("data/gtd_12to15_0616dist.xlsx")



In [3]:

    
headers = pd.read_excel("data/GDELT Metadata.xlsx").columns.values
gdelt = pd.read_csv("data/20150108.export.txt", delimiter="\t", names=headers, parse_dates=["Day"])









    



/usr/local/lib/python2.7/site-packages/IPython/core/interactiveshell.py:2902: DtypeWarning: Columns (8,9,10,11,14,19) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Cleaning

Remove tuples where lat/lng or date is na



In [4]:

    
gtd = gtd.dropna(subset=['latitude', 'longitude', 'iday'])
gdelt = gdelt.dropna(subset=["ActionGeo_Lat", "ActionGeo_Long", "Day"])

Common Years

The two datasets intersect for years 2014-2015



In [5]:

    
gtd.iyear.value_counts()









    Out[5]:





2014    16737
2015    14712
2013    11892
2012     8448
Name: iyear, dtype: int64

The years in GDLET are distributed strangly:



In [6]:

    
gdelt.Year.value_counts()









    Out[6]:





2015    155997
2014      2744
2005        44
Name: Year, dtype: int64



In [7]:

    
set(gtd.iyear).intersection(set(gdelt.Year))









    Out[7]:





{2014, 2015}

Filter for 2015



In [8]:

    
#limit query to year 2015 b/c both datasest have full 2015 data
gtdf = gtd[(gtd.iyear==2015) | (gtd.iyear==2014)]
gdeltf=gdelt[(gdelt.Year==2015) | (gdelt.Year==2014)]#.sample(10000) #sampling a subset of the data for fast testing

Rule-Based Matching

Common Columns

We need to define the mapping between the columns that are in common between the two datasets.

I could only identify 2 columns that map GTD to GDELT. Key is GTD column Value is GDELT column

iyear/imonth/iday : Day - convert to int seconds and measure euclidian distance
latitude/longitude : ActionGeo_Lat/ActionGeo_Long - convert to 2D vector and measure euclidian distance

Simularity Metric

We begin by covering approaches that employ handcrafted matching rules. A simple yet popular type of rule computes the similarity score between a pair of tuples $x$ and $y$ as a linearly weighted combination of the individual similarity scores:

$$ \DeclareMathOperator*{\argmax}{\arg\!\max}\\ sim(x, y) = \sum_{i=1}^n \alpha_i sim_i(x,y)\\ match(GTD, GDELT) = \argmax(sim(GTD_n, GDELT))\\ $$

where $n$ is the number of common coloumns between tables $X$ and $Y$ (in this case 2), $s_i(x,y) ∈ [0,1]$ is a similarity score between the $i$th attributes of $x$ and $y$, and $\alpha_i ∈ [0, 1]$ is a prespecified weight that indicates the importance of the $i$th attribute to the overall similarity score, such that 􏰈$\Sigma_{i=1}^n \alpha_i = 1$. We declare $x$ and $y$ matched if $sim(x,y) ≥ β$ for a prespecified threshold $β$, and not matched otherwise.

Python Implementation

Lat/Lng

First we take each lat/lng and put them in a 2D vector format so we can take the euclidian disantace.

Date

We take the date data from each touple and put it in native python datetime format. Then we convert each datetime into an integer representing the seconds from the epoch. We can then take the euclidian distance between two datetimes.

Computational Complexity

This routine is a loop in a loop: O(n^2) growth complexity with 5,150,259,916 computations to run if we limit search to 2014-2015.



In [9]:

    
print (gtdf.shape, gdeltf.shape)
gtdf.shape[0] * gdeltf.shape[0]









    



((31449, 137), (10000, 58))






    Out[9]:





314490000



In [10]:

    
#python implemention of the sim function above w/ 2 matching rules for lat/lng and date
#arguments are pandas series (row) from each database
def sim(target, match):
    sim = 0

    latlng_sim = sim_latlng(target, match)
    sim += (W_LATLNG * latlng_sim)
    
    date_sim = sim_date(target, match)
    sim += (W_DATE * date_sim)
    
    return sim



In [11]:

    
#define lat/lng sim function
#expect 2 1x2 numpy arrays
def sim_latlng(target, match):
    target_latlng = np.array(target[["latitude", "longitude"]].values)
    
    match_latlng = np.array(match[["ActionGeo_Lat", "ActionGeo_Long"]].values)
    
    sim = euclid_sim(target_latlng, match_latlng)
    
    return sim



In [12]:

    
#try to match on date
#2  integers represeignn seconds from th epoch
def sim_date(target, match):
    target_date_parts = target[["iyear", "imonth", "iday"]].values
    target_date = datetime.datetime(target_date_parts[0], target_date_parts[1], target_date_parts[2])
    target_seconds = (target_date - EPOCH).total_seconds()
    
    match_date = match["Day"]
    match_seconds = (match_date - EPOCH).total_seconds()
    
    sim = euclid_sim(target_seconds, match_seconds)
    return sim



In [13]:

    
def euclid_sim(a,b):
    dist = numpy.linalg.norm(a-b)
    prob = 1 / (1 + dist)
    return prob



In [14]:

    
#utiltiy method to print match result
def print_table(gtd, gdelt, score):
    gtd_vals=[]
    gdelt_vals=[]
    
    #latlng
    gtd_vals.append(gtd[["latitude", "longitude"]].values)
    gdelt_vals.append(gdelt[["ActionGeo_Lat", "ActionGeo_Long"]].values)
    
    #date
    target_date_parts = gtd[["iyear", "imonth", "iday"]].values
    gtd_vals.append(datetime.datetime(target_date_parts[0], target_date_parts[1], target_date_parts[2]))
    gdelt_vals.append(datetime.datetime.strptime(str(gdelt[["Day"]][0]), "%Y%m%d"))
    
    #ids 
    gtd_vals.append(gtd.eventid)
    gdelt_vals.append(gdelt.GlobalEventID)
    
    t_dict = {"0cols": ["lat/lng", "date", "id"], "1GTD": gtd_vals, "GDELT": gdelt_vals, "Score": score}
    df = pd.DataFrame(t_dict)
    return df

Results

Here we iterate thought our targets in GTD and then compare against each candidate match in GDELT and return the closest match. A match is defined as the highest score which is computed by nanargmax below. We can also configure our weights here too:



In [15]:

    
%%time

#weights should sum to 1
W_LATLNG=0.5
W_DATE=0.5
EPOCH = datetime.datetime.utcfromtimestamp(0)

# threshold (beta)
THRESHOLD = 0.4

#choose what tuples of GTD you want to match against
targets = gtdf#.sample(500)
matches = {}

for _, target in targets.iterrows(): 
    scores = []
    for _, row in gdeltf.iterrows():
        score = sim(target, row)
        scores.append(score)

    i = np.nanargmax(scores)
    high_score = scores[i]
    #if above thershold, consider match
    if high_score >= THRESHOLD:
        match = gdeltf.iloc[i]
        matches[target.eventid] = (high_score, match.GlobalEventID)
    else:
        matches[target.eventid] = (None, None)
        pass









    



CPU times: user 1h 26min 10s, sys: 48.1 s, total: 1h 26min 58s
Wall time: 1h 26min 48s

Report

Print out top 200 matches by score



In [16]:

    
sorted_matches = sorted(matches.items(), key=lambda x: x[1][0], reverse=True)
top200 = sorted_matches[0:200]



In [17]:

    
gtdids= zip(*top200)[0]; scores = zip(*zip(*top200)[1])[0]; gdeltids=zip(*zip(*top200)[1])[1]
report = pd.DataFrame({"GTD": gtdids, "Score": scores, "GDELT": gdeltids})
report [["GTD", "Score", "GDELT"]].head(200)









    Out[17]:






  
    
      
      GTD
      Score
      GDELT
    
  
  
    
      0
      201501080052
      0.999093
      331067090
    
    
      1
      201501080005
      0.989967
      331080176
    
    
      2
      201501080044
      0.79047
      331118383
    
    
      3
      201501080026
      0.774984
      331004618
    
    
      4
      201401080033
      0.638378
      330987548
    
    
      5
      201501010069
      0.528406
      330990691
    
    
      6
      201404150072
      0.499986
      331138117
    
    
      7
      201405310048
      0.499661
      331083568
    
    
      8
      201510030022
      0.499625
      331122461
    
    
      9
      201512090035
      0.499625
      331122461
    
    
      10
      201501260068
      0.499093
      331067090
    
    
      11
      201411040061
      0.498777
      331099534
    
    
      12
      201501100067
      0.498382
      331067090
    
    
      13
      201512140021
      0.498276
      331084035
    
    
      14
      201501020078
      0.498071
      331008363
    
    
      15
      201512150074
      0.497702
      331033950
    
    
      16
      201406040063
      0.497232
      330989742
    
    
      17
      201401180006
      0.4968
      330990919
    
    
      18
      201401240009
      0.496475
      331065696
    
    
      19
      201405310013
      0.496462
      331148592
    
    
      20
      201412060035
      0.496462
      331092941
    
    
      21
      201504300066
      0.496336
      331083542
    
    
      22
      201402130064
      0.496141
      331064994
    
    
      23
      201412050036
      0.495911
      331104414
    
    
      24
      201407110039
      0.495712
      331117831
    
    
      25
      201512310032
      0.495473
      331012394
    
    
      26
      201404060013
      0.495441
      331117831
    
    
      27
      201506240032
      0.495213
      331012394
    
    
      28
      201511030068
      0.494972
      331117137
    
    
      29
      201403030061
      0.494948
      331025843
    
    
      ...
      ...
      ...
      ...
    
    
      170
      201505020118
      0.415337
      331070826
    
    
      171
      201506270007
      0.412257
      331117273
    
    
      172
      201506180020
      0.410705
      331085867
    
    
      173
      201507160029
      0.409363
      331148592
    
    
      174
      201401310047
      0.40564
      331074011
    
    
      175
      201405300030
      0.404795
      330990919
    
    
      176
      201407080025
      0.404756
      331097434
    
    
      177
      201505060091
      0.404719
      331118008
    
    
      178
      201508100089
      0.404316
      331113107
    
    
      179
      201509270003
      0.403784
      331088203
    
    
      180
      201507270012
      0.401332
      331061982
    
    
      181
      201512220041
      0.400986
      331003379
    
    
      182
      201405190034
      0.400743
      330987548
    
    
      183
      201510200005
      0.400575
      331117795
    
    
      184
      201501230074
      0.400513
      331110928
    
    
      185
      201402030081
      None
      None
    
    
      186
      201509120014
      None
      None
    
    
      187
      201405270039
      None
      None
    
    
      188
      201402200088
      None
      None
    
    
      189
      201506220061
      None
      None
    
    
      190
      201501100063
      None
      None
    
    
      191
      201508100134
      None
      None
    
    
      192
      201408170029
      None
      None
    
    
      193
      201502210096
      None
      None
    
    
      194
      201411070001
      None
      None
    
    
      195
      201506220087
      None
      None
    
    
      196
      201506220088
      None
      None
    
    
      197
      201512190011
      None
      None
    
    
      198
      201512020039
      None
      None
    
    
      199
      201411240023
      None
      None
    
  

200 rows × 3 columns



In [ ]:

	GTD	Score	GDELT
0	201501080052	0.999093	331067090
1	201501080005	0.989967	331080176
2	201501080044	0.79047	331118383
3	201501080026	0.774984	331004618
4	201401080033	0.638378	330987548
5	201501010069	0.528406	330990691
6	201404150072	0.499986	331138117
7	201405310048	0.499661	331083568
8	201510030022	0.499625	331122461
9	201512090035	0.499625	331122461
10	201501260068	0.499093	331067090
11	201411040061	0.498777	331099534
12	201501100067	0.498382	331067090
13	201512140021	0.498276	331084035
14	201501020078	0.498071	331008363
15	201512150074	0.497702	331033950
16	201406040063	0.497232	330989742
17	201401180006	0.4968	330990919
18	201401240009	0.496475	331065696
19	201405310013	0.496462	331148592
20	201412060035	0.496462	331092941
21	201504300066	0.496336	331083542
22	201402130064	0.496141	331064994
23	201412050036	0.495911	331104414
24	201407110039	0.495712	331117831
25	201512310032	0.495473	331012394
26	201404060013	0.495441	331117831
27	201506240032	0.495213	331012394
28	201511030068	0.494972	331117137
29	201403030061	0.494948	331025843
...	...	...	...
170	201505020118	0.415337	331070826
171	201506270007	0.412257	331117273
172	201506180020	0.410705	331085867
173	201507160029	0.409363	331148592
174	201401310047	0.40564	331074011
175	201405300030	0.404795	330990919
176	201407080025	0.404756	331097434
177	201505060091	0.404719	331118008
178	201508100089	0.404316	331113107
179	201509270003	0.403784	331088203
180	201507270012	0.401332	331061982
181	201512220041	0.400986	331003379
182	201405190034	0.400743	330987548
183	201510200005	0.400575	331117795
184	201501230074	0.400513	331110928
185	201402030081	None	None
186	201509120014	None	None
187	201405270039	None	None
188	201402200088	None	None
189	201506220061	None	None
190	201501100063	None	None
191	201508100134	None	None
192	201408170029	None	None
193	201502210096	None	None
194	201411070001	None	None
195	201506220087	None	None
196	201506220088	None	None
197	201512190011	None	None
198	201512020039	None	None
199	201411240023	None	None