A major topic of ETL is string matching and data matching. While string matching returns a simularity score between two strings, data matching returning a simularity between two two tuples (rows) in a database. For example, we would like to determine whether the tuples (David Smith, 608-245-4367, Madison WI)
and (D. M. Smith, 245-4367, Madison WI)
refer to the same person.
The problem of data matching arises in many integration situations. In the simplest case, we may be merging multiple databases with identical schemas, but without a unique global ID, and want to decide which rows are duplicates. The problem is complicated when we need to join rows from sources that have different schemas. Data matching may also arise at query time. For example, a query may often imprecisely refer to a particular data item, e.g., a query asking for the phone number of a David Smith who lives in Madison. We need to employ data matching techniques to decide which tuples in the database match the query.
In this project, we will complete a cross-dataset join across two terrorism databases w/ differnent schemas: GTD and GDELT.
We cover several classes of solutions to the data matching problem. The first kind employs handcrafted rules, or Rule-based Matching to match tuples. These techniques typically make heavy use of domain-specific knowledge in domains where the complexity of the rules is manageable. The next kind of solution, Learning-based Matching learns the appropriate rules from labeled examples, using supervised learning. The third kind, clustering does not use training data. Instead, it iteratively assigns tuples to clusters, such that all tuples within a single cluster match and those across clusters do not.
In [1]:
import pandas as pd
import time
%pylab inline
In [2]:
gtd=pd.read_excel("data/gtd_12to15_0616dist.xlsx")
In [3]:
headers = pd.read_excel("data/GDELT Metadata.xlsx").columns.values
gdelt = pd.read_csv("data/20150108.export.txt", delimiter="\t", names=headers, parse_dates=["Day"])
In [4]:
gtd = gtd.dropna(subset=['latitude', 'longitude', 'iday'])
gdelt = gdelt.dropna(subset=["ActionGeo_Lat", "ActionGeo_Long", "Day"])
In [5]:
gtd.iyear.value_counts()
Out[5]:
The years in GDLET are distributed strangly:
In [6]:
gdelt.Year.value_counts()
Out[6]:
In [7]:
set(gtd.iyear).intersection(set(gdelt.Year))
Out[7]:
Filter for 2015
In [8]:
#limit query to year 2015 b/c both datasest have full 2015 data
gtdf = gtd[(gtd.iyear==2015) | (gtd.iyear==2014)]
gdeltf=gdelt[(gdelt.Year==2015) | (gdelt.Year==2014)]#.sample(10000) #sampling a subset of the data for fast testing
We need to define the mapping between the columns that are in common between the two datasets.
I could only identify 2 columns that map GTD to GDELT. Key is GTD column Value is GDELT column
We begin by covering approaches that employ handcrafted matching rules. A simple yet popular type of rule computes the similarity score between a pair of tuples $x$ and $y$ as a linearly weighted combination of the individual similarity scores:
$$ \DeclareMathOperator*{\argmax}{\arg\!\max}\\ sim(x, y) = \sum_{i=1}^n \alpha_i sim_i(x,y)\\ match(GTD, GDELT) = \argmax(sim(GTD_n, GDELT))\\ $$where $n$ is the number of common coloumns between tables $X$ and $Y$ (in this case 2), $s_i(x,y) ∈ [0,1]$ is a similarity score between the $i$th attributes of $x$ and $y$, and $\alpha_i ∈ [0, 1]$ is a prespecified weight that indicates the importance of the $i$th attribute to the overall similarity score, such that $\Sigma_{i=1}^n \alpha_i = 1$. We declare $x$ and $y$ matched if $sim(x,y) ≥ β$ for a prespecified threshold $β$, and not matched otherwise.
First we take each lat/lng and put them in a 2D vector format so we can take the euclidian disantace.
We take the date data from each touple and put it in native python datetime
format. Then we convert each datetime
into an integer representing the seconds from the epoch. We can then take the euclidian distance between two datetimes
.
This routine is a loop in a loop: O(n^2) growth complexity with 5,150,259,916 computations to run if we limit search to 2014-2015.
In [9]:
print (gtdf.shape, gdeltf.shape)
gtdf.shape[0] * gdeltf.shape[0]
Out[9]:
In [10]:
#python implemention of the sim function above w/ 2 matching rules for lat/lng and date
#arguments are pandas series (row) from each database
def sim(target, match):
sim = 0
latlng_sim = sim_latlng(target, match)
sim += (W_LATLNG * latlng_sim)
date_sim = sim_date(target, match)
sim += (W_DATE * date_sim)
return sim
In [11]:
#define lat/lng sim function
#expect 2 1x2 numpy arrays
def sim_latlng(target, match):
target_latlng = np.array(target[["latitude", "longitude"]].values)
match_latlng = np.array(match[["ActionGeo_Lat", "ActionGeo_Long"]].values)
sim = euclid_sim(target_latlng, match_latlng)
return sim
In [12]:
#try to match on date
#2 integers represeignn seconds from th epoch
def sim_date(target, match):
target_date_parts = target[["iyear", "imonth", "iday"]].values
target_date = datetime.datetime(target_date_parts[0], target_date_parts[1], target_date_parts[2])
target_seconds = (target_date - EPOCH).total_seconds()
match_date = match["Day"]
match_seconds = (match_date - EPOCH).total_seconds()
sim = euclid_sim(target_seconds, match_seconds)
return sim
In [13]:
def euclid_sim(a,b):
dist = numpy.linalg.norm(a-b)
prob = 1 / (1 + dist)
return prob
In [14]:
#utiltiy method to print match result
def print_table(gtd, gdelt, score):
gtd_vals=[]
gdelt_vals=[]
#latlng
gtd_vals.append(gtd[["latitude", "longitude"]].values)
gdelt_vals.append(gdelt[["ActionGeo_Lat", "ActionGeo_Long"]].values)
#date
target_date_parts = gtd[["iyear", "imonth", "iday"]].values
gtd_vals.append(datetime.datetime(target_date_parts[0], target_date_parts[1], target_date_parts[2]))
gdelt_vals.append(datetime.datetime.strptime(str(gdelt[["Day"]][0]), "%Y%m%d"))
#ids
gtd_vals.append(gtd.eventid)
gdelt_vals.append(gdelt.GlobalEventID)
t_dict = {"0cols": ["lat/lng", "date", "id"], "1GTD": gtd_vals, "GDELT": gdelt_vals, "Score": score}
df = pd.DataFrame(t_dict)
return df
In [15]:
%%time
#weights should sum to 1
W_LATLNG=0.5
W_DATE=0.5
EPOCH = datetime.datetime.utcfromtimestamp(0)
# threshold (beta)
THRESHOLD = 0.4
#choose what tuples of GTD you want to match against
targets = gtdf#.sample(500)
matches = {}
for _, target in targets.iterrows():
scores = []
for _, row in gdeltf.iterrows():
score = sim(target, row)
scores.append(score)
i = np.nanargmax(scores)
high_score = scores[i]
#if above thershold, consider match
if high_score >= THRESHOLD:
match = gdeltf.iloc[i]
matches[target.eventid] = (high_score, match.GlobalEventID)
else:
matches[target.eventid] = (None, None)
pass
In [16]:
sorted_matches = sorted(matches.items(), key=lambda x: x[1][0], reverse=True)
top200 = sorted_matches[0:200]
In [17]:
gtdids= zip(*top200)[0]; scores = zip(*zip(*top200)[1])[0]; gdeltids=zip(*zip(*top200)[1])[1]
report = pd.DataFrame({"GTD": gtdids, "Score": scores, "GDELT": gdeltids})
report [["GTD", "Score", "GDELT"]].head(200)
Out[17]:
In [ ]: