Introduction

This IPython notebook illustrates how to sample and label a table (candidate set). First, we need to import py_entitymatching package and other libraries as follows:


In [1]:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

In [2]:
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

path_A = datasets_dir + os.sep + 'DBLP.csv'
path_B = datasets_dir + os.sep + 'ACM.csv'
path_C = datasets_dir + os.sep + 'tableC.csv'

In [3]:
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
C = em.read_csv_metadata(path_C, key='_id', 
                         fk_ltable='ltable_id', fk_rtable='rtable_id',
                         ltable=A, rtable=B)


Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.

In [4]:
C.head()


Out[4]:
_id ltable_id rtable_id ltable_authors ltable_title rtable_authors rtable_title
0 0 conf/sigmod/AbadiC02 191915 Daniel J. Abadi, Mitch Cherniack Visual COKO: a debugger for query optimizer development Michael J. Carey, David J. DeWitt, Michael J. Franklin, Nancy E. Hall, Mark L. McAuliffe, Jeffre... Shoring up persistent applications
1 1 conf/sigmod/AbadiC02 191931 Daniel J. Abadi, Mitch Cherniack Visual COKO: a debugger for query optimizer development Daniel J. Dietterich DEC data distributor: for data replication and data warehousing
2 2 conf/sigmod/AbadiC02 233356 Daniel J. Abadi, Mitch Cherniack Visual COKO: a debugger for query optimizer development Mitch Cherniack, Stanley B. Zdonik Rule languages and internal algebras for rule-based optimizers
3 3 conf/sigmod/AbadiC02 276311 Daniel J. Abadi, Mitch Cherniack Visual COKO: a debugger for query optimizer development Mitch Cherniack, Stan Zdonik Changing the rules: transformations for rule-based optimizers
4 4 conf/sigmod/AbadiC02 335432 Daniel J. Abadi, Mitch Cherniack Visual COKO: a debugger for query optimizer development Jianjun Chen, David J. DeWitt, Feng Tian, Yuan Wang NiagaraCQ: a scalable continuous query system for Internet databases

In [5]:
len(C)


Out[5]:
14673

Sample Candidate Set

From the candidate set, a sample (for labeling purposes) can be obtained like this:


In [6]:
S = em.sample_table(C, 450)

Label the Sampled Set


In [7]:
# Label the sampled set
# Specify the name for the label column
G = em.label_table(S, 'gold_label')


Column name (gold_label) is not present in dataframe

The user must specify 0 for non-match and 1 for match. Typically, the sampling and the labeling step is done in iterations (till we get sufficient density of matches). Once labeled, the labeled data set will look like this:


In [8]:
# Assume that we have labeled the data and stored it in 
# labeled_data_demo.csv

path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'
G = em.read_csv_metadata(path_labeled_data, key='_id', 
                         fk_ltable='ltable_id', fk_rtable='rtable_id',
                         ltable=A, rtable=B)


Metadata file is not present in the given path; proceeding to read the csv file.

In [9]:
G.head()


Out[9]:
_id ltable_id rtable_id ltable_title ltable_authors ltable_year rtable_title rtable_authors rtable_year label
0 0 l1223 r498 Dynamic Information Visualization Yannis E. Ioannidis 1996 Dynamic information visualization Yannis E. Ioannidis 1996 1
1 1 l1563 r1285 Dynamic Load Balancing in Hierarchical Parallel Database Systems Luc Bouganim, Daniela Florescu, Patrick Valduriez 1996 Dynamic Load Balancing in Hierarchical Parallel Database Systems Luc Bouganim, Daniela Florescu, Patrick Valduriez 1996 1
2 2 l1514 r1348 Query Processing and Optimization in Oracle Rdb Gennady Antoshenkov, Mohamed Ziauddin 1996 prospector: a content-based multimedia server for massively parallel architectures S. Choo, W. O'Connell, G. Linerman, H. Chen, K. Ganapathy, A. Biliris, E. Panagos, D. Schrader 1996 0
3 3 l206 r1641 An Asymptotically Optimal Multiversion B-Tree Thomas Ohler, Peter Widmayer, Bruno Becker, Stephan Gschwind, Bernhard Seeger 1996 A complete temporal relational algebra Debabrata Dey, Terence M. Barron, Veda C. Storey 1996 0
4 4 l1589 r495 Evaluating Probabilistic Queries over Imprecise Data Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar 2003 Evaluating probabilistic queries over imprecise data Reynold Cheng, Dmitri V. Kalashnikov, Sunil Prabhakar 2003 1