Introduction

This IPython notebook illustrates how to perform blocking using rule-based blocker.

First, we need to import py_entitymatching package and other libraries as follows:


In [1]:
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd


/Users/pradap/miniconda3/lib/python3.5/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Then, read the (sample) input tables for blocking purposes.


In [5]:
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

# Get the paths of the input tables
path_A = datasets_dir + os.sep + 'person_table_A.csv'
path_B = datasets_dir + os.sep + 'person_table_B.csv'

In [6]:
# Read the CSV files and set 'ID' as the key attribute
A = em.read_csv_metadata(path_A, key='ID')
B = em.read_csv_metadata(path_B, key='ID')

Different Ways to Block Using Blackbox Based Blocker

There are three different ways to do overlap blocking:

  1. Block two tables to produce a candidate set of tuple pairs.
  2. Block a candidate set of tuple pairs to typically produce a reduced candidate set of tuple pairs.
  3. Block two tuples to check if a tuple pair would get blocked.

Block Tables to Produce a Candidate Set of Tuple Pairs

First, define a blackbox function


In [18]:
def address_address_function(x, y):
    # x, y will be of type pandas series
    
    # get name attribute
    x_address = x['address']
    y_address = y['address']
    # get the city
    x_split, y_split = x_address.split(','), y_address.split(',')
    x_city = x_split[len(x_split) - 1]
    y_city = y_split[len(y_split) - 1]
    # check if the cities match
    if x_city != y_city:
        return True
    else:
        return False

In [22]:
# Instantiate blackbox blocker
bb = em.BlackBoxBlocker()
# Set the black box function
bb.set_black_box_function(address_address_function)

In [23]:
C = bb.block_tables(A, B, l_output_attrs=['name', 'address'], r_output_attrs=['name', 'address'])


0%                          100%
[##############################] | ETA: 00:00:00
Total time elapsed: 00:00:00

In [24]:
C


Out[24]:
_id ltable_ID rtable_ID ltable_name ltable_address rtable_name rtable_address
0 0 a1 b1 Kevin Smith 607 From St, San Francisco Mark Levene 108 Clement St, San Francisco
1 1 a1 b2 Kevin Smith 607 From St, San Francisco Bill Bridge 3131 Webster St, San Francisco
2 2 a1 b3 Kevin Smith 607 From St, San Francisco Mike Franklin 1652 Stockton St, San Francisco
3 3 a1 b4 Kevin Smith 607 From St, San Francisco Joseph Kuan 108 South Park, San Francisco
4 4 a1 b6 Kevin Smith 607 From St, San Francisco Michael Brodie 133 Clement Street, San Francisco
5 5 a2 b1 Michael Franklin 1652 Stockton St, San Francisco Mark Levene 108 Clement St, San Francisco
6 6 a2 b2 Michael Franklin 1652 Stockton St, San Francisco Bill Bridge 3131 Webster St, San Francisco
7 7 a2 b3 Michael Franklin 1652 Stockton St, San Francisco Mike Franklin 1652 Stockton St, San Francisco
8 8 a2 b4 Michael Franklin 1652 Stockton St, San Francisco Joseph Kuan 108 South Park, San Francisco
9 9 a2 b6 Michael Franklin 1652 Stockton St, San Francisco Michael Brodie 133 Clement Street, San Francisco
10 10 a3 b1 William Bridge 3131 Webster St, San Francisco Mark Levene 108 Clement St, San Francisco
11 11 a3 b2 William Bridge 3131 Webster St, San Francisco Bill Bridge 3131 Webster St, San Francisco
12 12 a3 b3 William Bridge 3131 Webster St, San Francisco Mike Franklin 1652 Stockton St, San Francisco
13 13 a3 b4 William Bridge 3131 Webster St, San Francisco Joseph Kuan 108 South Park, San Francisco
14 14 a3 b6 William Bridge 3131 Webster St, San Francisco Michael Brodie 133 Clement Street, San Francisco
15 15 a4 b1 Binto George 423 Powell St, San Francisco Mark Levene 108 Clement St, San Francisco
16 16 a4 b2 Binto George 423 Powell St, San Francisco Bill Bridge 3131 Webster St, San Francisco
17 17 a4 b3 Binto George 423 Powell St, San Francisco Mike Franklin 1652 Stockton St, San Francisco
18 18 a4 b4 Binto George 423 Powell St, San Francisco Joseph Kuan 108 South Park, San Francisco
19 19 a4 b6 Binto George 423 Powell St, San Francisco Michael Brodie 133 Clement Street, San Francisco
20 20 a5 b1 Alphonse Kemper 1702 Post Street, San Francisco Mark Levene 108 Clement St, San Francisco
21 21 a5 b2 Alphonse Kemper 1702 Post Street, San Francisco Bill Bridge 3131 Webster St, San Francisco
22 22 a5 b3 Alphonse Kemper 1702 Post Street, San Francisco Mike Franklin 1652 Stockton St, San Francisco
23 23 a5 b4 Alphonse Kemper 1702 Post Street, San Francisco Joseph Kuan 108 South Park, San Francisco
24 24 a5 b6 Alphonse Kemper 1702 Post Street, San Francisco Michael Brodie 133 Clement Street, San Francisco

Block Candidate Set

First, define a blackbox function


In [25]:
def name_name_function(x, y):
    # x, y will be of type pandas series
    
    # get name attribute
    x_name = x['name']
    y_name = y['name']
    # get last names
    x_name = x_name.split(' ')[1]
    y_name = y_name.split(' ')[1]
    # check if last names match
    if x_name != y_name:
        return True
    else:
        return False

In [29]:
# Instantiate blackbox blocker
bb = em.BlackBoxBlocker()
# Set the black box function
bb.set_black_box_function(name_name_function)

In [30]:
D = bb.block_candset(C)


0%                     100%
[#########################] | ETA: 00:00:00
Total time elapsed: 00:00:00

In [31]:
D


Out[31]:
_id ltable_ID rtable_ID ltable_name ltable_address rtable_name rtable_address
7 7 a2 b3 Michael Franklin 1652 Stockton St, San Francisco Mike Franklin 1652 Stockton St, San Francisco
11 11 a3 b2 William Bridge 3131 Webster St, San Francisco Bill Bridge 3131 Webster St, San Francisco

Block Two tuples To Check If a Tuple Pair Would Get Blocked

First, define the black box function first


In [33]:
def address_address_function(x, y):
    # x, y will be of type pandas series
    
    # get name attribute
    x_address = x['address']
    y_address = y['address']
    # get the city
    x_split, y_split = x_address.split(','), y_address.split(',')
    x_city = x_split[len(x_split) - 1]
    y_city = y_split[len(y_split) - 1]
    # check if the cities match
    if x_city != y_city:
        return True
    else:
        return False

In [34]:
# Instantiate blackabox blocker
bb = em.BlackBoxBlocker()
# Set the blackbox function 
bb.set_black_box_function(address_address_function)

In [35]:
A.ix[[0]]


Out[35]:
ID name birth_year hourly_wage address zipcode
0 a1 Kevin Smith 1989 30.0 607 From St, San Francisco 94107

In [36]:
B.ix[[0]]


Out[36]:
ID name birth_year hourly_wage address zipcode
0 b1 Mark Levene 1987 29.5 108 Clement St, San Francisco 94107

In [38]:
status = bb.block_tuples(A.ix[0], B.ix[0])

print(status)


False