Introduction

This IPython notebook illustrates how to select the best learning based matcher. First, we need to import py_entitymatching package and other libraries as follows:



In [1]:

    
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

# Set the seed value 
seed = 0



In [2]:

    
!ls $datasets_dir









    



Adding Features to Feature Table.ipynb
Combining Multiple Blockers.ipynb
Debugging Blocker Output.ipynb
Down Sampling.ipynb
Editing and Generating Features Manually.ipynb
Evaluating the Selected Matcher.ipynb
Generating Features Manually.ipynb
Performing Blocking Using Blackbox Blocker.ipynb
Performing Blocking Using Built-In Blockers (Attr. Equivalence Blocker).ipynb
Performing Blocking Using Built-In Blockers (Overlap Blocker).ipynb
Performing Blocking Using Rule-Based Blocking.ipynb
Reading CSV Files from Disk.ipynb
Removing Features From Feature Table.ipynb
Sampling and Labeling.ipynb
Selecting the Best Learning Matcher.ipynb



In [3]:

    
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

path_A = datasets_dir + os.sep + 'dblp_demo.csv'
path_B = datasets_dir + os.sep + 'acm_demo.csv'
path_labeled_data = datasets_dir + os.sep + 'labeled_data_demo.csv'



In [5]:

    
A = em.read_csv_metadata(path_A, key='id')
B = em.read_csv_metadata(path_B, key='id')
# Load the pre-labeled data
S = em.read_csv_metadata(path_labeled_data, 
                         key='_id',
                         ltable=A, rtable=B, 
                         fk_ltable='ltable_id', fk_rtable='rtable_id')









    



No handlers could be found for logger "py_entitymatching.io.parsers"

Then, split the labeled data into development set and evaluation set. Use the development set to select the best learning-based matcher



In [6]:

    
# Split S into I an J
IJ = em.split_train_test(S, train_proportion=0.5, random_state=0)
I = IJ['train']
J = IJ['test']

Selecting the Best learning-based matcher

This, typically involves the following steps:

Creating a set of learning-based matchers
Creating features
Extracting feature vectors
Selecting the best learning-based matcher using k-fold cross validation
Debugging the matcher (and possibly repeat the above steps)

Creating a set of learning-based matchers

First, we need to create a set of learning-based matchers. The following matchers are supported in Magellan: (1) decision tree, (2) random forest, (3) naive bayes, (4) svm, (5) logistic regression, and (6) linear regression.



In [7]:

    
# Create a set of ML-matchers
dt = em.DTMatcher(name='DecisionTree', random_state=0)
svm = em.SVMMatcher(name='SVM', random_state=0)
rf = em.RFMatcher(name='RF', random_state=0)
lg = em.LogRegMatcher(name='LogReg', random_state=0)
ln = em.LinRegMatcher(name='LinReg')

Creating features

Next, we need to create a set of features for the development set. Magellan provides a way to automatically generate features based on the attributes in the input tables. For the purposes of this guide, we use the automatically generated features.



In [8]:

    
# Generate a set of features
F = em.get_features_for_matching(A, B, validate_inferred_attr_types=False)

We observe that there were 20 features generated. As a first step, lets say that we decide to use only 'year' related features.



In [9]:

    
F.feature_name









    Out[9]:





0                          id_id_lev_dist
1                           id_id_lev_sim
2                               id_id_jar
3                               id_id_jwn
4                               id_id_exm
5                   id_id_jac_qgm_3_qgm_3
6             title_title_jac_qgm_3_qgm_3
7         title_title_cos_dlm_dc0_dlm_dc0
8                         title_title_mel
9                    title_title_lev_dist
10                    title_title_lev_sim
11        authors_authors_jac_qgm_3_qgm_3
12    authors_authors_cos_dlm_dc0_dlm_dc0
13                    authors_authors_mel
14               authors_authors_lev_dist
15                authors_authors_lev_sim
16                          year_year_exm
17                          year_year_anm
18                     year_year_lev_dist
19                      year_year_lev_sim
Name: feature_name, dtype: object

Extracting feature vectors

In this step, we extract feature vectors using the development set and the created features.



In [10]:

    
# Convert the I into a set of feature vectors using F
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='label',
                            show_progress=False)



In [11]:

    
# Display first few rows
H.head()









    Out[11]:







  
    
      
      _id
      ltable_id
      rtable_id
      id_id_lev_dist
      id_id_lev_sim
      id_id_jar
      id_id_jwn
      id_id_exm
      id_id_jac_qgm_3_qgm_3
      title_title_jac_qgm_3_qgm_3
      ...
      authors_authors_jac_qgm_3_qgm_3
      authors_authors_cos_dlm_dc0_dlm_dc0
      authors_authors_mel
      authors_authors_lev_dist
      authors_authors_lev_sim
      year_year_exm
      year_year_anm
      year_year_lev_dist
      year_year_lev_sim
      label
    
  
  
    
      430
      430
      l1494
      r1257
      4
      0.20
      0.466667
      0.466667
      0
      0.000000
      0.000000
      ...
      0.000000
      0.000000
      0.445707
      44.0
      0.083333
      1
      1.0
      0.0
      1.0
      0
    
    
      35
      35
      l1385
      r1160
      4
      0.20
      0.466667
      0.466667
      0
      0.000000
      0.025641
      ...
      0.000000
      0.000000
      0.589417
      43.0
      0.271186
      1
      1.0
      0.0
      1.0
      0
    
    
      394
      394
      l1345
      r85
      4
      0.20
      0.000000
      0.000000
      0
      0.090909
      1.000000
      ...
      0.951111
      0.945946
      0.822080
      172.0
      0.338462
      1
      1.0
      0.0
      1.0
      1
    
    
      29
      29
      l611
      r141
      3
      0.25
      0.666667
      0.666667
      0
      0.090909
      0.049383
      ...
      0.000000
      0.000000
      0.531543
      26.0
      0.277778
      1
      1.0
      0.0
      1.0
      0
    
    
      181
      181
      l1164
      r1161
      2
      0.60
      0.733333
      0.733333
      0
      0.076923
      1.000000
      ...
      0.592593
      0.668153
      0.684700
      34.0
      0.244444
      1
      1.0
      0.0
      1.0
      1
    
  

5 rows × 24 columns



In [12]:

    
# Check if the feature vectors contain missing values
# A return value of True means that there are missing values
any(pd.notnull(H))









    Out[12]:





True

We observe that the extracted feature vectors contain missing values. We have to impute the missing values for the learning-based matchers to fit the model correctly. For the purposes of this guide, we impute the missing value in a column with the mean of the values in that column.



In [13]:

    
# Impute feature vectors with the mean of the column values.
H = em.impute_table(H, 
                exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
                strategy='mean')

Selecting the best matcher using cross-validation

Now, we select the best matcher using k-fold cross-validation. For the purposes of this guide, we use five fold cross validation and use 'precision' metric to select the best matcher.



In [14]:

    
# Select the best ML matcher using CV
result = em.select_matcher([dt, rf, svm, ln, lg], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        k=5,
        target_attr='label', metric_to_select_matcher='f1', random_state=0)
result['cv_stats']









    Out[14]:







  
    
      
      Matcher
      Average precision
      Average recall
      Average f1
    
  
  
    
      0
      DecisionTree
      0.915322
      0.950714
      0.930980
    
    
      1
      RF
      1.000000
      0.950714
      0.974131
    
    
      2
      SVM
      0.977778
      0.810632
      0.883248
    
    
      3
      LinReg
      1.000000
      0.935330
      0.966131
    
    
      4
      LogReg
      0.985714
      0.935330
      0.958724



In [15]:

    
result['drill_down_cv_stats']['precision']









    Out[15]:







  
    
      
      Name
      Matcher
      Num folds
      Fold 1
      Fold 2
      Fold 3
      Fold 4
      Fold 5
      Mean score
    
  
  
    
      0
      DecisionTree
      <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>
      5
      0.95
      1.000000
      0.764706
      0.933333
      0.928571
      0.915322
    
    
      1
      RF
      <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>
      5
      1.00
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
    
    
      2
      SVM
      <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>
      5
      1.00
      1.000000
      0.888889
      1.000000
      1.000000
      0.977778
    
    
      3
      LinReg
      <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>
      5
      1.00
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
    
    
      4
      LogReg
      <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>
      5
      1.00
      0.928571
      1.000000
      1.000000
      1.000000
      0.985714



In [16]:

    
result['drill_down_cv_stats']['recall']









    Out[16]:







  
    
      
      Name
      Matcher
      Num folds
      Fold 1
      Fold 2
      Fold 3
      Fold 4
      Fold 5
      Mean score
    
  
  
    
      0
      DecisionTree
      <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>
      5
      0.95
      1.000000
      0.928571
      0.8750
      1.000000
      0.950714
    
    
      1
      RF
      <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>
      5
      0.95
      1.000000
      0.928571
      0.8750
      1.000000
      0.950714
    
    
      2
      SVM
      <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>
      5
      0.90
      0.923077
      0.571429
      0.8125
      0.846154
      0.810632
    
    
      3
      LinReg
      <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>
      5
      0.95
      1.000000
      0.928571
      0.8750
      0.923077
      0.935330
    
    
      4
      LogReg
      <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>
      5
      0.95
      1.000000
      0.928571
      0.8750
      0.923077
      0.935330



In [17]:

    
result['drill_down_cv_stats']['f1']









    Out[17]:







  
    
      
      Name
      Matcher
      Num folds
      Fold 1
      Fold 2
      Fold 3
      Fold 4
      Fold 5
      Mean score
    
  
  
    
      0
      DecisionTree
      <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>
      5
      0.950000
      1.000000
      0.838710
      0.903226
      0.962963
      0.930980
    
    
      1
      RF
      <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>
      5
      0.974359
      1.000000
      0.962963
      0.933333
      1.000000
      0.974131
    
    
      2
      SVM
      <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>
      5
      0.947368
      0.960000
      0.695652
      0.896552
      0.916667
      0.883248
    
    
      3
      LinReg
      <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>
      5
      0.974359
      1.000000
      0.962963
      0.933333
      0.960000
      0.966131
    
    
      4
      LogReg
      <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>
      5
      0.974359
      0.962963
      0.962963
      0.933333
      0.960000
      0.958724

Debug X (Random Forest)



In [18]:

    
# Split H into P and Q
PQ = em.split_train_test(H, train_proportion=0.5, random_state=0)
P = PQ['train']
Q = PQ['test']



In [19]:

    
# Debug RF matcher using GUI
em.vis_debug_rf(rf, P, Q, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        target_attr='label')



In [20]:

    
# Add a feature to do Jaccard on title + authors and add it to F

# Create a feature declaratively
sim = em.get_sim_funs_for_matching()
tok = em.get_tokenizers_for_matching()
feature_string = """jaccard(wspace((ltuple['title'] + ' ' + ltuple['authors']).lower()), 
                            wspace((rtuple['title'] + ' ' + rtuple['authors']).lower()))"""
feature = em.get_feature_fn(feature_string, sim, tok)

# Add feature to F
em.add_feature(F, 'jac_ws_title_authors', feature)









    Out[20]:





True



In [21]:

    
# Convert I into feature vectors using updated F
H = em.extract_feature_vecs(I, 
                            feature_table=F, 
                            attrs_after='label',
                            show_progress=False)



In [22]:

    
# Check whether the updated F improves X (Random Forest)
result = em.select_matcher([rf], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        k=5,
        target_attr='label', metric_to_select_matcher='f1', random_state=0)
result['drill_down_cv_stats']['f1']









    Out[22]:







  
    
      
      Name
      Matcher
      Num folds
      Fold 1
      Fold 2
      Fold 3
      Fold 4
      Fold 5
      Mean score
    
  
  
    
      0
      RF
      <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>
      5
      0.974359
      1.0
      0.962963
      0.933333
      1.0
      0.974131



In [23]:

    
# Select the best matcher again using CV
result = em.select_matcher([dt, rf, svm, ln, lg], table=H, 
        exclude_attrs=['_id', 'ltable_id', 'rtable_id', 'label'],
        k=5,
        target_attr='label', metric_to_select_matcher='f1', random_state=0)
result['cv_stats']









    Out[23]:







  
    
      
      Matcher
      Average precision
      Average recall
      Average f1
    
  
  
    
      0
      DecisionTree
      1.000000
      1.000000
      1.000000
    
    
      1
      RF
      1.000000
      0.950714
      0.974131
    
    
      2
      SVM
      1.000000
      0.837418
      0.907995
    
    
      3
      LinReg
      1.000000
      0.970330
      0.984593
    
    
      4
      LogReg
      0.985714
      0.935330
      0.958724



In [24]:

    
result['drill_down_cv_stats']['f1']









    Out[24]:







  
    
      
      Name
      Matcher
      Num folds
      Fold 1
      Fold 2
      Fold 3
      Fold 4
      Fold 5
      Mean score
    
  
  
    
      0
      DecisionTree
      <py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>
      5
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
      1.000000
    
    
      1
      RF
      <py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>
      5
      0.974359
      1.000000
      0.962963
      0.933333
      1.000000
      0.974131
    
    
      2
      SVM
      <py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>
      5
      0.947368
      0.960000
      0.782609
      0.933333
      0.916667
      0.907995
    
    
      3
      LinReg
      <py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>
      5
      1.000000
      1.000000
      0.962963
      1.000000
      0.960000
      0.984593
    
    
      4
      LogReg
      <py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>
      5
      0.974359
      0.962963
      0.962963
      0.933333
      0.960000
      0.958724

	_id	ltable_id	rtable_id	id_id_lev_dist	id_id_lev_sim	id_id_jar	id_id_jwn	id_id_jac_qgm_3_qgm_3	title_title_jac_qgm_3_qgm_3	...	authors_authors_jac_qgm_3_qgm_3	authors_authors_cos_dlm_dc0_dlm_dc0	authors_authors_mel	authors_authors_lev_dist	authors_authors_lev_sim	year_year_exm	year_year_anm	year_year_lev_sim	label
430	430	l1494	r1257	4	0.20	0.466667	0.466667	0.000000	0.000000	...	0.000000	0.000000	0.445707	44.0	0.083333	1	1.0	1.0	0
35	35	l1385	r1160	4	0.20	0.466667	0.466667	0.000000	0.025641	...	0.000000	0.000000	0.589417	43.0	0.271186	1	1.0	1.0	0
394	394	l1345	r85	4	0.20	0.000000	0.000000	0.090909	1.000000	...	0.951111	0.945946	0.822080	172.0	0.338462	1	1.0	1.0	1
29	29	l611	r141	3	0.25	0.666667	0.666667	0.090909	0.049383	...	0.000000	0.000000	0.531543	26.0	0.277778	1	1.0	1.0	0
181	181	l1164	r1161	2	0.60	0.733333	0.733333	0.076923	1.000000	...	0.592593	0.668153	0.684700	34.0	0.244444	1	1.0	1.0	1

	Matcher	Average precision	Average recall	Average f1
0	DecisionTree	0.915322	0.950714	0.930980
1	RF	1.000000	0.950714	0.974131
2	SVM	0.977778	0.810632	0.883248
3	LinReg	1.000000	0.935330	0.966131
4	LogReg	0.985714	0.935330	0.958724

	Name	Matcher	Num folds	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Mean score
0	DecisionTree	<py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>	5	0.95	1.000000	0.764706	0.933333	0.928571	0.915322
1	RF	<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>	5	1.00	1.000000	1.000000	1.000000	1.000000	1.000000
2	SVM	<py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>	5	1.00	1.000000	0.888889	1.000000	1.000000	0.977778
3	LinReg	<py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>	5	1.00	1.000000	1.000000	1.000000	1.000000	1.000000
4	LogReg	<py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>	5	1.00	0.928571	1.000000	1.000000	1.000000	0.985714

	Name	Matcher	Num folds	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Mean score
0	DecisionTree	<py_entitymatching.matcher.dtmatcher.DTMatcher object at 0x10db02990>	5	0.950000	1.000000	0.838710	0.903226	0.962963	0.930980
1	RF	<py_entitymatching.matcher.rfmatcher.RFMatcher object at 0x10db02310>	5	0.974359	1.000000	0.962963	0.933333	1.000000	0.974131
2	SVM	<py_entitymatching.matcher.svmmatcher.SVMMatcher object at 0x10db02390>	5	0.947368	0.960000	0.695652	0.896552	0.916667	0.883248
3	LinReg	<py_entitymatching.matcher.linregmatcher.LinRegMatcher object at 0x10db020d0>	5	0.974359	1.000000	0.962963	0.933333	0.960000	0.966131
4	LogReg	<py_entitymatching.matcher.logregmatcher.LogRegMatcher object at 0x10db02210>	5	0.974359	0.962963	0.962963	0.933333	0.960000	0.958724