Introduction



In [1]:

    
# Import py_entitymatching package
import py_entitymatching as em
import os
import pandas as pd

Then, read the (sample) input tables



In [2]:

    
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'

# Get the paths of the input tables
path = datasets_dir + os.sep + 'dblp_demo.csv'



In [3]:

    
# Read the CSV file and set 'ID' as the key attribute
A = em.read_csv_metadata(path, key='id')
B = em.read_csv_metadata(path, key='id')
A.head()









    



Metadata file is not present in the given path; proceeding to read the csv file.
Metadata file is not present in the given path; proceeding to read the csv file.






    Out[3]:







  
    
      
      id
      title
      authors
      venue
      year
    
  
  
    
      0
      l0
      Paradise: A Database System for GIS Applications
      Paradise Team
      SIGMOD Conference
      1995
    
    
      1
      l1
      A Query Language and Optimization Techniques for Unstructured Data
      Gerd G. Hillebrand, Peter Buneman, Susan B. Davidson, Dan Suciu
      SIGMOD Conference
      1996
    
    
      2
      l2
      Turbo-charging Vertical Mining of Large Databases
      Jayant R. Haritsa, Devavrat Shah, S. Sudarshan, Pradeep Shenoy, Mayank Bawa, Gaurav Bhalotia
      SIGMOD Conference
      2000
    
    
      3
      l3
      Maintenance of Data Cubes and Summary Tables in a Warehouse
      Inderpal Singh Mumick, Dallan Quass, Barinderpal Singh Mumick
      SIGMOD Conference
      1997
    
    
      4
      l4
      On Relational Support for XML Publishing: Beyond Sorting and Tagging
      Raghav Kaushik, Jeffrey F. Naughton, Surajit Chaudhuri
      SIGMOD Conference
      2003

Data Exploration

This notebook will demonstrate using two different data exploration tools. OpenRefine is supported for python 2.7 and 3.5 and PandasTable is only supported for python 3.5

OpenRefine



In [4]:

    
# Invoke the open refine gui for data exploration
p = em.data_explore_openrefine(A, name='Table')



In [5]:

    
# Save the project back to our dataframe
# after calling export_pandas_frame, the openRefine project will be deleted automatically
A = p.export_pandas_frame()



In [6]:

    
A.head()









    Out[6]:







  
    
      
      id
      title
      authors
      venue
      year
    
  
  
    
      0
      l0
      You can modify data if necessary using OpenRefine
      Paradise Team
      SIGMOD Conference
      1995
    
    
      1
      l1
      A Query Language and Optimization Techniques for Unstructured Data
      Gerd G. Hillebrand, Peter Buneman, Susan B. Davidson, Dan Suciu
      SIGMOD Conference
      1996
    
    
      2
      l2
      Turbo-charging Vertical Mining of Large Databases
      Jayant R. Haritsa, Devavrat Shah, S. Sudarshan, Pradeep Shenoy, Mayank Bawa, Gaurav Bhalotia
      SIGMOD Conference
      2000
    
    
      3
      l3
      Maintenance of Data Cubes and Summary Tables in a Warehouse
      Inderpal Singh Mumick, Dallan Quass, Barinderpal Singh Mumick
      SIGMOD Conference
      1997
    
    
      4
      l4
      On Relational Support for XML Publishing: Beyond Sorting and Tagging
      Raghav Kaushik, Jeffrey F. Naughton, Surajit Chaudhuri
      SIGMOD Conference
      2003

Pandastable



In [7]:

    
# Invoke the pandastable gui for data exploration
# The process will be blocked until closing the GUI
em.data_explore_pandastable(B)



In [8]:

    
B.head()









    Out[8]:







  
    
      
      id
      title
      authors
      venue
      year
    
  
  
    
      0
      l0
      You can modify data if necessary using pandastable
      Paradise Team
      SIGMOD Conference
      1995
    
    
      1
      l1
      A Query Language and Optimization Techniques for Unstructured Data
      Gerd G. Hillebrand, Peter Buneman, Susan B. Davidson, Dan Suciu
      SIGMOD Conference
      1996
    
    
      2
      l2
      Turbo-charging Vertical Mining of Large Databases
      Jayant R. Haritsa, Devavrat Shah, S. Sudarshan, Pradeep Shenoy, Mayank Bawa, Gaurav Bhalotia
      SIGMOD Conference
      2000
    
    
      3
      l3
      Maintenance of Data Cubes and Summary Tables in a Warehouse
      Inderpal Singh Mumick, Dallan Quass, Barinderpal Singh Mumick
      SIGMOD Conference
      1997
    
    
      4
      l4
      On Relational Support for XML Publishing: Beyond Sorting and Tagging
      Raghav Kaushik, Jeffrey F. Naughton, Surajit Chaudhuri
      SIGMOD Conference
      2003



In [ ]:

	id	title	authors	venue	year
0	l0	Paradise: A Database System for GIS Applications	Paradise Team	SIGMOD Conference	1995
1	l1	A Query Language and Optimization Techniques for Unstructured Data	Gerd G. Hillebrand, Peter Buneman, Susan B. Davidson, Dan Suciu	SIGMOD Conference	1996
2	l2	Turbo-charging Vertical Mining of Large Databases	Jayant R. Haritsa, Devavrat Shah, S. Sudarshan, Pradeep Shenoy, Mayank Bawa, Gaurav Bhalotia	SIGMOD Conference	2000
3	l3	Maintenance of Data Cubes and Summary Tables in a Warehouse	Inderpal Singh Mumick, Dallan Quass, Barinderpal Singh Mumick	SIGMOD Conference	1997
4	l4	On Relational Support for XML Publishing: Beyond Sorting and Tagging	Raghav Kaushik, Jeffrey F. Naughton, Surajit Chaudhuri	SIGMOD Conference	2003

	id	title	authors	venue	year
0	l0	You can modify data if necessary using OpenRefine	Paradise Team	SIGMOD Conference	1995
1	l1	A Query Language and Optimization Techniques for Unstructured Data	Gerd G. Hillebrand, Peter Buneman, Susan B. Davidson, Dan Suciu	SIGMOD Conference	1996
2	l2	Turbo-charging Vertical Mining of Large Databases	Jayant R. Haritsa, Devavrat Shah, S. Sudarshan, Pradeep Shenoy, Mayank Bawa, Gaurav Bhalotia	SIGMOD Conference	2000
3	l3	Maintenance of Data Cubes and Summary Tables in a Warehouse	Inderpal Singh Mumick, Dallan Quass, Barinderpal Singh Mumick	SIGMOD Conference	1997
4	l4	On Relational Support for XML Publishing: Beyond Sorting and Tagging	Raghav Kaushik, Jeffrey F. Naughton, Surajit Chaudhuri	SIGMOD Conference	2003

	id	title	authors	venue	year
0	l0	You can modify data if necessary using pandastable	Paradise Team	SIGMOD Conference	1995
1	l1	A Query Language and Optimization Techniques for Unstructured Data	Gerd G. Hillebrand, Peter Buneman, Susan B. Davidson, Dan Suciu	SIGMOD Conference	1996
2	l2	Turbo-charging Vertical Mining of Large Databases	Jayant R. Haritsa, Devavrat Shah, S. Sudarshan, Pradeep Shenoy, Mayank Bawa, Gaurav Bhalotia	SIGMOD Conference	2000
3	l3	Maintenance of Data Cubes and Summary Tables in a Warehouse	Inderpal Singh Mumick, Dallan Quass, Barinderpal Singh Mumick	SIGMOD Conference	1997
4	l4	On Relational Support for XML Publishing: Beyond Sorting and Tagging	Raghav Kaushik, Jeffrey F. Naughton, Surajit Chaudhuri	SIGMOD Conference	2003