This IPython notebook illustrates how to read the CSV files from disk as tables and set their metadata.
First, we need to import py_entitymatching package and other libraries as follows:
In [1]:
import py_entitymatching as em
import pandas as pd
import os, sys
the paths of the CSV file in the disk. For the convenience of the user, we have included some sample files in the package. The path of a sample CSV file can be obtained like this:
In [2]:
# Get the datasets directory
datasets_dir = em.get_install_path() + os.sep + 'datasets'
# Get the paths of the input tables
path_A = datasets_dir + os.sep + 'person_table_A.csv'
In [3]:
# Display the contents of the file in path_A
!cat $path_A | head -3
There are three different ways to read a CSV file and set metadata:
First, read the CSV files as follows:
In [4]:
A = em.read_csv_metadata(path_A)
In [5]:
A.head()
Out[5]:
In [6]:
# Display the 'type' of A
type(A)
Out[6]:
Then set the metadata for the table. We see ID
is the key attribute (since it contains unique values and no value is missing) for the table. We can set this metadata as follows:
In [7]:
em.set_key(A, 'ID')
Out[7]:
In [8]:
# Get the metadata that were set for table A
em.get_key(A)
Out[8]:
Now the CSV file is read into the memory and the metadata (i.e. key) is set for the table.
In the above, we saw that we first read in the CSV file and then set the metadata. These two steps can be combined into a single step like this:
In [9]:
A = em.read_csv_metadata(path_A, key='ID')
In [10]:
# Display the 'type' of A
type(A)
Out[10]:
In [11]:
# Get the metadata that were set for the table A
em.get_key(A)
Out[11]:
The user can specify the metadata in a file.
This file MUST be in the same directory as the CSV file and the file name should be same, except the extension is set to '.metadata'.
In [12]:
# We set the metadata for table A (stored in person_table_A.csv).
# Get the file name (with full path) where the metadata file must be stored
metadata_file = datasets_dir + os.sep + 'person_table_A.metadata'
# Specify the metadata for table A . Here we specify that 'ID' is the key attribute for the table.
# Note that this step requires write permission to the datasets directory.
!echo '#key=ID' > $metadata_file
In [ ]:
# If you donot have write permissions to the datasets directory, first copy the file to the local directory and then create
# a metadata file like this:
# !cp $path_A .
# metadata_local_file = 'person_table_A.metadata'
# !echo '#key=ID' > $metadata_local_file
In [13]:
# Read the CSV file for table A
A = em.read_csv_metadata(path_A)
In [14]:
# Get the key for table A
em.get_key(A)
Out[14]:
In [15]:
# Remove the metadata file
!rm $metadata_file