This tutorial explains how to join two tables A and B using jaccard similarity measure.
In [4]:
# Import libraries
import py_stringsimjoin as ssj
import py_stringmatching as sm
import pandas as pd
import os
import sys
In [5]:
print('python version ' + sys.version)
print('py_stringsimjoin version: ' + ssj.__version__)
print('py_stringmatching version: ' + sm.__version__)
print('pandas version: ' + pd.__version__)
We begin by loading two tables. For the purpose of this tutorial, we use the books dataset that comes with the package.
In [6]:
table_A_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'books_A.csv.gz'])
table_B_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'books_B.csv.gz'])
In [7]:
# Load csv files as dataframes. Since we are reading a compressed csv file, we provide the compression argument.
# If you are reading an uncompressed csv file, you should not specify the compression argument.
A = pd.read_csv(table_A_path, compression='gzip')
B = pd.read_csv(table_B_path, compression='gzip')
print('Number of records in A: ' + str(len(A)))
print('Number of records in B: ' + str(len(B)))
In [8]:
A.head(1)
Out[8]:
In [9]:
B.head(1)
Out[9]:
In order to perform the join, you need to identify on which attribute to perform the join. Using the profiling command, you can inspect which attributes are suitable for join. For example, if an attribute contains many missing values, you may not want to perform join on that attribute.
In [10]:
# profile attributes in table A
ssj.profile_table_for_join(A)
Out[10]:
In [11]:
# profile attributes in table B
ssj.profile_table_for_join(B)
Out[11]:
Based on the profile output, we find that the 'Title' attribute in both tables does not contain any missing values. Hence, for the purpose of this tutorial, we will now join tables A and B on 'Title' attribute using jaccard measure. Next, we need to decide on what threshold to use for the join. For this tutorial, we will use a threshold of 0.5. Specifically, the join will now find tuple pairs from A and B such that the jaccard score over the 'Title' attributes is atleast 0.5.
The next step after loading the tables is to create a tokenizer. A tokenizer is used to tokenize a string into a set of tokens. To create a tokenizer, you can use the different tokenizers provided by py_stringmatching package. A whitespace tokenizer can be created as follows:
In [12]:
# create whitespace tokenizer for tokenizing 'Title' attribute. The return_set flag should be set to True since
# jaccard is a set based measure.
ws = sm.WhitespaceTokenizer(return_set=True)
Next, you need to perform the join using the following command:
In [13]:
# find all pairs from A and B such that the jaccard score on 'Title' is at least 0.5. Setting n_jobs=-1 exploits all
# CPU cores available.
output_pairs = ssj.jaccard_join(A, B, 'ID', 'ID', 'Title', 'Title', ws, 0.5,
l_out_attrs=['Title'], r_out_attrs=['Title'], n_jobs=-1)
In [14]:
len(output_pairs)
Out[14]:
In [15]:
# examine the output pairs
output_pairs.head()
Out[15]: