This tutorial explains how to join two tables A and B using jaccard similarity measure.


In [4]:
# Import libraries
import py_stringsimjoin as ssj
import py_stringmatching as sm
import pandas as pd
import os
import sys

In [5]:
print('python version ' + sys.version)
print('py_stringsimjoin version: ' + ssj.__version__)
print('py_stringmatching version: ' + sm.__version__)
print('pandas version: ' + pd.__version__)


python version 2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
py_stringsimjoin version: 0.1.0
py_stringmatching version: 0.1.0
pandas version: 0.16.2

Loading data

We begin by loading two tables. For the purpose of this tutorial, we use the books dataset that comes with the package.


In [6]:
table_A_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'books_A.csv.gz'])
table_B_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'books_B.csv.gz'])

In [7]:
# Load csv files as dataframes. Since we are reading a compressed csv file, we provide the compression argument.
# If you are reading an uncompressed csv file, you should not specify the compression argument.
A = pd.read_csv(table_A_path, compression='gzip')
B = pd.read_csv(table_B_path, compression='gzip')
print('Number of records in A: ' + str(len(A)))
print('Number of records in B: ' + str(len(B)))


Number of records in A: 3022
Number of records in B: 3099

In [8]:
A.head(1)


Out[8]:
ID Title Price Author ISBN13 Publisher Publication_Date Pages Dimensions
0 BN0001 The Maze Runner Series Complete Collection $24.21 James Dashner 9780385388894 Random House Children's Books 7/8/14 NaN 5.60(w) x 8.40(h) x 3.30(d)

In [9]:
B.head(1)


Out[9]:
ID Title UsedPrice NewPrice Author ISBN10 ISBN13 Publisher Publication_Date Pages Dimensions
0 HC0001 The Hunger Games(Hunger Games (Quality)) $4.98 $11.02 Suzanne Collins 0439023521 9780439023528 Scholastic Press Jul. 3rd, 2010 374 5.30 x 8.00 x 0.90 inches

Profiling data

In order to perform the join, you need to identify on which attribute to perform the join. Using the profiling command, you can inspect which attributes are suitable for join. For example, if an attribute contains many missing values, you may not want to perform join on that attribute.


In [10]:
# profile attributes in table A
ssj.profile_table_for_join(A)


Out[10]:
Unique values Missing values Comments
Attribute
ID 3022 (100.0%) 0 (0.0%) This attribute can be used as a key attribute.
Title 2859 (94.61%) 0 (0.0%)
Price 698 (23.1%) 0 (0.0%)
Author 1225 (40.54%) 0 (0.0%)
ISBN13 2869 (94.94%) 0 (0.0%)
Publisher 200 (6.62%) 0 (0.0%)
Publication_Date 1100 (36.4%) 0 (0.0%)
Pages 303 (10.03%) 12 (0.4%) Joining on this attribute will ignore 12 (0.4%) rows.
Dimensions 1764 (58.37%) 20 (0.66%) Joining on this attribute will ignore 20 (0.66%) rows.

In [11]:
# profile attributes in table B
ssj.profile_table_for_join(B)


Out[11]:
Unique values Missing values Comments
Attribute
ID 3099 (100.0%) 0 (0.0%) This attribute can be used as a key attribute.
Title 2913 (94.0%) 0 (0.0%)
UsedPrice 320 (10.33%) 3 (0.1%) Joining on this attribute will ignore 3 (0.1%) rows.
NewPrice 883 (28.49%) 166 (5.36%) Joining on this attribute will ignore 166 (5.36%) rows.
Author 1371 (44.24%) 5 (0.16%) Joining on this attribute will ignore 5 (0.16%) rows.
ISBN10 3099 (100.0%) 0 (0.0%) This attribute can be used as a key attribute.
ISBN13 3099 (100.0%) 0 (0.0%) This attribute can be used as a key attribute.
Publisher 367 (11.84%) 0 (0.0%)
Publication_Date 1165 (37.59%) 0 (0.0%)
Pages 510 (16.46%) 0 (0.0%)
Dimensions 2572 (82.99%) 248 (8.0%) Joining on this attribute will ignore 248 (8.0%) rows.

Based on the profile output, we find that the 'Title' attribute in both tables does not contain any missing values. Hence, for the purpose of this tutorial, we will now join tables A and B on 'Title' attribute using jaccard measure. Next, we need to decide on what threshold to use for the join. For this tutorial, we will use a threshold of 0.5. Specifically, the join will now find tuple pairs from A and B such that the jaccard score over the 'Title' attributes is atleast 0.5.

Creating a tokenizer

The next step after loading the tables is to create a tokenizer. A tokenizer is used to tokenize a string into a set of tokens. To create a tokenizer, you can use the different tokenizers provided by py_stringmatching package. A whitespace tokenizer can be created as follows:


In [12]:
# create whitespace tokenizer for tokenizing 'Title' attribute. The return_set flag should be set to True since
# jaccard is a set based measure.
ws = sm.WhitespaceTokenizer(return_set=True)

Performing join

Next, you need to perform the join using the following command:


In [13]:
# find all pairs from A and B such that the jaccard score on 'Title' is at least 0.5. Setting n_jobs=-1 exploits all
# CPU cores available.
output_pairs = ssj.jaccard_join(A, B, 'ID', 'ID', 'Title', 'Title', ws, 0.5, 
                                l_out_attrs=['Title'], r_out_attrs=['Title'], n_jobs=-1)


0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 0.095 sec

In [14]:
len(output_pairs)


Out[14]:
803

In [15]:
# examine the output pairs
output_pairs.head()


Out[15]:
_id l_ID r_ID l_Title r_Title _sim_score
0 0 BN0019 HC0003 Looking for Alaska Looking for Alaska 1.000
1 1 BN0043 HC0006 The Book Thief The Book Thief 1.000
2 2 BN1684 HC0006 The Good Thief The Book Thief 0.500
3 3 BN0167 HC0008 The 7 Habits of Highly Effective Teens Workbook The 7 Habits of Highly Effective Teens Workbook 1.000
4 4 BN0048 HC0008 The 7 Habits of Highly Effective Teens The 7 Habits of Highly Effective Teens Workbook 0.875