This tutorial explains how to join two tables A and B using jaccard similarity measure.



In [4]:

    
# Import libraries
import py_stringsimjoin as ssj
import py_stringmatching as sm
import pandas as pd
import os
import sys



In [5]:

    
print('python version ' + sys.version)
print('py_stringsimjoin version: ' + ssj.__version__)
print('py_stringmatching version: ' + sm.__version__)
print('pandas version: ' + pd.__version__)









    



python version 2.7.10 |Anaconda 2.3.0 (64-bit)| (default, May 28 2015, 17:02:03) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
py_stringsimjoin version: 0.1.0
py_stringmatching version: 0.1.0
pandas version: 0.16.2

Loading data

We begin by loading two tables. For the purpose of this tutorial, we use the books dataset that comes with the package.



In [6]:

    
table_A_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'books_A.csv.gz'])
table_B_path = os.sep.join([ssj.get_install_path(), 'datasets', 'data', 'books_B.csv.gz'])



In [7]:

    
# Load csv files as dataframes. Since we are reading a compressed csv file, we provide the compression argument.
# If you are reading an uncompressed csv file, you should not specify the compression argument.
A = pd.read_csv(table_A_path, compression='gzip')
B = pd.read_csv(table_B_path, compression='gzip')
print('Number of records in A: ' + str(len(A)))
print('Number of records in B: ' + str(len(B)))









    



Number of records in A: 3022
Number of records in B: 3099



In [8]:

    
A.head(1)









    Out[8]:






  
    
      
      ID
      Title
      Price
      Author
      ISBN13
      Publisher
      Publication_Date
      Pages
      Dimensions
    
  
  
    
      0
      BN0001
      The Maze Runner Series Complete Collection
      $24.21
      James Dashner
      9780385388894
      Random House Children's Books
      7/8/14
      NaN
      5.60(w) x 8.40(h) x 3.30(d)



In [9]:

    
B.head(1)









    Out[9]:






  
    
      
      ID
      Title
      UsedPrice
      NewPrice
      Author
      ISBN10
      ISBN13
      Publisher
      Publication_Date
      Pages
      Dimensions
    
  
  
    
      0
      HC0001
      The Hunger Games(Hunger Games (Quality))
      $4.98
      $11.02
      Suzanne Collins
      0439023521
      9780439023528
      Scholastic Press
      Jul. 3rd, 2010
      374
      5.30 x 8.00 x 0.90 inches

Profiling data

In order to perform the join, you need to identify on which attribute to perform the join. Using the profiling command, you can inspect which attributes are suitable for join. For example, if an attribute contains many missing values, you may not want to perform join on that attribute.



In [10]:

    
# profile attributes in table A
ssj.profile_table_for_join(A)









    Out[10]:






  
    
      
      Unique values
      Missing values
      Comments
    
    
      Attribute
      
      
      
    
  
  
    
      ID
      3022 (100.0%)
      0 (0.0%)
      This attribute can be used as a key attribute.
    
    
      Title
      2859 (94.61%)
      0 (0.0%)
      
    
    
      Price
      698 (23.1%)
      0 (0.0%)
      
    
    
      Author
      1225 (40.54%)
      0 (0.0%)
      
    
    
      ISBN13
      2869 (94.94%)
      0 (0.0%)
      
    
    
      Publisher
      200 (6.62%)
      0 (0.0%)
      
    
    
      Publication_Date
      1100 (36.4%)
      0 (0.0%)
      
    
    
      Pages
      303 (10.03%)
      12 (0.4%)
      Joining on this attribute will ignore 12 (0.4%) rows.
    
    
      Dimensions
      1764 (58.37%)
      20 (0.66%)
      Joining on this attribute will ignore 20 (0.66%) rows.



In [11]:

    
# profile attributes in table B
ssj.profile_table_for_join(B)









    Out[11]:






  
    
      
      Unique values
      Missing values
      Comments
    
    
      Attribute
      
      
      
    
  
  
    
      ID
      3099 (100.0%)
      0 (0.0%)
      This attribute can be used as a key attribute.
    
    
      Title
      2913 (94.0%)
      0 (0.0%)
      
    
    
      UsedPrice
      320 (10.33%)
      3 (0.1%)
      Joining on this attribute will ignore 3 (0.1%) rows.
    
    
      NewPrice
      883 (28.49%)
      166 (5.36%)
      Joining on this attribute will ignore 166 (5.36%) rows.
    
    
      Author
      1371 (44.24%)
      5 (0.16%)
      Joining on this attribute will ignore 5 (0.16%) rows.
    
    
      ISBN10
      3099 (100.0%)
      0 (0.0%)
      This attribute can be used as a key attribute.
    
    
      ISBN13
      3099 (100.0%)
      0 (0.0%)
      This attribute can be used as a key attribute.
    
    
      Publisher
      367 (11.84%)
      0 (0.0%)
      
    
    
      Publication_Date
      1165 (37.59%)
      0 (0.0%)
      
    
    
      Pages
      510 (16.46%)
      0 (0.0%)
      
    
    
      Dimensions
      2572 (82.99%)
      248 (8.0%)
      Joining on this attribute will ignore 248 (8.0%) rows.

Based on the profile output, we find that the 'Title' attribute in both tables does not contain any missing values. Hence, for the purpose of this tutorial, we will now join tables A and B on 'Title' attribute using jaccard measure. Next, we need to decide on what threshold to use for the join. For this tutorial, we will use a threshold of 0.5. Specifically, the join will now find tuple pairs from A and B such that the jaccard score over the 'Title' attributes is atleast 0.5.

Creating a tokenizer

The next step after loading the tables is to create a tokenizer. A tokenizer is used to tokenize a string into a set of tokens. To create a tokenizer, you can use the different tokenizers provided by py_stringmatching package. A whitespace tokenizer can be created as follows:



In [12]:

    
# create whitespace tokenizer for tokenizing 'Title' attribute. The return_set flag should be set to True since
# jaccard is a set based measure.
ws = sm.WhitespaceTokenizer(return_set=True)

Performing join

Next, you need to perform the join using the following command:



In [13]:

    
# find all pairs from A and B such that the jaccard score on 'Title' is at least 0.5. Setting n_jobs=-1 exploits all
# CPU cores available.
output_pairs = ssj.jaccard_join(A, B, 'ID', 'ID', 'Title', 'Title', ws, 0.5, 
                                l_out_attrs=['Title'], r_out_attrs=['Title'], n_jobs=-1)









    



0%                          100%
[##############################] | ETA[sec]: 0.000 
Total time elapsed: 0.095 sec



In [14]:

    
len(output_pairs)









    Out[14]:





803



In [15]:

    
# examine the output pairs
output_pairs.head()









    Out[15]:






  
    
      
      _id
      l_ID
      r_ID
      l_Title
      r_Title
      _sim_score
    
  
  
    
      0
      0
      BN0019
      HC0003
      Looking for Alaska
      Looking for Alaska
      1.000
    
    
      1
      1
      BN0043
      HC0006
      The Book Thief
      The Book Thief
      1.000
    
    
      2
      2
      BN1684
      HC0006
      The Good Thief
      The Book Thief
      0.500
    
    
      3
      3
      BN0167
      HC0008
      The 7 Habits of Highly Effective Teens Workbook
      The 7 Habits of Highly Effective Teens Workbook
      1.000
    
    
      4
      4
      BN0048
      HC0008
      The 7 Habits of Highly Effective Teens
      The 7 Habits of Highly Effective Teens Workbook
      0.875

	Unique values	Missing values	Comments
Attribute
ID	3022 (100.0%)	0 (0.0%)	This attribute can be used as a key attribute.
Title	2859 (94.61%)	0 (0.0%)
Price	698 (23.1%)	0 (0.0%)
Author	1225 (40.54%)	0 (0.0%)
ISBN13	2869 (94.94%)	0 (0.0%)
Publisher	200 (6.62%)	0 (0.0%)
Publication_Date	1100 (36.4%)	0 (0.0%)
Pages	303 (10.03%)	12 (0.4%)	Joining on this attribute will ignore 12 (0.4%) rows.
Dimensions	1764 (58.37%)	20 (0.66%)	Joining on this attribute will ignore 20 (0.66%) rows.

	Unique values	Missing values	Comments
Attribute
ID	3099 (100.0%)	0 (0.0%)	This attribute can be used as a key attribute.
Title	2913 (94.0%)	0 (0.0%)
UsedPrice	320 (10.33%)	3 (0.1%)	Joining on this attribute will ignore 3 (0.1%) rows.
NewPrice	883 (28.49%)	166 (5.36%)	Joining on this attribute will ignore 166 (5.36%) rows.
Author	1371 (44.24%)	5 (0.16%)	Joining on this attribute will ignore 5 (0.16%) rows.
ISBN10	3099 (100.0%)	0 (0.0%)	This attribute can be used as a key attribute.
ISBN13	3099 (100.0%)	0 (0.0%)	This attribute can be used as a key attribute.
Publisher	367 (11.84%)	0 (0.0%)
Publication_Date	1165 (37.59%)	0 (0.0%)
Pages	510 (16.46%)	0 (0.0%)
Dimensions	2572 (82.99%)	248 (8.0%)	Joining on this attribute will ignore 248 (8.0%) rows.

	_id	l_ID	r_ID	l_Title	r_Title	_sim_score
0	0	BN0019	HC0003	Looking for Alaska	Looking for Alaska	1.000
1	1	BN0043	HC0006	The Book Thief	The Book Thief	1.000
2	2	BN1684	HC0006	The Good Thief	The Book Thief	0.500
3	3	BN0167	HC0008	The 7 Habits of Highly Effective Teens Workbook	The 7 Habits of Highly Effective Teens Workbook	1.000
4	4	BN0048	HC0008	The 7 Habits of Highly Effective Teens	The 7 Habits of Highly Effective Teens Workbook	0.875