Using mergepurge

This is an example of how to use the mergepurge python package using the same fake data used in the package tests.

In it, the goal will be to take a file of contacts with incomplete data for each record and add an email address from a separate file of more complete contact records.

Load Some Partial Contact Data

Make Standardized Columns to Match Contacts With

Examples of data with supported formats (or lack thereof)

Load Additional Contact Data to Match with the Incomplete Data

Search for Each Incomplete Contact in the DataFrame of Complete Contacts

Add Columns from Complete DataFrame to Matching Records of the Incomplete DataFrame

Cleanup - Get rid of matching columns.

System Info - The env. used to create this notebook.



In [1]:

    
%load_ext autoreload
%autoreload 2
from IPython.display import HTML



In [2]:

    
%%html 
<!-- make tables display a little nicer in markdown cells -->
<style>table {float:left;}</style>

Load Some Partial Contact Data

The data loaded here was created for testing and made with the barnum python package.

Data like this are common in purchased email lists and tools that use them like Marketo, and in customer relationship management (CRM) platforms like SalesForce.



In [3]:

    
import os, sys
import pandas as pd

PARTIAL_PATH = os.path.join('tests', 'incomplete.tsv')

# ensure numbers with potentially leading zeros are read as strings
PARTIAL_DTYPES = {'zipcode': str}

partial = pd.read_csv(PARTIAL_PATH, sep='\t', encoding='utf-8', dtype=PARTIAL_DTYPES)

# drop some un-needed testing columns for this notebook
partial.drop(['ID', 'email', 'change'], axis=1, inplace=True)



In [4]:

    
partial.head()









    Out[4]:






  
    
      
      address
      city
      company
      first
      last
      state
      zipcode
    
  
  
    
      0
      3386 Oakgreen   Cove
      Kieler
      Research Software Limited
      NaN
      NaN
      WI
      53812
    
    
      1
      8042 Tall Pine Trail
      Sheffield
      Solutions Speed Omega
      Robert
      Bannister
      IL
      61361
    
    
      2
      9683 Memorial Trail
      Malvern
      Source Analysis Vision
      NaN
      NaN
      PA
      19355
    
    
      3
      4172 Hunters Horn Way Suite 100
      Hoople
      NaN
      NaN
      Donahue
      ND
      58243
    
    
      4
      4348 Perkins Parkway
      Belvidere
      Net West
      Josefa
      Horsley
      NC
      27919

Make Standardized Columns to Match Contacts With

This is where mergepurge can save you lots of time. You just need to define which columns contain the relevant data, they don't have to be named a certain way or contain values in a specific format. Empty values are ok, as are some notes and other "non-contact" info appended to values.

Examples of data with supported formats - or lack there-of

Name	Address 1	Address 2	City	State	Zip
Dr. Leo Spaceman	30 Rockefeller Plaza	GE Bldg	New York	NY	10112
Dr. Spaceman	Attn: Leo	30 Rockefeller Plz	New York	NY	10112

name	address	city	state
Dr. Leo Spaceman	30 Rockefeller Plaza, GE Bldg	New York	NY
Dr Spaceman	Attn: Leo - 30 Rockefeller Plz	New York	NY

title	first_name	last name	Address	City	State
	Leo	Spaceman	30 Rockefeller Plaza, GE Bldg	New York	NY
	Doctor	spaceman	Attn: Leo - 30 Rockefeller Plz	New York	NY
Dr	notavailable	spaceman	30 Rockefeller Plaza	New York	NY

All of the above tables of data should work fine as input.

Notice, not only are there different column names and columns, but there are missing values and pieces of information occuring in the wrong column as well. All of those scenarios should be handled ok as long as mixed up data occurs in the same type of column (location, name, or business name) and you pass the correct order of columns to build_matching_cols(), e.g. contact_cols=['title','first','last'] and not contact-cols=['last','first'].

Now let's build the standardized columns for the incomplete data we have already loaded.



In [5]:

    
from mergepurge import clean

PART_LOC_COLS     = ['address', 'city', 'state', 'zipcode']
PART_CONTACT_COLS = ['first', 'last']
PART_COMPANY_COLS = ['company']

partial = clean.build_matching_cols(partial, 
                                    PART_LOC_COLS,
                                    PART_CONTACT_COLS,
                                    PART_COMPANY_COLS)



In [6]:

    
partial.info()









    



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 18 columns):
address         19 non-null object
city            19 non-null object
company         12 non-null object
first           14 non-null object
last            13 non-null object
state           19 non-null object
zipcode         19 non-null object
aa_streetnum    19 non-null object
aa_street       19 non-null object
aa_city         19 non-null object
aa_state        19 non-null object
aa_zip          19 non-null object
aa_fulladdy     20 non-null object
aa_title        0 non-null float64
aa_firstname    14 non-null object
aa_lastname     13 non-null object
aa_fullname     12 non-null object
aa_company      20 non-null object
dtypes: float64(1), object(17)
memory usage: 2.9+ KB

The dataframe info shows that a whole bunch of columns were appended to the end of the dataframe, all of them prefixed with aa_.

If you take a look at the data you'll see it's not going to be useful for much outside of features for a matching algorithm because many components of business names and addresses have been omitted. This is done to standardize values as much as possible and throw away the ambigous or overly common pieces of information that tend to cause bad matches, e.g. 'LLC' or other business entity types in business names.



In [7]:

    
HTML( partial.head(3).to_html() )









    Out[7]:





  
    
      
      address
      city
      company
      first
      last
      state
      zipcode
      aa_streetnum
      aa_street
      aa_city
      aa_state
      aa_zip
      aa_fulladdy
      aa_title
      aa_firstname
      aa_lastname
      aa_fullname
      aa_company
    
  
  
    
      0
      3386 Oakgreen   Cove
      Kieler
      Research Software Limited
      NaN
      NaN
      WI
      53812
      3386
      Oakgreen
      Kieler
      WI
      53812
      3386 Oakgreen Cove Kieler WI 53812
      NaN
      NaN
      NaN
      NaN
      Research Software
    
    
      1
      8042 Tall Pine Trail
      Sheffield
      Solutions Speed Omega
      Robert
      Bannister
      IL
      61361
      8042
      Tall Pine
      Sheffield
      IL
      61361
      8042 Tall Pine Trail Sheffield IL 61361
      NaN
      Robert
      Bannister
      Robert Bannister
      Solutions Speed Omega
    
    
      2
      9683 Memorial Trail
      Malvern
      Source Analysis Vision
      NaN
      NaN
      PA
      19355
      9683
      Memorial
      Malvern
      PA
      19355
      9683 Memorial Trail Malvern PA 19355
      NaN
      NaN
      NaN
      NaN
      Source Analysis Vision

Load Additional Contact Data to Match with the Incomplete Data

This is data you might have from your existing CRM or however you keep track of your existing contacts.

In this example we're using it because it contains the same contacts as our incomplete data but with various changes made to some fields, and it contains a populated email column.



In [8]:

    
COMP_PATH = os.path.join('tests', 'complete_parsed.tsv')
COMP_DTYPES = {'aa_streetnum': str, 'aa_zip': str, 'zipcode': str}
complete = pd.read_csv(COMP_PATH, sep='\t', encoding='utf-8', dtype=COMP_DTYPES)

HTML(complete.tail(3).to_html())









    Out[8]:





  
    
      
      ID
      address
      city
      company
      email
      first
      last
      state
      zipcode
      aa_streetnum
      aa_street
      aa_city
      aa_state
      aa_zip
      aa_fulladdy
      aa_title
      aa_firstname
      aa_lastname
      aa_fullname
      aa_company
    
  
  
    
      97
      97
      9013 Dogwood Grove Circle
      Gabbs
      Systems Speed Net
      Linnie@blanditeum.gov
      Catina
      Easter
      NV
      89409
      9013
      Dogwood Grove
      Gabbs
      NV
      89409
      9013 Dogwood Grove Circle Gabbs NV 89409
      NaN
      Catina
      Easter
      Catina Easter
      Systems Speed Net
    
    
      98
      98
      2240 Meadow Run Trail
      Campbell
      Innovation Advanced Vision
      Erika@utfeugait.us
      Isabella
      Ulmer
      AL
      36727
      2240
      Meadow Run
      Campbell
      AL
      36727
      2240 Meadow Run Trail Campbell AL 36727
      NaN
      Isabella
      Ulmer
      Isabella Ulmer
      Innovation Advanced Vision
    
    
      99
      99
      8428 Misty Creek Street
      West Willow
      General Research Galaxy
      J.o@euismodaliquip.gov
      Joaquin
      Creed
      PA
      17583
      8428
      Misty Creek
      West Willow
      PA
      17583
      8428 Misty Creek Street West Willow PA 17583
      NaN
      Joaquin
      Creed
      Joaquin Creed
      General Research Galaxy

These data have already been run through build_matching_cols(), but if we were going to run it again this is how we would describe which columns contain what info:



In [9]:

    
COMP_LOC_COLS     = ['address', 'city', 'state', 'zipcode']
COMP_CONTACT_COLS = ['first', 'last']
COMP_COMPANY_COLS = ['company']

Search for Each Incomplete Contact in the DataFrame of Complete Contacts

After preprocessing each dataframe of contacts and creating the set of standardized _aa columns we're ready to search for matches:



In [10]:

    
from mergepurge import match

matches_found = match.find_related(partial, complete)









    



95.0% (19) of search_for records have at least 1 matching record.
0.0% (0) of search_for records have multiple matching records.

So we found a matching record in the dataframe of complete records for all but one of our incomplete records.

The data returned from find_related() is a list. In each item of the list is a tuple containing (x,y) where x is the type of match the algorithm used to find the match for that record and y is a list of indices of matching records in the dataframe of complete records.



In [11]:

    
matches_found[0]









    Out[11]:





('ExactAddress', 0, Int64Index([0], dtype='int64'))

Add Columns from Complete DataFrame to Matching Records of the Incomplete DataFrame

Now that the matches are found you can pull in information from the complete dataframe by using merge_lists and just specifying a list of columns from the complete dataframe you want.

Columns of the same name will have a suffix added to the column name in the result dataframe.



In [12]:

    
output = match.merge_lists(partial, complete, 
                           matching_indices=matches_found, 
                           wanted_cols=['email'])









    



ExactAddress      11
ExactNameState     8
Name: source_type, dtype: int64

We see some summary output above that tells us how many of the incomplete records were matched by each match type.

We can also now check out the end result with an email column now populated.



In [13]:

    
output[['email']].info()









    



<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 1 columns):
email    19 non-null object
dtypes: object(1)
memory usage: 320.0+ bytes



In [14]:

    
output[['first','last','company','email']].head()









    Out[14]:






  
    
      
      first
      last
      company
      email
    
  
  
    
      0
      NaN
      NaN
      Research Software Limited
      K.a@praesentfacilisi.com
    
    
      1
      Robert
      Bannister
      Solutions Speed Omega
      F.Ingraham@wisilobortis.edu
    
    
      2
      NaN
      NaN
      Source Analysis Vision
      Barbara@nullasuscipit.com
    
    
      3
      NaN
      Donahue
      NaN
      A.l@autemautem.tv
    
    
      4
      Josefa
      Horsley
      Net West
      J.Horsley@lobortiselitsed.eu

Cleanup

If you no longer have use for the standardized aa_ matching columns you can easily remove them:



In [15]:

    
built_cols = [col for col in output.columns if col.startswith('aa_')]

output.drop(built_cols, axis=1, inplace=True)



In [16]:

    
output.head()









    Out[16]:






  
    
      
      address
      city
      company
      first
      last
      state
      zipcode
      dest_ID
      email
      multiple_emails
      source_type
      src_ID
    
  
  
    
      0
      3386 Oakgreen   Cove
      Kieler
      Research Software Limited
      NaN
      NaN
      WI
      53812
      0
      K.a@praesentfacilisi.com
      False
      ExactAddress
      Int64Index([0], dtype='int64')
    
    
      1
      8042 Tall Pine Trail
      Sheffield
      Solutions Speed Omega
      Robert
      Bannister
      IL
      61361
      1
      F.Ingraham@wisilobortis.edu
      False
      ExactAddress
      Int64Index([1], dtype='int64')
    
    
      2
      9683 Memorial Trail
      Malvern
      Source Analysis Vision
      NaN
      NaN
      PA
      19355
      2
      Barbara@nullasuscipit.com
      False
      ExactAddress
      Int64Index([2], dtype='int64')
    
    
      3
      4172 Hunters Horn Way Suite 100
      Hoople
      NaN
      NaN
      Donahue
      ND
      58243
      3
      A.l@autemautem.tv
      False
      ExactAddress
      Int64Index([3], dtype='int64')
    
    
      4
      4348 Perkins Parkway
      Belvidere
      Net West
      Josefa
      Horsley
      NC
      27919
      4
      J.Horsley@lobortiselitsed.eu
      False
      ExactNameState
      Int64Index([4], dtype='int64')

System Info



In [17]:

    
try:
    %load_ext watermark
except ImportError as e:
    %install_ext https://raw.githubusercontent.com/rasbt/python_reference/master/ipython_magic/watermark.py
    %load_ext watermark

%watermark









    



20/03/2017 18:00:13

CPython 3.4.2
IPython 5.3.0

compiler   : GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)
system     : Darwin
release    : 12.6.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit

	address	city	company	first	last	state	zipcode
0	3386 Oakgreen Cove	Kieler	Research Software Limited	NaN	NaN	WI	53812
1	8042 Tall Pine Trail	Sheffield	Solutions Speed Omega	Robert	Bannister	IL	61361
2	9683 Memorial Trail	Malvern	Source Analysis Vision	NaN	NaN	PA	19355
3	4172 Hunters Horn Way Suite 100	Hoople	NaN	NaN	Donahue	ND	58243
4	4348 Perkins Parkway	Belvidere	Net West	Josefa	Horsley	NC	27919

	ID	address	city	company	email	first	last	state	zipcode	aa_streetnum	aa_street	aa_city	aa_state	aa_zip	aa_fulladdy	aa_title	aa_firstname	aa_lastname	aa_fullname	aa_company
97	97	9013 Dogwood Grove Circle	Gabbs	Systems Speed Net	Linnie@blanditeum.gov	Catina	Easter	NV	89409	9013	Dogwood Grove	Gabbs	NV	89409	9013 Dogwood Grove Circle Gabbs NV 89409	NaN	Catina	Easter	Catina Easter	Systems Speed Net
98	98	2240 Meadow Run Trail	Campbell	Innovation Advanced Vision	Erika@utfeugait.us	Isabella	Ulmer	AL	36727	2240	Meadow Run	Campbell	AL	36727	2240 Meadow Run Trail Campbell AL 36727	NaN	Isabella	Ulmer	Isabella Ulmer	Innovation Advanced Vision
99	99	8428 Misty Creek Street	West Willow	General Research Galaxy	J.o@euismodaliquip.gov	Joaquin	Creed	PA	17583	8428	Misty Creek	West Willow	PA	17583	8428 Misty Creek Street West Willow PA 17583	NaN	Joaquin	Creed	Joaquin Creed	General Research Galaxy

	first	last	company	email
0	NaN	NaN	Research Software Limited	K.a@praesentfacilisi.com
1	Robert	Bannister	Solutions Speed Omega	F.Ingraham@wisilobortis.edu
2	NaN	NaN	Source Analysis Vision	Barbara@nullasuscipit.com
3	NaN	Donahue	NaN	A.l@autemautem.tv
4	Josefa	Horsley	Net West	J.Horsley@lobortiselitsed.eu