This is an example of how to use the mergepurge python package using the same fake data used in the package tests.
In it, the goal will be to take a file of contacts with incomplete data for each record and add an email address from a separate file of more complete contact records.
Load Some Partial Contact Data
Make Standardized Columns to Match Contacts With
Examples of data with supported formats (or lack thereof)
Load Additional Contact Data to Match with the Incomplete Data
Search for Each Incomplete Contact in the DataFrame of Complete Contacts
Add Columns from Complete DataFrame to Matching Records of the Incomplete DataFrame
Cleanup - Get rid of matching columns.
System Info - The env. used to create this notebook.
In [1]:
%load_ext autoreload
%autoreload 2
from IPython.display import HTML
In [2]:
%%html
<!-- make tables display a little nicer in markdown cells -->
<style>table {float:left;}</style>
The data loaded here was created for testing and made with the barnum python package.
Data like this are common in purchased email lists and tools that use them like Marketo, and in customer relationship management (CRM) platforms like SalesForce.
In [3]:
import os, sys
import pandas as pd
PARTIAL_PATH = os.path.join('tests', 'incomplete.tsv')
# ensure numbers with potentially leading zeros are read as strings
PARTIAL_DTYPES = {'zipcode': str}
partial = pd.read_csv(PARTIAL_PATH, sep='\t', encoding='utf-8', dtype=PARTIAL_DTYPES)
# drop some un-needed testing columns for this notebook
partial.drop(['ID', 'email', 'change'], axis=1, inplace=True)
In [4]:
partial.head()
Out[4]:
This is where mergepurge can save you lots of time. You just need to define which columns contain the relevant data, they don't have to be named a certain way or contain values in a specific format. Empty values are ok, as are some notes and other "non-contact" info appended to values.
| Name | Address 1 | Address 2 | City | State | Zip |
|---|---|---|---|---|---|
| Dr. Leo Spaceman | 30 Rockefeller Plaza | GE Bldg | New York | NY | 10112 |
| Dr. Spaceman | Attn: Leo | 30 Rockefeller Plz | New York | NY | 10112 |
| name | address | city | state |
|---|---|---|---|
| Dr. Leo Spaceman | 30 Rockefeller Plaza, GE Bldg | New York | NY |
| Dr Spaceman | Attn: Leo - 30 Rockefeller Plz | New York | NY |
| title | first_name | last name | Address | City | State |
|---|---|---|---|---|---|
| Leo | Spaceman | 30 Rockefeller Plaza, GE Bldg | New York | NY | |
| Doctor | spaceman | Attn: Leo - 30 Rockefeller Plz | New York | NY | |
| Dr | notavailable | spaceman | 30 Rockefeller Plaza | New York | NY |
All of the above tables of data should work fine as input.
Notice, not only are there different column names and columns, but there are missing values and pieces of information occuring in the wrong column as well. All of those scenarios should be handled ok as long as mixed up data occurs in the same type of column (location, name, or business name) and you pass the correct order of columns to build_matching_cols(), e.g. contact_cols=['title','first','last'] and not contact-cols=['last','first'].
Now let's build the standardized columns for the incomplete data we have already loaded.
In [5]:
from mergepurge import clean
PART_LOC_COLS = ['address', 'city', 'state', 'zipcode']
PART_CONTACT_COLS = ['first', 'last']
PART_COMPANY_COLS = ['company']
partial = clean.build_matching_cols(partial,
PART_LOC_COLS,
PART_CONTACT_COLS,
PART_COMPANY_COLS)
In [6]:
partial.info()
The dataframe info shows that a whole bunch of columns were appended to the end of the dataframe, all of them prefixed with aa_.
If you take a look at the data you'll see it's not going to be useful for much outside of features for a matching algorithm because many components of business names and addresses have been omitted. This is done to standardize values as much as possible and throw away the ambigous or overly common pieces of information that tend to cause bad matches, e.g. 'LLC' or other business entity types in business names.
In [7]:
HTML( partial.head(3).to_html() )
Out[7]:
This is data you might have from your existing CRM or however you keep track of your existing contacts.
In this example we're using it because it contains the same contacts as our incomplete data but with various changes made to some fields, and it contains a populated email column.
In [8]:
COMP_PATH = os.path.join('tests', 'complete_parsed.tsv')
COMP_DTYPES = {'aa_streetnum': str, 'aa_zip': str, 'zipcode': str}
complete = pd.read_csv(COMP_PATH, sep='\t', encoding='utf-8', dtype=COMP_DTYPES)
HTML(complete.tail(3).to_html())
Out[8]:
These data have already been run through build_matching_cols(), but if we were going to run it again this is how we would describe which columns contain what info:
In [9]:
COMP_LOC_COLS = ['address', 'city', 'state', 'zipcode']
COMP_CONTACT_COLS = ['first', 'last']
COMP_COMPANY_COLS = ['company']
In [10]:
from mergepurge import match
matches_found = match.find_related(partial, complete)
So we found a matching record in the dataframe of complete records for all but one of our incomplete records.
The data returned from find_related() is a list. In each item of the list is a tuple containing (x,y) where x is the type of match the algorithm used to find the match for that record and y is a list of indices of matching records in the dataframe of complete records.
In [11]:
matches_found[0]
Out[11]:
Now that the matches are found you can pull in information from the complete dataframe by using merge_lists and just specifying a list of columns from the complete dataframe you want.
Columns of the same name will have a suffix added to the column name in the result dataframe.
In [12]:
output = match.merge_lists(partial, complete,
matching_indices=matches_found,
wanted_cols=['email'])
We see some summary output above that tells us how many of the incomplete records were matched by each match type.
We can also now check out the end result with an email column now populated.
In [13]:
output[['email']].info()
In [14]:
output[['first','last','company','email']].head()
Out[14]:
In [15]:
built_cols = [col for col in output.columns if col.startswith('aa_')]
output.drop(built_cols, axis=1, inplace=True)
In [16]:
output.head()
Out[16]:
In [17]:
try:
%load_ext watermark
except ImportError as e:
%install_ext https://raw.githubusercontent.com/rasbt/python_reference/master/ipython_magic/watermark.py
%load_ext watermark
%watermark