Using mergepurge

This is an example of how to use the mergepurge python package using the same fake data used in the package tests.

In it, the goal will be to take a file of contacts with incomplete data for each record and add an email address from a separate file of more complete contact records.

Load Some Partial Contact Data

Make Standardized Columns to Match Contacts With

Examples of data with supported formats (or lack thereof)

Load Additional Contact Data to Match with the Incomplete Data

Search for Each Incomplete Contact in the DataFrame of Complete Contacts

Add Columns from Complete DataFrame to Matching Records of the Incomplete DataFrame

Cleanup - Get rid of matching columns.

System Info - The env. used to create this notebook.


In [1]:
%load_ext autoreload
%autoreload 2
from IPython.display import HTML

In [2]:
%%html 
<!-- make tables display a little nicer in markdown cells -->
<style>table {float:left;}</style>


Load Some Partial Contact Data

The data loaded here was created for testing and made with the barnum python package.

Data like this are common in purchased email lists and tools that use them like Marketo, and in customer relationship management (CRM) platforms like SalesForce.


In [3]:
import os, sys
import pandas as pd

PARTIAL_PATH = os.path.join('tests', 'incomplete.tsv')

# ensure numbers with potentially leading zeros are read as strings
PARTIAL_DTYPES = {'zipcode': str}

partial = pd.read_csv(PARTIAL_PATH, sep='\t', encoding='utf-8', dtype=PARTIAL_DTYPES)

# drop some un-needed testing columns for this notebook
partial.drop(['ID', 'email', 'change'], axis=1, inplace=True)

In [4]:
partial.head()


Out[4]:
address city company first last state zipcode
0 3386 Oakgreen Cove Kieler Research Software Limited NaN NaN WI 53812
1 8042 Tall Pine Trail Sheffield Solutions Speed Omega Robert Bannister IL 61361
2 9683 Memorial Trail Malvern Source Analysis Vision NaN NaN PA 19355
3 4172 Hunters Horn Way Suite 100 Hoople NaN NaN Donahue ND 58243
4 4348 Perkins Parkway Belvidere Net West Josefa Horsley NC 27919

Make Standardized Columns to Match Contacts With

This is where mergepurge can save you lots of time. You just need to define which columns contain the relevant data, they don't have to be named a certain way or contain values in a specific format. Empty values are ok, as are some notes and other "non-contact" info appended to values.

Examples of data with supported formats - or lack there-of

Name Address 1 Address 2 City State Zip
Dr. Leo Spaceman 30 Rockefeller Plaza GE Bldg New York NY 10112
Dr. Spaceman Attn: Leo 30 Rockefeller Plz New York NY 10112
name address city state
Dr. Leo Spaceman 30 Rockefeller Plaza, GE Bldg New York NY
Dr Spaceman Attn: Leo - 30 Rockefeller Plz New York NY
title first_name last name Address City State
Leo Spaceman 30 Rockefeller Plaza, GE Bldg New York NY
Doctor spaceman Attn: Leo - 30 Rockefeller Plz New York NY
Dr notavailable spaceman 30 Rockefeller Plaza New York NY

All of the above tables of data should work fine as input.

Notice, not only are there different column names and columns, but there are missing values and pieces of information occuring in the wrong column as well. All of those scenarios should be handled ok as long as mixed up data occurs in the same type of column (location, name, or business name) and you pass the correct order of columns to build_matching_cols(), e.g. contact_cols=['title','first','last'] and not contact-cols=['last','first'].

Now let's build the standardized columns for the incomplete data we have already loaded.


In [5]:
from mergepurge import clean

PART_LOC_COLS     = ['address', 'city', 'state', 'zipcode']
PART_CONTACT_COLS = ['first', 'last']
PART_COMPANY_COLS = ['company']

partial = clean.build_matching_cols(partial, 
                                    PART_LOC_COLS,
                                    PART_CONTACT_COLS,
                                    PART_COMPANY_COLS)

In [6]:
partial.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 18 columns):
address         19 non-null object
city            19 non-null object
company         12 non-null object
first           14 non-null object
last            13 non-null object
state           19 non-null object
zipcode         19 non-null object
aa_streetnum    19 non-null object
aa_street       19 non-null object
aa_city         19 non-null object
aa_state        19 non-null object
aa_zip          19 non-null object
aa_fulladdy     20 non-null object
aa_title        0 non-null float64
aa_firstname    14 non-null object
aa_lastname     13 non-null object
aa_fullname     12 non-null object
aa_company      20 non-null object
dtypes: float64(1), object(17)
memory usage: 2.9+ KB

The dataframe info shows that a whole bunch of columns were appended to the end of the dataframe, all of them prefixed with aa_.

If you take a look at the data you'll see it's not going to be useful for much outside of features for a matching algorithm because many components of business names and addresses have been omitted. This is done to standardize values as much as possible and throw away the ambigous or overly common pieces of information that tend to cause bad matches, e.g. 'LLC' or other business entity types in business names.


In [7]:
HTML( partial.head(3).to_html() )


Out[7]:
address city company first last state zipcode aa_streetnum aa_street aa_city aa_state aa_zip aa_fulladdy aa_title aa_firstname aa_lastname aa_fullname aa_company
0 3386 Oakgreen Cove Kieler Research Software Limited NaN NaN WI 53812 3386 Oakgreen Kieler WI 53812 3386 Oakgreen Cove Kieler WI 53812 NaN NaN NaN NaN Research Software
1 8042 Tall Pine Trail Sheffield Solutions Speed Omega Robert Bannister IL 61361 8042 Tall Pine Sheffield IL 61361 8042 Tall Pine Trail Sheffield IL 61361 NaN Robert Bannister Robert Bannister Solutions Speed Omega
2 9683 Memorial Trail Malvern Source Analysis Vision NaN NaN PA 19355 9683 Memorial Malvern PA 19355 9683 Memorial Trail Malvern PA 19355 NaN NaN NaN NaN Source Analysis Vision

Load Additional Contact Data to Match with the Incomplete Data

This is data you might have from your existing CRM or however you keep track of your existing contacts.

In this example we're using it because it contains the same contacts as our incomplete data but with various changes made to some fields, and it contains a populated email column.


In [8]:
COMP_PATH = os.path.join('tests', 'complete_parsed.tsv')
COMP_DTYPES = {'aa_streetnum': str, 'aa_zip': str, 'zipcode': str}
complete = pd.read_csv(COMP_PATH, sep='\t', encoding='utf-8', dtype=COMP_DTYPES)

HTML(complete.tail(3).to_html())


Out[8]:
ID address city company email first last state zipcode aa_streetnum aa_street aa_city aa_state aa_zip aa_fulladdy aa_title aa_firstname aa_lastname aa_fullname aa_company
97 97 9013 Dogwood Grove Circle Gabbs Systems Speed Net Linnie@blanditeum.gov Catina Easter NV 89409 9013 Dogwood Grove Gabbs NV 89409 9013 Dogwood Grove Circle Gabbs NV 89409 NaN Catina Easter Catina Easter Systems Speed Net
98 98 2240 Meadow Run Trail Campbell Innovation Advanced Vision Erika@utfeugait.us Isabella Ulmer AL 36727 2240 Meadow Run Campbell AL 36727 2240 Meadow Run Trail Campbell AL 36727 NaN Isabella Ulmer Isabella Ulmer Innovation Advanced Vision
99 99 8428 Misty Creek Street West Willow General Research Galaxy J.o@euismodaliquip.gov Joaquin Creed PA 17583 8428 Misty Creek West Willow PA 17583 8428 Misty Creek Street West Willow PA 17583 NaN Joaquin Creed Joaquin Creed General Research Galaxy

These data have already been run through build_matching_cols(), but if we were going to run it again this is how we would describe which columns contain what info:


In [9]:
COMP_LOC_COLS     = ['address', 'city', 'state', 'zipcode']
COMP_CONTACT_COLS = ['first', 'last']
COMP_COMPANY_COLS = ['company']

Search for Each Incomplete Contact in the DataFrame of Complete Contacts

After preprocessing each dataframe of contacts and creating the set of standardized _aa columns we're ready to search for matches:


In [10]:
from mergepurge import match

matches_found = match.find_related(partial, complete)


95.0% (19) of search_for records have at least 1 matching record.
0.0% (0) of search_for records have multiple matching records.

So we found a matching record in the dataframe of complete records for all but one of our incomplete records.

The data returned from find_related() is a list. In each item of the list is a tuple containing (x,y) where x is the type of match the algorithm used to find the match for that record and y is a list of indices of matching records in the dataframe of complete records.


In [11]:
matches_found[0]


Out[11]:
('ExactAddress', 0, Int64Index([0], dtype='int64'))

Add Columns from Complete DataFrame to Matching Records of the Incomplete DataFrame

Now that the matches are found you can pull in information from the complete dataframe by using merge_lists and just specifying a list of columns from the complete dataframe you want.

Columns of the same name will have a suffix added to the column name in the result dataframe.


In [12]:
output = match.merge_lists(partial, complete, 
                           matching_indices=matches_found, 
                           wanted_cols=['email'])


ExactAddress      11
ExactNameState     8
Name: source_type, dtype: int64

We see some summary output above that tells us how many of the incomplete records were matched by each match type.

We can also now check out the end result with an email column now populated.


In [13]:
output[['email']].info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 20 entries, 0 to 19
Data columns (total 1 columns):
email    19 non-null object
dtypes: object(1)
memory usage: 320.0+ bytes

In [14]:
output[['first','last','company','email']].head()


Out[14]:
first last company email
0 NaN NaN Research Software Limited K.a@praesentfacilisi.com
1 Robert Bannister Solutions Speed Omega F.Ingraham@wisilobortis.edu
2 NaN NaN Source Analysis Vision Barbara@nullasuscipit.com
3 NaN Donahue NaN A.l@autemautem.tv
4 Josefa Horsley Net West J.Horsley@lobortiselitsed.eu

Cleanup

If you no longer have use for the standardized aa_ matching columns you can easily remove them:


In [15]:
built_cols = [col for col in output.columns if col.startswith('aa_')]

output.drop(built_cols, axis=1, inplace=True)

In [16]:
output.head()


Out[16]:
address city company first last state zipcode dest_ID email multiple_emails source_type src_ID
0 3386 Oakgreen Cove Kieler Research Software Limited NaN NaN WI 53812 0 K.a@praesentfacilisi.com False ExactAddress Int64Index([0], dtype='int64')
1 8042 Tall Pine Trail Sheffield Solutions Speed Omega Robert Bannister IL 61361 1 F.Ingraham@wisilobortis.edu False ExactAddress Int64Index([1], dtype='int64')
2 9683 Memorial Trail Malvern Source Analysis Vision NaN NaN PA 19355 2 Barbara@nullasuscipit.com False ExactAddress Int64Index([2], dtype='int64')
3 4172 Hunters Horn Way Suite 100 Hoople NaN NaN Donahue ND 58243 3 A.l@autemautem.tv False ExactAddress Int64Index([3], dtype='int64')
4 4348 Perkins Parkway Belvidere Net West Josefa Horsley NC 27919 4 J.Horsley@lobortiselitsed.eu False ExactNameState Int64Index([4], dtype='int64')

System Info


In [17]:
try:
    %load_ext watermark
except ImportError as e:
    %install_ext https://raw.githubusercontent.com/rasbt/python_reference/master/ipython_magic/watermark.py
    %load_ext watermark

%watermark


20/03/2017 18:00:13

CPython 3.4.2
IPython 5.3.0

compiler   : GCC 4.2.1 Compatible Apple LLVM 5.1 (clang-503.0.40)
system     : Darwin
release    : 12.6.0
machine    : x86_64
processor  : i386
CPU cores  : 4
interpreter: 64bit