String Matching

The idea of string matching is to find strings that match a given pattern. We have seen that Pandas provides some useful functions to do that job.


In [5]:
import pandas as pd

names = pd.DataFrame({"name" : ["Alice","Bob","Charlie","Dennis"],
                     "surname" : ["Doe","Smith","Sheen","Quaid"]})
names


Out[5]:
name surname
0 Alice Doe
1 Bob Smith
2 Charlie Sheen
3 Dennis Quaid

In [4]:
names.name.str.match("A\w+")


Out[4]:
0     True
1    False
2    False
3    False
Name: name, dtype: bool

In [7]:
debts = pd.DataFrame({"debtor":["D.Quaid","C.Sheen"],
                     "amount":[100,10000]})
debts


Out[7]:
amount debtor
0 100 D.Quaid
1 10000 C.Sheen

Imagine I want to have a list of my friends with the amount of money I borrowed to each other, toghether with their names and surnames.


In [10]:
debts["surname"] = debts.debtor.str.extract("\w+\.(\w+)")
debts


C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
  """Entry point for launching an IPython kernel.
Out[10]:
amount debtor surname
0 100 D.Quaid Quaid
1 10000 C.Sheen Sheen

To merge two dataframes we can use merge function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html. Merge DataFrame objects by performing a database-style join operation by columns or indexes.

We can use merge in two different ways (very simmilar):

  • using the DataFrame method: left_df.merge(right_df, ...)
  • using the Pandas function: pd.merge(left_df, right_df, ...)

Common parameters

  • how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
    • left: use only keys from left frame, similar to a SQL left outer join; preserve key order
    • right: use only keys from right frame, similar to a SQL right outer join; preserve key order
    • outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
    • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys

on : label or list

Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.

left_on : label or list, or array-like

Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns

right_on : label or list, or array-like

Field names to join on in right DataFrame or vector/list of vectors per left_on docs

left_index : boolean, default False

Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels

right_index : boolean, default False

Use the index from the right DataFrame as the join key. Same caveats as left_index

In [12]:
names.merge(debts, left_on="surname", right_on="surname", how="left")


Out[12]:
name surname amount debtor
0 Alice Doe NaN NaN
1 Bob Smith NaN NaN
2 Charlie Sheen 10000.0 C.Sheen
3 Dennis Quaid 100.0 D.Quaid

In [13]:
names.merge(debts, left_on="surname", right_on="surname", how="right")


Out[13]:
name surname amount debtor
0 Charlie Sheen 10000 C.Sheen
1 Dennis Quaid 100 D.Quaid

In [14]:
names.merge(debts, left_on="surname", right_on="surname", how="inner")


Out[14]:
name surname amount debtor
0 Charlie Sheen 10000 C.Sheen
1 Dennis Quaid 100 D.Quaid

In [15]:
names.merge(debts, left_on="surname", right_on="surname", how="outer")


Out[15]:
name surname amount debtor
0 Alice Doe NaN NaN
1 Bob Smith NaN NaN
2 Charlie Sheen 10000.0 C.Sheen
3 Dennis Quaid 100.0 D.Quaid

In [17]:
names.merge(debts, left_index=True, right_index=True, how="left")


Out[17]:
name surname_x amount debtor surname_y
0 Alice Doe 100.0 D.Quaid Quaid
1 Bob Smith 10000.0 C.Sheen Sheen
2 Charlie Sheen NaN NaN NaN
3 Dennis Quaid NaN NaN NaN

Approximate String Matching

Fuzzy String Matching, also called Approximate String Matching, is the process of finding strings that approximatively match a given pattern.

For example, we can have two datasets with information about local municipalities:


In [20]:
lat_lon_mun = pd.read_excel("lat_lon_municipalities.xls", skiprows=2)
lat_lon_mun.head()


Out[20]:
Comunidad Provincia Población Latitud Longitud Altitud Habitantes Hombres Mujeres
0 Andalucía Almería Abla 37.14114 -2.780104 871.16840 1504 783 721
1 Andalucía Almería Abrucena 37.13305 -2.797098 976.93870 1341 682 659
2 Andalucía Almería Adra 36.74807 -3.022522 10.97898 24373 12338 12035
3 Andalucía Almería Albánchez 37.28710 -2.181163 481.31230 815 422 393
4 Andalucía Almería Alboloduy 37.03319 -2.621750 388.43460 674 334 340

In [27]:
mun_codes = pd.read_excel("11codmun.xls", encoding="latin1", skiprows=1) 
mun_codes.head()


Out[27]:
CPRO CMUN DC NOMBRE
0 1 1 4 Alegría-Dulantzi
1 1 2 9 Amurrio
2 1 49 3 Añana
3 1 3 5 Aramaio
4 1 6 6 Armiñón

In [33]:
lat_lon_mun[lat_lon_mun["Población"].str.match(".*anaria.*")]


Out[33]:
Comunidad Provincia Población Latitud Longitud Altitud Habitantes Hombres Mujeres
1595 Canarias Las Palmas Palmas de Gran Canaria (Las) 28.12482 -15.43001 7.271036 381847 186191 195656
1601 Canarias Las Palmas Santa María de Guía de Gran Canaria 28.13915 -15.63284 184.067500 14069 7022 7047
1610 Canarias Las Palmas Valsequillo de Gran Canaria 27.99119 -15.49928 579.217100 9067 4617 4450

In [34]:
mun_codes[mun_codes["NOMBRE"].str.match(".*anaria.*")]


Out[34]:
CPRO CMUN DC NOMBRE
5251 35 16 7 Palmas de Gran Canaria, Las
5257 35 23 4 Santa María de Guía de Gran Canaria
5266 35 31 1 Valsequillo de Gran Canaria

In [37]:
"Valsequillo de Gran Canaria" == "Valsequillo de Gran Canaria"


Out[37]:
True

In [36]:
"Palmas de Gran Canaria (Las)" == "Palmas de Gran Canaria, Las"


Out[36]:
False

The closeness of a match is often measured in terms of edit distance, which is the number of primitive operations necessary to convert the string into an exact match. Primitive operations are usually: insertion (to insert a new character at a given position), deletion (to delete a particular character) and substitution (to replace a character with a new one).

Fuzzy String Matching can have different practical applications. Typical examples are spell-checking, text re-use detection (the politically correct way of calling plagiarism detection), spam filtering, as well as several applications in the bioinformatics domain, e.g. matching DNA sequences.

FuzzyWuzzy

The main modules in FuzzyWuzzy are called fuzz, for string-to-string comparisons, and process to compare a string with a list of strings.

Under the hood, FuzzyWuzzy uses difflib, part of the standard library, so there is nothing extra to install.

String Similarity

The simplest way to compare two strings is with a measurement of edit distance. For example, the following two strings are quite similar:

NEW YORK METS
NEW YORK MEATS

Now, according to the ratio:

Return a measure of the sequences' similarity as a float in the range [0, 1]. Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T.

In [28]:
from fuzzywuzzy import fuzz
fuzz.ratio("NEW YORK METS","NEW YORK MEATS")


Out[28]:
96

In [35]:
fuzz.ratio("Palmas de Gran Canaria (Las)","Palmas de Gran Canaria, Las")


Out[35]:
95

Partial String Similarity

What to do when we want to find if two strings are simmilar, and one contains the other.

In this case the ratio will be low but if we would know how to split the bigger string, the match would be perfect. Let's see an example:


In [39]:
"San Millán de Yécora" == "Millán de Yécora"


Out[39]:
False

In [40]:
fuzz.ratio("San Millán de Yécora", "Millán de Yécora")


Out[40]:
89

In fact we can have the following situation:


In [42]:
fuzz.ratio("YANKEES", "NEW YORK YANKEES")


Out[42]:
61

In [44]:
fuzz.ratio("NEW YORK METS", "NEW YORK YANKEES")


Out[44]:
76

partial_ratio, seeks the more appealing substring and returns its ratio


In [45]:
fuzz.partial_ratio("San Millán de Yécora", "Millán de Yécora")


Out[45]:
100

In [46]:
fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES")


Out[46]:
100

In [47]:
fuzz.partial_ratio("NEW YORK METS", "NEW YORK YANKEES")


Out[47]:
69

Out of Order

What happens if we have just the same strings but in a different order, let's have an example:


In [48]:
s1 = "Las Palmas de Gran Canaria"
s2 = "Gran Canaria, Las Palmas de"
s3 = "Palmas de Gran Canaria, Las"
s4 = "Palmas de Gran Canaria, (Las)"

FuzzyWuzzy provides two ways to deal with this situation:

Token Sort

The token sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and then joining them back into a string. Then compare the transformed strings with a simple ratio(.


In [49]:
fuzz.token_sort_ratio("Las Palmas de Gran Canaria", "Palmas de Gran Canaria Las")


Out[49]:
100

In [50]:
fuzz.ratio("Las Palmas de Gran Canaria", "Palmas de Gran Canaria Las")


Out[50]:
85

Token Set

The token set approach is similar, but a little bit more flexible. Here, we tokenize both strings, but instead of immediately sorting and comparing, we split the tokens into two groups: intersection and remainder. We use those sets to build up a comparison string.

t0 = [SORTED_INTERSECTION]
t1 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING1]
t2 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING2]

max(ratio(t0,t1),ratio(t0,t2),ratio(t1,t2))

In [52]:
t0 = ["Canaria,","de","Gran", "Palmas"]
t1 = ["Canaria,","de","Gran", "Palmas"] + ["Las"]
t2 = ["Canaria,","de","Gran", "Palmas"] + ["(Las)"]

In [53]:
fuzz.token_sort_ratio("Palmas de Gran Canaria, Las", "Palmas de Gran Canaria, (Las)")


Out[53]:
100

Example

We want to merge both mun_codes and lat_lon_mun. So we have to have a good municipality name in both datasets. From that names we can do:

  • Exact string matching: match these names common in both datasets
  • Approximate string matching: match these names with highest similarity

Step 1: Explore datasets


In [54]:
mun_codes.shape


Out[54]:
(8116, 4)

In [56]:
mun_codes.head()


Out[56]:
CPRO CMUN DC NOMBRE
0 1 1 4 Alegría-Dulantzi
1 1 2 9 Amurrio
2 1 49 3 Añana
3 1 3 5 Aramaio
4 1 6 6 Armiñón

In [55]:
lat_lon_mun.shape


Out[55]:
(8112, 9)

In [57]:
lat_lon_mun.head()


Out[57]:
Comunidad Provincia Población Latitud Longitud Altitud Habitantes Hombres Mujeres
0 Andalucía Almería Abla 37.14114 -2.780104 871.16840 1504 783 721
1 Andalucía Almería Abrucena 37.13305 -2.797098 976.93870 1341 682 659
2 Andalucía Almería Adra 36.74807 -3.022522 10.97898 24373 12338 12035
3 Andalucía Almería Albánchez 37.28710 -2.181163 481.31230 815 422 393
4 Andalucía Almería Alboloduy 37.03319 -2.621750 388.43460 674 334 340

Step 2: Merge datasets


In [111]:
df1 = mun_codes.merge(lat_lon_mun, left_on="NOMBRE", right_on="Población",how="inner")
df1.head()


Out[111]:
CPRO CMUN DC NOMBRE Comunidad Provincia Población Latitud Longitud Altitud Habitantes Hombres Mujeres
0 1 1 4 Alegría-Dulantzi País Vasco Álava Alegría-Dulantzi 42.84149 -2.513507 561.6857 2620 1353 1267
1 1 2 9 Amurrio País Vasco Álava Amurrio 43.05265 -3.001022 219.6910 10089 5069 5020
2 1 49 3 Añana País Vasco Álava Añana 42.80235 -2.982607 628.5115 176 92 84
3 1 3 5 Aramaio País Vasco Álava Aramaio 43.05400 -2.566000 381.8797 1499 781 718
4 1 6 6 Armiñón País Vasco Álava Armiñón 42.72305 -2.872574 463.5815 207 103 104

Step 3: Create a new variable called match_ratio


In [112]:
df1["match_ratio"] = 100

Step 4: Merge again with original dataset and select those which have no direct match


In [113]:
df2 = mun_codes.merge(df1, left_on="NOMBRE", right_on="NOMBRE", how="left")
df2.head()
df3 = df2.loc[: ,["CPRO_x","CMUN_x","DC_x","NOMBRE","match_ratio"]]
df3.rename(columns={"CPRO_x": "CPRO", "CMUN_x":"CMUN","DC_x":"DC"},inplace=True)
df3.head()


Out[113]:
CPRO CMUN DC NOMBRE match_ratio
0 1 1 4 Alegría-Dulantzi 100.0
1 1 2 9 Amurrio 100.0
2 1 49 3 Añana 100.0
3 1 3 5 Aramaio 100.0
4 1 6 6 Armiñón 100.0

In [115]:
df3.loc[df3.match_ratio.isnull(),:].head()


Out[115]:
CPRO CMUN DC NOMBRE match_ratio
31 1 58 7 Legutio NaN
63 2 14 3 Ballestero, El NaN
69 2 19 4 Bonillo, El NaN
85 2 35 2 Gineta, La NaN
88 2 38 7 Herrera, La NaN

Step 5: Apply approximate string matching


In [127]:
mun_names = lat_lon_mun["Población"].tolist()

def approx_str_compare(x):
    ratio = [fuzz.ratio(x,m) for m in mun_names]
    res = pd.DataFrame({"ratio" : ratio,
                "name": mun_names})
    return res.sort_values(by="ratio",ascending=False).iloc[0,:]
    
df4 = df3.loc[df3.match_ratio.isnull(),"NOMBRE"].map(approx_str_compare)

Step 6: Concatenate results


In [135]:
df4.map(lambda x: x["name"])
df4.map(lambda x: x["ratio"])
df6 = df3.loc[df3.match_ratio.isnull(),:]
df6["match_ratio"] = df4.map(lambda x: x["ratio"])
df6["NOMBRE"] = df4.map(lambda x: x["name"])
df6.head()


C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
Out[135]:
CPRO CMUN DC NOMBRE match_ratio
31 1 58 7 Legutiano 88
63 2 14 3 Ballestero (El) 90
69 2 19 4 Bonillo (El) 87
85 2 35 2 Gineta (La) 86
88 2 38 7 Herrera (La) 87

In [138]:
df7 = pd.concat([df3,df6])

Exercices

  1. Improve the pipeline taking into consideration that we haven't checked that all names are unique. What is happening? Can you solve it?

  2. Work with both NOMBRE and Población variables to improve results.

  3. Make approximate string matxing procedure more efficient.

  4. Porvide the final merged dataset.