String Matching

The idea of string matching is to find strings that match a given pattern. We have seen that Pandas provides some useful functions to do that job.



In [5]:

    
import pandas as pd

names = pd.DataFrame({"name" : ["Alice","Bob","Charlie","Dennis"],
                     "surname" : ["Doe","Smith","Sheen","Quaid"]})
names









    Out[5]:







  
    
      
      name
      surname
    
  
  
    
      0
      Alice
      Doe
    
    
      1
      Bob
      Smith
    
    
      2
      Charlie
      Sheen
    
    
      3
      Dennis
      Quaid



In [4]:

    
names.name.str.match("A\w+")









    Out[4]:





0     True
1    False
2    False
3    False
Name: name, dtype: bool



In [7]:

    
debts = pd.DataFrame({"debtor":["D.Quaid","C.Sheen"],
                     "amount":[100,10000]})
debts









    Out[7]:







  
    
      
      amount
      debtor
    
  
  
    
      0
      100
      D.Quaid
    
    
      1
      10000
      C.Sheen

Imagine I want to have a list of my friends with the amount of money I borrowed to each other, toghether with their names and surnames.



In [10]:

    
debts["surname"] = debts.debtor.str.extract("\w+\.(\w+)")
debts









    



C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: FutureWarning: currently extract(expand=None) means expand=False (return Index/Series/DataFrame) but in a future version of pandas this will be changed to expand=True (return DataFrame)
  """Entry point for launching an IPython kernel.






    Out[10]:







  
    
      
      amount
      debtor
      surname
    
  
  
    
      0
      100
      D.Quaid
      Quaid
    
    
      1
      10000
      C.Sheen
      Sheen

To merge two dataframes we can use merge function: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.merge.html. Merge DataFrame objects by performing a database-style join operation by columns or indexes.

We can use merge in two different ways (very simmilar):

using the DataFrame method: left_df.merge(right_df, ...)
using the Pandas function: pd.merge(left_df, right_df, ...)

Common parameters

how : {‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘inner’
- left: use only keys from left frame, similar to a SQL left outer join; preserve key order
- right: use only keys from right frame, similar to a SQL right outer join; preserve key order
- outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically
- inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys

on : label or list

Field names to join on. Must be found in both DataFrames. If on is None and not merging on indexes, then it merges on the intersection of the columns by default.

left_on : label or list, or array-like

Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns

right_on : label or list, or array-like

Field names to join on in right DataFrame or vector/list of vectors per left_on docs

left_index : boolean, default False

Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels

right_index : boolean, default False

Use the index from the right DataFrame as the join key. Same caveats as left_index



In [12]:

    
names.merge(debts, left_on="surname", right_on="surname", how="left")









    Out[12]:







  
    
      
      name
      surname
      amount
      debtor
    
  
  
    
      0
      Alice
      Doe
      NaN
      NaN
    
    
      1
      Bob
      Smith
      NaN
      NaN
    
    
      2
      Charlie
      Sheen
      10000.0
      C.Sheen
    
    
      3
      Dennis
      Quaid
      100.0
      D.Quaid



In [13]:

    
names.merge(debts, left_on="surname", right_on="surname", how="right")









    Out[13]:







  
    
      
      name
      surname
      amount
      debtor
    
  
  
    
      0
      Charlie
      Sheen
      10000
      C.Sheen
    
    
      1
      Dennis
      Quaid
      100
      D.Quaid



In [14]:

    
names.merge(debts, left_on="surname", right_on="surname", how="inner")









    Out[14]:







  
    
      
      name
      surname
      amount
      debtor
    
  
  
    
      0
      Charlie
      Sheen
      10000
      C.Sheen
    
    
      1
      Dennis
      Quaid
      100
      D.Quaid



In [15]:

    
names.merge(debts, left_on="surname", right_on="surname", how="outer")









    Out[15]:







  
    
      
      name
      surname
      amount
      debtor
    
  
  
    
      0
      Alice
      Doe
      NaN
      NaN
    
    
      1
      Bob
      Smith
      NaN
      NaN
    
    
      2
      Charlie
      Sheen
      10000.0
      C.Sheen
    
    
      3
      Dennis
      Quaid
      100.0
      D.Quaid



In [17]:

    
names.merge(debts, left_index=True, right_index=True, how="left")









    Out[17]:







  
    
      
      name
      surname_x
      amount
      debtor
      surname_y
    
  
  
    
      0
      Alice
      Doe
      100.0
      D.Quaid
      Quaid
    
    
      1
      Bob
      Smith
      10000.0
      C.Sheen
      Sheen
    
    
      2
      Charlie
      Sheen
      NaN
      NaN
      NaN
    
    
      3
      Dennis
      Quaid
      NaN
      NaN
      NaN

Approximate String Matching

Fuzzy String Matching, also called Approximate String Matching, is the process of finding strings that approximatively match a given pattern.

For example, we can have two datasets with information about local municipalities:



In [20]:

    
lat_lon_mun = pd.read_excel("lat_lon_municipalities.xls", skiprows=2)
lat_lon_mun.head()









    Out[20]:







  
    
      
      Comunidad
      Provincia
      Población
      Latitud
      Longitud
      Altitud
      Habitantes
      Hombres
      Mujeres
    
  
  
    
      0
      Andalucía
      Almería
      Abla
      37.14114
      -2.780104
      871.16840
      1504
      783
      721
    
    
      1
      Andalucía
      Almería
      Abrucena
      37.13305
      -2.797098
      976.93870
      1341
      682
      659
    
    
      2
      Andalucía
      Almería
      Adra
      36.74807
      -3.022522
      10.97898
      24373
      12338
      12035
    
    
      3
      Andalucía
      Almería
      Albánchez
      37.28710
      -2.181163
      481.31230
      815
      422
      393
    
    
      4
      Andalucía
      Almería
      Alboloduy
      37.03319
      -2.621750
      388.43460
      674
      334
      340



In [27]:

    
mun_codes = pd.read_excel("11codmun.xls", encoding="latin1", skiprows=1) 
mun_codes.head()









    Out[27]:







  
    
      
      CPRO
      CMUN
      DC
      NOMBRE
    
  
  
    
      0
      1
      1
      4
      Alegría-Dulantzi
    
    
      1
      1
      2
      9
      Amurrio
    
    
      2
      1
      49
      3
      Añana
    
    
      3
      1
      3
      5
      Aramaio
    
    
      4
      1
      6
      6
      Armiñón



In [33]:

    
lat_lon_mun[lat_lon_mun["Población"].str.match(".*anaria.*")]









    Out[33]:







  
    
      
      Comunidad
      Provincia
      Población
      Latitud
      Longitud
      Altitud
      Habitantes
      Hombres
      Mujeres
    
  
  
    
      1595
      Canarias
      Las Palmas
      Palmas de Gran Canaria (Las)
      28.12482
      -15.43001
      7.271036
      381847
      186191
      195656
    
    
      1601
      Canarias
      Las Palmas
      Santa María de Guía de Gran Canaria
      28.13915
      -15.63284
      184.067500
      14069
      7022
      7047
    
    
      1610
      Canarias
      Las Palmas
      Valsequillo de Gran Canaria
      27.99119
      -15.49928
      579.217100
      9067
      4617
      4450



In [34]:

    
mun_codes[mun_codes["NOMBRE"].str.match(".*anaria.*")]









    Out[34]:







  
    
      
      CPRO
      CMUN
      DC
      NOMBRE
    
  
  
    
      5251
      35
      16
      7
      Palmas de Gran Canaria, Las
    
    
      5257
      35
      23
      4
      Santa María de Guía de Gran Canaria
    
    
      5266
      35
      31
      1
      Valsequillo de Gran Canaria



In [37]:

    
"Valsequillo de Gran Canaria" == "Valsequillo de Gran Canaria"









    Out[37]:





True



In [36]:

    
"Palmas de Gran Canaria (Las)" == "Palmas de Gran Canaria, Las"









    Out[36]:





False

The closeness of a match is often measured in terms of edit distance, which is the number of primitive operations necessary to convert the string into an exact match. Primitive operations are usually: insertion (to insert a new character at a given position), deletion (to delete a particular character) and substitution (to replace a character with a new one).

Fuzzy String Matching can have different practical applications. Typical examples are spell-checking, text re-use detection (the politically correct way of calling plagiarism detection), spam filtering, as well as several applications in the bioinformatics domain, e.g. matching DNA sequences.

FuzzyWuzzy

The main modules in FuzzyWuzzy are called fuzz, for string-to-string comparisons, and process to compare a string with a list of strings.

Under the hood, FuzzyWuzzy uses difflib, part of the standard library, so there is nothing extra to install.

String Similarity

The simplest way to compare two strings is with a measurement of edit distance. For example, the following two strings are quite similar:

NEW YORK METS
NEW YORK MEATS

Now, according to the ratio:

Return a measure of the sequences' similarity as a float in the range [0, 1]. Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T.



In [28]:

    
from fuzzywuzzy import fuzz
fuzz.ratio("NEW YORK METS","NEW YORK MEATS")









    Out[28]:





96



In [35]:

    
fuzz.ratio("Palmas de Gran Canaria (Las)","Palmas de Gran Canaria, Las")









    Out[35]:





95

Partial String Similarity

What to do when we want to find if two strings are simmilar, and one contains the other.

In this case the ratio will be low but if we would know how to split the bigger string, the match would be perfect. Let's see an example:



In [39]:

    
"San Millán de Yécora" == "Millán de Yécora"









    Out[39]:





False



In [40]:

    
fuzz.ratio("San Millán de Yécora", "Millán de Yécora")









    Out[40]:





89

In fact we can have the following situation:



In [42]:

    
fuzz.ratio("YANKEES", "NEW YORK YANKEES")









    Out[42]:





61



In [44]:

    
fuzz.ratio("NEW YORK METS", "NEW YORK YANKEES")









    Out[44]:





76

partial_ratio, seeks the more appealing substring and returns its ratio



In [45]:

    
fuzz.partial_ratio("San Millán de Yécora", "Millán de Yécora")









    Out[45]:





100



In [46]:

    
fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES")









    Out[46]:





100



In [47]:

    
fuzz.partial_ratio("NEW YORK METS", "NEW YORK YANKEES")









    Out[47]:





69

Out of Order

What happens if we have just the same strings but in a different order, let's have an example:



In [48]:

    
s1 = "Las Palmas de Gran Canaria"
s2 = "Gran Canaria, Las Palmas de"
s3 = "Palmas de Gran Canaria, Las"
s4 = "Palmas de Gran Canaria, (Las)"

FuzzyWuzzy provides two ways to deal with this situation:

Token Sort

The token sort approach involves tokenizing the string in question, sorting the tokens alphabetically, and then joining them back into a string. Then compare the transformed strings with a simple ratio(.



In [49]:

    
fuzz.token_sort_ratio("Las Palmas de Gran Canaria", "Palmas de Gran Canaria Las")









    Out[49]:





100



In [50]:

    
fuzz.ratio("Las Palmas de Gran Canaria", "Palmas de Gran Canaria Las")









    Out[50]:





85

Token Set

The token set approach is similar, but a little bit more flexible. Here, we tokenize both strings, but instead of immediately sorting and comparing, we split the tokens into two groups: intersection and remainder. We use those sets to build up a comparison string.

t0 = [SORTED_INTERSECTION]
t1 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING1]
t2 = [SORTED_INTERSECTION] + [SORTED_REST_OF_STRING2]

max(ratio(t0,t1),ratio(t0,t2),ratio(t1,t2))



In [52]:

    
t0 = ["Canaria,","de","Gran", "Palmas"]
t1 = ["Canaria,","de","Gran", "Palmas"] + ["Las"]
t2 = ["Canaria,","de","Gran", "Palmas"] + ["(Las)"]



In [53]:

    
fuzz.token_sort_ratio("Palmas de Gran Canaria, Las", "Palmas de Gran Canaria, (Las)")









    Out[53]:





100

Example

We want to merge both mun_codes and lat_lon_mun. So we have to have a good municipality name in both datasets. From that names we can do:

Exact string matching: match these names common in both datasets
Approximate string matching: match these names with highest similarity

Step 1: Explore datasets



In [54]:

    
mun_codes.shape









    Out[54]:





(8116, 4)



In [56]:

    
mun_codes.head()









    Out[56]:







  
    
      
      CPRO
      CMUN
      DC
      NOMBRE
    
  
  
    
      0
      1
      1
      4
      Alegría-Dulantzi
    
    
      1
      1
      2
      9
      Amurrio
    
    
      2
      1
      49
      3
      Añana
    
    
      3
      1
      3
      5
      Aramaio
    
    
      4
      1
      6
      6
      Armiñón



In [55]:

    
lat_lon_mun.shape









    Out[55]:





(8112, 9)



In [57]:

    
lat_lon_mun.head()









    Out[57]:







  
    
      
      Comunidad
      Provincia
      Población
      Latitud
      Longitud
      Altitud
      Habitantes
      Hombres
      Mujeres
    
  
  
    
      0
      Andalucía
      Almería
      Abla
      37.14114
      -2.780104
      871.16840
      1504
      783
      721
    
    
      1
      Andalucía
      Almería
      Abrucena
      37.13305
      -2.797098
      976.93870
      1341
      682
      659
    
    
      2
      Andalucía
      Almería
      Adra
      36.74807
      -3.022522
      10.97898
      24373
      12338
      12035
    
    
      3
      Andalucía
      Almería
      Albánchez
      37.28710
      -2.181163
      481.31230
      815
      422
      393
    
    
      4
      Andalucía
      Almería
      Alboloduy
      37.03319
      -2.621750
      388.43460
      674
      334
      340

Step 2: Merge datasets



In [111]:

    
df1 = mun_codes.merge(lat_lon_mun, left_on="NOMBRE", right_on="Población",how="inner")
df1.head()









    Out[111]:







  
    
      
      CPRO
      CMUN
      DC
      NOMBRE
      Comunidad
      Provincia
      Población
      Latitud
      Longitud
      Altitud
      Habitantes
      Hombres
      Mujeres
    
  
  
    
      0
      1
      1
      4
      Alegría-Dulantzi
      País Vasco
      Álava
      Alegría-Dulantzi
      42.84149
      -2.513507
      561.6857
      2620
      1353
      1267
    
    
      1
      1
      2
      9
      Amurrio
      País Vasco
      Álava
      Amurrio
      43.05265
      -3.001022
      219.6910
      10089
      5069
      5020
    
    
      2
      1
      49
      3
      Añana
      País Vasco
      Álava
      Añana
      42.80235
      -2.982607
      628.5115
      176
      92
      84
    
    
      3
      1
      3
      5
      Aramaio
      País Vasco
      Álava
      Aramaio
      43.05400
      -2.566000
      381.8797
      1499
      781
      718
    
    
      4
      1
      6
      6
      Armiñón
      País Vasco
      Álava
      Armiñón
      42.72305
      -2.872574
      463.5815
      207
      103
      104

Step 3: Create a new variable called match_ratio



In [112]:

    
df1["match_ratio"] = 100

Step 4: Merge again with original dataset and select those which have no direct match



In [113]:

    
df2 = mun_codes.merge(df1, left_on="NOMBRE", right_on="NOMBRE", how="left")
df2.head()
df3 = df2.loc[: ,["CPRO_x","CMUN_x","DC_x","NOMBRE","match_ratio"]]
df3.rename(columns={"CPRO_x": "CPRO", "CMUN_x":"CMUN","DC_x":"DC"},inplace=True)
df3.head()









    Out[113]:







  
    
      
      CPRO
      CMUN
      DC
      NOMBRE
      match_ratio
    
  
  
    
      0
      1
      1
      4
      Alegría-Dulantzi
      100.0
    
    
      1
      1
      2
      9
      Amurrio
      100.0
    
    
      2
      1
      49
      3
      Añana
      100.0
    
    
      3
      1
      3
      5
      Aramaio
      100.0
    
    
      4
      1
      6
      6
      Armiñón
      100.0



In [115]:

    
df3.loc[df3.match_ratio.isnull(),:].head()









    Out[115]:







  
    
      
      CPRO
      CMUN
      DC
      NOMBRE
      match_ratio
    
  
  
    
      31
      1
      58
      7
      Legutio
      NaN
    
    
      63
      2
      14
      3
      Ballestero, El
      NaN
    
    
      69
      2
      19
      4
      Bonillo, El
      NaN
    
    
      85
      2
      35
      2
      Gineta, La
      NaN
    
    
      88
      2
      38
      7
      Herrera, La
      NaN

Step 5: Apply approximate string matching



In [127]:

    
mun_names = lat_lon_mun["Población"].tolist()

def approx_str_compare(x):
    ratio = [fuzz.ratio(x,m) for m in mun_names]
    res = pd.DataFrame({"ratio" : ratio,
                "name": mun_names})
    return res.sort_values(by="ratio",ascending=False).iloc[0,:]
    
df4 = df3.loc[df3.match_ratio.isnull(),"NOMBRE"].map(approx_str_compare)

Step 6: Concatenate results



In [135]:

    
df4.map(lambda x: x["name"])
df4.map(lambda x: x["ratio"])
df6 = df3.loc[df3.match_ratio.isnull(),:]
df6["match_ratio"] = df4.map(lambda x: x["ratio"])
df6["NOMBRE"] = df4.map(lambda x: x["name"])
df6.head()









    



C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """






    Out[135]:







  
    
      
      CPRO
      CMUN
      DC
      NOMBRE
      match_ratio
    
  
  
    
      31
      1
      58
      7
      Legutiano
      88
    
    
      63
      2
      14
      3
      Ballestero (El)
      90
    
    
      69
      2
      19
      4
      Bonillo (El)
      87
    
    
      85
      2
      35
      2
      Gineta (La)
      86
    
    
      88
      2
      38
      7
      Herrera (La)
      87



In [138]:

    
df7 = pd.concat([df3,df6])

Exercices

Improve the pipeline taking into consideration that we haven't checked that all names are unique. What is happening? Can you solve it?
Work with both NOMBRE and Población variables to improve results.
Make approximate string matxing procedure more efficient.
Porvide the final merged dataset.

	name	surname	amount	debtor
0	Alice	Doe	NaN	NaN
1	Bob	Smith	NaN	NaN
2	Charlie	Sheen	10000.0	C.Sheen
3	Dennis	Quaid	100.0	D.Quaid

	Comunidad	Provincia	Población	Latitud	Longitud	Altitud	Habitantes	Hombres	Mujeres
0	Andalucía	Almería	Abla	37.14114	-2.780104	871.16840	1504	783	721
1	Andalucía	Almería	Abrucena	37.13305	-2.797098	976.93870	1341	682	659
2	Andalucía	Almería	Adra	36.74807	-3.022522	10.97898	24373	12338	12035
3	Andalucía	Almería	Albánchez	37.28710	-2.181163	481.31230	815	422	393
4	Andalucía	Almería	Alboloduy	37.03319	-2.621750	388.43460	674	334	340

	CPRO	CMUN	DC	NOMBRE
0	1	1	4	Alegría-Dulantzi
1	1	2	9	Amurrio
2	1	49	3	Añana
3	1	3	5	Aramaio
4	1	6	6	Armiñón

	Comunidad	Provincia	Población	Latitud	Longitud	Altitud	Habitantes	Hombres	Mujeres
1595	Canarias	Las Palmas	Palmas de Gran Canaria (Las)	28.12482	-15.43001	7.271036	381847	186191	195656
1601	Canarias	Las Palmas	Santa María de Guía de Gran Canaria	28.13915	-15.63284	184.067500	14069	7022	7047
1610	Canarias	Las Palmas	Valsequillo de Gran Canaria	27.99119	-15.49928	579.217100	9067	4617	4450

	CPRO	CMUN	DC	NOMBRE
5251	35	16	7	Palmas de Gran Canaria, Las
5257	35	23	4	Santa María de Guía de Gran Canaria
5266	35	31	1	Valsequillo de Gran Canaria

	CPRO	CMUN	DC	NOMBRE	Comunidad	Provincia	Población	Latitud	Longitud	Altitud	Habitantes	Hombres	Mujeres
0	1	1	4	Alegría-Dulantzi	País Vasco	Álava	Alegría-Dulantzi	42.84149	-2.513507	561.6857	2620	1353	1267
1	1	2	9	Amurrio	País Vasco	Álava	Amurrio	43.05265	-3.001022	219.6910	10089	5069	5020
2	1	49	3	Añana	País Vasco	Álava	Añana	42.80235	-2.982607	628.5115	176	92	84
3	1	3	5	Aramaio	País Vasco	Álava	Aramaio	43.05400	-2.566000	381.8797	1499	781	718
4	1	6	6	Armiñón	País Vasco	Álava	Armiñón	42.72305	-2.872574	463.5815	207	103	104

	CPRO	CMUN	DC	NOMBRE	match_ratio
31	1	58	7	Legutio	NaN
63	2	14	3	Ballestero, El	NaN
69	2	19	4	Bonillo, El	NaN
85	2	35	2	Gineta, La	NaN
88	2	38	7	Herrera, La	NaN

	CPRO	CMUN	DC	NOMBRE	match_ratio
31	1	58	7	Legutiano	88
63	2	14	3	Ballestero (El)	90
69	2	19	4	Bonillo (El)	87
85	2	35	2	Gineta (La)	86
88	2	38	7	Herrera (La)	87