Title: String Munging In Dataframe
Slug: pandas_string_munging
Summary: String Munging In Dataframe
Date: 2016-05-01 12:00
Category: Python
Tags: Data Wrangling
Authors: Chris Albon

import modules


In [1]:
import pandas as pd
import numpy as np
import re as re

Create dataframe


In [2]:
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'], 
        'email': ['jas203@gmail.com', 'momomolly@gmail.com', np.NAN, 'battler@milner.com', 'Ames1234@yahoo.com'], 
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'email', 'preTestScore', 'postTestScore'])
df


Out[2]:
first_name last_name email preTestScore postTestScore
0 Jason Miller jas203@gmail.com 4 25
1 Molly Jacobson momomolly@gmail.com 24 94
2 Tina Ali NaN 31 57
3 Jake Milner battler@milner.com 2 62
4 Amy Cooze Ames1234@yahoo.com 3 70

Which strings in the email column contains 'gmail'


In [3]:
df['email'].str.contains('gmail')


Out[3]:
0     True
1     True
2      NaN
3    False
4    False
Name: email, dtype: object

Create a regular expression pattern that breaks apart emails


In [4]:
pattern = '([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\\.([A-Z]{2,4})'

Find everything in df.email that contains that pattern


In [5]:
df['email'].str.findall(pattern, flags=re.IGNORECASE)


Out[5]:
0       [(jas203, gmail, com)]
1    [(momomolly, gmail, com)]
2                          NaN
3     [(battler, milner, com)]
4     [(Ames1234, yahoo, com)]
Name: email, dtype: object

Create a pandas series containing the email elements


In [6]:
matches = df['email'].str.match(pattern, flags=re.IGNORECASE)
matches


/Users/chrisralbon/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.
  if __name__ == '__main__':
Out[6]:
0       (jas203, gmail, com)
1    (momomolly, gmail, com)
2                        NaN
3     (battler, milner, com)
4     (Ames1234, yahoo, com)
Name: email, dtype: object

Select the domains of the df.email


In [7]:
matches.str[1]


Out[7]:
0     gmail
1     gmail
2       NaN
3    milner
4     yahoo
Name: email, dtype: object