Regex



In [ ]:

    
import re



In [2]:

    
# regex string: for any alphanumeric character, and greedy "+"
word = r'\w+'



In [3]:

    
sentence = "I am Sam; Sam I am."

Let's find all words in the sentence using regex



In [4]:

    
re.findall(word, sentence)









    Out[4]:





['I', 'am', 'Sam', 'Sam', 'I', 'am']

search & match



In [7]:

    
sresult = re.search(word, sentence)



In [8]:

    
sresult.group()









    Out[8]:





'I'



In [9]:

    
mresult = re.match(word, sentence)



In [10]:

    
mresult.group()









    Out[10]:





'I'



In [11]:

    
capitalized_word = r'[A-Z]\w+'



In [12]:

    
sresult = re.search(capitalized_word, sentence)



In [17]:

    
sresult









    Out[17]:





<_sre.SRE_Match at 0x103d7e510>



In [13]:

    
sresult.group()









    Out[13]:





'Sam'



In [14]:

    
mresult = re.match(capitalized_word, sentence)



In [16]:

    
mresult

Nothing is returned! It is NULL -- and we will get an error if we try to call the group method on a NULL object.

The reason is that re.match is anchored at the beginning of the string. As re.match documentation says:

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.

Note: If you want to locate a match anywhere in string, use search() instead.

re.search is more broad : is the word anywhere in the sentence?

re.match is faster but more specific : does the sentence begin with the word?

numbers



In [20]:

    
# digits only

numbers = r'\d+'



In [27]:

    
nasa_briefing = '''This Saturday at 5:51 a.m. PDT, NASA's Juno \
spacecraft will get closer to the cloud tops of Jupiter than \
at any other time during its prime mission. At the moment of \
closest approach, Juno will be about 2,600 miles (4,200 kilometers) \
above Jupiter's swirling clouds and traveling at 130,000 mph \
(208,000 kilometers per hour) with respect to the planet. '''



In [28]:

    
re.findall(numbers, nasa_briefing)









    Out[28]:





['5', '51', '2', '600', '4', '200', '130', '000', '208', '000']

This does not seem quite right: as all the "," and ":" separated out numbers which are related.

Let's fix the issue with the thousands ",":



In [29]:

    
numbers = r'(\d+,\d+|\d+)'



In [30]:

    
re.findall(numbers, nasa_briefing)









    Out[30]:





['5', '51', '2,600', '4,200', '130,000', '208,000']

Alternatively, we can use a different regex to accomplish the same thing:



In [35]:

    
numbers = r'(\d*,?\d+)'



In [36]:

    
re.findall(numbers, nasa_briefing)









    Out[36]:





['5', '51', '2,600', '4,200', '130,000', '208,000']

Better result -- but our time representation still need fixing:



In [43]:

    
numbers = r'(\d*,?:?\d+)'



In [44]:

    
re.findall(numbers, nasa_briefing)









    Out[44]:





['5:51', '2,600', '4,200', '130,000', '208,000']

named groups



In [46]:

    
city_state = r'(?P<city>[\w\s]+), (?P<state>[A-Z]{2})'



In [47]:

    
sentence = "1600 Amphitheatre Pkwy, Mountain View, CA 94043"



In [48]:

    
re.findall(city_state, sentence)









    Out[48]:





[(' Mountain View', 'CA')]



In [50]:

    
for city_st in re.finditer(city_state, sentence):
    print("city: {}".format(city_st.group('city')))
    print("state: {}".format(city_st.group('state')))









    



city:  Mountain View
state: CA



In [ ]:



In [ ]: