Regex


In [ ]:
import re

In [2]:
# regex string: for any alphanumeric character, and greedy "+"
word = r'\w+'

In [3]:
sentence = "I am Sam; Sam I am."

Let's find all words in the sentence using regex


In [4]:
re.findall(word, sentence)


Out[4]:
['I', 'am', 'Sam', 'Sam', 'I', 'am']

search & match


In [7]:
sresult = re.search(word, sentence)

In [8]:
sresult.group()


Out[8]:
'I'

In [9]:
mresult = re.match(word, sentence)

In [10]:
mresult.group()


Out[10]:
'I'

In [11]:
capitalized_word = r'[A-Z]\w+'

In [12]:
sresult = re.search(capitalized_word, sentence)

In [17]:
sresult


Out[17]:
<_sre.SRE_Match at 0x103d7e510>

In [13]:
sresult.group()


Out[13]:
'Sam'

In [14]:
mresult = re.match(capitalized_word, sentence)

In [16]:
mresult

Nothing is returned! It is NULL -- and we will get an error if we try to call the group method on a NULL object.

The reason is that re.match is anchored at the beginning of the string. As re.match documentation says:

If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.

Note: If you want to locate a match anywhere in string, use search() instead.

re.search is more broad : is the word anywhere in the sentence?

re.match is faster but more specific : does the sentence begin with the word?

numbers


In [20]:
# digits only

numbers = r'\d+'

In [27]:
nasa_briefing = '''This Saturday at 5:51 a.m. PDT, NASA's Juno \
spacecraft will get closer to the cloud tops of Jupiter than \
at any other time during its prime mission. At the moment of \
closest approach, Juno will be about 2,600 miles (4,200 kilometers) \
above Jupiter's swirling clouds and traveling at 130,000 mph \
(208,000 kilometers per hour) with respect to the planet. '''

In [28]:
re.findall(numbers, nasa_briefing)


Out[28]:
['5', '51', '2', '600', '4', '200', '130', '000', '208', '000']

This does not seem quite right: as all the "," and ":" separated out numbers which are related.

Let's fix the issue with the thousands ",":


In [29]:
numbers = r'(\d+,\d+|\d+)'

In [30]:
re.findall(numbers, nasa_briefing)


Out[30]:
['5', '51', '2,600', '4,200', '130,000', '208,000']

Alternatively, we can use a different regex to accomplish the same thing:


In [35]:
numbers = r'(\d*,?\d+)'

In [36]:
re.findall(numbers, nasa_briefing)


Out[36]:
['5', '51', '2,600', '4,200', '130,000', '208,000']

Better result -- but our time representation still need fixing:


In [43]:
numbers = r'(\d*,?:?\d+)'

In [44]:
re.findall(numbers, nasa_briefing)


Out[44]:
['5:51', '2,600', '4,200', '130,000', '208,000']

named groups


In [46]:
city_state = r'(?P<city>[\w\s]+), (?P<state>[A-Z]{2})'

In [47]:
sentence = "1600 Amphitheatre Pkwy, Mountain View, CA 94043"

In [48]:
re.findall(city_state, sentence)


Out[48]:
[(' Mountain View', 'CA')]

In [50]:
for city_st in re.finditer(city_state, sentence):
    print("city: {}".format(city_st.group('city')))
    print("state: {}".format(city_st.group('state')))


city:  Mountain View
state: CA

In [ ]:


In [ ]: