In [ ]:
import re
In [2]:
# regex string: for any alphanumeric character, and greedy "+"
word = r'\w+'
In [3]:
sentence = "I am Sam; Sam I am."
Let's find all words in the sentence using regex
In [4]:
re.findall(word, sentence)
Out[4]:
In [7]:
sresult = re.search(word, sentence)
In [8]:
sresult.group()
Out[8]:
In [9]:
mresult = re.match(word, sentence)
In [10]:
mresult.group()
Out[10]:
In [11]:
capitalized_word = r'[A-Z]\w+'
In [12]:
sresult = re.search(capitalized_word, sentence)
In [17]:
sresult
Out[17]:
In [13]:
sresult.group()
Out[13]:
In [14]:
mresult = re.match(capitalized_word, sentence)
In [16]:
mresult
Nothing is returned! It is NULL -- and we will get an error if we try to call the group method on a NULL object.
The reason is that re.match is anchored at the beginning of the string. As re.match documentation says:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding MatchObject instance. Return None if the string does not match the pattern; note that this is different from a zero-length match.
Note: If you want to locate a match anywhere in string, use search() instead.
re.search is more broad : is the word anywhere in the sentence?
re.match is faster but more specific : does the sentence begin with the word?
In [20]:
# digits only
numbers = r'\d+'
In [27]:
nasa_briefing = '''This Saturday at 5:51 a.m. PDT, NASA's Juno \
spacecraft will get closer to the cloud tops of Jupiter than \
at any other time during its prime mission. At the moment of \
closest approach, Juno will be about 2,600 miles (4,200 kilometers) \
above Jupiter's swirling clouds and traveling at 130,000 mph \
(208,000 kilometers per hour) with respect to the planet. '''
In [28]:
re.findall(numbers, nasa_briefing)
Out[28]:
This does not seem quite right: as all the "," and ":" separated out numbers which are related.
Let's fix the issue with the thousands ",":
In [29]:
numbers = r'(\d+,\d+|\d+)'
In [30]:
re.findall(numbers, nasa_briefing)
Out[30]:
Alternatively, we can use a different regex to accomplish the same thing:
In [35]:
numbers = r'(\d*,?\d+)'
In [36]:
re.findall(numbers, nasa_briefing)
Out[36]:
Better result -- but our time representation still need fixing:
In [43]:
numbers = r'(\d*,?:?\d+)'
In [44]:
re.findall(numbers, nasa_briefing)
Out[44]:
In [46]:
city_state = r'(?P<city>[\w\s]+), (?P<state>[A-Z]{2})'
In [47]:
sentence = "1600 Amphitheatre Pkwy, Mountain View, CA 94043"
In [48]:
re.findall(city_state, sentence)
Out[48]:
In [50]:
for city_st in re.finditer(city_state, sentence):
print("city: {}".format(city_st.group('city')))
print("state: {}".format(city_st.group('state')))
In [ ]:
In [ ]: