In [36]:
import re
batRegex = re.compile(r'Bat(wo)?man') # The ()? says this group can appear 0 or 1 times to match; it is optional
mo = batRegex.search('The Adventures of Batman')
print(mo.group())
mo = batRegex.search('The Adventures of Batwoman')
print(mo.group())
However, it cannot match multiple repititions:
In [37]:
mo = batRegex.search('The Adventures of Batwowowowoman')
print(mo.group())
We can use this to find strings that may or may not include elements, like phone numbers with and without area codes.
In [38]:
phoneNumRegex = re.compile(r'\d\d\d\-\d\d\d-\d\d\d\d') # this requires an area code.
mo = phoneNumRegex.search('My number is 415-555-4242') # matches
print(mo.group())
mo2 = phoneNumRegex.search('My number is 555-4242') # will not match
print(mo2)
phoneNumRegex = re.compile(r'(\d\d\d\-)?\d\d\d-\d\d\d\d') # Make first three digits and dash optional
mo = phoneNumRegex.search('My number is 415-555-4242') # matches
print(mo.group())
mo2 = phoneNumRegex.search('My number is 555-4242') # matches
print(mo2.group())
In [39]:
import re
batRegex = re.compile(r'Bat(wo)*man') # The ()* says this group can appear 0 or n times to match
print(batRegex.search('The Adventures of Batwoman').group())
print(batRegex.search('The Adventures of Batwowowowoman').group())
In [40]:
import re
batRegex = re.compile(r'Bat(wo)+man') # The ()+ says this group can appear 1 or n times; it is NOT optional
print(batRegex.search('The Adventures of Batwoman').group())
print(batRegex.search('The Adventures of Batwowowowoman').group())
print(batRegex.search('The Adventures of Batman').group())
All of these characters can be escaped for literal matches:
In [41]:
import re
batRegex = re.compile(r'\+\*\?') # The +,*, and ? are escaped.
print(batRegex.search('I learned about +*? RegEx syntax').group())
In [42]:
haRegex = re.compile(r'(Ha){3}')
print(haRegex.search('HaHaHa').group())
print(haRegex.search('HaHaHaHa').group()) # Matches only three times, so returns only 3
#print(haRegex.search('HaHa').group()) # No Match
phoneRegex = re.compile(r'((\d)?\d\d\d(\d)?){3}') # Useful to avoid repitition
phoneNumRegex.search('My number is 415-555-4242').group()
Out[42]:
This operator can also take the {x,y}
argument to create a minimum or maximum number of repititions.
In [43]:
haRegex = re.compile(r'(Ha){3,5}')
print(haRegex.search('HaHaHa').group())
print(haRegex.search('HaHaHaHa').group())
print(haRegex.search('HaHaHaHaHa').group())
print(haRegex.search('HaHaHaHaHaHaHaHa').group()) # Matches max of 5
haRegex = re.compile(r'(Ha){,5}') # Can drop one or the other for unbounded matches
print(haRegex.search('Ha').group())
print(haRegex.search('HaHa').group())
print(haRegex.search('HaHaHa').group())
print(haRegex.search('HaHaHaHa').group())
print(haRegex.search('HaHaHaHaHa').group())
print(haRegex.search('HaHaHaHaHaHaHaHa').group()) # Matches max of 5
RegEx does greedy matches, which means it will try to find the longest string that matches, not the shortest.
In [44]:
haRegex = re.compile(r'(Ha){1,6}') # at least 1, or 6
print(haRegex.search('HaHaHaHaHaHaHaHa').group()) # Matches longest string; 6
You can do a non-greedy match by using a '}?'
operator.
In [45]:
haRegex = re.compile(r'(Ha){1,6}?') # The }? says favor the first condition, not the second; non-greedy
print(haRegex.search('HaHaHaHaHaHaHaHa').group()) # Matches shortest string, 1
The ()
character creates a subgroup for matching.
The |
, character matches one of several patterns in a group.
The ?
character allows for optional (0 or 1) matches.
The *
character can be used to match many (0 or n) times.
The +
character can match one or more (1 or n) times.
The {m,n}
character allows for at least m or at most n matches of the parameter within it.
The {?
or }?
character allows for non-greedy matches, favoring the {
without the ?
.
The \
character escapes any of these characters for literal matches.
?
says the group matches zero or one times.*
says the group matches zero or more times.+
says the group matches one ore more times.