Lesson 25:

RegEx groups and the Pipe Character

The | pipe character can match one of many groups, but you may want a certain number of repitions of a group.

The '?' Regex Operater

The ? RegEx operater allows for optional (0 or 1) matches:



In [36]:

    
import re

batRegex = re.compile(r'Bat(wo)?man') # The ()? says this group can appear 0 or 1 times to match; it is optional

mo = batRegex.search('The Adventures of Batman')
print(mo.group())

mo = batRegex.search('The Adventures of Batwoman')
print(mo.group())









    



Batman
Batwoman

However, it cannot match multiple repititions:



In [37]:

    
mo = batRegex.search('The Adventures of Batwowowowoman')
print(mo.group())









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-37-3ca83bd4a86e> in <module>()
      1 mo = batRegex.search('The Adventures of Batwowowowoman')
----> 2 print(mo.group())

AttributeError: 'NoneType' object has no attribute 'group'

We can use this to find strings that may or may not include elements, like phone numbers with and without area codes.



In [38]:

    
phoneNumRegex = re.compile(r'\d\d\d\-\d\d\d-\d\d\d\d') # this requires an area code.

mo = phoneNumRegex.search('My number is 415-555-4242') # matches
print(mo.group())

mo2 = phoneNumRegex.search('My number is 555-4242') # will not match
print(mo2)


phoneNumRegex = re.compile(r'(\d\d\d\-)?\d\d\d-\d\d\d\d') # Make first three digits and dash optional

mo = phoneNumRegex.search('My number is 415-555-4242') # matches
print(mo.group())
mo2 = phoneNumRegex.search('My number is 555-4242') # matches
print(mo2.group())









    



415-555-4242
None
415-555-4242
555-4242

The '*' Regex Operater

The * character can be used to match many (0 or n) times.



In [39]:

    
import re

batRegex = re.compile(r'Bat(wo)*man') # The ()* says this group can appear 0 or n times to match

print(batRegex.search('The Adventures of Batwoman').group())

print(batRegex.search('The Adventures of Batwowowowoman').group())









    



Batwoman
Batwowowowoman

The '+' Regex Operater

The + character can match one or more (1 or n) times.



In [40]:

    
import re

batRegex = re.compile(r'Bat(wo)+man') # The ()+ says this group can appear 1 or n times; it is NOT optional

print(batRegex.search('The Adventures of Batwoman').group())

print(batRegex.search('The Adventures of Batwowowowoman').group())

print(batRegex.search('The Adventures of Batman').group())









    



Batwoman
Batwowowowoman






    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-40-c82c03b4fdec> in <module>()
      7 print(batRegex.search('The Adventures of Batwowowowoman').group())
      8 
----> 9 print(batRegex.search('The Adventures of Batman').group())

AttributeError: 'NoneType' object has no attribute 'group'

All of these characters can be escaped for literal matches:



In [41]:

    
import re

batRegex = re.compile(r'\+\*\?') # The +,*, and ? are escaped. 

print(batRegex.search('I learned about +*? RegEx syntax').group())

+*?

The '{}' Regex Operater

The {x} character can match x times.



In [42]:

    
haRegex = re.compile(r'(Ha){3}')

print(haRegex.search('HaHaHa').group())
print(haRegex.search('HaHaHaHa').group())  # Matches only three times, so returns only 3
#print(haRegex.search('HaHa').group()) # No Match

phoneRegex = re.compile(r'((\d)?\d\d\d(\d)?){3}') # Useful to avoid repitition
phoneNumRegex.search('My number is 415-555-4242').group()









    



HaHaHa
HaHaHa






    Out[42]:





'415-555-4242'

This operator can also take the {x,y} argument to create a minimum or maximum number of repititions.



In [43]:

    
haRegex = re.compile(r'(Ha){3,5}')

print(haRegex.search('HaHaHa').group())
print(haRegex.search('HaHaHaHa').group())
print(haRegex.search('HaHaHaHaHa').group())  
print(haRegex.search('HaHaHaHaHaHaHaHa').group()) # Matches max of 5


haRegex = re.compile(r'(Ha){,5}') # Can drop one or the other for unbounded matches
print(haRegex.search('Ha').group())
print(haRegex.search('HaHa').group())
print(haRegex.search('HaHaHa').group())
print(haRegex.search('HaHaHaHa').group())
print(haRegex.search('HaHaHaHaHa').group())  
print(haRegex.search('HaHaHaHaHaHaHaHa').group()) # Matches max of 5









    



HaHaHa
HaHaHaHa
HaHaHaHaHa
HaHaHaHaHa
Ha
HaHa
HaHaHa
HaHaHaHa
HaHaHaHaHa
HaHaHaHaHa

RegEx does greedy matches, which means it will try to find the longest string that matches, not the shortest.



In [44]:

    
haRegex = re.compile(r'(Ha){1,6}') # at least 1, or 6
print(haRegex.search('HaHaHaHaHaHaHaHa').group()) # Matches longest string; 6









    



HaHaHaHaHaHa

You can do a non-greedy match by using a '}?' operator.



In [45]:

    
haRegex = re.compile(r'(Ha){1,6}?') # The }? says favor the first condition, not the second; non-greedy
print(haRegex.search('HaHaHaHaHaHaHaHa').group()) # Matches shortest string, 1

Ha

The Regex Operaters

The () character creates a subgroup for matching.
The |, character matches one of several patterns in a group.
The ? character allows for optional (0 or 1) matches.
The * character can be used to match many (0 or n) times.
The + character can match one or more (1 or n) times.
The {m,n} character allows for at least m or at most n matches of the parameter within it.
The {? or }? character allows for non-greedy matches, favoring the { without the ?.
The \ character escapes any of these characters for literal matches.

Recap

The ? says the group matches zero or one times.
The * says the group matches zero or more times.
The + says the group matches one ore more times.
The curly braces can match a specific number of times.
The curly braces with two numbers matches a minimum and maximum number of times.
Leaving out the first or second number in the curly braces says there is no minimum or maxiumum.
Greedy matching matches the longest string possible, non-greedy matching matches the shortest string possible.
Putting a question mark after the curly braces makes it do a non-greedy match.