Lesson 25:

RegEx groups and the Pipe Character

The | pipe character can match one of many groups, but you may want a certain number of repitions of a group.

The '?' Regex Operater

The ? RegEx operater allows for optional (0 or 1) matches:


In [36]:
import re

batRegex = re.compile(r'Bat(wo)?man') # The ()? says this group can appear 0 or 1 times to match; it is optional

mo = batRegex.search('The Adventures of Batman')
print(mo.group())

mo = batRegex.search('The Adventures of Batwoman')
print(mo.group())


Batman
Batwoman

However, it cannot match multiple repititions:


In [37]:
mo = batRegex.search('The Adventures of Batwowowowoman')
print(mo.group())


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-37-3ca83bd4a86e> in <module>()
      1 mo = batRegex.search('The Adventures of Batwowowowoman')
----> 2 print(mo.group())

AttributeError: 'NoneType' object has no attribute 'group'

We can use this to find strings that may or may not include elements, like phone numbers with and without area codes.


In [38]:
phoneNumRegex = re.compile(r'\d\d\d\-\d\d\d-\d\d\d\d') # this requires an area code.

mo = phoneNumRegex.search('My number is 415-555-4242') # matches
print(mo.group())

mo2 = phoneNumRegex.search('My number is 555-4242') # will not match
print(mo2)


phoneNumRegex = re.compile(r'(\d\d\d\-)?\d\d\d-\d\d\d\d') # Make first three digits and dash optional

mo = phoneNumRegex.search('My number is 415-555-4242') # matches
print(mo.group())
mo2 = phoneNumRegex.search('My number is 555-4242') # matches
print(mo2.group())


415-555-4242
None
415-555-4242
555-4242

The '*' Regex Operater

The * character can be used to match many (0 or n) times.


In [39]:
import re

batRegex = re.compile(r'Bat(wo)*man') # The ()* says this group can appear 0 or n times to match

print(batRegex.search('The Adventures of Batwoman').group())

print(batRegex.search('The Adventures of Batwowowowoman').group())


Batwoman
Batwowowowoman

The '+' Regex Operater

The + character can match one or more (1 or n) times.


In [40]:
import re

batRegex = re.compile(r'Bat(wo)+man') # The ()+ says this group can appear 1 or n times; it is NOT optional

print(batRegex.search('The Adventures of Batwoman').group())

print(batRegex.search('The Adventures of Batwowowowoman').group())

print(batRegex.search('The Adventures of Batman').group())


Batwoman
Batwowowowoman
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-40-c82c03b4fdec> in <module>()
      7 print(batRegex.search('The Adventures of Batwowowowoman').group())
      8 
----> 9 print(batRegex.search('The Adventures of Batman').group())

AttributeError: 'NoneType' object has no attribute 'group'

All of these characters can be escaped for literal matches:


In [41]:
import re

batRegex = re.compile(r'\+\*\?') # The +,*, and ? are escaped. 

print(batRegex.search('I learned about +*? RegEx syntax').group())


+*?

The '{}' Regex Operater

The {x} character can match x times.


In [42]:
haRegex = re.compile(r'(Ha){3}')

print(haRegex.search('HaHaHa').group())
print(haRegex.search('HaHaHaHa').group())  # Matches only three times, so returns only 3
#print(haRegex.search('HaHa').group()) # No Match

phoneRegex = re.compile(r'((\d)?\d\d\d(\d)?){3}') # Useful to avoid repitition
phoneNumRegex.search('My number is 415-555-4242').group()


HaHaHa
HaHaHa
Out[42]:
'415-555-4242'

This operator can also take the {x,y} argument to create a minimum or maximum number of repititions.


In [43]:
haRegex = re.compile(r'(Ha){3,5}')

print(haRegex.search('HaHaHa').group())
print(haRegex.search('HaHaHaHa').group())
print(haRegex.search('HaHaHaHaHa').group())  
print(haRegex.search('HaHaHaHaHaHaHaHa').group()) # Matches max of 5


haRegex = re.compile(r'(Ha){,5}') # Can drop one or the other for unbounded matches
print(haRegex.search('Ha').group())
print(haRegex.search('HaHa').group())
print(haRegex.search('HaHaHa').group())
print(haRegex.search('HaHaHaHa').group())
print(haRegex.search('HaHaHaHaHa').group())  
print(haRegex.search('HaHaHaHaHaHaHaHa').group()) # Matches max of 5


HaHaHa
HaHaHaHa
HaHaHaHaHa
HaHaHaHaHa
Ha
HaHa
HaHaHa
HaHaHaHa
HaHaHaHaHa
HaHaHaHaHa

RegEx does greedy matches, which means it will try to find the longest string that matches, not the shortest.


In [44]:
haRegex = re.compile(r'(Ha){1,6}') # at least 1, or 6
print(haRegex.search('HaHaHaHaHaHaHaHa').group()) # Matches longest string; 6


HaHaHaHaHaHa

You can do a non-greedy match by using a '}?' operator.


In [45]:
haRegex = re.compile(r'(Ha){1,6}?') # The }? says favor the first condition, not the second; non-greedy
print(haRegex.search('HaHaHaHaHaHaHaHa').group()) # Matches shortest string, 1


Ha

The Regex Operaters

The () character creates a subgroup for matching.
The |, character matches one of several patterns in a group.
The ? character allows for optional (0 or 1) matches.
The * character can be used to match many (0 or n) times.
The + character can match one or more (1 or n) times.
The {m,n} character allows for at least m or at most n matches of the parameter within it.
The {? or }? character allows for non-greedy matches, favoring the { without the ?.
The \ character escapes any of these characters for literal matches.

Recap

  • The ? says the group matches zero or one times.
  • The * says the group matches zero or more times.
  • The + says the group matches one ore more times.
  • The curly braces can match a specific number of times.
  • The curly braces with two numbers matches a minimum and maximum number of times.
  • Leaving out the first or second number in the curly braces says there is no minimum or maxiumum.
  • Greedy matching matches the longest string possible, non-greedy matching matches the shortest string possible.
  • Putting a question mark after the curly braces makes it do a non-greedy match.