Lesson 27:

RegEx .* Dot-Star, ^ Caret, & $ Dollar Sign Characters

Besides just turning a class negative, the ^ character can also define the start of a string.

The $ character can be used in combination to define the end of a string.


In [11]:
import re

beginsWithTheHelloRegex = re.compile(r'^Hello') # String must start exactly with 'Hello'

print(beginsWithTheHelloRegex.findall('Hello there'))
print(beginsWithTheHelloRegex.findall('Wait, did he say Hello just now?'))
print(beginsWithTheHelloRegex.findall('He said Hello'))

endsWithTheHelloRegex = re.compile(r'Hello$') # String must end exactly with 'Hello'

print(endsWithTheHelloRegex.findall('Hello there'))
print(beginsWithTheHelloRegex.findall('Wait, did he say Hello just now?'))
print(endsWithTheHelloRegex.findall('He said Hello'))


['Hello']
[]
[]
[]
[]
['Hello']

They can be used in combination:


In [35]:
allDigitsRegex = re.compile(r'^\d+$') # Must start and end with a digit, with at least 1 or more digits inbetween

print(allDigitsRegex.findall('2153234623462561514')) # Matches entire string
print(allDigitsRegex.findall('21532346234letters!62561514')) # No match, doesn't end with string


Out[35]:
[]

The . character matches any character.


In [18]:
atRegex = re.compile(r'.at') # Any single character followed by at

print(atRegex.findall('The cat in the hat sat on the flat mat.')) # matches anything ending with at

atRegex = re.compile(r'.{2}at') # Any two characters followed by at

print(atRegex.findall('The cat in the hat sat on the flat mat.')) # matches anything ending with at, including spaces


['cat', 'hat', 'sat', 'lat', 'mat']
[' cat', ' hat', ' sat', 'flat', ' mat']

The .* is therefore used to match anything, any number of any character:


In [23]:
name = 'First Name: Al, Last Name: Sweigart' # To pull names from this string would require a lot of indexing code
name2 = 'First Name: Vivek, Last Name: Menon' # To pull names from this string would require a lot of indexing code

nameRegex = re.compile(r'First Name: (.*), Last Name: (.*)') # Matches anything in this groups formatted exactly like this

print(nameRegex.findall(name))
print(nameRegex.findall(name2))


[('Al', 'Sweigart')]
[('Vivek', 'Menon')]

.* is greedy by default, but you can activate non-greedy mode with .*?


In [26]:
serve = '<To serve humans> for dinner.>'

greedyRegex = re.compile(r'<(.*)>') # Looking for any length match, between brackets. 
nongreedyRegex = re.compile(r'<(.*?)>') # Looking for any length match, between brackets. 

print(greedyRegex.findall(serve)) # Matches the longest string
print(nongreedyRegex.findall(serve)) # Matches the shortest string


['To serve humans> for dinner.']
['To serve humans']

.* matches any character except the newline (\n) character.


In [30]:
primeDirectives = 'Serve the public trust.\nProtect the innocent.\nUphold the law.'

print(primeDirectives)

dotStar = re.compile(r'.*')
print(dotStar.findall(primeDirectives))


Serve the public trust.
Protect the innocent.
Uphold the law.
['Serve the public trust.', '', 'Protect the innocent.', '', 'Uphold the law.', '']

We can use the paramater re.DOTALL can set to truly match any character:


In [34]:
dotStar = re.compile(r'.*', re.DOTALL)
print(dotStar.findall(primeDirectives))


Out[34]:
['Serve the public trust.\nProtect the innocent.\nUphold the law.', '']

Similiarily, re.IGNORECASE or re.I to ignore case:


In [38]:
vowelRegex = re.compile(r'[aeiou]', re.I) # Match any vowel, regardless of case 

print(vowelRegex.findall('Al, why does your programming book talk about RoboCop so much?'))


['A', 'o', 'e', 'o', 'u', 'o', 'a', 'i', 'o', 'o', 'a', 'a', 'o', 'u', 'o', 'o', 'o', 'o', 'u']

Recap

  • The ^ regex character means the string must start with the pattern, $ means the string must end with the pattern.
  • Both means the string must match the pattern exactly.
  • The . regex character is a wildcard; it matches anything except newlines.
  • The re.DOTALL parameter can be used in re.compile() to make the . match newlines as well.
  • Pass re.I to re.compile() to make the matching case-insensitive.