Lesson 23:

Regular Expression Basics

Regular expressions use pattern matching to find text. They are typically faster than the alternative.

Phone Number Search Program


In [2]:
def isPhoneNumber(text):
    if len(text) != 12:
        return False # not phone number-sized
    for i in range (0,3):
        if not text[i].isdecimal():
            return False # no area code
    if text[3] != '-':
            return False # missing dash
    for i in range(4,7):
        if not text[i].isdecimal():
            return False # no first 3 digits
    if text[7] != '-':
        return False # missing second dash
    for i in range (8,12):
        if not text[i].isdecimal():
            return False # missing last 4 digits
    return True

You can then test what strings count as phone numbers using this program.


In [4]:
print(isPhoneNumber('415-555-1234')) # False
print(isPhoneNumber('Hello')) # False
print(isPhoneNumber('415551234')) # False


True
False
False

Use to check if strings contain phone numbers:


In [22]:
message = 'Call me at 415-555-1011 tomorrow, or at 415-555-9999 any other day.'
message2 = 'There are no phone numbers in this message.'

def findNumber(message):
    foundNumber = False # set False to start|
    for i in range(len(message)):
        chunk = message[i:i+12] # Take a phone number size 'chunk' of the string, character by character
        #print(chunk) # debug
        if isPhoneNumber(chunk):
            print('Phone number found: ' + chunk)
            foundNumber = True
    if not foundNumber: # Run after loop, not during loop
            print('Could not find any phone numbers.')

findNumber(message)
findNumber(message2)


Phone number found: 415-555-1011
Phone number found: 415-555-9999
Could not find any phone numbers.

This is a lot of code for text pattern matching; which is very common activity in most programming. Therefore, regular expressions are used to simplify this process.

Phone Number Matching with RegEx

The re module stores RegEx functions. It usually takes raw strings (r'').


In [33]:
import re

print(message)

phoneNumRegex =  re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # Defines the pattern and converts to Regex
mo = phoneNumRegex.search(message) #re.search() searches string and  returns a Match Object
print(mo.group()) # the .group() method contains the actual text in the Match Object


Call me at 415-555-1011 tomorrow, or at 415-555-9999 any other day.
415-555-1011

This is 3 lines of code in place of 30, which is significantly more effecient.

You can also use the .findall() method to find all RegEx matches, not just the first.


In [38]:
mo = phoneNumRegex.findall(message) #re.findall() returns a Match Object List
print(mo) # mo.findall() returns a list, so it doesn't need .group()


['415-555-1011', '415-555-9999']

Recap

  • Regular expressions are mini-language for specifying text pattern. Writing code to do pattern matching without regular expressions is a huge pain.
  • Regex strings often use backslashes, '\' (like \d), so they are often handled in raw strings: r'\d'
  • Import the re module first to use them.
  • Call the re.compile() function to create a regex object.
  • Call the regex object's .search() method to create a match object.
  • Call the match object's .group() method to get the matched string.
  • Use the regex object's .findall() to get a list of matched objects.
  • '\d' is the regex for a numeric digit.