Lesson 26:

RegEx Character Classes and the .findall() Method

The find.all() method for regex objects finds all matching strings in a text.


In [3]:
import re

phoneRegex = re.compile(r'/d/d/d-/d/d/d-/d/d/d/d')

#phoneRegex.search() # finds first match
#phoneRegex.findall() # finds all matches

find.all() returns a list of strings.

It behaves differently with groups.


In [4]:
import re

phoneRegex = re.compile(r'(/d/d/d)-(/d/d/d-/d/d/d/d)') # Two groups, so returns tuples

#phoneRegex.findall() # finds all matches in pairs; [('group1', 'group2'),...]

To get the total string, just wrap the total regex in its own group, so you get [(totalstring, group1, group2),...].

RegEx Character Classes

\d\ is the RegEx character for digits.


In [15]:
#digitRegex = re.compile(r'(1|2|3|4...|n)`) is equivalent to

#digitRegex = re.compile(r'\d\')


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-1d94011dd0c1> in <module>()
      4 #digitRegex = re.compile(r'\d\')
      5 
----> 6 print(str(list(fib())).strip("[]"))

NameError: name 'fib' is not defined

Other regex characters are:

  • \D Any character that is NOT a numeric digit from 0 to 9.
  • \w Any letter, numeric digit, punctuation, or the underscore character (word characters.)
  • \W Any character that is NOT a letter, numeric digit, or the underscore character.
  • \s Any space, tab, or newline character (space characters.)

  • \S Any character that is NOT a space character.


In [23]:
# Example using lyrics from The Twelve Days of Christmas 

lyrics = '''
12 Drummers Drumming
11 Pipers Piping
10 Lords a Leaping
9 Ladies Dancing
8 Maids a Milking
7 Swans a Swimming
6 Geese a Laying
5 Golden Rings
4 Calling Birds
3 French Hens
2 Turtle Doves
and 1 Partridge in a Pear Tree
'''

xmasRegex = re.compile(r'\d+\s\w+') # 1 or more digits, space, 1 or more words

xmasRegex.findall(lyrics) # Returns all 'x gift', but stops at space because \w+ does not include spaces


Out[23]:
['12 Drummers',
 '11 Pipers',
 '10 Lords',
 '9 Ladies',
 '8 Maids',
 '7 Swans',
 '6 Geese',
 '5 Golden',
 '4 Calling',
 '3 French',
 '2 Turtle',
 '1 Partridge']

It is possible to create your own character classes, outside of these shorthand classes, using []:


In [26]:
vowelRegex = re.compile(r'[aeiouAEIOU]') # RegEx for lowercase and uppercase vowels
alphabetRegex = re.compile(r'[a-zA-Z]') # RegEx for lowercase and uppercase alphabet using ranges

print(vowelRegex.findall('Robocop eats baby food.')) # Finds a list of all vowels in string

doublevowelRegex = re.compile(r'[aeiouAEIOU]{2}') # RegEx for two lowercase and uppercase vowels in a row; {2} repeats.
print(doublevowelRegex.findall('Robocop eats baby food.')) # Finds a list of all vowels in string


['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o']
['ea', 'oo']

A useful feature of custom character classes are negative character classes:


In [30]:
consonantsRegex = re.compile(r'[^aeiouAEIOU]') # RegEx for finding all characters that are NOT vowels

print(consonantsRegex.findall('Robocop eats baby food.')) # Output will include spaces and words.


['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.']

Recap

  • The regex .findall() method is passed a string, and returns a list of all matches in it, not just the first match.
  • If the regex has 0 or 1 groups, .findall() returns a list of strings.
  • If the regex has 2 ore more groups, .findall() returns a list of tuples of strings.
  • \d is the shorthand character class that matches digits.
  • \w is the shorthand character class that matches words.
  • \s is the shorthand character class for whitespace.
  • \D is the shorthand character class that matches NOT digits.
  • \W is the shorthand character class that matches NOT words.
  • \S is the shorthand character class that matches NOT spaces.
  • You can make your own character classes with square brackets: [aeiou]
  • The ^ caret symbol makes it a negative character class, matching anything NOT in the brackets: [^aeiou]