Lesson 26:

RegEx Character Classes and the `.findall()` Method

The find.all() method for regex objects finds all matching strings in a text.



In [3]:

    
import re

phoneRegex = re.compile(r'/d/d/d-/d/d/d-/d/d/d/d')

#phoneRegex.search() # finds first match
#phoneRegex.findall() # finds all matches

find.all() returns a list of strings.

It behaves differently with groups.



In [4]:

    
import re

phoneRegex = re.compile(r'(/d/d/d)-(/d/d/d-/d/d/d/d)') # Two groups, so returns tuples

#phoneRegex.findall() # finds all matches in pairs; [('group1', 'group2'),...]

To get the total string, just wrap the total regex in its own group, so you get [(totalstring, group1, group2),...].

RegEx Character Classes

\d\ is the RegEx character for digits.



In [15]:

    
#digitRegex = re.compile(r'(1|2|3|4...|n)`) is equivalent to

#digitRegex = re.compile(r'\d\')









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-1d94011dd0c1> in <module>()
      4 #digitRegex = re.compile(r'\d\')
      5 
----> 6 print(str(list(fib())).strip("[]"))

NameError: name 'fib' is not defined

Other regex characters are:

\D Any character that is NOT a numeric digit from 0 to 9.
\w Any letter, numeric digit, punctuation, or the underscore character (word characters.)
\W Any character that is NOT a letter, numeric digit, or the underscore character.
\s Any space, tab, or newline character (space characters.)
\S Any character that is NOT a space character.



In [23]:

    
# Example using lyrics from The Twelve Days of Christmas 

lyrics = '''
12 Drummers Drumming
11 Pipers Piping
10 Lords a Leaping
9 Ladies Dancing
8 Maids a Milking
7 Swans a Swimming
6 Geese a Laying
5 Golden Rings
4 Calling Birds
3 French Hens
2 Turtle Doves
and 1 Partridge in a Pear Tree
'''

xmasRegex = re.compile(r'\d+\s\w+') # 1 or more digits, space, 1 or more words

xmasRegex.findall(lyrics) # Returns all 'x gift', but stops at space because \w+ does not include spaces









    Out[23]:





['12 Drummers',
 '11 Pipers',
 '10 Lords',
 '9 Ladies',
 '8 Maids',
 '7 Swans',
 '6 Geese',
 '5 Golden',
 '4 Calling',
 '3 French',
 '2 Turtle',
 '1 Partridge']

It is possible to create your own character classes, outside of these shorthand classes, using []:



In [26]:

    
vowelRegex = re.compile(r'[aeiouAEIOU]') # RegEx for lowercase and uppercase vowels
alphabetRegex = re.compile(r'[a-zA-Z]') # RegEx for lowercase and uppercase alphabet using ranges

print(vowelRegex.findall('Robocop eats baby food.')) # Finds a list of all vowels in string

doublevowelRegex = re.compile(r'[aeiouAEIOU]{2}') # RegEx for two lowercase and uppercase vowels in a row; {2} repeats.
print(doublevowelRegex.findall('Robocop eats baby food.')) # Finds a list of all vowels in string









    



['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o']
['ea', 'oo']

A useful feature of custom character classes are negative character classes:



In [30]:

    
consonantsRegex = re.compile(r'[^aeiouAEIOU]') # RegEx for finding all characters that are NOT vowels

print(consonantsRegex.findall('Robocop eats baby food.')) # Output will include spaces and words.









    



['R', 'b', 'c', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.']

Recap

The regex .findall() method is passed a string, and returns a list of all matches in it, not just the first match.
If the regex has 0 or 1 groups, .findall() returns a list of strings.
If the regex has 2 ore more groups, .findall() returns a list of tuples of strings.
\d is the shorthand character class that matches digits.
\w is the shorthand character class that matches words.
\s is the shorthand character class for whitespace.
\D is the shorthand character class that matches NOT digits.
\W is the shorthand character class that matches NOT words.
\S is the shorthand character class that matches NOT spaces.
You can make your own character classes with square brackets: [aeiou]
The ^ caret symbol makes it a negative character class, matching anything NOT in the brackets: [^aeiou]

Lesson 26:

RegEx Character Classes and the .findall() Method

RegEx Character Classes

Recap

RegEx Character Classes and the `.findall()` Method