Whether it is a refresher, a primer or simply more reps. Regular expressions (or regex as I will refer to them) are a way to work effectivly with strings. It is quite possible that you did not even know you could do some of the things that are possible with regex.
Regular expressions are essentially a tiny, highly specialized programming language embedded inside Python (or another language) and in Python they are made available through the re module. You can then ask questions such as “Does this string match the pattern?”, or “Is there a match for the pattern anywhere in this string?”. You can also use regex to modify a string or to split it apart in various ways. The official python docs has a howto that is a great starting place.
Did you know you could use the following regex to pull out all the any email addresses?
In [4]:
import re
my_str = "This (frodo@gmail.com) is what you are looking for, but here is my other address: samwise@minis-tirith.edu"
pattern = r"[A-Za-z0-9\.\+_-]+@[A-Za-z0-9\._-]+\.[a-zA-Z]*"
print(re.findall(pattern, my_str))
We saw several key ingredients in the email example that we will go into detail:
By the way there is also a standard library option to parse emails. First lets start with the re python module. There are a number of ways to search/match pattern, but these are the most common
Search and match are similar---there is a page in the docs for that. All of these methods shown above are the common methods that you will use to interface with the regular expression class.
EXERCISE: Create your own examples from the following code
In [6]:
s = "la la la ti dum da"
m = re.search("la",s)
print(m.group(0))
re.findall("la",s)
Out[6]:
And before we get in to deep it is worth noting that when there is a match these methods return a MatchObject. Match objects always have a boolean value of True. Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement. So this means that we can test to see if a match exists before iterating through individual matches.
match = re.search(pattern, string)
if match:
process(match)
In [9]:
m = re.match(r"(\w+) (\w+)", "Isaac Newton, physicist")
print(m.group(0)) # The entire match
print(m.group(1)) # The first parenthesized subgroup.
print(m.group(2)) # The second parenthesized subgroup.
print(m.group(1, 2)) # Multiple arguments give us a tuple.
print(m.groups())
EXERCISES Hint: add a * to the end of your pattern to see what happens.
In [24]:
s = "Today you are you! That is truer than true! "
s += "There is no one alive who is you-er than you! (Quote 42)"
## match All of the 'T's
print(re.findall("T",s))
## match all upper case letters
print(re.findall("[A-Z]",s))
## match only the numbers
print(re.findall("[\d|\s]+",s))
## match any word characters
print(re.findall("[A-Za-z0-9-]+",s))
## match any non-word characters
#print(re.findall("\w+",s))
Where the fun really begins
EXERCISES
In [44]:
s = "You have brains in your head. "
s += "You have feet in your shoes. "
s += "You can steer yourself any direction you choose. --Dr. Suess"
## match all words
print(re.findall("\w+",s))
## match only the word touching the begining of the line
print(re.findall("^\w+",s))
## match the word touching the end of the line
print(re.findall("\w+$",s))
## match all letters that come after a single whitespace character
print(re.findall("\s(\w+)",s))
## Can you create a pattern that will match "you anyword"?
print(re.findall("[Y|y]ou\s(\w+)",s))
Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\' as the pattern string, because the regular expression must be \, and each backslash must be expressed as \ inside a regular Python string literal.
Here is a question that an instructor recently asked in a slack channel. If I have the following use cases can we parse the paths?
EXERCISE: ## can you create a list where each directory is separated by commas?
In [3]:
import os
a = r'this_is/some directory/'
b = r"this_is/some\ directory"
c = r"this_is/some\ other\ directory/"
In [33]:
In [35]:
Out[35]:
In [36]:
Out[36]:
In [ ]: