A common workflow with regular expressions is that you write a pattern for the thing you are looking for... **Remember there are different Flavours of Regex Available, so what works on one might not work on the other..
In short - this pattern describes an email address; With the above regex pattern, we can search through a text file to find email addresses, or verify if a given string looks like an email address..
The most basic regex pattern in a token like just an $<b>$ i.e a single literal character. In the string " Zebra is an animal.", this will match the very first $b$ in the Ze$b$ Note that it doesn't matter whether it's present in the middle of the word as of now..
Now let me introduce few very basics things used in $<regex>$ to define itself (remeber the e-mail address pattern above, now we will break into piece by piece..)
In the regex discussed in this tutorial, there are 11 characters with special meanings: the opening square bracket $<[>$, the backslash, the caret <^>, the dollar sign <$>, the period or dot <.>, the vertical bar or pipe symbol <|>, the question mark <?>, the asterisk or star <*>, the plus sign <+>, the opening round bracket <(> and the closing round bracket <)>. These special characters are often called “metacharacters”.
Meta character | Description |
---|---|
. | Period matches any single character except a line break. |
[ ] | Character class. Matches any character contained between the square brackets. |
[^ ] | Negated character class. Matches any character that is not contained between the square brackets . |
* | Matches 0 or more repetitions of the preceding symbol. |
+ | Matches 1 or more repetitions of the preceding symbol. |
? | Makes the preceding symbol optional. |
{n,m} | Braces. Matches at least "n" but not more than "m" repetitions of the preceding symbol. |
(xyz) | Character group. Matches the characters xyz in that exact order. |
| | Alternation. Matches either the characters before or the characters after the symbol. |
\ | Escapes the next character. This allows you to match reserved characters [ ] ( ) { } . * + ? ^ $ \ . |
^ | Matches the beginning of the input. |
$ | Matches the end of the input. |
Read More here and here. Both are Very Very Good...
This is a very important point to understand: a regex-directed engine will always return the leftmost match, even if a better match could be found later. When applying a regex to a string, the engine will start at the first character of the string. It will try all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, will the engine continue with the second character in the text. Again, it will try all possible permutations of the regex, in exactly the same order. The result is that the regex-directed engine will return the leftmost match.
When applying <cat> to He captured a catfish for his cat., the engine will try to match the first token in the regex <c> to the first character in the match H. This fails. There are no other possible permutations of this regex, because it merely consists of a sequence of literal characters. So the regex engine tries to match the <c> with the e. This fails too, as does matching the c with the space. Arriving at the 4th character in the match, <c> matches c. The engine will then try to match the second token <a> to the 5th character, a. This succeeds too. But then, <t> fails to match p. At that point, the engine knows the regex cannot be matched starting at the 4th character in the match. So it will continue with the 5th: a. Again, <c> fails to match here and the engine carries on. At the 15th character in the match, <c> again matches c. The engine then proceeds to attempt to match the remainder of the regex at character 15 and finds that <a> matches a and <t> matches t.
The entire regular expression could be matched starting at character 15. The engine is "eager" to report a match. It will therefore report the first three letters of catfish as a valid match. The engine never proceeds beyond this point to see if there are any better matches. The first match is considered good enough.
Character sets are also called character class. Square brackets are used to specify character sets. Use a hyphen inside a character set to specify the characters' range. The order of the character range inside square brackets doesn't matter. For example, the regular expression [Tt]he
means: an uppercase T or lowercase t, followed by the letter h, followed by the letter e.
A period inside a character set, however, means a literal period. The regular expression <ar[.]> means: a lowercase character a, followed by letter r, followed by a period . character.
<ar[.]> => A garage is a good place to park a car.
<[0-9]> => Matches a single digit between 0 and 9. You can use more than one range.
Typing a caret(^) after the opening square bracket will negate the character class. The result is that the character class will match any character that is not in the character class.
Regular expression provides shorthands for the commonly used character sets, which offer convenient shorthands for commonly used regular expressions. The shorthand character sets are as follows:
Shorthand | Description |
---|---|
. | Any character except new line. It's the most commonly misused metacharacter. |
\w | Matches alphanumeric characters: [a-zA-Z0-9_] |
\W | Matches non-alphanumeric characters: [^\w] |
\d | Matches digit: [0-9] |
\D | Matches non-digit: [^\d] |
\s | Matches whitespace character: [\t\n\f\r\p{Z}] |
\S | Matches non-whitespace character: [^\s] |
The symbol *
matches zero or more repetitions of the preceding matcher. The
regular expression a*
means: zero or more repetitions of preceding lowercase
character a
. But if it appears after a character set or class then it finds
the repetitions of the whole character set.
For example, the regular expression
[a-z]*
means: any number of lowercase letters in a row.The *
symbol can be used with the meta character .
to match any string of
characters .*
. The *
symbol can be used with the whitespace character \s
to match a string of whitespace characters. For example, the expression
\s*cat\s*
means: zero or more spaces, followed by lowercase character c
,
followed by lowercase character a
, followed by lowercase character t
,
followed by zero or more spaces.
The symbol +
matches one or more repetitions of the preceding character. For
example, the regular expression c.+t
means: lowercase letter c
, followed by
at least one character, followed by the lowercase character t
. It needs to be
clarified that t
is the last t
in the sentence.
In regular expression the meta character ?
makes the preceding character optional. This symbol matches zero or one instance of the preceding character. For example, the regular expression [T]?he
means: Optional the uppercase letter T, followed by the lowercase character h, followed by the lowercase character e.
Regex | Means |
---|---|
abc+ | matches a string that has ab followed by one or more c |
abc? | matches a string that has ab followed by zero or one c |
abc{2} | matches a string that has ab followed by 2 c |
abc{2,} | matches a string that has ab followed by 2 or more c |
abc{2,5} | matches a string that has ab followed by 2 up to 5 c |
a(bc)* | matches a string that has a followed by zero or more copies of the sequence bc |
a(bc){2,5} | matches a string that has a followed by 2 up to 5 copies of the sequence bc |
<.+> | matches <div>simple div</div> |
In regular expressions, the dot or period is one of the most commonly used metacharacters. Unfortunately, it is also the most commonly misused metacharacter. The dot is short for the negated character class <[^\n]> (UNIX regex flavors) or <[^\r\n]> (Windows regex flavors).
Use The Dot Sparingly
Put in a dot, and everything will
match just fine when you test the regex on valid data. The problem is that the regex will also match in cases
where it should not match..
Example - Let’s say we want to match a date in mm/dd/yy
format, but we
want to leave the user the choice of date separators. The quick solution is <\d\d.\d\d.\d\d>. Seems fine at
first sight.. It will match a date like 02/12/03
just what we intended, So fine...
Anchors are a different breed. They do not match any character at all. Instead, they match a position before,
after or between characters. They can be used to anchor
the regex match at a certain position.
abc
matches a
. <^b> will
not match abc
at all, because the <b> cannot be matched right after the start of the string, matched by <^>.c
in abc
, while <a\$> does not
match abc
at all....The word boundary \b matches positions where one side is a word character (usually a letter, digit or underscore—but see below for variations across engines) and the other side is not a word character (for instance, it may be the beginning of the string or a space character).
The \bcat\b
would therefore match cat
in a black cat
, but it wouldn't match it in catatonic
, tomcat
or certificate
. Removing one of the boundaries, \bcat
would match cat
in catfish
, and cat\b
would match cat
in tomcat
, but not vice-versa. Both, of course, would match cat
on its own.
Word boundaries are useful when you want to match a sequence of letters (or digits) on their own, or to ensure that they occur at the beginning or the end of a sequence of characters.
Be aware, though, that \bcat\b
will not match cat
in _cat
or in cat25
because there is no boundary between an underscore and a letter, nor between a letter and a digit: these all belong to what regex defines as word characters.
By placing part of a regular expression inside round brackets or parentheses, you can group that part of the regular expression together. This allows you to apply a regex operator, e.g. a repetition operator, to the entire group. Only round brackets can be used for grouping. Square brackets define a character class, and curly braces are used by a special repetition operator.
Set(Value)?
matches "Set or SetValue". first backreference
will contain Value.Backreferences allow you to re-use part of the regex match. You can reuse it inside the regular expression before or afterwards depending on the Regex Flavour you are using...
Some regex flavours use \
, some flavours use $
, etc..
In Perl, you can use the magic variables $1, $2, etc. to access the part of the string matched by the backreference
Regex : (\w+)\1
on String seek
will match ee
PS I am myself studying this section properly, hence couldn't add more details :))
Suppose you want to use a regex to match a list of function names in a programming language: "Get, GetValue, Set or SetValue."
Get|GetValue|Set|SetValue
Now take a look closer carefully at the regex and the string, both. Here are some other ways to do the same task
Get(Value)?|Set(Value)?
\b(Get|GetValue|Set|SetValue)\b
\b(Get(Value)?|Set(Value)?)\b
\b(Get|Set)(Value)?\b
Regex: <[^>]+>
<\a>, <\b>, <\img />, <\br />, etc
. You can use this to find segments that have HTML tags you need to deal with, or to remove all HTML tags from a text.Regex: https?:\/\/[\w\.\/\-?=&%,]+
Regex: '\w+?'
Regex: ([-A-Za-z0-9_]*?([-A-Za-z_][0-9]|[0-9][-A-Za-z_])[-A-Za-z0-9_]*)
Regex: \b(the|The)\b.*?\b(?=\W?\b(is|are|was|can|shall| must|that|which|about|by|at|if|when|should|among|above|under|$)\b)
The Web based look up is our new feature. A project manager should not proofread... Our Product Name is...
Regex: \b(a|an|A|An)\b.*?\b(?=\W?\b(is|are|was|can|shall|must |that|which|about|by|at|if|when|among|above|under|$)\b)
Regex: \b(this|these|This|These)\b.*?\b(?=\W?\b(is|are|was|can|shall|must|that|which|about|by|at|if|when|among|above|under|$)\b)
- **What it does**: This works much like the Regex shown above, except that it finds text that begins with this or these. This can also be very helpful when you need to extract terminology from a project.
Regex :(.*?)
re.sub(regex, replacement, subject)
performs a search-and-replace across subject, replacing all
matches of regex in subject with replacement. The result is returned by the sub() function. The subject
string you pass is not modified. The re.sub() function applies the same backslash logic to the replacement text as is applied to the regular expression. Therefore, you should use raw strings for the replacement text...
In [1]:
%load_ext autoreload
%autoreload
import re, time
In [2]:
s = 'How do you do this'
print('After applying re.sub -- ', re.sub(r"How do you", "How do I", s), '\nOriginal Text is still -- ', s)
So does that mens that we have to type one regex expression everytime, run and check it and then the substituion willl happen? i.e Can't we stack re.sub(), re.sub(), re.sub()....
Surely we can, Remeber re.sub()
is returning a string after making the changes that matched the pattern you asked for..
In [3]:
s_old = 'How do you do this'
print('After applying re.sub -- ',end='')
s_new = re.sub(r"How do you", "How do I", s_old)
print(f'\nOriginal Text isn\'t still **{s_old}** but it\'s now **{s_new}**')
#Obviously s_old and s_new are different, I am just trying to show that we can stack the operations....
In [4]:
tweet = '#fingerprint #Pregnancy Test https://goo.gl/h1MfQV #android +#apps +#beautiful \
#cute #health #igers #iphoneonly #iphonesia #iphone \
<3 ;D :( :-('
#Let's take care of emojis and the #(hash-tags)...
print(f'Original Tweet ---- \n {tweet}')
## Replacing #hashtag with only hashtag
tweet = re.sub(r'#(\S+)', r' \1 ', tweet)
#this gets a bit technical as here we are using Backreferencing and Character Sets Shorthands and replacing the captured Group.
#\S = [^\s] Matches any charachter that isn't white space
print(f'\n Tweet after replacing hashtags ----\n {tweet}')
## Love -- <3, :*
tweet = re.sub(r'(<3|:\*)', ' EMO_POS ', tweet)
print(f'\n Tweet after replacing Emojis for Love with EMP_POS ----\n {tweet}')
#The parentheses are for Grouping, so we search (remeber the raw string (`r`))
#either for <3 or(|) :\* (as * is a meta character, so preceeded by the backslash)
## Wink -- ;-), ;), ;-D, ;D, (;, (-;
tweet = re.sub(r'(;-?\)|;-?D|\(-?;)', ' EMO_POS ', tweet)
print(f'\n Tweet after replacing Emojis for Wink with EMP_POS ----\n {tweet}')
#The parentheses are for Grouping as usual, then we first focus on `;-), ;),`, so we can see that 1st we need to have a ;
#and then we can either have a `-` or nothing, so we can do this via using our `?` clubbed with `;` and hence we have the very
#starting with `(;-?\)` and simarly for others...
## Sad -- :-(, : (, :(, ):, )-:
tweet = re.sub(r'(:\s?\(|:-\(|\)\s?:|\)-:)', ' EMO_NEG ', tweet)
print(f'\n Tweet after replacing Emojis for Sad with EMP_NEG ----\n {tweet}')
In [5]:
##See the Output Carefully, there are Spaces inbetween un-necessary...
## Replace multiple spaces with a single space
tweet = re.sub(r'\s+', ' ', tweet)
print(f'\n Tweet after replacing xtra spaces ----\n {tweet}')
##Replace the Puctuations (+,;)
tweet = re.sub(r'[^\w\s]','',tweet)
print(f'\n Tweet after replacing Punctuation + with PUNC ----\n {tweet}')
In [6]:
# bags of positive/negative smiles (You can extend the above example to take care of these few too...))) A good Excercise...
positive_emojis = set([
":‑)",":)",":-]",":]",":-3",":3",":->",":>","8-)","8)",":-}",":}",":o)",":c)",":^)","=]","=)",":‑D",":D","8‑D","8D",
"x‑D","xD","X‑D","XD","=D","=3","B^D",":-))",";‑)",";)","*-)","*)",";‑]",";]",";^)",":‑,",";D",":‑P",":P","X‑P","XP",
"x‑p","xp",":‑p",":p",":‑Þ",":Þ",":‑þ",":þ",":‑b",":b","d:","=p",">:P", ":'‑)", ":')", ":-*", ":*", ":×"
])
negative_emojis = set([
":‑(",":(",":‑c",":c",":‑<",":<",":‑[",":[",":-||",">:[",":{",":@",">:(","D‑':","D:<","D:","D8","D;","D=","DX",":‑/",
":/",":‑.",'>:\\', ">:/", ":\\", "=/" ,"=\\", ":L", "=L",":S",":‑|",":|","|‑O","<:‑|"
])
In [7]:
## Pattern to match any IP Addresses
pattern = r'\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b'
the above pattern will also match 999.999.999.999
but that isn't a valid IP at all
Now this depends on the data at hand as to how far you want the regex to be accurate...
To restrict all 4
numbers in the IP address to 0..255
, you can use this
complex beast:
\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-
9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-
4][0-9]|[01]?[0-9][0-9]?)\b
In [8]:
updated_pattern = r'\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b'
updated_pattern
Out[8]:
In [9]:
if re.search(pattern, '999.999.999.999'): print('Matched')
if re.search(updated_pattern, '256.999.999.999'):
print('Matched')
else:
print('Not Matched')
In [10]:
#Valid Dates..
pattern = r'(19|20)\d\d[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])'
matches a date in yyyy-mm-dd format from between 1900-01-01 and 2099-12-31, with a choice of four separators(space included :))
The year is matched by (19|20)\d\d
(0[1-9]|1[012])
(rounding brackets are necessary so that to include both the options)01 and 09
, and 10, 11 or 12
01
through 09
, the second 10 through 29
, and the third matches 30 or 31
...