metachars

. any char
\w any alphanumeric (a-z, A-Z, 0-9, _)
\s any whitespace char (" _, \t, \n)
\S any nonwhitespace
\d any digit (0-9)
. searches for an actual period



In [ ]:

    
#subject lines that have dates, e.g. 12/01/99
[line for line in subjects if re.search("\d\d/\d\d/\d\d", line)]

define your own character classes

inside your regular expression, write [aeiou]



In [ ]:

    
[line for line in subjects if re.search("[aeiou][aeiou][aeiou][aeiou]", line)]



In [ ]:

    
[line for line in subjects if re.search("F[wW]:", line)]

metacharacters

^ beginning of string
$ end of string
\b word boundary



In [ ]:

    
[line for line in subjects if res.search("^[Nn]ew [Yy]ork", line)]



In [ ]:

    
[line for line in subjects if re.search(r"\boil\b", line)]

aside: metacharacters and escape characters

\n new line
\t tab
\ single backslash (python interprets these)



In [1]:

    
x = "this is \na test"
print(x)









    



this is 
a test



In [2]:

    
x = "this is\t\t\tanother test"
print(x)









    



this is			another test



In [3]:

    
normal = "hello\nthere"
raw = r"hello\nthere"
print("normal:", normal)
print("raw:", raw)









    



normal: hello
there
raw: hello\nthere

metacharacters 3: quantifiers

        * match zero or more times
        {n} matches exactly n times
        {n,m} matches at least n times, but no more than m times
        {n,} matches at least n times, but maybe infinite times
        + match at least once ({1,})
        ? match one time or zero times

[line for line in subjects if re.search(r"^R string matches regular expression if at the first line, you encounter .......



In [4]:

    
[line for line in subjects if re.search(r"\b(?:[Cc]at|[kK]itty|[kK]itten)\b", line)]









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-b400ec1e8354> in <module>()
----> 1 [line for line in subjects if re.search(r"\b(?:[Cc]at|[kK]itty|[kK]itten)\b", line)]

NameError: name 'subjects' is not defined

more metacharacters: alternation

.......

capturing

read teh whole corpus in as one big string



In [6]:

    
all_subjects = open("enronsubjects.txt").read()









    



---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-6-e5018b4c43ed> in <module>()
----> 1 all_subjects = open("enronsubjects.txt").read()

FileNotFoundError: [Errno 2] No such file or directory: 'enronsubjects.txt'



In [ ]:

    
all_subjects[:1000]
#looking for domain names
[line for line in subjectts if re.search](r"\b\w+\.(?:com|net|org)\b", line)
#re.findall(r"\b\w+\.(?:com|net|org)\b", all_subjects)
#"will you pass teh pepper?" re.search "yes"
#"will you pass the pepper?" re.findall "yes, here it is" *passes pepper*