metachars

  • . any char
  • \w any alphanumeric (a-z, A-Z, 0-9, _)
  • \s any whitespace char (" _, \t, \n)
  • \S any nonwhitespace
  • \d any digit (0-9)
  • . searches for an actual period

In [ ]:
#subject lines that have dates, e.g. 12/01/99
[line for line in subjects if re.search("\d\d/\d\d/\d\d", line)]

define your own character classes

inside your regular expression, write [aeiou]


In [ ]:
[line for line in subjects if re.search("[aeiou][aeiou][aeiou][aeiou]", line)]

In [ ]:
[line for line in subjects if re.search("F[wW]:", line)]

metacharacters

  • ^ beginning of string
  • $ end of string
  • \b word boundary

In [ ]:
[line for line in subjects if res.search("^[Nn]ew [Yy]ork", line)]

In [ ]:
[line for line in subjects if re.search(r"\boil\b", line)]

aside: metacharacters and escape characters

  • \n new line
  • \t tab
  • \ single backslash (python interprets these)

In [1]:
x = "this is \na test"
print(x)


this is 
a test

In [2]:
x = "this is\t\t\tanother test"
print(x)


this is			another test

In [3]:
normal = "hello\nthere"
raw = r"hello\nthere"
print("normal:", normal)
print("raw:", raw)


normal: hello
there
raw: hello\nthere

metacharacters 3: quantifiers

        * match zero or more times
        {n} matches exactly n times
        {n,m} matches at least n times, but no more than m times
        {n,} matches at least n times, but maybe infinite times
        + match at least once ({1,})
        ? match one time or zero times

[line for line in subjects if re.search(r"^R string matches regular expression if at the first line, you encounter .......


In [4]:
[line for line in subjects if re.search(r"\b(?:[Cc]at|[kK]itty|[kK]itten)\b", line)]


---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-b400ec1e8354> in <module>()
----> 1 [line for line in subjects if re.search(r"\b(?:[Cc]at|[kK]itty|[kK]itten)\b", line)]

NameError: name 'subjects' is not defined

more metacharacters: alternation

.......

capturing

read teh whole corpus in as one big string


In [6]:
all_subjects = open("enronsubjects.txt").read()


---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-6-e5018b4c43ed> in <module>()
----> 1 all_subjects = open("enronsubjects.txt").read()

FileNotFoundError: [Errno 2] No such file or directory: 'enronsubjects.txt'

In [ ]:
all_subjects[:1000]
#looking for domain names
[line for line in subjectts if re.search](r"\b\w+\.(?:com|net|org)\b", line)
#re.findall(r"\b\w+\.(?:com|net|org)\b", all_subjects)
#"will you pass teh pepper?" re.search "yes"
#"will you pass the pepper?" re.findall "yes, here it is" *passes pepper*