Let's teach you how to extract emails from a mailbox file. This has a variety of applications; the most common being buiding a list of emails for spamming... er, I meant "mass marketing."
The best way to find the emails in an email inbox file is to search for lines in the file that begin with From:
(similar to what we did in the lab). When you find an email write just the email address (not "From:" itself - the email address only.), and then running the filtered text through the isEmail()
function I've provided below.
In this program there are exerpts from 4 user inboxes as part of the infamous Enron Email Dataset these files are:
NYC5-allen-inbox.txt
NYC5-donohoe-inbox.txt
NYC5-lay-inbox.txt
NYC5-williams-inbox.txt
In the spirit of successive refinement, Our program will be divided into 2 parts.
In [ ]:
## This code provides the isEmail(text) function which checks if the text is an email address.
## RUN THIS CELL !!!
import re
# for custom mails use: '^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w+$'
def isEmail(text):
regex = '^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$'
return re.search(regex,text) is not None
print("When mafudge@syr.edu expect True, Actual", isEmail("mafudge@syr.edu"))
print("When mafudge@syr.edu expect False, Actual", isEmail(" sdjkhf mafudge@syr.edu"))
In part one we focus on reading from the input file and then printing the emails the console. The program should prompt for the name of the mailbox file to read and then print the emails to the screen.
Example Run:
Enter mailbox file to read: NYC5-lay-inbox.txt
wbd_5@hotmail.com
enron_update@concureworkplace.com
jeanine.denicola@newpower.com
garydlindley@visto.com
dan.ayers@enron.com
ealvittor@yahoo.com
dmyers@atlascapitalservices.com
dmyers@atlascapitalservices.com
coa@attglobal.net
slong@exodusenergy.com
ehvaughan@vnsm.com
annette.ambriz@compaq.com
proactiveupdate@proactivenet.com
no.address@enron.com
john.hardy@enron.com
nancy.muchmore@enron.com
nick@cavendishwhite.com
aurora.dimacali@enron.com
...
...
(more emails are present, but left out.)
In [ ]:
## Part 1: Step 2: Write the code here
With the first part complete, now we focus on writing these emails to a file. The name of the file to write depends on the file read. simply replace inbox
with emails
. For example if you read in NYC5-donohoe-inbox.txt
you will write the emails to NYC5-donohoe-emails.txt
. The program should display how many emails were written to the file.
Example Run:
Enter mailbox file to read: NYC5-williams-inbox.txt
Wrote 65 emails to NYC5-williams-emails.txt
In [ ]:
## Part 2: Step 2: Write code here
Reflect upon your experience completing this assignment. This should be a personal narrative, in your own voice, and cite specifics relevant to the activity as to help the grader understand how you arrived at the code you submitted. Things to consider touching upon: Elaborate on the process itself. Did your original problem analysis work as designed? How many iterations did you go through before you arrived at the solution? Where did you struggle along the way and how did you overcome it? What did you learn from completing the assignment? What do you need to work on to get better? What was most valuable and least valuable about this exercise? Do you have any suggestions for improvements?
To make a good reflection, you should journal your thoughts, questions and comments while you complete the exercise.
Keep your response to between 100 and 250 words.
--== Write Your Reflection Below Here ==--
In [ ]:
# RUN THIS CODE CELL TO TURN IN YOUR WORK!
from ist256.submission import Submission
Submission().submit()