Now You Code 5: Email Harvest Training v2

Let's teach you how to extract emails from a mailbox file. This has a variety of applications; the most common being buiding a list of emails for spamming... er, I meant "mass marketing."

The best way to find the emails in an email inbox file is to search for lines in the file that begin with From: (similar to what we did in the lab). When you find an email write just the email address (not "From:" itself - the email address only.), and then running the filtered text through the isEmail() function I've provided below.

In this program there are exerpts from 4 user inboxes as part of the infamous Enron Email Dataset these files are:

  • NYC5-allen-inbox.txt
  • NYC5-donohoe-inbox.txt
  • NYC5-lay-inbox.txt
  • NYC5-williams-inbox.txt

In the spirit of successive refinement, Our program will be divided into 2 parts.


In [ ]:
## This code provides the isEmail(text) function which checks if the text is an email address.
## RUN THIS CELL !!!
import re

# for custom mails use: '^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w+$'     
def isEmail(text):    
    regex = '^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$'
    return re.search(regex,text) is not None

print("When mafudge@syr.edu expect True, Actual", isEmail("mafudge@syr.edu"))
print("When mafudge@syr.edu expect False, Actual", isEmail(" sdjkhf mafudge@syr.edu"))

Part One

In part one we focus on reading from the input file and then printing the emails the console. The program should prompt for the name of the mailbox file to read and then print the emails to the screen.

Example Run:

Enter mailbox file to read: NYC5-lay-inbox.txt

wbd_5@hotmail.com
enron_update@concureworkplace.com
jeanine.denicola@newpower.com
garydlindley@visto.com
dan.ayers@enron.com
ealvittor@yahoo.com
dmyers@atlascapitalservices.com
dmyers@atlascapitalservices.com
coa@attglobal.net
slong@exodusenergy.com
ehvaughan@vnsm.com
annette.ambriz@compaq.com
proactiveupdate@proactivenet.com
no.address@enron.com
john.hardy@enron.com
nancy.muchmore@enron.com
nick@cavendishwhite.com
aurora.dimacali@enron.com
...
...

(more emails are present, but left out.)

Part 1: Step 1 Problem Analysis

Input: Mailbox File to read

Output: emails to the screen

Algorithm:

(todo write here)

In [ ]:
## Part 1: Step 2: Write the code here

Part Two

With the first part complete, now we focus on writing these emails to a file. The name of the file to write depends on the file read. simply replace inbox with emails. For example if you read in NYC5-donohoe-inbox.txt you will write the emails to NYC5-donohoe-emails.txt. The program should display how many emails were written to the file.

Example Run:

Enter mailbox file to read: NYC5-williams-inbox.txt

Wrote 65 emails to NYC5-williams-emails.txt

Part 2: Step 1: Problem Analysis

Input: Mailbox File to read

Outputs: email file and number of emails to screen, email addresses themselves into the file

Algorithm:

(todo write here)

In [ ]:
## Part 2: Step 2:  Write code here

Step 3: Questions

  1. Did a significant amount of your code need to change from part 1 to part 2? Explain.

Answer:

  1. Devise an approach to remove duplicate emails from the output file. You don't have to write as code, just explain it.

Answer:

Step 4: Reflection

Reflect upon your experience completing this assignment. This should be a personal narrative, in your own voice, and cite specifics relevant to the activity as to help the grader understand how you arrived at the code you submitted. Things to consider touching upon: Elaborate on the process itself. Did your original problem analysis work as designed? How many iterations did you go through before you arrived at the solution? Where did you struggle along the way and how did you overcome it? What did you learn from completing the assignment? What do you need to work on to get better? What was most valuable and least valuable about this exercise? Do you have any suggestions for improvements?

To make a good reflection, you should journal your thoughts, questions and comments while you complete the exercise.

Keep your response to between 100 and 250 words.

--== Write Your Reflection Below Here ==--


In [ ]:
# RUN THIS CODE CELL TO TURN IN YOUR WORK!
from ist256.submission import Submission
Submission().submit()