The goals of this lab are to help you to understand:
For this lab, we will write a program to read spam confidence headers from a mailbox file like CCL-mbox-tiny.txt
or CCL-mbox-small.txt
. These files contain raw email data, and in that data is a SPAM confidence number for each message:
X-DSPAM-Confidence:0.8475
Our goal will be to find each of these lines in the file, and extract the confidence number (In this case 0.8475
), with the end-goal of calculating the average SPAM Confidence of all the emails in the file.
Let's start with some code to read the lines of text from CCL-mbox-tiny.txt
In [ ]:
filename = "CCL-mbox-tiny.txt"
with open(filename, 'r') as f:
for line in f.readlines():
print(line.strip())
In [ ]:
# TODO Write code to not print the lines but count the number of lines!
Next, we'll focus on only getting lines addressing X-DSPAM-Confidence:
. We do this by including an if
statement inside the for
loop. This is a very common pattern in computing used to search through massive amouts of data.
You need to edit line 4 of the code below to only print lines which begin with X-DSPAM-Confidence:
You know you got it working if in your output there are only 5 rows.
In [ ]:
filename = "CCL-mbox-tiny.txt"
with open(filename, 'r') as f:
for line in f.readlines():
if TODO:
print(line.strip())
The final step is to figure out how to parse out the confidence value from the string.
For example for the given line: X-DSPAM-Confidence: 0.8475
we need to get the value 0.8475
as a float.
The strategy here is to replace X-DSPAM-Confidence:
with an empty string, then calling the float()
function to convert the results to a float.
Write code to parse the value 0.8475
from the text string 'X-DSPAM-Confidence: 0.8475'
.
In [ ]:
line = 'X-DSPAM-Confidence: 0.8475'
number = #TODO remove 'X-DSPAM-Confidence:' , then convert to a float.
print (number)
Now that we have all the working parts, let's put it all together.
0. use the file named 'CCL-mbox-short.txt'
1. line count is 0
2. total confidence is 0
3. open mailbox file
4. for each line in file
5. if line starts with `X-DSPAM-Confidence:`
6. remove `X-DSPAM-Confidence:` from line and convert to float
7. increment line count
8. add spam confidence to total confidence
9. print average confidence (total confidence/line count)
In [3]:
## TODO: Write program here:
filename = 'CCL-mbox-short.txt'
Please answer the following questions. This should be a personal narrative, in your own voice. Answer the questions by double clicking on the question and placing your answer next to the Answer: prompt.
Answer:
Answer:
Answer:
1 ==> I can do this on my own and explain how to do it.
2 ==> I can do this on my own without any help.
3 ==> I can do this with help or guidance from others. If you choose this level please list those who helped you.
4 ==> I don't understand this at all yet and need extra help. If you choose this please try to articulate that which you do not understand.
Answer:
In [ ]:
# SAVE YOUR WORK FIRST! CTRL+S
# RUN THIS CODE CELL TO TURN IN YOUR WORK!
from ist256.submission import Submission
Submission().submit()