In-Class Coding Lab: Files

The goals of this lab are to help you to understand:

  • Reading data from a file all at once or one line at a time.
  • Searching for data in files
  • Parsing text data to numerical data.
  • How to build complex programs incrementally.

Average Spam Confidence

For this lab, we will write a program to read spam confidence headers from a mailbox file like CCL-mbox-tiny.txt or CCL-mbox-small.txt. These files contain raw email data, and in that data is a SPAM confidence number for each message:

X-DSPAM-Confidence:0.8475

Our goal will be to find each of these lines in the file, and extract the confidence number (In this case 0.8475), with the end-goal of calculating the average SPAM Confidence of all the emails in the file.

Reading from the file

Let's start with some code to read the lines of text from CCL-mbox-tiny.txt


In [ ]:
filename = "CCL-mbox-tiny.txt"
with open(filename, 'r') as f:
    for line in f.readlines():
        print(line.strip())

Now Try It

Now modify the code above to print the number of lines in the file, instead of printing the lines themselves. You'll need to increment a variable each time through the loop and then print it out afterwards.

There should be 332 lines.


In [ ]:
# TODO Write code to not print the lines but count the number of lines!

Finding the SPAM Confidence

Next, we'll focus on only getting lines addressing X-DSPAM-Confidence:. We do this by including an if statement inside the for loop. This is a very common pattern in computing used to search through massive amouts of data.

You need to edit line 4 of the code below to only print lines which begin with X-DSPAM-Confidence: You know you got it working if in your output there are only 5 rows.


In [ ]:
filename = "CCL-mbox-tiny.txt"
with open(filename, 'r') as f:
    for line in f.readlines():
        if TODO: 
            print(line.strip())

Parsing out the confidence value

The final step is to figure out how to parse out the confidence value from the string. For example for the given line: X-DSPAM-Confidence: 0.8475 we need to get the value 0.8475 as a float.

The strategy here is to replace X-DSPAM-Confidence: with an empty string, then calling the float() function to convert the results to a float.

Now Try It

Write code to parse the value 0.8475 from the text string 'X-DSPAM-Confidence: 0.8475'.


In [ ]:
line = 'X-DSPAM-Confidence: 0.8475'
number =  #TODO remove 'X-DSPAM-Confidence:' , then convert to a float.
print (number)

Putting it all together

Now that we have all the working parts, let's put it all together.

0.  use the file named 'CCL-mbox-short.txt' 
1.  line count is 0
2.  total confidence is 0
3.  open mailbox file
4.  for each line in file
5.  if line starts with `X-DSPAM-Confidence:`
6.     remove `X-DSPAM-Confidence:` from line and convert to float
7.     increment line count
8.     add spam confidence to total confidence
9.  print average confidence (total confidence/line count)

In [3]:
## TODO: Write program here:
filename  = 'CCL-mbox-short.txt'

Question

How do you know this is right? How Can you verify it's right? HINT: You might not be able to hand-calculate the average spam confidence, but you can do that with CCL-mbox-tiny.txt, right?

Metacognition

Please answer the following questions. This should be a personal narrative, in your own voice. Answer the questions by double clicking on the question and placing your answer next to the Answer: prompt.

Questions

  1. Record any questions you have about this lab that you would like to ask in recitation. It is expected you will have questions if you did not complete the code sections correctly. Learning how to articulate what you do not understand is an important skill of critical thinking.

Answer:

  1. What was the most difficult aspect of completing this lab? Least difficult?

Answer:

  1. What aspects of this lab do you find most valuable? Least valuable?

Answer:

  1. Rate your comfort level with this week's material so far.

1 ==> I can do this on my own and explain how to do it.
2 ==> I can do this on my own without any help.
3 ==> I can do this with help or guidance from others. If you choose this level please list those who helped you.
4 ==> I don't understand this at all yet and need extra help. If you choose this please try to articulate that which you do not understand.

Answer:


In [ ]:
# SAVE YOUR WORK FIRST! CTRL+S
# RUN THIS CODE CELL TO TURN IN YOUR WORK!
from ist256.submission import Submission
Submission().submit()