Files

Opening Files

To read a file, you must first open a file. This returns a file handle which you can used to then get the contents of a file. If the file doesn't exist this will throw an error.

file_handle = open('filename.txt')


Once you are done with a file, you need to close it. Bad things can happen if you don't close your files, particularly on locking filesystems.

file_handle.close()

In [ ]:
# Run these to get some of the files we will be using today. 
# These are the salaries of public workers in California from the website transparentcalifornia
# The last line is downloading a short story for the project

import urllib.request, urllib.parse, urllib.error
urllib.request.urlretrieve("http://transparentcalifornia.com/export/san-francisco-2014.csv", "san-francisco-2014.csv")
urllib.request.urlretrieve("http://transparentcalifornia.com/export/san-francisco-2013.csv", "san-francisco-2013.csv")
urllib.request.urlretrieve("http://www.gutenberg.org/cache/epub/1952/pg1952.txt", "theyellowwallpaper.txt")

In [ ]:
# Opening a file
fh = open('san-francisco-2014.csv')
print(fh)
fh.close()

In [ ]:
# Opening a non-existent file
fh = open('i_dont_exist.txt')
print(fh)
fh.close()

TRY IT

Open and close the san-francisco-2013.csv file


In [ ]:

Text files and lines

A text file is just a sequence of lines, in fact if you read it in all at once it is returns a list of strings.

Each line is separated by the new line character "\n". This is the special character that is inserted into text files when you hit enter (or you can deliberately put it into strings by using the special \n syntax).


In [ ]:
print("Golden\nGate\nBridge")

TRY IT

Print your name on two lines using only one print statement


In [ ]:

Reading from files

There are two common ways to read through the file, the first (and usually better way) is to loop through the lines in the file.

for line in file_handle:
    print line

The second is to read all the lines at once and store as a string or list.

lines = file_handle.read() # stores as a single string
lines = file_handle.readlines() # stores as a list of strings (separates on new lines)

Unless you are going to process the lines in a file several times, use the first method. It uses way less memory which will be useful if you ever have big files


In [ ]:
fh = open('thingstodo.txt')
for line in fh:
    print(line.rstrip())
fh.close()

In [ ]:
fh = open('thingstodo.txt')
contents = fh.read()
fh.close()

print(contents)
print(type(contents))

fh = open('thingstodo.txt')
lines = fh.readlines()
fh.close()

print(lines)
print(type(lines))

TRY IT

Open 'san-francisco-2013.csv' and print out the first line. You can use either method. If you are using the loop method, you can 'break' after printing the first line.


In [ ]:

Searching through a file

When searching through a file, you can use string methods to discover and parse the contents.

Let's look at a few examples


In [ ]:
# Looking for a line that starts with something

# I want to see salary data of women with my first name
fh = open('san-francisco-2014.csv')
for line in fh:
    if line.startswith('Charlotte'):
        print(line)
fh.close()

In [ ]:
# Looking for lines that contain a specific string
fh = open('san-francisco-2014.csv')
# Looking for all the department heads
for line in fh:
    # Remember if find doesn't find the string, it returns -1
    if line.find('Dept Head') != -1:
        print(line)
fh.close()

In [ ]:
# Counting lines that match criteria
fh = open('san-francisco-2014.csv')
num_trainees = 0
for line in fh:
    # Remember if find doesn't find the string, it returns -1
    if line.find('Trainee') != -1:
        num_trainees += 1
fh.close()
print("There are {0} trainees".format(num_trainees))

In [ ]:
# Splitting lines, this is great for excel like data (tsv, csv)
# I want to see salary data of women with my name
fh = open('san-francisco-2014.csv')
for line in fh:
    if line.startswith('Emily'):
        cols = line.split(',')
        print(cols)
        # Salary is 3rd column
        print(cols[1], cols[2], cols[-1])
fh.close()

* Note that sometimes you get a quoted line, instead of the title and salary. If a csv file has a comma inside a cell, the line is quoted. Thus, splitting is not the proper way to read a csv file, but it will work in a pinch. We'll learn about the csv module as well as other ways to read in tabular (excel-like) data in the second half of the class.


In [ ]:
# Skipping lines
fh = open('thingstodo.txt')
for line in fh:
    if line.startswith('Golden'):
        continue
    print(line)
fh.close()

Try, except with open

If you are worried that the file might not exist, you can wrap the open in a try block

try:
    fh = open('i_dont_exist.txt')
except:
    print "File does not exist"
    exit()

In [ ]:
# Opening a non-existent file
try:
    fh = open('i_dont_exist.txt')
    print(fh)
    fh.close()
except:
    print("File does not exist")
    #exit()

Writing to files

You can write to files very easily. You need to give open a second parameter 'w' to indicate you want to open the file in write mode.

 fh_write = open('new_file.txt', 'w')

Then you call the write method on the file handle. You give it the string you want to write to the file. Be careful, write doesn't add a new line character to the end of strings like print does.

 fh_write.write('line to write\n')

Just like reading files, you need to close your file when you are done.

 fh_write.close()

In [ ]:
fh = open('numbers.txt', 'w')
for i in range(10):
    fh.write(str(i) + '\n')
fh.close()

# Now let's prove that we actaully made a file
fh = open('numbers.txt')
lines = fh.readlines()
print(lines)
fh.close()

TRY IT

Create a file called 'my_favorite_cities.txt' and put your top 3 favorite cities each on its own line.

Bonus check that you did it correctly by reading the lines in python


In [ ]:

With statement and opening files

You can use with to open a file and it will automatically close the file at the end of the with block. This is the python preferred way to open files. (Sorry it took me so long to show you)

with open('filename.txt') as file_handle:
    for line in file_handle:
        print line

# You don't have to close the file

In [ ]:
with open('thingstodo.txt') as fh:
    for line in fh:
        print((line.rstrip()))

You can also use with statements to write files


In [ ]:
with open('numbers2.txt', 'w') as fh:
    for i in range(5):
        fh.write(str(i) + '\n')
with open('numbers2.txt') as fh:
    for line in fh:
        print((line.rstrip()))

TRY IT

Refactor this code to use a with statement:

# Counting lines that match criteria
fh = open('san-francisco-2014.csv')
num_trainees = 0
for line in fh:
    # Remember if find doesn't find the string, it returns -1
    if line.find('Trainee') != -1:
        num_trainees += 1
fh.close()
print "There are {0} trainees".format(num_trainees)

In [ ]:

Project

We will calculate the average length of the first word in sentences in the short story "The Yellow Wallpaper" by Charlotte Perkins Gilman. (Feel free to use a different story, Project Gutenberg has many free ones. https://www.gutenberg.org/) This method works because in the text file, each sentence is on a separate line. If you are using another story, you may just want to go by paragraph or you can try spliting sentences on punctuation.

  1. Open the file in read mode using a with statement
  2. Initialize two variables sum and count to the value of 0
  3. Loop through each line. If the first character of the line is a capital letter (Check the strings lesson for the in keyword):
    • Add 1 to count
    • Split the line on spaces and find the length of the first word. Add this length to sum.
  4. Calculate the average length of first words of sentences using the sum and count variables (be careful about integer division).
  5. Open a new file 'ave_first_word_length.txt' in write mode using with statement
  6. Print the title of the story on the first line and the average first word length on the second line.

In [ ]: