Reading and writing files in Python

Goals for this assignment

Most of the data that you will use will come in a file of some sort. In fact, we've used data from files at various points during this class, but have mostly glossed over how we actually work with those files. In this assignment, and in class tomorrow, we're going to work with some of the common types of files using standard Python (and numpy) methods.

Your name

// put your name here!

Standard file types

Text files: This is a broad category of file that contains information stored as plain text. That data could be numerical, strings, or some combination of them. It's typical for a text file to have columns of data and have a format that looks like this:

# Columns:
#  1. time
#  2. height
#  3. distance
0.317  0.227  0.197
0.613  0.432  2.715
1.305  0.917  5.613

where the rows starting with a pound sign (#) are meant to be comments and the ones following that have data that's broken up into columns by spaces or tabs. Text files often have the ".txt" file extension. The primary advantage of plain text files is their simplicity and portability; the disadvantages are that (1) there is no standard format for data in these files, so you have to look at each file and figure out exactly how it is structured; (2) it uses disk space very inefficiently, (3) it's difficult to store "non-rectangular" datasets (i.e., datasets where each row has a different number of columns, or missing data). Despite the disadvantages, however, their convenience means that they are useful for many purposes!

Comma-separated value files: This sort of file, also known as a "CSV" file, is a specific type of text file where numbers and strings are stored as a table of data. Each line in the file is an individual "record" that's analogous to a row in a spreadsheet, and each entry within that row is a "field", separated by commas. This type of data file often has the ".csv" file extension. The data above might be stored in a CSV file in the following way:

"height","time","distance"
0.317,0.227,0.197
0.613,0.432,2.715
1.305,0.917,5.613

CSV files share many of the same advantages and disadvantages of plain text files. One significant advantage, however, is that there is something closer to a standard file format for CSV files, and it's easier to deal with missing data/non-rectangular data.

Binary files: This encompasses a wide range of file types that store data in a much more compact format, and allow much more complex data types to be stored. Many of the files that you use - for example, any file that contains audio, video, or image data - are binary files, and Numpy also has its own binary file format. The internal format of these files can vary tremendously, but in general binary files have an extention that tells you what type of format it is (for example, .mp3, .mov, or .jpg), and standard tools exist to read and write those files.

Part 1: Reading and writing text files

The Python tutorial has a useful section on reading and writing text files. In short, you open a file using the open() method:

myfile = open('filename.txt','r')

where the first argument is a string containing the file name (you can use a variable if you want), and the second is a string that explains how a file will be used. 'r' means "read-only"; 'w' means "write-only"; 'r+' means "read and write". If you do not put any argument, the file is assumed to be read-only.

You can read the entire file by:

myfile.read()

or a single line with:

myfile.readline()

Each line of the file is returned as a string, which you can manipulate as you would any other string (by splitting it into a list, converting to a new file type, etc.)

You can also loop over the file and operate on each line sequentally:

for line in myfile:
    print(line)

Note: if you get an extra line in between each printed-out line, that's because there is already a newline ('\n') character at the end of the line in the file. You can make this go away by using this syntax:

print(line,end='')

Also, if you want to get all of the lines of a file in a list you can use list(myfile) or myfile.readlines()

You can write to a file by creating a string and writing it. For example:

string = str(4.0) + 'abcde'
myfile.write(string)

Once you're done with a file, you can close it by using the close() method:

myfile.close()

You can also use the seek() function to move back and forth within a file; see the Python tutorial for more information on this.

You may also find yourself using the Numpy loadtxt() and savetxt() methods to read and write text-based files; we'll talk about those in the next section.

In the space below: write a piece of code that creates a file named your_last_name.txt (with your actual last name), and then write several sentences of information to it with some facts about yourself that end in a newline ('\n'). (Note that opening a file in mode 'w' will create the file!) Look at the file with a text editor (TextEdit in Mac; WordPad or NotePad in Windows) and see what it looks like! Then, read the file in line-by-line, split each line into a list of strings, and print out the first item in the list. You should get the first word in each sentence back!


In [ ]:
# put your code here!

Part 2: Reading and writing CSV files

You're likely to use two primary means of reading and writing CSV files in Python: the Python CSV module and the Numpy loadtxt() and savetxt().

The Python CSV module has reader() and writer() methods that let you read and write to a file with lists. Once you open a file (as you would a normal text file in Python), you have to tell Python that it's a CSV file using csv.reader(), which returns a reader object that you can then iterate over to get rows in the file in the same way you do with lists. In fact, each line of a CSV file is returned as a list. So, to read and print out individual lines of a CSV file, you would do the following:

import csv
csvfile = open('my_file.csv','r')
csvreader = csv.reader(csvfile,delimiter=',')

for row in csvreader:
    print(row)

csvfile.close()

Writing is similarly straightforward. You open the file, use the writer() method to create a file object, and then you can write individual rows with the writerow() method. The writerow() method taks in a list and writes it as a single row in the file. For example, the following code creates two arrays and then writes them into a CSV file:

import csv

x = y = np.arange(0,10,1)  # create a couple of arrays to work with

csvfile = open('my_written_file.csv','w',newline='')
csvwriter = csv.writer(csvfile,delimiter=',')

for i in range(x.size):
    csvwriter.writerow([x[i], y[i]])

csvfile.close()

Numpy's loadtxt() and savetxt() methods do similar things as the CSV reader() and writer() methods, but directly work with arrays. In particular, you could load the two arrays written via the small piece of code directly above this into two numpy arrays as follows:

import numpy as np
xnew, ynew = np.loadtxt('my_written_file.csv',delimiter=',',unpack=True)

or you can read the two arrays into a single numpy array:

import numpy as np
combined_arrays = np.loadtxt('my_written_file.csv',delimiter=',',unpack=True)

where xnew can be accessed as combined_arrays[0] and ynew is combined_arrays[1].

In the space below: write a piece of code that creates two numpy arrays - an array of approximately 100 x-values that are evenly spaced between -10 and 10, and an array of y-values that are equal to sin(x). Use the Python CSV writer to write those to a CSV file with a name of your choosing, and then use numpy to load them into two arrays (with different names) and plot them using pyplot's plot() method. Does it look like a sine wave?


In [ ]:
# put your code here!

Part 3: Reading and writing binary files

There are an astounding number of ways to read and write binary files. For our purposes, we can use the numpy save() and savez() methods to write one or multiple numpy arrays into a binary file, respectively, and the load() method to load them. For example, if you have arrays named arr1, arr2, and arr3, you would write them to a file as follows:

np.savez('mydatafile',arr1,arr2,arr3)

and if you wanted to save them with specific names - say "first_array", "second_array", and "third_array", you would do so using this syntax:

np.savez('mydatafile',first_array=arr1,second_array=arr2,third_array=arr3)

In both circumstances, you would get a file named mydatafile.npz unless you manually put an extension different than .npz on the file.

You would then read in the data files using the Numpy load() method:

all_data = np.load('mydatafile.npz')

all_data then contains all of the array information. To find out what arrays are in there, you can print all_data.files:

print(all_data.files)

and access an individual array by name:

array_one = all_data['first_array']

In the space below: redo what you did in Part 2, but using the numpy savez() and load() methods and thus using a Numpy array file instead of a CSV file. When saving the arrays, make sure to give them descriptive names. If you plot the x and y values, do you still get a sine wave?


In [ ]:
# put your code here!

Feedback

Do you have any remaining questions after going through the activities described above?

// put your answer here!

You're all done!

Submit this assignment to the "Day 19" pre-class assignment dropbox on Desire2Learn.