An introduction to solving biological problems with Python

Session 2.3: Files

Data input and output (I/O)

So far, all that data we have been working with has been written by us into our scripts, and the results of out computation has just been displayed in the terminal output. In the real world data will be supplied by the user of our programs (who may be you!) by some means, and we will often want to save the results of some analysis somewhere more permanent than just printing it to the screen. In this session we cover the way of reading data into our programs by reading files from disk, we also discuss writing out data to files.

There are, of course, many other ways of accessing data, such as querying a database or retrieving data from a network such as the internet. We don't cover these here, but python has excellent support for interacting with databases and networks either in the standard library or using external modules.

Using files

Frequently the data we want to operate on or analyse will be stored in files, so in our programs we need to be able to open files, read through them (perhaps all at once, perhaps not), and then close them.

We will also frequently want to be able to print output to files rather than always printing out results to the terminal.

Python supports all of these modes of operations on files, and provides a number of useful functions and syntax to make dealing with files straightforward.

Opening files

To open a file, python provides the open function, which takes a filename as its first argument and returns a file object which is python's internal representation of the file.


In [ ]:
path = "data/datafile.txt"
fileObj = open( path )

open takes an optional second argument specifying the mode in which the file is opened, either for reading, writing or appending.

It defaults to 'r' which means open for reading in text mode. Other common values are 'w' for writing (truncating the file if it already exists) and 'a' for appending.


In [ ]:
open( "data/myfile.txt", "r" ) # open for reading, default

In [ ]:
open( "data/myfile.txt", "w" ) # open for writing (existing files will be overwritten)

In [ ]:
open( "data/myfile.txt", "a" ) # open for appending

Closing files

To close a file once you finished with it, you can call the .close method on a file object.


In [ ]:
fileObj.close()

Mode modifiers

These mode strings can include some extra modifier characters to deal with issues with files across multiple platforms.

'b': binary mode, e.g. 'rb'. No translation for end-of-line characters to platform specific setting value.

Character Meaning
'r' open for reading (default)
'w' open for writing, truncating the file first
'x' open for exclusive creation, failing if the file already exists
'a' open for writing, appending to the end of the file if it exists
'b' binary mode
't' text mode (default)
'+' open a disk file for updating (reading and writing)

Reading from files

Once we have opened a file for reading, file objects provide a number of methods for accessing the data in a file. The simplest of these is the .read method that reads the entire contents of the file into a string variable.


In [ ]:
fileObj = open( "data/datafile.txt" )
print(fileObj.read()) # everything
fileObj.close()

Note that this means the entire file will be read into memory. If you are operating on a large file and don't actually need all the data at the same time this is rather inefficient.

Frequently, we just need to operate on individual lines of the file, and you can use the .readline method to read a line from a file and return it as a python string.

File objects internally keep track of your current location in a file, so to get following lines from the file you can call this method multiple times.

It is important to note that the string representing each line will have a trailing newline "\n" character, which you may want to remove with the .rstrip string method.

Once the end of the file is reached, .readline will return an empty string ''. This is different from an apparently empty line in a file, as even an empty line will contain a newline character. Recall that the empty string is considered as False in python, so you can readily check for this condition with an if statement etc.


In [ ]:
# one line at a time
fileObj = open( "data/datafile.txt" )
print("1st line:", fileObj.readline())
print("2nd line:", fileObj.readline())
print("3rd line:", fileObj.readline())
print("4th line:", fileObj.readline())
fileObj.close()

To read in all lines from a file as a list of strings containing the data from each line, use the .readlines method (though note that this will again read all data into memory).


In [ ]:
# all lines
fileObj = open( "data/datafile.txt" )

lines = fileObj.readlines()

print("The file has", len(lines), "lines")

fileObj.close()

Looping over the lines in a file is a very common operation and python lets you iterate over a file using a for loop just as if it were an array of strings. This does not read all data into memory at once, and so is much more efficient that reading the file with .readlines and then looping over the resulting list.


In [ ]:
# as an iterable
fileObj = open( "data/datafile.txt" )

for line in fileObj:
    print(line.rstrip().upper())

fileObj.close()

The with statement

It is important that files are closed when they are no longer required, but writing fileObj.close() is tedious (and more importantly, easy to forget). An alternative syntax is to open the files within a with statement, in which case the file will automatically be closed at the end of the with block.


In [ ]:
# fileObj will be closed when leaving the block
with open( "data/datafile.txt" ) as fileObj:
    for ( i, line ) in enumerate( fileObj, start = 1 ):
        print( i, line.strip() )

Exercises 2.3.1

Write a script that reads a file containing many lines of nucleotide sequence. For each line in the file, print out the line number, the length of the sequence and the sequence (There is an example file here or in data/dna.txt from the course materials ).

Writing to files

Once a file has been opened for writing, you can use the .write() method on a file object to write data to the file.

The argument to the .write() method must be a string, so if you want to write out numerical data to a file you will have to convert it to a string somehow beforehand.

**Remember** to include a newline character `\n` to separate lines of your output, unlike the `print()` statement, `.write()` does not include this by default.

In [ ]:
read_counts = {
    'BRCA2': 43234,
    'FOXP2': 3245,
    'SORT1': 343792
}

with open( "out.txt", "w" ) as output:
    output.write("GENE\tREAD_COUNT\n")

    for gene in read_counts:
        line = "\t".join( [ gene, str(read_counts[gene]) ] )
        output.write(line + "\n")

To view the output file, open a terminal window, go to the directory where the file has been written, and print the content of the file using cat command or open it using your favourite editor:

cat out.txt

Be cautious when opening a file for writing, as python will happily let you overwrite any existing data in the file.

Exercises 2.3.2

Create a script that writes the values of a list of numbers to a file, with each number on a seperate line.

Next session

Go to our next notebook: python_basic_2_4