Lecture 5: File I/O and Modules

CBIO (CSCI) 4835/6835: Introduction to Computational Biology

Overview and Objectives

So far, all the "data" we've worked with have been manually-created lists or other collections. A big part of your careers as computational scientists will involve interacting with data saved in files. Here we'll finally get to go over reading to and writing from the filesystem. By the end of this lecture, you should be able to:

Implement a basic file reader / writer using built-in Python tools
Import and use Python modules

Part 1: Interacting with text files

Text files are probably the most common and pervasive format of data. They can contain almost anything: weather data, stock market activity, literary works, raw web data.

On the biological end, they can contain things like sequence alignments, protein sequences, molecular structure information, and myriad other data.

Text files are also convenient for your own work: once some kind of analysis has finished, it's nice to dump the results into a file you can inspect later.

Reading an entire file

So let's jump into it! Let's start with something simple: a FASTA text file for the BRCA1 gene.



In [1]:

    
f = open("brca1.fasta", "r")
line = f.readline()
print(line)
f.close()









    



>lcl|NC_000017.10_cdsid_NP_009225.1 [gene=BRCA1] [protein=breast cancer type 1 susceptibility protein isoform 1] [protein_id=NP_009225.1] [location=complement(join(41197695..41197819,41199660..41199720,41201138..41201211,41203080..41203134,41209069..41209152,41215350..41215390,41215891..41215968,41219625..41219712,41222945..41223255,41226348..41226538,41228505..41228631,41234421..41234592,41242961..41243049,41243452..41246877,41247863..41247939,41249261..41249306,41251792..41251897,41256139..41256278,41256885..41256973,41258473..41258550,41267743..41267796,41276034..41276113))]

Aside

A quick review on FASTA files (we'll get into this more in a future lecture)

FASTA refers to software from 1985 for DNA and protein sequence alignment. The software is long obsolete, but its namesake lives on in the file format it used: FASTA-format.

A sequence in a FASTA file is represented as a series of lines.

The first line starts with a greater-than carrot > and contains a "human-readable" description of the sequence in the file. It usually contains an accession number for the sequence, and may contain other information as well.

Following this line are sequences, using single-letter codes. Anything other than a valid sequence is traditionally ignored.

Code walkthrough

Back to the code, then. First, we have a function open() that accepts two arguments:



In [2]:

    
f = open("brca1.fasta", "r")

The first argument is the file path. It's like a URL, except to a file on your computer. It should be noted that, unless you specify a leading forward slash "/" (an absolute path), Python will interpret this path to be relative to wherever the Python script is that you're running with this command.

The second argument is the mode. This tells Python whether you're reading from a file, writing to a file, or appending to a file. We'll come to each of these.

These two arguments are part of the function open(), which then returns a file descriptor. It's your key to accessing or modifying the contents of the file.

The next line is where the magic happens:



In [3]:

    
line = f.readline()

In this line, we're calling the method readline() on the file reference we got in the previous step. This method goes into the file, pulls out the first line, and sticks it in the variable line as one long string.



In [4]:

    
print(line)









    



>lcl|NC_000017.10_cdsid_NP_009225.1 [gene=BRCA1] [protein=breast cancer type 1 susceptibility protein isoform 1] [protein_id=NP_009225.1] [location=complement(join(41197695..41197819,41199660..41199720,41201138..41201211,41203080..41203134,41209069..41209152,41215350..41215390,41215891..41215968,41219625..41219712,41222945..41223255,41226348..41226538,41228505..41228631,41234421..41234592,41242961..41243049,41243452..41246877,41247863..41247939,41249261..41249306,41251792..41251897,41256139..41256278,41256885..41256973,41258473..41258550,41267743..41267796,41276034..41276113))]

...which we then simply print out.

Finally, the last and possibly most important line:



In [5]:

    
f.close()

This statement explicitly closes the file reference, effectively shutting the valve to the file.

Do not underestimate the value of this statement. There are weird errors that can crop up when you forget to close file descriptors. It can be difficult to remember to do this, though; in other languages where you have to manually allocate and release any memory you use, it's a bit easier to remember. Since Python handles all that stuff for us, it's not a force of habit to explicitly shut off things we've turned on.

Fortunately, there's an alternative those of us with bad short-term memory can use.



In [6]:

    
with open("brca1.fasta", "r") as f:
    line = f.readline()
    print(line)









    



>lcl|NC_000017.10_cdsid_NP_009225.1 [gene=BRCA1] [protein=breast cancer type 1 susceptibility protein isoform 1] [protein_id=NP_009225.1] [location=complement(join(41197695..41197819,41199660..41199720,41201138..41201211,41203080..41203134,41209069..41209152,41215350..41215390,41215891..41215968,41219625..41219712,41222945..41223255,41226348..41226538,41228505..41228631,41234421..41234592,41242961..41243049,41243452..41246877,41247863..41247939,41249261..41249306,41251792..41251897,41256139..41256278,41256885..41256973,41258473..41258550,41267743..41267796,41276034..41276113))]

This code works identically to the code before it. The difference is, by using a with block, Python intrinsically closes the file descriptor at the end of the block. Therefore, no need to remember to do it yourself! Hooray!

File modes

What was the "r" file mode from the open() call?

The "mode" is the way you tell Python exactly what you want to do with the file you're accessing. There are three modes:

"r" for read mode. The file will only be read from (it must already exist).

"w" for write mode. The file is created or truncated (anything already there is deleted) and can only be written to.

"a" for append mode. The file is created or appended to (does not delete or truncate any existing file) and can only be written to.

Manipulating Files

There are lots of other methods besides open(), close(), and readline() for tinkering with files.

read - return the entire file as a string (can also specify optional size argument)
readlines - return lists of all lines
write - writes a passed string to the file
seek - set current position of the file; seek(0) starts back at beginning

Which methods can be used in read mode? write mode? append mode?

What is the value of line?



In [7]:

    
f = open('brca1.fasta')
f.read()
line = f.readline()



In [8]:

    
print(line)

Hello....... Newline.

In Python, we've emphasized how whitespace is important. Recall that whitespace is defined as a character you can't necessary "see": tabs and spaces, for example.

There's a third character in the whitespace category: the newline character. It's what "appears" when you press the Enter key.

Internally, it's seen by Python as a character that looks like this: \n

But whenever you view plain text, the character is invisible. The only way you can tell it's there is by virtue of the fact that text is separated into lines.

However, when you're reading data in from files (and writing it out, too), you can't afford to ignore these newline characters. They can get you in a lot of trouble.



In [9]:

    
with open("brca1.fasta", "r") as f:
    for i in range(5):
        line = f.readline()
        print(line)









    



>lcl|NC_000017.10_cdsid_NP_009225.1 [gene=BRCA1] [protein=breast cancer type 1 susceptibility protein isoform 1] [protein_id=NP_009225.1] [location=complement(join(41197695..41197819,41199660..41199720,41201138..41201211,41203080..41203134,41209069..41209152,41215350..41215390,41215891..41215968,41219625..41219712,41222945..41223255,41226348..41226538,41228505..41228631,41234421..41234592,41242961..41243049,41243452..41246877,41247863..41247939,41249261..41249306,41251792..41251897,41256139..41256278,41256885..41256973,41258473..41258550,41267743..41267796,41276034..41276113))]

ATGGATTTATCTGCTCTTCGCGTTGAAGAAGTACAAAATGTCATTAATGCTATGCAGAAAATCTTAGAGT

GTCCCATCTGTCTGGAGTTGATCAAGGAACCTGTCTCCACAAAGTGTGACCACATATTTTGCAAATTTTG

CATGCTGAAACTTCTCAACCAGAAGAAAGGGCCTTCACAGTGTCCTTTATGTAAGAATGATATAACCAAA

AGGAGCCTACAAGAAAGTACGAGATTTAGTCAACTTGTTGAAGAGCTATTGAAAATCATTTGTGCTTTTC

What's with the blank lines between each DNA sequence?

You can't see it, but there are newline characters at the ends of each of the lines. Those newlines, coupled with the fact that print() implicitly adds its own newline character to the end of whatever you print, means the Enter key was effectively pressed twice.

Hence, the blank line between each sequence.

So how can we handle this?

`strip()`

Strings in Python have a wonderful strip() function. It cuts off any whitespace on either end.



In [10]:

    
lots_of_whitespace = "\n\n          this is a valid string              \n"
print(lots_of_whitespace)









    



          this is a valid string



In [11]:

    
stripped = lots_of_whitespace.strip()
print(stripped)









    



this is a valid string

strip() chops and chops from both ends of a string until it reaches non-whitespace characters.

Writing to files

We've seen reading from files. How about writing to them? (spoiler alert: newlines can be a pain here, too)



In [12]:

    
data_to_save = "This is important data. Definitely worth saving."
with open("outfile.txt", "w") as file_object:
    file_object.write(data_to_save)

You'll notice two important changes from before:

Switch the "r" argument in the open() function to "w" (changing from reading to writing).
Call write() on your file descriptor, and pass in the data you want to write to the file (in this case, data_to_save).

If you try this using a new notebook on JupyterHub (or on your local machine), you should see a new text file named "outfile.txt" appear in the same directory as your script. Give it a shot!



In [13]:

    
!cat outfile.txt









    



This is important data. Definitely worth saving.

And there you have it. Some notes about writing files:

If the file you're writing to does NOT currently exist, Python will try to create it for you. In most cases this should be fine

If the file you're writing to DOES already exist, Python will overwrite everything in the file with the new content. As in, everything that was in the file before will be erased.

That second point seems a bit harsh, doesn't it? Luckily, there is recourse.

Appending to an existing file

If you find yourself in the situation of writing to a file multiple times, and wanting to keep what you wrote to the file previously, then you're in the market for appending to a file.

This works exactly the same as writing to a file, with one small wrinkle:



In [14]:

    
data_to_save = "This is ALSO important data. BOTH DATA ARE IMPORTANT."
with open("outfile.txt", "a") as file_object:
    file_object.write(data_to_save)

The only change that was made was switching the "w" in the open() method to "a" for append mode. If you look in outfile.txt, you should see both lines of text we've written.



In [15]:

    
!cat outfile.txt









    



This is important data. Definitely worth saving.This is ALSO important data. BOTH DATA ARE IMPORTANT.

Whoa, why are those two sentences scrunched right up against each other?

Newlines strike again! When you're writing to a file, you'll need to explicitly put in the newlines when you want something to go on a separate line.



In [16]:

    
data_to_save = "This is the first line.\n"
with open("outfile.txt", "w") as file_object:  # "w" mode
    file_object.write(data_to_save)
    
more_data = "Here's the second line.\n"
with open("outfile.txt", "a") as file_object:  # "a" mode
    file_object.write(more_data)

What is in outfile.txt?



In [17]:

    
!cat outfile.txt









    



This is the first line.
Here's the second line.

Some notes about appending to a file.

If the file does NOT already exist, then using "a" in open() is functionally identical to using "w".

You only need to use append mode if you closed the file descriptor to that file previously. If you have an open file descriptor, you can call write() multiple times; each call will append the text to the previous text. It's only when you close a descriptor, but then want to open up another one to the same file, that you'd need to switch to append mode.

Part 2: Importing Modules

Python comes with a number of packages that provide additional functionality. These are called modules.

Any Python file is a module. The code in these modules can be included in your Python program using the import command.



In [18]:

    
import this









    



The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!

Lots of other packages that come default with Python:



In [19]:

    
import random   # For generating random numbers
import os       # For interacting with the filesystem of your computer
import sys      # Helps with customizing the behavior of your Python program
import re       # For regular expressions. Unrelated: https://xkcd.com/1171/
import datetime # Helps immensely with determining the date and formatting it
import math     # Gives some basic math functions: trig, factorial, exponential, logarithms, etc.

If you are so inclined, you can see the full Python default module index here: https://docs.python.org/3/py-modindex.html.

Keep in mind--those are just the packages that come with Python. These constitute a teeny tiny drop in the proverbial bucket when you include 3rd party packages available through the Python Package Index. At posting, PyPI was tracking 96,893 Python packages.

Once you've imported a package, all the variables and functions in the module are accessible through the module object.



In [20]:

    
import math



In [21]:

    
print(math.pi)









    



3.141592653589793



In [36]:

    
print(pi)









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-36-5fd1868b3769> in <module>()
----> 1 print(pi)

NameError: name 'pi' is not defined

Namespaces

A namespace in Python, without going into too much detail (yet), is the collection of variables / objects / named things that you have at your disposal.

When you get a NameError like in the previous slide when trying to reference pi, this is because pi is not defined in that particular namespace; instead, it's defined in the math namespace.

Of course, you could always define your own variable pi:



In [23]:

    
pi = 3.14  # an approximation

print(math.pi)
print(pi)









    



3.141592653589793
3.14

These are two different variables, because they exist in different namespaces. They just happen to have the same names.

If, on the other hand, you wanted to import the math.pi variable directly into the full namespace, you could adjust the import statement to look like this:



In [24]:

    
from math import pi
print(pi)









    



3.141592653589793

By importing pi directly into the full namespace, it has the same effect as reassigning our previous variable pi--as in, it got wiped out by this new one.

This is why namespaces are useful--they can differentiate between functions and variables with the same name.

There is a way you can import everything into the global namespace from a module, all at once.



In [25]:

    
from math import *

Anything in the math package is now available to you, without needing to specify math. first.



In [26]:

    
print(sin(pi))
print(cos(pi))

# "sin" and "cos" are functions in the math module, but now they're in the global namespace









    



1.2246467991473532e-16
-1.0

In general, it's a really good idea to avoid importing everything from a module at once.

You probably don't know every little thing that's in the module

Some of these things may have names that perfectly match other things you have defined, leading to some very bizarre behavior

If you happen to know of some obscure functions and variables available in the module, readers of your code who don't have a similar in-depth knowledge may get very confused when they see you using functions and variables that seemingly aren't defined anywhere

In short, don't do this. But it's good to know what it is.

One last useful trick with imports is the ability to rename the module to whatever you want (kind of like a variable).

If, instead, I didn't want to refer to the math module as math, I could rename it anything:



In [27]:

    
import math as anything

print(anything.pi)









    



3.141592653589793

Now, wherever I would have used math, now I just use anything.

`sys`

Let's take a look at one very useful built-in module before wrapping up: sys . This module has an array of useful utilities, especially if you're interacting with the command/bash prompt from your Python program.



In [28]:

    
import sys

print(sys.argv)









    



['/opt/python/lib/python3.5/site-packages/ipykernel/__main__.py', '-f', '/Users/squinn/Library/Jupyter/runtime/kernel-09963f0a-eb38-4676-a99a-1954ba945ecb.json']

The sys.argv list contains parameters that were passed to the Python program currently running (yes, this very Jupyter notebook). Using this variable, you can actually pass parameters from the command line to a Python program that you create.

Here's a small Python script I wrote (it's just a text file).



In [29]:

    
!cat args.py









    



import sys
print(len(sys.argv))

This script contains two lines: an import statement, and then it prints out the length of the sys.argv list.

What prints out?

$ python args.py hi bye



In [30]:

    
!python args.py hi bye

The sys module is also where the various output pipes are located.

When you run print() with some string in it, Python has a couple of options for where it sends that string. The defaut--it shows up right on your console--is known as standard output. Its variable is stored in sys:



In [31]:

    
print(sys.stdout)









    



<ipykernel.iostream.OutStream object at 0x105190a58>

You can "write" to this variable almost like you would a file descriptor:



In [32]:

    
sys.stdout.write("Hello, world!")









    



Hello, world!

Except the file is none other than the console!

There's another pipe known as standard error. This is typically where (you guessed it) error messages are sent. If Python ever crashes, this is the variable the output goes to.



In [33]:

    
sys.stderr.write("Something went wrong!")









    



Something went wrong!

Administrivia

Apologies again for the change-up in lecture topics from the schedule on the website.
Probably won't be the only time that happens this semester...
Biology NEXT WEEK!
How is Assignment 1 going?

Additional Resources

Matthes, Eric. Python Crash Course, Chapter 10. 2016. ISBN-13: 978-1593276034
Model, Mitchell. Bioinformatics Programming Using Python, Chapter 2. 2010. ISBN-13: 978-0596154509