So far, all the "data" we've worked with have been manually-created lists or other collections. A big part of your careers as computational scientists will involve interacting with data saved in files. Here we'll finally get to go over reading to and writing from the filesystem. By the end of this lecture, you should be able to:
Text files are probably the most common and pervasive format of data. They can contain almost anything: weather data, stock market activity, literary works, raw web data.
On the biological end, they can contain things like sequence alignments, protein sequences, molecular structure information, and myriad other data.
Text files are also convenient for your own work: once some kind of analysis has finished, it's nice to dump the results into a file you can inspect later.
In [1]:
f = open("brca1.fasta", "r")
line = f.readline()
print(line)
f.close()
A sequence in a FASTA file is represented as a series of lines.
>
and contains a "human-readable" description of the sequence in the file. It usually contains an accession number for the sequence, and may contain other information as well.
In [2]:
f = open("brca1.fasta", "r")
"/"
(an absolute path), Python will interpret this path to be relative to wherever the Python script is that you're running with this command.These two arguments are part of the function open()
, which then returns a file descriptor. It's your key to accessing or modifying the contents of the file.
The next line is where the magic happens:
In [3]:
line = f.readline()
In this line, we're calling the method readline()
on the file reference we got in the previous step. This method goes into the file, pulls out the first line, and sticks it in the variable line
as one long string.
In [4]:
print(line)
...which we then simply print out.
Finally, the last and possibly most important line:
In [5]:
f.close()
This statement explicitly closes the file reference, effectively shutting the valve to the file.
Do not underestimate the value of this statement. There are weird errors that can crop up when you forget to close file descriptors. It can be difficult to remember to do this, though; in other languages where you have to manually allocate and release any memory you use, it's a bit easier to remember. Since Python handles all that stuff for us, it's not a force of habit to explicitly shut off things we've turned on.
Fortunately, there's an alternative those of us with bad short-term memory can use.
In [6]:
with open("brca1.fasta", "r") as f:
line = f.readline()
print(line)
This code works identically to the code before it. The difference is, by using a with
block, Python intrinsically closes the file descriptor at the end of the block. Therefore, no need to remember to do it yourself! Hooray!
"r"
for read mode. The file will only be read from (it must already exist)."w"
for write mode. The file is created or truncated (anything already there is deleted) and can only be written to."a"
for append mode. The file is created or appended to (does not delete or truncate any existing file) and can only be written to.There are lots of other methods besides open()
, close()
, and readline()
for tinkering with files.
read
- return the entire file as a string (can also specify optional size argument)readlines
- return lists of all lineswrite
- writes a passed string to the fileseek
- set current position of the file; seek(0) starts back at beginning Which methods can be used in read mode? write mode? append mode?
What is the value of line
?
In [7]:
f = open('brca1.fasta')
f.read()
line = f.readline()
In [8]:
print(line)
In Python, we've emphasized how whitespace is important. Recall that whitespace is defined as a character you can't necessary "see": tabs and spaces, for example.
There's a third character in the whitespace category: the newline character. It's what "appears" when you press the Enter
key.
Internally, it's seen by Python as a character that looks like this: \n
But whenever you view plain text, the character is invisible. The only way you can tell it's there is by virtue of the fact that text is separated into lines.
However, when you're reading data in from files (and writing it out, too), you can't afford to ignore these newline characters. They can get you in a lot of trouble.
In [9]:
with open("brca1.fasta", "r") as f:
for i in range(5):
line = f.readline()
print(line)
What's with the blank lines between each DNA sequence?
You can't see it, but there are newline characters at the ends of each of the lines. Those newlines, coupled with the fact that print()
implicitly adds its own newline character to the end of whatever you print, means the Enter
key was effectively pressed twice.
Hence, the blank line between each sequence.
So how can we handle this?
In [10]:
lots_of_whitespace = "\n\n this is a valid string \n"
print(lots_of_whitespace)
In [11]:
stripped = lots_of_whitespace.strip()
print(stripped)
strip()
chops and chops from both ends of a string until it reaches non-whitespace characters.
In [12]:
data_to_save = "This is important data. Definitely worth saving."
with open("outfile.txt", "w") as file_object:
file_object.write(data_to_save)
You'll notice two important changes from before:
"r"
argument in the open()
function to "w"
(changing from reading to writing).write()
on your file descriptor, and pass in the data you want to write to the file (in this case, data_to_save
).If you try this using a new notebook on JupyterHub (or on your local machine), you should see a new text file named "outfile.txt
" appear in the same directory as your script. Give it a shot!
In [13]:
!cat outfile.txt
And there you have it. Some notes about writing files:
That second point seems a bit harsh, doesn't it? Luckily, there is recourse.
If you find yourself in the situation of writing to a file multiple times, and wanting to keep what you wrote to the file previously, then you're in the market for appending to a file.
This works exactly the same as writing to a file, with one small wrinkle:
In [14]:
data_to_save = "This is ALSO important data. BOTH DATA ARE IMPORTANT."
with open("outfile.txt", "a") as file_object:
file_object.write(data_to_save)
The only change that was made was switching the "w"
in the open()
method to "a"
for append mode. If you look in outfile.txt
, you should see both lines of text we've written.
In [15]:
!cat outfile.txt
Whoa, why are those two sentences scrunched right up against each other?
Newlines strike again! When you're writing to a file, you'll need to explicitly put in the newlines when you want something to go on a separate line.
In [16]:
data_to_save = "This is the first line.\n"
with open("outfile.txt", "w") as file_object: # "w" mode
file_object.write(data_to_save)
more_data = "Here's the second line.\n"
with open("outfile.txt", "a") as file_object: # "a" mode
file_object.write(more_data)
What is in outfile.txt
?
In [17]:
!cat outfile.txt
Some notes about appending to a file.
open()
is functionally identical to using "w".write()
multiple times; each call will append the text to the previous text. It's only when you close a descriptor, but then want to open up another one to the same file, that you'd need to switch to append mode.
In [18]:
import this
Lots of other packages that come default with Python:
In [19]:
import random # For generating random numbers
import os # For interacting with the filesystem of your computer
import sys # Helps with customizing the behavior of your Python program
import re # For regular expressions. Unrelated: https://xkcd.com/1171/
import datetime # Helps immensely with determining the date and formatting it
import math # Gives some basic math functions: trig, factorial, exponential, logarithms, etc.
If you are so inclined, you can see the full Python default module index here: https://docs.python.org/3/py-modindex.html.
Keep in mind--those are just the packages that come with Python. These constitute a teeny tiny drop in the proverbial bucket when you include 3rd party packages available through the Python Package Index. At posting, PyPI was tracking 96,893 Python packages.
Once you've imported a package, all the variables and functions in the module are accessible through the module object.
In [20]:
import math
In [21]:
print(math.pi)
In [36]:
print(pi)
A namespace in Python, without going into too much detail (yet), is the collection of variables / objects / named things that you have at your disposal.
When you get a NameError
like in the previous slide when trying to reference pi
, this is because pi
is not defined in that particular namespace; instead, it's defined in the math
namespace.
Of course, you could always define your own variable pi
:
In [23]:
pi = 3.14 # an approximation
print(math.pi)
print(pi)
These are two different variables, because they exist in different namespaces. They just happen to have the same names.
If, on the other hand, you wanted to import the math.pi
variable directly into the full namespace, you could adjust the import statement to look like this:
In [24]:
from math import pi
print(pi)
By importing pi
directly into the full namespace, it has the same effect as reassigning our previous variable pi
--as in, it got wiped out by this new one.
This is why namespaces are useful--they can differentiate between functions and variables with the same name.
There is a way you can import everything into the global namespace from a module, all at once.
In [25]:
from math import *
Anything in the math
package is now available to you, without needing to specify math.
first.
In [26]:
print(sin(pi))
print(cos(pi))
# "sin" and "cos" are functions in the math module, but now they're in the global namespace
In general, it's a really good idea to avoid importing everything from a module at once.
In short, don't do this. But it's good to know what it is.
One last useful trick with imports is the ability to rename the module to whatever you want (kind of like a variable).
If, instead, I didn't want to refer to the math
module as math
, I could rename it anything:
In [27]:
import math as anything
print(anything.pi)
Now, wherever I would have used math
, now I just use anything
.
Let's take a look at one very useful built-in module before wrapping up: sys
. This module has an array of useful utilities, especially if you're interacting with the command/bash prompt from your Python program.
In [28]:
import sys
print(sys.argv)
The sys.argv
list contains parameters that were passed to the Python program currently running (yes, this very Jupyter notebook). Using this variable, you can actually pass parameters from the command line to a Python program that you create.
Here's a small Python script I wrote (it's just a text file).
In [29]:
!cat args.py
This script contains two lines: an import
statement, and then it prints out the length of the sys.argv
list.
What prints out?
$ python args.py hi bye
In [30]:
!python args.py hi bye
The sys
module is also where the various output pipes are located.
When you run print()
with some string in it, Python has a couple of options for where it sends that string. The defaut--it shows up right on your console--is known as standard output. Its variable is stored in sys
:
In [31]:
print(sys.stdout)
You can "write" to this variable almost like you would a file descriptor:
In [32]:
sys.stdout.write("Hello, world!")
Except the file is none other than the console!
There's another pipe known as standard error. This is typically where (you guessed it) error messages are sent. If Python ever crashes, this is the variable the output goes to.
In [33]:
sys.stderr.write("Something went wrong!")