Strings

Authors: Tommy Guy, Anthony Scopatz, Will Trimble, Leszek Tarkowski

Lesson goals:

  1. Examine the string class in greater detail.
  2. Use open() to open, read, and write to files.

To start understanding the string type, let's use the built in help system.


In [ ]:
help(str)

The help page for string is very long, and it may be easier to keep it open in a browser window by going to the online Python documentation for str and the documentation for sequence types while we talk about its properties.

At its heart, a string is just a sequence of characters. Basic strings are defined using single or double quotes.


In [ ]:
s = "This is a string."
s2 = 'This is another string that uses single quotes'

The reason for having two types of quotes to define a string is emphasized in these examples:


In [ ]:
s = "Bob's mom called to say hello."
s = 'Bob's mom called to say hello.'

The second one should be an error: Python interprets it as s = 'Bob' then the rest of the line breaks the language standard.

Characters in literal strings must come from the ASCII character set, which is a set of 127 character codes that is used by all modern programming languages and computers. Unfortunately, ASCII does not have room for non-Roman characters like accents or Eastern scripts. Unicode strings in Python are specified with a leading u:


In [ ]:
u = u'abcdé'

For the rest of this lecture, we will deal with ASCII strings, because most scientific data that is stored as text is stored with ASCII.

Working with Strings

Strings are iterables, which means many of the ideas from lists can also be applied directly to string manipulation. For instance, characters can be accessed individually or in sequences:


In [ ]:
s = 'abcdefghijklmnopqrstuvwxyz'
s[0]

In [ ]:
s[-1]

In [ ]:
s[1:4]

They can also be compared using sort and equals.


In [ ]:
'str1' == 'str2'

In [ ]:
'str1' == 'str1'

In [ ]:
'str1' < 'str2'

In the help screen, which we looked at above, there are lots of functions that look like this:

|  __add__(...)
|      x.__add__(y) <==> x+y

|  __le__(...)
|      x.__le__(y) <==> x<y

These are special Python functions that interpret operations like < and +. We'll talk more about these in the next lecture on Classes.

Some special functions introduce handy text functions.

Hands on example

Try each of the following functions on a few strings. What does the function do?


In [ ]:
s = "This is a string"

In [ ]:
s.startswith("This")

In [ ]:
s.split(" ")

In [ ]:
s.strip() # This won't change every string!

In [ ]:
s.capitalize()

In [ ]:
s.lower()

In [ ]:
s.upper()

File I/O

Python has a built-in function called "open()" that can be used to manipulate files. The help information for open is below:


In [ ]:
help(open)

The main two parameters we'll need to worry about are the name of the file and the mode, which determines whether we can read from or write to the file. open returns a file object which acts like a pointer into the file. An example will make this clear. In the code below, I've opened a file that contains one line:

$ cat testfile.txt
abcde
fghij

Now let's open this file in Python:


In [ ]:
f = open('testfile.txt', 'r')

The second input, 'r' means I want to open the file for reading only. I can not write to this handle. The read() command will read a specified number of bytes:


In [ ]:
s = f.read(3)
print s

We read the first three characters, where each character is a byte long. We can see that the file handle points to the 4th byte (index number 3) in the file:


In [ ]:
f.tell()

In [ ]:
f.read(1)

In [ ]:
f.close() # close the old handle

In [ ]:
f.read()  # can't read anymore because the file is closed.

The file we are using is a long series of characters, but two of the characters are new line characters. If we looked at the file in sequence, it would look like "abcdenfghijn". Separating a file into lines is popular enough that there are two ways to read whole lines in a file. The first is to use the readlines() method:


In [ ]:
f = open('testfile.txt', 'r')
lines = f.readlines()
print lines
f.close()  # Always close the file when you are done with it

A very important point about the readline method is that it keeps the newline character at the end of each line. You can use the strip() method to get rid of the string.

File handles are also iterable, which means we can use them in for loops or list extensions:


In [ ]:
f = open('testfile.txt', 'r')
lines = [line.strip() for line in f]
f.close()
print lines

In [ ]:
lines = []
f = open('testfile.txt', 'r')
for line in f:
    lines.append(s.strip())
f.close()
print lines

These are equivalent operations. It's often best to handle a file one line at a time, particularly when the file is so large it might not fit in memory.

The other half of the story is writing output to files. We'll talk about two techniques: writing to the shell and writing to files directly.

If your program only creates one stream of output, it's often a good idea to write to the shell using the print function. There are several advantages to this strategy, including the fact that it allows the user to select where they want to store the output without worrying about any command line flags. You can use ">" to direct the output of your program to a file or use "|" to pipe it to another program.

Sometimes, you need to direct your output directly to a file handle. For instance, if your program produces two output streams, you may want to assign two open file handles. Opening a file for reading simply requires changing the second option from 'r' to 'w' or 'a'.

Caution! Opening a file with the 'w' option means start writing at the beginning, which may overwrite old material. If you want to append to the file without losing what is already there, open it with 'a'.

Writing to a file uses the write() command, which accepts a string.


In [ ]:
outfile = open('outfile.txt', 'w')
outfile.write('This is the first line!')
outfile.close()

Another way to write to a file is to use writelines(), which accepts a list of strings and writes them in order. Caution! writelines does not append newlines. If you really want to write a newline at the end of each string in the list, add it yourself.

Context managers

Closing a file is something often neglected in Python due to the fact that it is usually done automatically at the end of a script by Python's garbage collector, a part of the interpreter responsible for closing unused resources.

In less trivial scenarios, a file should be closed after using it to prevent data corruption. To ensure this, you can use a special language construct called a context manager, available since Python 2.5.

with open('outfile.txt','w') as f:
       f.write("Message of a Great Importance")

   #other instructions, file is already closed at this point

Also called "with statements", context managers are responsible for releasing resources when they are no longer needed. In this example, the context manager opens a file, creates a file handle variable to the open file named f, and after the block of instructions completes, closes the file.

Aside About File Editing

How do you edit a file in place? You can use f.seek() and f.tell() to verify that even if your file handle is pointing to the middle of a file, write commands go to the end of the file in append mode. The best way to change a file is to open a temporary file in /tmp/, fill it, and then move it to overwrite the original. On large clusters, /tmp/ is often local to each node, which means it reduces I/O bottlenecks associated with writing large amounts of data.


In [ ]: