String processing

Let us start by having a look at some of the functionality that is built into Python strings.

The Python string object

The Python string object has many useful features built into it. Let us look at some of these.


In [ ]:
some_text = "  Postman Pat has a cat named Jess.  "

Note that the text starts and ends with some whitespace characters. One often wants to get rid of these.


In [ ]:
stripped_text = some_text.strip()
stripped_text

It is possible to do this only for the left/right hand side of the string.


In [ ]:
some_text.lstrip()

In [ ]:
some_text.rstrip()

Another common scenarios is to check whether or not a string starts with a particular word.


In [ ]:
stripped_text.startswith("postman")

Note that the above returns False as the words have different capitalisation. One can work around this by forcing the input text to be lower case.


In [ ]:
stripped_text.lower()

In [ ]:
stripped_text.lower().startswith("postman")

There is also an endswith() method that can be useful for checking file extensions.


In [ ]:
"/my/great/picture.png".endswith(".png")

Let us search for particular words within our string.


In [ ]:
some_text.find("cat")

The find argument returns the index of the first letter of the search term.


In [ ]:
some_text[20:23]

If the search term is not found the find() method returns -1.


In [ ]:
some_text.find("dog")

If we search for something that exists more than once we get the index of the first instance.


In [ ]:
some_text.find("at")

We can find the next instance by specifying the index to start the search from.


In [ ]:
some_text.find("at", 12)

In [ ]:
some_text.find("at", 22)

String objects also have functionality for enabling substitutions.


In [ ]:
"One cat, two cats, three cats.".replace("cat", "dog")

It is possible to specify the number of substituitons that one wishes to make.


In [ ]:
"One cat, two cats, three cats.".replace("cat", "dog", 2)

One of the most useful features of string objects is the ability to split them based on a separator.


In [ ]:
some_text.split(" ")

Note the extra items at the begining and end of the line arising from the extra white spaces. One can get around this by using the strip() function.


In [ ]:
some_text.strip().split(" ")

This function is particularly useful when dealing with csv files.


In [ ]:
"1,2,3,4".split(",")

However, depending on your CSV file it may be safer to do something along the lines of the below.

Does everyone know what list comprehension is?


In [ ]:
[s.strip() for s in "1, 2,3,    4".split(",")]

Futhermore, if you know that you are wanting to deal with integers you may even want to include the string to integer conversion as well.


In [ ]:
[int(s.strip()) for s in "1, 2,3,    4".split(",")]

There are many variaitons on the string operators described above. It is useful to familiarise yourself with the Python documentation on strings.

Regular expressions

Regular expressions can be defined as a series of characters that define a search pattern.

Regular expressions can be very powerful. However, they can be difficult to build up. Often it is a process of trial and error. This means that once they have been created, and the trial and error process has been forgotten, it can be extremely difficult to understand what the regular expression does and why it is constructed the way it is.

Use regular expressions only as a last resort!

To use regular expressionsions in Python we need to import the re module.


In [ ]:
import re
some_text = "  Postman Pat has a cat named Jess.  "

Let us search for the word "cat".


In [ ]:
re.search(r"cat", some_text)

There are two things to note here:

  1. We use a raw string to represent our regular expression
  2. The regular expression search() method returns a match object (or None if no match is found)

The index of the first and last matched characters can be accessed as using the match object's start() and end() methods.


In [ ]:
match = re.search(r"cat", some_text)
if match:
    print(some_text[match.start():match.end()])

Now suppose that we wanted the first letter to be any alphanumberic character. We can achieve this using the regular expression "word" meta character \w.


In [ ]:
match = re.search(r"\wat", some_text)
if match:
    print(match.string[match.start():match.end()])

It is also possible to find all matches. However, note that this returns strings as opposed to regular expression match objects.


In [ ]:
matches = re.findall(r"\wat", some_text)
for m in matches:
    print(m)

Similarly we can use regular expressions to perform substitutions.


In [ ]:
re.sub(r"\wat", "dog", some_text)

However, more commonly we want to extract particular pieces of information from a string. For example the accession and version from the NCBI header. (Format: ">gi|xx|dbsrc|accession.version|description".)


In [ ]:
ncbi_header = ">gi|568336023|gb|CM000663.2| Homo sapiens chromosome 1, GRCh38 reference primary assembly."
match = re.search(r">gi\|[0-9]*\|\w*\|(\w*).([0-9])*\|.*", ncbi_header)

Note how horrible and incomprehensible the regular expression is.

It took me a couple of attempts to get this one right as I forgot that | is a regular expression meta character that needs to be escaped using a backslash \.

However, we can now access the groups specified by the parenthesis.


In [ ]:
match.groups()

Individual groups can also be accessed. Note that the first group includes everything matched by the regular expression.


In [ ]:
match.group(0)

In [ ]:
match.group(1)

In [ ]:
match.group(2)

Let us have a look at a common pitfall when using regular expressions in Python: the difference between the methods search() and match().


In [ ]:
re.search(r"cat", "my cat has a hat")

In [ ]:
print( re.match(r"cat", "my cat has a hat") )

In [ ]:
re.match(r"my", "my cat has a hat")

Basically match() only looks for a match at the beginning of the string to be searched. For more information see the search() vs match() section in the Python documentation.

Finally if you are using the same regular expression many times you may find it advantageous to compile the regular expression. This may speed up your program.


In [ ]:
cat_regex = re.compile(r"cat")
cat_regex.search("my cat has a hat")

There is a lot more to regular expressions in particular all the meta characters. For more information have a look at the regular expressions operations section in the Python documentation.