Working with strings (printing, variables, and lists)

Let's review strings - one of the most basic concepts (variable types) in Python

Which of the following do you think is a string?

a = 'string'
b = 'two'
c = 3
d = '3'

We first we can let Python tell us. Its assumed here that you understand the concept of a variable. In this case - as in most scripting languages - the variable is on the left of the '=' sign, and the value of the variable is on the right.

First, let's let the Python notebook create these variables; run the cell below:


In [ ]:
a = 'string'
b = 'two'
c = 3
d = '3'

Python has a built in function called type that will disply the type of information contained in a variable. Since we want to ask Python which of the variables is a string, we can simply run the function type() on the variable. Try it, completing the cell below:


In [ ]:
print type(a)

Why does 'd' have the value it does? Write Python code in the cell below to do the following:

1) Create the variable 'd' and assign it the value of '3' as a string
2) Print the type of variable 'd' and confirm it is a string
3) Create the variable 'c' and assign it the value '3' as int
4) Print the type of variable 'd' and confirm it is an int


In [ ]:
d =

What have we learned?

Think about the answers and discuss with your partner. Write out some Python in the cell below that demonstrates the answers when you run the cell.

  • Strings can be ... characters long
  • All strings must be...
  • An integer (e.g. '3') can be a string if...

In [ ]:
#Strings can be ...

#All strings must be...

#An integer can be a string if...

More on Variables

Before we go further, let's consider some variable names. Do they have any rules?

Consider the following experimental dataset:

Group Number of Mice Average Mass(g) Group Id
alpha 3 17.0 CGJ28371
beta 5 16.4 SJW99399
gamma 6 17.8 PWS29382

Discuss with your partner, what variable names would you use describe

  • Average weight of a mouse group?
  • Number of mice in a group?

Create the variables below, and use a print statement to display the variable as well as the variable type


In [ ]:

Did your variable names work on the first try? On the second? Hopefully you got some names to work. Also, hopefully, your variable names were easy-to-read and as unambigous as possible. For example, would the following names work?

  • alpha = 3 (3 what?)
  • gamma_mass = 6 (maybe ok, but didn't we mean average mass?)
  • numberofmiceinbetagroup = 5 (very explicit, but hard to read)

In the end, you will have to decide on names that balance explicitness, with ease of use. Some simple rules and conventions (based on a much much long Python style guide:

  • Avoid uninformative single character variables like 'a','b','c'
  • Separate words either 'using_underscores' or 'ByUsingCaps'
  • Avoid keywords ('for','while','True','print') - these words have special meanings in Python; Python will let you know if you are doing something foolish here!

Slicing, dicing, and Examining strings

Much of the data bioinformaticians work with come in the form of strings; as you encountered, a DNA sequence 'ATGCGCCGTA' is sequence as far as Python is concerned. Let's look at a few Python functions for working with strings. First create three new variables that represent the 'Group IDs' for each of the mouse groups in the table above


In [ ]:
alpha_id = 'CGJ28371'
beta_id = ''
gamma_id = ''

Next, lets look the alpha_id; what what is the length of the id


In [ ]:
len(alpha_id)

Like all the IDs given in the table, len(alpha_id) should have a value of 8. We can examine each character in the string represented by alpha_id to do this we use the print function except this time we use this special notation next to the variable. See what happens in the next cell.


In [ ]:
print alpha_id[0]

We can translate the statement above as the following english sentence 'print the 0th element of the string 'CGJ28371'

What is the 0th element?

In general, most computing languages will count items starting from 0. Therefore, breaking appart the string 'CGJ28371' here is how we would count it.

Index 0 1 2 3 4 5 6 7
Value C G J 2 8 3 7 1

Challenge In the cell below, print the alpha_id character by character in reverse


In [ ]:

Sometimes we will want to be more specific about which characters we would like to pull from a string. We can use the brackets to select specific characters from a string according to this rule

string[begin:end:step]

Translation:

  • begin - which index to start with
  • end - which element to stop with
  • step - how many elements to select at one time (one-by-one would be 1, every other element would be 2)

Run the cell below to see some examples:


In [ ]:
my_string = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
print my_string[0]
print my_string[1]
print my_string[2]
print my_string[0:9]
print my_string[0:9:2]
print my_string[0::2]

As you have noticed, as long as you stick to the convention string[begin:end:step] you can even omit some of the options:


In [ ]:
print my_string[::]

Challenge

In the group ID data, the first three characters are the experienters initials, and the numbers are a unique ID number. Use the cell below to do and demonstrate the following (as indicated by comments)


In [ ]:
#Create new variables that contain the initials of the experimenter
#for each mouse group; print the value of these new variables

#Create new variables that contain the ID of the experimenter
#for each mouse group; print the value of these new variables

Making more complex print statements

Another useful property of strings in Python is the ability to concatonate strings - taking two or more strings and placing them together. In Python this is as simple as using the + key. For example:


In [ ]:
print 'ABC' + '123'

This is a property of strings, you cannot concatonate a string and int:


In [ ]:
print 'ABC' + 123

It's not uncommon that we will create a biological dataset in a file, or in a table using the print function. Let's talk about perhaps the most important biological file format; the Fasta file

Fasta Files

The Fasta File is one of the oldest and most widely used file formats in biology/bioinformatics. Fasta files convey a biological sequence (DNA, RNA, or Protein) as well as some metadata; usually at the very least, the name of the sequence. Every fasta file shares two properties:

  1. Every fasta file will be two lines
  2. The first line of the fasta file begins with >; the remainder of the line contains any information about the file such as the name of the sequence.
  3. The next line of the file will be the sequence; DNA, RNA, or Protein

Here are two examples of fasta files containing DNA sequence is:

>sequence 001
ATTCGAGGATCGATTTCGATCGATGCTTAGCTTTAGCTTTTTTAGATCTCCCA

>sequence 002
AAGCTGACGGGGAGCTAGTCTTAGTCGTACGTTCGAT

Creating a Fasta file printer

Challenge

Let's create a simple sequence in Python that will do the following:

  1. Have a variable which will hold the name of the sequence
  2. Have a variable which will hold the sequence string
  3. Print the name and sequence in proper fasta format

Discuss what you think you will need to do with your partner and use the cell below to complete the challenge.


In [ ]:
# Tip: You should be able to do this in 3 lines of code
# To make the printed information move to a new line, use the 
# newline character '\n' - used as a string with quotes
# Python will not print '\n' to the screen, but interprit 
# that you mean to end one line at that location and begin a 
# newline

Counting and substitutions

Another important string tool is the ability to determine certian properties of a string. As you have already seen, the len() function allows you to count the number of characters in a string.


In [ ]:
my_string = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
print len(my_string)

You also can count specific characters within a string using the count() function:


In [ ]:
my_string = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
print my_string.count('A')

Here we are using a function - here called an attribute - in new way called Python dot notation. Python knows that all the types of functions it can do with a string. We can see the helpfile (which will tell us all thing things Python can do with a string) using the help() function:


In [ ]:
help(str)

This a pretty long help file! Much of it may be hard (and even unnesessary) for you to read right now. However, its good to know it exists. This combined with resources like Google and Stackoverflow will help you answer a lot of your own question - an also reassure you your instructor is not just making things up! Looking at the documentation, one of the first clear functions you might notice is the capitolization command. What do you think is happening in the following lines of code?


In [ ]:
my_uppercase_string = 'ABCDEFG'
my_lowercase_string = my_uppercase_string.lower()
print my_lowercase_string

Notice how the variable was reassigned using . notation. Can you write a sequence of commands in 2 lines that will print the lowercase version of an uppercase string?


In [ ]:

DNA to RNA

Another of the functions that Python can do is replace characters in a string. The .replace() method works like this:

str.replace('character(s) to be replaced','character(s) to replace with')

So:


In [ ]:
string_1 = 'ABCDEG'
print 'original: ' + string_1 + '\n' + 'after replace: ' + string_1.replace('G','F')
string_2 = 'I like to eat ?'
print 'original: ' + string_2 + '\n' + 'after replace: ' + string_2.replace('?','Mom\'s spaggetti')

In [ ]:
DNA = 'AGATGGGCTTACTGATCGACCCAGTACGATCGTATTTTTCATCGT'
RNA = DNA.replace('A','U')
print RNA

Examining the HIV genome

This is the HIV genome:


In [ ]:
# Human immunodeficiency virus type 1 (HXB2), complete genome;
# ACCESSION   K03455
# VERSION     K03455.1 GI:1906382

hiv_genome = 'tggaagggctaattcactcccaacgaagacaagatatccttgatctgtggatctaccacacacaaggctacttccctgattagcagaactacacaccagggccagggatcagatatccactgacctttggatggtgctacaagctagtaccagttgagccagagaagttagaagaagccaacaaaggagagaacaccagcttgttacaccctgtgagcctgcatggaatggatgacccggagagagaagtgttagagtggaggtttgacagccgcctagcatttcatcacatggcccgagagctgcatccggagtacttcaagaactgctgacatcgagcttgctacaagggactttccgctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagatcctgcatataagcagctgctttttgcctgtactgggtctctctggttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctcagacccttttagtcagtgtggaaaatctctagcagtggcgcccgaacagggacctgaaagcgaaagggaaaccagaggagctctctcgacgcaggactcggcttgctgaagcgcgcacggcaagaggcgaggggcggcgactggtgagtacgccaaaaattttgactagcggaggctagaaggagagagatgggtgcgagagcgtcagtattaagcgggggagaattagatcgatgggaaaaaattcggttaaggccagggggaaagaaaaaatataaattaaaacatatagtatgggcaagcagggagctagaacgattcgcagttaatcctggcctgttagaaacatcagaaggctgtagacaaatactgggacagctacaaccatcccttcagacaggatcagaagaacttagatcattatataatacagtagcaaccctctattgtgtgcatcaaaggatagagataaaagacaccaaggaagctttagacaagatagaggaagagcaaaacaaaagtaagaaaaaagcacagcaagcagcagctgacacaggacacagcaatcaggtcagccaaaattaccctatagtgcagaacatccaggggcaaatggtacatcaggccatatcacctagaactttaaatgcatgggtaaaagtagtagaagagaaggctttcagcccagaagtgatacccatgttttcagcattatcagaaggagccaccccacaagatttaaacaccatgctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagaccatcaatgaggaagctgcagaatgggatagagtgcatccagtgcatgcagggcctattgcaccaggccagatgagagaaccaaggggaagtgacatagcaggaactactagtacccttcaggaacaaataggatggatgacaaataatccacctatcccagtaggagaaatttataaaagatggataatcctgggattaaataaaatagtaagaatgtatagccctaccagcattctggacataagacaaggaccaaaggaaccctttagagactatgtagaccggttctataaaactctaagagccgagcaagcttcacaggaggtaaaaaattggatgacagaaaccttgttggtccaaaatgcgaacccagattgtaagactattttaaaagcattgggaccagcggctacactagaagaaatgatgacagcatgtcagggagtaggaggacccggccataaggcaagagttttggctgaagcaatgagccaagtaacaaattcagctaccataatgatgcagagaggcaattttaggaaccaaagaaagattgttaagtgtttcaattgtggcaaagaagggcacacagccagaaattgcagggcccctaggaaaaagggctgttggaaatgtggaaaggaaggacaccaaatgaaagattgtactgagagacaggctaattttttagggaagatctggccttcctacaagggaaggccagggaattttcttcagagcagaccagagccaacagccccaccagaagagagcttcaggtctggggtagagacaacaactccccctcagaagcaggagccgatagacaaggaactgtatcctttaacttccctcaggtcactctttggcaacgacccctcgtcacaataaagataggggggcaactaaaggaagctctattagatacaggagcagatgatacagtattagaagaaatgagtttgccaggaagatggaaaccaaaaatgatagggggaattggaggttttatcaaagtaagacagtatgatcagatactcatagaaatctgtggacataaagctataggtacagtattagtaggacctacacctgtcaacataattggaagaaatctgttgactcagattggttgcactttaaattttcccattagccctattgagactgtaccagtaaaattaaagccaggaatggatggcccaaaagttaaacaatggccattgacagaagaaaaaataaaagcattagtagaaatttgtacagagatggaaaaggaagggaaaatttcaaaaattgggcctgaaaatccatacaatactccagtatttgccataaagaaaaaagacagtactaaatggagaaaattagtagatttcagagaacttaataagagaactcaagacttctgggaagttcaattaggaataccacatcccgcagggttaaaaaagaaaaaatcagtaacagtactggatgtgggtgatgcatatttttcagttcccttagatgaagacttcaggaagtatactgcatttaccatacctagtataaacaatgagacaccagggattagatatcagtacaatgtgcttccacagggatggaaaggatcaccagcaatattccaaagtagcatgacaaaaatcttagagccttttagaaaacaaaatccagacatagttatctatcaatacatggatgatttgtatgtaggatctgacttagaaatagggcagcatagaacaaaaatagaggagctgagacaacatctgttgaggtggggacttaccacaccagacaaaaaacatcagaaagaacctccattcctttggatgggttatgaactccatcctgataaatggacagtacagcctatagtgctgccagaaaaagacagctggactgtcaatgacatacagaagttagtggggaaattgaattgggcaagtcagatttacccagggattaaagtaaggcaattatgtaaactccttagaggaaccaaagcactaacagaagtaataccactaacagaagaagcagagctagaactggcagaaaacagagagattctaaaagaaccagtacatggagtgtattatgacccatcaaaagacttaatagcagaaatacagaagcaggggcaaggccaatggacatatcaaatttatcaagagccatttaaaaatctgaaaacaggaaaatatgcaagaatgaggggtgcccacactaatgatgtaaaacaattaacagaggcagtgcaaaaaataaccacagaaagcatagtaatatggggaaagactcctaaatttaaactgcccatacaaaaggaaacatgggaaacatggtggacagagtattggcaagccacctggattcctgagtgggagtttgttaatacccctcccttagtgaaattatggtaccagttagagaaagaacccatagtaggagcagaaaccttctatgtagatggggcagctaacagggagactaaattaggaaaagcaggatatgttactaatagaggaagacaaaaagttgtcaccctaactgacacaacaaatcagaagactgagttacaagcaatttatctagctttgcaggattcgggattagaagtaaacatagtaacagactcacaatatgcattaggaatcattcaagcacaaccagatcaaagtgaatcagagttagtcaatcaaataatagagcagttaataaaaaaggaaaaggtctatctggcatgggtaccagcacacaaaggaattggaggaaatgaacaagtagataaattagtcagtgctggaatcaggaaagtactatttttagatggaatagataaggcccaagatgaacatgagaaatatcacagtaattggagagcaatggctagtgattttaacctgccacctgtagtagcaaaagaaatagtagccagctgtgataaatgtcagctaaaaggagaagccatgcatggacaagtagactgtagtccaggaatatggcaactagattgtacacatttagaaggaaaagttatcctggtagcagttcatgtagccagtggatatatagaagcagaagttattccagcagaaacagggcaggaaacagcatattttcttttaaaattagcaggaagatggccagtaaaaacaatacatactgacaatggcagcaatttcaccggtgctacggttagggccgcctgttggtgggcgggaatcaagcaggaatttggaattccctacaatccccaaagtcaaggagtagtagaatctatgaataaagaattaaagaaaattataggacaggtaagagatcaggctgaacatcttaagacagcagtacaaatggcagtattcatccacaattttaaaagaaaaggggggattggggggtacagtgcaggggaaagaatagtagacataatagcaacagacatacaaactaaagaattacaaaaacaaattacaaaaattcaaaattttcgggtttattacagggacagcagaaatccactttggaaaggaccagcaaagctcctctggaaaggtgaaggggcagtagtaatacaagataatagtgacataaaagtagtgccaagaagaaaagcaaagatcattagggattatggaaaacagatggcaggtgatgattgtgtggcaagtagacaggatgaggattagaacatggaaaagtttagtaaaacaccatatgtatgtttcagggaaagctaggggatggttttatagacatcactatgaaagccctcatccaagaataagttcagaagtacacatcccactaggggatgctagattggtaataacaacatattggggtctgcatacaggagaaagagactggcatttgggtcagggagtctccatagaatggaggaaaaagagatatagcacacaagtagaccctgaactagcagaccaactaattcatctgtattactttgactgtttttcagactctgctataagaaaggccttattaggacacatagttagccctaggtgtgaatatcaagcaggacataacaaggtaggatctctacaatacttggcactagcagcattaataacaccaaaaaagataaagccacctttgcctagtgttacgaaactgacagaggatagatggaacaagccccagaagaccaagggccacagagggagccacacaatgaatggacactagagcttttagaggagcttaagaatgaagctgttagacattttcctaggatttggctccatggcttagggcaacatatctatgaaacttatggggatacttgggcaggagtggaagccataataagaattctgcaacaactgctgtttatccattttcagaattgggtgtcgacatagcagaataggcgttactcgacagaggagagcaagaaatggagccagtagatcctagactagagccctggaagcatccaggaagtcagcctaaaactgcttgtaccaattgctattgtaaaaagtgttgctttcattgccaagtttgtttcataacaaaagccttaggcatctcctatggcaggaagaagcggagacagcgacgaagagctcatcagaacagtcagactcatcaagcttctctatcaaagcagtaagtagtacatgtaacgcaacctataccaatagtagcaatagtagcattagtagtagcaataataatagcaatagttgtgtggtccatagtaatcatagaatataggaaaatattaagacaaagaaaaatagacaggttaattgatagactaatagaaagagcagaagacagtggcaatgagagtgaaggagaaatatcagcacttgtggagatgggggtggagatggggcaccatgctccttgggatgttgatgatctgtagtgctacagaaaaattgtgggtcacagtctattatggggtacctgtgtggaaggaagcaaccaccactctattttgtgcatcagatgctaaagcatatgatacagaggtacataatgtttgggccacacatgcctgtgtacccacagaccccaacccacaagaagtagtattggtaaatgtgacagaaaattttaacatgtggaaaaatgacatggtagaacagatgcatgaggatataatcagtttatgggatcaaagcctaaagccatgtgtaaaattaaccccactctgtgttagtttaaagtgcactgatttgaagaatgatactaataccaatagtagtagcgggagaatgataatggagaaaggagagataaaaaactgctctttcaatatcagcacaagcataagaggtaaggtgcagaaagaatatgcatttttttataaacttgatataataccaatagataatgatactaccagctataagttgacaagttgtaacacctcagtcattacacaggcctgtccaaaggtatcctttgagccaattcccatacattattgtgccccggctggttttgcgattctaaaatgtaataataagacgttcaatggaacaggaccatgtacaaatgtcagcacagtacaatgtacacatggaattaggccagtagtatcaactcaactgctgttaaatggcagtctagcagaagaagaggtagtaattagatctgtcaatttcacggacaatgctaaaaccataatagtacagctgaacacatctgtagaaattaattgtacaagacccaacaacaatacaagaaaaagaatccgtatccagagaggaccagggagagcatttgttacaataggaaaaataggaaatatgagacaagcacattgtaacattagtagagcaaaatggaataacactttaaaacagatagctagcaaattaagagaacaatttggaaataataaaacaataatctttaagcaatcctcaggaggggacccagaaattgtaacgcacagttttaattgtggaggggaatttttctactgtaattcaacacaactgtttaatagtacttggtttaatagtacttggagtactgaagggtcaaataacactgaaggaagtgacacaatcaccctcccatgcagaataaaacaaattataaacatgtggcagaaagtaggaaaagcaatgtatgcccctcccatcagtggacaaattagatgttcatcaaatattacagggctgctattaacaagagatggtggtaatagcaacaatgagtccgagatcttcagacctggaggaggagatatgagggacaattggagaagtgaattatataaatataaagtagtaaaaattgaaccattaggagtagcacccaccaaggcaaagagaagagtggtgcagagagaaaaaagagcagtgggaataggagctttgttccttgggttcttgggagcagcaggaagcactatgggcgcagcctcaatgacgctgacggtacaggccagacaattattgtctggtatagtgcagcagcagaacaatttgctgagggctattgaggcgcaacagcatctgttgcaactcacagtctggggcatcaagcagctccaggcaagaatcctggctgtggaaagatacctaaaggatcaacagctcctggggatttggggttgctctggaaaactcatttgcaccactgctgtgccttggaatgctagttggagtaataaatctctggaacagatttggaatcacacgacctggatggagtgggacagagaaattaacaattacacaagcttaatacactccttaattgaagaatcgcaaaaccagcaagaaaagaatgaacaagaattattggaattagataaatgggcaagtttgtggaattggtttaacataacaaattggctgtggtatataaaattattcataatgatagtaggaggcttggtaggtttaagaatagtttttgctgtactttctatagtgaatagagttaggcagggatattcaccattatcgtttcagacccacctcccaaccccgaggggacccgacaggcccgaaggaatagaagaagaaggtggagagagagacagagacagatccattcgattagtgaacggatccttggcacttatctgggacgatctgcggagcctgtgcctcttcagctaccaccgcttgagagacttactcttgattgtaacgaggattgtggaacttctgggacgcagggggtgggaagccctcaaatattggtggaatctcctacagtattggagtcaggaactaaagaatagtgctgttagcttgctcaatgccacagccatagcagtagctgaggggacagatagggttatagaagtagtacaaggagcttgtagagctattcgccacatacctagaagaataagacagggcttggaaaggattttgctataagatgggtggcaagtggtcaaaaagtagtgtgattggatggcctactgtaagggaaagaatgagacgagctgagccagcagcagatagggtgggagcagcatctcgagacctggaaaaacatggagcaatcacaagtagcaatacagcagctaccaatgctgcttgtgcctggctagaagcacaagaggaggaggaggtgggttttccagtcacacctcaggtacctttaagaccaatgacttacaaggcagctgtagatcttagccactttttaaaagaaaaggggggactggaagggctaattcactcccaaagaagacaagatatccttgatctgtggatctaccacacacaaggctacttccctgattagcagaactacacaccagggccaggggtcagatatccactgacctttggatggtgctacaagctagtaccagttgagccagataagatagaagaggccaataaaggagagaacaccagcttgttacaccctgtgagcctgcatgggatggatgacccggagagagaagtgttagagtggaggtttgacagccgcctagcatttcatcacgtggcccgagagctgcatccggagtacttcaagaactgctgacatcgagcttgctacaagggactttccgctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagatcctgcatataagcagctgctttttgcctgtactgggtctctctggttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctcagacccttttagtcagtgtggaaaatctctagca'

Given the things we have learned so far, and the information below, complete the tasks in the code cell below. Note that the HIV map contains numbers (positions) for the first and last nucleotides of several genes (See again HIV Genome Landmarks)


In [ ]:
# Determine and print the length of the HIV genome

# Create variables for and print the sequences for the following HIV genes
#* gag
#* pol
#* vif
#* vpr
#* env


# generate the RNA sequence for each of the listed genes
# generate a sum for each of the nuclotides ('A','U','G','C')
# caculate the GC content for each of the genes
#percent GC = sum of (G) + sum (C) / total number of nuclotides