Let's review strings - one of the most basic concepts (variable types) in Python
Which of the following do you think is a string?
a = 'string'
b = 'two'
c = 3
d = '3'
We first we can let Python tell us. Its assumed here that you understand the concept of a variable. In this case - as in most scripting languages - the variable is on the left of the '=' sign, and the value of the variable is on the right.
First, let's let the Python notebook create these variables; run the cell below:
In [ ]:
a = 'string'
b = 'two'
c = 3
d = '3'
Python has a built in function called type
that will disply the type of information contained in a variable. Since we want to ask Python which of the variables is a string, we can simply run the function type()
on the variable. Try it, completing the cell below:
In [ ]:
print type(a)
Why does 'd' have the value it does? Write Python code in the cell below to do the following:
1) Create the variable 'd' and assign it the value of '3' as a string
2) Print the type of variable 'd' and confirm it is a string
3) Create the variable 'c' and assign it the value '3' as int
4) Print the type of variable 'd' and confirm it is an int
In [ ]:
d =
In [ ]:
#Strings can be ...
#All strings must be...
#An integer can be a string if...
Before we go further, let's consider some variable names. Do they have any rules?
Consider the following experimental dataset:
Group | Number of Mice | Average Mass(g) | Group Id |
---|---|---|---|
alpha | 3 | 17.0 | CGJ28371 |
beta | 5 | 16.4 | SJW99399 |
gamma | 6 | 17.8 | PWS29382 |
Discuss with your partner, what variable names would you use describe
Create the variables below, and use a print
statement to display the variable as well as the variable type
In [ ]:
Did your variable names work on the first try? On the second? Hopefully you got some names to work. Also, hopefully, your variable names were easy-to-read and as unambigous as possible. For example, would the following names work?
In the end, you will have to decide on names that balance explicitness, with ease of use. Some simple rules and conventions (based on a much much long Python style guide:
Much of the data bioinformaticians work with come in the form of strings; as you encountered, a DNA sequence 'ATGCGCCGTA' is sequence as far as Python is concerned. Let's look at a few Python functions for working with strings. First create three new variables that represent the 'Group IDs' for each of the mouse groups in the table above
In [ ]:
alpha_id = 'CGJ28371'
beta_id = ''
gamma_id = ''
Next, lets look the alpha_id
; what what is the length of the id
In [ ]:
len(alpha_id)
Like all the IDs given in the table, len(alpha_id)
should have a value of 8. We can examine each character in the string represented by alpha_id
to do this we use the print
function except this time we use this special notation next to the variable. See what happens in the next cell.
In [ ]:
print alpha_id[0]
We can translate the statement above as the following english sentence
'print the 0th element of the string 'CGJ28371'
What is the 0th element?
In general, most computing languages will count items starting from 0. Therefore, breaking appart the string 'CGJ28371' here is how we would count it.
Index | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
Value | C | G | J | 2 | 8 | 3 | 7 | 1 |
Challenge In the cell below, print the alpha_id character by character in reverse
In [ ]:
Sometimes we will want to be more specific about which characters we would like to pull from a string. We can use the brackets to select specific characters from a string according to this rule
Translation:
Run the cell below to see some examples:
In [ ]:
my_string = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
print my_string[0]
print my_string[1]
print my_string[2]
print my_string[0:9]
print my_string[0:9:2]
print my_string[0::2]
As you have noticed, as long as you stick to the convention string[begin:end:step]
you can even omit some of the options:
In [ ]:
print my_string[::]
Challenge
In the group ID data, the first three characters are the experienters initials, and the numbers are a unique ID number. Use the cell below to do and demonstrate the following (as indicated by comments)
In [ ]:
#Create new variables that contain the initials of the experimenter
#for each mouse group; print the value of these new variables
#Create new variables that contain the ID of the experimenter
#for each mouse group; print the value of these new variables
Another useful property of strings in Python is the ability to concatonate strings - taking two or more strings and placing them together. In Python this is as simple as using the +
key. For example:
In [ ]:
print 'ABC' + '123'
This is a property of strings, you cannot concatonate a string and int:
In [ ]:
print 'ABC' + 123
It's not uncommon that we will create a biological dataset in a file, or in a table using the print function. Let's talk about perhaps the most important biological file format; the Fasta file
The Fasta File is one of the oldest and most widely used file formats in biology/bioinformatics. Fasta files convey a biological sequence (DNA, RNA, or Protein) as well as some metadata; usually at the very least, the name of the sequence. Every fasta file shares two properties:
>
; the remainder of the line contains any information about the file such as the name of the sequence. Here are two examples of fasta files containing DNA sequence is:
>sequence 001
ATTCGAGGATCGATTTCGATCGATGCTTAGCTTTAGCTTTTTTAGATCTCCCA
>sequence 002
AAGCTGACGGGGAGCTAGTCTTAGTCGTACGTTCGAT
Challenge
Let's create a simple sequence in Python that will do the following:
Discuss what you think you will need to do with your partner and use the cell below to complete the challenge.
In [ ]:
# Tip: You should be able to do this in 3 lines of code
# To make the printed information move to a new line, use the
# newline character '\n' - used as a string with quotes
# Python will not print '\n' to the screen, but interprit
# that you mean to end one line at that location and begin a
# newline
Another important string tool is the ability to determine certian properties of a string. As you have already seen, the len()
function allows you to count the number of characters in a string.
In [ ]:
my_string = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
print len(my_string)
You also can count specific characters within a string using the count()
function:
In [ ]:
my_string = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
print my_string.count('A')
Here we are using a function - here called an attribute - in new way called Python dot notation. Python knows that all the types of functions it can do with a string. We can see the helpfile (which will tell us all thing things Python can do with a string) using the help()
function:
In [ ]:
help(str)
This a pretty long help file! Much of it may be hard (and even unnesessary) for you to read right now. However, its good to know it exists. This combined with resources like Google and Stackoverflow will help you answer a lot of your own question - an also reassure you your instructor is not just making things up! Looking at the documentation, one of the first clear functions you might notice is the capitolization command. What do you think is happening in the following lines of code?
In [ ]:
my_uppercase_string = 'ABCDEFG'
my_lowercase_string = my_uppercase_string.lower()
print my_lowercase_string
Notice how the variable was reassigned using . notation. Can you write a sequence of commands in 2 lines that will print the lowercase version of an uppercase string?
In [ ]:
In [ ]:
string_1 = 'ABCDEG'
print 'original: ' + string_1 + '\n' + 'after replace: ' + string_1.replace('G','F')
string_2 = 'I like to eat ?'
print 'original: ' + string_2 + '\n' + 'after replace: ' + string_2.replace('?','Mom\'s spaggetti')
In [ ]:
DNA = 'AGATGGGCTTACTGATCGACCCAGTACGATCGTATTTTTCATCGT'
RNA = DNA.replace('A','U')
print RNA
This is the HIV genome:
In [ ]:
# Human immunodeficiency virus type 1 (HXB2), complete genome;
# ACCESSION K03455
# VERSION K03455.1 GI:1906382
hiv_genome = 'tggaagggctaattcactcccaacgaagacaagatatccttgatctgtggatctaccacacacaaggctacttccctgattagcagaactacacaccagggccagggatcagatatccactgacctttggatggtgctacaagctagtaccagttgagccagagaagttagaagaagccaacaaaggagagaacaccagcttgttacaccctgtgagcctgcatggaatggatgacccggagagagaagtgttagagtggaggtttgacagccgcctagcatttcatcacatggcccgagagctgcatccggagtacttcaagaactgctgacatcgagcttgctacaagggactttccgctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagatcctgcatataagcagctgctttttgcctgtactgggtctctctggttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctcagacccttttagtcagtgtggaaaatctctagcagtggcgcccgaacagggacctgaaagcgaaagggaaaccagaggagctctctcgacgcaggactcggcttgctgaagcgcgcacggcaagaggcgaggggcggcgactggtgagtacgccaaaaattttgactagcggaggctagaaggagagagatgggtgcgagagcgtcagtattaagcgggggagaattagatcgatgggaaaaaattcggttaaggccagggggaaagaaaaaatataaattaaaacatatagtatgggcaagcagggagctagaacgattcgcagttaatcctggcctgttagaaacatcagaaggctgtagacaaatactgggacagctacaaccatcccttcagacaggatcagaagaacttagatcattatataatacagtagcaaccctctattgtgtgcatcaaaggatagagataaaagacaccaaggaagctttagacaagatagaggaagagcaaaacaaaagtaagaaaaaagcacagcaagcagcagctgacacaggacacagcaatcaggtcagccaaaattaccctatagtgcagaacatccaggggcaaatggtacatcaggccatatcacctagaactttaaatgcatgggtaaaagtagtagaagagaaggctttcagcccagaagtgatacccatgttttcagcattatcagaaggagccaccccacaagatttaaacaccatgctaaacacagtggggggacatcaagcagccatgcaaatgttaaaagagaccatcaatgaggaagctgcagaatgggatagagtgcatccagtgcatgcagggcctattgcaccaggccagatgagagaaccaaggggaagtgacatagcaggaactactagtacccttcaggaacaaataggatggatgacaaataatccacctatcccagtaggagaaatttataaaagatggataatcctgggattaaataaaatagtaagaatgtatagccctaccagcattctggacataagacaaggaccaaaggaaccctttagagactatgtagaccggttctataaaactctaagagccgagcaagcttcacaggaggtaaaaaattggatgacagaaaccttgttggtccaaaatgcgaacccagattgtaagactattttaaaagcattgggaccagcggctacactagaagaaatgatgacagcatgtcagggagtaggaggacccggccataaggcaagagttttggctgaagcaatgagccaagtaacaaattcagctaccataatgatgcagagaggcaattttaggaaccaaagaaagattgttaagtgtttcaattgtggcaaagaagggcacacagccagaaattgcagggcccctaggaaaaagggctgttggaaatgtggaaaggaaggacaccaaatgaaagattgtactgagagacaggctaattttttagggaagatctggccttcctacaagggaaggccagggaattttcttcagagcagaccagagccaacagccccaccagaagagagcttcaggtctggggtagagacaacaactccccctcagaagcaggagccgatagacaaggaactgtatcctttaacttccctcaggtcactctttggcaacgacccctcgtcacaataaagataggggggcaactaaaggaagctctattagatacaggagcagatgatacagtattagaagaaatgagtttgccaggaagatggaaaccaaaaatgatagggggaattggaggttttatcaaagtaagacagtatgatcagatactcatagaaatctgtggacataaagctataggtacagtattagtaggacctacacctgtcaacataattggaagaaatctgttgactcagattggttgcactttaaattttcccattagccctattgagactgtaccagtaaaattaaagccaggaatggatggcccaaaagttaaacaatggccattgacagaagaaaaaataaaagcattagtagaaatttgtacagagatggaaaaggaagggaaaatttcaaaaattgggcctgaaaatccatacaatactccagtatttgccataaagaaaaaagacagtactaaatggagaaaattagtagatttcagagaacttaataagagaactcaagacttctgggaagttcaattaggaataccacatcccgcagggttaaaaaagaaaaaatcagtaacagtactggatgtgggtgatgcatatttttcagttcccttagatgaagacttcaggaagtatactgcatttaccatacctagtataaacaatgagacaccagggattagatatcagtacaatgtgcttccacagggatggaaaggatcaccagcaatattccaaagtagcatgacaaaaatcttagagccttttagaaaacaaaatccagacatagttatctatcaatacatggatgatttgtatgtaggatctgacttagaaatagggcagcatagaacaaaaatagaggagctgagacaacatctgttgaggtggggacttaccacaccagacaaaaaacatcagaaagaacctccattcctttggatgggttatgaactccatcctgataaatggacagtacagcctatagtgctgccagaaaaagacagctggactgtcaatgacatacagaagttagtggggaaattgaattgggcaagtcagatttacccagggattaaagtaaggcaattatgtaaactccttagaggaaccaaagcactaacagaagtaataccactaacagaagaagcagagctagaactggcagaaaacagagagattctaaaagaaccagtacatggagtgtattatgacccatcaaaagacttaatagcagaaatacagaagcaggggcaaggccaatggacatatcaaatttatcaagagccatttaaaaatctgaaaacaggaaaatatgcaagaatgaggggtgcccacactaatgatgtaaaacaattaacagaggcagtgcaaaaaataaccacagaaagcatagtaatatggggaaagactcctaaatttaaactgcccatacaaaaggaaacatgggaaacatggtggacagagtattggcaagccacctggattcctgagtgggagtttgttaatacccctcccttagtgaaattatggtaccagttagagaaagaacccatagtaggagcagaaaccttctatgtagatggggcagctaacagggagactaaattaggaaaagcaggatatgttactaatagaggaagacaaaaagttgtcaccctaactgacacaacaaatcagaagactgagttacaagcaatttatctagctttgcaggattcgggattagaagtaaacatagtaacagactcacaatatgcattaggaatcattcaagcacaaccagatcaaagtgaatcagagttagtcaatcaaataatagagcagttaataaaaaaggaaaaggtctatctggcatgggtaccagcacacaaaggaattggaggaaatgaacaagtagataaattagtcagtgctggaatcaggaaagtactatttttagatggaatagataaggcccaagatgaacatgagaaatatcacagtaattggagagcaatggctagtgattttaacctgccacctgtagtagcaaaagaaatagtagccagctgtgataaatgtcagctaaaaggagaagccatgcatggacaagtagactgtagtccaggaatatggcaactagattgtacacatttagaaggaaaagttatcctggtagcagttcatgtagccagtggatatatagaagcagaagttattccagcagaaacagggcaggaaacagcatattttcttttaaaattagcaggaagatggccagtaaaaacaatacatactgacaatggcagcaatttcaccggtgctacggttagggccgcctgttggtgggcgggaatcaagcaggaatttggaattccctacaatccccaaagtcaaggagtagtagaatctatgaataaagaattaaagaaaattataggacaggtaagagatcaggctgaacatcttaagacagcagtacaaatggcagtattcatccacaattttaaaagaaaaggggggattggggggtacagtgcaggggaaagaatagtagacataatagcaacagacatacaaactaaagaattacaaaaacaaattacaaaaattcaaaattttcgggtttattacagggacagcagaaatccactttggaaaggaccagcaaagctcctctggaaaggtgaaggggcagtagtaatacaagataatagtgacataaaagtagtgccaagaagaaaagcaaagatcattagggattatggaaaacagatggcaggtgatgattgtgtggcaagtagacaggatgaggattagaacatggaaaagtttagtaaaacaccatatgtatgtttcagggaaagctaggggatggttttatagacatcactatgaaagccctcatccaagaataagttcagaagtacacatcccactaggggatgctagattggtaataacaacatattggggtctgcatacaggagaaagagactggcatttgggtcagggagtctccatagaatggaggaaaaagagatatagcacacaagtagaccctgaactagcagaccaactaattcatctgtattactttgactgtttttcagactctgctataagaaaggccttattaggacacatagttagccctaggtgtgaatatcaagcaggacataacaaggtaggatctctacaatacttggcactagcagcattaataacaccaaaaaagataaagccacctttgcctagtgttacgaaactgacagaggatagatggaacaagccccagaagaccaagggccacagagggagccacacaatgaatggacactagagcttttagaggagcttaagaatgaagctgttagacattttcctaggatttggctccatggcttagggcaacatatctatgaaacttatggggatacttgggcaggagtggaagccataataagaattctgcaacaactgctgtttatccattttcagaattgggtgtcgacatagcagaataggcgttactcgacagaggagagcaagaaatggagccagtagatcctagactagagccctggaagcatccaggaagtcagcctaaaactgcttgtaccaattgctattgtaaaaagtgttgctttcattgccaagtttgtttcataacaaaagccttaggcatctcctatggcaggaagaagcggagacagcgacgaagagctcatcagaacagtcagactcatcaagcttctctatcaaagcagtaagtagtacatgtaacgcaacctataccaatagtagcaatagtagcattagtagtagcaataataatagcaatagttgtgtggtccatagtaatcatagaatataggaaaatattaagacaaagaaaaatagacaggttaattgatagactaatagaaagagcagaagacagtggcaatgagagtgaaggagaaatatcagcacttgtggagatgggggtggagatggggcaccatgctccttgggatgttgatgatctgtagtgctacagaaaaattgtgggtcacagtctattatggggtacctgtgtggaaggaagcaaccaccactctattttgtgcatcagatgctaaagcatatgatacagaggtacataatgtttgggccacacatgcctgtgtacccacagaccccaacccacaagaagtagtattggtaaatgtgacagaaaattttaacatgtggaaaaatgacatggtagaacagatgcatgaggatataatcagtttatgggatcaaagcctaaagccatgtgtaaaattaaccccactctgtgttagtttaaagtgcactgatttgaagaatgatactaataccaatagtagtagcgggagaatgataatggagaaaggagagataaaaaactgctctttcaatatcagcacaagcataagaggtaaggtgcagaaagaatatgcatttttttataaacttgatataataccaatagataatgatactaccagctataagttgacaagttgtaacacctcagtcattacacaggcctgtccaaaggtatcctttgagccaattcccatacattattgtgccccggctggttttgcgattctaaaatgtaataataagacgttcaatggaacaggaccatgtacaaatgtcagcacagtacaatgtacacatggaattaggccagtagtatcaactcaactgctgttaaatggcagtctagcagaagaagaggtagtaattagatctgtcaatttcacggacaatgctaaaaccataatagtacagctgaacacatctgtagaaattaattgtacaagacccaacaacaatacaagaaaaagaatccgtatccagagaggaccagggagagcatttgttacaataggaaaaataggaaatatgagacaagcacattgtaacattagtagagcaaaatggaataacactttaaaacagatagctagcaaattaagagaacaatttggaaataataaaacaataatctttaagcaatcctcaggaggggacccagaaattgtaacgcacagttttaattgtggaggggaatttttctactgtaattcaacacaactgtttaatagtacttggtttaatagtacttggagtactgaagggtcaaataacactgaaggaagtgacacaatcaccctcccatgcagaataaaacaaattataaacatgtggcagaaagtaggaaaagcaatgtatgcccctcccatcagtggacaaattagatgttcatcaaatattacagggctgctattaacaagagatggtggtaatagcaacaatgagtccgagatcttcagacctggaggaggagatatgagggacaattggagaagtgaattatataaatataaagtagtaaaaattgaaccattaggagtagcacccaccaaggcaaagagaagagtggtgcagagagaaaaaagagcagtgggaataggagctttgttccttgggttcttgggagcagcaggaagcactatgggcgcagcctcaatgacgctgacggtacaggccagacaattattgtctggtatagtgcagcagcagaacaatttgctgagggctattgaggcgcaacagcatctgttgcaactcacagtctggggcatcaagcagctccaggcaagaatcctggctgtggaaagatacctaaaggatcaacagctcctggggatttggggttgctctggaaaactcatttgcaccactgctgtgccttggaatgctagttggagtaataaatctctggaacagatttggaatcacacgacctggatggagtgggacagagaaattaacaattacacaagcttaatacactccttaattgaagaatcgcaaaaccagcaagaaaagaatgaacaagaattattggaattagataaatgggcaagtttgtggaattggtttaacataacaaattggctgtggtatataaaattattcataatgatagtaggaggcttggtaggtttaagaatagtttttgctgtactttctatagtgaatagagttaggcagggatattcaccattatcgtttcagacccacctcccaaccccgaggggacccgacaggcccgaaggaatagaagaagaaggtggagagagagacagagacagatccattcgattagtgaacggatccttggcacttatctgggacgatctgcggagcctgtgcctcttcagctaccaccgcttgagagacttactcttgattgtaacgaggattgtggaacttctgggacgcagggggtgggaagccctcaaatattggtggaatctcctacagtattggagtcaggaactaaagaatagtgctgttagcttgctcaatgccacagccatagcagtagctgaggggacagatagggttatagaagtagtacaaggagcttgtagagctattcgccacatacctagaagaataagacagggcttggaaaggattttgctataagatgggtggcaagtggtcaaaaagtagtgtgattggatggcctactgtaagggaaagaatgagacgagctgagccagcagcagatagggtgggagcagcatctcgagacctggaaaaacatggagcaatcacaagtagcaatacagcagctaccaatgctgcttgtgcctggctagaagcacaagaggaggaggaggtgggttttccagtcacacctcaggtacctttaagaccaatgacttacaaggcagctgtagatcttagccactttttaaaagaaaaggggggactggaagggctaattcactcccaaagaagacaagatatccttgatctgtggatctaccacacacaaggctacttccctgattagcagaactacacaccagggccaggggtcagatatccactgacctttggatggtgctacaagctagtaccagttgagccagataagatagaagaggccaataaaggagagaacaccagcttgttacaccctgtgagcctgcatgggatggatgacccggagagagaagtgttagagtggaggtttgacagccgcctagcatttcatcacgtggcccgagagctgcatccggagtacttcaagaactgctgacatcgagcttgctacaagggactttccgctggggactttccagggaggcgtggcctgggcgggactggggagtggcgagccctcagatcctgcatataagcagctgctttttgcctgtactgggtctctctggttagaccagatctgagcctgggagctctctggctaactagggaacccactgcttaagcctcaataaagcttgccttgagtgcttcaagtagtgtgtgcccgtctgttgtgtgactctggtaactagagatccctcagacccttttagtcagtgtggaaaatctctagca'
Given the things we have learned so far, and the information below, complete the tasks in the code cell below. Note that the HIV map contains numbers (positions) for the first and last nucleotides of several genes (See again HIV Genome Landmarks)
In [ ]:
# Determine and print the length of the HIV genome
# Create variables for and print the sequences for the following HIV genes
#* gag
#* pol
#* vif
#* vpr
#* env
# generate the RNA sequence for each of the listed genes
# generate a sum for each of the nuclotides ('A','U','G','C')
# caculate the GC content for each of the genes
#percent GC = sum of (G) + sum (C) / total number of nuclotides