Please download this notebook here: http://bit.ly/intro-python-bio
Darrell Hurt, Ph.D., Acting Branch Chief
Major Areas of Research
Personnell
In the Terminal (or DOS prompt) type:
$ python
You should see something like this:
$ python
Python 2.7.2 (default, Oct 11 2012, 20:14:37)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
Enter
print(‘hello world’)
and hit return
Congratulation! You have just written your first python script! The famous "Hello, World" program.
To exit type:
exit()
Open a text editor (TextEdit on Mac, NotePad in Windows) and type:
print(‘hello world’)
Save the file as "hello_world.py"
Note: Make sure there is not an extra extension on the end such as ".txt"
To run the script, in the Terminal (or DOS prompt) type:
python hello_world.py
and hit return.
You should see:
$ hello world
Advantages of IDEs
IPython provides architecture for interactive computing:
I have already installed iPython using the Anaconda installed (from Continuum Analytics). To launch iPython Notebook, in the Terminal (or DOS prompt) type:
$ ipython notebook
You should be seeing a screen like this:
Insert a cell below, type the code below and run:
print(“Hello world”)
The whole thing is a statement; print is a function
Comments
# This is a comment
Anything after the pound sign (#) is ignored by the python interpreter; great for explaing what is going on in your code
In [ ]:
# this is a comment
print("Hello, world")
In [ ]:
#In a new cell below type, and run:
# store a short DNA sequence in the variable my_dna
my_dna = "ATGCGTA"
# now print the DNA sequence
print(my_dna)
In a new cell below type, and run:
my_dna = "AATT" + "GGCC"
print(my_dna)
What gets printed out? Next type:
upstream = "AAA"
my_dna = upstream + "ATGC"
my_dna should now be "AAAATGC"
In [ ]:
print(my_dna)
In [ ]:
#In a new cell below type, and run:
#store the DNA sequence in a variable
my_dna = "ATGCGAGT"
# calculate the length of the sequence and store it in a variable
dna_length = len(my_dna)
# print a message telling us the DNA sequence lenth
print("The length of the DNA sequence is " + str(dna_length))
In [ ]:
my_dna = "ATGCGAGT"
dna_length = len(my_dna)
print("The length of the DNA sequence is " + str(dna_length))
#What is the length of the sequence?
In [ ]:
#In a new cell below type, and run:
protein = "vlspadktnv"
# replace valine with tyrosine
print(protein.replace("v", "y"))
In [ ]:
protein.replace?
In [ ]:
# we can replace more than one character
print(protein.replace("vls", "ymt"))
# the original variable is not affected
print(protein)
#Is the correct sequence printed out?
In a new cell below type, and run:
protein = "vlspadktnv"
# print positions three to five
print(protein[3:5])
# NOTE: positions start at zero, not one
print(protein[0:6])
# if we use a stop position beyond the end, it's the same as using the end
print(protein[0:60])
What output did you get?
In [ ]:
print(protein[3:])
print(protein[:-3])
In a new cell below type, and run:
protein = "vlspadktnv"
# count amino acid residues
valine_count = protein.count('v')
lsp_count = protein.count('lsp')
tryptophan_count = protein.count('w')
# now print the counts
print("valines: " + str(valine_count))
print("lsp: " + str(lsp_count))
print("tryptophans: " + str(tryptophan_count))
What counts did you get?
Calculating AT content for the following DNA sequence:
ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT
Write a program that will print out the AT content of this DNA sequence. Hint: you can use normal mathematical symbols like add (+), subtract (-), multiply (*), divide (/) and parentheses to carry out calculations on numbers in Python.
Reminder: if you're using Python 2 rather than Python 3, include this line at the top of your program:
from __future__ import division
Load solution...
In a new cell below type, and run:
my_file = open("dna.txt")
file_contents = my_file.read()
print(file_contents)
my_file = open("dna.txt")
my_file_contents = my_file.read()
# remove the newline from the end of the file contents
my_dna = my_file_contents.rstrip("\n")
dna_length = len(my_dna)
print("sequence is " + my_dna + " and length is " + str(dna_length))
How long is the sequence?
In [ ]:
#In a new cell below type, and run:
my_file = open("out.txt", "w")
my_file.write("Hello world")
# remember to close the file
my_file.close()
my_file = open("out.txt")
print(my_file)
Writing a FASTA file
FASTA file format is a commonly-used DNA and protein sequence file format. A single sequence in FASTA format looks like this:
>sequence_name
ATCGACTGATCGATCGTACGAT
Write a program that will create a FASTA file for the following three sequences – make sure that all sequences are in upper case and only contain the bases A, T, G and C.
Sequence header DNA sequence
ABC123 ATCGTACGATCGATCGATCGCTAGACGTATCG
DEF456 actgatcgacgatcgatcgatcacgact
HIJ789 ACTGAC-ACTGT--ACTGTA----CATGTG
In a new cell below type, and run:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
conserved_sites = [24, 56, 132]
print(apes[0])
first_site = conserved_sites[2]
chimp_index = apes.index("Pan troglodytes")
# chimp_index is now 1
last_ape = apes[-1]
print(last_ape)
Which ape is last?
An additional list:
nucleotides = ["T", ”C", ”A”, “G”]
print(nucleotides)
In a new cell below type, and run:
ranks = ["kingdom", "phylum", "class", "order", "family"]
lower_ranks = ranks[2:5]
# lower ranks are class, order and family
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
print("There are " + str(len(apes)) + " apes")
apes.append("Pan paniscus")
print("Now there are " + str(len(apes)) + " apes")
How many apres are there now?
In [ ]:
#In a new cell below type, and run:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
monkeys = ["Papio ursinus", "Macaca mulatta"]
primates = apes + monkeys
print(str(len(apes)) + " apes")
print(str(len(monkeys)) + " monkeys")
print(str(len(primates)) + " primates")
In [ ]:
ranks = ["kingdom", "phylum", "class", "order", "family"]
print("at the start : " + str(ranks))
rank_dup = ranks
rank_dup.reverse()
#rank2 = rank_dup.reverse()
print("after reversing : " + str(rank_dup))
ranks.sort()
print("after sorting : " + str(ranks))
#Is the order alphabetical?
In [ ]:
ranks.reverse?
In [ ]:
#In a new cell below type, and run:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
for ape in apes:
print(ape + " is an ape")
for ape in apes:
name_length = len(ape)
first_letter = ape[0]
print(ape + " is an ape. Its name starts with " + first_letter)
print("Its name has " + str(name_length) + " letters")
#Does the output look correct?
In a new cell below type, and run:
apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]
for ape in apes:
name_length = len(ape)
first_letter = ape[0]
print(ape + " is an ape. Its name starts with " + first_letter)
print("Its name has " + str(name_length) + " letters")
Indentation errors
Use tabs or spaces but not both
In [ ]:
#In a new cell below type, and run:
name = "martin"
for character in name:
print("one character is " + character)
names = "melanogaster,simulans,yakuba,ananassae"
species = names.split(",")
print(str(species))
#What prints out?
What prints out?
In a new cell below type, and run:
file = open("some_input.txt")
for line in file:
# do something with the line
print(line)
Sometimes you want to loop over an interavely increasing set of nucleotides:
protein = "vlspadktnv”
vls vlsp vlspa…
In a new cell below type, and run:
stop_positions = [3,4,5,6,7,8,9,10]
for stop in stop_positions:
substring = protein[0:stop]
print(substring)
A better option is to use the Range command:
for number in range(3, 8):
print(number)
for number in range(6):
print(number)
In [ ]:
range(3, 100, 3)
Processing DNA in a file
The file input.txt contains a number of DNA sequences, one per line. Each sequence starts with the same 14 base pair fragment – a sequencing adapter that should have been removed. Write a program that will (a) trim this adapter and write the cleaned sequences to a new file and (b) print the length of each sequence to the screen.
Previously we wrote code to compute the AT percentage of a sequence:
my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT”
length = len(my_dna)
a_count = my_dna.count('A’)
t_count = my_dna.count('T’)
at_content = (a_count + t_count) / length
print("AT content is " + str(at_content))
In [ ]:
#In a new cell below type, and run:
from __future__ import division #if using python 2
def get_at_content(dna):
length = len(dna)
a_count = dna.count('A')
t_count = dna.count('T')
at_content = (a_count + t_count) / length
return at_content
#==============================================
print("AT content is " + str(get_at_content("ATGACTGGACCA")))
In a new cell below type (or udating the code above), and run:
def get_at_content(dna, sig_figs):
length = len(dna)
a_count = dna.upper().count('A')
t_count = dna.upper().count('T')
at_content = (a_count + t_count) / length
return round(at_content, sig_figs)
test_dna = "ATGCATGCAACTGTAGC"
print(get_at_content(test_dna, 1))
print(get_at_content(test_dna, 2))
print(get_at_content(test_dna, 3))
Functions do not always have to take parameters
Functions do not always have to return a value
def get_at_content():
test_dna = "ATGCATGCAACTGTAGC"
length = len(dna)
a_count = dna.upper().count('A')
t_count = dna.upper().count('T')
at_content = (a_count + t_count) / length
print(round(at_content, sig_figs))
What are the disadvantages of doing these things?
Function arguments can be named
Order then does not matter
get_at_content(dna="ATCGTGACTCG", sig_figs=2)
get_at_content(sig_figs=2, dna="ATCGTGACTCG")
Functions can have default values
Default values do not need to be provided unless a different value is desired
def get_at_content(dna, sig_figs=2):
# function code here
get_at_content(my_dna)
Functions should be testing with know good values
Functions should be tested with known bad values
assert get_at_content("ATGC") == 0.5
assert get_at_content("A") == 1
assert get_at_content("G") == 0
assert get_at_content("ATGC") == 0.5
assert get_at_content("AGG") == 0.33
assert get_at_content("AGG", 1) == 0.3
assert get_at_content("AGG", 5) == 0.33333
Python has a built-in values “True”, “False”
Conditional statements evaluate to True or False
If statements use conditional statements
expression_level = 125
if expression_level > 100:
print("gene is highly expressed")
In [ ]:
expression_level = 125
if not expression_level > 100:
print("gene is highly expressed")
else:
print("gene is lowly expressed")
In [ ]:
if
elif
else
value = True
Use conditional statements to seperate accessions...
In a new cell below type, and run:
file1 = open("one.txt", "w")
file2 = open("two.txt", "w")
file3 = open("three.txt", "w")
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']
for accession in accs:
if accession.startswith('a'):
file1.write(accession + "\n")
elif accession.startswith('b'):
file2.write(accession + "\n")
else:
file3.write(accession + "\n")
While loops loop until a condition is met
In a new cell below type, and run:
count = 0
while count<10:
print(count)
count = count + 1
Use “and”, “or”, “not and”, “not or” to build complex conditions
In a new cell below type, and run:
accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']
for accession in accs:
if accession.startswith('a') and accession.endswith('3'):
print(accession)
There are a lot of patterns in biology:
Pattern in strings inside text:
Many problems that we want to solve that require more flexible patterns:
To search for these patterns, we use the regular expression module “re”
In a new cell below type, and run:
import re
re.search(pattern, string)
dna = "ATCGCGAATTCAC"
if re.search(r"GAATTC", dna):
print("restriction site found!")
if re.search(r"GC(A|T|G|C)GC", dna):
print("restriction site found!")
if re.search(r"GC[ATGC]+GC", dna): \d
print("restriction site found!")
Get the string that matched
In a new cell below type, and run:
dna = "ATGACGTACGTACGACTG"
# store the match object in the variable m
m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)
print("entire match: " + m.group())
print("first bit: " + m.group(1))
print("second bit: " + m.group(2))
Get the positions of the match
In a new cell below type, and run:
print("start: " + str(m.start()))
print("end: " + str(m.end()))
In [ ]:
import re
m = re.search("[ATGC]", 'ATCG')
re?
In a new cell below type, and run:
enzymes = {}
enzymes['EcoRI'] = r'GAATTC'
enzymes['AvaII] = r'GG(A|T)CC'
enzymes['BisI'] = r'GC[ATGC]GC'
# remove the EcoRI enzyme from the dict
enzymes.pop('EcoRI')
dna = "AATGATCGATCGTACGCTGA"
counts = {}
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
trinucleotide = base1 + base2 + base3
count = dna.count(trinucleotide)
counts[trinucleotide] = count
print(counts)
In a new cell below type, and run:
count = {'ACC': 0, 'ATG': 1, 'AAG': 0, 'AAA': 0, 'ATC': 2, 'AAC': 0, 'ATA': 0, 'AGG': 0, 'CCT': 0, 'CTC': 0, 'AGC': 0, 'ACA': 0, 'AGA': 0, 'CAT': 0, 'AAT': 1, 'ATT': 0, 'CTG': 1, 'CTA': 0, 'ACT': 0, 'CAC': 0, 'ACG': 1, 'CAA': 0, 'AGT': 0, 'CAG': 0, 'CCG': 0, 'CCC': 0, 'CTT': 0, 'TAT': 0, 'GGT': 0, 'TGT': 0, 'CGA': 1, 'CCA': 0, 'TCT': 0, 'GAT': 2, 'CGG': 0, 'TTT': 0, 'TGC': 0, 'GGG': 0, 'TAG': 0, 'GGA': 0, 'TAA': 0, 'GGC': 0, 'TAC': 1, 'TTC': 0, 'TCG': 2, 'TTA': 0, 'TTG': 0, 'TCC': 0, 'GAA': 0, 'TGG': 0, 'GCA': 0, 'GTA': 1, 'GCC': 0, 'GTC': 0, 'GCG': 0, 'GTG': 0, 'GAG': 0, 'GTT': 0, 'GCT': 1, 'TGA': 2, 'GAC': 0, 'CGT': 1, 'TCA': 0, 'CGC': 1}
print(counts['TGA'])
if 'AAA' in counts:
print(counts('AAA'))
for trinucleotide in counts.keys():
if counts.get(trinucleotide) == 2:
print(trinucleotide)
for trinucleotide in sorted(counts.keys()):
if counts.get(trinucleotide) == 2:
print(trinucleotide)
for trinucleotide, count in counts.items():
if count == 2:
print(trinucleotide)
Rename a file
In a new cell below type, and run:
import os
os.rename("old.txt", "new.txt")
Rename a folder
os.rename("/home/martin/old_folder", "/home/martin/new_folder")
Check to see if a file exists
if os.path.exists("/home/martin/email.txt"):
print("You have mail!")
Remove a file
os.remove("/home/martin/unwanted_file.txt")
Remove empty folder
os.rmdir("/home/martin/emtpy")
To delete a folder and all the files in it, use shutil.rmtree
from shutil import rmtree
shutil.rmtree("home/martin/full")
Run an external program
In a new cell below type, and run:
import subprocess
subprocess.call("/bin/date")
Run an external program with options
subprocess.call("/bin/date +%B", shell=True)
Saving program output
current_month = subprocess.check_output("/bin/date +%B", shell=True)
Now using IPython magic:
In [ ]:
!/bin/date
Interactive user input
In a new cell below type, and run:
accession = input("Enter the accession name")
# do something with the accession variable
Capture command line arguments
import sys
print(sys.argv)
# python myprogram.py one two three
# sys.argv[1] return script name
Introduce you to the basics of the python programming language
Introduced you to the iPython environment
Prepare you for the next session “Introduction to Biopython for Scientists”
Enable you to write or assemble scripts of your own or modify existing scripts for your own purposes
Websites:
http://pythonforbiologists.com http://www.pythonforbeginners.com http://www.pythontutor.com/visualize.html#mode=display
Free eBook in HTML / PDF
http://pythonforbiologists.com http://greenteapress.com/thinkpython/ http://openbookproject.net/books/bpp4awd/index.html
Python Regular Expressions (pattern matching)
Python Style Guide
Collaborations welcome
R. Burke Squires - richard.squires at nih.gov
or for those on the NIH campus and surronding areas
ScienceApps at niaid.nih.gov
In [ ]: