Introduction to Python Programming

Please download this notebook here: http://bit.ly/intro-python-bio


R. Burke Squires

NIAID Bioinformatics and Computational Biosciences Branch (BCBB)


Python Programming for Scientists Series

  • Introduction to iPython Notebook
  • Introduction to Python Programming
  • Introduction to Biopython Programming
  • Introduction to Data Analysis with Python
  • Introduction to Data Visualization with Python

NIAID Bioinformatics and Computational Biosciences (BCBB)

Darrell Hurt, Ph.D., Acting Branch Chief

Major Areas of Research

  • Computational biology collaboration and training
  • Bioinformatics software development
  • Biocomputing resources

Personnell

  • Computational biologists (Consultants)
  • Scientific programmers
  • Program managers


Why Python?


Introduction to Python Programming

Outline

  • iPython & Integrated Development Environments
  • Printing and manipulating text
  • Reading and writing files
  • Lists and Loops
  • Writing your own functions
  • Conditional tests
  • Regular Expressions
  • Dictionaries
  • Files, programs and user input

Resources:


Goals

  • Introduce you to the basics of the python programming language
  • Introduce you to the iPython environment and integrated development environments (IDE)
  • Enable you to write or assemble scripts of your own or modify existing scripts for your own purposes
  • Prepare you for the next session “Introduction to Biopython Programming”

Where do we start?

Programming…is it Magic?

No…BUT it can seem like it at times! :-)

Working with text files


Executing Python Statements with the Python Interpreter

In the Terminal (or DOS prompt) type:

$ python

You should see something like this:

$ python
Python 2.7.2 (default, Oct 11 2012, 20:14:37)
[GCC 4.2.1 Compatible Apple Clang 4.0 (tags/Apple/clang-418.0.60)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> 

Enter

print(‘hello world’)

and hit return

Congratulation! You have just written your first python script! The famous "Hello, World" program.

To exit type:

exit()

Executing Python Scripts from a File

Open a text editor (TextEdit on Mac, NotePad in Windows) and type:

print(‘hello world’)

Save the file as "hello_world.py"

Note: Make sure there is not an extra extension on the end such as ".txt"

To run the script, in the Terminal (or DOS prompt) type:

python hello_world.py

and hit return.

You should see:

$ hello world

How Do You Write Longers Programs / Scripts?

Integrated Development Environment (IDE)

  • Spyder: Installed as part of Continuum Analytics Anaconda package

  • PyCharm: Great debugging tools, free Community edition; Now support IPython Notebook editing and running.


Advantages of IDEs

  • Intelligent Editor (PyCharm features):
    • Code completion, on-the-fly error highlighting, auto-fixes, etc.
    • Automated code refactorings and rich navigation capabilities
    • Integrated debugger and unit testing support
    • Native version control system (VCS) integrations
    • Customizable UI and key-bindings, with VIM emulation available

iPython

IPython provides architecture for interactive computing:

  • Powerful interactive shells (terminal and Qt-based).
  • A browser-based notebook with support for code, text, mathematical expressions, inline plots and other rich media.
  • Support for interactive data visualization and use of GUI toolkits.
  • Easy to use, high performance tools for parallel computing.

I have already installed iPython using the Anaconda installed (from Continuum Analytics). To launch iPython Notebook, in the Terminal (or DOS prompt) type:

$ ipython notebook

iPython – Home Screen

You should be seeing a screen like this:



Introduction to Python


Printing and manipulating text:

Insert a cell below, type the code below and run:

print(“Hello world”)

The whole thing is a statement; print is a function

Comments

# This is a comment

Anything after the pound sign (#) is ignored by the python interpreter; great for explaing what is going on in your code


In [ ]:
# this is a comment
print("Hello, world")

Storing String in Variables


In [ ]:
#In a new cell below type, and run:

# store a short DNA sequence in the variable my_dna
my_dna = "ATGCGTA"

# now print the DNA sequence
print(my_dna)

Concatenation

In a new cell below type, and run:

my_dna = "AATT" + "GGCC"

print(my_dna)

What gets printed out? Next type:

upstream = "AAA"

my_dna = upstream + "ATGC"

my_dna should now be "AAAATGC"


In [ ]:
print(my_dna)

Finding the Length of a String


In [ ]:
#In a new cell below type, and run:

#store the DNA sequence in a variable

my_dna = "ATGCGAGT"

# calculate the length of the sequence and store it in a variable

dna_length = len(my_dna)

# print a message telling us the DNA sequence lenth

print("The length of the DNA sequence is " + str(dna_length))

In [ ]:
my_dna = "ATGCGAGT"

dna_length = len(my_dna)

print("The length of the DNA sequence is " + str(dna_length))

#What is the length of the sequence?

Replacement


In [ ]:
#In a new cell below type, and run:
    
protein = "vlspadktnv"

# replace valine with tyrosine

print(protein.replace("v", "y"))

In [ ]:
protein.replace?

In [ ]:
# we can replace more than one character

print(protein.replace("vls", "ymt"))

# the original variable is not affected

print(protein)

#Is the correct sequence printed out?

Slicing

In a new cell below type, and run:

protein = "vlspadktnv"

# print positions three to five

print(protein[3:5])

# NOTE: positions start at zero, not one

print(protein[0:6])

# if we use a stop position beyond the end, it's the same as using the end

print(protein[0:60])

What output did you get?


In [ ]:
print(protein[3:])
print(protein[:-3])

Counting

In a new cell below type, and run:

protein = "vlspadktnv"

# count amino acid residues

valine_count = protein.count('v')

lsp_count = protein.count('lsp')

tryptophan_count = protein.count('w')

# now print the counts

print("valines: " + str(valine_count))

print("lsp: " + str(lsp_count))

print("tryptophans: " + str(tryptophan_count))

What counts did you get?

Excersize

Calculating AT content for the following DNA sequence:

ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT

Write a program that will print out the AT content of this DNA sequence. Hint: you can use normal mathematical symbols like add (+), subtract (-), multiply (*), divide (/) and parentheses to carry out calculations on numbers in Python.

Reminder: if you're using Python 2 rather than Python 3, include this line at the top of your program:

from __future__ import division

Load solution...


Reading and Writing Files

Reading a File

In a new cell below type, and run:

my_file = open("dna.txt")

file_contents = my_file.read()

print(file_contents)

my_file = open("dna.txt")

my_file_contents = my_file.read()

# remove the newline from the end of the file contents

my_dna = my_file_contents.rstrip("\n")

dna_length = len(my_dna)

print("sequence is " + my_dna + " and length is " + str(dna_length))

How long is the sequence?

Writing to a File


In [ ]:
#In a new cell below type, and run:

my_file = open("out.txt", "w")

my_file.write("Hello world")

# remember to close the file

my_file.close()

my_file = open("out.txt")

print(my_file)

Excersize

Writing a FASTA file

FASTA file format is a commonly-used DNA and protein sequence file format. A single sequence in FASTA format looks like this:

>sequence_name
ATCGACTGATCGATCGTACGAT

Write a program that will create a FASTA file for the following three sequences – make sure that all sequences are in upper case and only contain the bases A, T, G and C.

Sequence header DNA sequence

ABC123  ATCGTACGATCGATCGATCGCTAGACGTATCG
DEF456  actgatcgacgatcgatcgatcacgact
HIJ789  ACTGAC-ACTGT--ACTGTA----CATGTG

Lists and Loops

Lists

In a new cell below type, and run:

apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]

conserved_sites = [24, 56, 132]

print(apes[0])

first_site = conserved_sites[2]

chimp_index = apes.index("Pan troglodytes")

# chimp_index is now 1

last_ape = apes[-1]

print(last_ape)

Which ape is last?

An additional list:

nucleotides = ["T", ”C", ”A”, “G”]

print(nucleotides)

Slicing & Appending Lists

In a new cell below type, and run:

ranks = ["kingdom", "phylum", "class", "order", "family"]

lower_ranks = ranks[2:5]

# lower ranks are class, order and family

apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]

print("There are " + str(len(apes)) + " apes")

apes.append("Pan paniscus")

print("Now there are " + str(len(apes)) + " apes")

How many apres are there now?

Concatenating, Reversing & Sorting Lists


In [ ]:
#In a new cell below type, and run:

apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]

monkeys = ["Papio ursinus", "Macaca mulatta"]

primates = apes + monkeys

print(str(len(apes)) + " apes")

print(str(len(monkeys)) + " monkeys")

print(str(len(primates)) + " primates")

In [ ]:
ranks = ["kingdom", "phylum", "class", "order", "family"]

print("at the start : " + str(ranks))

rank_dup = ranks
rank_dup.reverse()
#rank2 = rank_dup.reverse()

print("after reversing : " + str(rank_dup))

ranks.sort()

print("after sorting : " + str(ranks))

#Is the order alphabetical?

In [ ]:
ranks.reverse?

Looping through Lists


In [ ]:
#In a new cell below type, and run:

apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]

for ape in apes:
print(ape + " is an ape")

for ape in apes:
    name_length = len(ape)
    first_letter = ape[0]
    print(ape + " is an ape. Its name starts with " + first_letter)
    print("Its name has " + str(name_length) + " letters")

#Does the output look correct?

Indentation

In a new cell below type, and run:

apes = ["Homo sapiens", "Pan troglodytes", "Gorilla gorilla"]

for ape in apes:
name_length = len(ape)
first_letter = ape[0]
print(ape + " is an ape. Its name starts with " + first_letter)
print("Its name has " + str(name_length) + " letters")

Indentation errors

Use tabs or spaces but not both

Using Strings as Lists, Splitting


In [ ]:
#In a new cell below type, and run:

name = "martin"

for character in name:
    print("one character is " + character)

names = "melanogaster,simulans,yakuba,ananassae"

species = names.split(",")

print(str(species))

#What prints out?

What prints out?

Looping through File, Line by Line

In a new cell below type, and run:

file = open("some_input.txt")

for line in file:
    # do something with the line
    print(line)

Looping with Ranges

Sometimes you want to loop over an interavely increasing set of nucleotides:

protein = "vlspadktnv”

vls vlsp vlspa…

In a new cell below type, and run:

stop_positions = [3,4,5,6,7,8,9,10]
for stop in stop_positions:
    substring = protein[0:stop]
    print(substring)

A better option is to use the Range command:

for number in range(3, 8):
    print(number)

for number in range(6):
    print(number)

In [ ]:
range(3, 100, 3)

Excersize

Processing DNA in a file

The file input.txt contains a number of DNA sequences, one per line. Each sequence starts with the same 14 base pair fragment – a sequencing adapter that should have been removed. Write a program that will (a) trim this adapter and write the cleaned sequences to a new file and (b) print the length of each sequence to the screen.

Writing Your Own Functions

Convert Code to Function

Previously we wrote code to compute the AT percentage of a sequence:

my_dna = "ACTGATCGATTACGTATAGTATTTGCTATCATACATATATATCGATGCGTTCAT”
length = len(my_dna)
a_count = my_dna.count('A’)
t_count = my_dna.count('T’)
at_content = (a_count + t_count) / length
print("AT content is " + str(at_content))

In [ ]:
#In a new cell below type, and run:
    
from __future__ import division  #if using python 2

def get_at_content(dna):
    length = len(dna)
    a_count = dna.count('A')
    t_count = dna.count('T')
    at_content = (a_count + t_count) / length
    return at_content

#==============================================
    
print("AT content is " + str(get_at_content("ATGACTGGACCA")))

Improving our Function

In a new cell below type (or udating the code above), and run:

def get_at_content(dna, sig_figs):
    length = len(dna)
    a_count = dna.upper().count('A')
    t_count = dna.upper().count('T')
    at_content = (a_count + t_count) / length
    return round(at_content, sig_figs)


test_dna = "ATGCATGCAACTGTAGC"

print(get_at_content(test_dna, 1))
print(get_at_content(test_dna, 2))
print(get_at_content(test_dna, 3))

Functions do not always have to take parameters

Functions do not always have to return a value

def get_at_content():
    test_dna = "ATGCATGCAACTGTAGC"
    length = len(dna)
    a_count = dna.upper().count('A')
    t_count = dna.upper().count('T')
    at_content = (a_count + t_count) / length
    print(round(at_content, sig_figs))

What are the disadvantages of doing these things?

Defaults & Named Arguments

Function arguments can be named

Order then does not matter

get_at_content(dna="ATCGTGACTCG", sig_figs=2)

get_at_content(sig_figs=2, dna="ATCGTGACTCG")

Functions can have default values

Default values do not need to be provided unless a different value is desired

def get_at_content(dna, sig_figs=2):
    # function code here

get_at_content(my_dna)

Testing Functions

Functions should be testing with know good values

Functions should be tested with known bad values

assert get_at_content("ATGC") == 0.5

assert get_at_content("A") == 1

assert get_at_content("G") == 0

assert get_at_content("ATGC") == 0.5

assert get_at_content("AGG") == 0.33

assert get_at_content("AGG", 1) == 0.3

assert get_at_content("AGG", 5) == 0.33333

Conditional Tests

True, False, If…else…elif…then

Python has a built-in values “True”, “False”

Conditional statements evaluate to True or False

If statements use conditional statements

expression_level = 125
if expression_level > 100:
    print("gene is highly expressed")

In [ ]:
expression_level = 125
if not expression_level > 100:
    print("gene is highly expressed")
else:
    print("gene is lowly expressed")

In [ ]:
if
elif
else

value = True

Use conditional statements to seperate accessions...

In a new cell below type, and run:

file1 = open("one.txt", "w")

file2 = open("two.txt", "w")

file3 = open("three.txt", "w")

accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']

for accession in accs:
    if accession.startswith('a'):
        file1.write(accession + "\n")
    elif accession.startswith('b'):
        file2.write(accession + "\n")
    else:
        file3.write(accession + "\n")

While Loops

While loops loop until a condition is met

In a new cell below type, and run:

count = 0

while count<10:
    print(count)
    count = count + 1

Building Complex Conditions

Use “and”, “or”, “not and”, “not or” to build complex conditions

In a new cell below type, and run:

accs = ['ab56', 'bh84', 'hv76', 'ay93', 'ap97', 'bd72']

for accession in accs:
    if accession.startswith('a') and accession.endswith('3'):
        print(accession)

Regular Expressions

Patterns in Biology

There are a lot of patterns in biology:

  • protein domains
  • DNA transcription factor binding motifs
  • restriction enzyme cut sites
  • runs of mononucleotides

Pattern in strings inside text:

  • read mapping locations
  • geographical sample coordinates
  • taxonomic names
  • gene names
  • gene accession numbers
  • BLAST searches

Many problems that we want to solve that require more flexible patterns:

  • Given a DNA sequence, what's the length of the poly-A tail?
  • Given a gene accession name, extract the part between the third character and the underscore
  • Given a protein sequence, determine if it contains this highly-redundant domain motif

Regular expression module

To search for these patterns, we use the regular expression module “re”

In a new cell below type, and run:

import re

re.search(pattern, string)

dna = "ATCGCGAATTCAC"

if re.search(r"GAATTC", dna):
    print("restriction site found!")

if re.search(r"GC(A|T|G|C)GC", dna):
    print("restriction site found!")

if re.search(r"GC[ATGC]+GC", dna): \d
    print("restriction site found!")

Get String and Position of Match

Get the string that matched

In a new cell below type, and run:

dna = "ATGACGTACGTACGACTG"

# store the match object in the variable m

m = re.search(r"GA([ATGC]{3})AC([ATGC]{2})AC", dna)

print("entire match: " + m.group())

print("first bit: " + m.group(1))

print("second bit: " + m.group(2))


Get the positions of the match

In a new cell below type, and run:

print("start: " + str(m.start()))

print("end: " + str(m.end()))

In [ ]:
import re
m = re.search("[ATGC]", 'ATCG')
re?

Dictionaries

Storing Paired Data

In a new cell below type, and run:

enzymes = {}
enzymes['EcoRI'] = r'GAATTC'
enzymes['AvaII] = r'GG(A|T)CC'
enzymes['BisI'] = r'GC[ATGC]GC'

# remove the EcoRI enzyme from the dict
enzymes.pop('EcoRI')

dna = "AATGATCGATCGTACGCTGA"
counts = {}
for base1 in ['A', 'T', 'G', 'C']:
    for base2 in ['A', 'T', 'G', 'C']:
        for base3 in ['A', 'T', 'G', 'C']:
            trinucleotide = base1 + base2 + base3
            count = dna.count(trinucleotide)
            counts[trinucleotide] = count
print(counts)

In a new cell below type, and run:

count = {'ACC': 0, 'ATG': 1, 'AAG': 0, 'AAA': 0, 'ATC': 2, 'AAC': 0, 'ATA': 0, 'AGG': 0, 'CCT': 0, 'CTC': 0, 'AGC': 0, 'ACA': 0, 'AGA': 0, 'CAT': 0, 'AAT': 1, 'ATT': 0, 'CTG': 1, 'CTA': 0, 'ACT': 0, 'CAC': 0, 'ACG': 1, 'CAA': 0, 'AGT': 0, 'CAG': 0, 'CCG': 0, 'CCC': 0, 'CTT': 0, 'TAT': 0, 'GGT': 0, 'TGT': 0, 'CGA': 1, 'CCA': 0, 'TCT': 0, 'GAT': 2, 'CGG': 0, 'TTT': 0, 'TGC': 0, 'GGG': 0, 'TAG': 0, 'GGA': 0, 'TAA': 0, 'GGC': 0, 'TAC': 1, 'TTC': 0, 'TCG': 2, 'TTA': 0, 'TTG': 0, 'TCC': 0, 'GAA': 0, 'TGG': 0, 'GCA': 0, 'GTA': 1, 'GCC': 0, 'GTC': 0, 'GCG': 0, 'GTG': 0, 'GAG': 0, 'GTT': 0, 'GCT': 1, 'TGA': 2, 'GAC': 0, 'CGT': 1, 'TCA': 0, 'CGC': 1}

print(counts['TGA'])

if 'AAA' in counts:
    print(counts('AAA'))

for trinucleotide in counts.keys():
    if counts.get(trinucleotide) == 2:
        print(trinucleotide)

for trinucleotide in sorted(counts.keys()):
    if counts.get(trinucleotide) == 2:
        print(trinucleotide)

for trinucleotide, count in counts.items():
    if count == 2:
        print(trinucleotide)

Files, Programs, & User Input

Basic File Manipulation

Rename a file

In a new cell below type, and run:

import os

os.rename("old.txt", "new.txt")

Rename a folder

os.rename("/home/martin/old_folder", "/home/martin/new_folder")

Check to see if a file exists

if os.path.exists("/home/martin/email.txt"):
    print("You have mail!")

Remove a file

os.remove("/home/martin/unwanted_file.txt")

Remove empty folder

os.rmdir("/home/martin/emtpy")


To delete a folder and all the files in it, use shutil.rmtree

from shutil import rmtree

shutil.rmtree("home/martin/full")

Running External Programs

Run an external program

In a new cell below type, and run:

import subprocess

subprocess.call("/bin/date")


Run an external program with options

subprocess.call("/bin/date +%B", shell=True)


Saving program output

current_month = subprocess.check_output("/bin/date +%B", shell=True)

Now using IPython magic:


In [ ]:
!/bin/date

User Input

Interactive user input

In a new cell below type, and run:

    accession = input("Enter the accession name")

    # do something with the accession variable

Capture command line arguments

import sys

print(sys.argv)

# python myprogram.py one two three

# sys.argv[1] return script name

To conclude…

Goals

  • Introduce you to the basics of the python programming language

  • Introduced you to the iPython environment

  • Prepare you for the next session “Introduction to Biopython for Scientists”

  • Enable you to write or assemble scripts of your own or modify existing scripts for your own purposes


Resources


Q & A

Collaborations welcome

R. Burke Squires - richard.squires@nih.gov

or

ScienceApps@niaid.nih.gov


In [ ]: