Find out EColi's public genome, extract the intergenic regions, perform local alignment of promoter regions through BLAST and try to find potential regulatory candidate binding sites. Select automatically candidate pairs of genes that share one or several binding sites. Find homologs of the genes into two other organisms that are philogenetic relatives and scan for the same binding sites.
Alternatively translate the sequences and scan for proteins having similar sequence in the other species.
Further study:
Observation:
In [ ]:
Helpful notes:
One of the most important steps in sequencing is external running sequencing programs ... in a sequence. This is also known as a command pipeline. There are various pipelines created for sequencing and some are more general while others very specific. In this course we will learn how to run two BLAST commands in sequence using Python.
Installing BLAST on your local machine or server is up to you and falls out of the scope of this class. In Ubuntu Linux I download and install BLAST with:
sudo apt-get install ncbi-blast+
The following commands will create a database from a fasta file describing a complete set of sequences, and will query an input sequence against this database:
makeblastdb -in sample.fa -dbtype 'prot' -out NewDb blastp -db NewDb -query somesequence.fasta
http://www.rocksclusters.org/roll-documentation/bio/5.4/blast_usage.html
In [ ]:
from subprocess import Popen, PIPE
class Command( object ):
def __init__( self, text ):
self.args = text.split()
def execute( self ):
p = Popen(self.args, shell = False, stdin=None, stdout=PIPE, stderr=None)
#result = p.communicate()[0]
#print result
status = p.wait()
self.stdin, self.stdout, self.stderr = p.stdin, p.stdout, p.stderr
t1 = "makeblastdb -in data/Homo_sapiens.GRCh38.pep.all.fa -dbtype prot -out data/hudb"
t2 = "blastp -db data/hudb -query data/peptide.fasta"
pipeline = [Command(t1), Command(t2)]
for c in pipeline:
print "Running command:", " ".join(c.args)
c.execute()
print "Input:", c.stdin
print "Output:", c.stdout
print "Error:", c.stderr
Next we are doing something similar, but we use the biopython module. We can perform a local search using Biopython, but instead we will perform blast alignment through a remote API call.
http://biopython.org/DIST/docs/api/Bio.Blast.NCBIWWW-module.html
http://www.ebi.ac.uk/Tools/sss/ncbiblast/help/index-protein.html
http://nbviewer.ipython.org/url/www.csc.kth.se/~ksahlin/teaching/dd2404/lectures/lect4.ipynb
In [3]:
from Bio.Blast import NCBIWWW
from Bio import SeqIO
record = SeqIO.read("data/peptide.fasta", format="fasta")
result_handle = NCBIWWW.qblast("blastp", "uniprotkb", record.seq)
# # save the result
# save_file = open("/tmp/my_blast.xml", "w")
# save_file.write(result_handle.read())
# save_file.close()
# result_handle.close()
# result_handle = open("/tmp/my_blast.xml")
from Bio.Blast import NCBIXML
blast_record = NCBIXML.read(result_handle)
#dir(blast_record)
E_VALUE_THRESH = 0.04
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < E_VALUE_THRESH:
print '****Alignment****'
print 'sequence:', alignment.title
print 'length:', alignment.length
print 'e value:', hsp.expect
print hsp.query[0:75] + '...'
print hsp.match[0:75] + '...'
print hsp.sbjct[0:75] + '...'
Biopython has functionality related to sequencing, alignments, mutiple format file handling and population genetics.
http://biopython.org/DIST/docs/tutorial/Tutorial.html
Later:
Read a genebank record and analize it. Understand the structure of a few objects:
Seq:
SeqRecord
http://biopython.org/wiki/SeqRecord
SeqFeature
http://biopython.org/DIST/docs/api/Bio.SeqFeature.SeqFeature-class.html
Next convert it to fasta and have a look at the SeqIO submodule:
In [ ]:
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
from Bio.Alphabet import generic_dna
my_dna = Seq("AGTACACTGGT", generic_dna)
my_dna.find("ACT")
my_dna.count("A")
my_dna.complement()
my_dna.reverse_complement()
my_rna = my_dna.transcribe()
my_rna.back_transcribe()
cdna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", generic_dna)
cdna.translate()
coding_dna.translate(to_stop=True)
from Bio import SeqIO
# for index, record in enumerate(SeqIO.parse(open("data/ls_orchid.gbk"), "genbank")) :
# print "index %i, ID = %s, length %i, with %i features" \
# % (index, record.id, len(record.seq), len(record.features))
record = SeqIO.parse(open("data/ls_orchid.gbk"), "genbank")[0]
dir(record)
print record.seq
print record.seq.__class__
print record.id, record.name, record.description
#sequencing project info
print record.dbxrefs
print record.dbxrefs.__class__
print record.annotations
print record.annotations.__class__
print record.annotations["source"]
print record.annotations["references"].__class__
print len(record.annotations["references"])
for ref in record.annotations["references"] : print ref.authors
record.reverse_complement(id="test")
#SeqFeatures
print record.features.__class__
print len(record.features)
#Slicing
sub_record = record[4300:4800]
print len(sub_record.features)
sub_record.description = "subset of ..."
print(sub_record.format("genbank"))
##Format convertion
print record.format("fasta")
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
IUPAC.protein),
id="YP_025292.1", name="HokC",
description="toxic membrane protein, small")
print record
SeqIO.write(record, 'customrecord.fasta', "fasta")
Extra task: Convert FASTQ to FASTA
zcat input_file.fastq.gz | awk 'NR%4==1{printf ">%s\n", substr($0,2)}NR%4==2{print}' > output_file.fa
It the Biopython tutorial there is a nice introduction to the method of coalescent simulation. However chances are that you will want a particular type of study that will not be found in Biopython. You could for example manipulate your data in Biopython up to a point and then use either an external program or a different python module. In this next example, we will perform coalescent simulation using a software called FFPopSim, for which a Python binding exists, but it may be too cumbersome to install for the purpose of class.
http://webdav.tuebingen.mpg.de/interference/coalescence.html
http://webdav.tuebingen.mpg.de/ffpopsim/pkg/doc/python/html/Requirements.html#building-requirements
simuPOP
http://simupop.sourceforge.net/Main/HomePage
egglib-py
In [ ]: