Sequencing

Task description

Find out EColi's public genome, extract the intergenic regions, perform local alignment of promoter regions through BLAST and try to find potential regulatory candidate binding sites. Select automatically candidate pairs of genes that share one or several binding sites. Find homologs of the genes into two other organisms that are philogenetic relatives and scan for the same binding sites.

Alternatively translate the sequences and scan for proteins having similar sequence in the other species.

Further study:

Compute the philogenetic distance between the three bacteria, for example by coalescent simulation.
Use the http://etetoolkit.org/ for a nice display.
Find a way to deal with the operons.
Try to find binding sites in a more accurate manner, possibly by confirming them with tools like STEME, TAMO or MOODS.
Try to make this more realistic if you have the proper background into sequencing and regulomics, use positional weight matrices, compute philogenetic distances etc.
If you think you found something then let me know, maybe we publish it!
Use another organism if you hate EColi!

Observation:

The details below will not help you directly but it may inspire you.
Extracting promoter regions is one of the scripts that is part of the BioPython cookbooks and it originates from a blogpost: http://bytesizebio.net/2010/02/11/short-bioinformatic-hacks-reading-between-the-genes/



In [ ]:

Helpful notes:

Making a command pipeline
Manipulate sequences in BioPython
Phylogeny and population genetics

Making a command pipeline

One of the most important steps in sequencing is external running sequencing programs ... in a sequence. This is also known as a command pipeline. There are various pipelines created for sequencing and some are more general while others very specific. In this course we will learn how to run two BLAST commands in sequence using Python.

Installing BLAST on your local machine or server is up to you and falls out of the scope of this class. In Ubuntu Linux I download and install BLAST with:

sudo apt-get install ncbi-blast+

The following commands will create a database from a fasta file describing a complete set of sequences, and will query an input sequence against this database:

makeblastdb -in sample.fa -dbtype 'prot' -out NewDb blastp -db NewDb -query somesequence.fasta

http://www.rocksclusters.org/roll-documentation/bio/5.4/blast_usage.html



In [ ]:

    
from subprocess import Popen, PIPE

class Command( object ):
    def __init__( self, text ):
        self.args = text.split()
    def execute( self ):
        p = Popen(self.args, shell = False, stdin=None, stdout=PIPE, stderr=None)
        #result = p.communicate()[0]
        #print result
        status = p.wait()
        self.stdin, self.stdout, self.stderr = p.stdin, p.stdout, p.stderr
        
t1 = "makeblastdb -in data/Homo_sapiens.GRCh38.pep.all.fa -dbtype prot -out data/hudb"
t2 = "blastp -db data/hudb -query data/peptide.fasta"

pipeline = [Command(t1), Command(t2)]

for c in pipeline:
    print "Running command:", " ".join(c.args)
    c.execute()
    print "Input:", c.stdin
    print "Output:", c.stdout
    print "Error:", c.stderr









    



Running command: makeblastdb -in data/Homo_sapiens.GRCh38.pep.all.fa -dbtype prot -out data/hudb
Input:

Next we are doing something similar, but we use the biopython module. We can perform a local search using Biopython, but instead we will perform blast alignment through a remote API call.

http://biopython.org/DIST/docs/api/Bio.Blast.NCBIWWW-module.html

http://www.ebi.ac.uk/Tools/sss/ncbiblast/help/index-protein.html

http://nbviewer.ipython.org/url/www.csc.kth.se/~ksahlin/teaching/dd2404/lectures/lect4.ipynb



In [3]:

    
from Bio.Blast import NCBIWWW
from Bio import SeqIO
record = SeqIO.read("data/peptide.fasta", format="fasta")
result_handle = NCBIWWW.qblast("blastp", "uniprotkb", record.seq)
# # save the result
# save_file = open("/tmp/my_blast.xml", "w")
# save_file.write(result_handle.read())
# save_file.close()
# result_handle.close()
# result_handle = open("/tmp/my_blast.xml")
from Bio.Blast import NCBIXML
blast_record = NCBIXML.read(result_handle)
#dir(blast_record)
E_VALUE_THRESH = 0.04
for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
        if hsp.expect < E_VALUE_THRESH:
            print '****Alignment****'
            print 'sequence:', alignment.title
            print 'length:', alignment.length
            print 'e value:', hsp.expect
            print hsp.query[0:75] + '...'
            print hsp.match[0:75] + '...'
            print hsp.sbjct[0:75] + '...'

Manipulate sequences in Bioython

Biopython has functionality related to sequencing, alignments, mutiple format file handling and population genetics.

http://biopython.org/DIST/docs/tutorial/Tutorial.html

Later:

Read a genebank record and analize it. Understand the structure of a few objects:

Seq:

http://biopython.org/wiki/Seq

SeqRecord

http://biopython.org/wiki/SeqRecord

SeqFeature

http://biopython.org/DIST/docs/api/Bio.SeqFeature.SeqFeature-class.html

Next convert it to fasta and have a look at the SeqIO submodule:

http://biopython.org/wiki/SeqIO

http://nbviewer.ipython.org/github/tiagoantao/biopython-notebook/blob/master/notebooks/04%20-%20Sequence%20Annotation%20objects.ipynb



In [ ]:

    
from Bio.Seq import Seq
my_seq = Seq("AGTACACTGGT")
from Bio.Alphabet import generic_dna
my_dna = Seq("AGTACACTGGT", generic_dna)
my_dna.find("ACT")
my_dna.count("A")
my_dna.complement()
my_dna.reverse_complement()
my_rna = my_dna.transcribe()
my_rna.back_transcribe()

cdna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG", generic_dna)
cdna.translate()
coding_dna.translate(to_stop=True)

from Bio import SeqIO
# for index, record in enumerate(SeqIO.parse(open("data/ls_orchid.gbk"), "genbank")) :
#     print "index %i, ID = %s, length %i, with %i features" \
#         % (index, record.id, len(record.seq), len(record.features))

record = SeqIO.parse(open("data/ls_orchid.gbk"), "genbank")[0]
dir(record)
print record.seq
print record.seq.__class__
print record.id, record.name, record.description

#sequencing project info
print record.dbxrefs
print record.dbxrefs.__class__
print record.annotations
print record.annotations.__class__
print record.annotations["source"]
print record.annotations["references"].__class__
print len(record.annotations["references"])
for ref in record.annotations["references"] : print ref.authors
record.reverse_complement(id="test")

#SeqFeatures
print record.features.__class__
print len(record.features)

#Slicing
sub_record = record[4300:4800]
print len(sub_record.features)
sub_record.description = "subset of ..."
print(sub_record.format("genbank"))

##Format convertion
print record.format("fasta")

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
                   IUPAC.protein),
                   id="YP_025292.1", name="HokC",
                   description="toxic membrane protein, small")
print record
SeqIO.write(record, 'customrecord.fasta', "fasta")

Extra task: Convert FASTQ to FASTA

zcat input_file.fastq.gz | awk 'NR%4==1{printf ">%s\n", substr($0,2)}NR%4==2{print}' > output_file.fa

Philogeny and population genetics

It the Biopython tutorial there is a nice introduction to the method of coalescent simulation. However chances are that you will want a particular type of study that will not be found in Biopython. You could for example manipulate your data in Biopython up to a point and then use either an external program or a different python module. In this next example, we will perform coalescent simulation using a software called FFPopSim, for which a Python binding exists, but it may be too cumbersome to install for the purpose of class.

http://webdav.tuebingen.mpg.de/interference/coalescence.html

http://webdav.tuebingen.mpg.de/ffpopsim/pkg/doc/python/html/Requirements.html#building-requirements

simuPOP

http://simupop.sourceforge.net/Main/HomePage

egglib-py

http://www.biomedcentral.com/1471-2156/13/27



In [ ]: