Transcriptomics

  • Extract the promoter regions using biopython
  • Investigate de-novo motifs on clusters of genes using meme or steme
  • Use TransFac database to search for motif occurences on selected genes
  • Save the found motifs and related data
  • Test which of the motif occurences on your selected genes is significant

In [ ]:
import TR
TR.run() #the task must run from a single access point

Extract the promoter regions using biopython

There are a few transcriptional regulatory modules for Yeast listed here:

http://www.jaist.ac.jp/~h-pham/regulation/inf_TRMs.htm

Using BioPython, extract the promoter regions of these genes. To download the full genome automatically, adapt the code bellow to your needs.

Alternatively, if you can fish out the nucleotide positions for every gene and have a local genome you can also do this in standard Python.


In [ ]:
from Bio import Entrez
Entrez.email = "sn@mail.at"
handle = Entrez.esearch(db='nuccore', term='escherichia coli[orgn] AND complete genome[title]')
genome_ids = Entrez.read(handle)['IdList']
print genome_ids
handle = Entrez.efetch(db="nuccore", id=['556503834'], rettype="gbwithparts", retmode="txt")
#print "First 2000 characters:"
#print(handle.read()[:2000])#first 2000 characters
handle2 = Entrez.efetch(db="nucleotide", id=['556503834'], rettype="fasta", retmode="txt")
print(handle2.read()[:500])
fout = open("data/ecoli.gbk", "w")
fout.write(handle.read())
fout.close()
#handle.close()
print("Saved")

To extract the genomic regions, again adapt the helper code bellow:


In [ ]:
import Bio
from Bio import SeqIO, SeqFeature, SeqRecord
import os

gbf = "data/ecoli.gbk"
promoters = []

record = SeqIO.parse(open(gbf), "genbank").next()
dnaseq = record.seq # seq is an attribute of the record object containing the actual nucleotide sequence as a string

## We fill the CDS list first
cds = []#CDS features list
for feature in record.features:
    if feature.type == 'CDS':
        #print dir(feature)
        #print feature
        #print feature.qualifiers
        #break
        try:
            mystart = feature.location._start.position
            myend = feature.location._end.position
            ## we annotate the promoters with the NumProt id of coresponding to the referenced gene
            ## how did I knew this, by simple object interogation - uncomment the lines above to understand it
            pid = feature.qualifiers['protein_id'][0]
            if feature.strand == 1:
                cds.append((mystart,myend,pid))
        except Exception:
            pass # being careless about the CDSes spanning multiple locations!

## Extract the promoter regions in a list of SeqRecord objects
for i,pospair in enumerate(cds):
    start, end, pid = pospair
    pseq = dnaseq[start-400:start] # dnaseq is a string, so we can slice it directly
    promid = 'p_'+ pid
    pdesc = "%s, 400 bp upstream promoter, %d-%d" % (record.name, start-400, start)
    ## Note that BioPython named a class with the same name as its module, this is frequent in OOP and somewhat funny 
    prec = SeqRecord.SeqRecord(pseq, id=promid, description = pdesc)
    promoters.append(prec)

## Write all promoters on file
outpath = "data/ecoli_prom.fasta"
SeqIO.write(promoters, open(outpath,"w"), "fasta")
print "Done!"

In [ ]:
import TR
genelist = []
TR.extract_promoters(genelist, nnuc = 400)

Investigate de-novo motifs on clusters of genes using meme or steme

De-novo TFBS hunt is not that hard once you manage to install a proper program. But the installation itself can be tedious. This is an optional step, but in case you want to do it I reccommend STEME:

https://pythonhosted.org/STEME/

It will install as a Python module with pip, however it has a fairly complex set of other required programs and libraries.


In [ ]:
TF.de_novo(genelist)

Use a TransFac consensus file to search for motif occurences on selected genes

http://www.yeastract.com/download.php

The link above contains motifs in Transfac format. Use Biopython or standard python to read the consensus matrices. Now match them over your input promoter.

To do that, read carefully the following tutorial on BioPython's motif module and then try to use it:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc213

Second observation, if you need to match motifs from a database on a grand scale, I reccommend this tool, having bindings for Python:

https://github.com/jhkorhonen/MOODS/wiki/Getting-started


In [ ]:
TR.search(genelist)

The optimal way to save your data would be a matrix of promoters vs motifs detailing the occurence positions. Additional files could contain information on matches and the motifs themselves and optionally save their logo image.


In [ ]:
TR.save()

Test which of the motif occurences on your selected genes is significant

A simple significance test would construct a matching background for the entire genome and compute a P-value based on a Fischer test (how many matches for a motif in the selected genes compared to the background).

http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.fisher_exact.html


In [ ]:
input_motif_id=""
TR.significance(genelist, input_motif_id)

Now run the significance test for all gene lists and motifs and see if you can find some relevant associations.


In [ ]: