In [ ]:
import TR
TR.run() #the task must run from a single access point
There are a few transcriptional regulatory modules for Yeast listed here:
http://www.jaist.ac.jp/~h-pham/regulation/inf_TRMs.htm
Using BioPython, extract the promoter regions of these genes. To download the full genome automatically, adapt the code bellow to your needs.
Alternatively, if you can fish out the nucleotide positions for every gene and have a local genome you can also do this in standard Python.
In [ ]:
from Bio import Entrez
Entrez.email = "sn@mail.at"
handle = Entrez.esearch(db='nuccore', term='escherichia coli[orgn] AND complete genome[title]')
genome_ids = Entrez.read(handle)['IdList']
print genome_ids
handle = Entrez.efetch(db="nuccore", id=['556503834'], rettype="gbwithparts", retmode="txt")
#print "First 2000 characters:"
#print(handle.read()[:2000])#first 2000 characters
handle2 = Entrez.efetch(db="nucleotide", id=['556503834'], rettype="fasta", retmode="txt")
print(handle2.read()[:500])
fout = open("data/ecoli.gbk", "w")
fout.write(handle.read())
fout.close()
#handle.close()
print("Saved")
To extract the genomic regions, again adapt the helper code bellow:
In [ ]:
import Bio
from Bio import SeqIO, SeqFeature, SeqRecord
import os
gbf = "data/ecoli.gbk"
promoters = []
record = SeqIO.parse(open(gbf), "genbank").next()
dnaseq = record.seq # seq is an attribute of the record object containing the actual nucleotide sequence as a string
## We fill the CDS list first
cds = []#CDS features list
for feature in record.features:
if feature.type == 'CDS':
#print dir(feature)
#print feature
#print feature.qualifiers
#break
try:
mystart = feature.location._start.position
myend = feature.location._end.position
## we annotate the promoters with the NumProt id of coresponding to the referenced gene
## how did I knew this, by simple object interogation - uncomment the lines above to understand it
pid = feature.qualifiers['protein_id'][0]
if feature.strand == 1:
cds.append((mystart,myend,pid))
except Exception:
pass # being careless about the CDSes spanning multiple locations!
## Extract the promoter regions in a list of SeqRecord objects
for i,pospair in enumerate(cds):
start, end, pid = pospair
pseq = dnaseq[start-400:start] # dnaseq is a string, so we can slice it directly
promid = 'p_'+ pid
pdesc = "%s, 400 bp upstream promoter, %d-%d" % (record.name, start-400, start)
## Note that BioPython named a class with the same name as its module, this is frequent in OOP and somewhat funny
prec = SeqRecord.SeqRecord(pseq, id=promid, description = pdesc)
promoters.append(prec)
## Write all promoters on file
outpath = "data/ecoli_prom.fasta"
SeqIO.write(promoters, open(outpath,"w"), "fasta")
print "Done!"
In [ ]:
import TR
genelist = []
TR.extract_promoters(genelist, nnuc = 400)
De-novo TFBS hunt is not that hard once you manage to install a proper program. But the installation itself can be tedious. This is an optional step, but in case you want to do it I reccommend STEME:
https://pythonhosted.org/STEME/
It will install as a Python module with pip, however it has a fairly complex set of other required programs and libraries.
In [ ]:
TF.de_novo(genelist)
http://www.yeastract.com/download.php
The link above contains motifs in Transfac format. Use Biopython or standard python to read the consensus matrices. Now match them over your input promoter.
To do that, read carefully the following tutorial on BioPython's motif module and then try to use it:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc213
Second observation, if you need to match motifs from a database on a grand scale, I reccommend this tool, having bindings for Python:
In [ ]:
TR.search(genelist)
In [ ]:
TR.save()
A simple significance test would construct a matching background for the entire genome and compute a P-value based on a Fischer test (how many matches for a motif in the selected genes compared to the background).
http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.fisher_exact.html
In [ ]:
input_motif_id=""
TR.significance(genelist, input_motif_id)
Now run the significance test for all gene lists and motifs and see if you can find some relevant associations.
In [ ]: