The eukaryotic genome is adorned by molecules called transcription factors (TFs). At any given time, some of these are regulating gene expression, e.g. by interacting with RNA polymerase, but others are not. How can we distinguish the functional TF-DNA binding events from a potentially large background of non-functional binding events?
To approach this question, we first need to quantify the strength with which TFs bind DNA. A TF binds DNA by making contact with a sequence of $L$ nucleotides
\begin{equation} \vec{s} = \langle s_1, s_2, \ldots, s_L\rangle, \end{equation}where $s_i \in \{A,C,G,T\}$. Denote by $E(\vec{s})$ the binding energy of a given TF to a DNA sub-sequence $\vec{s}$. With binding lengths of $L = 10-20$ nucleotides, there are too many possible $\vec{s}$ to measure $E(\vec{s})$ exhaustively. Fortunately, the contribution of each nucleotide to the binding energy of the sub-sequence is approximately independent and additive:
\begin{equation} E(\vec{s}) = \sum_{i=1}^L \epsilon_i(s_i), \end{equation}reducing the impractical problem of determining the large number of values of $E(\vec{s})$ to the practical problem of the determining the $L\times 4$ energy matrix, $\epsilon_i(s)$. This matrix has been determined for a TF called Gal4 using in vitro measurements of the equilibrium binding constants
\begin{equation} K(\vec{s}) \propto e^{-\beta E(\vec{s})} \end{equation}for all sequences $\vec{s}$ that differ in just one nucleotide from a given sequence. I manually fetched these data from the literature [Liang et al 1996], and stored them in the file data/Gal4_affinity.in
.
In [1]:
import sys, os
sys.path.append(os.getcwd() + '/source')
from extract import createEnergyMatrix
energy_matrix = createEnergyMatrix('data/Gal4_affinity.in')
Here, I have stored the energy matrix as a list of dictionaries for computational reasons, but we can use pandas
to visualize it:
In [2]:
import pandas as pd
df = pd.DataFrame(energy_matrix)
df
Out[2]:
In the above data structure:
So, for example, a DNA sequence that binds optimally to Gal4 can be extracted by listing the nucleotides with the lowest energy at each position:
In [3]:
print(df.idxmin(axis=1))
I manually downloaded, from the Saccharomyces Genome Database, the DNA sequence of the third chromosome of yeast, stored in FASTA format, and read it into a string:
In [4]:
from extract import getFasta
with open('data/chr03.fsa') as f:
header, chromosome = getFasta(f)
print('\nHere is the beginning of the DNA sequence of the chromosome:\n')
print(chromosome[:100])
print('\nThere are ', len(chromosome), 'nucleotides in this chromosome')
In [5]:
from auxFunctions import calcEnergyListWithMatrix
TFBS = calcEnergyListWithMatrix(chromosome, energy_matrix)
TFBS.head()
Out[5]:
Here is how those TF-DNA binding energies are distributed throughout the genome:
In [6]:
import numpy as np
from matplotlib import pyplot as plt
from auxFunctions import binList
%matplotlib inline
energyBins, numberSites = binList(TFBS['TF-DNA binding energy'], xMin=-5.0, xMax=50, binWidth=0.25)
fontsize = 14
fontsize_tick = 12
fig = plt.figure(figsize=(7,5), facecolor='w')
ax = fig.add_subplot(111)
ax.plot(energyBins, numberSites, linewidth=0, marker='s', markersize=8, color='red')
ax.set_xlabel('TF-DNA binding energy (kT)', fontsize=fontsize)
ax.set_ylabel('number of genomic sites', fontsize=fontsize)
ax.set_yscale('log')
ax.set_xlim(0, 40)
ax.tick_params(axis='both', which='major', labelsize=fontsize_tick)
Putting the y-axis on a log scale reveals that the distribution is approximately parabolic, implying that on a linear scale the distribution is approximately Gaussian. This is expected from the fact that each TF-DNA binding energy is a sum of single-nucleotide energies that are, to a good approximate, independently and identically distributed (Central Limit Theorem).
Notice also that, though highly-specific (low-energy) sites do indeed exist, the sheer number of less-specific (intermediate- to high-energy) sites across the genome can, in principle, soak up a significant number of TFs.
To identify where regulatory regions of genes are likely to be, I downloaded the following General Feature Format file from the Saccharomyces Genome Database:
In [7]:
!head -20 data/saccharomyces_cerevisiae_chr03.gff
This file contains all genomic features on the third chromosome of this species. I located the coding-sequence features, and used them to extract regions of DNA that lie upstream of each transcription start site.
In [8]:
def extractPromoters():
""" parse saccharomyces_cerevisiae_chr03.gff and extract promoters """
promLength = 100
with open('data/saccharomyces_cerevisiae_chr03.gff') as fin:
# skip over header lines
line = fin.readline()
while line[0] == "#":
line = fin.readline()
features = []
while line:
seqid, source, feature_type, start, end, score, strand, phase, attributes = line.split()
if feature_type == 'CDS':
attributes = attributes.split(';')
initDict = [attribute.split('=') for attribute in attributes]
attributes = dict(initDict)
systematicGeneName = attributes['Parent']
if 'orf_classification' in attributes:
classification = attributes['orf_classification']
else:
classification = '.'
if 'gene' in attributes:
standardGeneName = attributes['gene']
else:
standardGeneName = '.'
# which DNA strand the gene is encoded on determines where the promoter is located
if strand == '+':
promStart = int(start) - promLength
promEnd = int(start)
elif strand == '-':
promStart = int(end)
promEnd = int(end) + promLength
promoter = chromosome[max(promStart, 0):promEnd]
features += [(standardGeneName, systematicGeneName, classification, promStart, promEnd, promoter)]
line = fin.readline()
return pd.DataFrame(data=features, columns=['standard gene name', 'systematic gene name',
'classification', 'promoter start position',
'promoter end position',
'promoter sequence'])
promoters = extractPromoters()
promoters.head()
Out[8]:
I have placed a .
in a field to indicate missing data.
In [9]:
import warnings
warnings.filterwarnings('ignore')
TFBS_high_affinity = TFBS[TFBS['TF-DNA binding energy'] < 10]
TFBS_high_affinity_categorical_variable = [0]*len(TFBS_high_affinity)
TFBS_high_affinity_promoter = ['.']*len(TFBS_high_affinity)
count = -1
for TFBS_index, TFBS_row in TFBS_high_affinity.iterrows():
count += 1
TFBS_start = TFBS_row['binding-site start position']
TFBS_end = TFBS_row['binding-site end position']
for promoter_index, promoter_row in promoters.iterrows():
promoter_start = promoter_row['promoter start position']
promoter_end = promoter_row['promoter end position']
if (promoter_start < TFBS_start) and (TFBS_end < promoter_end):
TFBS_high_affinity_categorical_variable[count] = 1
TFBS_high_affinity_promoter[count] = promoter_row['systematic gene name']
break
TFBS_high_affinity['promoter categorical variable'] = TFBS_high_affinity_categorical_variable
TFBS_high_affinity['promoter'] = TFBS_high_affinity_promoter
TFBS_high_affinity.head()
Out[9]:
I used scikit-learn to determine whether subsequences with greater affinity for the TF (i.e. lower energy) tend to be located in promoter regions more often than you'd expect by chance:
In [10]:
energies = TFBS_high_affinity['TF-DNA binding energy']
categories = TFBS_high_affinity['promoter categorical variable']
# sample data
fig = plt.figure(figsize=(7,5), facecolor='w')
ax = fig.add_subplot(111)
ax.plot(energies, categories,
linewidth=0, marker='o', markersize=8, color='red', label='sample data')
ax.set_xlabel('TF-DNA binding energy (kT)', fontsize=fontsize)
ax.set_ylabel('promoter categorical variable', fontsize=fontsize)
ax.tick_params(axis='both', which='major', labelsize=fontsize_tick)
ax.set_ylim(-0.1, 1.1)
# logistic regression
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
energies_rs = energies.values.reshape((len(energies),1))
lr.fit(energies_rs, categories)
x = np.linspace(energies.min(),energies.max())
x_rs = x.reshape((len(x), 1))
probs = lr.predict_proba(x_rs)
ax.plot(x, probs[:,1], linewidth=2, marker=None, color='black',
label='logistic probability of lying in promoter region')
# probability that a randomly choosen site lies in a promoter
number_promoters, number_cols = promoters.shape
promoter_size = len(promoters['promoter sequence'][0])
promoter_size_summed_over_chromosome = number_promoters*promoter_size
chromosome_size = len(chromosome)
null_probability_of_random_bp_lying_in_promoter = float(promoter_size_summed_over_chromosome)/float(chromosome_size)
ax.plot([x.min(), x.max()],
[null_probability_of_random_bp_lying_in_promoter,
null_probability_of_random_bp_lying_in_promoter],
'r--',
linewidth=3, marker=None, color='black',
label='null probability of lying in promoter region')
# label the plot
legend = ax.legend(loc='center', fontsize=fontsize)
The data show that high-affinity subsequences are more likely to lie in promoter regions than expected by chance, suggesting that there has been evolutionary pressure to retain strong binding sites in promoters. The actionable insight is therefore that searches for new functional TF binding sites should be focused on promoter regions.
In [ ]: