This example demonstrates the use of metaseq
for performing a common
task when analyzing ChIP-seq data: what does transcription factor binding signal look like near transcription start sites?
The IPython Notebook of this example can be found in the source directory (doc/source/example_session.ipynb
) or at http://nbviewer.ipython.org/github/daler/metaseq/blob/master/doc/source/example_session.ipynb
The syntax of this document follows that of the IPython notebook. For example, calls out to the shell are prefixed with a "!
"; if not then the code is run in Python.
In [1]:
# Enable in-line plots for this example
%matplotlib inline
This example uses data that can be downloaded from https://github.com/daler/metaseq-example-data. See that repository for details on how the data were prepared; here's how to download the prepared data:
In [2]:
%%bash
example_dir="metaseq-example"
if [ -e $example_dir ]; then echo "already exists";
else
mkdir -p $example_dir
(cd $example_dir \
&& wget --progress=dot:giga https://raw.githubusercontent.com/daler/metaseq-example-data/master/metaseq-example-data.tar.gz \
&& tar -xzf metaseq-example-data.tar.gz \
&& rm metaseq-example-data.tar.gz)
fi
In [3]:
# From now on, we can just use `data_dir`
data_dir = 'metaseq-example/data'
In [4]:
# What did we get?
!ls $data_dir
The .bam
and .bam.bai
files are from an ENCODE project ChIP-Seq experiment in the human erythroid K562 cell line for the ATF3 transcription factor and its associated input control. See the ENCODE page for details.
The .bigWig
files are from the same experiment, downloaded from ENCODE
The GTF annotation files are downloaded from Ensembl (release 66) for hg19.
The Homo_sapiens.GRCh37.66_chr17.gtf.db
file is a gffutils
database that was prepared from the GTF file.
The GSM*.gtf.gz
data are Cufflinks output from GEO accession GSM847565 and GSM847566. These were then parsed and reformatted into the corresponding GSM*.table
files.
Our goal is to look at the ChIP-seq signal over transcription start sites (TSSes) of genes. Typically in this sort of analysis we start with annotations; here we're using the annotations from Ensembl. If we're lucky, TSSes will already be annotated. Failing that, perhaps 5'UTRs are annotated, so we could take the 5' end of the 5'UTR as the TSS. Let's see what the Ensembl data gives us.
In [5]:
!head -n 3 $data_dir/Homo_sapiens.GRCh37.66_chr17.gtf
GTF files have the feature type in the 3rd field. So what kind of featuretypes do we have here?
In [6]:
!cut -f 3 $data_dir/Homo_sapiens.GRCh37.66_chr17.gtf | sort | uniq -c
With only these featuretypes to work with, we would need to do the following to identify the TSS of each transcript:
Luckily, gffutils
is able to infer transcripts and genes from a GTF file. The inferred transcripts and genes are already in the prepared gffutils
database, at $data_dir/Homo_sapiens.GRCh37.66_chr17.gtf.db
. First we connect to it:
In [7]:
import os
import gffutils
db = gffutils.FeatureDB(os.path.join(data_dir, 'Homo_sapiens.GRCh37.66_chr17.gtf.db'))
We'll use pybedtools
for interval manipulation.
Here we create a generator function that iterates through all annotated transcripts in the database. For each transcript, we convert it to a pybedtools.Interval
and use the TSS
function to give us the 1-bp position of the TSS, and save it as a new file.
Here is a general usage pattern for gffutils
and pybedtools
: do the work in a generator function, and pass the generator to pybedtools.BedTool
. This uses very little memory, and scales well to hundreds of thousands of features.
In [8]:
import pybedtools
from pybedtools.featurefuncs import TSS
from gffutils.helpers import asinterval
def tss_generator():
"""
Generator function to yield TSS of each annotated transcript
"""
for transcript in db.features_of_type('transcript'):
yield TSS(asinterval(transcript), upstream=1, downstream=0)
# A BedTool made out of a generator, and saved to file.
tsses = pybedtools.BedTool(tss_generator()).saveas('tsses.gtf')
Now that we have a TSS file, we can modify it in different ways. Maybe we want to look at TSS +/- 1kb. Or 5kb. Or just 3kb upstream.
For this example, let's use pybedtools
to add 1kb to either side of the TSS. This uses the BEDTools slop
routine; see the docs for that program for how to make changes to up/downstream distances.
In [9]:
tsses_1kb = tsses.slop(b=1000, genome='hg19', output='tsses-1kb.gtf')
metaseq
works with the concepts of signal and windows. In this example, the signal is ChIP data, and the windows are TSS +/- 1kb.
The first step is to create “genomic signal” objects out of the data. Since our example files are BAM files, we specify the kind=’bam’, but if you have your own data in a different format (bigWig, bigBed, BED, GFF, GTF, VCF) then specify that format instead (see :func:metaseq.genomic_signal()
).
We need to pass the filenames of the BAM files:
In [10]:
import metaseq
ip_signal = metaseq.genomic_signal(
os.path.join(data_dir, 'wgEncodeHaibTfbsK562Atf3V0416101AlnRep1_chr17.bam'),
'bam')
input_signal = metaseq.genomic_signal(
os.path.join(data_dir, 'wgEncodeHaibTfbsK562RxlchV0416101AlnRep1_chr17.bam'),
'bam')
Now we can create the arrays of signal over each window. Since this can be a time-consuming step, the first time this code is run it will cache the arrays on disk. The next time this code is run, it will be quickly loaded. Trigger a re-run by deleting the .npz
file.
Here, with the BamSignal.array
method, we bin each promoter region into 100 bins, and calculate the signal in parallel across as many CPUs as are available. We do this for the IP signal and input signals separately. Then, since these are BAM files of mapped reads, we scale the arrays to the library size. The scaled arrays are then saved to disk, along with the windows that were used to create them.
In [11]:
import multiprocessing
processes = multiprocessing.cpu_count()
if not os.path.exists('example.npz'):
# The signal is the IP ChIP-seq BAM file.
ip_array = ip_signal.array(
# Look at signal over these windows
tsses_1kb,
# Bin signal into this many bins per window
bins=100,
# Use multiple CPUs. Dramatically speeds up run time.
processes=processes)
# Do the same thing for input.
input_array = input_signal.array(
tsses_1kb,
bins=100,
processes=processes)
# Normalize to library size. The values in the array
# will be in units of "reads per million mapped reads"
ip_array /= ip_signal.mapped_read_count() / 1e6
input_array /= input_signal.mapped_read_count() / 1e6
# Cache to disk. The data will be saved as "example.npz" and "example.features".
metaseq.persistence.save_features_and_arrays(
features=tsses,
arrays={'ip': ip_array, 'input': input_array},
prefix='example',
link_features=True,
overwrite=True)
Now that we’ve saved to disk, at any time in the future we can load the data without having to regenerate them:
In [12]:
features, arrays = metaseq.persistence.load_features_and_arrays(prefix='example')
Let’s do some double-checks.
In [13]:
# How many features?
assert len(features) == 5708
# This ought to be exactly the same as the number of features in `tsses_1kb.gtf`
assert len(features) == len(tsses_1kb) == 5708
# This shows that `arrays` acts like a dictionary
assert sorted(arrays.keys()) == ['input', 'ip']
# This shows that the IP and input arrays have one row per feature, and one column per bin
assert arrays['ip'].shape == (5708, 100) == arrays['input'].shape
In [14]:
import numpy as np
x = np.linspace(-1000, 1000, 100)
Then plot, using standard matplotlib
commands:
In [15]:
# Import plotting tools
from matplotlib import pyplot as plt
# Create a figure and axes
fig = plt.figure()
ax = fig.add_subplot(111)
# Plot the IP:
ax.plot(
# use the x-axis values we created
x,
# axis=0 takes the column-wise mean, so with
# 100 columns we'll have 100 means to plot
arrays['ip'].mean(axis=0),
# Make it red
color='r',
# Label to show up in legend
label='IP')
# Do the same thing with the input
ax.plot(
x,
arrays['input'].mean(axis=0),
color='k',
label='input')
# Add a vertical line at the TSS, at position 0
ax.axvline(0, linestyle=':', color='k')
# Add labels and legend
ax.set_xlabel('Distance from TSS (bp)')
ax.set_ylabel('Average read coverage (per million mapped reads)')
ax.legend(loc='best');
Let's work on improving this plot, one step at a time.
We don't really know if this average signal is due to a handful of really strong peaks, or if it's moderate signal over many peaks. So one improvement would be to include a heatmap of the signal over all the TSSs.
First, let's create a single normalized array by subtracting input from IP:
In [16]:
normalized_subtracted = arrays['ip'] - arrays['input']
metaseq
comes with some helper functions to simplify this kind of plotting. The metaseq.plotutils.imshow
function is one of these; here the arguments are described:
In [17]:
# Tweak some font settings so the results look nicer
plt.rcParams['font.family'] = 'Arial'
plt.rcParams['font.size'] = 10
# the metaseq.plotutils.imshow function does a lot of work,
# we just have to give it the right arguments:
fig = metaseq.plotutils.imshow(
# The array to plot; here, we've subtracted input from IP.
normalized_subtracted,
# X-axis to use
x=x,
# Change the default figure size to something smaller for this example
figsize=(3, 7),
# Make the colorbar limits go from 5th to 99th percentile.
# `percentile=True` means treat vmin/vmax as percentiles rather than
# actual values.
percentile=True,
vmin=5,
vmax=99,
# Style for the average line plot (black line)
line_kwargs=dict(color='k', label='All'),
# Style for the +/- 95% CI band surrounding the
# average line (transparent black)
fill_kwargs=dict(color='k', alpha=0.3),
)
In [18]:
print "asdf"
In [19]:
fig = metaseq.plotutils.imshow(
# These are the same arguments as above.
normalized_subtracted,
x=x,
figsize=(3, 7),
vmin=5, vmax=99, percentile=True,
line_kwargs=dict(color='k', label='All'),
fill_kwargs=dict(color='k', alpha=0.3),
# This is new: sort by mean signal
sort_by=normalized_subtracted.mean(axis=1)
)
We can use any number of arbitrary sorting methods. For example, this sorts the rows by the position of the highest signal in the row. Note that the line plot, which is the column-wise average, remains unchanged since we're still using the same data. The rows are just sorted differently.
In [20]:
fig = metaseq.plotutils.imshow(
# These are the same arguments as above.
normalized_subtracted,
x=x,
figsize=(3, 7),
vmin=5, vmax=99, percentile=True,
line_kwargs=dict(color='k', label='All'),
fill_kwargs=dict(color='k', alpha=0.3),
# This is new: sort by mean signal
sort_by=np.argmax(normalized_subtracted, axis=1)
)
In [21]:
fig = metaseq.plotutils.imshow(
normalized_subtracted,
x=x,
figsize=(3, 7),
vmin=5, vmax=99, percentile=True,
line_kwargs=dict(color='k', label='All'),
fill_kwargs=dict(color='k', alpha=0.3),
sort_by=normalized_subtracted.mean(axis=1)
)
Now we'll make some tweaks to the plot. The figure returned by metaseq.plotutils.imshow
has attributes array_axes
, line_axes
, and cax
, which can be used as an easy way to get handles to the axes for further configuration. Let's make some additional tweaks:
In [22]:
# "line_axes" is our handle for working on the lower axes.
# Add some nicer labels.
fig.line_axes.set_ylabel('Average enrichment');
fig.line_axes.set_xlabel('Distance from TSS (bp)');
# "array_axes" is our handle for working on the upper array axes.
# Add a nicer axis label
fig.array_axes.set_ylabel('Transcripts on chr17')
# Remove the x tick labels, since they're redundant
# with the line axes
fig.array_axes.set_xticklabels([])
# Add a vertical line to indicate zero in both the array axes
# and the line axes
fig.array_axes.axvline(0, linestyle=':', color='k')
fig.line_axes.axvline(0, linestyle=':', color='k')
fig.cax.set_ylabel("Enrichment")
fig
Out[22]:
Often we want to compare ChIP-seq data with RNA-seq data. But RNA-seq data typically is presented as gene ID, while ChIP-seq data is presented as genomic coords. These can be tricky to reconcile.
We will use example data from ATF3 knockdown experiments them to subset the ChIP signal by those TSSs that were affected by knockdown and those that were not.
This example uses pre-processed data downloaded from GEO. We'll use a simple (and naive) 2-fold cutoff to identify transcripts that went up, down, or were unchanged upon ATF3 knockdown. In real-world analysis, you'd probaby have a table from DESeq2 or edgeR analysis that you would use instead.
The metaseq.results_table
module has tools for working with this kind of data (for example, the metaseq.results_table.DESeq2Results
class). Here, we will make a generic ResultsTable
which handles any kind of tab-delimited data. It's important to specify the index column. This is the column that contains the transcript IDs in these files.
In [23]:
from metaseq.results_table import ResultsTable
control = ResultsTable(
os.path.join(data_dir, 'GSM847565_SL2585.table'),
import_kwargs=dict(index_col=0))
knockdown = ResultsTable(
os.path.join(data_dir, 'GSM847566_SL2592.table'),
import_kwargs=dict(index_col=0))
metaseq.results_table.ResultsTable
objects are wrappers around pandas.DataFrame
objects, so if you already know pandas
you know how to manipulate these objects. The pandas.DataFrame
is always available as the data
attribute.
Here are the first 5 rows of the control
object, which show that the index is id
, which are Ensembl transcript IDs, and there are two columns, score
and fpkm
:
In [24]:
# ---------------------------------------------------------
# Inspect results to see what we're working with
print len(control.data)
control.data.head()
Out[24]:
We should ensure that control
and knockdown
have their transcript IDs in the same order as the rows in the heatmap array, and that they only contain transcript IDs from chr17.
The ResultsTable.reindex_to
method is very useful for this -- it takes a pybedtools.BedTool
object and re-indexes the underlying dataframe so that the order of the dataframe matches the order of the features in the file. In this way we can re-align RNA-seq data to ChIP-seq data for more direct comparison.
Remember the tsses_1kb
object that we used to create the array? That defined the order of the rows in the array. We can use that to re-index the dataframes. Let's look at the first line from that file to see how the transcript ID information is stored:
In [25]:
# ---------------------------------------------------------
# Inspect the GTF file originally used to create the array
print tsses_1kb[0]
The Ensembl transcript ID is stored in the transcript_id
field of the GTF attributes:
transcript_id "ENST00000318008"; gene_id "ENSG00000002834";
The ResultsTable
is indexed by transcript ID. Note that DESeq2 and edgeR results are typically indexed by gene, rather than trancscript, ID. So when working with your own data, be sure to select the GTF attribute whose values will be found in the ResultsTable
index.
Here, we tell the ResultsTable.reindex_to
method which attribute it should pay attention to when realigning the data:
In [26]:
# ---------------------------------------------------------
# Re-align the ResultsTables to match the GTF file
control = control.reindex_to(tsses, attribute='transcript_id')
knockdown = knockdown.reindex_to(tsses, attribute='transcript_id')
Note that we now have a different order -- the first 5 rows are now different compared to when we checked before.
Also, the number of rows in the table has decreased dramatically. Recall that tsses_1kb
only contained features from chr17. The original data table had all transcripts. By reindexing the table to match the tsses_1kb
, we lose all of the non-chr17 transcripts.
In [27]:
print len(control)
control.data.head()
Out[27]:
Also note that second transcript, with NaN values. It turns out that transcript was not in the original RNA-seq results data table:
In [28]:
original_control = ResultsTable(
os.path.join(data_dir, 'GSM847565_SL2585.table'),
import_kwargs=dict(index_col=0))
'ENST00000419929' in original_control.data.index
Out[28]:
This may be because the experiment from GEO used something other than Ensembl annotations when running the analysis. It's actually not clear from the GEO entry what they used. Anyway, in order to make sure the rows in the table match the rows in the array, NaNs are added as values.
Let's do some double-checks to make sure things are set up correctly:
In [29]:
# Everything should be the same length
assert len(control.data) == len(knockdown.data) == len(tsses_1kb) == 5708
# Spot-check some values to make sure the GTF file and the DataFrame match up.
assert tsses[0]['transcript_id'] == control.data.index[0]
assert tsses[100]['transcript_id'] == control.data.index[100]
assert tsses[5000]['transcript_id'] == control.data.index[5000]
Now for some more data-wrangling. We'll use basic pandas
operations to merge the control and knockdown data together into a single table. We'll also create a new log2foldchange column.
In [30]:
# Join the dataframes and create a new pandas.DataFrame.
data = control.data.join(knockdown.data, lsuffix='_control', rsuffix='_knockdown')
# Add a log2 fold change variable
data['log2foldchange'] = np.log2(data.fpkm_knockdown / data.fpkm_control)
data.head()
Out[30]:
We can investigate some basic stats:
In [31]:
# ---------------------------------------------------------
# How many transcripts on chr17 changed expression?
print "up:", sum(data.log2foldchange > 1)
print "down:", sum(data.log2foldchange < -1)
Let's return to the heatmap. In addition to the average coverage line we already have, we'd like to add additional lines in another panel. The metaseq.plotutils.imshow
function is very flexible, and uses matplotlib.gridspec
for organizing the axes. This means we can ask for an additional axes by overriding the default height_ratios
tuple, using (3, 1, 1)
. This says to make 3 axes, where the first one is 3x the height of the other two.
In [32]:
fig = metaseq.plotutils.imshow(
# Same as before...
normalized_subtracted,
x=x,
figsize=(3, 7),
vmin=5, vmax=99, percentile=True,
line_kwargs=dict(color='k', label='All'),
fill_kwargs=dict(color='k', alpha=0.3),
sort_by=normalized_subtracted.mean(axis=1),
# Default was (3,1); here we add another number
height_ratios=(3, 1, 1)
)
# `fig.gs` contains the `matplotlib.gridspec.GridSpec` object,
# so we can now create the new axes.
bottom_axes = plt.subplot(fig.gs[2, 0])
The metaseq.plotutils.ci_plot
function takes an array and plots the mean signal +/- 95% CI bands. This was actually called automatically before for our line plot of average signal across all TSSes.
Now, let's create a custom plot that separates TSSes into up, down, and unchanged in the ATF3 knockdown.
Importantly, since we've aligned the RNA-seq data table and the array, we can calculate subsets in the RNA-seq data (as boolean indexes) and use those same indexes into the array itself.
For clarity, let's split up each step separately for the upregulated genes.
In [33]:
# This is a pandas.Series, True where the log2foldchange was >1
upregulated = (data.log2foldchange > 1)
upregulated
Out[33]:
In [34]:
# This gets us the underlying boolean NumPy array which we
# can use to subset the array
index = upregulated.values
index
Out[34]:
In [35]:
# This is the subset of the array where the TSS of the transcript
# went up in the ATF3 knockdown
upregulated_chipseq_signal = normalized_subtracted[index, :]
upregulated_chipseq_signal
Out[35]:
In [36]:
# We can combine the above steps into the following:
subset = normalized_subtracted[(data.log2foldchange > 1).values, :]
Now we just use the same technique for the up, down, and unchanged transcripts. Each one of them gets passed to the ci_plot
method, which plots the line in the color we specify (line_kwargs
, fill_kwargs
) on the axes we specify (bottom_axes
).
In [37]:
# Signal over TSSs of transcripts that were activated upon knockdown.
metaseq.plotutils.ci_plot(
x,
normalized_subtracted[(data.log2foldchange > 1).values, :],
line_kwargs=dict(color='#fe9829', label='up'),
fill_kwargs=dict(color='#fe9829', alpha=0.3),
ax=bottom_axes)
# Signal over TSSs of transcripts that were repressed upon knockdown
metaseq.plotutils.ci_plot(
x,
normalized_subtracted[(data.log2foldchange < -1).values, :],
line_kwargs=dict(color='#8e3104', label='down'),
fill_kwargs=dict(color='#8e3104', alpha=0.3),
ax=bottom_axes)
# Signal over TSSs tof transcripts that did not change upon knockdown
metaseq.plotutils.ci_plot(
x,
normalized_subtracted[((data.log2foldchange >= -1) & (data.log2foldchange <= 1)).values, :],
line_kwargs=dict(color='.5', label='unchanged'),
fill_kwargs=dict(color='.5', alpha=0.3),
ax=bottom_axes);
Finally, we do some cleaning up to make the figure look nicer (axes labels, legend, vertical lines at zero):
In [38]:
# Clean up redundant x tick labels, and add axes labels
fig.line_axes.set_xticklabels([])
fig.array_axes.set_xticklabels([])
fig.line_axes.set_ylabel('Average\nenrichement')
fig.array_axes.set_ylabel('Transcripts on chr17')
bottom_axes.set_ylabel('Average\nenrichment')
bottom_axes.set_xlabel('Distance from TSS (bp)')
fig.cax.set_ylabel('Enrichment')
# Add the vertical lines for TSS position to all axes
for ax in [fig.line_axes, fig.array_axes, bottom_axes]:
ax.axvline(0, linestyle=':', color='k')
# Nice legend
bottom_axes.legend(loc='best', frameon=False, fontsize=8, labelspacing=.3, handletextpad=0.2)
fig.subplots_adjust(left=0.3, right=0.8, bottom=0.05)
fig
Out[38]:
We can save the figure to disk in different formats for manuscript preparation:
In [39]:
fig.savefig('demo.png')
fig.savefig('demo.svg')
It appears that transcripts unchanged by ATF3 knockdown have the strongest ChIP signal. Transcripts that went up upon knockdown (that is, ATF3 normally represses them) had a slightly higher signal than those transcripts that went down (normally activated by ATF3).
Interestingly, even though we used a crude cutoff of 2-fold for a single replicate, and we only looked at chr17, the direction of the relationship we see here -- where ATF3-repressed genes have a higher signal than ATF3-activated -- is consistent with ATF3's known repressive role.
In [40]:
# K-means input data should be normalized (mean=0, stddev=1)
from sklearn import preprocessing
X_scaled = preprocessing.scale(normalized_subtracted)
k = 4
ind, breaks = metaseq.plotutils.new_clustered_sortind(
# The array to cluster
X_scaled,
# Within each cluster, how the rows should be sorted
row_key=np.mean,
# How each cluster should be sorted
cluster_key=np.median,
# Number of clusters
k=k)
In [41]:
# Plot the heatmap again
fig = metaseq.plotutils.imshow(
normalized_subtracted,
x=x,
figsize=(3, 9),
vmin=5, vmax=99, percentile=True,
line_kwargs=dict(color='k', label='All'),
fill_kwargs=dict(color='k', alpha=0.3),
# A little tricky: `sort_by` expects values to sort by
# (say, expression values). But we've pre-calculated
# our actual sort index based on clusters, so we transform
# it like this
sort_by=np.argsort(ind),
# This adds a "strip" axes on the right side, useful
# for adding extra information. We'll add cluster color
# codes here.
strip=True,
)
# De-clutter by hiding labels
plt.setp(
fig.strip_axes.get_yticklabels()
+ fig.strip_axes.get_xticklabels()
+ fig.array_axes.get_xticklabels(),
visible=False)
#
fig.line_axes.set_ylabel('Average\nenrichement')
fig.array_axes.set_ylabel('Transcripts on chr17')
fig.strip_axes.yaxis.set_label_position('right')
fig.strip_axes.set_ylabel('Cluster')
fig.cax.set_ylabel('Enrichment')
# Make colors
import matplotlib
cmap = matplotlib.cm.Spectral
colors = cmap(np.arange(k) / float(k))
# This figure will contain average signal for each cluster
fig2 = plt.figure(figsize=(10,3))
last_break = 0
cluster_number = 1
n_panel_rows = 1
n_panel_cols = k
for color, this_break in zip(colors, breaks):
if cluster_number == 1:
sharex = None
sharey = None
else:
sharex = fig2.axes[0]
sharey = fig2.axes[0]
ax = fig2.add_subplot(
n_panel_rows,
n_panel_cols,
cluster_number,
sharex=sharex,
sharey=sharey)
# The y position is somewhat tricky: the array was
# displayed using matplotlib.imshow with the argument
# `origin="lower"`, which means the row in the plot at y=0
# corresponds to the last row in the array (index=-1).
# But the breaks are in array coordinates. So we convert
# them by subtracting from the total array size.
xpos = 0
width = 1
ypos = len(normalized_subtracted) - this_break
height = this_break - last_break
rect = matplotlib.patches.Rectangle(
(xpos, ypos), width=width, height=height, color=color)
fig.strip_axes.add_patch(rect)
fig.array_axes.axhline(ypos, color=color, linewidth=2)
chunk = normalized_subtracted[last_break:this_break]
metaseq.plotutils.ci_plot(
x,
chunk,
ax=ax,
line_kwargs=dict(color=color),
fill_kwargs=dict(color=color, alpha=0.3),
)
ax.axvline(0, color='k', linestyle=':')
ax.set_title('cluster %s\n(N=%s)' % (cluster_number, len(chunk)))
cluster_number += 1
last_break = this_break
In [42]:
# Convert to ResultsTable so we can take advantage of its
# `scatter` method
rt = ResultsTable(data)
# Get the up/down regulated
up = rt.log2foldchange > 1
dn = rt.log2foldchange < -1
# Go back to the ChIP-seq data and create a boolean array
# that is True only for the top TSSes with the strongest
# mean signal
tss_means = normalized_subtracted.mean(axis=1)
strongest_signal = np.zeros(len(tss_means)) == 1
strongest_signal[np.argsort(tss_means)[-25:]] = True
rt.scatter(
x='fpkm_control',
y='fpkm_knockdown',
xfunc=np.log1p,
yfunc=np.log1p,
genes_to_highlight=[
(up, dict(color='#da3b3a', alpha=0.8)),
(dn, dict(color='#00748e', alpha=0.8)),
(strongest_signal, dict(color='k', s=50, alpha=1)),
],
general_kwargs=dict(marker='.', color='0.5', alpha=0.2, s=5),
one_to_one=dict(color='r', linestyle=':')
);
In [43]:
# Perhaps a better analysis would be to plot average
# ChIP-seq signal vs log2foldchange directly. In an imaginary
# world where biology is simple, we might expect TSSes with stronger
# log2foldchange upon knockdown to have stronger ChIP-seq signal
# in the control.
#
# To take advantage of the `scatter` method of ResultsTable objects,
# we simply add the TSS signal means as another variable in the
# dataframe. Then we can refer to it by name in `scatter`.
#
# We'll also use the same colors and genes to highlight from
# above.
rt.data['tss_means'] = tss_means
rt.scatter(
x='log2foldchange',
y='tss_means',
genes_to_highlight=[
(up, dict(color='#da3b3a', alpha=0.8)),
(dn, dict(color='#00748e', alpha=0.8)),
(strongest_signal, dict(color='k', s=50, alpha=1)),
],
general_kwargs=dict(marker='.', color='0.5', alpha=0.2, s=5),
yfunc=np.log2);