We will do the assembly and merging using the example data in the tests/data directory in the following order:
The example data is the chromosome 1 portion of the RNASeq data from the neurons in superior central nucleus raphe, medial part (CSm) and dosal nucleus raphe (DR) in Fev-Cre mouse line which labels serotoninergic neurons. (Fev-Cre is also called ePet-Cre.)
First we will change the logging level for jupyter notebook and enable inline matplotlib figure:
In [1]:
# This is to change logging level of jupyter notebook
try:
from importlib import reload # for python 3
except:
pass
import logging
reload(logging)
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO, datefmt='%I:%M:%S')
# This is to show matplotlib output in the notebook
%matplotlib inline
import matplotlib.pylab as P
In [2]:
import pandas as PD
si = PD.read_excel('../tests/data/sampleinfo.xlsx')
print(si[['name','mapbed','sjtab']])
The read mapping output from STAR is a BAM file. We convert the BAM file to BED file. You can do this using bedtools bamtobed command, or using the wrapper jgem.bedtools.bam2bed:
from jgem import bedtools as BT
BT.bam2bed(bam_path,bed_path)
STAR also creates a file named SJ.out.tab which contains junction information.
First, we'll create columns containing actual paths of these files in the sampleinfo dataframe.
In [3]:
import os
BASEDIR = '../tests/data'
si['bed_path'] = [os.path.join(BASEDIR, "BED", x) for x in si['mapbed']]
si['sjtab_path'] = [os.path.join(BASEDIR, "SJ", x) for x in si['sjtab']]
print(si[['name','bed_path','sjtab_path']])
We need a BIGWIG file containing normalized coverages and an appropriately formatted BED file containing junction information for the inputs to the assembler.
To normalize coverage, we use average coverage which is the number of the total mapped base pairs divided by the number of covered base pairs. This can be calculated using jgem.bedtools.get_total_bp_bedfile:
In [4]:
# calculate total bp and covered bp, takes ~1min
from jgem import bedtools as BT
tot_cov = [BT.get_total_bp_bedfile(x) for x in si['bed_path']]
si['totbp'] = [x[0] for x in tot_cov]
si['covbp'] = [x[1] for x in tot_cov]
si['acov'] = si['totbp']/si['covbp']
print(si[['name','totbp','covbp','acov']])
Now normalized BIGWIG coverage files can be generated using jgem.bedtools.bed2bw function:
In [5]:
# first prepare destination
si['bw_path'] = [os.path.join(BASEDIR, "bigwig/{0}.bw".format(x)) for x in si['name']]
# then make coverages BIGWIG from BED (~ 2 min)
from jgem import utils as UT
chromsizes = UT.chromsizes('mm10') # a file containing chromosome names and sizes
for bedfile, bwfile, acov in si[['bed_path','bw_path','acov']].values:
scale = 1./acov
BT.bed2bw(bedfile, chromsizes, bwfile, scale)
In [6]:
ls ../tests/data/bigwig/*chr1.bw
To generate junction files from SJ.out.tab, use function jgem.gtfgffbed.sjtab2sjbed:
In [7]:
# sjbed destination
si['sjbed_path'] = [os.path.join(BASEDIR, "SJ/{0}.sj.bed.gz".format(x)) for x in si['name']]
# SJ.out.tab => sjbed
from jgem import gtfgffbed as GGB
for sjtab, sjbed, acov in si[['sjtab_path','sjbed_path','acov']].values:
scale = 1./acov
GGB.sjtab2sjbed(sjtab,sjbed,scale)
In [8]:
ls ../tests/data/SJ/*chr1.sj.bed.gz
Let's check the content of generated SJBED file.
In [9]:
sj = GGB.read_sj(sjbed)
print(sj.head())
Use FileNames class to specify inputs (bigwig file and sjbed file) and output directory, then use Assembler class to do the assembly.
In [10]:
# assemble 4 samples takes about ~10 min
from jgem import filenames as FN
from jgem import assembler as AS
for name,bwfile,sjfile in si[['name','bw_path','sjbed_path']].values:
fn = FN.FileNames(name, bwfile, sjfile, outdir='../tests/out')
p = AS.Assembler(fn,saveintermediates=False)
p.assemble()
Generated figures visualize the process of finding coverage threshold for single exons. (See ...)
In the output directory, several files are generated:
In [11]:
ls ../tests/out/Fev_DR_m70_1623*
Main outputs are files with .ex.txt.gz, .sj.txt.gz suffixes. These contain exons and junctions in tab separated format. BED files are also generated for viewing elements (genes,exons,junctions) with browsers like IGV (https://www.broadinstitute.org/igv/). The screenshot below is a segment of chromosome 1 showing genes.bed files from the 4 samples.
Other files (assemble.params.txt.gz, assemble.stats.txt.gz, findsecovth.params.txt.gz, findsecovth.pdf) contain parameters and statistics of the assembly.
We will evaluate the assemblies against Gencode.VM4 annotation. For this, we use EvalNames class and EvalMatch class.
In [12]:
# Evaluate against Gencode.VM4 (~1 min)
from jgem import evaluate as EV
ens = {} # keep EvalNames objects
ems = {} # keep EvalMatch object
# Gencode.VM4
en_gen4 = EV.EvalNames('../tests/data/assemblies/gencode.vM4.chr1', code='gen4', outdir='../tests/out/')
# assign sample code
si['scode'] = [x.split('_')[-1].split('.')[0] for x in si['name']]
# For each sample
for scode, name, bwfile, sjfile in si[['scode','name','bw_path','sjbed_path']].values:
print('process {0}...'.format(name))
sjexbase = '../tests/out/'+name # prefix to .ex.txt.gz and .sj.txt.gz files
assemblycode = 'a'+str(scode) # assembly identifier
datacode = 'd'+str(scode) # data identifier
ens[name] = EV.EvalNames(sjexbase, code=assemblycode, outdir='../tests/out/')
ems[name] = EV.EvalMatch(en_gen4, ens[name], bwfile, sjfile, datacode=datacode, binsize=100)
ems[name].calculate()
INFO outputs are as follows:
These are assesed within different exon classes:
Classes with b, (5b,3b,sb) allow matching to different classes of exons.
Matching exons with junction boundary requires the boundaries to match. That is, to match internal exons, they have to be exactly the same (including strand). To match 5'exons, donor site has to the same but the other side does not have to be. To match single exon, they just need to have some overlap.
So, we have quite good detection of junctions. About 70% of internal exons are detected. 5',3' exons are worse (< 50%) but if allowed to match to other class of exons, then detection rates get close to the internal exons. Single exon detection rates are bad and about 10%.
We can visualize the detection percentages as follows.
In [13]:
# EvalMatch.get_detection_percentages returns a dataframe containing detection percentages
dp = PD.DataFrame({n:ems[n].get_detection_percentages().stack() for n in ems})
tgt = ['i','5','3','s','j']
dp1 = dp.ix[[(x,'%detected 1') for x in tgt]][si['name'].values]
dp1.index = tgt
ax = dp1.plot(kind='bar', legend=False, figsize=(6,4))
ax.set_ylabel('detection %')
ax.set_xlabel('exons/junctions')
ax.legend(loc='center left', bbox_to_anchor=(1,0.5))
Out[13]:
We can also plot the detection sensitivity against coverages. We expect the detection rate is better at the bigger coverage.
In [14]:
colors = ['ro-','go-','bo-','mo-']
for i,name in enumerate(ems):
em = ems[name]
if i==0:
axr = em.plot_sensitivity(color=colors[i])
else:
axr = em.plot_sensitivity(color=colors[i], axr=axr, ypos=i)
These plots show percent of detected exons calculated in bins sorted by coverage (binsize 100). Numbers in inset indicate normalized area under the curve (AUC, which ranges from 0 to 1). Dashed lines in 5',3' and single exon panels correspond to 5b, 3b, sb category.
How about the precision of the match? We can assess this by length ratio of matched exons:
In [15]:
for name in ems:
ems[name].plot_ratio()
Averages shown inside are geometric averages. Overall length tends to be a bit longer. (Matching of internal exons and junctions always produce ratio of 1, so they are not shown.)
For some genes, we see partial or disconnected reconstruction of the structure. An example is shown below:
These incomplete reconstructions are due to missing junctions or insufficient exon coverages.
To evaluate the completeness of the reconstruction, we calculate three measures. One is the ratio of gene length. We call this GLC (gene length completenesss). The second measure is the ratio of exon counts (ECC, exon count completeness), and the third is the ratio of junction counts (JCC, junction count completeness).
We can visualize these measures by plot_completeness method.
In [16]:
axr = ems[name].plot_completeness(pw='fdat', ca='g.-', cf='r-', cd='b.', xlim=[0,4])
Blue dots are genes, green dots are binned averages, red lines are sigmoid function fit to blue dots and red dashes are 99% position of the sigmoidal fit.
As expected, genes with low coverages tends to have incomplete reconstructions.
Below, we'll plot binned averages of completeness measures for all 4 samples.
In [17]:
colors = ['ro-','go-','bo-','mo-']
for i,name in enumerate(ems):
em = ems[name]
if i==0:
axr = em.plot_completeness(pw='a', ca=colors[i], xlim=(0,4), label=name, title='')
else:
axr = em.plot_completeness(pw='a', ca=colors[i], xlim=(0,4), axr=axr, label=name)
P.legend(bbox_to_anchor=(1,1),loc='upper left')
Out[17]:
Since individual cell types express only subsets of the genes, merging assembies is necessary to obtain a more complete picture of the transcriptomes. Moreover, merging should also improve the sensitivity and the completeness of the reconstruction.
For merging, we will repeat the assembly process but uses aggregated junctions and coverage bigwigs generated from all exons from all assemblies. We need to prepare these two kinds of files first. For this, we use MergeInputs class.
In [18]:
from jgem import merge as MG
fni = MG.MergeInputNames(sampleinfo=si, code='FevM', outdir='../tests/out')
mi = MG.MergeInputs(fnobj=fni, genome='mm10', np=3, th_detected=0, th_maxcnt1=0)
We need to assign sjexpre column to the sampleinfo:
In [20]:
# prefix to *.ex.txt.gz and *.sj.txt.gz files
si['sjexpre'] = [os.path.join('../tests/out/', x) for x in si['name']]
To create bigwigs:
In [21]:
mi.make_ex_bigwigs()
To make aggreagated junction files:
In [24]:
mi.make_sj_bed()
To make average bigwig coverage file (average of original individual bigwig coverages, different from bigwigs generated above by make_ex_bigwigs(), which are based on assembled exons):
In [25]:
mi.aggregate_bigwigs() #~ 1 min
Output files are:
In [26]:
ls ../tests/out/FevM*
Bigwigs generated from assembly exons are separated into multi-exon positive strand, negative strand and single exons:
.ex.mep.bw
.ex.men.bw
.ex.se.bw
Similarly junctions are also separated into two strands:
.sj.p.bed.gz
.sj.n.bed.gz
Averaged bigwig is:
.allsample.bw:
Other files contain stats.
Once we have inputs prepared, we use MergeAssemble class to do the merging.
In [27]:
fna = MG.MergeAssemblyNames('FevA', outdir='../tests/out')
ma = MG.MergeAssemble(fni,fna,np=3,saveintermediates=False)
In [28]:
ma.assemble()
Output files are:
In [30]:
ls ../tests/out/FevA*
Looking at .genes.bed file with IGV, we find previously incomplete reconstructions are now merged into one complete structure:
To evaluate the merging quantitatively, we use EvalMatch class as before:
In [31]:
ems2 = {} # keep EvalMatch object
en_m = EV.EvalNames('../tests/out/FevA', code='FevA', outdir='../tests/out/')
# for each sample data generate EvalMatch between the merged assembly and the gencode4
for scode, name, bwfile, sjfile in si[['scode','name','bw_path','sjbed_path']].values:
print('process {0}...'.format(name))
datacode = 'd'+str(scode)
ems2[name] = EV.EvalMatch(en_gen4, en_m, bwfile, sjfile, datacode=datacode, binsize=100)
ems2[name].calculate()
To visualize detection percentage:
In [32]:
for n in ems2:
dp[n+'_merged'] = ems2[n].get_detection_percentages().stack()
tgt = ['i','5','3','s','j']
cols = list(si['name'])+[x+'_merged' for x in si['name']]
dp1 = dp.ix[[(x,'%detected 1') for x in tgt]][cols]
dp1.index = tgt
ax = dp1.plot(kind='bar', legend=False, figsize=(10,4))
ax.set_ylabel('detection %')
ax.set_xlabel('exons/junctions')
ax.legend(loc='center left', bbox_to_anchor=(1,0.5))
Out[32]:
So we have much better detection of internal exons (~10% improvement) and slight improvements in the detection of 5'exons, 3'exons and junctions.
We can also see the improvements in the gene structure reconstruction in the changes in GLC,ECC,JCC:
In [33]:
for i,name in enumerate(ems):
if i==0:
axr = ems[name].plot_completeness(pw='a', ca='bo-', xlim=(0,4), label=name, title='')
axr = ems2[name].plot_completeness(pw='a', ca='ro-', label=name+'_merged', axr=axr)
else:
axr = ems[name].plot_completeness(pw='a', ca='bo-', label=name, axr=axr)
axr = ems2[name].plot_completeness(pw='a', ca='ro-', label=name+'_merged', axr=axr)
P.legend(bbox_to_anchor=(1,1),loc='upper left')
Out[33]:
Once we have an assembly, we want to know which gene model corresponds to which known gene. We do this using Comparator class.
In [34]:
from jgem import annotate as AN
g4sjexpre = '../tests/data/assemblies/gencode.vM4.chr1'
mergepre = '../tests/out/FevA'
outdir = '../tests/out'
cref = AN.ComparatorNames(g4sjexpre, 'gen4', outdir)
ctgt = AN.ComparatorNames(mergepre, 'FevA', outdir)
cp = AN.Comparator(cref, ctgt, outdir, 'gene_name')
cp.annotate(overwrite=False)
When overwrite is True, it overwrites the original sj,ex files. If False, it creats files in the output direcotry:
In [35]:
ls ../tests/out/FevA.gen4*
Let's check the contents of the exon file:
In [36]:
ex = UT.read_pandas('../tests/out/FevA.gen4.ex.txt.gz')
ex.columns
Out[36]:
In [37]:
acols = ['chr','st','ed','strand','cat','gname','eknown_gen4','etcode_gen4',
'intergenic_gen4', 'ex_as_ovl_gen4', 'gene_as_ovl_gen4',
'gknown_gen4', 'gtcode_gen4','gen4_sym0']
print(ex.head(10)[acols])
Columns are:
cat: exon category (internal, 5',3',single)
gname: assembly gene name
eknown_gen4: exon known against gen4 (k or u)
etcode_gen4: exon code (k.me: known multi-exon, u.se:unknown single exon, etc.)
intergenic_gen4: whether the gene is intergenic
ex_as_ovl_gen4: exon overlaps with gen4 exon in antisense direction
gene_as_ovl_gen4: gene overlaps with gen4 gene in antisense direction
gtcode_gen4: similar to etcode_gen4
gen4_sym0: Gencode.vM4 gene symbol (0 indicates closest one)