Tutorial

We will do the assembly and merging using the example data in the tests/data directory in the following order:

  1. Preparing sample info spreadsheet
  2. Preparing inputs
  3. Assembling individual samples
  4. Evalulation of individual assemblies
  5. Preparing for merge
  6. Merging
  7. Evaluation of the merged assembly
  8. Annotating merged assembly

The example data is the chromosome 1 portion of the RNASeq data from the neurons in superior central nucleus raphe, medial part (CSm) and dosal nucleus raphe (DR) in Fev-Cre mouse line which labels serotoninergic neurons. (Fev-Cre is also called ePet-Cre.)

First we will change the logging level for jupyter notebook and enable inline matplotlib figure:


In [1]:
# This is to change logging level of jupyter notebook
try:
    from importlib import reload # for python 3
except:
    pass
import logging
reload(logging)
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.INFO, datefmt='%I:%M:%S')

# This is to show matplotlib output in the notebook
%matplotlib inline
import matplotlib.pylab as P

Sample Info

For this tutorial, we will use the Excel file sampleinfo.xlsx in the tests/data directory. Following fields are required:

  • name: unique sample name
  • mapbed: name (or path) of BED file containing mapped reads
  • sjtab: name (or path) of STAR SJ.out.tab output file

In [2]:
import pandas as PD

si = PD.read_excel('../tests/data/sampleinfo.xlsx')
print(si[['name','mapbed','sjtab']])


                    name                             mapbed  \
0   Fev_DR_m70_1623.chr1   Fev_DR_m70_1623_star.chr1.bed.gz   
1  Fev_CSm_m70_1624.chr1  Fev_CSm_m70_1624_star.chr1.bed.gz   
2   Fev_DR_m71_1625.chr1   Fev_DR_m71_1625_star.chr1.bed.gz   
3  Fev_CSm_m71_1626.chr1  Fev_CSm_m71_1626_star.chr1.bed.gz   

                              sjtab  
0   Fev_DR_m70_1623.chr1.SJ.out.tab  
1  Fev_CSm_m70_1624.chr1.SJ.out.tab  
2   Fev_DR_m71_1625.chr1.SJ.out.tab  
3  Fev_CSm_m71_1626.chr1.SJ.out.tab  

The read mapping output from STAR is a BAM file. We convert the BAM file to BED file. You can do this using bedtools bamtobed command, or using the wrapper jgem.bedtools.bam2bed:

from jgem import bedtools as BT
BT.bam2bed(bam_path,bed_path)

STAR also creates a file named SJ.out.tab which contains junction information.

First, we'll create columns containing actual paths of these files in the sampleinfo dataframe.


In [3]:
import os

BASEDIR = '../tests/data'
si['bed_path'] = [os.path.join(BASEDIR, "BED", x) for x in si['mapbed']]
si['sjtab_path'] =  [os.path.join(BASEDIR, "SJ", x) for x in si['sjtab']]

print(si[['name','bed_path','sjtab_path']])


                    name                                           bed_path  \
0   Fev_DR_m70_1623.chr1  ../tests/data/BED/Fev_DR_m70_1623_star.chr1.be...   
1  Fev_CSm_m70_1624.chr1  ../tests/data/BED/Fev_CSm_m70_1624_star.chr1.b...   
2   Fev_DR_m71_1625.chr1  ../tests/data/BED/Fev_DR_m71_1625_star.chr1.be...   
3  Fev_CSm_m71_1626.chr1  ../tests/data/BED/Fev_CSm_m71_1626_star.chr1.b...   

                                          sjtab_path  
0   ../tests/data/SJ/Fev_DR_m70_1623.chr1.SJ.out.tab  
1  ../tests/data/SJ/Fev_CSm_m70_1624.chr1.SJ.out.tab  
2   ../tests/data/SJ/Fev_DR_m71_1625.chr1.SJ.out.tab  
3  ../tests/data/SJ/Fev_CSm_m71_1626.chr1.SJ.out.tab  

Preparing Inputs to the assembler

We need a BIGWIG file containing normalized coverages and an appropriately formatted BED file containing junction information for the inputs to the assembler.

To normalize coverage, we use average coverage which is the number of the total mapped base pairs divided by the number of covered base pairs. This can be calculated using jgem.bedtools.get_total_bp_bedfile:


In [4]:
# calculate total bp and covered bp, takes ~1min
from jgem import bedtools as BT
tot_cov = [BT.get_total_bp_bedfile(x) for x in si['bed_path']]
si['totbp'] = [x[0] for x in tot_cov]
si['covbp'] = [x[1] for x in tot_cov]
si['acov'] = si['totbp']/si['covbp']
print(si[['name','totbp','covbp','acov']])


                    name      totbp     covbp       acov
0   Fev_DR_m70_1623.chr1  145699916   9142973  15.935726
1  Fev_CSm_m70_1624.chr1  158992441  11276760  14.099124
2   Fev_DR_m71_1625.chr1  161359243  13946848  11.569585
3  Fev_CSm_m71_1626.chr1  161885717  11661445  13.882132

Now normalized BIGWIG coverage files can be generated using jgem.bedtools.bed2bw function:


In [5]:
# first prepare destination
si['bw_path'] = [os.path.join(BASEDIR, "bigwig/{0}.bw".format(x)) for x in si['name']]
# then make coverages BIGWIG from BED (~ 2 min)
from jgem import utils as UT
chromsizes = UT.chromsizes('mm10') # a file containing chromosome names and sizes
for bedfile, bwfile, acov in si[['bed_path','bw_path','acov']].values:
    scale = 1./acov
    BT.bed2bw(bedfile, chromsizes, bwfile, scale)

In [6]:
ls ../tests/data/bigwig/*chr1.bw


../tests/data/bigwig/Fev_CSm_m70_1624.chr1.bw
../tests/data/bigwig/Fev_CSm_m71_1626.chr1.bw
../tests/data/bigwig/Fev_DR_m70_1623.chr1.bw
../tests/data/bigwig/Fev_DR_m71_1625.chr1.bw

To generate junction files from SJ.out.tab, use function jgem.gtfgffbed.sjtab2sjbed:


In [7]:
# sjbed destination
si['sjbed_path'] = [os.path.join(BASEDIR, "SJ/{0}.sj.bed.gz".format(x)) for x in si['name']]
# SJ.out.tab => sjbed
from jgem import gtfgffbed as GGB
for sjtab, sjbed, acov in si[['sjtab_path','sjbed_path','acov']].values:
    scale = 1./acov
    GGB.sjtab2sjbed(sjtab,sjbed,scale)

In [8]:
ls ../tests/data/SJ/*chr1.sj.bed.gz


../tests/data/SJ/Fev_CSm_m70_1624.chr1.sj.bed.gz
../tests/data/SJ/Fev_CSm_m71_1626.chr1.sj.bed.gz
../tests/data/SJ/Fev_DR_m70_1623.chr1.sj.bed.gz
../tests/data/SJ/Fev_DR_m71_1625.chr1.sj.bed.gz

Let's check the content of generated SJBED file.


In [9]:
sj = GGB.read_sj(sjbed)
print(sj.head())


    chr       st       ed                 name      ucnt strand      mcnt
0  chr1  3121707  3142317   CT/AC-k0-u4-m0-o46  0.288140      -  0.000000
1  chr1  3207318  3213438  CT/AC-k0-u10-m0-o47  0.720350      -  0.000000
2  chr1  3271573  3302745   GT/AG-k0-u0-m1-o37  0.000000      +  0.072035
3  chr1  3421902  3670551  CT/AC-k1-u20-m0-o43  1.440701      -  0.000000
4  chr1  3660890  3670551   CT/AC-k0-u2-m0-o41  0.144070      -  0.000000

SJBED file format

As shown above, the tab separated junction file should contain:

  • chr: chromosome name
  • st: start position
  • ed: end position
  • name: junction name
  • ucnt: unique reads (normalized to average coverage)
  • strand: strand of junction
  • mcnt: non-unique (multimapping) reads (normalized)

Assembling individual samples

Use FileNames class to specify inputs (bigwig file and sjbed file) and output directory, then use Assembler class to do the assembly.


In [10]:
# assemble 4 samples takes about ~10 min 
from jgem import filenames as FN
from jgem import assembler as AS
for name,bwfile,sjfile in si[['name','bw_path','sjbed_path']].values:
    fn = FN.FileNames(name, bwfile, sjfile, outdir='../tests/out')
    p = AS.Assembler(fn,saveintermediates=False)
    p.assemble()


INFO:Executing SELECTSJ ====================
INFO: #sj:7660=>7649
INFO: time: 0.572s
INFO:Executing REMOVEJIE ====================
INFO: #sj:7649=>7639, jie 10
INFO: time: 1.903s
INFO:Executing SJ2EX ====================
INFO:#sj:7639=>7639 after ureadth, mreadth
INFO: #ex:8126, #sj:7639
INFO: time: 0.110s
INFO:Executing MERGEEXONS ====================
INFO: #ex:8204
INFO: time: 0.203s
INFO:Executing FINDEDGES ====================
INFO: time: 1.754s
INFO:Executing FIXSTRAND ====================
INFO: time: 0.429s
INFO:Executing EDGEFIXER ====================
INFO: time: 15.209s
INFO:Executing FINDIRETS ====================
INFO:#irets candidates:170
INFO:168 irets found
INFO: time: 6.295s
INFO:Executing FINDSECOVTH ====================
INFO: SECOVTH=0.7479972670226993
INFO: time: 6.432s
INFO:Executing FINDSE ====================
INFO:#candidate SE:10810
INFO:  processing 10810 SE candidates
INFO:#candidate SE after _fix_se_ext:10712
INFO:write exons ...=> assemble.exons0.txt.gz
INFO: time: 55.827s
INFO:Executing CALCCOV ====================
INFO: time: 3.515s
INFO:Executing SETINFO ====================
INFO: time: 0.445s
INFO:Executing FINDGENES ====================
INFO:finding genes (connected components)...
INFO: time: 2.059s
INFO:assigning gidx...
INFO: #genes:10887
INFO: time: 2.128s
INFO:Executing SELECTSEME ====================
INFO:ex(9835) sj(7639) ==> ex(9073) sj(7639)
INFO:se(438) => se(385) 53 removed
INFO: time: 0.846s
INFO:Executing FIXEDGES2 ====================
INFO: #ae:10056
INFO: time: 9.988s
INFO:Executing CONSISTENTSJ ====================
INFO:  #sj 7639 => 7257
INFO: time: 0.010s
INFO:Executing WRITESJEX ====================
INFO: time: 0.980s
INFO:Executing WRITEGENES ====================
INFO: writing bed12 genes...
INFO: time: 3.860s
INFO:Executing SELECTSJ ====================
INFO: #sj:8150=>8131
INFO: time: 0.625s
INFO:Executing REMOVEJIE ====================
INFO: #sj:8131=>8120, jie 11
INFO: time: 1.970s
INFO:Executing SJ2EX ====================
INFO:#sj:8120=>8120 after ureadth, mreadth
INFO: #ex:8611, #sj:8120
INFO: time: 0.108s
INFO:Executing MERGEEXONS ====================
INFO: #ex:8729
INFO: time: 0.204s
INFO:Executing FINDEDGES ====================
INFO: time: 1.896s
INFO:Executing FIXSTRAND ====================
INFO: time: 0.347s
INFO:Executing EDGEFIXER ====================
INFO: time: 16.161s
INFO:Executing FINDIRETS ====================
INFO:#irets candidates:242
INFO:249 irets found
INFO: time: 7.519s
INFO:Executing FINDSECOVTH ====================
INFO: SECOVTH=0.7762553051224985
INFO: time: 7.387s
INFO:Executing FINDSE ====================
INFO:#candidate SE:13612
INFO:  processing 13612 SE candidates
INFO:#candidate SE after _fix_se_ext:13503
INFO:write exons ...=> assemble.exons0.txt.gz
INFO: time: 73.385s
INFO:Executing CALCCOV ====================
INFO: time: 3.893s
INFO:Executing SETINFO ====================
INFO: time: 0.472s
INFO:Executing FINDGENES ====================
INFO:finding genes (connected components)...
INFO: time: 2.273s
INFO:assigning gidx...
INFO: #genes:13357
INFO: time: 2.357s
INFO:Executing SELECTSEME ====================
INFO:ex(10653) sj(8120) ==> ex(9871) sj(8120)
INFO:se(350) => se(303) 47 removed
INFO: time: 0.869s
INFO:Executing FIXEDGES2 ====================
INFO: #ae:11025
INFO: time: 11.890s
INFO:Executing CONSISTENTSJ ====================
INFO:  #sj 8120 => 7729
INFO: time: 0.011s
INFO:Executing WRITESJEX ====================
INFO: time: 1.039s
INFO:Executing WRITEGENES ====================
INFO: writing bed12 genes...
INFO: time: 3.570s
INFO:Executing SELECTSJ ====================
INFO: #sj:8445=>8430
INFO: time: 0.623s
INFO:Executing REMOVEJIE ====================
INFO: #sj:8430=>8417, jie 13
INFO: time: 2.092s
INFO:Executing SJ2EX ====================
INFO:#sj:8417=>8417 after ureadth, mreadth
INFO: #ex:8920, #sj:8417
INFO: time: 0.108s
INFO:Executing MERGEEXONS ====================
INFO: #ex:9014
INFO: time: 0.215s
INFO:Executing FINDEDGES ====================
INFO: time: 2.139s
INFO:Executing FIXSTRAND ====================
INFO: time: 0.443s
INFO:Executing EDGEFIXER ====================
INFO: time: 19.034s
INFO:Executing FINDIRETS ====================
INFO:#irets candidates:311
INFO:345 irets found
INFO: time: 8.503s
INFO:Executing FINDSECOVTH ====================
INFO: SECOVTH=0.8155224727141477
INFO: time: 7.565s
INFO:Executing FINDSE ====================
INFO:#candidate SE:18923
INFO:  processing 18923 SE candidates
INFO:#candidate SE after _fix_se_ext:18784
INFO:write exons ...=> assemble.exons0.txt.gz
INFO: time: 112.036s
INFO:Executing CALCCOV ====================
INFO: time: 4.803s
INFO:Executing SETINFO ====================
INFO: time: 0.578s
INFO:Executing FINDGENES ====================
INFO:finding genes (connected components)...
INFO: time: 2.722s
INFO:assigning gidx...
INFO: #genes:18181
INFO: time: 2.830s
INFO:Executing SELECTSEME ====================
INFO:ex(11681) sj(8417) ==> ex(10809) sj(8417)
INFO:se(414) => se(338) 76 removed
INFO: time: 0.948s
INFO:Executing FIXEDGES2 ====================
INFO: #ae:12296
INFO: time: 14.969s
INFO:Executing CONSISTENTSJ ====================
INFO:  #sj 8417 => 7980
INFO: time: 0.013s
INFO:Executing WRITESJEX ====================
INFO: time: 1.118s
INFO:Executing WRITEGENES ====================
INFO: writing bed12 genes...
INFO: time: 4.214s
INFO:Executing SELECTSJ ====================
INFO: #sj:8009=>7999
INFO: time: 0.614s
INFO:Executing REMOVEJIE ====================
INFO: #sj:7999=>7994, jie 5
INFO: time: 2.482s
INFO:Executing SJ2EX ====================
INFO:#sj:7994=>7994 after ureadth, mreadth
INFO: #ex:8502, #sj:7994
INFO: time: 0.103s
INFO:Executing MERGEEXONS ====================
INFO: #ex:8587
INFO: time: 0.190s
INFO:Executing FINDEDGES ====================
INFO: time: 2.106s
INFO:Executing FIXSTRAND ====================
INFO: time: 0.340s
INFO:Executing EDGEFIXER ====================
INFO: time: 18.487s
INFO:Executing FINDIRETS ====================
INFO:#irets candidates:206
INFO:204 irets found
INFO: time: 7.766s
INFO:Executing FINDSECOVTH ====================
INFO: SECOVTH=0.8409911574568218
INFO: time: 7.566s
INFO:Executing FINDSE ====================
INFO:#candidate SE:15096
INFO:  processing 15096 SE candidates
INFO:#candidate SE after _fix_se_ext:14951
INFO:write exons ...=> assemble.exons0.txt.gz
INFO: time: 87.202s
INFO:Executing CALCCOV ====================
INFO: time: 4.506s
INFO:Executing SETINFO ====================
INFO: time: 0.524s
INFO:Executing FINDGENES ====================
INFO:finding genes (connected components)...
INFO: time: 2.601s
INFO:assigning gidx...
INFO: #genes:14751
INFO: time: 2.687s
INFO:Executing SELECTSEME ====================
INFO:ex(10698) sj(7994) ==> ex(9844) sj(7994)
INFO:se(456) => se(403) 53 removed
INFO: time: 0.885s
INFO:Executing FIXEDGES2 ====================
INFO: #ae:11100
INFO: time: 12.982s
INFO:Executing CONSISTENTSJ ====================
INFO:  #sj 7994 => 7567
INFO: time: 0.011s
INFO:Executing WRITESJEX ====================
INFO: time: 1.062s
INFO:Executing WRITEGENES ====================
INFO: writing bed12 genes...
INFO: time: 4.381s

Generated figures visualize the process of finding coverage threshold for single exons. (See ...)

In the output directory, several files are generated:


In [11]:
ls ../tests/out/Fev_DR_m70_1623*


../tests/out/Fev_DR_m70_1623.chr1.assemble.params.txt.gz
../tests/out/Fev_DR_m70_1623.chr1.assemble.stats.txt
../tests/out/Fev_DR_m70_1623.chr1.ci.txt.gz
../tests/out/Fev_DR_m70_1623.chr1.ex.bed.gz
../tests/out/Fev_DR_m70_1623.chr1.ex.txt.gz
../tests/out/Fev_DR_m70_1623.chr1.findsecovth.params.txt
../tests/out/Fev_DR_m70_1623.chr1.findsecovth.pdf
../tests/out/Fev_DR_m70_1623.chr1.genes.bed.gz
../tests/out/Fev_DR_m70_1623.chr1.selectsj.sj2.txt.gz
../tests/out/Fev_DR_m70_1623.chr1.sj.bed.gz
../tests/out/Fev_DR_m70_1623.chr1.sj.txt.gz

Main outputs are files with .ex.txt.gz, .sj.txt.gz suffixes. These contain exons and junctions in tab separated format. BED files are also generated for viewing elements (genes,exons,junctions) with browsers like IGV (https://www.broadinstitute.org/igv/). The screenshot below is a segment of chromosome 1 showing genes.bed files from the 4 samples.

Other files (assemble.params.txt.gz, assemble.stats.txt.gz, findsecovth.params.txt.gz, findsecovth.pdf) contain parameters and statistics of the assembly.

Evaluation of individual assemblies

We will evaluate the assemblies against Gencode.VM4 annotation. For this, we use EvalNames class and EvalMatch class.


In [12]:
# Evaluate against Gencode.VM4 (~1 min)
from jgem import evaluate as EV
ens = {} # keep EvalNames objects
ems = {} # keep EvalMatch object
# Gencode.VM4
en_gen4 = EV.EvalNames('../tests/data/assemblies/gencode.vM4.chr1', code='gen4', outdir='../tests/out/')
# assign sample code
si['scode'] = [x.split('_')[-1].split('.')[0] for x in si['name']]
# For each sample
for scode, name, bwfile, sjfile in si[['scode','name','bw_path','sjbed_path']].values:
    print('process {0}...'.format(name))
    sjexbase = '../tests/out/'+name # prefix to .ex.txt.gz and .sj.txt.gz files
    assemblycode = 'a'+str(scode) # assembly identifier
    datacode = 'd'+str(scode) # data identifier
    ens[name] = EV.EvalNames(sjexbase,  code=assemblycode, outdir='../tests/out/') 
    ems[name] = EV.EvalMatch(en_gen4, ens[name], bwfile, sjfile, datacode=datacode, binsize=100) 
    ems[name].calculate()


process Fev_DR_m70_1623.chr1...
INFO:[i] detected1:7766,	matched:5583,	(detected2:5820),	ratio:0.72,	(ratio2:0.96)
INFO:[5] detected1:2341,	matched:1116,	(detected2:1536),	ratio:0.48,	(ratio2:0.73)
INFO:[3] detected1:2607,	matched:1139,	(detected2:1802),	ratio:0.44,	(ratio2:0.63)
INFO:[s] detected1:894,	matched:95,	(detected2:385),	ratio:0.11,	(ratio2:0.25)
INFO:[j] detected1:6892,	matched:6665,	(detected2:7251),	ratio:0.97,	(ratio2:0.92)
INFO:[5b] detected1:2341,	matched:1573,	(detected2:9543),	ratio:0.67,	(ratio2:0.16)
INFO:[3b] detected1:2607,	matched:1706,	(detected2:9543),	ratio:0.65,	(ratio2:0.18)
INFO:[sb] detected1:894,	matched:262,	(detected2:9543),	ratio:0.29,	(ratio2:0.03)
process Fev_CSm_m70_1624.chr1...
INFO:[i] detected1:7990,	matched:5858,	(detected2:6163),	ratio:0.73,	(ratio2:0.95)
INFO:[5] detected1:2418,	matched:1163,	(detected2:1736),	ratio:0.48,	(ratio2:0.67)
INFO:[3] detected1:2734,	matched:1229,	(detected2:2075),	ratio:0.45,	(ratio2:0.59)
INFO:[s] detected1:961,	matched:91,	(detected2:299),	ratio:0.09,	(ratio2:0.30)
INFO:[j] detected1:7153,	matched:6948,	(detected2:7722),	ratio:0.97,	(ratio2:0.90)
INFO:[5b] detected1:2418,	matched:1634,	(detected2:10273),	ratio:0.68,	(ratio2:0.16)
INFO:[3b] detected1:2734,	matched:1832,	(detected2:10273),	ratio:0.67,	(ratio2:0.18)
INFO:[sb] detected1:961,	matched:269,	(detected2:10273),	ratio:0.28,	(ratio2:0.03)
process Fev_DR_m71_1625.chr1...
INFO:[i] detected1:8259,	matched:5950,	(detected2:6327),	ratio:0.72,	(ratio2:0.94)
INFO:[5] detected1:2526,	matched:1185,	(detected2:2032),	ratio:0.47,	(ratio2:0.58)
INFO:[3] detected1:2887,	matched:1318,	(detected2:2544),	ratio:0.46,	(ratio2:0.52)
INFO:[s] detected1:1004,	matched:90,	(detected2:334),	ratio:0.09,	(ratio2:0.27)
INFO:[j] detected1:7336,	matched:7107,	(detected2:7974),	ratio:0.97,	(ratio2:0.89)
INFO:[5b] detected1:2526,	matched:1668,	(detected2:11237),	ratio:0.66,	(ratio2:0.15)
INFO:[3b] detected1:2887,	matched:1910,	(detected2:11237),	ratio:0.66,	(ratio2:0.17)
INFO:[sb] detected1:1004,	matched:273,	(detected2:11237),	ratio:0.27,	(ratio2:0.02)
process Fev_CSm_m71_1626.chr1...
INFO:[i] detected1:7985,	matched:5725,	(detected2:6018),	ratio:0.72,	(ratio2:0.95)
INFO:[5] detected1:2462,	matched:1210,	(detected2:1837),	ratio:0.49,	(ratio2:0.66)
INFO:[3] detected1:2750,	matched:1232,	(detected2:2132),	ratio:0.45,	(ratio2:0.58)
INFO:[s] detected1:969,	matched:88,	(detected2:401),	ratio:0.09,	(ratio2:0.22)
INFO:[j] detected1:7085,	matched:6840,	(detected2:7559),	ratio:0.97,	(ratio2:0.90)
INFO:[5b] detected1:2462,	matched:1692,	(detected2:10388),	ratio:0.69,	(ratio2:0.16)
INFO:[3b] detected1:2750,	matched:1823,	(detected2:10388),	ratio:0.66,	(ratio2:0.18)
INFO:[sb] detected1:969,	matched:256,	(detected2:10388),	ratio:0.26,	(ratio2:0.02)

INFO outputs are as follows:

  • detected1: counts of gencode exons/junctions with positive coverage
  • matched: number of matched exons/junctions
  • detected2: counts of elements in the assembly
  • ratio: matched/detected1
  • ratio2: matched/detected2

These are assesed within different exon classes:

  • i: internal exon (exons bounded by splice junctions)
  • 5: 5'exons
  • 3: 3'exons
  • s: single exons
  • j: junctions

Classes with b, (5b,3b,sb) allow matching to different classes of exons.

Matching exons with junction boundary requires the boundaries to match. That is, to match internal exons, they have to be exactly the same (including strand). To match 5'exons, donor site has to the same but the other side does not have to be. To match single exon, they just need to have some overlap.

So, we have quite good detection of junctions. About 70% of internal exons are detected. 5',3' exons are worse (< 50%) but if allowed to match to other class of exons, then detection rates get close to the internal exons. Single exon detection rates are bad and about 10%.

We can visualize the detection percentages as follows.


In [13]:
# EvalMatch.get_detection_percentages returns a dataframe containing detection percentages 
dp = PD.DataFrame({n:ems[n].get_detection_percentages().stack() for n in ems})
tgt = ['i','5','3','s','j']
dp1 = dp.ix[[(x,'%detected 1') for x in tgt]][si['name'].values]
dp1.index = tgt
ax = dp1.plot(kind='bar', legend=False, figsize=(6,4))
ax.set_ylabel('detection %')
ax.set_xlabel('exons/junctions')
ax.legend(loc='center left', bbox_to_anchor=(1,0.5))


Out[13]:
<matplotlib.legend.Legend at 0x110d903c8>

We can also plot the detection sensitivity against coverages. We expect the detection rate is better at the bigger coverage.


In [14]:
colors = ['ro-','go-','bo-','mo-']
for i,name in enumerate(ems):
    em = ems[name]
    if i==0:
        axr = em.plot_sensitivity(color=colors[i])
    else:
        axr = em.plot_sensitivity(color=colors[i], axr=axr, ypos=i)


These plots show percent of detected exons calculated in bins sorted by coverage (binsize 100). Numbers in inset indicate normalized area under the curve (AUC, which ranges from 0 to 1). Dashed lines in 5',3' and single exon panels correspond to 5b, 3b, sb category.

How about the precision of the match? We can assess this by length ratio of matched exons:


In [15]:
for name in ems:
    ems[name].plot_ratio()


Averages shown inside are geometric averages. Overall length tends to be a bit longer. (Matching of internal exons and junctions always produce ratio of 1, so they are not shown.)

For some genes, we see partial or disconnected reconstruction of the structure. An example is shown below:

These incomplete reconstructions are due to missing junctions or insufficient exon coverages.

To evaluate the completeness of the reconstruction, we calculate three measures. One is the ratio of gene length. We call this GLC (gene length completenesss). The second measure is the ratio of exon counts (ECC, exon count completeness), and the third is the ratio of junction counts (JCC, junction count completeness).

We can visualize these measures by plot_completeness method.


In [16]:
axr = ems[name].plot_completeness(pw='fdat', ca='g.-', cf='r-',  cd='b.', xlim=[0,4])


Blue dots are genes, green dots are binned averages, red lines are sigmoid function fit to blue dots and red dashes are 99% position of the sigmoidal fit.

As expected, genes with low coverages tends to have incomplete reconstructions.

Below, we'll plot binned averages of completeness measures for all 4 samples.


In [17]:
colors = ['ro-','go-','bo-','mo-']
for i,name in enumerate(ems):
    em = ems[name]
    if i==0:
        axr = em.plot_completeness(pw='a', ca=colors[i], xlim=(0,4), label=name, title='')
    else:
        axr = em.plot_completeness(pw='a', ca=colors[i], xlim=(0,4), axr=axr, label=name)
P.legend(bbox_to_anchor=(1,1),loc='upper left')


Out[17]:
<matplotlib.legend.Legend at 0x1118a5320>

Preparing for merge

Since individual cell types express only subsets of the genes, merging assembies is necessary to obtain a more complete picture of the transcriptomes. Moreover, merging should also improve the sensitivity and the completeness of the reconstruction.

For merging, we will repeat the assembly process but uses aggregated junctions and coverage bigwigs generated from all exons from all assemblies. We need to prepare these two kinds of files first. For this, we use MergeInputs class.


In [18]:
from jgem import merge as MG
fni = MG.MergeInputNames(sampleinfo=si, code='FevM', outdir='../tests/out')
mi = MG.MergeInputs(fnobj=fni, genome='mm10', np=3, th_detected=0, th_maxcnt1=0)

We need to assign sjexpre column to the sampleinfo:


In [20]:
# prefix to *.ex.txt.gz and *.sj.txt.gz files
si['sjexpre'] = [os.path.join('../tests/out/', x) for x in si['name']]

To create bigwigs:


In [21]:
mi.make_ex_bigwigs()


INFO:../tests/out/FevM.ex.mep.bed.gz:totbp=22889483,covbp=1785031,scale=0.0779847670652937
INFO:../tests/out/FevM.ex.men.bed.gz:totbp=35393861,covbp=2078926,scale=0.05873690920580832
INFO:../tests/out/FevM.ex.se.bed.gz:totbp=3037419,covbp=675251,scale=0.2223107842546583

To make aggreagated junction files:


In [24]:
mi.make_sj_bed()


INFO:selectsj: in 11398
INFO:selectsj: 107 smaller than th_ratio(0.001)
INFO:selectsj: 0 smaller than th_detected(0)
INFO:selectsj: 0 smaller than th_maxcnt1(0)
INFO:selectsj: 4861 larger than th_maxcnt2(1)
INFO:#selected SJ:11291<=11398

To make average bigwig coverage file (average of original individual bigwig coverages, different from bigwigs generated above by make_ex_bigwigs(), which are based on assembled exons):


In [25]:
mi.aggregate_bigwigs() #~ 1 min

Output files are:


In [26]:
ls ../tests/out/FevM*


../tests/out/FevM.allsample.bw        ../tests/out/FevM.ex.se.bw
../tests/out/FevM.allsj.stats.txt.gz  ../tests/out/FevM.sj.n.bed.gz
../tests/out/FevM.allsj.txt.gz        ../tests/out/FevM.sj.p.bed.gz
../tests/out/FevM.ex.men.bw           ../tests/out/FevM.sj0.bed.gz
../tests/out/FevM.ex.mep.bw           ../tests/out/FevM.sj2.txt.gz

Bigwigs generated from assembly exons are separated into multi-exon positive strand, negative strand and single exons:

.ex.mep.bw
.ex.men.bw
.ex.se.bw

Similarly junctions are also separated into two strands:

.sj.p.bed.gz
.sj.n.bed.gz

Averaged bigwig is:

.allsample.bw: 

Other files contain stats.

Merging

Once we have inputs prepared, we use MergeAssemble class to do the merging.


In [27]:
fna = MG.MergeAssemblyNames('FevA', outdir='../tests/out')
ma = MG.MergeAssemble(fni,fna,np=3,saveintermediates=False)

In [28]:
ma.assemble()


INFO:Executing CHECKSJSUPPORT ====================
INFO: #sj: 5348=>4828
INFO: time: 0.816s
INFO:Executing REMOVEJIE ====================
INFO: #sj:4828=>4828, jie 0
INFO: time: 0.326s
INFO:Executing SJ2EX ====================
INFO:#sj:4828=>4828 after ureadth, mreadth
INFO: #ex:4883, #sj:4828
INFO: time: 0.082s
INFO:Executing MERGEEXONS ====================
INFO: #ex:5708
INFO: time: 0.165s
INFO:Executing FINDEDGES2 ====================
INFO: time: 14.420s
INFO:Executing FIXSTRAND ====================
INFO: time: 0.223s
INFO:Executing FINDIRETS ====================
INFO:#irets candidates:155
INFO:226 irets found
INFO: time: 2.325s
INFO:Executing FIND53IR ====================
INFO:  calculating SECOV...
INFO:write exons ...=> assemble.exons0.txt.gz
INFO: time: 2.649s
INFO:Executing CALCCOV ====================
INFO: time: 0.765s
INFO:Executing SETINFO ====================
INFO: time: 0.199s
INFO:Executing FINDGENES ====================
INFO:finding genes (connected components)...
INFO: time: 1.528s
INFO:assigning gidx...
INFO: #genes:512
INFO: time: 1.559s
INFO:Executing SELECTSEME ====================
INFO:ex(6080) sj(4815) ==> ex(6010) sj(4815)
INFO:se(2) => se(2) 0 removed
INFO: time: 0.550s
INFO:Executing CONSISTENTSJ ====================
INFO:  #sj 4815 => 4780
INFO: time: 0.010s
INFO:Executing WRITESJEX ====================
INFO: time: 0.654s
INFO:Executing WRITEGENES ====================
INFO: writing bed12 genes...
INFO: time: 1.908s
INFO:Executing CHECKSJSUPPORT ====================
INFO: #sj: 5958=>5517
INFO: time: 0.776s
INFO:Executing REMOVEJIE ====================
INFO: #sj:5517=>5509, jie 8
INFO: time: 0.353s
INFO:Executing SJ2EX ====================
INFO:#sj:5509=>5509 after ureadth, mreadth
INFO: #ex:5482, #sj:5509
INFO: time: 0.077s
INFO:Executing MERGEEXONS ====================
INFO: #ex:6411
INFO: time: 0.168s
INFO:Executing FINDEDGES2 ====================
INFO: time: 15.376s
INFO:Executing FIXSTRAND ====================
INFO: time: 0.176s
INFO:Executing FINDIRETS ====================
INFO:#irets candidates:73
INFO:87 irets found
INFO: time: 1.860s
INFO:Executing FIND53IR ====================
INFO:  calculating SECOV...
INFO:write exons ...=> assemble.exons0.txt.gz
INFO: time: 2.905s
INFO:Executing CALCCOV ====================
INFO: time: 0.746s
INFO:Executing SETINFO ====================
INFO: time: 0.242s
INFO:Executing FINDGENES ====================
INFO:finding genes (connected components)...
INFO: time: 1.800s
INFO:assigning gidx...
INFO: #genes:525
INFO: time: 1.828s
INFO:Executing SELECTSEME ====================
INFO:ex(6790) sj(5492) ==> ex(6734) sj(5492)
INFO:se(4) => se(4) 0 removed
INFO: time: 0.611s
INFO:Executing CONSISTENTSJ ====================
INFO:  #sj 5492 => 5464
INFO: time: 0.008s
INFO:Executing WRITESJEX ====================
INFO: time: 0.605s
INFO:Executing WRITEGENES ====================
INFO: writing bed12 genes...
INFO: time: 2.038s
INFO:mep:6012, men:6738
INFO:n0:958, np:474, nn:484, np+nn:958

Output files are:


In [30]:
ls ../tests/out/FevA*


../tests/out/FevA.ci.txt.gz
../tests/out/FevA.ex.bed.gz
../tests/out/FevA.ex.txt.gz
../tests/out/FevA.genes.bed.gz
../tests/out/FevA.genes.txt.gz
../tests/out/FevA.men.assemble.stats.txt
../tests/out/FevA.mep.assemble.stats.txt
../tests/out/FevA.se.bed.gz
../tests/out/FevA.sj.bed.gz
../tests/out/FevA.sj.txt.gz

Looking at .genes.bed file with IGV, we find previously incomplete reconstructions are now merged into one complete structure:

To evaluate the merging quantitatively, we use EvalMatch class as before:


In [31]:
ems2 = {} # keep EvalMatch object 
en_m = EV.EvalNames('../tests/out/FevA',  code='FevA', outdir='../tests/out/') 
# for each sample data generate EvalMatch between the merged assembly and the gencode4
for scode, name, bwfile, sjfile in si[['scode','name','bw_path','sjbed_path']].values:
    print('process {0}...'.format(name))
    datacode = 'd'+str(scode)
    ems2[name] = EV.EvalMatch(en_gen4, en_m, bwfile, sjfile, datacode=datacode, binsize=100) 
    ems2[name].calculate()


process Fev_DR_m70_1623.chr1...
INFO:[i] detected1:7766,	matched:6347,	(detected2:7026),	ratio:0.82,	(ratio2:0.90)
INFO:[5] detected1:2341,	matched:1250,	(detected2:1506),	ratio:0.53,	(ratio2:0.83)
INFO:[3] detected1:2607,	matched:1293,	(detected2:1804),	ratio:0.50,	(ratio2:0.72)
INFO:[s] detected1:894,	matched:82,	(detected2:164),	ratio:0.09,	(ratio2:0.50)
INFO:[j] detected1:6892,	matched:6790,	(detected2:7373),	ratio:0.99,	(ratio2:0.92)
INFO:[5b] detected1:2341,	matched:1771,	(detected2:10500),	ratio:0.76,	(ratio2:0.17)
INFO:[3b] detected1:2607,	matched:1945,	(detected2:10500),	ratio:0.75,	(ratio2:0.19)
INFO:[sb] detected1:894,	matched:285,	(detected2:10500),	ratio:0.32,	(ratio2:0.03)
process Fev_CSm_m70_1624.chr1...
INFO:[i] detected1:7990,	matched:6447,	(detected2:7111),	ratio:0.81,	(ratio2:0.91)
INFO:[5] detected1:2418,	matched:1242,	(detected2:1541),	ratio:0.51,	(ratio2:0.81)
INFO:[3] detected1:2734,	matched:1325,	(detected2:1877),	ratio:0.48,	(ratio2:0.71)
INFO:[s] detected1:961,	matched:82,	(detected2:166),	ratio:0.09,	(ratio2:0.49)
INFO:[j] detected1:7153,	matched:7043,	(detected2:7806),	ratio:0.98,	(ratio2:0.90)
INFO:[5b] detected1:2418,	matched:1782,	(detected2:10695),	ratio:0.74,	(ratio2:0.17)
INFO:[3b] detected1:2734,	matched:2009,	(detected2:10695),	ratio:0.73,	(ratio2:0.19)
INFO:[sb] detected1:961,	matched:297,	(detected2:10695),	ratio:0.31,	(ratio2:0.03)
process Fev_DR_m71_1625.chr1...
INFO:[i] detected1:8259,	matched:6532,	(detected2:7244),	ratio:0.79,	(ratio2:0.90)
INFO:[5] detected1:2526,	matched:1262,	(detected2:1639),	ratio:0.50,	(ratio2:0.77)
INFO:[3] detected1:2887,	matched:1364,	(detected2:1985),	ratio:0.47,	(ratio2:0.69)
INFO:[s] detected1:1004,	matched:82,	(detected2:168),	ratio:0.08,	(ratio2:0.49)
INFO:[j] detected1:7336,	matched:7193,	(detected2:8045),	ratio:0.98,	(ratio2:0.89)
INFO:[5b] detected1:2526,	matched:1825,	(detected2:11036),	ratio:0.72,	(ratio2:0.17)
INFO:[3b] detected1:2887,	matched:2083,	(detected2:11036),	ratio:0.72,	(ratio2:0.19)
INFO:[sb] detected1:1004,	matched:289,	(detected2:11036),	ratio:0.29,	(ratio2:0.03)
process Fev_CSm_m71_1626.chr1...
INFO:[i] detected1:7985,	matched:6401,	(detected2:7075),	ratio:0.80,	(ratio2:0.90)
INFO:[5] detected1:2462,	matched:1306,	(detected2:1556),	ratio:0.53,	(ratio2:0.84)
INFO:[3] detected1:2750,	matched:1347,	(detected2:1902),	ratio:0.49,	(ratio2:0.71)
INFO:[s] detected1:969,	matched:82,	(detected2:167),	ratio:0.08,	(ratio2:0.49)
INFO:[j] detected1:7085,	matched:6956,	(detected2:7667),	ratio:0.98,	(ratio2:0.91)
INFO:[5b] detected1:2462,	matched:1855,	(detected2:10700),	ratio:0.75,	(ratio2:0.17)
INFO:[3b] detected1:2750,	matched:2036,	(detected2:10700),	ratio:0.74,	(ratio2:0.19)
INFO:[sb] detected1:969,	matched:284,	(detected2:10700),	ratio:0.29,	(ratio2:0.03)

To visualize detection percentage:


In [32]:
for n in ems2:
    dp[n+'_merged'] = ems2[n].get_detection_percentages().stack()
tgt = ['i','5','3','s','j']
cols = list(si['name'])+[x+'_merged' for x in si['name']]
dp1 = dp.ix[[(x,'%detected 1') for x in tgt]][cols]
dp1.index = tgt
ax = dp1.plot(kind='bar', legend=False, figsize=(10,4))
ax.set_ylabel('detection %')
ax.set_xlabel('exons/junctions')
ax.legend(loc='center left', bbox_to_anchor=(1,0.5))


Out[32]:
<matplotlib.legend.Legend at 0x12e85a160>

So we have much better detection of internal exons (~10% improvement) and slight improvements in the detection of 5'exons, 3'exons and junctions.

We can also see the improvements in the gene structure reconstruction in the changes in GLC,ECC,JCC:


In [33]:
for i,name in enumerate(ems):
    if i==0:
        axr = ems[name].plot_completeness(pw='a', ca='bo-', xlim=(0,4), label=name, title='')
        axr = ems2[name].plot_completeness(pw='a', ca='ro-', label=name+'_merged', axr=axr)
    else:
        axr = ems[name].plot_completeness(pw='a', ca='bo-', label=name, axr=axr)
        axr = ems2[name].plot_completeness(pw='a', ca='ro-', label=name+'_merged', axr=axr)
        
P.legend(bbox_to_anchor=(1,1),loc='upper left')


Out[33]:
<matplotlib.legend.Legend at 0x12cd28748>

Annotating assembly

Once we have an assembly, we want to know which gene model corresponds to which known gene. We do this using Comparator class.


In [34]:
from jgem import annotate as AN
g4sjexpre = '../tests/data/assemblies/gencode.vM4.chr1'
mergepre = '../tests/out/FevA'
outdir = '../tests/out'
cref = AN.ComparatorNames(g4sjexpre, 'gen4', outdir)
ctgt = AN.ComparatorNames(mergepre, 'FevA', outdir)
cp = AN.Comparator(cref, ctgt, outdir, 'gene_name')
cp.annotate(overwrite=False)

When overwrite is True, it overwrites the original sj,ex files. If False, it creats files in the output direcotry:


In [35]:
ls ../tests/out/FevA.gen4*


../tests/out/FevA.gen4.ex.txt.gz  ../tests/out/FevA.gen4.sj.txt.gz

Let's check the contents of the exon file:


In [36]:
ex = UT.read_pandas('../tests/out/FevA.gen4.ex.txt.gz')
ex.columns


Out[36]:
Index(['chr', 'st', 'ed', 'name', 'sc1', 'strand', '_id', '_gidx', 'gname',
       'cat', 'ptyp', 'cov', 'len', 'a_id', 'd_id', 'a_degree', 'd_degree',
       'a_pos', 'd_pos', 'tlen', 'glen', 'ecov', 'gcov', 'eknown_gen4',
       'etcode_gen4', 'intergenic_gen4', 'ex_as_ovl_gen4', 'gene_as_ovl_gen4',
       'gknown_gen4', 'gtcode_gen4', 'gen4_gidxs', 'gen4_gidx0', 'gen4_syms',
       'gen4_sym0'],
      dtype='object')

In [37]:
acols = ['chr','st','ed','strand','cat','gname','eknown_gen4','etcode_gen4', 
         'intergenic_gen4', 'ex_as_ovl_gen4', 'gene_as_ovl_gen4',
         'gknown_gen4', 'gtcode_gen4','gen4_sym0']
print(ex.head(10)[acols])


    chr       st       ed strand cat   gname eknown_gen4 etcode_gen4  \
0  chr1  3199728  3207317      -   3     NG2           k        k.me   
1  chr1  3213438  3216968      -   i     NG2           k        k.me   
2  chr1  3284679  3285150      .   s  S50000           u        u.se   
3  chr1  3421578  3421901      -   3     NG2           k        k.me   
4  chr1  3421637  3421901      -   3     NG2           k        k.me   
5  chr1  3421701  3421901      -   i     NG2           k        k.me   
6  chr1  3660016  3660889      -   3     NG2           u        u.me   
7  chr1  3670551  3671205      -   5     NG2           k        k.me   
8  chr1  4771187  4772555      .   s  S50001           k        k.se   
9  chr1  4776348  4776801      -   3     NG4           k        k.me   

  intergenic_gen4 ex_as_ovl_gen4 gene_as_ovl_gen4 gknown_gen4 gtcode_gen4  \
0               n              n                n           k        k.me   
1               n              n                n           k        k.me   
2               n              n                n           u        u.se   
3               n              n                n           k        k.me   
4               n              n                n           k        k.me   
5               n              n                n           k        k.me   
6               n              n                n           k        k.me   
7               n              n                n           k        k.me   
8               n              n                n           k        k.se   
9               n              n                n           k        k.me   

      gen4_sym0  
0          Xkr4  
1          Xkr4  
2           NaN  
3          Xkr4  
4          Xkr4  
5          Xkr4  
6          Xkr4  
7          Xkr4  
8  RP23-34E15.4  
9        Mrpl15  

Columns are:

cat: exon category (internal, 5',3',single)
gname: assembly gene name
eknown_gen4: exon known against gen4 (k or u)
etcode_gen4: exon code (k.me: known multi-exon, u.se:unknown single exon, etc.)
intergenic_gen4: whether the gene is intergenic 
ex_as_ovl_gen4: exon overlaps with gen4 exon in antisense direction
gene_as_ovl_gen4: gene overlaps with gen4 gene in antisense direction
gtcode_gen4: similar to etcode_gen4
gen4_sym0: Gencode.vM4 gene symbol (0 indicates closest one)