In [1]:
import pandas as pd
import seaborn as sns

Why Don't Transcriptomes Finish Assembling and What Can Be Done About It?

a brief intro to transcriptome assembly, with preliminary work an ideas in the khmer software package, presented to the steely-eyed nucleotide hackers of the Lab for Data Intensive Biology.



"The assertion that most production software bugs are soft Heisenbugs that go away when you look at them is well known to systems programmers. Bohrbugs, like the Bohr atom, are solid, easily detected by standard techniques, and hence boring. But Heisenbugs may elude a bugcatcher for years of execution. Indeed, the bugcatcher may perturb the situation just enough to make the Heisenbug disappear."

- Jim Gray, June 1985


Jim Gray's amusing quote refers to building fault-tolerant systems (from a an oldie-but-a-goodie that I recommend reading), not assembling transcriptomes. There is, however, a relation: many of us have run into the maddening situation where an otherwise jolly assembly routine decides to fall over three-fourths of the way through its three-day ultramarathon.

**How can we finish assembling?**

Just Don't.




Stream, Baby, Stream

Let's just do streaming assembly instead!

Most of us are likely familiar with, or have at least heard, someone in the lab (probably Titus) ramble on (probably Titus) about streaming algorithms. Let's get a more precise definition.


Streaming algorithms:

  • Operate on sequences of data too big to fit in memory,
  • make a single pass over that data$^1$,
  • and generally, produce approximate solutions $^{2}$.

  1. or sometimes, a small number of passes.
  2. slides from Piot Indyk at Rice Uni.: https://people.csail.mit.edu/indyk/ita-web.pdf </small>

Digital Normalization

As an example, consider diginorm:

  • Operates on a sequence of reads,
  • makes a single pass over those reads,
  • and produces an approximation of the total information in the reads$^1$.

  1. Note: here "approximate" refers to the variation resulting from false positives and order of observation, not the read sampling process. An "exact" diginorm algorithm would be two-pass and use an exact $k$-mer counting data structure. </small>

Transcriptome Assembly

Our goal then is to produce a streaming transcriptome assembler. Such an assembler will make a single pass over the reads and generate an approximate set of transcripts.

Note that now, we will have variation based on false positives (though that portion will be minimal), the order of observation, and the amount of local information available when assembly is triggered.

How to Assemble Things

Let's go over the traditional process for a de Buijn graph assembler.


The de Bruijn Graph

We build the de Bruijn graph by breaking down the input reads into $k$-mers.

![](http://genome.cshlp.org/content/20/9/1165/F2.large.jpg)

image source: http://genome.cshlp.org/content/20/9/1165.full

A de Druijn graph is built from the entire set of reads.

Then de Bruijn graph is walked to extract transcripts. This process uses a number of heuristics$^1$ to guide which branches to take, and hopefully, construct full length transcripts.


1. Recently, an information-optimal assembler was introduced which formalizes a theory of transcriptome assembly. We'll touch on this. Source: http://biorxiv.org/content/early/2016/02/09/039230

After this initial walk of the completed de Bruijn graph, there are usually some post-processing steps, such as resolving splice variants by mapping paired reads.

Of course, this process is not streaming!


Image source: http://www.nature.com/nbt/journal/v29/n7/full/nbt.1883.html


Graph Structures

The main challenges of transcriptome assembly, like genome assembly, are errors, heterozygosity, and repetitive sequence.

All of you have a fair idea of what these things are, so rather than go into them biologically, we're going to explore them via their graph structures.


The Linear Path


![](images/dbg-linear-path-notitle.svg)

The most basic graph component (other than the *node* and *edge*). Each exon and UTR is a linear path, flanked by one, two, or no high degree nodes (HDNs).

de Bruijn Graph Tip


![](images/dbg-tip-notitle.svg)

The tip results from a sequence error at the end of a read or, rarely, from a low-coverage SNP within $k$ bases of the beginning or end of a read. In our case, it is sometimes the result of a false positive.

de Bruijn Graph SNP Bubble


![](images/dbg-snp-bubble-notitle.svg)

A low coverage bubble is an error in the middle of a read; a high coverage bubble, a SNP. Bubbles should almost *never* form from a false positive, unless the bloom filter is ridiculously small.

Now we move on to repeats. Most are from alternative splicing, though some can result from low-complexity sequence.

Intra-transcript Repeat


![](images/intra-transcript-double-repeat-notitle.svg)

A single repeat within a transcript. Low-complexity sequence, or repetitive exons.

Intra-transcript Triple Repeat


![](images/intra-transcript-triple-repeat-notitle.svg)

Intra-transcript Interleaved Repeat


![](images/intra-transcript-interleaved-repeat-notitle.svg)

Inter-transcript Left-Sided Z Repeat


![](images/inter-transcript-left-sided-Z-notitle.svg)

*Inter*-transcript repeats generally result from splice variants. The nomenclature for the following three repeat classes comes from Kannan et al.'s paper "Shannon: An Information-Optimal de Novo RNA-Seq Assembler.

Inter-transcript Double-Sided Z Repeat


![](images/inter-transcript-double-sided-Z-notitle.svg)

Inter-transcript Circular Repeat


![](images/inter-transcript-circular-repeat-notitle.svg)

What About Ours?

Work on our streaming assembler is currently being done here: https://github.com/dib-lab/khmer/pull/1412

So far, we have a simple set of transcript assembly algorithms exposed to Python, which are triggered in a streaming script.


In [2]:
import khmer
from khmer.tests import khmer_tst_utils as test_utils
import itertools
import random
import screed

K = 21

In [3]:
def mutate_base(base):
    if base in 'AT':
        return random.choice('GC')
    elif base in 'GC':
        return random.choice('AT')
    else:
        assert False, 'bad base'


def mutate_sequence(sequence, N=1):
    sequence = list(sequence)
    positions = random.sample(range(len(sequence)), N)

    for i in positions:
        sequence[i] = mutate_base(sequence[i])

    return ''.join(sequence)


def mutate_position(sequence, pos):
    sequence = list(sequence)
    sequence[pos] = mutate_base(sequence[pos])
    return ''.join(sequence)

In [4]:
def reads(sequence, L=100, N=100):
    positions = list(range(len(sequence) - L))
    for i in range(N):
        start = random.choice(positions)
        yield sequence[start:start+L]


def kmers(sequence):
    for i in range(len(sequence)-K+1):
        yield sequence[i:i+K]

In [5]:
contig = list(screed.open(test_utils.get_test_data('simple-genome.fa')))[0].sequence
print(contig)


TGCCGCCCAGCACCGGGTGACTAGGTTGAGCCATGATTAACCTGCAATGAAGGTCATTCACACGCAGCGTCATTTAATGGATTGGTGCACACTTAACTGGGTGCCGCGCTGGTGCTGATCCATGAAGTTCATCTGGACTTGTACGTGCGACAGCTCCTTCCATTTCCGCCTTGCCATACAGACCACCTAAGACCGCAGACCCTCCTCCTTACCACATGCGATGCGTGGGAACCGGTGTCAAAGACGGGTGCCGCTACACAGGAAGGCACCCAGGGAAAGTCGTTTGCCGGAAGAGAGTGGAGCTCCTACGTAAACGGGGAAACCACTTGTTTGGATTCCCCCTTGCCGATTCGGCCCTATCAGGATGTATTTAACTTAGGAGAAACCGAACAACTGCCACCGCTTATTGCCCCGGCAGGCGGTAGTTTCCACGATCTAACAATCGAAGCAATTCGGACAGGCTTAAGCTACAAAGCTCGGATTTTGTAAGTGCTCTATCCTTTGTAGGAAGTGAAAGATGACGTTGCGGCCGTCGCTGTTGGAGGAACCGCAGCACCATGGCGCCTGTGCGAGCTGGAGATCCTCTCATAGCGTCAGAGCACGGGATGCTGTATATTAAGCACACAATAGCCCGGGGACCGGCCCCAACGTGAAATGCCTGGCCTGCCGTTCTTTATAGTGCTCGTGATAGTGTTATAAAGGAACTAACATCAAGTTATGTAAGGACTTTTACAATAGCGTGGTCCGTCAAGTCGTCCACGTGTGTAAATTCATTGGTACCTTTTGCCGAAAAATTTGAAAGCTAAGCACATTCTGCTTACTCACAGGGTAAGTTCCTGAAGTATTAATGTAATGTGGAAAGACAGGCATATGAACACTATTGGGCTTTGTAGACATTCCTCATCCATGCTGTATCAGTAATGTACAATTCGCCCCTTTCGTAAAGGAGAGCCGTGCTAACGTTATATTCGGTCTTACCACGGGCTCGATAGTTTGCCCC

In [6]:
K = 21
graph = khmer.Countgraph(K, 1e6, 4)
labeller = khmer._GraphLabels(graph)

graph.consume(contig)
#bubble = mutate_position(contig, 100)
tip = contig[100:100+K-1] + mutate_base(contig[100+K])

test_reads = list(itertools.chain(reads(contig), [tip]))
random.shuffle(test_reads)

for n, read in enumerate(test_reads):
    graph.consume(read)

    hdns = graph.find_high_degree_nodes(read)
    if hdns:
        print([khmer.reverse_hash(h, K) for h in list(hdns)])
    labeller.label_across_high_degree_nodes(read, hdns, n)

In [7]:
path = graph.assemble_linear_path(contig[:K])
print(path)
print(len(path))
print(contig[101-K:101])
print(bubble[101-K:101])


TGCCGCCCAGCACCGGGTGACTAGGTTGAGCCATGATTAACCTGCAATGAAGGTCATTCACACGCAGCGTCATTTAATGGATTGGTGCACACTTAACTGGGTGCCGCGCTGGTGCTGATCCATGAAGTTCATCTGGACTTGTACGTGCGACAGCTCCTTCCATTTCCGCCTTGCCATACAGACCACCTAAGACCGCAGACCCTCCTCCTTACCACATGCGATGCGTGGGAACCGGTGTCAAAGACGGGTGCCGCTACACAGGAAGGCACCCAGGGAAAGTCGTTTGCCGGAAGAGAGTGGAGCTCCTACGTAAACGGGGAAACCACTTGTTTGGATTCCCCCTTGCCGATTCGGCCCTATCAGGATGTATTTAACTTAGGAGAAACCGAACAACTGCCACCGCTTATTGCCCCGGCAGGCGGTAGTTTCCACGATCTAACAATCGAAGCAATTCGGACAGGCTTAAGCTACAAAGCTCGGATTTTGTAAGTGCTCTATCCTTTGTAGGAAGTGAAAGATGACGTTGCGGCCGTCGCTGTTGGAGGAACCGCAGCACCATGGCGCCTGTGCGAGCTGGAGATCCTCTCATAGCGTCAGAGCACGGGATGCTGTATATTAAGCACACAATAGCCCGGGGACCGGCCCCAACGTGAAATGCCTGGCCTGCCGTTCTTTATAGTGCTCGTGATAGTGTTATAAAGGAACTAACATCAAGTTATGTAAGGACTTTTACAATAGCGTGGTCCGTCAAGTCGTCCACGTGTGTAAATTCATTGGTACCTTTTGCCGAAAAATTTGAAAGCTAAGCACATTCTGCTTACTCACAGGGTAAGTTCCTGAAGTATTAATGTAATGTGGAAAGACAGGCATATGAACACTATTGGGCTTTGTAGACATTCCTCATCCATGCTGTATCAGTAATGTACAATTCGCCCCTTTCGTAAAGGAGAGCCGTGCTAACGTTATATTCGGTCTTACCACGGGCTCGATAGTTTGCCCC
1000
ATTGGTGCACACTTAACTGGG
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-5a68329b5cb6> in <module>()
      3 print(len(path))
      4 print(contig[101-K:101])
----> 5 print(bubble[101-K:101])

NameError: name 'bubble' is not defined

In [ ]:
paths = labeller.assemble_labeled_path(contig[:K])
print(*[str(len(p)) + ' ' + p for p in paths], sep='\n\n')