In [1]:
import pandas as pd
import seaborn as sns
a brief intro to transcriptome assembly, with preliminary work an ideas in the khmer software package, presented to the steely-eyed nucleotide hackers of the Lab for Data Intensive Biology.
"The assertion that most production software bugs are soft Heisenbugs that go away when you look at them is well known to systems programmers. Bohrbugs, like the Bohr atom, are solid, easily detected by standard techniques, and hence boring. But Heisenbugs may elude a bugcatcher for years of execution. Indeed, the bugcatcher may perturb the situation just enough to make the Heisenbug disappear."
Jim Gray's amusing quote refers to building fault-tolerant systems (from a an oldie-but-a-goodie that I recommend reading), not assembling transcriptomes. There is, however, a relation: many of us have run into the maddening situation where an otherwise jolly assembly routine decides to fall over three-fourths of the way through its three-day ultramarathon.
Streaming algorithms:
Note that now, we will have variation based on false positives (though that portion will be minimal), the order of observation, and the amount of local information available when assembly is triggered.
A de Druijn graph is built from the entire set of reads.
Then de Bruijn graph is walked to extract transcripts. This process uses a number of heuristics$^1$ to guide which branches to take, and hopefully, construct full length transcripts.
1. Recently, an information-optimal assembler was introduced which formalizes a theory of transcriptome assembly. We'll touch on this. Source: http://biorxiv.org/content/early/2016/02/09/039230
After this initial walk of the completed de Bruijn graph, there are usually some post-processing steps, such as resolving splice variants by mapping paired reads.
Of course, this process is not streaming!
Image source: http://www.nature.com/nbt/journal/v29/n7/full/nbt.1883.html
All of you have a fair idea of what these things are, so rather than go into them biologically, we're going to explore them via their graph structures.
The tip results from a sequence error at the end of a read or, rarely, from a low-coverage SNP within $k$ bases of the beginning or end of a read. In our case, it is sometimes the result of a false positive.
A low coverage bubble is an error in the middle of a read; a high coverage bubble, a SNP. Bubbles should almost *never* form from a false positive, unless the bloom filter is ridiculously small.
A single repeat within a transcript. Low-complexity sequence, or repetitive exons.
*Inter*-transcript repeats generally result from splice variants. The nomenclature for the following three repeat classes comes from Kannan et al.'s paper "Shannon: An Information-Optimal de Novo RNA-Seq Assembler.
Work on our streaming assembler is currently being done here: https://github.com/dib-lab/khmer/pull/1412
So far, we have a simple set of transcript assembly algorithms exposed to Python, which are triggered in a streaming script.
In [2]:
import khmer
from khmer.tests import khmer_tst_utils as test_utils
import itertools
import random
import screed
K = 21
In [3]:
def mutate_base(base):
if base in 'AT':
return random.choice('GC')
elif base in 'GC':
return random.choice('AT')
else:
assert False, 'bad base'
def mutate_sequence(sequence, N=1):
sequence = list(sequence)
positions = random.sample(range(len(sequence)), N)
for i in positions:
sequence[i] = mutate_base(sequence[i])
return ''.join(sequence)
def mutate_position(sequence, pos):
sequence = list(sequence)
sequence[pos] = mutate_base(sequence[pos])
return ''.join(sequence)
In [4]:
def reads(sequence, L=100, N=100):
positions = list(range(len(sequence) - L))
for i in range(N):
start = random.choice(positions)
yield sequence[start:start+L]
def kmers(sequence):
for i in range(len(sequence)-K+1):
yield sequence[i:i+K]
In [5]:
contig = list(screed.open(test_utils.get_test_data('simple-genome.fa')))[0].sequence
print(contig)
In [6]:
K = 21
graph = khmer.Countgraph(K, 1e6, 4)
labeller = khmer._GraphLabels(graph)
graph.consume(contig)
#bubble = mutate_position(contig, 100)
tip = contig[100:100+K-1] + mutate_base(contig[100+K])
test_reads = list(itertools.chain(reads(contig), [tip]))
random.shuffle(test_reads)
for n, read in enumerate(test_reads):
graph.consume(read)
hdns = graph.find_high_degree_nodes(read)
if hdns:
print([khmer.reverse_hash(h, K) for h in list(hdns)])
labeller.label_across_high_degree_nodes(read, hdns, n)
In [7]:
path = graph.assemble_linear_path(contig[:K])
print(path)
print(len(path))
print(contig[101-K:101])
print(bubble[101-K:101])
In [ ]:
paths = labeller.assemble_labeled_path(contig[:K])
print(*[str(len(p)) + ' ' + p for p in paths], sep='\n\n')