Theoretical Efficiency of Read Until Enrichment

The "Read Until" feature of the Oxford Nanopore sequencing technology means a program can see the data coming in at each pore and, dependend on that data, reject the molecule inside a certain pore.

The actual performance of such a method depends on a lot of factors:

  • ratio of desireable over undesireable molecules in the sample
  • accuracy of detection
  • length of event data necessary for the decision
  • latency of event data reaching the controlling program
  • delay between decision and ejecting the molecule
  • time until the pore can accept a new molecule
  • length of DNA strands in the sample

In this notebook I boiled it down to three parameters:

  • ham_frequency is the frequency of desired molecules
  • ham_duration is the scale by which the desired molecules are read "longer"
  • accuracy is the accuracy of the classification

The analogy to spam detection is chosen because "ham/spam" makes for catchier variable names. This computation considers time and "amount of data" as equivalent. In reality, event speeds vary a lot, but in the long run, duration of reads and length of the strands correlate very strongly.

The result of the computation is the ratio of desired time/data over undesired time/data, which is hopefully higher than the original ham_frequency.


In [138]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

In [139]:
def sim_ru(ham_frequency, ham_duration, accuracy):
    # Monte-Carlo Style
    n = 1000000
    ham = np.random.random(size=n)<ham_frequency
    durations = np.ones(n)
    accurate = np.random.random(size=n)<accuracy
    durations[ham & accurate] = ham_duration
    durations[~ham & ~accurate] = ham_duration
    return (np.sum(durations[ham]) / np.sum(durations))


def sim_ru2(ham_frequency, ham_duration, accuracy):
    # exact calculation
    long = ((ham_frequency* accuracy) + (1-ham_frequency)*(1-accuracy)) * ham_duration
    short = ((ham_frequency* (1-accuracy)) + (1-ham_frequency)*(accuracy)) * 1.0
    ham = (ham_frequency* accuracy)*ham_duration + (ham_frequency* (1-accuracy))*1
    return ham / (long+short)

In [140]:
def make_plot(ham_frequency):
    f, ax = plt.subplots()
    f.set_figwidth(14)
    f.set_figheight(6)
    ax.set_ylim(0,1)
    x = np.arange(0.5, 1.0,0.001)
    y = np.zeros(len(x))
    handles = []
    for j in reversed([2.5,5,10,20,40]):
        for i in range(len(x)):
            y[i] = sim_ru2(ham_frequency, j, x[i])
        handles.append(ax.plot(x,y, label = "%.1f" % j))
    ax.grid()
    f.suptitle("Ratio of desired data over total data for different values of \"desired length\"/\"rejected length\"  ")
    ax.legend(loc=0);
    ax.xaxis.set_label_text("Detection Accuracy");
    ax.yaxis.set_label_text("Desired Output /  Total Output");

50% ham in sample


In [141]:
make_plot(0.5)


10% ham in sample


In [142]:
make_plot(0.1)


1% ham in sample


In [147]:
make_plot(0.01)


Conclusions

I hope this illustrates how to think about and design Read Until workflows.

In practical applications there will be tradeoffs between accuracy and the ham_duration: The more time the molecule has to spend inside the pore before ejection the higher the accuracy of the decision and the lower the ratio of the ham/spam duration.

It's also obvious that Read Until strongly favors long reads.


In [ ]: