The "Read Until" feature of the Oxford Nanopore sequencing technology means a program can see the data coming in at each pore and, dependend on that data, reject the molecule inside a certain pore.
The actual performance of such a method depends on a lot of factors:
In this notebook I boiled it down to three parameters:
ham_frequency is the frequency of desired moleculesham_duration is the scale by which the desired molecules are read "longer"accuracy is the accuracy of the classificationThe analogy to spam detection is chosen because "ham/spam" makes for catchier variable names. This computation considers time and "amount of data" as equivalent. In reality, event speeds vary a lot, but in the long run, duration of reads and length of the strands correlate very strongly.
The result of the computation is the ratio of desired time/data over undesired time/data, which is hopefully higher than the original ham_frequency.
In [138]:
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
In [139]:
def sim_ru(ham_frequency, ham_duration, accuracy):
# Monte-Carlo Style
n = 1000000
ham = np.random.random(size=n)<ham_frequency
durations = np.ones(n)
accurate = np.random.random(size=n)<accuracy
durations[ham & accurate] = ham_duration
durations[~ham & ~accurate] = ham_duration
return (np.sum(durations[ham]) / np.sum(durations))
def sim_ru2(ham_frequency, ham_duration, accuracy):
# exact calculation
long = ((ham_frequency* accuracy) + (1-ham_frequency)*(1-accuracy)) * ham_duration
short = ((ham_frequency* (1-accuracy)) + (1-ham_frequency)*(accuracy)) * 1.0
ham = (ham_frequency* accuracy)*ham_duration + (ham_frequency* (1-accuracy))*1
return ham / (long+short)
In [140]:
def make_plot(ham_frequency):
f, ax = plt.subplots()
f.set_figwidth(14)
f.set_figheight(6)
ax.set_ylim(0,1)
x = np.arange(0.5, 1.0,0.001)
y = np.zeros(len(x))
handles = []
for j in reversed([2.5,5,10,20,40]):
for i in range(len(x)):
y[i] = sim_ru2(ham_frequency, j, x[i])
handles.append(ax.plot(x,y, label = "%.1f" % j))
ax.grid()
f.suptitle("Ratio of desired data over total data for different values of \"desired length\"/\"rejected length\" ")
ax.legend(loc=0);
ax.xaxis.set_label_text("Detection Accuracy");
ax.yaxis.set_label_text("Desired Output / Total Output");
In [141]:
make_plot(0.5)
In [142]:
make_plot(0.1)
In [147]:
make_plot(0.01)
I hope this illustrates how to think about and design Read Until workflows.
In practical applications there will be tradeoffs between accuracy and the ham_duration: The more time the molecule has to spend inside the pore before ejection the higher the accuracy of the decision and the lower the ratio of the ham/spam duration.
It's also obvious that Read Until strongly favors long reads.
In [ ]: