Quality control

Goal

Remove sequences and regions with low quality or potential adapter contamination from the raw sequence pool.

Protocol

We use Atropos (Didion et al., 2017) for quality control.

The following command is adopted from Oecophylla, under qc.rule.

atropos --threads {threads} {params.atropos} --report-file {log} --report-formats txt -o {temp_dir}/{f_fp} -p {temp_dir}/{r_fp} -pe1 {input.forward} -pe2 {input.reverse}

For parameters (params.atropos), we use the following:

-a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-A GATCGGAAGAGCGTCGTGTAGGGAAAGGAGTGT
-q 15 --minimum-length 100 --pair-filter any

Note: the two sequences are adapters to be removed (assuming the library prep kit is Illumina TruSeq or compatible models such as Kapa HyperPlus, which we use).

Benchmark

The following benchmarks were obtained on 692 AGP shotgun samples, using 8 CPUs and 8 GB memory.

Basically, the run time is linear to the sample size, while memory consumption is linear and trivial.

For a typical dataset of 1 million sequences, this step will cost roughly 1 min 15 sec.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress

In [2]:
%matplotlib inline

In [3]:
df = pd.read_table('support_files/benchmarks/atropos.tsv', index_col=0)
df.head()


Out[3]:
seqs s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load
sample
10317.000001062 981681 77.8203 0:01:17 2044.04 5353.94 893.25 1382.30 234.93 400.00 371.43
10317.000001079 1080144 80.1169 0:01:20 2103.91 5411.07 1097.82 1559.92 253.10 139.91 387.32
10317.000001086 302135 26.8115 0:00:26 921.31 4230.70 498.44 572.31 72.55 51.16 376.98
10317.000001089 746192 57.8219 0:00:57 1260.92 4570.15 1024.99 1102.19 175.45 144.57 498.26
10317.000001127 1028459 87.0529 0:01:27 2195.91 5504.84 1413.00 1460.25 242.76 207.27 595.14

In [4]:
df['mseqs'] = df['seqs'] / 1000000
df['mbram'] = df['max_rss'] / 1000

In [5]:
reg = linregress(df['mseqs'].values, df['s'].values)
reg


Out[5]:
LinregressResult(slope=69.387990648100072, intercept=5.6821600677541539, rvalue=0.98989635320073421, pvalue=0.0, stderr=0.37144317405330479)

In [6]:
fig = plt.figure(figsize=(5, 5))
ax = plt.gca()
plt.plot(df['mseqs'], df['s'], 'o', markersize=4)
x0, x1 = plt.xlim()
y0 = x0 * reg.slope + reg.intercept
y1 = x1 * reg.slope + reg.intercept
plt.plot([x0, x1], [y0, y1], '--')
plt.text(0.1, 0.8, '$\it{y} = %.3g %+.3g \it{x}$\n$\it{R}^2 = %.3g$'
         % (reg.intercept, reg.slope, reg.rvalue ** 2),
         transform=ax.transAxes)
plt.xlabel('Million sequences')
plt.ylabel('Wall clock time (sec)');



In [7]:
reg = linregress(df['mseqs'].values, df['mbram'].values)
reg


Out[7]:
LinregressResult(slope=2.1242883118433484, intercept=-0.030892124063678361, rvalue=0.9709008287509695, pvalue=0.0, stderr=0.019581897495066496)

In [8]:
fig = plt.figure(figsize=(5, 5))
ax = plt.gca()
plt.plot(df['mseqs'], df['mbram'], 'o', markersize=4)
x0, x1 = plt.xlim()
y0 = x0 * reg.slope + reg.intercept
y1 = x1 * reg.slope + reg.intercept
plt.plot([x0, x1], [y0, y1], '--')
plt.text(0.1, 0.8, '$\it{y} = %.3g %+.3g \it{x}$\n$\it{R}^2 = %.3g$'
         % (reg.intercept, reg.slope, reg.rvalue ** 2),
         transform=ax.transAxes)
plt.xlabel('Million sequences')
plt.ylabel('Maximum memory usage (MB)');



In [ ]: