Remove sequences and regions with low quality or potential adapter contamination from the raw sequence pool.
We use Atropos (Didion et al., 2017) for quality control.
The following command is adopted from Oecophylla, under qc.rule.
atropos --threads {threads} {params.atropos} --report-file {log} --report-formats txt -o {temp_dir}/{f_fp} -p {temp_dir}/{r_fp} -pe1 {input.forward} -pe2 {input.reverse}
For parameters (params.atropos
), we use the following:
-a GATCGGAAGAGCACACGTCTGAACTCCAGTCAC
-A GATCGGAAGAGCGTCGTGTAGGGAAAGGAGTGT
-q 15 --minimum-length 100 --pair-filter any
Note: the two sequences are adapters to be removed (assuming the library prep kit is Illumina TruSeq or compatible models such as Kapa HyperPlus, which we use).
The following benchmarks were obtained on 692 AGP shotgun samples, using 8 CPUs and 8 GB memory.
Basically, the run time is linear to the sample size, while memory consumption is linear and trivial.
For a typical dataset of 1 million sequences, this step will cost roughly 1 min 15 sec.
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
In [2]:
%matplotlib inline
In [3]:
df = pd.read_table('support_files/benchmarks/atropos.tsv', index_col=0)
df.head()
Out[3]:
In [4]:
df['mseqs'] = df['seqs'] / 1000000
df['mbram'] = df['max_rss'] / 1000
In [5]:
reg = linregress(df['mseqs'].values, df['s'].values)
reg
Out[5]:
In [6]:
fig = plt.figure(figsize=(5, 5))
ax = plt.gca()
plt.plot(df['mseqs'], df['s'], 'o', markersize=4)
x0, x1 = plt.xlim()
y0 = x0 * reg.slope + reg.intercept
y1 = x1 * reg.slope + reg.intercept
plt.plot([x0, x1], [y0, y1], '--')
plt.text(0.1, 0.8, '$\it{y} = %.3g %+.3g \it{x}$\n$\it{R}^2 = %.3g$'
% (reg.intercept, reg.slope, reg.rvalue ** 2),
transform=ax.transAxes)
plt.xlabel('Million sequences')
plt.ylabel('Wall clock time (sec)');
In [7]:
reg = linregress(df['mseqs'].values, df['mbram'].values)
reg
Out[7]:
In [8]:
fig = plt.figure(figsize=(5, 5))
ax = plt.gca()
plt.plot(df['mseqs'], df['mbram'], 'o', markersize=4)
x0, x1 = plt.xlim()
y0 = x0 * reg.slope + reg.intercept
y1 = x1 * reg.slope + reg.intercept
plt.plot([x0, x1], [y0, y1], '--')
plt.text(0.1, 0.8, '$\it{y} = %.3g %+.3g \it{x}$\n$\it{R}^2 = %.3g$'
% (reg.intercept, reg.slope, reg.rvalue ** 2),
transform=ax.transAxes)
plt.xlabel('Million sequences')
plt.ylabel('Maximum memory usage (MB)');
In [ ]: