Remove potential host genome contamination from sequencing data.
We use Bowtie2 (Langmead and Salzberg, 2012) to align paired-end sequences to the host genome, then use SAMtools (Li et al., 2009) with BEDtools (Quinlan and Hall, 2010) to remove aligned sequences and their mates.
The following command is adopted from Oecophylla, under qc.rule.
bowtie2 -p {threads} -x {params.filter_db} --very-sensitive -1 {input.forward} -2 {input.reverse} 2> {log.bowtie}| samtools view -f 12 -F 256 -b -o {temp_dir}/{wildcards.sample}.unsorted.bam 2> {log.other}
samtools sort -T {temp_dir}/{wildcards.sample} -@ {threads} -n -o {temp_dir}/{wildcards.sample}.bam {temp_dir}/{wildcards.sample}.unsorted.bam 2> {log.other}
bedtools bamtofastq -i {temp_dir}/{wildcards.sample}.bam -fq {temp_dir}/{wildcards.sample}.R1.trimmed.filtered.fastq -fq2 {temp_dir}/{wildcards.sample}.R2.trimmed.filtered.fastq 2> {log.other}
If necessary, these three commands can be combined into one command:
bowtie2 -p {threads} -x {params.filter_db} --very-sensitive -1 {input.forward} -2 {input.reverse} | samtools view -f 12 -F 256 | samtools sort -@ {threads} -n | samtools view -bS | bedtools bamtofastq -i - -fq {output.forward} -fq2 {output.reverse} &> {log}
Multiple host databases (params.filter_db
) are already available on Barnacle, under: /databases/bowtie
. Of which, Human
is for human sequence removal, PhiX
is to remove Illumina’s spike-in control. Human_PhiX
is for both (we recommend using this database).
The following benchmarks were obtained on 692 AGP shotgun samples, using 4 CPUs and 8 GB memory.
Basically, the run time is linear to the sample size, while memory consumption is constant and trivial.
For a typical dataset of 1 million sequences, this step will cost roughly 3 min 30 sec.
In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
In [2]:
%matplotlib inline
In [3]:
df = pd.read_table('support_files/benchmarks/bowtie2.tsv', index_col=0)
df.head()
Out[3]:
In [4]:
df['mseqs'] = df['seqs'] / 1000000
df['mbram'] = df['max_rss'] / 1000
In [5]:
reg = linregress(df['mseqs'].values, df['s'].values)
reg
Out[5]:
In [6]:
fig = plt.figure(figsize=(5, 5))
ax = plt.gca()
plt.plot(df['mseqs'], df['s'], 'o', markersize=4)
x0, x1 = plt.xlim()
y0 = x0 * reg.slope + reg.intercept
y1 = x1 * reg.slope + reg.intercept
plt.plot([x0, x1], [y0, y1], '--')
plt.text(0.1, 0.8, '$\it{y} = %.3g %+.3g \it{x}$\n$\it{R}^2 = %.3g$'
% (reg.intercept, reg.slope, reg.rvalue ** 2),
transform=ax.transAxes)
plt.xlabel('Million sequences')
plt.ylabel('Wall clock time (sec)');
In [7]:
reg = linregress(df['mseqs'].values, df['mbram'].values)
reg
Out[7]:
In [8]:
fig = plt.figure(figsize=(5, 5))
ax = plt.gca()
plt.plot(df['mseqs'], df['mbram'], 'o', markersize=4)
plt.xlabel('Million sequences')
plt.ylabel('Maximum memory usage (MB)')
Out[8]:
In [ ]: