Remove potential host genome contamination from sequencing data.
We use Bowtie2 (Langmead and Salzberg, 2012) to align paired-end sequences to the host genome, then use SAMtools (Li et al., 2009) with BEDtools (Quinlan and Hall, 2010) to remove aligned sequences and their mates.
The following command is adopted from Oecophylla, under qc.rule.
bowtie2 -p {threads} -x {params.filter_db} --very-sensitive -1 {input.forward} -2 {input.reverse} 2> {log.bowtie}| samtools view -f 12 -F 256 -b -o {temp_dir}/{wildcards.sample}.unsorted.bam 2> {log.other}
samtools sort -T {temp_dir}/{wildcards.sample} -@ {threads} -n -o {temp_dir}/{wildcards.sample}.bam {temp_dir}/{wildcards.sample}.unsorted.bam 2> {log.other}
bedtools bamtofastq -i {temp_dir}/{wildcards.sample}.bam -fq {temp_dir}/{wildcards.sample}.R1.trimmed.filtered.fastq -fq2 {temp_dir}/{wildcards.sample}.R2.trimmed.filtered.fastq 2> {log.other}
If necessary, these three commands can be combined into one command:
bowtie2 -p {threads} -x {params.filter_db} --very-sensitive -1 {input.forward} -2 {input.reverse} | samtools view -f 12 -F 256 | samtools sort -@ {threads} -n | samtools view -bS | bedtools bamtofastq -i - -fq {output.forward} -fq2 {output.reverse} &> {log}
Multiple host databases (params.filter_db) are already available on Barnacle, under: /databases/bowtie. Of which, Human is for human sequence removal, PhiX is to remove Illumina’s spike-in control. Human_PhiX is for both (we recommend using this database).
The following benchmarks were obtained on 692 AGP shotgun samples, using 4 CPUs and 8 GB memory.
Basically, the run time is linear to the sample size, while memory consumption is constant and trivial.
For a typical dataset of 1 million sequences, this step will cost roughly 3 min 30 sec.
In [1]:
    
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
    
In [2]:
    
%matplotlib inline
    
In [3]:
    
df = pd.read_table('support_files/benchmarks/bowtie2.tsv', index_col=0)
df.head()
    
    Out[3]:
In [4]:
    
df['mseqs'] = df['seqs'] / 1000000
df['mbram'] = df['max_rss'] / 1000
    
In [5]:
    
reg = linregress(df['mseqs'].values, df['s'].values)
reg
    
    Out[5]:
In [6]:
    
fig = plt.figure(figsize=(5, 5))
ax = plt.gca()
plt.plot(df['mseqs'], df['s'], 'o', markersize=4)
x0, x1 = plt.xlim()
y0 = x0 * reg.slope + reg.intercept
y1 = x1 * reg.slope + reg.intercept
plt.plot([x0, x1], [y0, y1], '--')
plt.text(0.1, 0.8, '$\it{y} = %.3g %+.3g \it{x}$\n$\it{R}^2 = %.3g$'
         % (reg.intercept, reg.slope, reg.rvalue ** 2),
         transform=ax.transAxes)
plt.xlabel('Million sequences')
plt.ylabel('Wall clock time (sec)');
    
    
In [7]:
    
reg = linregress(df['mseqs'].values, df['mbram'].values)
reg
    
    Out[7]:
In [8]:
    
fig = plt.figure(figsize=(5, 5))
ax = plt.gca()
plt.plot(df['mseqs'], df['mbram'], 'o', markersize=4)
plt.xlabel('Million sequences')
plt.ylabel('Maximum memory usage (MB)')
    
    Out[8]:
    
In [ ]: