Host filtering

Goal

Remove potential host genome contamination from sequencing data.

Protocol

We use Bowtie2 (Langmead and Salzberg, 2012) to align paired-end sequences to the host genome, then use SAMtools (Li et al., 2009) with BEDtools (Quinlan and Hall, 2010) to remove aligned sequences and their mates.

The following command is adopted from Oecophylla, under qc.rule.

bowtie2 -p {threads} -x {params.filter_db} --very-sensitive -1 {input.forward} -2 {input.reverse} 2> {log.bowtie}| samtools view -f 12 -F 256 -b -o {temp_dir}/{wildcards.sample}.unsorted.bam 2> {log.other}

samtools sort -T {temp_dir}/{wildcards.sample} -@ {threads} -n -o {temp_dir}/{wildcards.sample}.bam {temp_dir}/{wildcards.sample}.unsorted.bam 2> {log.other}

bedtools bamtofastq -i {temp_dir}/{wildcards.sample}.bam -fq {temp_dir}/{wildcards.sample}.R1.trimmed.filtered.fastq -fq2 {temp_dir}/{wildcards.sample}.R2.trimmed.filtered.fastq 2> {log.other}

If necessary, these three commands can be combined into one command:

bowtie2 -p {threads} -x {params.filter_db} --very-sensitive -1 {input.forward} -2 {input.reverse} | samtools view -f 12 -F 256 | samtools sort -@ {threads} -n | samtools view -bS | bedtools bamtofastq -i - -fq {output.forward} -fq2 {output.reverse} &> {log}

Multiple host databases (params.filter_db) are already available on Barnacle, under: /databases/bowtie. Of which, Human is for human sequence removal, PhiX is to remove Illumina’s spike-in control. Human_PhiX is for both (we recommend using this database).

Benchmark

The following benchmarks were obtained on 692 AGP shotgun samples, using 4 CPUs and 8 GB memory.

Basically, the run time is linear to the sample size, while memory consumption is constant and trivial.

For a typical dataset of 1 million sequences, this step will cost roughly 3 min 30 sec.


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress

In [2]:
%matplotlib inline

In [3]:
df = pd.read_table('support_files/benchmarks/bowtie2.tsv', index_col=0)
df.head()


Out[3]:
seqs s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load
sample
10317.000001062 981617 200.3494 0:03:20 3450.60 4971.94 3431.92 3436.66 0.0 1248.96 145.12
10317.000001079 1080117 219.5485 0:03:39 3450.71 4975.90 3432.11 3436.83 0.0 1308.75 187.75
10317.000001086 301797 66.3551 0:01:06 3449.45 4966.91 3412.78 3426.54 0.0 209.85 79.11
10317.000001089 746019 159.4735 0:02:39 3457.93 4976.94 3418.17 3433.29 0.0 887.52 177.41
10317.000001127 1028104 210.8806 0:03:30 3467.57 4987.91 3448.89 3453.80 0.0 1282.62 185.06

In [4]:
df['mseqs'] = df['seqs'] / 1000000
df['mbram'] = df['max_rss'] / 1000

In [5]:
reg = linregress(df['mseqs'].values, df['s'].values)
reg


Out[5]:
LinregressResult(slope=201.57923113272986, intercept=9.8897553040653179, rvalue=0.98732196866490929, pvalue=0.0, stderr=1.2111292693713898)

In [6]:
fig = plt.figure(figsize=(5, 5))
ax = plt.gca()
plt.plot(df['mseqs'], df['s'], 'o', markersize=4)
x0, x1 = plt.xlim()
y0 = x0 * reg.slope + reg.intercept
y1 = x1 * reg.slope + reg.intercept
plt.plot([x0, x1], [y0, y1], '--')
plt.text(0.1, 0.8, '$\it{y} = %.3g %+.3g \it{x}$\n$\it{R}^2 = %.3g$'
         % (reg.intercept, reg.slope, reg.rvalue ** 2),
         transform=ax.transAxes)
plt.xlabel('Million sequences')
plt.ylabel('Wall clock time (sec)');



In [7]:
reg = linregress(df['mseqs'].values, df['mbram'].values)
reg


Out[7]:
LinregressResult(slope=0.44711386145708548, intercept=2.9066318382168732, rvalue=0.37098667678108366, pvalue=7.5750394375810717e-25, stderr=0.041826338629199808)

In [8]:
fig = plt.figure(figsize=(5, 5))
ax = plt.gca()
plt.plot(df['mseqs'], df['mbram'], 'o', markersize=4)
plt.xlabel('Million sequences')
plt.ylabel('Maximum memory usage (MB)')


Out[8]:
Text(0,0.5,'Maximum memory usage (MB)')

In [ ]: