kevlar
: human simulation "pico"At this stage, kevlar
takes quite a bit of time to run on human-sized data sets.
To facilitate more rapid method development, I needed a small test data set that can be processed quickly.
And while faithfully modeling a eukaryotic genome in all of its repetitive glory is extremely complicated, I wanted to at least capture a couple of features realistically in this simulation: higher-order nucleotide composition, and shared versus unique variants.
In brief, this notebook shows how I simulated a 2.5 Mb "reference" genome, simulated a trio of 3 individuals from that reference genome, and then invoked the kevlar
workflow to identify variants.
In [1]:
from __future__ import print_function
import subprocess
import kevlar
import random
import sys
In [2]:
def gen_muts():
locs = [random.randint(0, 2500000) for _ in range(10)]
types = [random.choice(['snv', 'ins', 'del', 'inv']) for _ in range(10)]
for l, t in zip(locs, types):
if t == 'snv':
value = random.randint(1, 3)
elif t == 'ins':
length = random.randint(20, 200)
value = ''.join(random.choice('ACGT') for _ in range(length))
elif t == 'del':
value = random.randint(20, 200)
else:
value = random.randint(50, 900)
print(l, t, value, sep='\t')
Rather than generating a truly random genome, I wanted one that shared some compositional features with the human genome.
I used the nuclmm
package to train a 6th-order Markov model of nucleotide composition on the human genome, and then use this to simulate a 2.5 Mb random genome maintaining the same composition.
Downloading the human genome and running nuclmm train
are time intensive, so I've provided the associated commands here only as comments.
In [3]:
# !wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
# !pip install git+https://github.com/standage/nuclmm.git
# !nuclmm train --order 6 --out human.order6.mm GRCh38_full_analysis_set_plus_decoy_hla.fa
# !nuclmm simulate --out human.random.fa --order 6 --numseqs 1 --seqlen 2500000 --seed 42 human.order6.mm
The files [proband|mother|father]-mutations.tsv
contain lists of mutations to apply to the reference genome for each simulated sample.
The proband shares 3 mutations with each parent, and has 10 unique mutations.
The kevlar mutate
command applies the mutations to the provided reference genome to create the mutated genome.
In [4]:
arglist = ['mutate', '-o', 'proband-genome.fa', 'proband-mutations.tsv', 'human.random.fa']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.mutate.main(args)
arglist = ['mutate', '-o', 'mother-genome.fa', 'mother-mutations.tsv', 'human.random.fa']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.mutate.main(args)
arglist = ['mutate', '-o', 'father-genome.fa', 'father-mutations.tsv', 'human.random.fa']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.mutate.main(args)
In [5]:
random.seed(55555555)
# wgsim uses an `int` type for its seed value
# Using extremely large integer values led to non-deterministic behavior,
# so I'm just using what can fit in a 16-bit integer here.
maxint = 65535
seed = random.randint(1, maxint)
cmd = 'wgsim -e 0.005 -r 0.0 -d 450 -s 50 -N 375000 -1 100 -2 100 -S {} proband-genome.fa proband-reads-1.fq proband-reads-2.fq'.format(seed)
_ = subprocess.check_call(cmd, shell=True)
seed = random.randint(1, maxint)
cmd = 'wgsim -e 0.005 -r 0.0 -d 450 -s 50 -N 375000 -1 100 -2 100 -S {} mother-genome.fa mother-reads-1.fq mother-reads-2.fq'.format(seed)
_ = subprocess.check_call(cmd, shell=True)
seed = random.randint(1, maxint)
cmd = 'wgsim -e 0.005 -r 0.0 -d 450 -s 50 -N 375000 -1 100 -2 100 -S {} father-genome.fa father-reads-1.fq father-reads-2.fq'.format(seed)
_ = subprocess.check_call(cmd, shell=True)
Discarding reads that match the reference genome perfectly eliminates many k-mers and allows us to count the remaining k-mers accurately with much less memory.
Typically kevlar dump
would operate on BAM files, but here I'm processing the bwa
SAM output directly and skipping kevlar dump
.
In [6]:
!time bwa index human.random.fa
!time bwa mem human.random.fa proband-reads-[1,2].fq 2> proband-bwa.log | samtools view | perl -ne 'print if !/\t\d+M\t/ || !/NM:i:0/' | perl -ane '$suffix = $F[1] & 64 ? "/1" : "/2"; print "\@" . "$F[0]$suffix\n$F[9]\n+\n$F[10]\n"' | gzip -c > proband-reads-dump.fq.gz
!time bwa mem human.random.fa mother-reads-[1,2].fq 2> mother-bwa.log | samtools view | perl -ne 'print if !/\t\d+M\t/ || !/NM:i:0/' | perl -ane '$suffix = $F[1] & 64 ? "/1" : "/2"; print "\@" . "$F[0]$suffix\n$F[9]\n+\n$F[10]\n"' | gzip -c > mother-reads-dump.fq.gz
!time bwa mem human.random.fa father-reads-[1,2].fq 2> father-bwa.log | samtools view | perl -ne 'print if !/\t\d+M\t/ || !/NM:i:0/' | perl -ane '$suffix = $F[1] & 64 ? "/1" : "/2"; print "\@" . "$F[0]$suffix\n$F[9]\n+\n$F[10]\n"' | gzip -c > father-reads-dump.fq.gz
In [7]:
arglist = ['count', '--ksize', '25', '--memory', '100M', '--mem-frac', '0.25',
'--case', 'proband.counttable', 'proband-reads-dump.fq.gz',
'--control', 'father.counttable', 'father-reads-dump.fq.gz',
'--control', 'mother.counttable', 'mother-reads-dump.fq.gz']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.count.main(args)
In [8]:
arglist = ['novel', '--ctrl-max', '1', '--case-min', '8', '--ksize', '25',
'--case', 'proband-reads-dump.fq.gz', '--case-counts', 'proband.counttable',
'--control-counts', 'mother.counttable', 'father.counttable',
'--out', 'proband.novel.augfastq.gz']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.novel.main(args)
In [9]:
arglist = ['filter', '--refr', 'human.random.fa', '--refr-memory', '50M', '--refr-max-fpr', '0.001',
'--abund-memory', '10M', '--abund-max-fpr', '0.001', '--min-abund', '8',
'--out', 'proband.novel.filtered.fq.gz', '--aug-out', 'proband.novel.filtered.augfastq.gz',
'--ksize', '25', 'proband.novel.augfastq.gz']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.filter.main(args)
In [10]:
arglist = ['partition', 'part', 'proband.novel.filtered.augfastq.gz']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.partition.main(args)
In [11]:
for i in range(10):
print('\n\n==== iter {i} ===='.format(i=i), file=sys.stderr)
# Strip interesting k-mer annotations
cmd = "gunzip -c part.cc{i}.augfastq.gz | grep -v '#$' | gzip -c > part.cc{i}.fq.gz".format(i=i)
subprocess.check_call(cmd, shell=True)
# Perform trimming
cmd = "trim-low-abund.py -M 50M -k 25 --output part.cc{i}.trim.fq.gz --gzip part.cc{i}.fq.gz 2> part.cc{i}.trim.log".format(i=i)
subprocess.check_call(cmd, shell=True)
# Re-annotate interesting k-mers
arglist = ['reaugment', '--out', 'part.cc{i}.trim.augfastq.gz'.format(i=i),
'part.cc{i}.augfastq.gz'.format(i=i), 'part.cc{i}.trim.fq.gz'.format(i=i)]
args = kevlar.cli.parser().parse_args(arglist)
kevlar.reaugment.main(args)
# Assemble
arglist = ['assemble', '--out', 'part.cc{i}.asmbl.augfasta.gz'.format(i=i),
'part.cc{i}.trim.augfastq.gz'.format(i=i)]
args = kevlar.cli.parser().parse_args(arglist)
kevlar.assemble.main(args)
# Plain Fasta file for convenience with downstream analysis.
cmd = "gunzip -c part.cc{i}.asmbl.augfasta.gz | grep -v '#$' > part.cc{i}.fa".format(i=i)
subprocess.check_call(cmd, shell=True)
In [ ]: