Variant calling with kevlar: human simulation "pico"

At this stage, kevlar takes quite a bit of time to run on human-sized data sets. To facilitate more rapid method development, I needed a small test data set that can be processed quickly. And while faithfully modeling a eukaryotic genome in all of its repetitive glory is extremely complicated, I wanted to at least capture a couple of features realistically in this simulation: higher-order nucleotide composition, and shared versus unique variants.

In brief, this notebook shows how I simulated a 2.5 Mb "reference" genome, simulated a trio of 3 individuals from that reference genome, and then invoked the kevlar workflow to identify variants.

Technical preliminaries

Nothing interesting to see here.


In [1]:
from __future__ import print_function
import subprocess
import kevlar
import random
import sys

In [2]:
def gen_muts():
    locs = [random.randint(0, 2500000) for _ in range(10)]
    types = [random.choice(['snv', 'ins', 'del', 'inv']) for _ in range(10)]
    for l, t in zip(locs, types):
        if t == 'snv':
            value = random.randint(1, 3)
        elif t == 'ins':
            length = random.randint(20, 200)
            value = ''.join(random.choice('ACGT') for _ in range(length))
        elif t == 'del':
            value = random.randint(20, 200)
        else:
            value = random.randint(50, 900)
        print(l, t, value, sep='\t')

Generate a random genome

Rather than generating a truly random genome, I wanted one that shared some compositional features with the human genome. I used the nuclmm package to train a 6th-order Markov model of nucleotide composition on the human genome, and then use this to simulate a 2.5 Mb random genome maintaining the same composition. Downloading the human genome and running nuclmm train are time intensive, so I've provided the associated commands here only as comments.


In [3]:
# !wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa
# !pip install git+https://github.com/standage/nuclmm.git
# !nuclmm train --order 6 --out human.order6.mm GRCh38_full_analysis_set_plus_decoy_hla.fa
# !nuclmm simulate --out human.random.fa --order 6 --numseqs 1 --seqlen 2500000 --seed 42 human.order6.mm

Simulate a trio

The files [proband|mother|father]-mutations.tsv contain lists of mutations to apply to the reference genome for each simulated sample. The proband shares 3 mutations with each parent, and has 10 unique mutations. The kevlar mutate command applies the mutations to the provided reference genome to create the mutated genome.


In [4]:
arglist = ['mutate', '-o', 'proband-genome.fa', 'proband-mutations.tsv', 'human.random.fa']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.mutate.main(args)

arglist = ['mutate', '-o', 'mother-genome.fa', 'mother-mutations.tsv', 'human.random.fa']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.mutate.main(args)

arglist = ['mutate', '-o', 'father-genome.fa', 'father-mutations.tsv', 'human.random.fa']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.mutate.main(args)


[kevlar::mutate] loading mutations
[kevlar::mutate] mutating genome
[kevlar::mutate] loading mutations
[kevlar::mutate] mutating genome
[kevlar::mutate] loading mutations
[kevlar::mutate] mutating genome

"Sequence" the genomes

Use wgsim to simulate Illumina sequencing of each sample, with a small error rate.


In [5]:
random.seed(55555555)

# wgsim uses an `int` type for its seed value
# Using extremely large integer values led to non-deterministic behavior,
# so I'm just using what can fit in a 16-bit integer here.
maxint = 65535

seed = random.randint(1, maxint)
cmd = 'wgsim -e 0.005 -r 0.0 -d 450 -s 50 -N 375000 -1 100 -2 100 -S {} proband-genome.fa proband-reads-1.fq proband-reads-2.fq'.format(seed)
_ = subprocess.check_call(cmd, shell=True)

seed = random.randint(1, maxint)
cmd = 'wgsim -e 0.005 -r 0.0 -d 450 -s 50 -N 375000 -1 100 -2 100 -S {} mother-genome.fa mother-reads-1.fq mother-reads-2.fq'.format(seed)
_ = subprocess.check_call(cmd, shell=True)

seed = random.randint(1, maxint)
cmd = 'wgsim -e 0.005 -r 0.0 -d 450 -s 50 -N 375000 -1 100 -2 100 -S {} father-genome.fa father-reads-1.fq father-reads-2.fq'.format(seed)
_ = subprocess.check_call(cmd, shell=True)

Dump boring reads

Discarding reads that match the reference genome perfectly eliminates many k-mers and allows us to count the remaining k-mers accurately with much less memory. Typically kevlar dump would operate on BAM files, but here I'm processing the bwa SAM output directly and skipping kevlar dump.


In [6]:
!time bwa index human.random.fa
!time bwa mem human.random.fa proband-reads-[1,2].fq 2> proband-bwa.log | samtools view | perl -ne 'print if !/\t\d+M\t/ || !/NM:i:0/' | perl -ane '$suffix = $F[1] & 64 ? "/1" : "/2"; print "\@" . "$F[0]$suffix\n$F[9]\n+\n$F[10]\n"' | gzip -c > proband-reads-dump.fq.gz
!time bwa mem human.random.fa mother-reads-[1,2].fq 2> mother-bwa.log | samtools view | perl -ne 'print if !/\t\d+M\t/ || !/NM:i:0/' | perl -ane '$suffix = $F[1] & 64 ? "/1" : "/2"; print "\@" . "$F[0]$suffix\n$F[9]\n+\n$F[10]\n"' | gzip -c > mother-reads-dump.fq.gz
!time bwa mem human.random.fa father-reads-[1,2].fq 2> father-bwa.log | samtools view | perl -ne 'print if !/\t\d+M\t/ || !/NM:i:0/' | perl -ane '$suffix = $F[1] & 64 ? "/1" : "/2"; print "\@" . "$F[0]$suffix\n$F[9]\n+\n$F[10]\n"' | gzip -c > father-reads-dump.fq.gz


[bwa_index] Pack FASTA... 0.03 sec
[bwa_index] Construct BWT for the packed sequence...
[bwa_index] 0.88 seconds elapse.
[bwa_index] Update BWT... 0.03 sec
[bwa_index] Pack forward-only FASTA... 0.02 sec
[bwa_index] Construct SA from BWT and Occ... 0.25 sec
[main] Version: 0.7.15-r1140
[main] CMD: bwa index human.random.fa
[main] Real time: 1.241 sec; CPU: 1.220 sec

real	0m1.252s
user	0m1.168s
sys	0m0.054s

real	0m30.560s
user	0m44.665s
sys	0m0.746s

real	0m31.306s
user	0m45.463s
sys	0m0.726s

real	0m31.313s
user	0m45.860s
sys	0m0.714s

Count all remaining k-mers

First control sample uses full 100M for counting. All subsequent samples check against the first control before counting (no need to count if k-mer is already disqualified in first sample), thus requiring only 100Mb x 0.25 = 25Mb for counting.


In [7]:
arglist = ['count', '--ksize', '25', '--memory', '100M', '--mem-frac', '0.25',
           '--case', 'proband.counttable', 'proband-reads-dump.fq.gz',
           '--control', 'father.counttable', 'father-reads-dump.fq.gz',
           '--control', 'mother.counttable', 'mother-reads-dump.fq.gz']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.count.main(args)


[kevlar::count] Loading control samples
[kevlar::counting]     computing k-mer abundances for 2 samples
[kevlar::counting]     loading sample from father-reads-dump.fq.gz
[kevlar::counting]     done loading reads; 295877 reads processed, 22485272 k-mers stored; estimated false positive rate is 0.008; saved to "father.counttable"
[kevlar::counting]     loading sample from mother-reads-dump.fq.gz
[kevlar::counting]     done loading reads; 295581 reads processed, 6951229 k-mers stored; estimated false positive rate is 0.180; saved to "mother.counttable"
[kevlar::count] 2 samples loaded in 92.80 sec
[kevlar::count] Loading case samples
[kevlar::counting]     computing k-mer abundances for 1 samples
[kevlar::counting]     loading sample from proband-reads-dump.fq.gz
[kevlar::counting]     done loading reads; 296034 reads processed, 6617158 k-mers stored; estimated false positive rate is 0.172; saved to "proband.counttable"
[kevlar::count] 1 sample(s) loaded in 66.25 sec
[kevlar::count] Total time: 159.05 seconds

Identify "interesting" k-mers

Select k-mers that are high abundance (> 8) in the proband and effectively absent (<= 1) in each control. Print the reads that contain these k-mers.


In [8]:
arglist = ['novel', '--ctrl-max', '1', '--case-min', '8', '--ksize', '25',
           '--case', 'proband-reads-dump.fq.gz', '--case-counts', 'proband.counttable',
           '--control-counts', 'mother.counttable', 'father.counttable',
           '--out', 'proband.novel.augfastq.gz']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.novel.main(args)


[kevlar::novel] Loading control samples
[kevlar::novel]    INFO: counttables for 2 sample(s) provided, any corresponding FASTA/FASTQ input will be ignored for computing k-mer abundances
[kevlar::counting]     loading sketchfile "mother.counttable"...done! estimated false positive rate is 0.180
[kevlar::counting]     loading sketchfile "father.counttable"...done! estimated false positive rate is 0.008
[kevlar::novel] Control samples loaded in 0.47 sec
[kevlar::novel] Loading case samples
[kevlar::novel]    INFO: counttables for 1 samples provided, any corresponding FASTA/FASTQ input will be ignored for computing k-mer abundances
[kevlar::counting]     loading sketchfile "proband.counttable"...done! estimated false positive rate is 0.172
[kevlar::novel] Case samples loaded in 0.09 sec
[kevlar::novel] All samples loaded in 0.56 sec
[kevlar::novel] Iterating over reads from 1 case sample(s)
[kevlar::novel] Iterated over 296033 reads in 62.63 seconds
[kevlar::novel] Found 22684 instances of 1125 unique novel kmers in 938 reads
[kevlar::novel] Total time: 63.19 seconds

Filter "interesting" k-mers

Recompute k-mer abundances with a much smaller amount of input data. In normal circumstances you'd be able to achieve an effective FPR = 0.0 with much less memory than in the kevlar novel step, but here I was just lazy and used the same.


In [9]:
arglist = ['filter', '--refr', 'human.random.fa', '--refr-memory', '50M', '--refr-max-fpr', '0.001',
           '--abund-memory', '10M', '--abund-max-fpr', '0.001', '--min-abund', '8',
           '--out', 'proband.novel.filtered.fq.gz', '--aug-out', 'proband.novel.filtered.augfastq.gz',
           '--ksize', '25', 'proband.novel.augfastq.gz']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.filter.main(args)


[kevlar::filter] Loading reference genome from human.random.fa
    1 sequences and 2499976 k-mers consumed; estimated false positive rate is 0.000
[kevlar::filter] Reference genome loaded in 2.52 sec
[kevlar::filter] Loading input; recalculate k-mer abundances, de-duplicate reads and merge k-mers
    - proband.novel.augfastq.gz
    938 instances of 938 reads consumed, annotated with 22684 instances of 1125 distinct "interesting" k-mers; estimated false positive rate is 0.000
[kevlar::filter] Input loaded in 0.73 sec
[kevlar::filter] Validate k-mers and print reads
    processed 22684 instances of 1125 distinct "interesting" k-mers in 938 reads
        838 instances of 95 distinct k-mers masked by the reference genome
        7 instances of 1 distinct k-mers discarded due to low abundance
        21839 instances of 1029 distinct k-mers validated as novel
        407 reads with no surviving valid k-mers ignored
        0 contaminant reads discarded
        531 reads written to output
[kevlar::filter] k-mers validated and reads printed in 1.59 sec
[kevlar::filter] Total time: 4.84 seconds

Partition reads by shared "interesting" k-mers

Here we expect to see 10 connected components, corresponding to the 10 mutations unique to the proband.


In [10]:
arglist = ['partition', 'part', 'proband.novel.filtered.augfastq.gz']
args = kevlar.cli.parser().parse_args(arglist)
kevlar.partition.main(args)


[kevlar::partition] Loading reads from proband.novel.filtered.augfastq.gz
[kevlar::partition] Reads loaded in 0.70 sec
[kevlar::partition] Building read graph in relaxed mode (shared novel k-mer required)
[kevlar::partition] Graph built in 1.09 sec
[kevlar::partition] Writing output to prefix part
[kevlar::overlap] grouped 531 reads into 10 connected components
[kevlar::partition] Output written in 0.57 sec
[kevlar::partition] Total time: 1.66 seconds

Assemble each partition

Perform abundance trimming on reads from each partition and then assemble. Each contig (or set of contigs) should be reflective of a distinct variant!


In [11]:
for i in range(10):
    print('\n\n==== iter {i} ===='.format(i=i), file=sys.stderr)
    
    # Strip interesting k-mer annotations
    cmd = "gunzip -c part.cc{i}.augfastq.gz | grep -v '#$' | gzip -c > part.cc{i}.fq.gz".format(i=i)
    subprocess.check_call(cmd, shell=True)
    # Perform trimming
    cmd = "trim-low-abund.py -M 50M -k 25 --output part.cc{i}.trim.fq.gz --gzip part.cc{i}.fq.gz 2> part.cc{i}.trim.log".format(i=i)
    subprocess.check_call(cmd, shell=True)
    # Re-annotate interesting k-mers
    arglist = ['reaugment', '--out', 'part.cc{i}.trim.augfastq.gz'.format(i=i),
               'part.cc{i}.augfastq.gz'.format(i=i), 'part.cc{i}.trim.fq.gz'.format(i=i)]
    args = kevlar.cli.parser().parse_args(arglist)
    kevlar.reaugment.main(args)
    
    # Assemble
    arglist = ['assemble', '--out', 'part.cc{i}.asmbl.augfasta.gz'.format(i=i),
               'part.cc{i}.trim.augfastq.gz'.format(i=i)]
    args = kevlar.cli.parser().parse_args(arglist)
    kevlar.assemble.main(args)
    
    # Plain Fasta file for convenience with downstream analysis.
    cmd = "gunzip -c part.cc{i}.asmbl.augfasta.gz | grep -v '#$' > part.cc{i}.fa".format(i=i)
    subprocess.check_call(cmd, shell=True)



==== iter 0 ====
[kevlar::assemble] loaded 129 reads and 414 interesting k-mers
[kevlar::assemble] initialized "shared interesting k-mers" graph with 126 nodes and 1477 edges
[kevlar::assemble] dropped 146 edges, graph now has 1 connected component(s), 111 nodes, and 1331 edges
[kevlar::assemble] assembled 94/111 reads from 1 connected component(s) into 7 contig(s)


==== iter 1 ====
[kevlar::assemble] loaded 77 reads and 182 interesting k-mers
[kevlar::assemble] initialized "shared interesting k-mers" graph with 75 nodes and 971 edges
[kevlar::assemble] dropped 90 edges, graph now has 1 connected component(s), 69 nodes, and 881 edges
[kevlar::assemble] assembled 53/69 reads from 1 connected component(s) into 4 contig(s)


==== iter 2 ====
[kevlar::assemble] loaded 66 reads and 175 interesting k-mers
[kevlar::assemble] initialized "shared interesting k-mers" graph with 62 nodes and 600 edges
[kevlar::assemble] dropped 59 edges, graph now has 1 connected component(s), 56 nodes, and 541 edges
[kevlar::assemble] assembled 49/56 reads from 1 connected component(s) into 2 contig(s)


==== iter 3 ====
[kevlar::assemble] loaded 42 reads and 85 interesting k-mers
[kevlar::assemble] initialized "shared interesting k-mers" graph with 42 nodes and 541 edges
[kevlar::assemble] dropped 46 edges, graph now has 1 connected component(s), 39 nodes, and 495 edges
[kevlar::assemble] assembled 37/39 reads from 1 connected component(s) into 1 contig(s)


==== iter 4 ====
[kevlar::assemble] loaded 26 reads and 25 interesting k-mers
[kevlar::assemble] initialized "shared interesting k-mers" graph with 26 nodes and 225 edges
[kevlar::assemble] dropped 21 edges, graph now has 1 connected component(s), 21 nodes, and 204 edges
[kevlar::assemble] assembled 21/21 reads from 1 connected component(s) into 1 contig(s)


==== iter 5 ====
[kevlar::assemble] loaded 21 reads and 22 interesting k-mers
[kevlar::assemble] initialized "shared interesting k-mers" graph with 21 nodes and 175 edges
[kevlar::assemble] dropped 10 edges, graph now has 1 connected component(s), 19 nodes, and 165 edges
[kevlar::assemble] assembled 18/19 reads from 1 connected component(s) into 1 contig(s)


==== iter 6 ====
[kevlar::assemble] loaded 21 reads and 24 interesting k-mers
[kevlar::assemble] initialized "shared interesting k-mers" graph with 19 nodes and 100 edges
[kevlar::assemble] dropped 8 edges, graph now has 1 connected component(s), 15 nodes, and 92 edges
[kevlar::assemble] assembled 14/15 reads from 1 connected component(s) into 1 contig(s)


==== iter 7 ====
[kevlar::assemble] loaded 20 reads and 24 interesting k-mers
[kevlar::assemble] initialized "shared interesting k-mers" graph with 18 nodes and 117 edges
[kevlar::assemble] dropped 10 edges, graph now has 1 connected component(s), 16 nodes, and 107 edges
[kevlar::assemble] assembled 15/16 reads from 1 connected component(s) into 1 contig(s)


==== iter 8 ====
[kevlar::assemble] loaded 16 reads and 25 interesting k-mers
[kevlar::assemble] initialized "shared interesting k-mers" graph with 15 nodes and 88 edges
[kevlar::assemble] dropped 7 edges, graph now has 1 connected component(s), 14 nodes, and 81 edges
[kevlar::assemble] assembled 12/14 reads from 1 connected component(s) into 1 contig(s)


==== iter 9 ====
[kevlar::assemble] loaded 16 reads and 53 interesting k-mers
[kevlar::assemble] initialized "shared interesting k-mers" graph with 15 nodes and 68 edges
[kevlar::assemble] dropped 6 edges, graph now has 1 connected component(s), 14 nodes, and 62 edges
[kevlar::assemble] assembled 12/14 reads from 1 connected component(s) into 1 contig(s)

In [ ]: