Exploring K-mer hashing in Brachypodium

Ben Bai (CMBE, ANU, Canberra, Australia)

Introduction

  • Exploring read-level clustering of Brachypodium using properties of the kmer distribution.
    • Can we detect peak structures that indicate homeozygosity variation?
    • Can we distinguish technical replicates?
    • Can we distinguish ploidy?
  • Experimental design should account for the range of input genetic diversity. Extremely distant samples may be incomparable.
  • Useful where SNP-based methods suffer from missing data/reference sequences.
  • We can look for similarities at the read-level rather than differences at the haplotypic level.

Check dependencies

  • python 2.x (developed on 2.7.6)
  • R 3.0+ (developed on 3.1.0)
  • khmer 1.2+ (developed on 1.2-rc2-6-gc5dee21)
  • wgsim (developed on 0.3.0, from samtools library)
  • pairs (from the seqqs library)

In [2]:
%%bash

which R
which python

# Locations on percy.anu.edu.au
# export PATH=$PATH:"/mnt/rescued_fast/kevin/ws/gbsqc/tools/seqqs/":"/usr/local/src/samtools/misc/"
which wgsim
which pairs

# IPython notebook should be started within a virtual environment.
# After "workon <virtual_env>", check dependencies:
pip freeze


/usr/bin/R
/home/ben/.virtualenvs/khmer/bin/python
/home/ben/bin/wgsim
/home/ben/bin/pairs
argparse==1.2.1
khmer==1.2-rc2-6-gc5dee21
matplotlib==1.4.0
mock==1.0.1
nose==1.3.4
numpy==1.9.0
pyparsing==2.0.3
python-dateutil==2.2
screed==0.7.1
six==1.8.0
wsgiref==0.1.2

Workflow demo


In [48]:
# Make sure we are working in <download_path>/brachy_khmer/doc/_tmp_dir at this point!
x = !pwd
if not x[0].endswith("_tmp_dir"):
    !mkdir _tmp_dir
    %cd _tmp_dir
else:
    print "Already in detected _tmp_dir:"
    !pwd


Already in detected _tmp_dir:
/home/ben/brachy_khmer/doc/_tmp_dir

In [49]:
# A few variables...

N = 1000000 # Genome length to simulate
n = 250000 # number of read pairs to simulate
k = 15 # kmer length for hashing later on

Generating a reference genome


In [5]:
# Generate a pseudorandom reference sequence
import random
random.seed(42) # Note we set the random seed -S to be 42.

seq_file = "test_ref.fa"
rand_seq = "".join([random.choice("ATCG") for i in range(N)])

with open(seq_file, "w") as outfile:
    outfile.write(">seq1_len%d\n" % N)
    # Wrap to line length
    line_len = 80
    split_seq = "\n".join([rand_seq[i:i+line_len] for i in range(0, N, line_len)])
    outfile.write(split_seq)

# Reread and check length
!head "$seq_file"
x = !tail -n +2 "$seq_file"
assert(len("".join(x)) == N)

# Compress it
!gzip -vf "$seq_file"


>seq1_len1000000
CATACCGATAACAACCACGAGCTAGTAAGCGCCGTCGCGCCAATAAATCTTATGCCACATGCCCGGAATTAGGTCTGTTA
CTCGTAGCAAACGTATGCGGACCCTCATTGGTCAGGTCCAGCGCATAGGGTAGGATAGGATCTGTACCATGCTCAATCAA
ACAGGAACAAGCTGGAATTTCCGATTGAATTTAGTGCAGGGGGATATATGTGTTGGCCATGCCCACCGAGAACCAAGACC
TCCGACATCTTGATGGAATGGGTAGCCGCAGTCGAAACTCCACTTGGATTAGCTCCTAAGGCGCAATGGGGAAAGGTCAG
GGGGACTGGGGTGAGGAGTTGAAATGGTCTGCGAGAGTATCTCCTCTCAGCCCCTACTTGCTTTTTATGCGCTCATTCCT
TATGGAACCTGGCAAAATGATCACTACAGGATAGCAAGTGCGCGGCCCGCGCTTACGCCTATTTCAACCATCTATGCCGG
TATGGGTCCCAATATCTATACATCACATATATGTGGTAACGTCGCGGAAAACCACGACCAGGAATCGTCACCAGGCAGGT
TTCTGGAGCGCTCGTGCTTCTGGAGATCTAGAAGTGCTAGACTGCCTGAACTCTCTCTCTTAAACGAATCGCTAAAAACT
GCCAGTGCTTAAGTGTACAACTGAGCGACTTGGTCCCGAACAAAATCTACCTCGGCCGTCCGGAATCCTACCGCTAATTT
test_ref.fa:	 69.3% -- replaced with test_ref.fa.gz

Simulating whole genome sequencing (WGS) reads with wgsim


In [6]:
# haplotype mode: no heterozygosity. All mutations are applied to one copy of the genome, and all reads are drawn from that copy.
!wgsim


Program: wgsim (short read simulator)
Version: 0.3.0
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>

Options: -e FLOAT      base error rate [0.020]
         -d INT        outer distance between the two ends [500]
         -s INT        standard deviation [50]
         -N INT        number of read pairs [1000000]
         -1 INT        length of the first read [70]
         -2 INT        length of the second read [70]
         -r FLOAT      rate of mutations [0.0010]
         -R FLOAT      fraction of indels [0.15]
         -X FLOAT      probability an indel is extended [0.30]
         -S INT        seed for random generator [-1]
         -h            haplotype mode


In [7]:
%%bash -s "$n"

# Simulate read pairs of 100 bp reads at the default mutation and sequencing error rates.
# Note we set the random seed -S to be 42.
wgsim -N $1 -1 100 -2 100 -S 42 -e 0.00 test_ref.fa.gz test_reads.r1.fq test_reads.r2.fq > test_reads.geno
head test_reads.geno

# Now we interleave the read pairs.
pairs join test_reads.r1.fq test_reads.r2.fq | gzip -f > test_reads.il.fq.gz


seq1_len1000000	1356	A	W	+
seq1_len1000000	4918	G	K	+
seq1_len1000000	5813	-	T	-
seq1_len1000000	9526	A	C	-
seq1_len1000000	10105	T	Y	+
seq1_len1000000	11233	G	K	+
seq1_len1000000	11520	C	Y	+
seq1_len1000000	13820	T	-	+
seq1_len1000000	15796	G	C	-
seq1_len1000000	17153	A	M	+
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 1 sequences, total length: 1000000

In [8]:
# Here is the first pair of reads
!zcat test_reads.il.fq.gz | head -n 8


@seq1_len1000000_951464_951949_0:0:0_0:0:0_0/1
AAGTTTAATTTACCCGGTAGAAAGCGTATGCTGGGTGAAATGCACTTCTGTAGAGCGAAAGTCAGTCTTAGTTCTAGGTGTAATACCATACGGCATATTT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq1_len1000000_951464_951949_0:0:0_0:0:0_0/2
ACCTTCGCCAGGGGCAGCTAGTGAGACTTCTCTCTCCCTGGACGAGCTCGCTGATAGAACCGCGGGTTGCGTGTAGGTAGAACGGGGTGCGCATCGGATT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

gzip: stdout: Broken pipe

Hashing with khmer


In [9]:
# Step 1.
# Hashes kmers in the reads file using a Count–min sketch (Ref: http://arxiv.org/abs/1309.2975).
# Output is a binary kmer hash table with fixed size rougly equal to N_TABLES * MIN_TABLESIZE bytes.
!load-into-counting.py


|| This is the script 'load-into-counting.py' in khmer.
|| You are running khmer version 1.2-rc2-6-gc5dee21
|| You are also using screed version 0.7
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * Q Zhang et al., arXiv:1309.2975 [q-bio.GN]
||
|| Please see http://khmer.readthedocs.org/en/latest/citations.html for details.

usage: load-into-counting.py [-h] [--version] [-q] [--ksize KSIZE] [--n_tables N_TABLES] [--min-tablesize MIN_TABLESIZE] [--threads N_THREADS] [-b]
                             [--report-total-kmers]
                             output_countingtable_filename input_sequence_filename [input_sequence_filename ...]
load-into-counting.py: error: too few arguments

In [10]:
# Step 2.
# Counts abundance of kmers in hash table.
# Output is a kmer abundance-frequency table in text format.
!abundance-dist.py


|| This is the script 'abundance-dist.py' in khmer.
|| You are running khmer version 1.2-rc2-6-gc5dee21
|| You are also using screed version 0.7
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * Q Zhang et al., arXiv:1309.2975 [q-bio.GN]
||
|| Please see http://khmer.readthedocs.org/en/latest/citations.html for details.

usage: abundance-dist.py [-h] [-z] [-s] [--version] input_counting_table_filename input_sequence_filename output_histogram_filename
abundance-dist.py: error: too few arguments

In [11]:
%%bash -s "$k"

# Let's get into hashing.

# Parameters:
#     kmer length
#     Number of hash tables: each table has a different offset so that collisions can be resolved
#     Size of each table: the larger the tables, the lower the collision chance
#     Output filename
#     Input interleaved FASTQ
load-into-counting.py --ksize $1 --n_tables 4 --min-tablesize 1e4 test_reads.k$1.htable test_reads.il.fq.gz

# Notice the high false positive (fp) rate and the error:
#     fp rate estimated to be 1.000
#     ** ERROR: the k-mer counting table is too small this data set.  Increase tablesize/# tables.
# This is because we are only allocating 40 KB memory per table, which is far too low for the number of reads.


Saving k-mer counting table to test_reads.k15.htable
Loading kmers from sequences in ['test_reads.il.fq.gz']
making k-mer counting table
consuming input test_reads.il.fq.gz
saving test_reads.k15.htable
fp rate estimated to be 1.000
|| This is the script 'load-into-counting.py' in khmer.
|| You are running khmer version 1.2-rc2-6-gc5dee21
|| You are also using screed version 0.7
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * Q Zhang et al., arXiv:1309.2975 [q-bio.GN]
||
|| Please see http://khmer.readthedocs.org/en/latest/citations.html for details.


PARAMETERS:
 - kmer size =    15 		(-k)
 - n tables =     4 		(-N)
 - min tablesize = 1e+04 	(-x)

Estimated memory usage is 4e+04 bytes (n_tables x min_tablesize)
--------
**
** ERROR: the k-mer counting table is too small this data set.  Increase tablesize/# tables.
**

In [12]:
%%bash -s "$k"

# Increasing the parameters such that we now allocate 8 tables of 40 Mb each.
# NOTE: you will need to change this if you vary the k or n parameters.

# For a real dataset, determining the required table size is not trivial, as it depends on the number of unique kmers in your reads, 
# which in turn depends on the number of reads, polymorphism rate etc.

# The official instruction from http://khmer.readthedocs.org/en/v1.0/choosing-table-sizes.html
# is to use as much memory as you have available.
# A good fp rate is well below 1%.
load-into-counting.py --ksize $1 --n_tables 8 --min-tablesize 4e7 test_reads.k$1.htable test_reads.il.fq.gz


Saving k-mer counting table to test_reads.k15.htable
Loading kmers from sequences in ['test_reads.il.fq.gz']
making k-mer counting table
consuming input test_reads.il.fq.gz
saving test_reads.k15.htable
fp rate estimated to be 0.000
DONE.
wrote to: test_reads.k15.htable.info
|| This is the script 'load-into-counting.py' in khmer.
|| You are running khmer version 1.2-rc2-6-gc5dee21
|| You are also using screed version 0.7
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * Q Zhang et al., arXiv:1309.2975 [q-bio.GN]
||
|| Please see http://khmer.readthedocs.org/en/latest/citations.html for details.


PARAMETERS:
 - kmer size =    15 		(-k)
 - n tables =     8 		(-N)
 - min tablesize = 4e+07 	(-x)

Estimated memory usage is 3.2e+08 bytes (n_tables x min_tablesize)
--------

In [13]:
%%bash -s "$k"

# Generate abundance frequency table
abundance-dist.py --no-zero --squash test_reads.k$1.htable test_reads.il.fq.gz test_reads.k$1.hist


hashtable from test_reads.k15.htable
K: 15
HT sizes: [40000003L, 40000021L, 40000033L, 40000061L, 40000123L, 40000159L, 40000183L, 40000211L]
outputting to test_reads.k15.hist
preparing hist...
|| This is the script 'abundance-dist.py' in khmer.
|| You are running khmer version 1.2-rc2-6-gc5dee21
|| You are also using screed version 0.7
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * Q Zhang et al., arXiv:1309.2975 [q-bio.GN]
||
|| Please see http://khmer.readthedocs.org/en/latest/citations.html for details.


In [14]:
%%bash -s "$k"

# Columns in the output are:
#     kmer abundance
#     freq of kmers at that abundance count
#     cumulative abundance
#     cumulative fraction of total distinct k-mers.
head test_reads.k$1.hist


1 6 6 0.0
2 1 7 0.0
3 30 37 0.0
4 45 82 0.0
5 37 119 0.0
6 44 163 0.0
7 121 284 0.0
8 163 447 0.0
9 212 659 0.001
10 182 841 0.001

In [50]:
# Calculate kmer coverage

# Source: http://dna.med.monash.edu.au/~torsten/velvet_advisor/
# All coverage values in Velvet are provided in k-mer coverage, 
# i.e. how many times has a k-mer been seen among the reads. 
# The relation between k-mer coverage Ck and standard (nucleotide-wise) coverage C is 
# Ck = C * (L - k + 1) / L 
# where k is your hash length, and L you read length. 

# Source: https://groups.google.com/forum/#!topic/bgi-soap/xKS39Nz4SCE
#    base_coverage_depth = total_base_num/genome_size = (read_num*read_length)/genome_size
#    kmer_coverage_depth = total_kmer_num/genome_size=read_num*(read_length-kmer_size+1)/genome_size
# So the relationship between base coverage depth and kmer coverage depth is:
#    kmer_coverage_depth = base_coverage_depth*(read_length-kmer_size+1)/read_length

def get_estimated_coverage(num_read_pairs, genome_len, L=100, k=25):
    C  = num_read_pairs * 100 * 2 / genome_len
    Ck = C*(L - k + 1) / L
    return C, Ck

print n, N, k
print get_estimated_coverage(num_read_pairs=n, genome_len=N, L=100, k=k)

# Given the genome the coverage is sufficiently high, we would expect the main peak to be centered at the kmer coverage.


250000 1000000 15
(50, 43)

In [51]:
# Plotting freq against abundance to produce the kmer spectra, 
# using a wrapper script to set axis limits
!python ../../scripts/plot_hist.py 100 10 test_reads.k$k\.hist

from IPython.core.display import Image, display
Image(filename="test_reads.k{0}.png".format(k)) 

# We can see the characteristic shape of a kmer spectrum of a random genome:
#    Poisson distributed main peak
#    Low abundance error peak

# Why is the main peak shifted left from the expected coverage of 43?
# Because the sequencing error rate of -e 0.02 shifts counts towards the error peak.


Out[51]:

In [17]:
# Repeating the demo using -e 0.00 removes the error peak and causes the main peak to be centered closer to the expected value.
Image(filename="../test_data/test_reads.k15.test_reads.k15.noerror.png")


Out[17]:

Simulating WGS data with subgenomes

We simulate sequencing data from genomes with different characteristics to see if we can detect a subgenome signal.


In [18]:
# Get a Brachy reference genome
!wget -nv https://doc-08-60-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/pq1qrefcuapcfa52g4pm3u32kuq69ad9/1415203200000/12528459609246823118/*/0B4t9UUDOrtFnZDJoZFBoZXRkN1U?e=download \
    -O Bdistachyon_MIPS_1_2.fa.gz


2014-11-02 10:58:56 URL:https://dl.dropboxusercontent.com/content_link/KN3xvpdUuhZv26R3vM9GPOGx4gT0b0w9niyFuKtXrI7nu1yQf7j6k8YxsVAfaNqu [82614448/82614448] -> "Bdistachyon_MIPS_1_2.fa.gz" [1]

In [19]:
%%bash
# Simulating reads from a polyploid genome with a subgenome with different mutation rate.

# sim_subgenomes.sh is is a wrapper script that uses wgsim
# Simulates 2.5M reads at a polymorphism rate of 0.005 (1/200), then simulates 2.5M reads at a polymorphism rate of 0.05 and merges.

# Parameters:
#     Reference file .fa.gz
#     Total number of read pairs and read length
#     Sequencing error rate
#     Number of subgenomes
#     Respective proportion of reads from each subgenome
#     Respective polymorphism rates
../../scripts/sim_subgenomes.sh Bdistachyon_MIPS_1_2.fa.gz \
                                                5000000 100 \
                                                0.02 \
                                                2 \
                                                0.5 0.5 \
                                                0.05 0.005


Output file: Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.il.fq.gz
Simulating 2500000 read pairs at mutation rate 0.05 for subgenome 0
Simulating 2500000 read pairs at mutation rate 0.005 for subgenome 1
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 83 sequences, total length: 271923306
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 83 sequences, total length: 271923306

In [20]:
%%bash
# Simulating diploid genome reads.
# Same as above, except both genomes have the same polymorphism rate of 0.005

../../scripts/sim_subgenomes.sh Bdistachyon_MIPS_1_2.fa.gz \
                                                5000000 100 \
                                                0.02 \
                                                2 \
                                                0.5 0.5 \
                                                0.005 0.005


Output file: Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.il.fq.gz
Simulating 2500000 read pairs at mutation rate 0.005 for subgenome 0
Simulating 2500000 read pairs at mutation rate 0.005 for subgenome 1
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 83 sequences, total length: 271923306
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 83 sequences, total length: 271923306

Hashing simulated data


In [21]:
%%bash

# Hashing simulated polyploid reads at k=25 using wrapper script auto_hash_and_count.sh

# (Simulated) .il.fq.gz file
# 4 tables x 2e9 bytes = 8 Gb memory usage
# k=25
../../scripts/auto_hash_and_count.sh Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.il.fq.gz \
                                        4 2e9 \
                                        25

In [22]:
!cat Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.k25.htable.info


through end: Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.il.fq.gz
fp rate estimated to be 0.003

In [23]:
%%bash

# Hashing simulated diploid reads at k=25

# Same parameters as above
../../scripts/auto_hash_and_count.sh Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.il.fq.gz \
                                        4 2e9 \
                                        25

In [24]:
!cat Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.k25.htable.info


through end: Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.il.fq.gz
fp rate estimated to be 0.002

Plotting simulated data


In [25]:
# Play around with different axis limits
!python ../../scripts/plot_hist.py  1000 10 \
    Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.k25.hist \
    Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.k25.hist
Image(filename="Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.k25.Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.k25.png")
# No peaks?!


Out[25]:

In [26]:
%%bash
# Is sequencing error obscuring the peak signal?
# Set error rate to 0.00
../../scripts/sim_subgenomes.sh Bdistachyon_MIPS_1_2.fa.gz \
                                                5000000 100 \
                                                0.00 \
                                                2 \
                                                0.5 0.5 \
                                                0.05 0.005

# Same parameters as above
../../scripts/auto_hash_and_count.sh Bdistachyon_MIPS_1_2.-0.00-0.5,0.5-0.05,0.005-.il.fq.gz \
                                        4 2e9 \
                                       25


Output file: Bdistachyon_MIPS_1_2.-0.00-0.5,0.5-0.05,0.005-.il.fq.gz
Simulating 2500000 read pairs at mutation rate 0.05 for subgenome 0
Simulating 2500000 read pairs at mutation rate 0.005 for subgenome 1
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 83 sequences, total length: 271923306
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 83 sequences, total length: 271923306

In [27]:
!python ../../scripts/plot_hist.py 1000 10 \
    Bdistachyon_MIPS_1_2.-0.00-0.5,0.5-0.05,0.005-.k25.hist \
    Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.k25.hist

Image(filename="Bdistachyon_MIPS_1_2.-0.00-0.5,0.5-0.05,0.005-.k25.Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.k25.png")
# Fewer kmers with high abundance due to higher error rate creating more unique kmers.
# Still no signal.


Out[27]:

In [47]:
# Calculate expected cov.
lm = !zcat Bdistachyon_MIPS_1_2.fa.gz | grep -v '^>' | wc -lm
genome_len = int(lm[0].split()[1]) - int(lm[0].split()[0])

print genome_len
print get_estimated_coverage(num_read_pairs=5e6, genome_len=genome_len)


271923306
(3.6775075101506745, 2.7949057077145123)

In [29]:
# Lets zoom in around that region
!python ../../scripts/plot_hist.py  10 10 \
    Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.k25.hist \
    Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.k25.hist
Image(filename="Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.k25.Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.k25.png")


Out[29]:

Considerations

  • What is the effect of varying the subgenome mutation rate?
  • Is this a correct way of simulating polyploidy?

Results from real demultiplexed WGS data


In [52]:
%cd ../test_data


/home/ben/brachy_khmer/doc/test_data

In [31]:
# Bdis05 WGS reads
# 9.7 M read pairs
print get_estimated_coverage(num_read_pairs=9.7e6, genome_len=genome_len)

# We also hashed a hexaploid whole genome data set from the Vogel lab. 
# Approx. 6M read pairs.

# Possibly Brachypodium hybridum ABR113?
# http://aob.oxfordjournals.org/content/early/2012/01/01/aob.mcr294.full
# Allotetraploid with 2n = 30 chromosomes (x = 10 + x = 5), derived from the cross between B. stacei and B. distachyon. 
# Genome size of 1.265 pg/2C. => 978 Mb
print get_estimated_coverage(num_read_pairs=6e6, genome_len=1.265*978e6/2)
# Contains both relatively small B. stacei-type and large B. distachyon-type chromosomes. 
# The name hybridum refers to its hybrid allotetraploid nature. 

!python ../../scripts/plot_hist.py 4000 10 \
    ../test_data/bdis05/Bdis05-1_qcd.k25.hist \
    ../test_data/vogel_hexaploids/7424.2.73049.ACAGTG.k25.hist

Image(filename="Bdis05-1_qcd.k25.7424.2.73049.ACAGTG.k25.png")

# Interesting, but with such low coverage, peaks at high abundance are likely to be organelle DNA/simple repeats.


(7.134364569692309, 5.422117072966155)
(1.9399112490603554, 1.47433254928587)
Out[31]:

Results from real GBS data

Compared to WGS data, there are additional considerations with GBS data:

  • Inherently lower coverage, even compared to multiplexed samples.
  • Reads are clustered around cut sites, not evenly distributed across the genome.
    • Can result in repetitive sequences?
    • Excludes organelle DNA?
  • Only a subset of possible kmers is available.
  • PCR introduces error.
    • Overall, the noise peak tends to be prominent.

But:

  • GBS sample read sets are small.
    • Samples hash quickly
    • Samples can be hashed using relatively low and constant memory usage

We used two Brachypodium GBS datasets: bd1 (mostly diploid) and bd6 (mostly hexaploid). bd1 is a recombinant inbred line; bd6 is not, but the species is selfing, so low heterozygosity is expected in wild populations anyway.

  • approx. 4M read pairs: bd_1E4, bd_6D1
  • approx. 2M read pairs: bd_1E6, bd_6H11
  • approx. 1M read pairs: bd_1F9, bd_6F11

In [32]:
%matplotlib inline  
import matplotlib.pyplot as plt
# Can we improve kmer coverage by hashing at a lower k?
fig, ax = plt.subplots()
ax.set_xlabel("k")
ax.set_ylabel("C_k")
ax.plot(range(10, 30), 
         [get_estimated_coverage(num_read_pairs=4e6, genome_len=genome_len, k=k)[1] 
         for k in range(10, 30)])
plt.savefig("improve_ck.png", bbox_inches='tight')
# Not really, the range is small.



In [53]:
# We just hashed at k=12 because memory usage is much better.
# Only 400 Mb per sample.
!grep -E 'memory|fp|ms' ../test_data/bd1_gbs/bd_1E4.k12.log

!python ../../scripts/plot_hist.py 2500 10 \
    ../test_data/bd1_gbs/bd_1E4.k12.hist \
    ../test_data/bd6_gbs/bd_6D1.k12.hist
Image(filename="bd_1E4.k12.bd_6D1.k12.png")

print get_estimated_coverage(num_read_pairs=4e6, genome_len=genome_len)
# The hexaploid sample has a much lower count of unique kmers that have high abundance


Estimated memory usage is 4e+08 bytes (n_tables x min_tablesize)
fp rate estimated to be 0.000
(2.9420060081205395, 2.23592456617161)

Conclusions so far

  • No peak signal due to low coverage. All the signal is squashed into the left.
  • We can't improve coverage enough by choosing a smaller k.
  • The distribution is shifted towards the low abundance end in polyploids, possibly due to more heterogeneity in composition.
  • There are interesting features at the high abundance end.
    • In WGS datasets, we need to watch for organelle DNA.

Analysing high abundance 12-mers

One advantage of a smaller k: there are far fewer 12-mers than 25-mers. We can iterate over all kmers and select the most abundant kmers in one of the samples, then construct a counts matrix for that set of kmers over all samples, normalised by library size.


In [34]:
# File with all the samples we want to analyse:
#     Path to sample .htable
#     number of read pairs in the sample to normalise counts by.

# Another option would be to normalise by the total number of kmers instead,
# These numbers are reported in the hashing .log files:
# e.g. Total number of k-mers: 5508702

# The set of most abundant kmers is derived from the first sample in the list.
# Unfortunately this .htable data is not on Dropbox (~9.7 Gb)
!head bd_clustering/k12_tables.txt


/home/ben/data/bd1_bd6_htables/bd_1B12.k12.htable	2305524
/home/ben/data/bd1_bd6_htables/bd_1C6.k12.htable	3072628
/home/ben/data/bd1_bd6_htables/bd_1C9.k12.htable	3564123
/home/ben/data/bd1_bd6_htables/bd_1D11.k12.htable	1034453
/home/ben/data/bd1_bd6_htables/bd_1E4.k12.htable	3974062
/home/ben/data/bd1_bd6_htables/bd_1E6.k12.htable	2011722
/home/ben/data/bd1_bd6_htables/bd_1F1.k12.htable	3100715
/home/ben/data/bd1_bd6_htables/bd_1F9.k12.htable	1003988
/home/ben/data/bd1_bd6_htables/bd_1G12.k12.htable	1150067
/home/ben/data/bd1_bd6_htables/bd_1H10.k12.htable	2219785

In [35]:
%%bash

cd ../test_data/bd_clustering/

# Arguments:
#     List of sample kmer hash tables .htable
#     k=12
#     After skipping the 200 most abundant kmers of first sample, analyse the next 100.

# In order for the matrix to be informative, a mean sample presence that is not too close to 100% is required.
python ../../../scripts/repeat_pres_abs.py k12_tables.txt 12 200 100
# In this case: "Mean sample presence proportion is 83.33%"


Loading sample 1 of 24 (2305524 read pairs): bd_1B12.k12
Counting first sample: bd_1B12.k12
5.96% complete.
11.92% complete.
17.88% complete.
23.84% complete.
29.80% complete.
35.76% complete.
41.72% complete.
47.68% complete.
53.64% complete.
59.60% complete.
65.57% complete.
71.53% complete.
77.49% complete.
83.45% complete.
89.41% complete.
95.37% complete.
Heapifying...
Discarding top 200 most abundant kmers.
Analysing next 100 most abundant kmers.
	Sample has 6.91 total normalised counts.
	Sample has 100.00% presence.
Loading sample 2 of 24 (3072628 read pairs): bd_1C6.k12
	Sample has 4.89 total normalised counts.
	Sample has 98.00% presence.
Loading sample 3 of 24 (3564123 read pairs): bd_1C9.k12
	Sample has 4.54 total normalised counts.
	Sample has 96.00% presence.
Loading sample 4 of 24 (1034453 read pairs): bd_1D11.k12
	Sample has 7.61 total normalised counts.
	Sample has 70.00% presence.
Loading sample 5 of 24 (3974062 read pairs): bd_1E4.k12
	Sample has 4.14 total normalised counts.
	Sample has 86.00% presence.
Loading sample 6 of 24 (2011722 read pairs): bd_1E6.k12
	Sample has 6.17 total normalised counts.
	Sample has 96.00% presence.
Loading sample 7 of 24 (3100715 read pairs): bd_1F1.k12
	Sample has 4.26 total normalised counts.
	Sample has 98.00% presence.
Loading sample 8 of 24 (1003988 read pairs): bd_1F9.k12
	Sample has 9.50 total normalised counts.
	Sample has 86.00% presence.
Loading sample 9 of 24 (1150067 read pairs): bd_1G12.k12
	Sample has 6.62 total normalised counts.
	Sample has 84.00% presence.
Loading sample 10 of 24 (2219785 read pairs): bd_1H10.k12
	Sample has 6.03 total normalised counts.
	Sample has 94.00% presence.
Loading sample 11 of 24 (746953 read pairs): bd_1H2.k12
	Sample has 10.25 total normalised counts.
	Sample has 86.00% presence.
Loading sample 12 of 24 (2015268 read pairs): bd_1H4.k12
	Sample has 5.51 total normalised counts.
	Sample has 84.00% presence.
Loading sample 13 of 24 (2209712 read pairs): bd_6A9.k12
	Sample has 2.73 total normalised counts.
	Sample has 96.00% presence.
Loading sample 14 of 24 (731673 read pairs): bd_6B5.k12
	Sample has 4.53 total normalised counts.
	Sample has 56.00% presence.
Loading sample 15 of 24 (1150553 read pairs): bd_6C3.k12
	Sample has 2.97 total normalised counts.
	Sample has 56.00% presence.
Loading sample 16 of 24 (1045482 read pairs): bd_6C5.k12
	Sample has 3.55 total normalised counts.
	Sample has 58.00% presence.
Loading sample 17 of 24 (2013071 read pairs): bd_6C7.k12
	Sample has 2.27 total normalised counts.
	Sample has 70.00% presence.
Loading sample 18 of 24 (3843959 read pairs): bd_6D1.k12
	Sample has 1.62 total normalised counts.
	Sample has 92.00% presence.
Loading sample 19 of 24 (3540100 read pairs): bd_6D4.k12
	Sample has 1.90 total normalised counts.
	Sample has 96.00% presence.
Loading sample 20 of 24 (3117896 read pairs): bd_6E5.k12
	Sample has 1.96 total normalised counts.
	Sample has 96.00% presence.
Loading sample 21 of 24 (1016378 read pairs): bd_6F11.k12
	Sample has 4.19 total normalised counts.
	Sample has 44.00% presence.
Loading sample 22 of 24 (2307156 read pairs): bd_6F4.k12
	Sample has 2.10 total normalised counts.
	Sample has 88.00% presence.
Loading sample 23 of 24 (2009189 read pairs): bd_6H11.k12
	Sample has 2.58 total normalised counts.
	Sample has 72.00% presence.
Loading sample 24 of 24 (3065407 read pairs): bd_6H6.k12
	Sample has 1.76 total normalised counts.
	Sample has 98.00% presence.
Mean sample presence proportion is 83.33%
Output matrix saved to k12_tables.skip200.N100.pres_abs.csv

Clustering GBS samples by repeat profile


In [36]:
# Cluster samples (default dist measure="euclidean", hclust method="complete")
!Rscript ../../scripts/cluster_samples.R \
    ../test_data/bd_clustering/k12_tables.skip200.N100.pres_abs.csv \
    k12_tables.skip200.N100_cluster.png
    
Image(filename="k12_tables.skip200.N100_cluster.png")


Out[36]:

In [41]:
# Change back to starting dir, ready to re-run
%cd ../


/home/ben/brachy_khmer/doc

Considerations

  • Find out what the samples are biologically (esp. ploidy, replicates)
  • Different clustering methods
  • Lower the mean presence % per sample. Experiment with:
    • Different set of kmers (i.e. the 2nd most abundant 100)
    • Different value of k, although runtime increases exponentially

Source

Acknowledgements