Exploring K-mer hashing in Brachypodium

Ben Bai (CMBE, ANU, Canberra, Australia)

Introduction

Exploring read-level clustering of Brachypodium using properties of the kmer distribution.
- Can we detect peak structures that indicate homeozygosity variation?
- Can we distinguish technical replicates?
- Can we distinguish ploidy?
Experimental design should account for the range of input genetic diversity. Extremely distant samples may be incomparable.
Useful where SNP-based methods suffer from missing data/reference sequences.
We can look for similarities at the read-level rather than differences at the haplotypic level.

Check dependencies

python 2.x (developed on 2.7.6)
R 3.0+ (developed on 3.1.0)
khmer 1.2+ (developed on 1.2-rc2-6-gc5dee21)
wgsim (developed on 0.3.0, from samtools library)
pairs (from the seqqs library)



In [2]:

    
%%bash

which R
which python

# Locations on percy.anu.edu.au
# export PATH=$PATH:"/mnt/rescued_fast/kevin/ws/gbsqc/tools/seqqs/":"/usr/local/src/samtools/misc/"
which wgsim
which pairs

# IPython notebook should be started within a virtual environment.
# After "workon <virtual_env>", check dependencies:
pip freeze









    



/usr/bin/R
/home/ben/.virtualenvs/khmer/bin/python
/home/ben/bin/wgsim
/home/ben/bin/pairs
argparse==1.2.1
khmer==1.2-rc2-6-gc5dee21
matplotlib==1.4.0
mock==1.0.1
nose==1.3.4
numpy==1.9.0
pyparsing==2.0.3
python-dateutil==2.2
screed==0.7.1
six==1.8.0
wsgiref==0.1.2

Workflow demo



In [48]:

    
# Make sure we are working in <download_path>/brachy_khmer/doc/_tmp_dir at this point!
x = !pwd
if not x[0].endswith("_tmp_dir"):
    !mkdir _tmp_dir
    %cd _tmp_dir
else:
    print "Already in detected _tmp_dir:"
    !pwd









    



Already in detected _tmp_dir:
/home/ben/brachy_khmer/doc/_tmp_dir



In [49]:

    
# A few variables...

N = 1000000 # Genome length to simulate
n = 250000 # number of read pairs to simulate
k = 15 # kmer length for hashing later on

Generating a reference genome



In [5]:

    
# Generate a pseudorandom reference sequence
import random
random.seed(42) # Note we set the random seed -S to be 42.

seq_file = "test_ref.fa"
rand_seq = "".join([random.choice("ATCG") for i in range(N)])

with open(seq_file, "w") as outfile:
    outfile.write(">seq1_len%d\n" % N)
    # Wrap to line length
    line_len = 80
    split_seq = "\n".join([rand_seq[i:i+line_len] for i in range(0, N, line_len)])
    outfile.write(split_seq)

# Reread and check length
!head "$seq_file"
x = !tail -n +2 "$seq_file"
assert(len("".join(x)) == N)

# Compress it
!gzip -vf "$seq_file"









    



>seq1_len1000000
CATACCGATAACAACCACGAGCTAGTAAGCGCCGTCGCGCCAATAAATCTTATGCCACATGCCCGGAATTAGGTCTGTTA
CTCGTAGCAAACGTATGCGGACCCTCATTGGTCAGGTCCAGCGCATAGGGTAGGATAGGATCTGTACCATGCTCAATCAA
ACAGGAACAAGCTGGAATTTCCGATTGAATTTAGTGCAGGGGGATATATGTGTTGGCCATGCCCACCGAGAACCAAGACC
TCCGACATCTTGATGGAATGGGTAGCCGCAGTCGAAACTCCACTTGGATTAGCTCCTAAGGCGCAATGGGGAAAGGTCAG
GGGGACTGGGGTGAGGAGTTGAAATGGTCTGCGAGAGTATCTCCTCTCAGCCCCTACTTGCTTTTTATGCGCTCATTCCT
TATGGAACCTGGCAAAATGATCACTACAGGATAGCAAGTGCGCGGCCCGCGCTTACGCCTATTTCAACCATCTATGCCGG
TATGGGTCCCAATATCTATACATCACATATATGTGGTAACGTCGCGGAAAACCACGACCAGGAATCGTCACCAGGCAGGT
TTCTGGAGCGCTCGTGCTTCTGGAGATCTAGAAGTGCTAGACTGCCTGAACTCTCTCTCTTAAACGAATCGCTAAAAACT
GCCAGTGCTTAAGTGTACAACTGAGCGACTTGGTCCCGAACAAAATCTACCTCGGCCGTCCGGAATCCTACCGCTAATTT
test_ref.fa:	 69.3% -- replaced with test_ref.fa.gz

Simulating whole genome sequencing (WGS) reads with wgsim



In [6]:

    
# haplotype mode: no heterozygosity. All mutations are applied to one copy of the genome, and all reads are drawn from that copy.
!wgsim









    



Program: wgsim (short read simulator)
Version: 0.3.0
Contact: Heng Li <lh3@sanger.ac.uk>

Usage:   wgsim [options] <in.ref.fa> <out.read1.fq> <out.read2.fq>

Options: -e FLOAT      base error rate [0.020]
         -d INT        outer distance between the two ends [500]
         -s INT        standard deviation [50]
         -N INT        number of read pairs [1000000]
         -1 INT        length of the first read [70]
         -2 INT        length of the second read [70]
         -r FLOAT      rate of mutations [0.0010]
         -R FLOAT      fraction of indels [0.15]
         -X FLOAT      probability an indel is extended [0.30]
         -S INT        seed for random generator [-1]
         -h            haplotype mode



In [7]:

    
%%bash -s "$n"

# Simulate read pairs of 100 bp reads at the default mutation and sequencing error rates.
# Note we set the random seed -S to be 42.
wgsim -N $1 -1 100 -2 100 -S 42 -e 0.00 test_ref.fa.gz test_reads.r1.fq test_reads.r2.fq > test_reads.geno
head test_reads.geno

# Now we interleave the read pairs.
pairs join test_reads.r1.fq test_reads.r2.fq | gzip -f > test_reads.il.fq.gz









    



seq1_len1000000	1356	A	W	+
seq1_len1000000	4918	G	K	+
seq1_len1000000	5813	-	T	-
seq1_len1000000	9526	A	C	-
seq1_len1000000	10105	T	Y	+
seq1_len1000000	11233	G	K	+
seq1_len1000000	11520	C	Y	+
seq1_len1000000	13820	T	-	+
seq1_len1000000	15796	G	C	-
seq1_len1000000	17153	A	M	+






    



[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 1 sequences, total length: 1000000



In [8]:

    
# Here is the first pair of reads
!zcat test_reads.il.fq.gz | head -n 8









    



@seq1_len1000000_951464_951949_0:0:0_0:0:0_0/1
AAGTTTAATTTACCCGGTAGAAAGCGTATGCTGGGTGAAATGCACTTCTGTAGAGCGAAAGTCAGTCTTAGTTCTAGGTGTAATACCATACGGCATATTT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
@seq1_len1000000_951464_951949_0:0:0_0:0:0_0/2
ACCTTCGCCAGGGGCAGCTAGTGAGACTTCTCTCTCCCTGGACGAGCTCGCTGATAGAACCGCGGGTTGCGTGTAGGTAGAACGGGGTGCGCATCGGATT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII

gzip: stdout: Broken pipe

Hashing with khmer



In [9]:

    
# Step 1.
# Hashes kmers in the reads file using a Count–min sketch (Ref: http://arxiv.org/abs/1309.2975).
# Output is a binary kmer hash table with fixed size rougly equal to N_TABLES * MIN_TABLESIZE bytes.
!load-into-counting.py









    



|| This is the script 'load-into-counting.py' in khmer.
|| You are running khmer version 1.2-rc2-6-gc5dee21
|| You are also using screed version 0.7
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * Q Zhang et al., arXiv:1309.2975 [q-bio.GN]
||
|| Please see http://khmer.readthedocs.org/en/latest/citations.html for details.

usage: load-into-counting.py [-h] [--version] [-q] [--ksize KSIZE] [--n_tables N_TABLES] [--min-tablesize MIN_TABLESIZE] [--threads N_THREADS] [-b]
                             [--report-total-kmers]
                             output_countingtable_filename input_sequence_filename [input_sequence_filename ...]
load-into-counting.py: error: too few arguments



In [10]:

    
# Step 2.
# Counts abundance of kmers in hash table.
# Output is a kmer abundance-frequency table in text format.
!abundance-dist.py









    



|| This is the script 'abundance-dist.py' in khmer.
|| You are running khmer version 1.2-rc2-6-gc5dee21
|| You are also using screed version 0.7
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * Q Zhang et al., arXiv:1309.2975 [q-bio.GN]
||
|| Please see http://khmer.readthedocs.org/en/latest/citations.html for details.

usage: abundance-dist.py [-h] [-z] [-s] [--version] input_counting_table_filename input_sequence_filename output_histogram_filename
abundance-dist.py: error: too few arguments



In [11]:

    
%%bash -s "$k"

# Let's get into hashing.

# Parameters:
#     kmer length
#     Number of hash tables: each table has a different offset so that collisions can be resolved
#     Size of each table: the larger the tables, the lower the collision chance
#     Output filename
#     Input interleaved FASTQ
load-into-counting.py --ksize $1 --n_tables 4 --min-tablesize 1e4 test_reads.k$1.htable test_reads.il.fq.gz

# Notice the high false positive (fp) rate and the error:
#     fp rate estimated to be 1.000
#     ** ERROR: the k-mer counting table is too small this data set.  Increase tablesize/# tables.
# This is because we are only allocating 40 KB memory per table, which is far too low for the number of reads.









    



Saving k-mer counting table to test_reads.k15.htable
Loading kmers from sequences in ['test_reads.il.fq.gz']
making k-mer counting table
consuming input test_reads.il.fq.gz
saving test_reads.k15.htable
fp rate estimated to be 1.000






    



|| This is the script 'load-into-counting.py' in khmer.
|| You are running khmer version 1.2-rc2-6-gc5dee21
|| You are also using screed version 0.7
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * Q Zhang et al., arXiv:1309.2975 [q-bio.GN]
||
|| Please see http://khmer.readthedocs.org/en/latest/citations.html for details.


PARAMETERS:
 - kmer size =    15 		(-k)
 - n tables =     4 		(-N)
 - min tablesize = 1e+04 	(-x)

Estimated memory usage is 4e+04 bytes (n_tables x min_tablesize)
--------
**
** ERROR: the k-mer counting table is too small this data set.  Increase tablesize/# tables.
**



In [12]:

    
%%bash -s "$k"

# Increasing the parameters such that we now allocate 8 tables of 40 Mb each.
# NOTE: you will need to change this if you vary the k or n parameters.

# For a real dataset, determining the required table size is not trivial, as it depends on the number of unique kmers in your reads, 
# which in turn depends on the number of reads, polymorphism rate etc.

# The official instruction from http://khmer.readthedocs.org/en/v1.0/choosing-table-sizes.html
# is to use as much memory as you have available.
# A good fp rate is well below 1%.
load-into-counting.py --ksize $1 --n_tables 8 --min-tablesize 4e7 test_reads.k$1.htable test_reads.il.fq.gz









    



Saving k-mer counting table to test_reads.k15.htable
Loading kmers from sequences in ['test_reads.il.fq.gz']
making k-mer counting table
consuming input test_reads.il.fq.gz
saving test_reads.k15.htable
fp rate estimated to be 0.000
DONE.
wrote to: test_reads.k15.htable.info






    



|| This is the script 'load-into-counting.py' in khmer.
|| You are running khmer version 1.2-rc2-6-gc5dee21
|| You are also using screed version 0.7
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * Q Zhang et al., arXiv:1309.2975 [q-bio.GN]
||
|| Please see http://khmer.readthedocs.org/en/latest/citations.html for details.


PARAMETERS:
 - kmer size =    15 		(-k)
 - n tables =     8 		(-N)
 - min tablesize = 4e+07 	(-x)

Estimated memory usage is 3.2e+08 bytes (n_tables x min_tablesize)
--------



In [13]:

    
%%bash -s "$k"

# Generate abundance frequency table
abundance-dist.py --no-zero --squash test_reads.k$1.htable test_reads.il.fq.gz test_reads.k$1.hist









    



hashtable from test_reads.k15.htable
K: 15
HT sizes: [40000003L, 40000021L, 40000033L, 40000061L, 40000123L, 40000159L, 40000183L, 40000211L]
outputting to test_reads.k15.hist
preparing hist...






    



|| This is the script 'abundance-dist.py' in khmer.
|| You are running khmer version 1.2-rc2-6-gc5dee21
|| You are also using screed version 0.7
||
|| If you use this script in a publication, please cite EACH of the following:
||
||   * MR Crusoe et al., 2014. doi: 10.6084/m9.figshare.979190
||   * Q Zhang et al., arXiv:1309.2975 [q-bio.GN]
||
|| Please see http://khmer.readthedocs.org/en/latest/citations.html for details.



In [14]:

    
%%bash -s "$k"

# Columns in the output are:
#     kmer abundance
#     freq of kmers at that abundance count
#     cumulative abundance
#     cumulative fraction of total distinct k-mers.
head test_reads.k$1.hist









    



1 6 6 0.0
2 1 7 0.0
3 30 37 0.0
4 45 82 0.0
5 37 119 0.0
6 44 163 0.0
7 121 284 0.0
8 163 447 0.0
9 212 659 0.001
10 182 841 0.001



In [50]:

    
# Calculate kmer coverage

# Source: http://dna.med.monash.edu.au/~torsten/velvet_advisor/
# All coverage values in Velvet are provided in k-mer coverage, 
# i.e. how many times has a k-mer been seen among the reads. 
# The relation between k-mer coverage Ck and standard (nucleotide-wise) coverage C is 
# Ck = C * (L - k + 1) / L 
# where k is your hash length, and L you read length. 

# Source: https://groups.google.com/forum/#!topic/bgi-soap/xKS39Nz4SCE
#    base_coverage_depth = total_base_num/genome_size = (read_num*read_length)/genome_size
#    kmer_coverage_depth = total_kmer_num/genome_size=read_num*(read_length-kmer_size+1)/genome_size
# So the relationship between base coverage depth and kmer coverage depth is:
#    kmer_coverage_depth = base_coverage_depth*(read_length-kmer_size+1)/read_length

def get_estimated_coverage(num_read_pairs, genome_len, L=100, k=25):
    C  = num_read_pairs * 100 * 2 / genome_len
    Ck = C*(L - k + 1) / L
    return C, Ck

print n, N, k
print get_estimated_coverage(num_read_pairs=n, genome_len=N, L=100, k=k)

# Given the genome the coverage is sufficiently high, we would expect the main peak to be centered at the kmer coverage.









    



250000 1000000 15
(50, 43)



In [51]:

    
# Plotting freq against abundance to produce the kmer spectra, 
# using a wrapper script to set axis limits
!python ../../scripts/plot_hist.py 100 10 test_reads.k$k\.hist

from IPython.core.display import Image, display
Image(filename="test_reads.k{0}.png".format(k)) 

# We can see the characteristic shape of a kmer spectrum of a random genome:
#    Poisson distributed main peak
#    Low abundance error peak

# Why is the main peak shifted left from the expected coverage of 43?
# Because the sequencing error rate of -e 0.02 shifts counts towards the error peak.









    Out[51]:



In [17]:

    
# Repeating the demo using -e 0.00 removes the error peak and causes the main peak to be centered closer to the expected value.
Image(filename="../test_data/test_reads.k15.test_reads.k15.noerror.png")









    Out[17]:

Simulating WGS data with subgenomes

We simulate sequencing data from genomes with different characteristics to see if we can detect a subgenome signal.



In [18]:

    
# Get a Brachy reference genome
!wget -nv https://doc-08-60-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/pq1qrefcuapcfa52g4pm3u32kuq69ad9/1415203200000/12528459609246823118/*/0B4t9UUDOrtFnZDJoZFBoZXRkN1U?e=download \
    -O Bdistachyon_MIPS_1_2.fa.gz









    



2014-11-02 10:58:56 URL:https://dl.dropboxusercontent.com/content_link/KN3xvpdUuhZv26R3vM9GPOGx4gT0b0w9niyFuKtXrI7nu1yQf7j6k8YxsVAfaNqu [82614448/82614448] -> "Bdistachyon_MIPS_1_2.fa.gz" [1]



In [19]:

    
%%bash
# Simulating reads from a polyploid genome with a subgenome with different mutation rate.

# sim_subgenomes.sh is is a wrapper script that uses wgsim
# Simulates 2.5M reads at a polymorphism rate of 0.005 (1/200), then simulates 2.5M reads at a polymorphism rate of 0.05 and merges.

# Parameters:
#     Reference file .fa.gz
#     Total number of read pairs and read length
#     Sequencing error rate
#     Number of subgenomes
#     Respective proportion of reads from each subgenome
#     Respective polymorphism rates
../../scripts/sim_subgenomes.sh Bdistachyon_MIPS_1_2.fa.gz \
                                                5000000 100 \
                                                0.02 \
                                                2 \
                                                0.5 0.5 \
                                                0.05 0.005









    



Output file: Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.il.fq.gz
Simulating 2500000 read pairs at mutation rate 0.05 for subgenome 0
Simulating 2500000 read pairs at mutation rate 0.005 for subgenome 1






    



[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 83 sequences, total length: 271923306
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 83 sequences, total length: 271923306



In [20]:

    
%%bash
# Simulating diploid genome reads.
# Same as above, except both genomes have the same polymorphism rate of 0.005

../../scripts/sim_subgenomes.sh Bdistachyon_MIPS_1_2.fa.gz \
                                                5000000 100 \
                                                0.02 \
                                                2 \
                                                0.5 0.5 \
                                                0.005 0.005









    



Output file: Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.il.fq.gz
Simulating 2500000 read pairs at mutation rate 0.005 for subgenome 0
Simulating 2500000 read pairs at mutation rate 0.005 for subgenome 1






    



[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 83 sequences, total length: 271923306
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 83 sequences, total length: 271923306

Hashing simulated data



In [21]:

    
%%bash

# Hashing simulated polyploid reads at k=25 using wrapper script auto_hash_and_count.sh

# (Simulated) .il.fq.gz file
# 4 tables x 2e9 bytes = 8 Gb memory usage
# k=25
../../scripts/auto_hash_and_count.sh Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.il.fq.gz \
                                        4 2e9 \
                                        25



In [22]:

    
!cat Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.k25.htable.info









    



through end: Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.il.fq.gz
fp rate estimated to be 0.003



In [23]:

    
%%bash

# Hashing simulated diploid reads at k=25

# Same parameters as above
../../scripts/auto_hash_and_count.sh Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.il.fq.gz \
                                        4 2e9 \
                                        25



In [24]:

    
!cat Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.k25.htable.info









    



through end: Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.il.fq.gz
fp rate estimated to be 0.002

Plotting simulated data



In [25]:

    
# Play around with different axis limits
!python ../../scripts/plot_hist.py  1000 10 \
    Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.k25.hist \
    Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.k25.hist
Image(filename="Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.k25.Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.k25.png")
# No peaks?!









    Out[25]:



In [26]:

    
%%bash
# Is sequencing error obscuring the peak signal?
# Set error rate to 0.00
../../scripts/sim_subgenomes.sh Bdistachyon_MIPS_1_2.fa.gz \
                                                5000000 100 \
                                                0.00 \
                                                2 \
                                                0.5 0.5 \
                                                0.05 0.005

# Same parameters as above
../../scripts/auto_hash_and_count.sh Bdistachyon_MIPS_1_2.-0.00-0.5,0.5-0.05,0.005-.il.fq.gz \
                                        4 2e9 \
                                       25









    



Output file: Bdistachyon_MIPS_1_2.-0.00-0.5,0.5-0.05,0.005-.il.fq.gz
Simulating 2500000 read pairs at mutation rate 0.05 for subgenome 0
Simulating 2500000 read pairs at mutation rate 0.005 for subgenome 1






    



[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 83 sequences, total length: 271923306
[wgsim_core] calculating the total length of the reference sequence...
[wgsim_core] 83 sequences, total length: 271923306



In [27]:

    
!python ../../scripts/plot_hist.py 1000 10 \
    Bdistachyon_MIPS_1_2.-0.00-0.5,0.5-0.05,0.005-.k25.hist \
    Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.k25.hist

Image(filename="Bdistachyon_MIPS_1_2.-0.00-0.5,0.5-0.05,0.005-.k25.Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.k25.png")
# Fewer kmers with high abundance due to higher error rate creating more unique kmers.
# Still no signal.









    Out[27]:



In [47]:

    
# Calculate expected cov.
lm = !zcat Bdistachyon_MIPS_1_2.fa.gz | grep -v '^>' | wc -lm
genome_len = int(lm[0].split()[1]) - int(lm[0].split()[0])

print genome_len
print get_estimated_coverage(num_read_pairs=5e6, genome_len=genome_len)









    



271923306
(3.6775075101506745, 2.7949057077145123)



In [29]:

    
# Lets zoom in around that region
!python ../../scripts/plot_hist.py  10 10 \
    Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.k25.hist \
    Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.k25.hist
Image(filename="Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.005,0.005-.k25.Bdistachyon_MIPS_1_2.-0.02-0.5,0.5-0.05,0.005-.k25.png")









    Out[29]:

Considerations

What is the effect of varying the subgenome mutation rate?
Is this a correct way of simulating polyploidy?

Results from real demultiplexed WGS data



In [52]:

    
%cd ../test_data









    



/home/ben/brachy_khmer/doc/test_data



In [31]:

    
# Bdis05 WGS reads
# 9.7 M read pairs
print get_estimated_coverage(num_read_pairs=9.7e6, genome_len=genome_len)

# We also hashed a hexaploid whole genome data set from the Vogel lab. 
# Approx. 6M read pairs.

# Possibly Brachypodium hybridum ABR113?
# http://aob.oxfordjournals.org/content/early/2012/01/01/aob.mcr294.full
# Allotetraploid with 2n = 30 chromosomes (x = 10 + x = 5), derived from the cross between B. stacei and B. distachyon. 
# Genome size of 1.265 pg/2C. => 978 Mb
print get_estimated_coverage(num_read_pairs=6e6, genome_len=1.265*978e6/2)
# Contains both relatively small B. stacei-type and large B. distachyon-type chromosomes. 
# The name hybridum refers to its hybrid allotetraploid nature. 

!python ../../scripts/plot_hist.py 4000 10 \
    ../test_data/bdis05/Bdis05-1_qcd.k25.hist \
    ../test_data/vogel_hexaploids/7424.2.73049.ACAGTG.k25.hist

Image(filename="Bdis05-1_qcd.k25.7424.2.73049.ACAGTG.k25.png")

# Interesting, but with such low coverage, peaks at high abundance are likely to be organelle DNA/simple repeats.









    



(7.134364569692309, 5.422117072966155)
(1.9399112490603554, 1.47433254928587)






    Out[31]:

Results from real GBS data

Compared to WGS data, there are additional considerations with GBS data:

Inherently lower coverage, even compared to multiplexed samples.
Reads are clustered around cut sites, not evenly distributed across the genome.
- Can result in repetitive sequences?
- Excludes organelle DNA?
Only a subset of possible kmers is available.
PCR introduces error.
- Overall, the noise peak tends to be prominent.

But:

GBS sample read sets are small.
- Samples hash quickly
- Samples can be hashed using relatively low and constant memory usage

We used two Brachypodium GBS datasets: bd1 (mostly diploid) and bd6 (mostly hexaploid). bd1 is a recombinant inbred line; bd6 is not, but the species is selfing, so low heterozygosity is expected in wild populations anyway.

approx. 4M read pairs: bd_1E4, bd_6D1
approx. 2M read pairs: bd_1E6, bd_6H11
approx. 1M read pairs: bd_1F9, bd_6F11



In [32]:

    
%matplotlib inline  
import matplotlib.pyplot as plt
# Can we improve kmer coverage by hashing at a lower k?
fig, ax = plt.subplots()
ax.set_xlabel("k")
ax.set_ylabel("C_k")
ax.plot(range(10, 30), 
         [get_estimated_coverage(num_read_pairs=4e6, genome_len=genome_len, k=k)[1] 
         for k in range(10, 30)])
plt.savefig("improve_ck.png", bbox_inches='tight')
# Not really, the range is small.



In [53]:

    
# We just hashed at k=12 because memory usage is much better.
# Only 400 Mb per sample.
!grep -E 'memory|fp|ms' ../test_data/bd1_gbs/bd_1E4.k12.log

!python ../../scripts/plot_hist.py 2500 10 \
    ../test_data/bd1_gbs/bd_1E4.k12.hist \
    ../test_data/bd6_gbs/bd_6D1.k12.hist
Image(filename="bd_1E4.k12.bd_6D1.k12.png")

print get_estimated_coverage(num_read_pairs=4e6, genome_len=genome_len)
# The hexaploid sample has a much lower count of unique kmers that have high abundance









    



Estimated memory usage is 4e+08 bytes (n_tables x min_tablesize)
fp rate estimated to be 0.000
(2.9420060081205395, 2.23592456617161)

Conclusions so far

No peak signal due to low coverage. All the signal is squashed into the left.
We can't improve coverage enough by choosing a smaller k.
The distribution is shifted towards the low abundance end in polyploids, possibly due to more heterogeneity in composition.
There are interesting features at the high abundance end.
- In WGS datasets, we need to watch for organelle DNA.

Analysing high abundance 12-mers

One advantage of a smaller k: there are far fewer 12-mers than 25-mers. We can iterate over all kmers and select the most abundant kmers in one of the samples, then construct a counts matrix for that set of kmers over all samples, normalised by library size.



In [34]:

    
# File with all the samples we want to analyse:
#     Path to sample .htable
#     number of read pairs in the sample to normalise counts by.

# Another option would be to normalise by the total number of kmers instead,
# These numbers are reported in the hashing .log files:
# e.g. Total number of k-mers: 5508702

# The set of most abundant kmers is derived from the first sample in the list.
# Unfortunately this .htable data is not on Dropbox (~9.7 Gb)
!head bd_clustering/k12_tables.txt









    



/home/ben/data/bd1_bd6_htables/bd_1B12.k12.htable	2305524
/home/ben/data/bd1_bd6_htables/bd_1C6.k12.htable	3072628
/home/ben/data/bd1_bd6_htables/bd_1C9.k12.htable	3564123
/home/ben/data/bd1_bd6_htables/bd_1D11.k12.htable	1034453
/home/ben/data/bd1_bd6_htables/bd_1E4.k12.htable	3974062
/home/ben/data/bd1_bd6_htables/bd_1E6.k12.htable	2011722
/home/ben/data/bd1_bd6_htables/bd_1F1.k12.htable	3100715
/home/ben/data/bd1_bd6_htables/bd_1F9.k12.htable	1003988
/home/ben/data/bd1_bd6_htables/bd_1G12.k12.htable	1150067
/home/ben/data/bd1_bd6_htables/bd_1H10.k12.htable	2219785



In [35]:

    
%%bash

cd ../test_data/bd_clustering/

# Arguments:
#     List of sample kmer hash tables .htable
#     k=12
#     After skipping the 200 most abundant kmers of first sample, analyse the next 100.

# In order for the matrix to be informative, a mean sample presence that is not too close to 100% is required.
python ../../../scripts/repeat_pres_abs.py k12_tables.txt 12 200 100
# In this case: "Mean sample presence proportion is 83.33%"









    



Loading sample 1 of 24 (2305524 read pairs): bd_1B12.k12
Counting first sample: bd_1B12.k12
5.96% complete.
11.92% complete.
17.88% complete.
23.84% complete.
29.80% complete.
35.76% complete.
41.72% complete.
47.68% complete.
53.64% complete.
59.60% complete.
65.57% complete.
71.53% complete.
77.49% complete.
83.45% complete.
89.41% complete.
95.37% complete.
Heapifying...
Discarding top 200 most abundant kmers.
Analysing next 100 most abundant kmers.
	Sample has 6.91 total normalised counts.
	Sample has 100.00% presence.
Loading sample 2 of 24 (3072628 read pairs): bd_1C6.k12
	Sample has 4.89 total normalised counts.
	Sample has 98.00% presence.
Loading sample 3 of 24 (3564123 read pairs): bd_1C9.k12
	Sample has 4.54 total normalised counts.
	Sample has 96.00% presence.
Loading sample 4 of 24 (1034453 read pairs): bd_1D11.k12
	Sample has 7.61 total normalised counts.
	Sample has 70.00% presence.
Loading sample 5 of 24 (3974062 read pairs): bd_1E4.k12
	Sample has 4.14 total normalised counts.
	Sample has 86.00% presence.
Loading sample 6 of 24 (2011722 read pairs): bd_1E6.k12
	Sample has 6.17 total normalised counts.
	Sample has 96.00% presence.
Loading sample 7 of 24 (3100715 read pairs): bd_1F1.k12
	Sample has 4.26 total normalised counts.
	Sample has 98.00% presence.
Loading sample 8 of 24 (1003988 read pairs): bd_1F9.k12
	Sample has 9.50 total normalised counts.
	Sample has 86.00% presence.
Loading sample 9 of 24 (1150067 read pairs): bd_1G12.k12
	Sample has 6.62 total normalised counts.
	Sample has 84.00% presence.
Loading sample 10 of 24 (2219785 read pairs): bd_1H10.k12
	Sample has 6.03 total normalised counts.
	Sample has 94.00% presence.
Loading sample 11 of 24 (746953 read pairs): bd_1H2.k12
	Sample has 10.25 total normalised counts.
	Sample has 86.00% presence.
Loading sample 12 of 24 (2015268 read pairs): bd_1H4.k12
	Sample has 5.51 total normalised counts.
	Sample has 84.00% presence.
Loading sample 13 of 24 (2209712 read pairs): bd_6A9.k12
	Sample has 2.73 total normalised counts.
	Sample has 96.00% presence.
Loading sample 14 of 24 (731673 read pairs): bd_6B5.k12
	Sample has 4.53 total normalised counts.
	Sample has 56.00% presence.
Loading sample 15 of 24 (1150553 read pairs): bd_6C3.k12
	Sample has 2.97 total normalised counts.
	Sample has 56.00% presence.
Loading sample 16 of 24 (1045482 read pairs): bd_6C5.k12
	Sample has 3.55 total normalised counts.
	Sample has 58.00% presence.
Loading sample 17 of 24 (2013071 read pairs): bd_6C7.k12
	Sample has 2.27 total normalised counts.
	Sample has 70.00% presence.
Loading sample 18 of 24 (3843959 read pairs): bd_6D1.k12
	Sample has 1.62 total normalised counts.
	Sample has 92.00% presence.
Loading sample 19 of 24 (3540100 read pairs): bd_6D4.k12
	Sample has 1.90 total normalised counts.
	Sample has 96.00% presence.
Loading sample 20 of 24 (3117896 read pairs): bd_6E5.k12
	Sample has 1.96 total normalised counts.
	Sample has 96.00% presence.
Loading sample 21 of 24 (1016378 read pairs): bd_6F11.k12
	Sample has 4.19 total normalised counts.
	Sample has 44.00% presence.
Loading sample 22 of 24 (2307156 read pairs): bd_6F4.k12
	Sample has 2.10 total normalised counts.
	Sample has 88.00% presence.
Loading sample 23 of 24 (2009189 read pairs): bd_6H11.k12
	Sample has 2.58 total normalised counts.
	Sample has 72.00% presence.
Loading sample 24 of 24 (3065407 read pairs): bd_6H6.k12
	Sample has 1.76 total normalised counts.
	Sample has 98.00% presence.
Mean sample presence proportion is 83.33%
Output matrix saved to k12_tables.skip200.N100.pres_abs.csv

Clustering GBS samples by repeat profile



In [36]:

    
# Cluster samples (default dist measure="euclidean", hclust method="complete")
!Rscript ../../scripts/cluster_samples.R \
    ../test_data/bd_clustering/k12_tables.skip200.N100.pres_abs.csv \
    k12_tables.skip200.N100_cluster.png
    
Image(filename="k12_tables.skip200.N100_cluster.png")









    Out[36]:



In [41]:

    
# Change back to starting dir, ready to re-run
%cd ../









    



/home/ben/brachy_khmer/doc

Considerations

Find out what the samples are biologically (esp. ploidy, replicates)
Different clustering methods
Lower the mean presence % per sample. Experiment with:
- Different set of kmers (i.e. the 2nd most abundant 100)
- Different value of k, although runtime increases exponentially

Source

Work in progress at https://github.com/digitase/brachy_khmer

Acknowledgements

http://khmer.readthedocs.org/en/v1.1/scripts.html
Thanks to the Borevitz Lab, ANU esp. Kevin Murray