A tutorial on calling copy number variations (CNVs) from high throughput sequence data using a hidden Markov model (HMM)

This notebook contains a minimal tutorial on how to use open source Python libraries, in particular numpy, matplotlib, sklearn, pysam and pysamstats, to call copy number variations in a biological sample, given a BAM file with aligned sequence reads.


In [1]:
from __future__ import division, print_function
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import sklearn
import pysam
import pysamstats

The example dataset used in this tutorial is an Illumina sequencing run of genomic DNA from the Plasmodium falciparum clone Dd2.


In [2]:
fasta_filename = 'data/Pf3D7_v3.fa'  # path to the reference genome FASTA file
bam_filename = 'data/Dd2.bam'  # path to the BAM file containing aligned sequence reads

Compute and inspect coverage

First, we need to compute the coverage of aligned sequence reads across a chromosome of interest. We will compute coverage in non-overlapping windows.


In [3]:
fasta = pysam.Fastafile(fasta_filename)
bam = pysam.Samfile(bam_filename)
chrom = 'Pf3D7_05_v3'  # chromosome 5
window_size = 300  # use 300bp non-overlapping windows
coverage = pysamstats.load_coverage_ext_binned(bam, fasta, chrom=chrom, pad=True, window_size=window_size)
coverage


Out[3]:
rec.array([('Pf3D7_05_v3', 150, 47, 48, 7, 17, 21, 1, 2, 0, 2),
       ('Pf3D7_05_v3', 450, 39, 114, 40, 26, 48, 0, 0, 1, 12),
       ('Pf3D7_05_v3', 750, 28, 247, 102, 36, 108, 0, 1, 13, 18), ...,
       ('Pf3D7_05_v3', 1342950, 34, 81, 56, 11, 12, 0, 0, 22, 3),
       ('Pf3D7_05_v3', 1343250, 46, 106, 86, 9, 11, 0, 0, 30, 3),
       ('Pf3D7_05_v3', 1343550, 46, 91, 42, 16, 32, 1, 0, 3, 10)], 
      dtype=[('chrom', 'S12'), ('pos', '<i4'), ('gc', 'u1'), ('reads_all', '<i4'), ('reads_pp', '<i4'), ('reads_mate_unmapped', '<i4'), ('reads_mate_other_chr', '<i4'), ('reads_mate_same_strand', '<i4'), ('reads_faceaway', '<i4'), ('reads_softclipped', '<i4'), ('reads_duplicate', '<i4')])

Let's plot the depth of coverage over the genome, to eyeball the data. This will help us get a sense of what the quality of the raw data is like and whether it might support accurate CNV calling, as well as whether there is any obvious signal of copy number variation in the sample.


In [4]:
fig = plt.figure(figsize=(12, 3))
ax = fig.add_subplot(111)
ax.plot(coverage.pos, coverage.reads_all, color='k', marker='o', linestyle=' ', alpha=.3)
ax.set_xlabel('%s position (bp)' % chrom)
ax.set_ylabel('no. reads')
ax.set_xlim(0, np.max(coverage.pos))
ax.set_title('raw coverage');


From this plot you can see an obvious amplification. This is the well-known 3-fold amplification of a region spanning a gene called multi-drug resistance 1 (MDR1). But you can also see that every there is a fair amount of variance in the number of aligned reads, which may make it difficult to fit an HMM to the raw data. We'll deal with that in the section below.

You can also see the signal gets messy at the very ends of the chromosomes. These are hypervariable sub-telomeric regions, which we will ignore here.

Coverage bias

The P. falciparum genome is highly AT-rich within non-coding regions, and we happen to know that AT-rich regions tend to be associated with lower sequencing yield on some sequencing platforms (e.g., Illumina). Thus we suspect that variation in the AT content of the genome may cause some systematic bias in the depth of coverage. To test this suspicion, let's plot depth of coverage against the local GC content of the window.


In [5]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hexbin(coverage.gc, coverage.reads_all, cmap='Greys', gridsize=20)
ax.set_xlabel('%GC')
ax.set_ylabel('no. reads')
ax.set_title('coverage bias');


In this sample, below 20% GC content there is a bias towards lower coverage. There are different methods for dealing with this bias. For this tutorial, we will take the simplest approach, which is to exclude all windows where GC content is less than 20%.


In [6]:
coverage_filtered = coverage[coverage.gc >= 20]

Let's plot the filtered coverage data, to see whether the signal is improved.


In [7]:
fig = plt.figure(figsize=(12, 3))
ax = fig.add_subplot(111)
ax.plot(coverage_filtered.pos, coverage_filtered.reads_all, color='k', marker='o', linestyle=' ', alpha=.3)
ax.set_xlabel('%s position (bp)' % chrom)
ax.set_ylabel('no. reads')
ax.set_xlim(0, np.max(coverage.pos))
ax.set_title('filtered coverage');


It's not perfect, but it's a lot cleaner, and good enough for fitting the HMM.

Coverage normalisation

As a final step in preparing the data, let's normalise the coverage values. Here we are dealing with a haploid organism, so we will normalise to 1. If you're dealing with a diploid, normalise to 2 instead.

Here we will normalised by median depth over the whole chromosome. Ideally you would want to normalise by median depth over some large region of the genome where you are confident that the sample has a copy number of 1 (i.e., that large CNVs are highly unlikely).


In [8]:
depth_normed = coverage_filtered.reads_all / np.median(coverage_filtered.reads_all)

In [9]:
fig = plt.figure(figsize=(12, 3))
ax = fig.add_subplot(111)
ax.plot(coverage_filtered.pos, depth_normed, color='k', marker='o', linestyle=' ', alpha=.3)
ax.set_xlabel('%s position (bp)' % chrom)
ax.set_ylabel('copy number')
ax.set_yticks(range(4))
ax.grid(axis='y')
ax.set_xlim(0, np.max(coverage.pos))
ax.set_title('normalised coverage');



In [10]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(depth_normed, color='w', bins=np.arange(0, 3, .1))
ax.set_yticks([])
ax.set_ylabel('frequency')
ax.set_xlabel('copy number')
ax.set_title('normalised coverage');


We are going to use a Gaussian HMM to model copy number variation. You can see by eye from the histogram above that the main peak corresponding to copy number 1 does not have a perfectly Gaussian distribution. However, it is close enough for our purposes here.

Copy number prediction

Now the data are prepared, let's use an HMM to predict copy number as the hidden state, given normalised coverage as the observed data.


In [11]:
from sklearn.hmm import GaussianHMM


def fit_hmm(depth_normed,  # normalised coverage array 
            transition_probability,  # probability of state transition
            variance,  # variance per copy 
            variance_fixed,  # variance for the zero copy number state 
            min_copy_number=0,  # minimum copy number to consider in the model
            max_copy_number=5,  # maximum copy number to consider in the model 
            n_iter=0,  # number of iterations to perform when fitting the model
            params='st',  # parameters that can be changed through fitting 
            init_params='',  # parameters that are initialised from the data
           ):
    
    # convenience variable
    n_states = max_copy_number - min_copy_number
    
    # construct the transition matrix
    transmat = np.zeros((n_states, n_states))
    transmat[:] = transition_probability
    transmat[np.diag_indices(n_states)] = 1-((n_states-1)*transition_probability)

    # construct means and covariance
    means = np.array([[n] for n in range(min_copy_number, max_copy_number)])
    covars = np.array([[variance*n + variance_fixed] for n in range(min_copy_number, max_copy_number)])

    # setup HMM 
    model = GaussianHMM(n_states, 
                        covariance_type='diag', 
                        n_iter=n_iter, 
                        transmat=transmat, 
                        params=params,
                        init_params=init_params)
    model.means_ = means
    model.covars_ = covars
    
    # fit HMM
    obs = np.column_stack([depth_normed])
    model.fit([obs])
    
    # predict hidden states
    h = model.predict(obs)
    
    return h

When dealing with multiple samples, where the variance in the coverage data may vary from one sequencing run to the next, you might want to calibrate each sample independently, to reach an appropriate value for the variance parameter. Here we will use a fixed value of 0.1. You can try different values for the variance parameter to get a feeling for what the effect will be. If you have a sample where you know the correct copy number state by some other method then this can help you to find a variance value which gives the best trade-off between sensitivity and specificity. Similarly you could try different values for the other parameters and see their effect.


In [12]:
copy_number = fit_hmm(depth_normed, 
                      transition_probability=.01,
                      variance=.1,
                      variance_fixed=.001)

Now let's plot the predicted copy number overlaid on the coverage data.


In [13]:
fig = plt.figure(figsize=(12, 3))
ax = fig.add_subplot(111)
for i, color in zip(range(5), 'kbgrcmy'):
    ax.plot(coverage_filtered.pos[copy_number == i], 
            depth_normed[copy_number == i], 
            color=color, marker='o', linestyle=' ', alpha=.3)
ax.plot(coverage_filtered.pos, copy_number, linestyle='-', linewidth=2, color='k')
ax.set_xlabel('%s position (bp)' % chrom)
ax.set_ylabel('copy number')
ax.set_yticks(range(4))
ax.grid(axis='y')
ax.set_xlim(0, np.max(coverage.pos))
ax.set_title('predicted copy number');