This notebook contains a minimal tutorial on how to use open source Python libraries, in particular numpy, matplotlib, sklearn, pysam and pysamstats, to call copy number variations in a biological sample, given a BAM file with aligned sequence reads.
In [1]:
from __future__ import division, print_function
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
import sklearn
import pysam
import pysamstats
The example dataset used in this tutorial is an Illumina sequencing run of genomic DNA from the Plasmodium falciparum clone Dd2.
In [2]:
fasta_filename = 'data/Pf3D7_v3.fa' # path to the reference genome FASTA file
bam_filename = 'data/Dd2.bam' # path to the BAM file containing aligned sequence reads
First, we need to compute the coverage of aligned sequence reads across a chromosome of interest. We will compute coverage in non-overlapping windows.
In [3]:
fasta = pysam.Fastafile(fasta_filename)
bam = pysam.Samfile(bam_filename)
chrom = 'Pf3D7_05_v3' # chromosome 5
window_size = 300 # use 300bp non-overlapping windows
coverage = pysamstats.load_coverage_ext_binned(bam, fasta, chrom=chrom, pad=True, window_size=window_size)
coverage
Out[3]:
Let's plot the depth of coverage over the genome, to eyeball the data. This will help us get a sense of what the quality of the raw data is like and whether it might support accurate CNV calling, as well as whether there is any obvious signal of copy number variation in the sample.
In [4]:
fig = plt.figure(figsize=(12, 3))
ax = fig.add_subplot(111)
ax.plot(coverage.pos, coverage.reads_all, color='k', marker='o', linestyle=' ', alpha=.3)
ax.set_xlabel('%s position (bp)' % chrom)
ax.set_ylabel('no. reads')
ax.set_xlim(0, np.max(coverage.pos))
ax.set_title('raw coverage');
From this plot you can see an obvious amplification. This is the well-known 3-fold amplification of a region spanning a gene called multi-drug resistance 1 (MDR1). But you can also see that every there is a fair amount of variance in the number of aligned reads, which may make it difficult to fit an HMM to the raw data. We'll deal with that in the section below.
You can also see the signal gets messy at the very ends of the chromosomes. These are hypervariable sub-telomeric regions, which we will ignore here.
The P. falciparum genome is highly AT-rich within non-coding regions, and we happen to know that AT-rich regions tend to be associated with lower sequencing yield on some sequencing platforms (e.g., Illumina). Thus we suspect that variation in the AT content of the genome may cause some systematic bias in the depth of coverage. To test this suspicion, let's plot depth of coverage against the local GC content of the window.
In [5]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hexbin(coverage.gc, coverage.reads_all, cmap='Greys', gridsize=20)
ax.set_xlabel('%GC')
ax.set_ylabel('no. reads')
ax.set_title('coverage bias');
In this sample, below 20% GC content there is a bias towards lower coverage. There are different methods for dealing with this bias. For this tutorial, we will take the simplest approach, which is to exclude all windows where GC content is less than 20%.
In [6]:
coverage_filtered = coverage[coverage.gc >= 20]
Let's plot the filtered coverage data, to see whether the signal is improved.
In [7]:
fig = plt.figure(figsize=(12, 3))
ax = fig.add_subplot(111)
ax.plot(coverage_filtered.pos, coverage_filtered.reads_all, color='k', marker='o', linestyle=' ', alpha=.3)
ax.set_xlabel('%s position (bp)' % chrom)
ax.set_ylabel('no. reads')
ax.set_xlim(0, np.max(coverage.pos))
ax.set_title('filtered coverage');
It's not perfect, but it's a lot cleaner, and good enough for fitting the HMM.
As a final step in preparing the data, let's normalise the coverage values. Here we are dealing with a haploid organism, so we will normalise to 1. If you're dealing with a diploid, normalise to 2 instead.
Here we will normalised by median depth over the whole chromosome. Ideally you would want to normalise by median depth over some large region of the genome where you are confident that the sample has a copy number of 1 (i.e., that large CNVs are highly unlikely).
In [8]:
depth_normed = coverage_filtered.reads_all / np.median(coverage_filtered.reads_all)
In [9]:
fig = plt.figure(figsize=(12, 3))
ax = fig.add_subplot(111)
ax.plot(coverage_filtered.pos, depth_normed, color='k', marker='o', linestyle=' ', alpha=.3)
ax.set_xlabel('%s position (bp)' % chrom)
ax.set_ylabel('copy number')
ax.set_yticks(range(4))
ax.grid(axis='y')
ax.set_xlim(0, np.max(coverage.pos))
ax.set_title('normalised coverage');
In [10]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.hist(depth_normed, color='w', bins=np.arange(0, 3, .1))
ax.set_yticks([])
ax.set_ylabel('frequency')
ax.set_xlabel('copy number')
ax.set_title('normalised coverage');
We are going to use a Gaussian HMM to model copy number variation. You can see by eye from the histogram above that the main peak corresponding to copy number 1 does not have a perfectly Gaussian distribution. However, it is close enough for our purposes here.
Now the data are prepared, let's use an HMM to predict copy number as the hidden state, given normalised coverage as the observed data.
In [11]:
from sklearn.hmm import GaussianHMM
def fit_hmm(depth_normed, # normalised coverage array
transition_probability, # probability of state transition
variance, # variance per copy
variance_fixed, # variance for the zero copy number state
min_copy_number=0, # minimum copy number to consider in the model
max_copy_number=5, # maximum copy number to consider in the model
n_iter=0, # number of iterations to perform when fitting the model
params='st', # parameters that can be changed through fitting
init_params='', # parameters that are initialised from the data
):
# convenience variable
n_states = max_copy_number - min_copy_number
# construct the transition matrix
transmat = np.zeros((n_states, n_states))
transmat[:] = transition_probability
transmat[np.diag_indices(n_states)] = 1-((n_states-1)*transition_probability)
# construct means and covariance
means = np.array([[n] for n in range(min_copy_number, max_copy_number)])
covars = np.array([[variance*n + variance_fixed] for n in range(min_copy_number, max_copy_number)])
# setup HMM
model = GaussianHMM(n_states,
covariance_type='diag',
n_iter=n_iter,
transmat=transmat,
params=params,
init_params=init_params)
model.means_ = means
model.covars_ = covars
# fit HMM
obs = np.column_stack([depth_normed])
model.fit([obs])
# predict hidden states
h = model.predict(obs)
return h
When dealing with multiple samples, where the variance in the coverage data may vary from one sequencing run to the next, you might want to calibrate each sample independently, to reach an appropriate value for the variance
parameter. Here we will use a fixed value of 0.1. You can try different values for the variance parameter to get a feeling for what the effect will be. If you have a sample where you know the correct copy number state by some other method then this can help you to find a variance value which gives the best trade-off between sensitivity and specificity. Similarly you could try different values for the other parameters and see their effect.
In [12]:
copy_number = fit_hmm(depth_normed,
transition_probability=.01,
variance=.1,
variance_fixed=.001)
Now let's plot the predicted copy number overlaid on the coverage data.
In [13]:
fig = plt.figure(figsize=(12, 3))
ax = fig.add_subplot(111)
for i, color in zip(range(5), 'kbgrcmy'):
ax.plot(coverage_filtered.pos[copy_number == i],
depth_normed[copy_number == i],
color=color, marker='o', linestyle=' ', alpha=.3)
ax.plot(coverage_filtered.pos, copy_number, linestyle='-', linewidth=2, color='k')
ax.set_xlabel('%s position (bp)' % chrom)
ax.set_ylabel('copy number')
ax.set_yticks(range(4))
ax.grid(axis='y')
ax.set_xlim(0, np.max(coverage.pos))
ax.set_title('predicted copy number');