Doud and Bloom (2016) performed deep mutational scanning on the hemagglutinin (HA) from A/WSN/1933 (H1N1) influenza virus. In that original paper, the data was analyzed using the older dms_tools software package. This Jupyter notebook re-analyzes the data using the newer dms_tools2 software.
The goal of the experiment was to quantify the effect of all amino-acid mutations to HA on viral replication in cell culture. To do this, Doud and Bloom (2016) created three independent plasmid mutant libraries carrying nearly all codon mutations of HA. These plasmid mutant libraries are referred to as mutDNA-1, mutDNA-2, and mutDNA-3. They then used these libraries to generate three libraries of mutant viruses, which they passaged at low MOI to create a genotype-phenotype link. These three virus mutant libraries are referred to as mutvirus-1, mutvirus-2, and mutvirus-3. In addition, they performed a single replicate of the same procedure using wildtype (unmutated plasmid) to create non-mutagenized virus -- these plasmid and virus samples are referred to as wtDNA and wtvirus.
All of these plasmid and virus samples were deep sequenced using barcoded-subamplicon sequencing to obtain high accuracy. This notebook analyzes those data to determine the "preference" of each site in HA for each possible amino acid.
Import Python modules, print version of dms_tools2, define some general variables used throughout the notebook.
Some key variables defined below impact how the entire notebook runs:
ncpus
is the number of CPUs to use. It should not exceed the number specified if you submit your job via a queue-ing system such as slurm
/ sbatch
. A value of -1
means use all available CPUs.
use_existing
specifies whether we use existing results if they exist, or generate everything new. Note that dms_tools2 programs just check for existing output but do not do true dependency tracking -- so if you have changed the inputs, then change this option from yes
to no
to re-run everything fresh.
In [1]:
import os
import warnings
warnings.simplefilter('ignore') # don't print warnings to avoid clutter in notebook
import pandas
from IPython.display import display, HTML, Markdown
import dms_tools2
import dms_tools2.plot
import dms_tools2.dssp
import dms_tools2.sra
from dms_tools2.ipython_utils import showPDF
print("Using dms_tools2 version {0}".format(dms_tools2.__version__))
# results will go in this directory
resultsdir = './results/'
if not os.path.isdir(resultsdir):
os.mkdir(resultsdir)
# CPUs to use, should not exceed the number you request with slurm
ncpus = -1
# do we use existing results or generate everything new?
use_existing = 'yes'
We first define basic information about the samples needed for processing the barcoded-subamplicon deep sequencing data. Specifically, we create a pandas DataFrame that gives the:
In [2]:
samples = pandas.DataFrame.from_records(
[('mutDNA-1', 'SRR3113656'),
('mutDNA-2', 'SRR3113657'),
('mutDNA-3', 'SRR3113658'),
('mutvirus-1', 'SRR3113660'),
('mutvirus-2', 'SRR3113661'),
('mutvirus-3', 'SRR3113662'),
('wtDNA', 'SRR3113655'),
('wtvirus', 'SRR3113659')],
columns=['name', 'run']
)
Before doing any analysis, we need to obtain the FASTQ files with the deep sequencing data. Doud and Bloom (2016) uploaded these FASTQ files to the Sequence Read Archive, so we simply download those files.
We download these files using the fastqFromSRA function from the dms_tools2 Python API. Note that the call to this function below uses two external programs that are not part of dms_tools2, and which you therefore must install externally on the computer that you are using:
ascp
and an Aspera Key file. Installing Aspera Connect and a key can be somewhat complex, so if you do not want to do this then just set aspera=None
in the command below and fastq-dump
will do the downloads (albeit more slowly).
In [3]:
fastqdir = os.path.join(resultsdir, './FASTQ_files/')
print("Downloading FASTQ files from the SRA...")
dms_tools2.sra.fastqFromSRA(
samples=samples,
fastq_dump='fastq-dump', # valid path to this program on the Hutch server
fastqdir=fastqdir,
aspera=(
'/app/aspera-connect/3.5.1/bin/ascp', # valid path to ascp on Hutch server
'/app/aspera-connect/3.5.1/etc/asperaweb_id_dsa.openssh' # Aspera key on Hutch server
)
)
print("Here are the names of the downloaded files now found in {0}".format(fastqdir))
display(HTML(samples.to_html(index=False)))
Doud and Bloom (2016) used barcoded-subamplicon sequencing to obtain high accuracy during the Illumina deep sequencing.
We therefore use the dms2_batch_bcsubamp program to analyze these data.
Running that program requires specifying a --batchfile
that lists the samples, a wildtype --refseq
to which we make alignments, and --alignspecs
that tell us where the subamplicons should align.
The batch file that we specify is printed by the cell below.
The alignments are made to the wildtype WSN HA coding sequence, which is available at ./data/WSN-HA.fasta.
The alignment specs are specified based on the exact primers used for the barcoded-subamplicon sequencing by Doud and Bloom (2016).
The alignment specs need to be exactly correct for the subamplicons to align.
We also do some trimming of the reads using the --R1trim
and --R2trim
parameters.
Such trimming can be crucial for getting good results, since the read quality often drops off at the end of reads.
In the example below, we have trimmed all subamplicons equally, but in some cases you might want to trim each one differently by specifying a list of values for each subamplicon.
In [4]:
# file containing wildtype WSN HA sequence
refseq = './data/WSN-HA.fasta'
# define subamplicon alignment specifications
alignspecs = ' '.join(['1,285,36,37',
'286,570,31,32',
'571,855,37,32',
'856,1140,31,36',
'1141,1425,29,33',
'1426,1698,40,43'])
# counts and alignments placed in this directory
countsdir = os.path.join(resultsdir, 'codoncounts')
if not os.path.isdir(countsdir):
os.mkdir(countsdir)
# write sample information to a batch file for dms2_batch_bcsubamplicons
countsbatchfile = os.path.join(countsdir, 'batch.csv')
print("Here is the batch file that we write to CSV format to use as input:")
display(HTML(samples[['name', 'R1']].to_html(index=False)))
samples[['name', 'R1']].to_csv(countsbatchfile, index=False)
print('\nNow running dms2_batch_bcsubamp...')
log = !dms2_batch_bcsubamp \
--batchfile {countsbatchfile} \
--refseq {refseq} \
--alignspecs {alignspecs} \
--outdir {countsdir} \
--summaryprefix summary \
--R1trim 200 \
--R2trim 170 \
--fastqdir {fastqdir} \
--ncpus {ncpus} \
--use_existing {use_existing}
print("Completed dms2_batch_bcsubamp.")
These runs create files giving the counts of each codon at each site in each sample.
These files are in the directory specified by --outdir
in the call to dms2_batch_bcsubamp above:
In [5]:
!ls {countsdir}/*_codoncounts.csv
Now we look at the summary plots created by dms2_batch_bcsubamp.
All of these files are found in the directory specified by --outdir
, and with the prefix specified by --summaryprefix
.
So we define them using this plot prefix plus the suffix for each plot.
In [6]:
countsplotprefix = os.path.join(countsdir, 'summary')
The *_readstats.pdf
plot below shows the statistics on the reads.
Most reads were retained, and a small fraction discarded because of low-quality barcodes.
None failed the Illumina filter as such reads were already filtered out those when we downloaded from the SRA using fastq-dump
.
In [7]:
showPDF(countsplotprefix + '_readstats.pdf', width=500)
The *_readsperbc.pdf
plot below shows how many times different barcodes were observed for each sample.
Barcodes need to be observed multiple times to be useful for barcoded-subamplicon sequencing error correction.
Many barcodes are observed multiple times.
There is also a large bar at 1, which probably corresponds to a mix of barcodes that were only observed once and sequencing errors on other barcodes that spuriously gave rise to an apparently unique sequence.
Overall, the distributions here indicate that most multiply sequenced barcodes were sequenced > 2 times, which suggests that most of the peak at one is probably due to sequencing errors on barcodes, and suggests that sequencing to greater depth probably would not help all that much for these samples.
In [8]:
showPDF(countsplotprefix + '_readsperbc.pdf')
The *_bcstats.pdf
plot below shows statistics on the barcodes.
Some of the barcodes had to be discarded because they had too few reads (these are the single-read barcodes in the plot above), a small fraction with adequate reads were not alignable, and the rest aligned to the HA gene properly.
The fact that most barcodes with multiple reads aligned is good -- if that was not the case then it would indicate that something went wrong, such as perhaps having the wrong --alignspecs
when calling dms2_batch_bcsubamp
.
In [9]:
showPDF(countsplotprefix + '_bcstats.pdf', width=500)
The *_depth.pdf
plot below shows the depth (number of called codons) at each site in the gene.
There is a bit of variation across the gene due to unequal numbers of barcodes for different subamplicons, but the variation is fairly modest.
The typical site has about $3 \times 10^5$ called codons, which makes sense as most samples had about $2 \times 10^6$ total aligned barcodes and the gene was split into 6 subamplicons.
If the coverage was wildly uneven across the gene, that would suggest that the subamplicons were pooled very unevenly.
Note that there are often little dips near the middle of the subamplicons.
These correspond to regions covered by the ends of the R1 and R2 reads, and so are usually of lower sequencing quality.
In addition, sometime if one read is much lower quality it can interfere with good sequencing from the other read in regions of overlap.
Sometimes these issues can be improved by adjusting --R1trim
and --R2trim
to dms2_batch_bcsubamp.
In [10]:
showPDF(countsplotprefix + '_depth.pdf')
The *_mutfreq.pdf
plot below shows the per-codon frequency of mutations at each site.
Generally, things are about as expected -- there is a fairly high mutation frequency across all sites in the mutant DNA library samples, mutations remain prevalent at many but not all sites in the mutant viruses (presumably there is more selection against mutations at some sites than others), and the mutation frequency (presumably mostly sequencing errors) is much lower in the wildtype plasmid and wildtype virus.
Note that there are few peaks in mutation frequency (probably sequencing errors) in the wildtype control samples.
These peaks often appear to correspond to the center of the subamplicons where there are slight dips in coverage in the depth plot shown above.
Again, this is usually due to read quality dropping off near the end of R1 and R2, and the effect can often be somewhat ameliorated by adjusting the --R1trim
and --R2trim
parameters to dms2_batch_bcsubamp to avoid low quality ends of the reads.
In [11]:
showPDF(countsplotprefix + '_mutfreq.pdf')
The *_cumulmutcounts.pdf
plot below shows the fraction of mutations that are found $\leq$ the indicated number of times.
This gives an idea of how completely the possible amino-acid and codon mutations were sampled in each sample.
As shown below, we can see that nearly all codon and amino-acid mutations were sampled at least once (and usually much more than once) in the mutDNA samples, indicating that our libraries contained virtually all these mutations as desired.
The possible mutations were only samples $\sim$50% of the time in the mutvirus samples because selection for viral replication purged many deleterious mutations.
The wildtype controls sampled only a relatively small fraction of the possible mutations, since sequencing errors were fairly rare.
In [12]:
showPDF(countsplotprefix + '_cumulmutcounts.pdf')
The *_codonmuttypes.pdf
plot below shows the per-codon frequency of nonsynonymous, synonymous, and stop codon mutations across the entire gene.
As expected, we see strong selection against stop codons and moderate selection against nonsynonymous mutations in the mutvirus samples relative to the mutDNA samples, indicating purifying selection against deleterioius changes.
The wildtype controls have lower mutation rates, reflecting the underlying rate of errors in the whole experimental process.
In [13]:
showPDF(countsplotprefix + '_codonmuttypes.pdf', width=500)
The numerical data in the *_codonmuttypes.pdf
plot shown above may also be of interest if you want to know the precise mutation frequencies.
That data is written to a file called *_codonmutttypes.csv
, the contents of which are displayed below.
In [14]:
codonmuttypes = pandas.read_csv(countsplotprefix + '_codonmuttypes.csv').sort_index(axis=1)
display(HTML(codonmuttypes.to_html(index=False)))
The *_codonntchanges.pdf
plot below shows per-codon frequency across the entire gene of codon mutations that change different numbers of nucleotides (e.g., ATG
to AAG
changes 1 nucleotide, ATG
to AAC
changes 2 nucleotides, and ATG
to CAC
changes 3 nucleotides).
Since we are dealing with codon-mutant libraries in this experiments, we see all three types of mutations in the mutagenized mutDNA and mutvirus samples.
The "mutations" in the wildtype controls are sequencing errors, and those are almost all single-nucleotide changes since it is very rare for two nucleotides in the same codon to both experience a sequencing error.
In [15]:
showPDF(countsplotprefix + '_codonntchanges.pdf', width=500)
The *_singlentchanges.pdf
plot below shows the frequency of each type of nucleotide change among only codon mutations with one nucleotide change.
This plot is mostly useful to check if there is a large bias in which mutations appear.
In particular, if you are getting oxidative damage (which causes G
to T
mutations) during the library preparation process, you will see a large excess of C
to A
or G
to T
mutations (or both).
For instance, in the case of influenza, when we get bad oxidative damage, then we see an excess of C
to A
mutations in the final cDNA since the damage is occurring to a ssRNA genome.
If you are sequencing something without polarity, you might see both types of mutations.
There is not much oxidative damage in the samples plotted below, which mostly have a fairly even distribution of nucleotide changes.
We do see that transitions (G
<-> A
and C
<-> T
) are a bit more common than most of the other types of mutations (all of which are transversions).
This is expected, since PCR based sources of mutations are expected to preferentially introduce transitions.
In [16]:
showPDF(countsplotprefix + '_singlentchanges.pdf', width=500)
We would like to analyze the counts of each mutation pre- and post-selection to estimate the preference of each site for each amino acid. We do this using the dms2_batch_prefs program, analyzing each replicate separately.
Note that for each replicate, we have a pre-selection sample (e.g., mutDNA-1) and post-selection sample (e.g., mutvirus-1). We also have two "error controls" that use the same deep sequencing strategy but on a wildtype library, these are wtDNA and wtvirus. We use these as the pre-selection and post-selection error controls to estimate the error rates from processes other than the mutations that we care about.
We create a batch file specifying these samples, and then run dms2_batch_prefs to estimate the preferences for each replicate. Note that in the batch file, we can simply specify the names used by dms2_batch_bcsubamp as well as the directory where the counts files were placed.
In [17]:
# prefs placed in this directory
prefsdir = os.path.join(resultsdir, 'prefs')
if not os.path.isdir(prefsdir):
os.mkdir(prefsdir)
# create batch file for dms2_batch_prefs
prefsbatch = pandas.DataFrame(
columns=['name', 'pre', 'post', 'errpre', 'errpost'],
data=[('replicate-1', 'mutDNA-1', 'mutvirus-1', 'wtDNA', 'wtvirus'),
('replicate-2', 'mutDNA-2', 'mutvirus-2', 'wtDNA', 'wtvirus'),
('replicate-3', 'mutDNA-3', 'mutvirus-3', 'wtDNA', 'wtvirus')]
)
prefsbatchfile = os.path.join(prefsdir, 'batch.csv')
print("Here is the batch file that we write to CSV format to use as input:")
display(HTML(prefsbatch.to_html(index=False)))
prefsbatch.to_csv(prefsbatchfile, index=False)
print("Running dms2_batch_prefs...")
log = !dms2_batch_prefs \
--indir {countsdir} \
--batchfile {prefsbatchfile} \
--outdir {prefsdir} \
--summaryprefix summary \
--use_existing {use_existing}
print("Completed running dms2_batch_prefs")
This creates files containing the estimated amino-acid preferences for each replicate.
These preferences are in the directory specified by --outdir
in the call to dms2_batch_prefs above:
In [18]:
!ls {prefsdir}/*_prefs.csv
We can also look at the summary files created by dms2_batch_prefs.
These files are also in the directory specified by --outdir
and have the prefix specified by --summaryprefix
.
The *_prefscorr.pdf
file shows the correlation between all pairs of replicates specified in the batch file.
Each point represents one amino-acid preference, and the Pearson R is shown in the plot:
In [19]:
showPDF(os.path.join(prefsdir, 'summary_prefscorr.pdf'), width=300)
We can also compare these estimated preferences with dms_tools2 to the original amino-acid preference estimated by Doud and Bloom (2016) using the older dms_tools software.
We have downloaded those older results from the original analysis, and placed them in the ./data/originalDoud2016prefs.
We make this comparison using the dms_tools2.plot.plotCorrMatrix
function that is part of the dms_tools2 Python API.
As can be seen below, the correlation between the estimates made by dms_tools2 are very similar to the original estimates made using the older dms_tools software.
In [20]:
# directory with original prefs from Doud2016 paper
oldprefsdir = './data/originalDoud2016prefs'
# make lists with new and original names and prefs files
prefsfiles = [os.path.join(xdir, 'replicate-{0}_prefs.csv'.format(r))
for xdir in [prefsdir, oldprefsdir] for r in [1, 2, 3]]
names = ['{0}replicate-{1}'.format(old, r) for old in ['', 'old-'] for r in [1, 2, 3]]
plotfile = os.path.join(prefsdir, 'old_vs_new_prefscorr.pdf')
dms_tools2.plot.plotCorrMatrix(names, prefsfiles, plotfile, datatype='prefs')
showPDF(plotfile, width=600)
Our run of dms2_batch_prefs also creates a file with the average preferences across replicates, with the prefix specified by --summaryprefix
and the suffix _avgprefs.csv
.
Let's use dms2_logoplot to visualize these average preferences in the form of a logo plot, where the height of each letter is proportional to the preference for that amino acid.
In [21]:
avgprefs = os.path.join(prefsdir, 'summary_avgprefs.csv')
For this plot, we'd like to overlay the secondary structure and relative solvent accessibility (RSA) for each residue.
We can get this information by running dssp on a PDB file.
We have already done this using a re-numbered version of PDB 1RVX (a structure of an H1 HA), and the results are in ./data/1RVX_trimer_sequentialnumbering.dssp.
So we first use the dms_tools2.dssp.processDSSP
function of the dms_tools2 Python API to extract the RSA and secondary structure class from this file and write them to CSV files.
In [22]:
dsspfile = './data/1RVX_trimer_sequentialnumbering.dssp'
dssp_df = dms_tools2.dssp.processDSSP(dsspfile, chain='A')
dssp_df['SS'] = dssp_df['SS_class']
ssfile = os.path.join(prefsdir, 'SS.csv')
dssp_df[['site', 'SS']].to_csv(ssfile, index=False)
rsafile = os.path.join(prefsdir, 'RSA.csv')
dssp_df[['site', 'RSA']].to_csv(rsafile, index=False)
We also want to make an overlay of the wildtype sequence.
To do that, we need to create a file that has the columns site
and wildtype
.
We can create this from any of the codon counts files by simply translating the codons to amino acids using the dms_tools2.utils.codonToAACounts
function of the dms_tools2 Python API, and then excluding any columns with stop codons as wildtype since our analyses above used --excludestop yes
and so excluded the stop codon:
In [23]:
wtoverlayfile = os.path.join(prefsdir, 'wildtypeoverlayfile.csv')
aacounts = dms_tools2.utils.codonToAACounts(
pandas.read_csv(os.path.join(countsdir, 'wtDNA_codoncounts.csv')))
aacounts.query('wildtype != "*"')[['site', 'wildtype']].to_csv(wtoverlayfile, index=False)
Now we use dms2_logoplot to make the logo plot of the preferences with wildtype sequence, secondary structure and RSA overlaid.
Note how the output file is in the directory specified by --outdir
, with the name given by --name
and the suffix _prefs.pdf
.
In [24]:
logoname = 'avgprefs'
log = !dms2_logoplot \
--prefs {avgprefs} \
--name {logoname} \
--outdir {prefsdir} \
--overlay1 {wtoverlayfile} wildtype wildtype \
--overlay2 {ssfile} SS "secondary structure" \
--overlay3 {rsafile} RSA "relative solvent accessibility" \
--nperline 81 \
--use_existing {use_existing}
logoplot = os.path.join(prefsdir, '{0}_prefs.pdf'.format(logoname))
showPDF(logoplot)
Finally, we will take the averaged amino-acid preferences and fit them to natural sequence evolution. The goal here is to test the experimentally measured preferences are informative about natural sequence evolution, and if so obtain a stringency parameter that can be used to re-scale them to bring them into line with the stringency of natural selection.
All of this is done in a phylogenetic framework using an experimentally informed codon model (ExpCM). For a description of the general concept of an ExpCM, see Bloom (2014). For a more practical discussion of how ExpCM can be applied to real data, see Hilton and Bloom (2017), which describes the phydms software that we will use for the analysis.
Note that the original Doud and Bloom (2016) performed a similar analysis to what we are doing here -- but they used an older version of phydms and a different sequence alignment, and so got somewhat numerically different results than what we'll get here, although the trends where the same.
First, we use phydms_comprehensive to fit the averaged amino-acid preferences to the HA sequence alignment in ./data/HA_alignment.fasta.
Note that running phydms_comprehensive requires that we have RAxML installed at the path specified by --raxml
.
We will look at the results in the modelcomparison.md
file created by phydms_comprehensive.
Since this program does not have a --use_existing
option, we code something equivalent in the cell below to avoid re-running if the output already exists.
In [25]:
print("Using the following version of phydms:")
!phydms -v
alignment = './data/HA_alignment.fasta'
phydms_dir = os.path.join(resultsdir, 'phydms_analysis/') # put phydms results here
modelcomparison = os.path.join(phydms_dir, 'modelcomparison.md')
if use_existing == 'yes' and os.path.isfile(modelcomparison):
print("Results of phydms analysis already exist.")
else:
print("Running phydms_comprehensive...")
log = !phydms_comprehensive {phydms_dir} {alignment} {avgprefs} --raxml raxml
print("Analysis complete.")
Now we look at the results of the analysis in the modelcomparison.md
file:
In [26]:
display(Markdown(modelcomparison))
The table show above compares four different models:
If the deep mutational scanning is informative about evolution, then we expect it to fit the natural sequence evolution much better than any of the other models. That is in fact what we see, as it has a much lower AIC.
Also of relevance is the stringency parameter ($\beta$) that re-scales the preferences to optimally fit natural evolution in an ExpCM. When $\beta > 1$, it means that natural evolution prefers the same amino acids as the experiments but with greater stringency (see Hilton and Bloom (2017), particularly Figure 3 of that paper, for more details).
Above we can see that our ExpCM fit a value of $\beta = 2.05$.
We can use dms2_logoplot to re-plot the replicate-average preferences after re-scaling by this stringency parameter by using the --stringency
parameter to that program.
Such a plot is shown below; it provides a better estimate of how the mutational effects measured in the deep mutational scanning actually correspond to natural selection.
In [27]:
rescaledlogoname = 'rescaled-avgprefs'
log = !dms2_logoplot \
--prefs {avgprefs} \
--name {rescaledlogoname} \
--outdir {prefsdir} \
--overlay1 {wtoverlayfile} wildtype wildtype \
--overlay2 {ssfile} SS "secondary structure" \
--overlay3 {rsafile} RSA "relative solvent accessibility" \
--nperline 81 \
--stringency 2.05 \
--use_existing {use_existing}
rescaledlogoplot = os.path.join(prefsdir, '{0}_prefs.pdf'.format(rescaledlogoname))
showPDF(rescaledlogoplot)
In [28]: