Analysis of deep mutational scanning of the hemagglutinin of WSN influenza virus by Doud and Bloom (2016)

Overview

Doud and Bloom (2016) performed deep mutational scanning on the hemagglutinin (HA) from A/WSN/1933 (H1N1) influenza virus. In that original paper, the data was analyzed using the older dms_tools software package. This Jupyter notebook re-analyzes the data using the newer dms_tools2 software.

Experimental summary

The goal of the experiment was to quantify the effect of all amino-acid mutations to HA on viral replication in cell culture. To do this, Doud and Bloom (2016) created three independent plasmid mutant libraries carrying nearly all codon mutations of HA. These plasmid mutant libraries are referred to as mutDNA-1, mutDNA-2, and mutDNA-3. They then used these libraries to generate three libraries of mutant viruses, which they passaged at low MOI to create a genotype-phenotype link. These three virus mutant libraries are referred to as mutvirus-1, mutvirus-2, and mutvirus-3. In addition, they performed a single replicate of the same procedure using wildtype (unmutated plasmid) to create non-mutagenized virus -- these plasmid and virus samples are referred to as wtDNA and wtvirus.

All of these plasmid and virus samples were deep sequenced using barcoded-subamplicon sequencing to obtain high accuracy. This notebook analyzes those data to determine the "preference" of each site in HA for each possible amino acid.

Import modules and define general variables

Import Python modules, print version of dms_tools2, define some general variables used throughout the notebook.

Some key variables defined below impact how the entire notebook runs:

ncpus is the number of CPUs to use. It should not exceed the number specified if you submit your job via a queue-ing system such as slurm / sbatch. A value of -1 means use all available CPUs.
use_existing specifies whether we use existing results if they exist, or generate everything new. Note that dms_tools2 programs just check for existing output but do not do true dependency tracking -- so if you have changed the inputs, then change this option from yes to no to re-run everything fresh.



In [1]:

    
import os
import warnings
warnings.simplefilter('ignore') # don't print warnings to avoid clutter in notebook
import pandas
from IPython.display import display, HTML, Markdown
import dms_tools2
import dms_tools2.plot
import dms_tools2.dssp
import dms_tools2.sra
from dms_tools2.ipython_utils import showPDF

print("Using dms_tools2 version {0}".format(dms_tools2.__version__))

# results will go in this directory
resultsdir = './results/' 
if not os.path.isdir(resultsdir):
    os.mkdir(resultsdir)
    
# CPUs to use, should not exceed the number you request with slurm
ncpus = -1

# do we use existing results or generate everything new?
use_existing = 'yes'









    



Using dms_tools2 version 2.2.dev5

Define the samples

We first define basic information about the samples needed for processing the barcoded-subamplicon deep sequencing data. Specifically, we create a pandas DataFrame that gives the:

The name of each sample.
The SRA (Sequence Read Archive) run accession for the deep sequencing data for that sample.



In [2]:

    
samples = pandas.DataFrame.from_records(
        [('mutDNA-1', 'SRR3113656'),
         ('mutDNA-2', 'SRR3113657'),
         ('mutDNA-3', 'SRR3113658'),
         ('mutvirus-1', 'SRR3113660'),
         ('mutvirus-2', 'SRR3113661'),
         ('mutvirus-3', 'SRR3113662'),
         ('wtDNA', 'SRR3113655'),
         ('wtvirus', 'SRR3113659')],
        columns=['name', 'run']
        )

Download the FASTQ files from the SRA

Before doing any analysis, we need to obtain the FASTQ files with the deep sequencing data. Doud and Bloom (2016) uploaded these FASTQ files to the Sequence Read Archive, so we simply download those files.

We download these files using the fastqFromSRA function from the dms_tools2 Python API. Note that the call to this function below uses two external programs that are not part of dms_tools2, and which you therefore must install externally on the computer that you are using:

The fastq-dump program from the SRA Toolkit. If you do not already have this toolkit installed, you will need to install it by following these instructions. You need a relatively recent version.
The Aspera Connect program for rapid downloads. You need both the executable ascp and an Aspera Key file. Installing Aspera Connect and a key can be somewhat complex, so if you do not want to do this then just set aspera=None in the command below and fastq-dump will do the downloads (albeit more slowly).



In [3]:

    
fastqdir = os.path.join(resultsdir, './FASTQ_files/')
print("Downloading FASTQ files from the SRA...")
dms_tools2.sra.fastqFromSRA(
        samples=samples,
        fastq_dump='fastq-dump', # valid path to this program on the Hutch server
        fastqdir=fastqdir,
        aspera=(
            '/app/aspera-connect/3.5.1/bin/ascp', # valid path to ascp on Hutch server
            '/app/aspera-connect/3.5.1/etc/asperaweb_id_dsa.openssh' # Aspera key on Hutch server
            )
        )
print("Here are the names of the downloaded files now found in {0}".format(fastqdir))
display(HTML(samples.to_html(index=False)))









    



Downloading FASTQ files from the SRA...
Here are the names of the downloaded files now found in ./results/./FASTQ_files/






    





  
    
      name
      run
      R1
      R2
    
  
  
    
      mutDNA-1
      SRR3113656
      mutDNA-1_R1.fastq.gz
      mutDNA-1_R2.fastq.gz
    
    
      mutDNA-2
      SRR3113657
      mutDNA-2_R1.fastq.gz
      mutDNA-2_R2.fastq.gz
    
    
      mutDNA-3
      SRR3113658
      mutDNA-3_R1.fastq.gz
      mutDNA-3_R2.fastq.gz
    
    
      mutvirus-1
      SRR3113660
      mutvirus-1_R1.fastq.gz
      mutvirus-1_R2.fastq.gz
    
    
      mutvirus-2
      SRR3113661
      mutvirus-2_R1.fastq.gz
      mutvirus-2_R2.fastq.gz
    
    
      mutvirus-3
      SRR3113662
      mutvirus-3_R1.fastq.gz
      mutvirus-3_R2.fastq.gz
    
    
      wtDNA
      SRR3113655
      wtDNA_R1.fastq.gz
      wtDNA_R2.fastq.gz
    
    
      wtvirus
      SRR3113659
      wtvirus_R1.fastq.gz
      wtvirus_R2.fastq.gz

Align the deep sequencing data and count mutations

Doud and Bloom (2016) used barcoded-subamplicon sequencing to obtain high accuracy during the Illumina deep sequencing.

We therefore use the dms2_batch_bcsubamp program to analyze these data.

Running that program requires specifying a --batchfile that lists the samples, a wildtype --refseq to which we make alignments, and --alignspecs that tell us where the subamplicons should align. The batch file that we specify is printed by the cell below. The alignments are made to the wildtype WSN HA coding sequence, which is available at ./data/WSN-HA.fasta. The alignment specs are specified based on the exact primers used for the barcoded-subamplicon sequencing by Doud and Bloom (2016). The alignment specs need to be exactly correct for the subamplicons to align.

We also do some trimming of the reads using the --R1trim and --R2trim parameters. Such trimming can be crucial for getting good results, since the read quality often drops off at the end of reads. In the example below, we have trimmed all subamplicons equally, but in some cases you might want to trim each one differently by specifying a list of values for each subamplicon.



In [4]:

    
# file containing wildtype WSN HA sequence
refseq = './data/WSN-HA.fasta'

# define subamplicon alignment specifications
alignspecs = ' '.join(['1,285,36,37', 
                       '286,570,31,32',
                       '571,855,37,32',
                       '856,1140,31,36',
                       '1141,1425,29,33',
                       '1426,1698,40,43'])

# counts and alignments placed in this directory
countsdir = os.path.join(resultsdir, 'codoncounts')
if not os.path.isdir(countsdir):
    os.mkdir(countsdir)
    
# write sample information to a batch file for dms2_batch_bcsubamplicons
countsbatchfile = os.path.join(countsdir, 'batch.csv')
print("Here is the batch file that we write to CSV format to use as input:")
display(HTML(samples[['name', 'R1']].to_html(index=False)))
samples[['name', 'R1']].to_csv(countsbatchfile, index=False)

print('\nNow running dms2_batch_bcsubamp...')
log = !dms2_batch_bcsubamp \
        --batchfile {countsbatchfile} \
        --refseq {refseq} \
        --alignspecs {alignspecs} \
        --outdir {countsdir} \
        --summaryprefix summary \
        --R1trim 200 \
        --R2trim 170 \
        --fastqdir {fastqdir} \
        --ncpus {ncpus} \
        --use_existing {use_existing} 

print("Completed dms2_batch_bcsubamp.")









    



Here is the batch file that we write to CSV format to use as input:






    





  
    
      name
      R1
    
  
  
    
      mutDNA-1
      mutDNA-1_R1.fastq.gz
    
    
      mutDNA-2
      mutDNA-2_R1.fastq.gz
    
    
      mutDNA-3
      mutDNA-3_R1.fastq.gz
    
    
      mutvirus-1
      mutvirus-1_R1.fastq.gz
    
    
      mutvirus-2
      mutvirus-2_R1.fastq.gz
    
    
      mutvirus-3
      mutvirus-3_R1.fastq.gz
    
    
      wtDNA
      wtDNA_R1.fastq.gz
    
    
      wtvirus
      wtvirus_R1.fastq.gz
    
  







    



Now running dms2_batch_bcsubamp...
Completed dms2_batch_bcsubamp.

These runs create files giving the counts of each codon at each site in each sample. These files are in the directory specified by --outdir in the call to dms2_batch_bcsubamp above:



In [5]:

    
!ls {countsdir}/*_codoncounts.csv









    



./results/codoncounts/mutDNA-1_codoncounts.csv
./results/codoncounts/mutDNA-2_codoncounts.csv
./results/codoncounts/mutDNA-3_codoncounts.csv
./results/codoncounts/mutvirus-1_codoncounts.csv
./results/codoncounts/mutvirus-2_codoncounts.csv
./results/codoncounts/mutvirus-3_codoncounts.csv
./results/codoncounts/wtDNA_codoncounts.csv
./results/codoncounts/wtvirus_codoncounts.csv

Now we look at the summary plots created by dms2_batch_bcsubamp. All of these files are found in the directory specified by --outdir, and with the prefix specified by --summaryprefix. So we define them using this plot prefix plus the suffix for each plot.



In [6]:

    
countsplotprefix = os.path.join(countsdir, 'summary')

The *_readstats.pdf plot below shows the statistics on the reads. Most reads were retained, and a small fraction discarded because of low-quality barcodes. None failed the Illumina filter as such reads were already filtered out those when we downloaded from the SRA using fastq-dump.



In [7]:

    
showPDF(countsplotprefix + '_readstats.pdf', width=500)

The *_readsperbc.pdf plot below shows how many times different barcodes were observed for each sample. Barcodes need to be observed multiple times to be useful for barcoded-subamplicon sequencing error correction. Many barcodes are observed multiple times. There is also a large bar at 1, which probably corresponds to a mix of barcodes that were only observed once and sequencing errors on other barcodes that spuriously gave rise to an apparently unique sequence. Overall, the distributions here indicate that most multiply sequenced barcodes were sequenced > 2 times, which suggests that most of the peak at one is probably due to sequencing errors on barcodes, and suggests that sequencing to greater depth probably would not help all that much for these samples.



In [8]:

    
showPDF(countsplotprefix + '_readsperbc.pdf')

The *_bcstats.pdf plot below shows statistics on the barcodes. Some of the barcodes had to be discarded because they had too few reads (these are the single-read barcodes in the plot above), a small fraction with adequate reads were not alignable, and the rest aligned to the HA gene properly. The fact that most barcodes with multiple reads aligned is good -- if that was not the case then it would indicate that something went wrong, such as perhaps having the wrong --alignspecs when calling dms2_batch_bcsubamp.



In [9]:

    
showPDF(countsplotprefix + '_bcstats.pdf', width=500)

The *_depth.pdf plot below shows the depth (number of called codons) at each site in the gene. There is a bit of variation across the gene due to unequal numbers of barcodes for different subamplicons, but the variation is fairly modest. The typical site has about $3 \times 10^5$ called codons, which makes sense as most samples had about $2 \times 10^6$ total aligned barcodes and the gene was split into 6 subamplicons. If the coverage was wildly uneven across the gene, that would suggest that the subamplicons were pooled very unevenly.

Note that there are often little dips near the middle of the subamplicons. These correspond to regions covered by the ends of the R1 and R2 reads, and so are usually of lower sequencing quality. In addition, sometime if one read is much lower quality it can interfere with good sequencing from the other read in regions of overlap. Sometimes these issues can be improved by adjusting --R1trim and --R2trim to dms2_batch_bcsubamp.



In [10]:

    
showPDF(countsplotprefix + '_depth.pdf')

The *_mutfreq.pdf plot below shows the per-codon frequency of mutations at each site. Generally, things are about as expected -- there is a fairly high mutation frequency across all sites in the mutant DNA library samples, mutations remain prevalent at many but not all sites in the mutant viruses (presumably there is more selection against mutations at some sites than others), and the mutation frequency (presumably mostly sequencing errors) is much lower in the wildtype plasmid and wildtype virus.

Note that there are few peaks in mutation frequency (probably sequencing errors) in the wildtype control samples. These peaks often appear to correspond to the center of the subamplicons where there are slight dips in coverage in the depth plot shown above. Again, this is usually due to read quality dropping off near the end of R1 and R2, and the effect can often be somewhat ameliorated by adjusting the --R1trim and --R2trim parameters to dms2_batch_bcsubamp to avoid low quality ends of the reads.



In [11]:

    
showPDF(countsplotprefix + '_mutfreq.pdf')

The *_cumulmutcounts.pdf plot below shows the fraction of mutations that are found $\leq$ the indicated number of times. This gives an idea of how completely the possible amino-acid and codon mutations were sampled in each sample. As shown below, we can see that nearly all codon and amino-acid mutations were sampled at least once (and usually much more than once) in the mutDNA samples, indicating that our libraries contained virtually all these mutations as desired. The possible mutations were only samples $\sim$50% of the time in the mutvirus samples because selection for viral replication purged many deleterious mutations. The wildtype controls sampled only a relatively small fraction of the possible mutations, since sequencing errors were fairly rare.



In [12]:

    
showPDF(countsplotprefix + '_cumulmutcounts.pdf')

The *_codonmuttypes.pdf plot below shows the per-codon frequency of nonsynonymous, synonymous, and stop codon mutations across the entire gene. As expected, we see strong selection against stop codons and moderate selection against nonsynonymous mutations in the mutvirus samples relative to the mutDNA samples, indicating purifying selection against deleterioius changes. The wildtype controls have lower mutation rates, reflecting the underlying rate of errors in the whole experimental process.



In [13]:

    
showPDF(countsplotprefix + '_codonmuttypes.pdf', width=500)

The numerical data in the *_codonmuttypes.pdf plot shown above may also be of interest if you want to know the precise mutation frequencies. That data is written to a file called *_codonmutttypes.csv, the contents of which are displayed below.



In [14]:

    
codonmuttypes = pandas.read_csv(countsplotprefix + '_codonmuttypes.csv').sort_index(axis=1)
display(HTML(codonmuttypes.to_html(index=False)))









    





  
    
      name
      nonsynonymous
      stop
      synonymous
    
  
  
    
      mutDNA-1
      0.001664
      0.000099
      0.000144
    
    
      mutDNA-2
      0.002077
      0.000119
      0.000163
    
    
      mutDNA-3
      0.001553
      0.000096
      0.000138
    
    
      mutvirus-1
      0.000630
      0.000013
      0.000181
    
    
      mutvirus-2
      0.000809
      0.000014
      0.000207
    
    
      mutvirus-3
      0.000683
      0.000015
      0.000185
    
    
      wtDNA
      0.000138
      0.000023
      0.000048
    
    
      wtvirus
      0.000194
      0.000012
      0.000096

The *_codonntchanges.pdf plot below shows per-codon frequency across the entire gene of codon mutations that change different numbers of nucleotides (e.g., ATG to AAG changes 1 nucleotide, ATG to AAC changes 2 nucleotides, and ATG to CAC changes 3 nucleotides). Since we are dealing with codon-mutant libraries in this experiments, we see all three types of mutations in the mutagenized mutDNA and mutvirus samples. The "mutations" in the wildtype controls are sequencing errors, and those are almost all single-nucleotide changes since it is very rare for two nucleotides in the same codon to both experience a sequencing error.



In [15]:

    
showPDF(countsplotprefix + '_codonntchanges.pdf', width=500)

The *_singlentchanges.pdf plot below shows the frequency of each type of nucleotide change among only codon mutations with one nucleotide change. This plot is mostly useful to check if there is a large bias in which mutations appear. In particular, if you are getting oxidative damage (which causes G to T mutations) during the library preparation process, you will see a large excess of C to A or G to T mutations (or both). For instance, in the case of influenza, when we get bad oxidative damage, then we see an excess of C to A mutations in the final cDNA since the damage is occurring to a ssRNA genome. If you are sequencing something without polarity, you might see both types of mutations.

There is not much oxidative damage in the samples plotted below, which mostly have a fairly even distribution of nucleotide changes.

We do see that transitions (G <-> A and C <-> T) are a bit more common than most of the other types of mutations (all of which are transversions). This is expected, since PCR based sources of mutations are expected to preferentially introduce transitions.



In [16]:

    
showPDF(countsplotprefix + '_singlentchanges.pdf', width=500)

Estimate the amino-acid preferences

We would like to analyze the counts of each mutation pre- and post-selection to estimate the preference of each site for each amino acid. We do this using the dms2_batch_prefs program, analyzing each replicate separately.

Note that for each replicate, we have a pre-selection sample (e.g., mutDNA-1) and post-selection sample (e.g., mutvirus-1). We also have two "error controls" that use the same deep sequencing strategy but on a wildtype library, these are wtDNA and wtvirus. We use these as the pre-selection and post-selection error controls to estimate the error rates from processes other than the mutations that we care about.

We create a batch file specifying these samples, and then run dms2_batch_prefs to estimate the preferences for each replicate. Note that in the batch file, we can simply specify the names used by dms2_batch_bcsubamp as well as the directory where the counts files were placed.



In [17]:

    
# prefs placed in this directory
prefsdir = os.path.join(resultsdir, 'prefs')
if not os.path.isdir(prefsdir):
    os.mkdir(prefsdir)

# create batch file for dms2_batch_prefs
prefsbatch = pandas.DataFrame(
        columns=['name', 'pre', 'post', 'errpre', 'errpost'],
        data=[('replicate-1', 'mutDNA-1', 'mutvirus-1', 'wtDNA', 'wtvirus'),
              ('replicate-2', 'mutDNA-2', 'mutvirus-2', 'wtDNA', 'wtvirus'),
              ('replicate-3', 'mutDNA-3', 'mutvirus-3', 'wtDNA', 'wtvirus')]
        )
prefsbatchfile = os.path.join(prefsdir, 'batch.csv')
print("Here is the batch file that we write to CSV format to use as input:")
display(HTML(prefsbatch.to_html(index=False)))
prefsbatch.to_csv(prefsbatchfile, index=False)

print("Running dms2_batch_prefs...")
log = !dms2_batch_prefs \
        --indir {countsdir} \
        --batchfile {prefsbatchfile} \
        --outdir {prefsdir} \
        --summaryprefix summary \
        --use_existing {use_existing} 
print("Completed running dms2_batch_prefs")









    



Here is the batch file that we write to CSV format to use as input:






    





  
    
      name
      pre
      post
      errpre
      errpost
    
  
  
    
      replicate-1
      mutDNA-1
      mutvirus-1
      wtDNA
      wtvirus
    
    
      replicate-2
      mutDNA-2
      mutvirus-2
      wtDNA
      wtvirus
    
    
      replicate-3
      mutDNA-3
      mutvirus-3
      wtDNA
      wtvirus
    
  







    



Running dms2_batch_prefs...
Completed running dms2_batch_prefs

This creates files containing the estimated amino-acid preferences for each replicate. These preferences are in the directory specified by --outdir in the call to dms2_batch_prefs above:



In [18]:

    
!ls {prefsdir}/*_prefs.csv









    



./results/prefs/replicate-1_prefs.csv  ./results/prefs/replicate-3_prefs.csv
./results/prefs/replicate-2_prefs.csv

We can also look at the summary files created by dms2_batch_prefs. These files are also in the directory specified by --outdir and have the prefix specified by --summaryprefix.

The *_prefscorr.pdf file shows the correlation between all pairs of replicates specified in the batch file. Each point represents one amino-acid preference, and the Pearson R is shown in the plot:



In [19]:

    
showPDF(os.path.join(prefsdir, 'summary_prefscorr.pdf'), width=300)

We can also compare these estimated preferences with dms_tools2 to the original amino-acid preference estimated by Doud and Bloom (2016) using the older dms_tools software. We have downloaded those older results from the original analysis, and placed them in the ./data/originalDoud2016prefs. We make this comparison using the dms_tools2.plot.plotCorrMatrix function that is part of the dms_tools2 Python API.

As can be seen below, the correlation between the estimates made by dms_tools2 are very similar to the original estimates made using the older dms_tools software.



In [20]:

    
# directory with original prefs from Doud2016 paper
oldprefsdir = './data/originalDoud2016prefs'

# make lists with new and original names and prefs files
prefsfiles = [os.path.join(xdir, 'replicate-{0}_prefs.csv'.format(r)) 
        for xdir in [prefsdir, oldprefsdir] for r in [1, 2, 3]]
names = ['{0}replicate-{1}'.format(old, r) for old in ['', 'old-'] for r in [1, 2, 3]]

plotfile = os.path.join(prefsdir, 'old_vs_new_prefscorr.pdf')
dms_tools2.plot.plotCorrMatrix(names, prefsfiles, plotfile, datatype='prefs')
showPDF(plotfile, width=600)

Our run of dms2_batch_prefs also creates a file with the average preferences across replicates, with the prefix specified by --summaryprefix and the suffix _avgprefs.csv. Let's use dms2_logoplot to visualize these average preferences in the form of a logo plot, where the height of each letter is proportional to the preference for that amino acid.



In [21]:

    
avgprefs = os.path.join(prefsdir, 'summary_avgprefs.csv')

For this plot, we'd like to overlay the secondary structure and relative solvent accessibility (RSA) for each residue. We can get this information by running dssp on a PDB file. We have already done this using a re-numbered version of PDB 1RVX (a structure of an H1 HA), and the results are in ./data/1RVX_trimer_sequentialnumbering.dssp. So we first use the dms_tools2.dssp.processDSSP function of the dms_tools2 Python API to extract the RSA and secondary structure class from this file and write them to CSV files.



In [22]:

    
dsspfile = './data/1RVX_trimer_sequentialnumbering.dssp'
dssp_df = dms_tools2.dssp.processDSSP(dsspfile, chain='A')
dssp_df['SS'] = dssp_df['SS_class']
ssfile = os.path.join(prefsdir, 'SS.csv')
dssp_df[['site', 'SS']].to_csv(ssfile, index=False)
rsafile = os.path.join(prefsdir, 'RSA.csv')
dssp_df[['site', 'RSA']].to_csv(rsafile, index=False)

We also want to make an overlay of the wildtype sequence. To do that, we need to create a file that has the columns site and wildtype. We can create this from any of the codon counts files by simply translating the codons to amino acids using the dms_tools2.utils.codonToAACounts function of the dms_tools2 Python API, and then excluding any columns with stop codons as wildtype since our analyses above used --excludestop yes and so excluded the stop codon:



In [23]:

    
wtoverlayfile = os.path.join(prefsdir, 'wildtypeoverlayfile.csv')
aacounts = dms_tools2.utils.codonToAACounts(
        pandas.read_csv(os.path.join(countsdir, 'wtDNA_codoncounts.csv')))
aacounts.query('wildtype != "*"')[['site', 'wildtype']].to_csv(wtoverlayfile, index=False)

Now we use dms2_logoplot to make the logo plot of the preferences with wildtype sequence, secondary structure and RSA overlaid.

Note how the output file is in the directory specified by --outdir, with the name given by --name and the suffix _prefs.pdf.



In [24]:

    
logoname = 'avgprefs'

log = !dms2_logoplot \
        --prefs {avgprefs} \
        --name {logoname} \
        --outdir {prefsdir} \
        --overlay1 {wtoverlayfile} wildtype wildtype \
        --overlay2 {ssfile} SS "secondary structure" \
        --overlay3 {rsafile} RSA "relative solvent accessibility" \
        --nperline 81 \
        --use_existing {use_existing}
        
logoplot = os.path.join(prefsdir, '{0}_prefs.pdf'.format(logoname))
showPDF(logoplot)

Fit preferences to the evolution of natural HA sequences

Finally, we will take the averaged amino-acid preferences and fit them to natural sequence evolution. The goal here is to test the experimentally measured preferences are informative about natural sequence evolution, and if so obtain a stringency parameter that can be used to re-scale them to bring them into line with the stringency of natural selection.

All of this is done in a phylogenetic framework using an experimentally informed codon model (ExpCM). For a description of the general concept of an ExpCM, see Bloom (2014). For a more practical discussion of how ExpCM can be applied to real data, see Hilton and Bloom (2017), which describes the phydms software that we will use for the analysis.

Note that the original Doud and Bloom (2016) performed a similar analysis to what we are doing here -- but they used an older version of phydms and a different sequence alignment, and so got somewhat numerically different results than what we'll get here, although the trends where the same.

First, we use phydms_comprehensive to fit the averaged amino-acid preferences to the HA sequence alignment in ./data/HA_alignment.fasta. Note that running phydms_comprehensive requires that we have RAxML installed at the path specified by --raxml. We will look at the results in the modelcomparison.md file created by phydms_comprehensive. Since this program does not have a --use_existing option, we code something equivalent in the cell below to avoid re-running if the output already exists.



In [25]:

    
print("Using the following version of phydms:")
!phydms -v

alignment = './data/HA_alignment.fasta'
phydms_dir = os.path.join(resultsdir, 'phydms_analysis/') # put phydms results here
modelcomparison = os.path.join(phydms_dir, 'modelcomparison.md')

if use_existing == 'yes' and os.path.isfile(modelcomparison):
    print("Results of phydms analysis already exist.")
else:
    print("Running phydms_comprehensive...")
    log = !phydms_comprehensive {phydms_dir} {alignment} {avgprefs} --raxml raxml
    print("Analysis complete.")









    



Using the following version of phydms:
phydms 2.2.2
Results of phydms analysis already exist.

Now we look at the results of the analysis in the modelcomparison.md file:



In [26]:

    
display(Markdown(modelcomparison))









    






Model
deltaAIC
LogLikelihood
nParams
ParamValues




ExpCM_summary_avgprefs
0.00
-4899.25
6
beta=2.05, kappa=5.18, omega=0.51


averaged_ExpCM_summary_avgprefs
2056.18
-5927.34
6
beta=0.67, kappa=5.35, omega=0.21


YNGKP_M5
2078.78
-5932.64
12
alpha_omega=0.30, beta_omega=1.42, kappa=4.68


YNGKP_M0
2185.18
-5986.84
11
kappa=4.60, omega=0.20

The table show above compares four different models:

The ExpCM model informed by the deep mutational scanning.
A standard Goldman-Yang style model that allows the dN/dS ratio to vary across sites (the M5 model of YNGKP).
A standard Goldman-Yang style model with a single dN/dS ratio (the M0 model of YNGKP).
A "control" ExpCM in which the amino-acid preferences are averaged across sites to eliminate all site-specific information. Note that this is different than averaging the replicates to the get the "average preferences" -- here, the preferences are being averaged across sites, not replicates, so there is no longer any site-specific information.

If the deep mutational scanning is informative about evolution, then we expect it to fit the natural sequence evolution much better than any of the other models. That is in fact what we see, as it has a much lower AIC.

Also of relevance is the stringency parameter ($\beta$) that re-scales the preferences to optimally fit natural evolution in an ExpCM. When $\beta > 1$, it means that natural evolution prefers the same amino acids as the experiments but with greater stringency (see Hilton and Bloom (2017), particularly Figure 3 of that paper, for more details).

Above we can see that our ExpCM fit a value of $\beta = 2.05$. We can use dms2_logoplot to re-plot the replicate-average preferences after re-scaling by this stringency parameter by using the --stringency parameter to that program. Such a plot is shown below; it provides a better estimate of how the mutational effects measured in the deep mutational scanning actually correspond to natural selection.



In [27]:

    
rescaledlogoname = 'rescaled-avgprefs'

log = !dms2_logoplot \
        --prefs {avgprefs} \
        --name {rescaledlogoname} \
        --outdir {prefsdir} \
        --overlay1 {wtoverlayfile} wildtype wildtype \
        --overlay2 {ssfile} SS "secondary structure" \
        --overlay3 {rsafile} RSA "relative solvent accessibility" \
        --nperline 81 \
        --stringency 2.05 \
        --use_existing {use_existing}
        
rescaledlogoplot = os.path.join(prefsdir, '{0}_prefs.pdf'.format(rescaledlogoname))
showPDF(rescaledlogoplot)



In [28]:

name	run	R1	R2
mutDNA-1	SRR3113656	mutDNA-1_R1.fastq.gz	mutDNA-1_R2.fastq.gz
mutDNA-2	SRR3113657	mutDNA-2_R1.fastq.gz	mutDNA-2_R2.fastq.gz
mutDNA-3	SRR3113658	mutDNA-3_R1.fastq.gz	mutDNA-3_R2.fastq.gz
mutvirus-1	SRR3113660	mutvirus-1_R1.fastq.gz	mutvirus-1_R2.fastq.gz
mutvirus-2	SRR3113661	mutvirus-2_R1.fastq.gz	mutvirus-2_R2.fastq.gz
mutvirus-3	SRR3113662	mutvirus-3_R1.fastq.gz	mutvirus-3_R2.fastq.gz
wtDNA	SRR3113655	wtDNA_R1.fastq.gz	wtDNA_R2.fastq.gz
wtvirus	SRR3113659	wtvirus_R1.fastq.gz	wtvirus_R2.fastq.gz

name	nonsynonymous	stop	synonymous
mutDNA-1	0.001664	0.000099	0.000144
mutDNA-2	0.002077	0.000119	0.000163
mutDNA-3	0.001553	0.000096	0.000138
mutvirus-1	0.000630	0.000013	0.000181
mutvirus-2	0.000809	0.000014	0.000207
mutvirus-3	0.000683	0.000015	0.000185
wtDNA	0.000138	0.000023	0.000048
wtvirus	0.000194	0.000012	0.000096

name	pre	post	errpre	errpost
replicate-1	mutDNA-1	mutvirus-1	wtDNA	wtvirus
replicate-2	mutDNA-2	mutvirus-2	wtDNA	wtvirus
replicate-3	mutDNA-3	mutvirus-3	wtDNA	wtvirus

Model	deltaAIC	LogLikelihood	nParams	ParamValues
ExpCM_summary_avgprefs	0.00	-4899.25	6	beta=2.05, kappa=5.18, omega=0.51
averaged_ExpCM_summary_avgprefs	2056.18	-5927.34	6	beta=0.67, kappa=5.35, omega=0.21
YNGKP_M5	2078.78	-5932.64	12	alpha_omega=0.30, beta_omega=1.42, kappa=4.68
YNGKP_M0	2185.18	-5986.84	11	kappa=4.60, omega=0.20