In [1]:
# Read in the data
import pandas as pd
import custom_funcs as cf
import numpy as np

from bokeh.charts import Histogram, Scatter
from bokeh import plotting as bplt
from bokeh.plotting import ColumnDataSource
from ipywidgets import widgets, interact, Dropdown
from IPython.display import display
from sklearn.preprocessing import LabelBinarizer
from sklearn import cross_validation as cv
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.metrics import mean_squared_error as mse
from Bio import SeqIO
from Bio.Align import MultipleSeqAlignment

%load_ext autoreload
%autoreload 2
bplt.output_notebook()


/Users/ericmjl/anaconda/envs/sysmicro/lib/python3.5/site-packages/matplotlib/__init__.py:872: UserWarning: axes.color_cycle is deprecated and replaced with axes.prop_cycle; please use the latter.
  warnings.warn(self.msg_depr % (key, alt_key))
BokehJS successfully loaded.

Machine Learning in Python using scikit-learn

Learning Outcomes

From the lectures, you should be able to:

  1. Describe the role of HIV protease and reverse transcriptase.
  2. Describe the mechanism, at the biochemical level, how protease and reverse transcriptase inhibitors function.

By the end of this portion of the class, you should be equipped to:

  1. Transform protein sequence data into an input suitable for machine learning purposes.
  2. Describe the importance of splitting the data into training, testing, and validation sets.
  3. Use the scikit-learn API to split your data into training and testing sets.
  4. Use the scikit-learn API to train a machine learning model to learn how to map from genotype to phenotype.
  5. Describe the difference between a model parameter and hyperparameters.
  6. Be ready to tackle the pset!

Question

If I gave you the protein sequence of an HIV protease, could you tell the degree to which the protease is resistant to a particular drug? What kind of data would you need?

  • From a clinical perspective: If a new patient came in, and you had to prescribe a drug combination for their infection, how would you tell which drugs to give, given the genomic data?

  • From an epidemiological perspective: How would you tell whether drug resistance is shaping the evolution of the virus? What kind of data would you need to answer this question?

Machine Learning Tasks

  1. Supervised
  2. Unsupervised

What are some examples of each?

Supervised Machine Learning

  1. Gold standard: Model directly learns relationship between features and outcomes.
  2. Expensive: Requires structured data collected in a representative fashion.
  3. Worthwhile: As long as your measurement accurately measures what you're inferring, supervised ML will work well.
  4. Features are everything:
    1. Feature engineering: transforming numerical features correctly to feed into model.
    2. Feature selection: figuring out what are the most important predictor variables.

If you can set up your machine learning problem as "learning the relationship between features and outcomes", then you are good to go for doing machine learning.

Feature 1 Feature 2 Feature 3 Outcome 1 Outcome 2
Obs 1 Obs 2 Obs 3 Val 1 Val 2
... ... ... ... ...

Outcomes

Outcomes can be grouped under two types:

  • Classes: Discrete labels (i.e. dog, cat, human, pig). Problems involving classes are classification problems.
  • Numbers: A continuuous measurement. Problems involving numbers are regression problems.

Let's get started

We will use one drug and one protein as an example to figure out how to write the code that transforms the data from sequence to insight.

Load the Data

The data are derived from the Stanford HIV Database. The data are collected by measuring the fold drug resistance in a standardized drug resistance assay, also known as the PhenoSense Assay. As such, this is a high quality, matched genotype-phenotype dataset.

I have written some custom functions in the custom_funcs.py module to aid with data loading. Run the cell below.


In [2]:
DATA_HANDLE = 'drug_data/hiv-protease-data.csv'  # specify the relative path to the protease drug resistance data
N_DATA = 8  # specify the number of columns in the CSV file that are drug resistance measurements.
CONSENSUS = 'sequences/hiv-protease-consensus.fasta'  # specify the relative path to the HIV protease consensus sequence

data, drug_cols, feat_cols = cf.read_data(DATA_HANDLE, N_DATA)

consensus_map = cf.read_consensus(CONSENSUS)

data = cf.clean_data(data, feat_cols, consensus_map)

Explore the Data


In [3]:
data.head() # alternatively, data.tail()


Out[3]:
FPV ATV IDV LPV NFV SQV TPV DRV P1 P2 ... P90 P91 P92 P93 P94 P95 P96 P97 P98 P99
SeqID
2996 2.5 NaN 16.3 NaN 38.6 16.1 NaN NaN P Q ... M T Q L G C T L N F
4387 0.7 NaN 0.8 NaN 0.8 1.1 NaN NaN P Q ... L T Q I G C T L N F
4432 1.5 NaN 1.0 NaN 2.2 1.1 NaN NaN P Q ... L T Q I G C T L N F
4482 3.9 NaN 20.2 NaN 21.6 9.2 NaN NaN P Q ... M T Q L G C T L N F
4486 9.5 20 8.2 11 72.0 46.0 NaN NaN P Q ... L T Q I G C T L N F

5 rows × 107 columns

Explore

Before we go any further, I think it is valuable to explore the dataset further. Below is an interactive tool to explore the data.

As you look through it, discuss the following question with each other:

  1. How would you describe the distribution of the data? What view of the data are you describing it from? (log10 or non-log10?)
  2. Are you surprised by negative values when the log10 transformation is performed?

In [4]:
def histogram(drug, bins=50, log10='False'):
    src = data[[drug]].dropna()
    xlabel = 'Drug Resistance Values'
    if log10 == "True":
        src = src.apply(lambda x: np.log10(x))
        xlabel = 'log10 (Drug Resistance Value)'
    p = Histogram(data=src, 
                  title='Distribution of {0} Drug Resistance Values'.format(drug),
                  xlabel=xlabel, 
                  ylabel='Count',
                  bins=bins, 
                  width=900, 
                  height=300)
    bplt.show(p)

interact(histogram, drug=[i for i in drug_cols], bins=[10, 80, 10], log10=["False", "True"])


Let's also take a look at the correlations between the drug resistance values. Think about the following question:

  1. Are there correlations between drug resistance values?

In [5]:
def scatter(drug1, drug2, log10="False"):
    p = bplt.figure(x_axis_label=drug1,
                    y_axis_label=drug2,
                    plot_height=300,
                    plot_width=300)
    src = data[[drug1, drug2]].dropna()
    if log10 == "True":
        src = src.apply(np.log10)
    src = ColumnDataSource(src)
    p.circle(source=src, x=drug1, y=drug2)
    bplt.show(p)    

interact(scatter, 
         drug1=[i for i in drug_cols][::-1], 
         drug2=[i for i in drug_cols],
         log10=['True', 'False'])


Let's work with only one drug: FPV. We will reduce the data such that we only have FPV drug resistance values with the sequence feature matrix.


In [6]:
cols_of_interest = []
cols_of_interest.append('FPV')
cols_of_interest.extend(feat_cols)
fpv = data[cols_of_interest]
fpv.head(10) # change to: fpv.dropna().head(10)


Out[6]:
FPV P1 P2 P3 P4 P5 P6 P7 P8 P9 ... P90 P91 P92 P93 P94 P95 P96 P97 P98 P99
SeqID
2996 2.5 P Q I T L W Q R P ... M T Q L G C T L N F
4387 0.7 P Q I T L W Q R P ... L T Q I G C T L N F
4432 1.5 P Q I T L W Q R P ... L T Q I G C T L N F
4482 3.9 P Q I T L W Q R P ... M T Q L G C T L N F
4486 9.5 P Q I T L W Q R P ... L T Q I G C T L N F
4538 NaN P Q I T L W Q R P ... M T Q I G C T L N F
4664 3.1 P Q I T L W Q R P ... M T Q I G C T L N F
4690 4.9 P Q I T L W Q R P ... M T Q I G C T L N F
4698 1.2 P Q I T L W Q R P ... L T Q L G C T L N F
5221 NaN P Q I T L W Q R P ... L T Q I G C T L N F

10 rows × 100 columns

Notice how there are NaN values present. These will cause issues for us later on.

  • This most probably stems from experimental data not being present for that sequence.
  • Therefore, it should not be used for model training.
  • Filling/substituting it with 0 also does not help, as it was not actually measured to have 0 drug resistance.
  • We will have to drop rows that have NaN values.

Exercise

Let's try to write a function together that will give us a cleaned data set, which you can use later on.


In [7]:
def one_drug_data(drug_name):
    cols_of_interest = []
    
    # append and extend are two methods for lists.
    # append adds a single item to the end of the list.
    # extend iterates over the list passed into the function, and appends it to the end of the list.
    cols_of_interest.append(drug_name)
    cols_of_interest.extend(feat_cols) 
    
    # What is the function call to remove rows with NaN values in it?
    drug_data = data[cols_of_interest].dropna()
    
    return drug_data

fpv = one_drug_data('FPV')
fpv.head()


Out[7]:
FPV P1 P2 P3 P4 P5 P6 P7 P8 P9 ... P90 P91 P92 P93 P94 P95 P96 P97 P98 P99
SeqID
2996 2.5 P Q I T L W Q R P ... M T Q L G C T L N F
4387 0.7 P Q I T L W Q R P ... L T Q I G C T L N F
4432 1.5 P Q I T L W Q R P ... L T Q I G C T L N F
4482 3.9 P Q I T L W Q R P ... M T Q L G C T L N F
4486 9.5 P Q I T L W Q R P ... L T Q I G C T L N F

5 rows × 100 columns

As the scatterplots and histograms show, it may be wise to log10 transform the drug resistance value, so that we can work with a less skewed distribution.

Exercise

Fill in the function below.


In [8]:
def log10_drug_data(drug_name):
    # Comment out drug_name.
    drug_data = one_drug_data(drug_name)
    drug_data[drug_name] = drug_data[drug_name].apply(np.log10)
    
    return drug_data

fpv = log10_drug_data("FPV")
fpv.head()


Out[8]:
FPV P1 P2 P3 P4 P5 P6 P7 P8 P9 ... P90 P91 P92 P93 P94 P95 P96 P97 P98 P99
SeqID
2996 0.397940 P Q I T L W Q R P ... M T Q L G C T L N F
4387 -0.154902 P Q I T L W Q R P ... L T Q I G C T L N F
4432 0.176091 P Q I T L W Q R P ... L T Q I G C T L N F
4482 0.591065 P Q I T L W Q R P ... M T Q L G C T L N F
4486 0.977724 P Q I T L W Q R P ... L T Q I G C T L N F

5 rows × 100 columns

Exercise

Now that we have cleaned the data, let's split it into the feature and outcome columns.


In [9]:
def split_xy(data, X_colnames, Y_colnames):
    """
    The "X" matrix will be the feature columns.
    The "Y" matrix (or column in this case) will be the outcome column(s).
    """
    
    return data[X_colnames], data[Y_colnames]

X, Y = split_xy(fpv, feat_cols, 'FPV')

Convince yourself that they've been split correctly, by running the following two cells.


In [10]:
X.head()


Out[10]:
P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 ... P90 P91 P92 P93 P94 P95 P96 P97 P98 P99
SeqID
2996 P Q I T L W Q R P I ... M T Q L G C T L N F
4387 P Q I T L W Q R P L ... L T Q I G C T L N F
4432 P Q I T L W Q R P L ... L T Q I G C T L N F
4482 P Q I T L W Q R P V ... M T Q L G C T L N F
4486 P Q I T L W Q R P F ... L T Q I G C T L N F

5 rows × 99 columns


In [11]:
Y.head()


Out[11]:
SeqID
2996    0.397940
4387   -0.154902
4432    0.176091
4482    0.591065
4486    0.977724
Name: FPV, dtype: float64

Problem

scikit-learn ML models can't take in strings. How can we transform this into numbers?

Solution

Binarize each column into 1s and 0s representing whether an amino acid is present in that position. Run the cell below.


In [12]:
lb = LabelBinarizer()
lb.fit(list('CHIMSVAGLPTRFYWDNEQK'))

X_binarized = pd.DataFrame()

for col in X.columns:
    binarized_cols = lb.transform(X[col])
    
    for i, c in enumerate(lb.classes_):
        X_binarized[str(col) + '_' + str(c)] = binarized_cols[:,i]

Convince yourself, now, that X_binarized is indeed binarized.


In [13]:
X_binarized.head().iloc[:,0:20]


Out[13]:
P1_A P1_C P1_D P1_E P1_F P1_G P1_H P1_I P1_K P1_L P1_M P1_N P1_P P1_Q P1_R P1_S P1_T P1_V P1_W P1_Y
0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

Train a model!

Let's train the Random Forest Regressor model.

Question:

Briefly, at a high level, how does the Random Forest algorithm work?

Ensemble Learners

Uses an ensemble of (i.e. many) "weak" learners to learn the association between the features and the outcome.

Random Forest is based on an ensemble of decision trees. What might be a "strong" decision tree, and what might be a "weak" decision tree?

Train/Test Split

Before we train the model, we need to first split the data into two chunks, the training set and the testing set. Why would this be important? Why couldn't we simply train and test on the whole dataset?


In [14]:
# Split the data into two sets, the training and testing set.
X_train, X_test, Y_train, Y_test = cv.train_test_split(X_binarized, Y, test_size=0.3)

In [15]:
# NOTE: This is the code you may want to use for your pset! 
# The code associated with cells that are interactive may not be best suited to your use case.
# Initialize the model
rfr = RandomForestRegressor()
# Call on the "model.fit()" function.
rfr.fit(X_train, Y_train)
# Make a prediction on the test set.
Y_preds = rfr.predict(X_test)
# Evaluate the model by calculating the mean-squared error.
mse(Y_test, Y_preds)


Out[15]:
0.11325358659386076

Question

What is the mean-squared error metric actually measuring?

Model Evaluation: Cross-Validation

During the model training and evaluation phase, we want to know which models will generalize best. Therefore, it is important to use cross-validation on the data.

In cross-validation procedure we are going to use here is called fractional shuffled cross validation, the data are:

  1. Shuffle the rows of the data.
  2. Split data randomly into a training set and test set, with the training set being some fraction of the entire data set.
  3. Train on training set, measure accuracy on test set.
  4. Repeat until procedure has been done X times.

There's nothing special about the name, I made it up. :)

Exercise

Play around with the interactive widgets below, and let us know which model you think should be considered the best. Note: some models take a long time to run. Also, the more the number of iterations, the longer it should take.


In [16]:
import math
def assess_model(model, n_iters):
    model = eval(model)
    n_iters = int(n_iters)
    cv_iterator = cv.ShuffleSplit(len(X_binarized), n_iter=n_iters, test_size=0.3)
    cv_scores = cv.cross_val_score(model(), 
                                   X_binarized, 
                                   Y, 
                                   cv=cv_iterator, 
                                   scoring='mean_squared_error')
    # note: we return negative of mean squared error because of a known issue in scikit-learn
    # where the MSE is returned as a negative value. Therefore, to get back the positive value,
    # we have to take the negative of the array of values.
    return -cv_scores

interact(assess_model, 
         model=['RandomForestRegressor', 'AdaBoostRegressor', 'ExtraTreesRegressor', 'GradientBoostingRegressor'],
         n_iters=Dropdown(description='n_iters', options=['5', '6', '7', '8', '9', '10', '11', '12']))


array([ 0.12491055,  0.12728943,  0.10961939,  0.12331918,  0.10460269,
        0.10889568,  0.11338647,  0.09690122,  0.11748701,  0.09267683])

Exercise

Modify the assess_model function such that it returns the mean and the standard deviation of the CV scores.

The following block of code should help illustrate for you acceptable coding patterns:

def function(param1, param2):
    # do stuff with param1 and param2
    result1 = #stuff done with param1
    result2 = #stuff done with param2
    return result1, result2

Also, the numpy function has been imported in the np namespace, so you can access the following functions:

  • np.mean(array_of_values) will give you the mean.
  • np.std(array_of_values) will give you the standard deviation.

When you're done, don't forget to re-run the cell, so that the function can be updated, before playing with the drop-down menu values.

Question

Which model performs the best?

Once we know which model performs the best, we can go on to tweak it further.

Note: We haven't tweaked the parameters in the model yet! Thus far, we have yet to optimize model performance. Take a look at each of the following model's API documentation pages to see what parameters can be tweaked:

Model Optimization

After choosing the best model, which currently use defaults, you can usually get some scoring performance gains by tweaking the parameters on the best model.


In [17]:
def optimize_random_forest_model(n_iters, n_estimators):
    n_iters = int(n_iters)  # I cast as an int only because widgets can only return strings. Likewise on next line.
    n_estimators = int(n_estimators)
    
    cv_iterator = cv.ShuffleSplit(len(X_binarized), n_iter=n_iters, test_size=0.3)
    cv_scores = cv.cross_val_score(RandomForestRegressor(n_estimators), 
                                   X_binarized, 
                                   Y, 
                                   cv=cv_iterator, 
                                   scoring='mean_squared_error')
    
    return np.mean(-cv_scores), np.std(-cv_scores)


    
interact(optimize_random_forest_model, 
         n_estimators=Dropdown(description='Number of Random Forest Estimators', options=['5', '10', '50', '100']),
         n_iters=Dropdown(description='Number of Train/Test Iterations', options=['5', '6', '7', '8', '9', '10', '11', '12']))


(0.10001749942169384, 0.0049772712462784864)

Feature Importance

We can query the model to ask what amino acids are most important. To do so, call the model.feature_importances_ attribute.


In [18]:
rfr = RandomForestRegressor(n_estimators=100)
rfr.fit(X_binarized, Y)
sorted([i for i in zip(X_binarized.columns, rfr.feature_importances_)], key=lambda x:x[1], reverse=True)


Out[18]:
[('P10_L', 0.41153983081947954),
 ('P33_F', 0.087194130618560378),
 ('P84_I', 0.069016903298581345),
 ('P33_L', 0.052252537189027934),
 ('P46_M', 0.047247483807592687),
 ('P54_I', 0.032236317910918112),
 ('P88_S', 0.023733087519056138),
 ('P32_V', 0.013220165452355387),
 ('P32_I', 0.01254896301870514),
 ('P90_M', 0.012053220672508637),
 ('P50_V', 0.0096164282377393089),
 ('P47_I', 0.0094227256284569686),
 ('P82_V', 0.0088161342275535643),
 ('P46_I', 0.0083532189812597357),
 ('P54_M', 0.0079382733212244577),
 ('P11_V', 0.0075440750789332609),
 ('P84_V', 0.0061204402360191257),
 ('P89_L', 0.0052304622542769085),
 ('P54_V', 0.0048701271355074518),
 ('P10_F', 0.0046550704219256063),
 ('P84_A', 0.0043041175663194008),
 ('P36_M', 0.0034951996867370904),
 ('P20_K', 0.003303991176634124),
 ('P36_I', 0.0030151127685327187),
 ('P50_L', 0.0028043735858380831),
 ('P71_V', 0.0027503753338096858),
 ('P37_N', 0.0027169852865242439),
 ('P77_I', 0.0027149401688160413),
 ('P76_L', 0.0026827612722290419),
 ('P63_P', 0.0026015598576784012),
 ('P89_V', 0.0023707692340096931),
 ('P50_I', 0.0022638047909994134),
 ('P72_I', 0.0022593171820461302),
 ('P13_I', 0.0022097428730436909),
 ('P77_V', 0.0021617853624173286),
 ('P20_T', 0.0021013054014395027),
 ('P82_A', 0.0020957591973437546),
 ('P88_N', 0.0020661030386924452),
 ('P57_R', 0.0020589630885390537),
 ('P43_K', 0.0020045388933561279),
 ('P82_T', 0.0019913017702376484),
 ('P63_L', 0.0019750938781072872),
 ('P64_I', 0.0019331179808237328),
 ('P35_E', 0.0019031623155558364),
 ('P82_F', 0.0018949255286604008),
 ('P64_V', 0.0018483789179354288),
 ('P93_I', 0.0018480964306397704),
 ('P10_I', 0.0018001446173499355),
 ('P15_I', 0.0017682904323598583),
 ('P47_V', 0.0017563697161151459),
 ('P62_I', 0.0017467043608429703),
 ('P35_D', 0.0017256694092155395),
 ('P90_L', 0.0016763186829238),
 ('P41_R', 0.0016057669892383449),
 ('P58_Q', 0.0015665741018824394),
 ('P71_A', 0.0015197330976026548),
 ('P76_V', 0.0015128274646054485),
 ('P15_V', 0.001501196737266989),
 ('P93_L', 0.0014685131744599067),
 ('P67_C', 0.0014625447828393271),
 ('P58_E', 0.0014426090172157183),
 ('P13_V', 0.0014297402562990598),
 ('P19_L', 0.0013999272422072313),
 ('P74_T', 0.0013577292717335801),
 ('P62_V', 0.00134806339495069),
 ('P72_V', 0.0013409214437789358),
 ('P20_R', 0.0013288861881059848),
 ('P46_L', 0.0013048156566990146),
 ('P73_G', 0.0012933172796680565),
 ('P71_T', 0.0012568683279341624),
 ('P12_T', 0.0011931687155049108),
 ('P53_F', 0.0011427855553791347),
 ('P41_K', 0.0011072268242445927),
 ('P14_K', 0.0011047208413678608),
 ('P61_Q', 0.001089104141062899),
 ('P37_D', 0.0010783934841499894),
 ('P48_G', 0.0010742411380537408),
 ('P70_K', 0.0010571953579160628),
 ('P55_K', 0.0010483863918864247),
 ('P71_I', 0.0010465259584454031),
 ('P57_K', 0.0010240215309776232),
 ('P69_K', 0.00098910732006297127),
 ('P14_R', 0.00098896425578508512),
 ('P33_I', 0.00096197743109292387),
 ('P83_D', 0.00094328641219596182),
 ('P24_L', 0.0009360091788359774),
 ('P11_I', 0.00093576235833870401),
 ('P85_I', 0.00093527822192866414),
 ('P69_H', 0.00093160401991238418),
 ('P37_S', 0.00086644051679528376),
 ('P73_S', 0.00085495264306822516),
 ('P83_N', 0.00085198704475787082),
 ('P63_A', 0.00083210233786057441),
 ('P74_S', 0.00082888493862627653),
 ('P30_D', 0.00082776050793102773),
 ('P60_D', 0.00082406594868899825),
 ('P19_I', 0.00080575405603883069),
 ('P20_I', 0.0007892885683315863),
 ('P12_A', 0.00075397748826046631),
 ('P54_L', 0.00073496603386735151),
 ('P64_M', 0.0007310022564313291),
 ('P16_G', 0.00072836805076503005),
 ('P10_V', 0.00071714130867122879),
 ('P30_N', 0.00070589915618455971),
 ('P95_C', 0.000680054480139604),
 ('P57_G', 0.00065446974765938598),
 ('P34_E', 0.00065206268507813801),
 ('P60_E', 0.00063317544862284724),
 ('P92_Q', 0.00062337210287413171),
 ('P61_E', 0.00061139608498838173),
 ('P72_T', 0.00061132431670185897),
 ('P45_K', 0.00061066540510607611),
 ('P66_I', 0.00059493975249950699),
 ('P43_T', 0.00059332478256863207),
 ('P63_S', 0.00058422276088947146),
 ('P73_T', 0.00058210014619499448),
 ('P48_E', 0.00056749557451289158),
 ('P85_V', 0.00054899100487860477),
 ('P36_L', 0.00053519234518127016),
 ('P72_L', 0.00052988695020612323),
 ('P47_A', 0.00052141425635643293),
 ('P37_E', 0.00050882068997392315),
 ('P82_I', 0.00050638306827354243),
 ('P53_L', 0.00049591004086342171),
 ('P12_S', 0.00049444444093046404),
 ('P63_Q', 0.00048113418836110089),
 ('P55_R', 0.00047635995353144742),
 ('P63_T', 0.00047330271205260865),
 ('P37_C', 0.00045976970521385293),
 ('P79_P', 0.00045494106737685078),
 ('P89_M', 0.00044478932567729791),
 ('P39_Q', 0.00043480723699885321),
 ('P39_P', 0.00042164304884982574),
 ('P82_L', 0.00041906577116950736),
 ('P48_V', 0.00040623974619705236),
 ('P88_D', 0.00039546573005195667),
 ('P69_Q', 0.00038913471995895413),
 ('P36_V', 0.00038769373124690503),
 ('P34_D', 0.00038660152200870144),
 ('P16_A', 0.0003817929709447307),
 ('P70_R', 0.00037506167571612194),
 ('P84_C', 0.00036717821458968463),
 ('P24_I', 0.00036601324528808525),
 ('P65_E', 0.00033825291837058014),
 ('P19_T', 0.00032995689711722823),
 ('P33_V', 0.00032664322843920695),
 ('P19_V', 0.0003211871064897612),
 ('P63_C', 0.00032002440977730071),
 ('P74_P', 0.00030705336989327389),
 ('P18_Q', 0.00030562316857759555),
 ('P91_T', 0.00030128291990705206),
 ('P48_M', 0.00026447895916438183),
 ('P34_Q', 0.00025408944468042707),
 ('P72_E', 0.0002472157432683766),
 ('P51_A', 0.00023126849765807769),
 ('P67_E', 0.00022194230158450987),
 ('P66_V', 0.00021900419172412813),
 ('P45_R', 0.00021868037043497573),
 ('P4_T', 0.00021832543012011599),
 ('P63_H', 0.00021730617541711468),
 ('P66_F', 0.0002130624771220087),
 ('P69_Y', 0.00020138655995094814),
 ('P22_A', 0.00019877151268581195),
 ('P64_L', 0.00019743113788426279),
 ('P37_T', 0.00019622467547607543),
 ('P16_E', 0.0001957639811682583),
 ('P39_S', 0.00019506206018162943),
 ('P79_A', 0.00019302457598782571),
 ('P18_H', 0.00019165511832512752),
 ('P51_G', 0.0001897618881742013),
 ('P89_I', 0.00018813828236484919),
 ('P43_R', 0.00018234319758082739),
 ('P93_M', 0.00018195489511511589),
 ('P35_Q', 0.00016969748996990456),
 ('P63_V', 0.00016951765921921475),
 ('P12_P', 0.00016021913678386496),
 ('P92_K', 0.00015879249143131445),
 ('P38_L', 0.00015822824619565103),
 ('P68_G', 0.00014955916397183411),
 ('P74_A', 0.00014549862472385564),
 ('P73_A', 0.00014537222305634887),
 ('P88_G', 0.00014533848145252638),
 ('P87_R', 0.00014149140943430421),
 ('P70_E', 0.00014108330165997686),
 ('P46_V', 0.00013524118325528641),
 ('P35_N', 0.0001348506496495107),
 ('P36_A', 0.00012855987513191541),
 ('P72_M', 0.00012805244982263701),
 ('P92_R', 0.00012310620888986791),
 ('P15_L', 0.00012095375260881275),
 ('P23_I', 0.0001160748461159348),
 ('P61_N', 0.00011191859240583039),
 ('P19_Q', 0.00011125594046780731),
 ('P10_R', 0.00010785451615816458),
 ('P22_V', 0.00010775596576793336),
 ('P65_D', 0.00010610183768890714),
 ('P12_E', 0.00010403970277257653),
 ('P67_F', 0.00010207724370660124),
 ('P75_V', 0.0001020340891817852),
 ('P45_V', 0.00010033649738526352),
 ('P17_G', 9.7417324373543589e-05),
 ('P77_T', 9.2920584110891706e-05),
 ('P3_I', 9.2660668791045002e-05),
 ('P35_G', 8.9708045856869817e-05),
 ('P7_Q', 8.8515134711724498e-05),
 ('P54_T', 8.7275879261750587e-05),
 ('P55_N', 8.5461347680986175e-05),
 ('P37_V', 8.4021472235515355e-05),
 ('P23_L', 8.3575447442899747e-05),
 ('P85_M', 8.3368916348018494e-05),
 ('P60_N', 8.3200259021561933e-05),
 ('P37_H', 8.1094342740502286e-05),
 ('P67_Y', 7.9176861642872153e-05),
 ('P21_E', 7.6628210388349522e-05),
 ('P73_C', 7.5687704267543584e-05),
 ('P69_R', 7.4163086046966835e-05),
 ('P95_F', 7.380712668726351e-05),
 ('P7_R', 7.3511171180983811e-05),
 ('P20_M', 7.2559347989317993e-05),
 ('P13_L', 7.054240446669091e-05),
 ('P43_I', 6.676701958064149e-05),
 ('P70_T', 6.5383128624712261e-05),
 ('P91_S', 6.5283768518533823e-05),
 ('P19_P', 6.4389978408781141e-05),
 ('P20_V', 6.2626355742196287e-05),
 ('P11_L', 6.2429295597099717e-05),
 ('P54_A', 6.2350854863817224e-05),
 ('P4_P', 5.9975733230103323e-05),
 ('P17_E', 5.8716708898585678e-05),
 ('P68_E', 5.8035939912588044e-05),
 ('P12_N', 5.5991924322953159e-05),
 ('P34_K', 5.3774179353196056e-05),
 ('P70_Q', 5.3679178698689031e-05),
 ('P72_K', 5.3677520606556665e-05),
 ('P43_S', 5.3309909075503481e-05),
 ('P29_D', 5.3253083020696207e-05),
 ('P8_R', 5.3061617610428991e-05),
 ('P71_L', 5.1338712261735454e-05),
 ('P95_L', 5.0909529218759862e-05),
 ('P54_S', 5.066919996031553e-05),
 ('P53_Y', 4.9954825248492423e-05),
 ('P79_S', 4.9919264554769768e-05),
 ('P72_R', 4.7579784242564321e-05),
 ('P98_N', 4.4389115923735317e-05),
 ('P24_M', 4.3332865952615581e-05),
 ('P80_T', 4.1927666721229549e-05),
 ('P38_I', 4.0358690170757258e-05),
 ('P45_I', 4.0255527832853821e-05),
 ('P82_S', 3.8367575041485512e-05),
 ('P12_K', 3.8012641535554661e-05),
 ('P24_F', 3.6480242539872378e-05),
 ('P95_G', 3.5630945465820925e-05),
 ('P21_D', 3.3894281035850283e-05),
 ('P74_E', 3.3475480706096768e-05),
 ('P78_G', 3.338617656223879e-05),
 ('P45_Q', 3.2720269782772136e-05),
 ('P79_D', 3.1560667023491287e-05),
 ('P66_L', 3.144842011881572e-05),
 ('P33_M', 3.1121450128436338e-05),
 ('P12_Q', 3.0951193830747272e-05),
 ('P13_A', 3.0444164707539842e-05),
 ('P18_I', 2.9977887158425357e-05),
 ('P67_D', 2.9706250486571396e-05),
 ('P80_P', 2.7245496413826607e-05),
 ('P97_I', 2.7063698182075879e-05),
 ('P73_D', 2.6321849036775375e-05),
 ('P37_Y', 2.5867975837397987e-05),
 ('P13_M', 2.5347257082615825e-05),
 ('P74_K', 2.4667685545061053e-05),
 ('P82_C', 2.3129418573217707e-05),
 ('P67_S', 2.2334057704810714e-05),
 ('P3_V', 2.0912539832182498e-05),
 ('P61_G', 2.0002797137763464e-05),
 ('P12_I', 1.9877857628171848e-05),
 ('P82_M', 1.9587877076772891e-05),
 ('P26_T', 1.9546905160078254e-05),
 ('P63_E', 1.9428632959994613e-05),
 ('P75_I', 1.9171985686204671e-05),
 ('P35_H', 1.8670482444558067e-05),
 ('P77_L', 1.8157998070107192e-05),
 ('P96_T', 1.8087799355920802e-05),
 ('P61_H', 1.7783098576533491e-05),
 ('P43_E', 1.7267655833999767e-05),
 ('P8_Q', 1.7013860659027087e-05),
 ('P37_A', 1.6996086010139557e-05),
 ('P4_S', 1.6757543919542916e-05),
 ('P92_E', 1.6377598672061871e-05),
 ('P10_S', 1.537550501183225e-05),
 ('P14_T', 1.5011809361443381e-05),
 ('P12_V', 1.4418753262698856e-05),
 ('P17_D', 1.2902064597878437e-05),
 ('P85_L', 1.2801009177995132e-05),
 ('P36_T', 1.211382258749827e-05),
 ('P15_N', 1.168154510421818e-05),
 ('P95_V', 1.1487378003377242e-05),
 ('P79_N', 1.0905098625314175e-05),
 ('P10_H', 1.0802659859052213e-05),
 ('P63_R', 1.0664536972265074e-05),
 ('P7_E', 1.0237056219039458e-05),
 ('P61_D', 1.0047686314153811e-05),
 ('P5_L', 9.8316680594555552e-06),
 ('P61_K', 9.7571297428931613e-06),
 ('P77_Q', 9.6175159463520411e-06),
 ('P68_D', 9.5554463254571168e-06),
 ('P67_W', 9.3241462931540595e-06),
 ('P59_Y', 9.3240351097156914e-06),
 ('P10_C', 9.3166818686054921e-06),
 ('P83_P', 8.4287129409392234e-06),
 ('P97_L', 8.415017742532207e-06),
 ('P3_M', 8.3737964855825876e-06),
 ('P6_W', 8.3142367846151225e-06),
 ('P21_K', 8.1815493911086562e-06),
 ('P12_R', 6.4901537445849237e-06),
 ('P41_I', 6.2915509675993552e-06),
 ('P71_M', 5.9892479415750434e-06),
 ('P61_R', 5.9459671040568685e-06),
 ('P12_D', 5.8041103722029448e-06),
 ('P6_P', 5.6872899513252965e-06),
 ('P10_Y', 5.5227085994690547e-06),
 ('P41_P', 4.9389046149368204e-06),
 ('P67_M', 4.9020985554804066e-06),
 ('P53_I', 4.8307150388171605e-06),
 ('P68_W', 4.78408599086649e-06),
 ('P39_L', 4.5274305236290066e-06),
 ('P62_M', 4.4411192377590948e-06),
 ('P72_Y', 4.2217352705228133e-06),
 ('P45_N', 3.7524641211626382e-06),
 ('P38_V', 3.7351611144769503e-06),
 ('P34_N', 3.7007079464327985e-06),
 ('P37_Q', 3.6363823368800333e-06),
 ('P84_T', 3.6220082674267494e-06),
 ('P63_D', 3.5695432477024124e-06),
 ('P45_L', 3.5646919949566004e-06),
 ('P43_Q', 3.5179766145546928e-06),
 ('P48_S', 3.3218118876924127e-06),
 ('P25_D', 3.1243327022979139e-06),
 ('P69_P', 3.1019469884907892e-06),
 ('P74_D', 3.0738765917694013e-06),
 ('P4_D', 2.8972587574406299e-06),
 ('P89_F', 2.8815233944316609e-06),
 ('P18_R', 2.7908238183841288e-06),
 ('P96_S', 2.6854506913732153e-06),
 ('P28_A', 2.6180616519765272e-06),
 ('P59_H', 2.4706368091918713e-06),
 ('P86_G', 2.4684968070542687e-06),
 ('P90_V', 2.4609915923653399e-06),
 ('P18_E', 2.4079941629117072e-06),
 ('P48_Q', 2.3996037599807405e-06),
 ('P91_K', 2.2438367841168825e-06),
 ('P19_R', 2.2094082479664141e-06),
 ('P70_N', 2.1306759550266958e-06),
 ('P60_K', 2.0850898841192426e-06),
 ('P49_G', 2.0661474640021475e-06),
 ('P98_I', 2.0577366429087092e-06),
 ('P34_T', 2.0101719505330234e-06),
 ('P67_L', 1.993129221366554e-06),
 ('P83_S', 1.9812667671588261e-06),
 ('P2_Q', 1.8937262811912424e-06),
 ('P60_Q', 1.8856292780551462e-06),
 ('P34_V', 1.5382901567955588e-06),
 ('P41_N', 1.4626831608828496e-06),
 ('P3_L', 1.4432130469656759e-06),
 ('P15_R', 1.4093407263236209e-06),
 ('P3_F', 1.1949048301555779e-06),
 ('P18_L', 1.1554184095572678e-06),
 ('P39_T', 1.0446814709501417e-06),
 ('P19_M', 1.0038614423018907e-06),
 ('P47_E', 9.7417836409491534e-07),
 ('P73_I', 9.4713122942716337e-07),
 ('P94_G', 8.8135232801322675e-07),
 ('P91_A', 8.6748241335045417e-07),
 ('P67_G', 8.5379787627024132e-07),
 ('P66_T', 8.130148050255334e-07),
 ('P67_H', 7.8623338411704371e-07),
 ('P18_K', 7.8070155068368996e-07),
 ('P14_E', 7.2764552396293119e-07),
 ('P5_F', 7.0292671681227506e-07),
 ('P79_Q', 6.2380316886061577e-07),
 ('P69_N', 6.1126193533577822e-07),
 ('P40_G', 5.6599917314866861e-07),
 ('P89_T', 5.2870804110019888e-07),
 ('P2_V', 4.2660524557558556e-07),
 ('P22_T', 4.1558642665771583e-07),
 ('P20_L', 3.4979953188424672e-07),
 ('P12_M', 3.4175065234959642e-07),
 ('P94_S', 3.2238039240539452e-07),
 ('P31_T', 2.1794218317060274e-07),
 ('P10_M', 1.9246645061347506e-07),
 ('P65_K', 7.8563475477671872e-08),
 ('P14_N', 4.8355369154563939e-08),
 ('P91_N', 5.0815177104389372e-09),
 ('P1_A', 0.0),
 ('P1_C', 0.0),
 ('P1_D', 0.0),
 ('P1_E', 0.0),
 ('P1_F', 0.0),
 ('P1_G', 0.0),
 ('P1_H', 0.0),
 ('P1_I', 0.0),
 ('P1_K', 0.0),
 ('P1_L', 0.0),
 ('P1_M', 0.0),
 ('P1_N', 0.0),
 ('P1_P', 0.0),
 ('P1_Q', 0.0),
 ('P1_R', 0.0),
 ('P1_S', 0.0),
 ('P1_T', 0.0),
 ('P1_V', 0.0),
 ('P1_W', 0.0),
 ('P1_Y', 0.0),
 ('P2_A', 0.0),
 ('P2_C', 0.0),
 ('P2_D', 0.0),
 ('P2_E', 0.0),
 ('P2_F', 0.0),
 ('P2_G', 0.0),
 ('P2_H', 0.0),
 ('P2_I', 0.0),
 ('P2_K', 0.0),
 ('P2_L', 0.0),
 ('P2_M', 0.0),
 ('P2_N', 0.0),
 ('P2_P', 0.0),
 ('P2_R', 0.0),
 ('P2_S', 0.0),
 ('P2_T', 0.0),
 ('P2_W', 0.0),
 ('P2_Y', 0.0),
 ('P3_A', 0.0),
 ('P3_C', 0.0),
 ('P3_D', 0.0),
 ('P3_E', 0.0),
 ('P3_G', 0.0),
 ('P3_H', 0.0),
 ('P3_K', 0.0),
 ('P3_N', 0.0),
 ('P3_P', 0.0),
 ('P3_Q', 0.0),
 ('P3_R', 0.0),
 ('P3_S', 0.0),
 ('P3_T', 0.0),
 ('P3_W', 0.0),
 ('P3_Y', 0.0),
 ('P4_A', 0.0),
 ('P4_C', 0.0),
 ('P4_E', 0.0),
 ('P4_F', 0.0),
 ('P4_G', 0.0),
 ('P4_H', 0.0),
 ('P4_I', 0.0),
 ('P4_K', 0.0),
 ('P4_L', 0.0),
 ('P4_M', 0.0),
 ('P4_N', 0.0),
 ('P4_Q', 0.0),
 ('P4_R', 0.0),
 ('P4_V', 0.0),
 ('P4_W', 0.0),
 ('P4_Y', 0.0),
 ('P5_A', 0.0),
 ('P5_C', 0.0),
 ('P5_D', 0.0),
 ('P5_E', 0.0),
 ('P5_G', 0.0),
 ('P5_H', 0.0),
 ('P5_I', 0.0),
 ('P5_K', 0.0),
 ('P5_M', 0.0),
 ('P5_N', 0.0),
 ('P5_P', 0.0),
 ('P5_Q', 0.0),
 ('P5_R', 0.0),
 ('P5_S', 0.0),
 ('P5_T', 0.0),
 ('P5_V', 0.0),
 ('P5_W', 0.0),
 ('P5_Y', 0.0),
 ('P6_A', 0.0),
 ('P6_C', 0.0),
 ('P6_D', 0.0),
 ('P6_E', 0.0),
 ('P6_F', 0.0),
 ('P6_G', 0.0),
 ('P6_H', 0.0),
 ('P6_I', 0.0),
 ('P6_K', 0.0),
 ('P6_L', 0.0),
 ('P6_M', 0.0),
 ('P6_N', 0.0),
 ('P6_Q', 0.0),
 ('P6_R', 0.0),
 ('P6_S', 0.0),
 ('P6_T', 0.0),
 ('P6_V', 0.0),
 ('P6_Y', 0.0),
 ('P7_A', 0.0),
 ('P7_C', 0.0),
 ('P7_D', 0.0),
 ('P7_F', 0.0),
 ('P7_G', 0.0),
 ('P7_H', 0.0),
 ('P7_I', 0.0),
 ('P7_K', 0.0),
 ('P7_L', 0.0),
 ('P7_M', 0.0),
 ('P7_N', 0.0),
 ('P7_P', 0.0),
 ('P7_S', 0.0),
 ('P7_T', 0.0),
 ('P7_V', 0.0),
 ('P7_W', 0.0),
 ('P7_Y', 0.0),
 ('P8_A', 0.0),
 ('P8_C', 0.0),
 ('P8_D', 0.0),
 ('P8_E', 0.0),
 ('P8_F', 0.0),
 ('P8_G', 0.0),
 ('P8_H', 0.0),
 ('P8_I', 0.0),
 ('P8_K', 0.0),
 ('P8_L', 0.0),
 ('P8_M', 0.0),
 ('P8_N', 0.0),
 ('P8_P', 0.0),
 ('P8_S', 0.0),
 ('P8_T', 0.0),
 ('P8_V', 0.0),
 ('P8_W', 0.0),
 ('P8_Y', 0.0),
 ('P9_A', 0.0),
 ('P9_C', 0.0),
 ('P9_D', 0.0),
 ('P9_E', 0.0),
 ('P9_F', 0.0),
 ('P9_G', 0.0),
 ('P9_H', 0.0),
 ('P9_I', 0.0),
 ('P9_K', 0.0),
 ('P9_L', 0.0),
 ('P9_M', 0.0),
 ('P9_N', 0.0),
 ('P9_P', 0.0),
 ('P9_Q', 0.0),
 ('P9_R', 0.0),
 ('P9_S', 0.0),
 ('P9_T', 0.0),
 ('P9_V', 0.0),
 ('P9_W', 0.0),
 ('P9_Y', 0.0),
 ('P10_A', 0.0),
 ('P10_D', 0.0),
 ('P10_E', 0.0),
 ('P10_G', 0.0),
 ('P10_K', 0.0),
 ('P10_N', 0.0),
 ('P10_P', 0.0),
 ('P10_Q', 0.0),
 ('P10_T', 0.0),
 ('P10_W', 0.0),
 ('P11_A', 0.0),
 ('P11_C', 0.0),
 ('P11_D', 0.0),
 ('P11_E', 0.0),
 ('P11_F', 0.0),
 ('P11_G', 0.0),
 ('P11_H', 0.0),
 ('P11_K', 0.0),
 ('P11_M', 0.0),
 ('P11_N', 0.0),
 ('P11_P', 0.0),
 ('P11_Q', 0.0),
 ('P11_R', 0.0),
 ('P11_S', 0.0),
 ('P11_T', 0.0),
 ('P11_W', 0.0),
 ('P11_Y', 0.0),
 ('P12_C', 0.0),
 ('P12_F', 0.0),
 ('P12_G', 0.0),
 ('P12_H', 0.0),
 ('P12_L', 0.0),
 ('P12_W', 0.0),
 ('P12_Y', 0.0),
 ('P13_C', 0.0),
 ('P13_D', 0.0),
 ('P13_E', 0.0),
 ('P13_F', 0.0),
 ('P13_G', 0.0),
 ('P13_H', 0.0),
 ('P13_K', 0.0),
 ('P13_N', 0.0),
 ('P13_P', 0.0),
 ('P13_Q', 0.0),
 ('P13_R', 0.0),
 ('P13_S', 0.0),
 ('P13_T', 0.0),
 ('P13_W', 0.0),
 ('P13_Y', 0.0),
 ('P14_A', 0.0),
 ('P14_C', 0.0),
 ('P14_D', 0.0),
 ('P14_F', 0.0),
 ('P14_G', 0.0),
 ('P14_H', 0.0),
 ('P14_I', 0.0),
 ('P14_L', 0.0),
 ('P14_M', 0.0),
 ('P14_P', 0.0),
 ('P14_Q', 0.0),
 ('P14_S', 0.0),
 ('P14_V', 0.0),
 ('P14_W', 0.0),
 ('P14_Y', 0.0),
 ('P15_A', 0.0),
 ('P15_C', 0.0),
 ('P15_D', 0.0),
 ('P15_E', 0.0),
 ('P15_F', 0.0),
 ('P15_G', 0.0),
 ('P15_H', 0.0),
 ('P15_K', 0.0),
 ('P15_M', 0.0),
 ('P15_P', 0.0),
 ('P15_Q', 0.0),
 ('P15_S', 0.0),
 ('P15_T', 0.0),
 ('P15_W', 0.0),
 ('P15_Y', 0.0),
 ('P16_C', 0.0),
 ('P16_D', 0.0),
 ('P16_F', 0.0),
 ('P16_H', 0.0),
 ('P16_I', 0.0),
 ('P16_K', 0.0),
 ('P16_L', 0.0),
 ('P16_M', 0.0),
 ('P16_N', 0.0),
 ('P16_P', 0.0),
 ('P16_Q', 0.0),
 ('P16_R', 0.0),
 ('P16_S', 0.0),
 ('P16_T', 0.0),
 ('P16_V', 0.0),
 ('P16_W', 0.0),
 ('P16_Y', 0.0),
 ('P17_A', 0.0),
 ('P17_C', 0.0),
 ('P17_F', 0.0),
 ('P17_H', 0.0),
 ('P17_I', 0.0),
 ('P17_K', 0.0),
 ('P17_L', 0.0),
 ('P17_M', 0.0),
 ('P17_N', 0.0),
 ('P17_P', 0.0),
 ('P17_Q', 0.0),
 ('P17_R', 0.0),
 ('P17_S', 0.0),
 ('P17_T', 0.0),
 ('P17_V', 0.0),
 ('P17_W', 0.0),
 ('P17_Y', 0.0),
 ('P18_A', 0.0),
 ('P18_C', 0.0),
 ('P18_D', 0.0),
 ('P18_F', 0.0),
 ('P18_G', 0.0),
 ('P18_M', 0.0),
 ('P18_N', 0.0),
 ('P18_P', 0.0),
 ('P18_S', 0.0),
 ('P18_T', 0.0),
 ('P18_V', 0.0),
 ('P18_W', 0.0),
 ('P18_Y', 0.0),
 ('P19_A', 0.0),
 ('P19_C', 0.0),
 ('P19_D', 0.0),
 ('P19_E', 0.0),
 ('P19_F', 0.0),
 ('P19_G', 0.0),
 ('P19_H', 0.0),
 ('P19_K', 0.0),
 ('P19_N', 0.0),
 ('P19_S', 0.0),
 ('P19_W', 0.0),
 ('P19_Y', 0.0),
 ('P20_A', 0.0),
 ('P20_C', 0.0),
 ('P20_D', 0.0),
 ('P20_E', 0.0),
 ('P20_F', 0.0),
 ('P20_G', 0.0),
 ('P20_H', 0.0),
 ('P20_N', 0.0),
 ('P20_P', 0.0),
 ('P20_Q', 0.0),
 ('P20_S', 0.0),
 ('P20_W', 0.0),
 ('P20_Y', 0.0),
 ('P21_A', 0.0),
 ('P21_C', 0.0),
 ('P21_F', 0.0),
 ('P21_G', 0.0),
 ('P21_H', 0.0),
 ('P21_I', 0.0),
 ('P21_L', 0.0),
 ('P21_M', 0.0),
 ('P21_N', 0.0),
 ('P21_P', 0.0),
 ('P21_Q', 0.0),
 ('P21_R', 0.0),
 ('P21_S', 0.0),
 ('P21_T', 0.0),
 ('P21_V', 0.0),
 ('P21_W', 0.0),
 ('P21_Y', 0.0),
 ('P22_C', 0.0),
 ('P22_D', 0.0),
 ('P22_E', 0.0),
 ('P22_F', 0.0),
 ('P22_G', 0.0),
 ('P22_H', 0.0),
 ('P22_I', 0.0),
 ('P22_K', 0.0),
 ('P22_L', 0.0),
 ('P22_M', 0.0),
 ('P22_N', 0.0),
 ('P22_P', 0.0),
 ('P22_Q', 0.0),
 ('P22_R', 0.0),
 ('P22_S', 0.0),
 ('P22_W', 0.0),
 ('P22_Y', 0.0),
 ('P23_A', 0.0),
 ('P23_C', 0.0),
 ('P23_D', 0.0),
 ('P23_E', 0.0),
 ('P23_F', 0.0),
 ('P23_G', 0.0),
 ('P23_H', 0.0),
 ('P23_K', 0.0),
 ('P23_M', 0.0),
 ('P23_N', 0.0),
 ('P23_P', 0.0),
 ('P23_Q', 0.0),
 ('P23_R', 0.0),
 ('P23_S', 0.0),
 ('P23_T', 0.0),
 ('P23_V', 0.0),
 ('P23_W', 0.0),
 ('P23_Y', 0.0),
 ('P24_A', 0.0),
 ('P24_C', 0.0),
 ('P24_D', 0.0),
 ('P24_E', 0.0),
 ('P24_G', 0.0),
 ('P24_H', 0.0),
 ('P24_K', 0.0),
 ('P24_N', 0.0),
 ('P24_P', 0.0),
 ('P24_Q', 0.0),
 ('P24_R', 0.0),
 ('P24_S', 0.0),
 ('P24_T', 0.0),
 ('P24_V', 0.0),
 ('P24_W', 0.0),
 ('P24_Y', 0.0),
 ('P25_A', 0.0),
 ('P25_C', 0.0),
 ('P25_E', 0.0),
 ('P25_F', 0.0),
 ('P25_G', 0.0),
 ('P25_H', 0.0),
 ('P25_I', 0.0),
 ('P25_K', 0.0),
 ('P25_L', 0.0),
 ('P25_M', 0.0),
 ('P25_N', 0.0),
 ('P25_P', 0.0),
 ('P25_Q', 0.0),
 ('P25_R', 0.0),
 ('P25_S', 0.0),
 ('P25_T', 0.0),
 ('P25_V', 0.0),
 ('P25_W', 0.0),
 ('P25_Y', 0.0),
 ('P26_A', 0.0),
 ('P26_C', 0.0),
 ('P26_D', 0.0),
 ('P26_E', 0.0),
 ('P26_F', 0.0),
 ('P26_G', 0.0),
 ('P26_H', 0.0),
 ('P26_I', 0.0),
 ('P26_K', 0.0),
 ('P26_L', 0.0),
 ('P26_M', 0.0),
 ('P26_N', 0.0),
 ('P26_P', 0.0),
 ('P26_Q', 0.0),
 ('P26_R', 0.0),
 ('P26_S', 0.0),
 ('P26_V', 0.0),
 ('P26_W', 0.0),
 ('P26_Y', 0.0),
 ('P27_A', 0.0),
 ('P27_C', 0.0),
 ('P27_D', 0.0),
 ('P27_E', 0.0),
 ('P27_F', 0.0),
 ('P27_G', 0.0),
 ('P27_H', 0.0),
 ('P27_I', 0.0),
 ('P27_K', 0.0),
 ('P27_L', 0.0),
 ('P27_M', 0.0),
 ('P27_N', 0.0),
 ('P27_P', 0.0),
 ('P27_Q', 0.0),
 ('P27_R', 0.0),
 ('P27_S', 0.0),
 ('P27_T', 0.0),
 ('P27_V', 0.0),
 ('P27_W', 0.0),
 ('P27_Y', 0.0),
 ('P28_C', 0.0),
 ('P28_D', 0.0),
 ('P28_E', 0.0),
 ('P28_F', 0.0),
 ('P28_G', 0.0),
 ('P28_H', 0.0),
 ('P28_I', 0.0),
 ('P28_K', 0.0),
 ('P28_L', 0.0),
 ('P28_M', 0.0),
 ('P28_N', 0.0),
 ('P28_P', 0.0),
 ('P28_Q', 0.0),
 ('P28_R', 0.0),
 ('P28_S', 0.0),
 ('P28_T', 0.0),
 ('P28_V', 0.0),
 ('P28_W', 0.0),
 ('P28_Y', 0.0),
 ('P29_A', 0.0),
 ('P29_C', 0.0),
 ('P29_E', 0.0),
 ('P29_F', 0.0),
 ('P29_G', 0.0),
 ('P29_H', 0.0),
 ('P29_I', 0.0),
 ('P29_K', 0.0),
 ('P29_L', 0.0),
 ('P29_M', 0.0),
 ('P29_N', 0.0),
 ('P29_P', 0.0),
 ('P29_Q', 0.0),
 ('P29_R', 0.0),
 ('P29_S', 0.0),
 ('P29_T', 0.0),
 ('P29_V', 0.0),
 ('P29_W', 0.0),
 ('P29_Y', 0.0),
 ('P30_A', 0.0),
 ('P30_C', 0.0),
 ('P30_E', 0.0),
 ('P30_F', 0.0),
 ('P30_G', 0.0),
 ('P30_H', 0.0),
 ('P30_I', 0.0),
 ('P30_K', 0.0),
 ('P30_L', 0.0),
 ('P30_M', 0.0),
 ('P30_P', 0.0),
 ('P30_Q', 0.0),
 ('P30_R', 0.0),
 ('P30_S', 0.0),
 ('P30_T', 0.0),
 ('P30_V', 0.0),
 ('P30_W', 0.0),
 ('P30_Y', 0.0),
 ('P31_A', 0.0),
 ('P31_C', 0.0),
 ('P31_D', 0.0),
 ('P31_E', 0.0),
 ('P31_F', 0.0),
 ('P31_G', 0.0),
 ('P31_H', 0.0),
 ('P31_I', 0.0),
 ('P31_K', 0.0),
 ('P31_L', 0.0),
 ('P31_M', 0.0),
 ('P31_N', 0.0),
 ('P31_P', 0.0),
 ('P31_Q', 0.0),
 ('P31_R', 0.0),
 ('P31_S', 0.0),
 ('P31_V', 0.0),
 ('P31_W', 0.0),
 ('P31_Y', 0.0),
 ('P32_A', 0.0),
 ('P32_C', 0.0),
 ('P32_D', 0.0),
 ('P32_E', 0.0),
 ('P32_F', 0.0),
 ('P32_G', 0.0),
 ('P32_H', 0.0),
 ('P32_K', 0.0),
 ('P32_L', 0.0),
 ('P32_M', 0.0),
 ('P32_N', 0.0),
 ('P32_P', 0.0),
 ('P32_Q', 0.0),
 ('P32_R', 0.0),
 ('P32_S', 0.0),
 ('P32_T', 0.0),
 ('P32_W', 0.0),
 ('P32_Y', 0.0),
 ('P33_A', 0.0),
 ('P33_C', 0.0),
 ('P33_D', 0.0),
 ('P33_E', 0.0),
 ('P33_G', 0.0),
 ('P33_H', 0.0),
 ('P33_K', 0.0),
 ('P33_N', 0.0),
 ('P33_P', 0.0),
 ('P33_Q', 0.0),
 ('P33_R', 0.0),
 ('P33_S', 0.0),
 ('P33_T', 0.0),
 ('P33_W', 0.0),
 ('P33_Y', 0.0),
 ('P34_A', 0.0),
 ('P34_C', 0.0),
 ('P34_F', 0.0),
 ('P34_G', 0.0),
 ('P34_H', 0.0),
 ('P34_I', 0.0),
 ('P34_L', 0.0),
 ('P34_M', 0.0),
 ('P34_P', 0.0),
 ('P34_R', 0.0),
 ('P34_S', 0.0),
 ('P34_W', 0.0),
 ('P34_Y', 0.0),
 ('P35_A', 0.0),
 ('P35_C', 0.0),
 ('P35_F', 0.0),
 ('P35_I', 0.0),
 ('P35_K', 0.0),
 ('P35_L', 0.0),
 ('P35_M', 0.0),
 ('P35_P', 0.0),
 ('P35_R', 0.0),
 ('P35_S', 0.0),
 ('P35_T', 0.0),
 ('P35_V', 0.0),
 ('P35_W', 0.0),
 ('P35_Y', 0.0),
 ('P36_C', 0.0),
 ('P36_D', 0.0),
 ('P36_E', 0.0),
 ('P36_F', 0.0),
 ('P36_G', 0.0),
 ('P36_H', 0.0),
 ('P36_K', 0.0),
 ('P36_N', 0.0),
 ('P36_P', 0.0),
 ('P36_Q', 0.0),
 ('P36_R', 0.0),
 ('P36_S', 0.0),
 ('P36_W', 0.0),
 ('P36_Y', 0.0),
 ('P37_F', 0.0),
 ('P37_G', 0.0),
 ('P37_I', 0.0),
 ('P37_K', 0.0),
 ('P37_L', 0.0),
 ('P37_M', 0.0),
 ('P37_P', 0.0),
 ('P37_R', 0.0),
 ('P37_W', 0.0),
 ('P38_A', 0.0),
 ('P38_C', 0.0),
 ('P38_D', 0.0),
 ('P38_E', 0.0),
 ('P38_F', 0.0),
 ('P38_G', 0.0),
 ('P38_H', 0.0),
 ('P38_K', 0.0),
 ('P38_M', 0.0),
 ('P38_N', 0.0),
 ('P38_P', 0.0),
 ('P38_Q', 0.0),
 ('P38_R', 0.0),
 ('P38_S', 0.0),
 ...]

Make Predictions

The final thing we need to do before moving onto the next class is to make predictions on the rest of the data.

Recall the motivation:

  • For the doctor: having seen the genomic data, he now needs to choose a drug to prescribe that targets the virus. Needs minimal resistance.
  • For the epidemiologist: she would like to understand the global evolutionary trajectory of HIV, and whether drug resistance is shaping it or not.

Load the global sequence data as a pandas DataFrame

The global HIV protease protein sequence data (proteases_downsampled.fasta) is stored in the FASTA file format, in the sequences/ directory.

Note the structure of the id field:

  • HIV subtype
  • Country code
  • Year of isolation
  • Patient ID
  • Accession Number

In [20]:
proteases = [s for s in SeqIO.parse('sequences/proteases_downsampled.fasta', 'fasta')]
for s in proteases:
    if len(s.seq) != 99:
        print(s.id)

In [21]:
proteases = [s for s in SeqIO.parse('sequences/proteases_downsampled.fasta', 'fasta') if len(s.seq) == 99]
proteases = MultipleSeqAlignment(proteases)
proteases


Out[21]:
<<class 'Bio.Align.MultipleSeqAlignment'> instance (249 records of length 99, SingleLetterAlphabet()) at 1098090b8>

In [22]:
# Iterate over all of the columns in the multiple sequence alignment, and append it as a 
proteases_df = pd.DataFrame()

for col in range(proteases.get_alignment_length()):
    binarized_cols = lb.transform([k for k in proteases[:,col]])
    
    for i, c in enumerate(lb.classes_):
        proteases_df[str(col + 1) + '_' + c] = binarized_cols[:,i]
        
# Add in the index.
index = []
for s in proteases:
    index.append(s.id)
    
proteases_df.index = index
proteases_df.head()


Out[22]:
1_A 1_C 1_D 1_E 1_F 1_G 1_H 1_I 1_K 1_L ... 99_M 99_N 99_P 99_Q 99_R 99_S 99_T 99_V 99_W 99_Y
B.US.2004.CA48243.GQ207976 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
B.US.2004.SD_PIRC_4339.KJ723303 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
B.US.2004.CR0068P.FJ469695 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
B.US.2004.CA51328.GQ208561 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
B.US.2004.CA42953.GQ207851 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 1980 columns


In [23]:
proteases_df.iloc[0:5, 0:20]


Out[23]:
1_A 1_C 1_D 1_E 1_F 1_G 1_H 1_I 1_K 1_L 1_M 1_N 1_P 1_Q 1_R 1_S 1_T 1_V 1_W 1_Y
B.US.2004.CA48243.GQ207976 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
B.US.2004.SD_PIRC_4339.KJ723303 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
B.US.2004.CR0068P.FJ469695 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
B.US.2004.CA51328.GQ208561 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
B.US.2004.CA42953.GQ207851 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0

Model Prediction Outputs

Here, I will show you how to output the predictions of the machine learning model to disk as a CSV file.

Live Coding Together


In [24]:
# Fit all four regressor models

def fit_model(model, X, Y):
    # Instantiate the model
    mdl = model()
    # Fit the model
    mdl.fit(X, Y)
    
    # Return the model and the model predictions.
    return mdl, mdl.predict(proteases_df)

rf_mdl, rf_preds = fit_model(RandomForestRegressor, X_binarized, Y)
et_mdl, et_preds = fit_model(ExtraTreesRegressor, X_binarized, Y)
ab_mdl, ab_preds = fit_model(AdaBoostRegressor, X_binarized, Y)
gb_mdl, gb_preds = fit_model(GradientBoostingRegressor, X_binarized, Y)

In [25]:
# Zip together the predicted values with the index of the protease dataframe.
protease_preds = [i for i in zip(proteases_df.index, rf_mdl.predict(proteases_df))]
# Convert it into a dataframe, set the index column to be the IDs, and apply a 10**x transformation
protease_preds = pd.DataFrame(protease_preds).set_index(0).apply(lambda x: np.power(10, x))
# View the first few lines.
protease_preds.head()


Out[25]:
1
0
B.US.2004.CA48243.GQ207976 1.281633
B.US.2004.SD_PIRC_4339.KJ723303 0.636339
B.US.2004.CR0068P.FJ469695 1.340600
B.US.2004.CA51328.GQ208561 0.684255
B.US.2004.CA42953.GQ207851 0.650719

In [26]:
# Write them to disk.
protease_preds.to_csv('csv/FPV_random-forest_preds.tsv', sep='\t')

In [27]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(rf_preds)


Out[27]:
(array([  43.,  127.,   41.,    5.,    6.,    7.,   12.,    3.,    1.,    4.]),
 array([-0.53364875, -0.2891269 , -0.04460505,  0.1999168 ,  0.44443865,
         0.6889605 ,  0.93348235,  1.1780042 ,  1.42252605,  1.6670479 ,
         1.91156975]),
 <a list of 10 Patch objects>)

In [ ]: