In [2]:
% matplotlib inline
In [53]:
from __future__ import print_function
import os.path
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from enrich2.variant import WILD_TYPE_VARIANT
import enrich2.plots as enrich_plot
pd.set_option("display.max_rows", 10) # rows shown when pretty-printing
Modify the results_path
variable in the next cell to match the output directory of your Enrich2-Example dataset.
In [4]:
results_path = "/path/to/Enrich2-Example/Results/"
Open the Experiment HDF5 file.
In [10]:
my_store = pd.HDFStore(os.path.join(results_path, "BRCA1_Example_exp.h5"))
The pd.HDFStore.keys()
method returns a list of all the tables in this HDF5 file.
In [73]:
my_store.keys()
Out[73]:
First we will work with the barcode-variant map for this analysis, stored in the "/main/barcodemap" table. The index is the barcode and it has a single column for the variant HGVS string.
In [74]:
bcm = my_store['/main/barcodemap']
bcm
Out[74]:
To find out how many unique barcodes are linked to each variant, we'll count the number of times each variant appears in the barcode-variant map using a Counter data structure. We'll then output the top ten variants by number of unique barcodes.
In [75]:
variant_bcs = Counter(bcm['value'])
variant_bcs.most_common(10)
Out[75]:
Next we'll turn the Counter into a data frame.
In [76]:
bc_counts = pd.DataFrame(variant_bcs.most_common(), columns=['variant', 'barcodes'])
bc_counts
Out[76]:
The data frame has the information we want, but it will be easier to use later if it's indexed by variant rather than row number.
In [77]:
bc_counts.index = bc_counts['variant']
bc_counts.index.name = None
del bc_counts['variant']
bc_counts
Out[77]:
We'll use a cutoff to choose variants with a minimum number of unique barcodes, and store this subset in a new index. We'll also exclude the wild type by dropping the first entry of the index.
In [78]:
bc_cutoff = 10
In [79]:
multi_bc_variants = bc_counts.loc[bc_counts['barcodes'] >= bc_cutoff].index[1:]
multi_bc_variants
Out[79]:
We can use this index to get condition-level scores for these variants by querying the "/main/variants/scores" table. Since we are working with an Experiment HDF5 file, the data frame column names are a MultiIndex with two levels, one for experimental conditions and one for data values (see the pandas documentation for more information).
In [80]:
multi_bc_scores = my_store.select('/main/variants/scores', where='index in multi_bc_variants')
multi_bc_scores
Out[80]:
There are fewer rows in multi_bc_scores
than in multi_bc_variants
because some of the variants were not scored in all replicate selections, and therefore do not have a condition-level score.
Now that we're finished getting data out of the HDF5 file, we'll close it.
In [81]:
my_store.close()
We'll add a column to the bc_counts
data frame that contains scores from the multi_bc_scores
data frame. To reference a column in a data frame with a MultiIndex, we need to specify all column levels.
In [82]:
bc_counts['score'] = multi_bc_scores['E3', 'score']
bc_counts
Out[82]:
Many rows in bc_counts
are missing scores (displayed as NaN) because those variants were not in multi_bc_scores
. We'll drop them before continuing.
In [83]:
bc_counts.dropna(inplace=True)
bc_counts
Out[83]:
Now that we have a data frame containing the subset of variants we're interested in, we can make a plot of score vs. number of unique barcodes. This example uses functions and colors from the Enrich2 plotting library.
In [86]:
fig, ax = plt.subplots()
enrich_plot.configure_axes(ax, xgrid=True)
ax.plot(bc_counts['barcodes'],
bc_counts['score'],
linestyle='none', marker='.', alpha=0.6,
color=enrich_plot.plot_colors['bright5'])
ax.set_xlabel("Unique Barcodes")
ax.set_ylabel("Variant Score")
Out[86]:
In [ ]: