The purpose of this notebook is to combine all the digital gene expression data for the retina cells, downloaded from the Gene Expression Omnibus using the accession number GSE63473.
In [ ]:
import altair
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
# Set the plotting style as for a "paper" (smaller labels)
# and using a white background with a grid ("whitegrid")
sns.set(context='paper', style='whitegrid')
%matplotlib inline
We'll import the macosko2015
package, which contains a URL pointing to where we've created clean data
In [ ]:
import macosko2015
macosko2015.BASE_URL
We've created a subset of the data that contains all the cells from batch 1, and only the differentially expressed genes. This is still pretty big!!
In [ ]:
urlname = macosko2015.BASE_URL + 'differential_clusters_expression.csv'
urlname
In [ ]:
expression = pd.read_csv(urlname, index_col=0)
print(expression.shape)
expression.head()
For later, let's also make a logged version the expression matrix:
In [ ]:
expression_log10 = np.log10(expression + 1)
print(expression_log10.shape)
expression_log10.head()
Now let's read the cell metadata
In [ ]:
urlname = macosko2015.BASE_URL + 'differential_clusters_cell_metadata.csv'
cell_metadata = pd.read_csv(urlname, index_col=0)
cell_metadata.head()
For this figure, the caption is:
(A) Pan-amacrine markers. The expression levels of the six genes identified (Nrxn2, Atp1b1, Pax6, Slc32a1, Slc6a1, Elavl3) are represented as dot plots across all 39 clusters; larger dots indicate broader expression within the cluster; deeper red denotes a higher expression level.
I wonder, how did they aggregate their expression per cluster? Mean, median, or mode? Did they log their data? We'll find out :)
You may have noticed that while the gene names are Captialcase in the paper, they're all uppercase in the data. So first, we'll define a function called upperizer
that will make our gene names uppercase.
Then, we'll make a list of genes for Figure 5a
In [ ]:
def upperizer(genes):
return [x.upper() for x in genes]
figure5a_genes = ['Nrxn2', 'Atp1b1', 'Pax6', 'Slc32a1', 'Slc6a1', 'Elavl3']
figure5a_genes_upper = upperizer(figure5a_genes)
figure5a_genes_upper
In [ ]:
# YOUR CODE HERE
In [ ]:
figure5a_expression = expression[figure5a_genes_upper]
print(figure5a_expression.shape)
figure5a_expression.head()
We will use a function called groupby
to grab the cells from each cluster, and use .mean()
In [ ]:
figure5a_expression_mean = figure5a_expression.groupby(cell_metadata['cluster_n'], axis=0).mean()
print(figure5a_expression_mean.shape)
figure5a_expression_mean.head()
To make this "punchcard" style figure, we will use the program altair
. To use this program, we need to reshape our data into a tall, "tidy" data format, where each row is an observation, and each column is a variable:
Source: http://r4ds.had.co.nz/tidy-data.html
First, we will unstack
our data, which will make a very long column of gene expression, where the gene name and cluster number is the index
.
In [ ]:
figure5a_expression_mean_unstack = figure5a_expression_mean.unstack()
print(figure5a_expression_mean_unstack.shape)
figure5a_expression_mean_unstack.head()
Now let's use the function reset_index
, which will take everything that was an index
and now make it a column:
In [ ]:
figure5a_expression_tidy = figure5a_expression_mean_unstack.reset_index()
print(figure5a_expression_tidy.shape)
figure5a_expression_tidy.head()
But our column names aren't so nice anymore ... let's create a dict
to map the old column name to a new, nicer one.
In [ ]:
renamer = {'level_0': 'gene_symbol', 0: 'expression'}
renamer
We can use the dataframe function .rename()
to rename our columns:
In [ ]:
figure5a_expression_tidy = figure5a_expression_tidy.rename(columns=renamer)
print(figure5a_expression_tidy.shape)
figure5a_expression_tidy.head()
Let's also add a log expression column, just in case :)
In [ ]:
figure5a_expression_tidy['expression_log'] = np.log10(figure5a_expression_tidy['expression'] + 1)
print(figure5a_expression_tidy.shape)
figure5a_expression_tidy.head()
In [ ]:
altair.Chart(figure5a_expression_tidy).mark_circle().encode(
size='expression', x=altair.X('gene_symbol'), y=altair.Y('cluster_n'))
In [ ]:
# YOUR CODE HERE
In [ ]:
altair.Chart(figure5a_expression_tidy).mark_circle().encode(
size='expression_log', x=altair.X('gene_symbol'), y=altair.Y('cluster_n'))
Here, we will make violinplots of specific genes within our dataset.
The caption of this figure is:
(B) Identification of known amacrine types among clusters. The 21 amacrine clusters consisted of 12 GABAergic, five glycinergic, one glutamatergic, and three non-GABAergic non-glycinergic clusters. Starburst amacrines were identified in cluster 3 by their expression of Chat; excitatory amacrines by expression of Slc17a8; A-II amacrines by their expression of Gjd2; and SEG amacrine neurons by their expression of Ebf3.
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
figure5b_genes = ['Chat', "Gad1", 'Gad2', 'Slc17a8', 'Slc6a9', 'Gjd2', 'Gjd2', 'Ebf3']
figure5b_genes_upper = upperizer(figure5b_genes)
figure5b_genes_upper
Now that we have the genes we want, let's get the cells we want!
We can use the range
function in Python to get numbers in order
In [ ]:
range(5)
Now this returns a "range object" which means Python is being lazy and not telling us what's inside. To force Python into action, we can use list
:
In [ ]:
list(range(5))
So this is getting us all numbers from 0 to 4 (not including 5). We can use this group of numbers to subset our cell metadata! Let's make a variable called rows
that contains True
/False
values telling us whether the cells are in that cluster number:
In [ ]:
rows = cell_metadata.cluster_n.isin(range(5))
rows
Now let's use our rows
variable to subset cell_metadata
In [ ]:
print('cell_metadata.shape', cell_metadata.shape)
cell_metadata_subset = cell_metadata.loc[rows]
print('cell_metadata_subset.shape', cell_metadata_subset.shape)
cell_metadata_subset.head()
Let's make sure we only have the clusters we need:
In [ ]:
cell_metadata_subset.cluster_n.unique()
This is kinda out of order so let's sort it with the sorted
function:
In [ ]:
sorted(cell_metadata_subset.cluster_n.unique())
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
rows = cell_metadata.cluster_n.isin(range(3, 24))
figure5b_cell_metadata = cell_metadata.loc[rows]
print(figure5b_cell_metadata.shape)
figure5b_cell_metadata.head()
In [ ]:
sorted(figure5b_cell_metadata.cluster_n.unique())
Now we want to get only the cells from these clusters. To do that, we would use .index
In [ ]:
figure5b_cell_metadata.index
In [ ]:
# YOUR CODE HERE
In [ ]:
figure5b_expression = expression.loc[figure5b_cell_metadata.index, figure5b_genes_upper]
print(figure5b_expression.shape)
figure5b_expression.head()
Again, we'll have to make a tidy
version of the data to be able to make the violinplots
In [ ]:
figure5b_cell_metadata.index
In [ ]:
figure5b_expression.index
In [ ]:
figure5b_tidy = figure5b_expression.unstack().reset_index()
figure5b_tidy = figure5b_tidy.rename(columns={'level_1': 'barcode', 'level_0': 'gene_symbol', 0: 'expression'})
figure5b_tidy['expression_log'] = np.log10(figure5b_tidy['expression'] + 1)
print(figure5b_tidy.shape)
figure5b_tidy.head()
If you want, you could also create a function to simplify the tidying and logging:
In [ ]:
def tidify_and_log(data):
tidy = data.unstack().reset_index()
tidy = tidy.rename(columns={'level_1': 'barcode', 'level_0': 'gene_symbol', 0: 'expression'})
tidy['expression_log'] = np.log10(tidy['expression'] + 1)
return tidy
Now that you have your tidy data, we need to add the cell metadata. We will use .join
, and specify to use the "barcode"
column of figure5b_tidy
In [ ]:
figure5b_tidy_clusters = figure5b_tidy.join(figure5b_cell_metadata, on='barcode')
print(figure5b_tidy_clusters.shape)
figure5b_tidy_clusters.head()
We can make violinplots using seaborn's sns.violinplot
, but that will show us the expression across all genes :(
In [ ]:
sns.violinplot?
The below command specifies "expression"
as the x-axis value (first argument), and "cluster_id"
as the y-axis value (second argument). Then we say that we want the program to look at the data
in our dataframe called figure5b_tidy_clusters
.
In [ ]:
sns.violinplot('expression', 'cluster_id', data=figure5b_tidy_clusters)
sns.FacetGrid
to make multiple violinplotsSince we want to make a separate violinplot for each gene, we need to take multiple steps. We will use the function sns.FacetGrid
to make mini-plots of each gene. If you want to read more about plotting on data-aware grid in Python, check out the seaborn docs on grid plotting.
Let's take a look at the documentation.
In [ ]:
sns.FacetGrid?
In [ ]:
# YOUR CODE HERE
In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol')
facetgrid
I have no idea which gene is where .. so let's add some titles with the convenient function g.set_titles
In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol')
facetgrid.set_titles('{col_name}')
Now let's add our violinplots, using map
on the facetgrid
. Again, we'll use "expression"
as the x-value (first argument) and "cluster_id"
as the second argument.
In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol')
facetgrid.map(sns.violinplot, 'expression', 'cluster_id')
facetgrid.set_titles('{col_name}')
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol', sharex=False)
facetgrid.map(sns.violinplot, 'expression', 'cluster_id')
facetgrid.set_titles('{col_name}')
Okay these violinplots are still pretty weird looking. In the paper, they scale the violinplots to all be the same width, and the lines are much thinner.
Let's look at the documentation of sns.violinplot
and see what we can do.
In [ ]:
sns.violinplot?
Looks like we can set the scale
variable to be "width" and let's try setting the linewidth
to 1.
In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol', sharex=False)
facetgrid.map(sns.violinplot, 'expression', 'cluster_id', scale='width', linewidth=1)
facetgrid.set_titles('{col_name}')
Much better! There's a few more things we need to tweak in sns.violinplot
. Let's get rid of the dotted thing on the inside, and only show the data exactly where it's valued - the ends of the violins should be square, not pointy.
Read the documentation of sns.violinplot and add to the options to cut off the violinplots at the data bounds, and have nothing on the inside of the violins.
In [ ]:
# YOUR CODE HERE
In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol',
gridspec_kws=dict(hspace=0, wspace=0), sharex=False)
facetgrid.map(sns.violinplot, 'expression', 'cluster_id', scale='width', linewidth=1, inner=None, cut=True)
facetgrid.set_titles('{col_name}')
Okay one more thing on the violinplots ... they had a different color for every cluster, so let's do the same thing too. Right now they're all blue but let's make them all a different color. Since we have so many categories (21), and ColorBrewer doesn't have setups for when there are more than 10 colors, we need to use a different set of colors. We'll use the "husl"
colormap, which uses perception research to make colormaps where no one color is too bright or too dark. Read more about it here
Here is an example of what the colormap looks like:
In [ ]:
sns.palplot(sns.color_palette('husl', n_colors=50))
Let's add palette="husl"
to our violinplot command and see what it does:
In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol', sharex=False)
facetgrid.map(sns.violinplot, 'expression', 'cluster_id', scale='width',
linewidth=1, inner=None, cut=True, palette='husl')
facetgrid.set_titles('{col_name}')
Now let's work on resizing the plots so they're each narrower. We'll add the following three options to `sns.FacetGrid to accomplish this:
size=4
(default: size=3
) - Make the relative size of the plot biggeraspect=0.25
(default: aspect=1
) - Make the the width of the plot be 1/4 the size of the heighgridspec_kws=dict(wspace=0)
- Set the width between plots to be zero
In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol',
size=4, aspect=0.25, gridspec_kws=dict(wspace=0),
sharex=False)
facetgrid.map(sns.violinplot, 'expression', 'cluster_id', scale='width',
linewidth=1, palette='husl', inner=None, cut=True)
facetgrid.set_titles('{col_name}')
Hmm.. now we can see that the clusters aren't in numeric order. Is there an option in sns.violinplot
that we can specify the order of the values?
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
cluster_order = figure5b_tidy_clusters.cluster_id.sort_values().unique()
cluster_order
In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol', size=4, aspect=0.25,
gridspec_kws=dict(wspace=0), sharex=False)
facetgrid.map(sns.violinplot, 'expression', 'cluster_id', scale='width',
linewidth=1, palette='husl', inner=None, cut=True, order=cluster_order)
facetgrid.set_titles('{col_name}')
Okay one last thing .. let's turn off the "expression" label at the bottom and the value scales (since right now we're just looking comparatively) with:
facetgrid.set(xlabel='', xticks=[])
In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol', size=4, aspect=0.25,
gridspec_kws=dict(wspace=0), sharex=False)
facetgrid.map(sns.violinplot, 'expression', 'cluster_id', scale='width',
linewidth=1, palette='husl', inner=None, cut=True, order=cluster_order)
facetgrid.set(xlabel='', xticks=[])
facetgrid.set_titles('{col_name}')
YOUR ANSWER HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
figure5b_tidy_clusters.head()
In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol', size=4, aspect=0.25,
gridspec_kws=dict(wspace=0), sharex=False)
facetgrid.map(sns.violinplot, 'expression_log', 'cluster_id', scale='width',
linewidth=1, palette='husl', inner=None, cut=True, order=cluster_order)
facetgrid.set(xlabel='', xticks=[])
facetgrid.set_titles('{col_name}')
Since we worked so hard to get these lines, let's write them as a function to a file called plotting_code.py
. We'll move all the options we fiddled with into the arguments of our violinplot_grid
function.
Notice we have to add import seaborn as sns
into our file. That's because the file must be standalone and include everything it needs, including all the modules.
In [ ]:
%%file plotting_code.py
import seaborn as sns
def violinplot_grid(tidy, col='gene_symbol', size=4, aspect=0.25, gridspec_kws=dict(wspace=0),
sharex=False, scale='width', linewidth=1, palette='husl', inner=None,
cut=True, order=None):
facetgrid = sns.FacetGrid(tidy, col=col, size=size, aspect=aspect,
gridspec_kws=gridspec_kws, sharex=sharex)
facetgrid.map(sns.violinplot, 'expression_log', 'cluster_id', scale=scale,
linewidth=linewidth, palette=palette, inner=inner, cut=cut, order=order)
facetgrid.set(xlabel='', xticks=[])
facetgrid.set_titles('{col_name}')
We can cat
(short for concatenate) the file, which means dump the contents out to the output:
In [ ]:
cat plotting_code.py
In [ ]:
import plotting_code
plotting_code.violinplot_grid(figure5b_tidy_clusters, order=cluster_order)
Now we see more of the "bimodality" they talk about in the paper
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
figure5c_genes = ['Gng7', 'Gbx2', 'Tpbg', 'Slitrk6', 'Maf', 'Tac2', 'Loxl2', 'Vip', 'Glra1',
'Igfbp5', 'Pdgfra', 'Slc35d3', 'Car3', 'Fgf1', 'Igf1', 'Col12a1', 'Ptgds',
'Ppp1r17', 'Cck', 'Shisa9', 'Pou3f3']
figure5c_genes_upper = upperizer(figure5c_genes)
figure5c_expression = expression.loc[figure5b_cell_metadata.index, figure5c_genes_upper]
print(figure5c_expression.shape)
figure5c_expression.head()
In [ ]:
figure5c_genes_upper
In [ ]:
figure5c_tidy = tidify_and_log(figure5c_expression)
print(figure5c_tidy.shape)
figure5c_tidy.head()
In [ ]:
figure5c_tidy_cell_metadata = figure5c_tidy.join(cell_metadata, on='barcode')
print(figure5c_tidy_cell_metadata.shape)
figure5c_tidy_cell_metadata.head()
In [ ]:
plotting_code.violinplot_grid(figure5c_tidy_cell_metadata, order=cluster_order, aspect=0.2)
They showed that for each amacrine cell cluster, they showed that one gene that was mutually exclusively detected using single-cell RNA-seq. And then, in Figure 5F, they showed that indeed, one of the markers is expressed in some amacrine cells but not others.
But, single-cell RNA seq is plagued with gene dropout -- randomly, one gene will be detected in one cell but not another.
What if there was a way that we could detect the genes that dropped out?
Compressed sensing is a field where they think about problems like, "if we only get 10% of the signal, and it's super noisy, could we reconstruct 100% of what was originally there?(sound familiar??) Turns out yes, you can! Robust PCA is one of the algorithms in compressed sensing which models the data $X$ as the sum of a low-rank matrix $L$ and a sparse matrix $S$.
$X = L + S$
Robust PCA is often used in video analysis to find anomalies. In their case, $L$ is the background and $S$ is the "anomalies" (people walking around).
To understand what Robust PCA does to biological data, we first need to understand what the raw data looks like. Let's look at the gene expression in only amacrine cells, with the RAW data:
In [ ]:
# Import a file I wrote with a cleaned-up clustermap
import fig_code
amacrine_cluster_n = sorted(figure5b_cell_metadata.cluster_n.unique())
amacrine_cluster_to_color = dict(zip(amacrine_cluster_n, sns.color_palette('husl', n_colors=len(amacrine_cluster_n))))
amacrine_cell_colors = [amacrine_cluster_to_color[i] for i in figure5b_cell_metadata['cluster_n']]
amacrine_expression = expression_log10.loc[figure5b_cell_metadata.index]
print(amacrine_expression.shape)
fig_code.clustermap(amacrine_expression, row_colors=amacrine_cell_colors)
In [ ]:
csv = macosko2015.BASE_URL + 'differential_clusters_lowrank_tidy_metadata_amacrine.csv'
lowrank_tidy = pd.read_csv(csv)
print(lowrank_tidy.shape)
# Reshape the data to be a large 2d matrix
lowrank_tidy_2d = lowrank_tidy.pivot(index='barcode', columns='gene_symbol', values='expression_log')
# set minimum value shown to 0 because there's a bunch of small (e.g. -1.1) negative numbers in the lowrank data
fig_code.clustermap(lowrank_tidy_2d, row_colors=amacrine_cell_colors, vmin=0)
In [ ]:
# Subset the genes on only figure 5b
rows = lowrank_tidy.gene_symbol.isin(figure5b_genes_upper)
lowrank_tidy_figure5b = lowrank_tidy.loc[rows]
print(lowrank_tidy_figure5b.shape)
lowrank_tidy_figure5b.head()
plotting_code.violinplot_grid(lowrank_tidy_figure5b, order=cluster_order, aspect=0.25)
In [ ]:
rows = lowrank_tidy.gene_symbol.isin(figure5c_genes_upper)
lowrank_tidy_figure5c = lowrank_tidy.loc[rows]
print(lowrank_tidy_figure5c.shape)
lowrank_tidy_figure5c.head()
plotting_code.violinplot_grid(lowrank_tidy_figure5c, order=cluster_order, aspect=0.2)
Interesting what happens to the "unique cluster identifiers" ...