Recreate Figure 5

The purpose of this notebook is to combine all the digital gene expression data for the retina cells, downloaded from the Gene Expression Omnibus using the accession number GSE63473.


In [ ]:
import altair
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# Set the plotting style as for a "paper" (smaller labels) 
# and using a white background with a grid ("whitegrid")
sns.set(context='paper', style='whitegrid')

%matplotlib inline

We'll import the macosko2015 package, which contains a URL pointing to where we've created clean data


In [ ]:
import macosko2015
macosko2015.BASE_URL

We've created a subset of the data that contains all the cells from batch 1, and only the differentially expressed genes. This is still pretty big!!


In [ ]:
urlname = macosko2015.BASE_URL + 'differential_clusters_expression.csv'
urlname

In [ ]:
expression = pd.read_csv(urlname, index_col=0)
print(expression.shape)
expression.head()

For later, let's also make a logged version the expression matrix:


In [ ]:
expression_log10 = np.log10(expression + 1)
print(expression_log10.shape)
expression_log10.head()

Now let's read the cell metadata


In [ ]:
urlname = macosko2015.BASE_URL + 'differential_clusters_cell_metadata.csv'
cell_metadata = pd.read_csv(urlname, index_col=0)
cell_metadata.head()

Figure 5a

For this figure, the caption is:

(A) Pan-amacrine markers. The expression levels of the six genes identified (Nrxn2, Atp1b1, Pax6, Slc32a1, Slc6a1, Elavl3) are represented as dot plots across all 39 clusters; larger dots indicate broader expression within the cluster; deeper red denotes a higher expression level.

I wonder, how did they aggregate their expression per cluster? Mean, median, or mode? Did they log their data? We'll find out :)

You may have noticed that while the gene names are Captialcase in the paper, they're all uppercase in the data. So first, we'll define a function called upperizer that will make our gene names uppercase.

Then, we'll make a list of genes for Figure 5a


In [ ]:
def upperizer(genes):
    return [x.upper() for x in genes]

figure5a_genes = ['Nrxn2', 'Atp1b1', 'Pax6', 'Slc32a1', 'Slc6a1', 'Elavl3']
figure5a_genes_upper = upperizer(figure5a_genes)
figure5a_genes_upper

Exercise 1

Make a subset of the data called figure5a_expression that contains only the genes from figure5a_genes_upper - Remember, genes are columns! How do you select only certain columns for pandas dataframes?


In [ ]:
# YOUR CODE HERE


In [ ]:
figure5a_expression = expression[figure5a_genes_upper]
print(figure5a_expression.shape)
figure5a_expression.head()

We will use a function called groupby to grab the cells from each cluster, and use .mean()


In [ ]:
figure5a_expression_mean = figure5a_expression.groupby(cell_metadata['cluster_n'], axis=0).mean()
print(figure5a_expression_mean.shape)
figure5a_expression_mean.head()

Tidy data

To make this "punchcard" style figure, we will use the program altair. To use this program, we need to reshape our data into a tall, "tidy" data format, where each row is an observation, and each column is a variable:

Source: http://r4ds.had.co.nz/tidy-data.html

First, we will unstack our data, which will make a very long column of gene expression, where the gene name and cluster number is the index.


In [ ]:
figure5a_expression_mean_unstack = figure5a_expression_mean.unstack()
print(figure5a_expression_mean_unstack.shape)
figure5a_expression_mean_unstack.head()

Now let's use the function reset_index, which will take everything that was an index and now make it a column:


In [ ]:
figure5a_expression_tidy = figure5a_expression_mean_unstack.reset_index()
print(figure5a_expression_tidy.shape)
figure5a_expression_tidy.head()

But our column names aren't so nice anymore ... let's create a dict to map the old column name to a new, nicer one.


In [ ]:
renamer = {'level_0': 'gene_symbol', 0: 'expression'}
renamer

We can use the dataframe function .rename() to rename our columns:


In [ ]:
figure5a_expression_tidy = figure5a_expression_tidy.rename(columns=renamer)
print(figure5a_expression_tidy.shape)
figure5a_expression_tidy.head()

Let's also add a log expression column, just in case :)


In [ ]:
figure5a_expression_tidy['expression_log'] = np.log10(figure5a_expression_tidy['expression'] + 1)
print(figure5a_expression_tidy.shape)
figure5a_expression_tidy.head()

In [ ]:
altair.Chart(figure5a_expression_tidy).mark_circle().encode(
    size='expression', x=altair.X('gene_symbol'), y=altair.Y('cluster_n'))

Exercise 2

Make the same plot, but use the logged expression column


In [ ]:
# YOUR CODE HERE


In [ ]:
altair.Chart(figure5a_expression_tidy).mark_circle().encode(
    size='expression_log', x=altair.X('gene_symbol'), y=altair.Y('cluster_n'))

Bonus exercise (if you're feeling ahead)

Try the same thing, but with:

  • Median within clusters, raw counts
  • Median within clusters, log expression

Figure 5b

Here, we will make violinplots of specific genes within our dataset.

The caption of this figure is:

(B) Identification of known amacrine types among clusters. The 21 amacrine clusters consisted of 12 GABAergic, five glycinergic, one glutamatergic, and three non-GABAergic non-glycinergic clusters. Starburst amacrines were identified in cluster 3 by their expression of Chat; excitatory amacrines by expression of Slc17a8; A-II amacrines by their expression of Gjd2; and SEG amacrine neurons by their expression of Ebf3.

Exercise 3

  1. Make a subset of genes called figure5b_genes using the gene names from the paper figure.
    • You may want to use the upperizer() function we used before

Use as many code cells as you need


In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE


In [ ]:
figure5b_genes = ['Chat', "Gad1", 'Gad2', 'Slc17a8', 'Slc6a9', 'Gjd2', 'Gjd2', 'Ebf3']
figure5b_genes_upper = upperizer(figure5b_genes)
figure5b_genes_upper

Now that we have the genes we want, let's get the cells we want!

We can use the range function in Python to get numbers in order


In [ ]:
range(5)

Now this returns a "range object" which means Python is being lazy and not telling us what's inside. To force Python into action, we can use list:


In [ ]:
list(range(5))

So this is getting us all numbers from 0 to 4 (not including 5). We can use this group of numbers to subset our cell metadata! Let's make a variable called rows that contains True/False values telling us whether the cells are in that cluster number:


In [ ]:
rows = cell_metadata.cluster_n.isin(range(5))
rows

Now let's use our rows variable to subset cell_metadata


In [ ]:
print('cell_metadata.shape', cell_metadata.shape)
cell_metadata_subset = cell_metadata.loc[rows]
print('cell_metadata_subset.shape', cell_metadata_subset.shape)
cell_metadata_subset.head()

Let's make sure we only have the clusters we need:


In [ ]:
cell_metadata_subset.cluster_n.unique()

This is kinda out of order so let's sort it with the sorted function:


In [ ]:
sorted(cell_metadata_subset.cluster_n.unique())

Exercise 4

Make a subset of the cell metadata, called figure5b_cell_metadata, that contains only the cells in the clusters shown in figure 5b.


In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE


In [ ]:
rows = cell_metadata.cluster_n.isin(range(3, 24))
figure5b_cell_metadata = cell_metadata.loc[rows]
print(figure5b_cell_metadata.shape)
figure5b_cell_metadata.head()

In [ ]:
sorted(figure5b_cell_metadata.cluster_n.unique())

Now we want to get only the cells from these clusters. To do that, we would use .index


In [ ]:
figure5b_cell_metadata.index

Exercise 5

Make a subset of gene expression called figure5b_expression using the figure5b_genes and figure5b_cell_metadata. Hint: Use .loc on expression


In [ ]:
# YOUR CODE HERE


In [ ]:
figure5b_expression = expression.loc[figure5b_cell_metadata.index, figure5b_genes_upper]
print(figure5b_expression.shape)
figure5b_expression.head()

Again, we'll have to make a tidy version of the data to be able to make the violinplots

Exercise 6

  1. Make a tidy version of the figure5b_expression called figure5b_tidy
  2. Add a column to the figure5b_tidy dataframe that contains the log10 expression data

In [ ]:
figure5b_cell_metadata.index

In [ ]:
figure5b_expression.index

In [ ]:
figure5b_tidy = figure5b_expression.unstack().reset_index()
figure5b_tidy = figure5b_tidy.rename(columns={'level_1': 'barcode', 'level_0': 'gene_symbol', 0: 'expression'})
figure5b_tidy['expression_log'] = np.log10(figure5b_tidy['expression'] + 1)
print(figure5b_tidy.shape)
figure5b_tidy.head()

If you want, you could also create a function to simplify the tidying and logging:


In [ ]:
def tidify_and_log(data):
    tidy = data.unstack().reset_index()
    tidy = tidy.rename(columns={'level_1': 'barcode', 'level_0': 'gene_symbol', 0: 'expression'})
    tidy['expression_log'] = np.log10(tidy['expression'] + 1)
    return tidy

Now that you have your tidy data, we need to add the cell metadata. We will use .join, and specify to use the "barcode" column of figure5b_tidy


In [ ]:
figure5b_tidy_clusters = figure5b_tidy.join(figure5b_cell_metadata, on='barcode')
print(figure5b_tidy_clusters.shape)
figure5b_tidy_clusters.head()

We can make violinplots using seaborn's sns.violinplot, but that will show us the expression across all genes :(


In [ ]:
sns.violinplot?

The below command specifies "expression" as the x-axis value (first argument), and "cluster_id" as the y-axis value (second argument). Then we say that we want the program to look at the data in our dataframe called figure5b_tidy_clusters.


In [ ]:
sns.violinplot('expression', 'cluster_id', data=figure5b_tidy_clusters)

Using sns.FacetGrid to make multiple violinplots

Since we want to make a separate violinplot for each gene, we need to take multiple steps. We will use the function sns.FacetGrid to make mini-plots of each gene. If you want to read more about plotting on data-aware grid in Python, check out the seaborn docs on grid plotting.

Let's take a look at the documentation.


In [ ]:
sns.FacetGrid?

Exercise 7

What is the first argument of FacetGrid? How can we specify that we want each column to be a gene symbol?

Use sns.FacetGrid on our figure5b_tidy_clusters data, specifying "gene_symbol" as the column in our data to use as columns in the grid.


In [ ]:
# YOUR CODE HERE


In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol')
facetgrid

I have no idea which gene is where .. so let's add some titles with the convenient function g.set_titles


In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol')
facetgrid.set_titles('{col_name}')

Now let's add our violinplots, using map on the facetgrid. Again, we'll use "expression" as the x-value (first argument) and "cluster_id" as the second argument.


In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol')
facetgrid.map(sns.violinplot, 'expression', 'cluster_id')
facetgrid.set_titles('{col_name}')

Hmm, all of these genes are on totally different scales .. how can we make it so that each gene is scaled to its own minimum and maximum?

Exercise 8

Read the documentation for sns.FacetGrid and figure out how to turn of the shared values on the x axis


In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE


In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol', sharex=False)
facetgrid.map(sns.violinplot, 'expression', 'cluster_id')
facetgrid.set_titles('{col_name}')

Okay these violinplots are still pretty weird looking. In the paper, they scale the violinplots to all be the same width, and the lines are much thinner.

Let's look at the documentation of sns.violinplot and see what we can do.


In [ ]:
sns.violinplot?

Looks like we can set the scale variable to be "width" and let's try setting the linewidth to 1.


In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol', sharex=False)
facetgrid.map(sns.violinplot, 'expression', 'cluster_id', scale='width', linewidth=1)
facetgrid.set_titles('{col_name}')

Much better! There's a few more things we need to tweak in sns.violinplot. Let's get rid of the dotted thing on the inside, and only show the data exactly where it's valued - the ends of the violins should be square, not pointy.

Exercise 9

Read the documentation of sns.violinplot and add to the options to cut off the violinplots at the data bounds, and have nothing on the inside of the violins.


In [ ]:
# YOUR CODE HERE


In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol',
                  gridspec_kws=dict(hspace=0, wspace=0), sharex=False)
facetgrid.map(sns.violinplot, 'expression', 'cluster_id', scale='width', linewidth=1, inner=None, cut=True)
facetgrid.set_titles('{col_name}')

Okay one more thing on the violinplots ... they had a different color for every cluster, so let's do the same thing too. Right now they're all blue but let's make them all a different color. Since we have so many categories (21), and ColorBrewer doesn't have setups for when there are more than 10 colors, we need to use a different set of colors. We'll use the "husl" colormap, which uses perception research to make colormaps where no one color is too bright or too dark. Read more about it here

Here is an example of what the colormap looks like:


In [ ]:
sns.palplot(sns.color_palette('husl', n_colors=50))

Let's add palette="husl" to our violinplot command and see what it does:


In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol', sharex=False)
facetgrid.map(sns.violinplot, 'expression', 'cluster_id', scale='width', 
      linewidth=1, inner=None, cut=True, palette='husl')
facetgrid.set_titles('{col_name}')

Now let's work on resizing the plots so they're each narrower. We'll add the following three options to `sns.FacetGrid to accomplish this:

  • size=4 (default: size=3) - Make the relative size of the plot bigger
  • aspect=0.25 (default: aspect=1) - Make the the width of the plot be 1/4 the size of the heigh
  • gridspec_kws=dict(wspace=0) - Set the width between plots to be zero

In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol', 
                          size=4, aspect=0.25, gridspec_kws=dict(wspace=0), 
                          sharex=False)
facetgrid.map(sns.violinplot, 'expression', 'cluster_id', scale='width', 
      linewidth=1, palette='husl', inner=None, cut=True)
facetgrid.set_titles('{col_name}')

Hmm.. now we can see that the clusters aren't in numeric order. Is there an option in sns.violinplot that we can specify the order of the values?

Exercise 10

  1. Read the sns.violinplot documentation to figure out the keyword to use to specify the order of the clusters
  2. Make a sorted list of the unique cluster ids
  3. Plot the violinplots on the FacetGrid

In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE


In [ ]:
cluster_order = figure5b_tidy_clusters.cluster_id.sort_values().unique()
cluster_order

In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol', size=4, aspect=0.25,
                  gridspec_kws=dict(wspace=0), sharex=False)
facetgrid.map(sns.violinplot, 'expression', 'cluster_id', scale='width', 
      linewidth=1, palette='husl', inner=None, cut=True, order=cluster_order)
facetgrid.set_titles('{col_name}')

Okay one last thing .. let's turn off the "expression" label at the bottom and the value scales (since right now we're just looking comparatively) with:

facetgrid.set(xlabel='', xticks=[])

In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol', size=4, aspect=0.25,
                  gridspec_kws=dict(wspace=0), sharex=False)
facetgrid.map(sns.violinplot, 'expression', 'cluster_id', scale='width', 
      linewidth=1, palette='husl', inner=None, cut=True, order=cluster_order)
facetgrid.set(xlabel='', xticks=[])
facetgrid.set_titles('{col_name}')

Exercise 11

Take a step back ... does this look like the actual Figure 5b from the paper? Do you see the bimodality that they claim?

Why or why not?

YOUR ANSWER HERE

We don't see the bimodality here because they used loggged data, not the raw counts.

Exercise 12

Use logged expression (which column was this in our data? Check figure5b_tidy_clusters.head() to remind yourself) on the facetgrid of violinplots we created.


In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE


In [ ]:
figure5b_tidy_clusters.head()

In [ ]:
facetgrid = sns.FacetGrid(figure5b_tidy_clusters, col='gene_symbol', size=4, aspect=0.25,
                  gridspec_kws=dict(wspace=0), sharex=False)
facetgrid.map(sns.violinplot, 'expression_log', 'cluster_id', scale='width', 
      linewidth=1, palette='husl', inner=None, cut=True, order=cluster_order)
facetgrid.set(xlabel='', xticks=[])
facetgrid.set_titles('{col_name}')

Since we worked so hard to get these lines, let's write them as a function to a file called plotting_code.py. We'll move all the options we fiddled with into the arguments of our violinplot_grid function.

Notice we have to add import seaborn as sns into our file. That's because the file must be standalone and include everything it needs, including all the modules.


In [ ]:
%%file plotting_code.py

import seaborn as sns

def violinplot_grid(tidy, col='gene_symbol', size=4, aspect=0.25, gridspec_kws=dict(wspace=0),
                    sharex=False, scale='width', linewidth=1, palette='husl', inner=None, 
                    cut=True, order=None):
    facetgrid = sns.FacetGrid(tidy, col=col, size=size, aspect=aspect,
                  gridspec_kws=gridspec_kws, sharex=sharex)
    facetgrid.map(sns.violinplot, 'expression_log', 'cluster_id', scale=scale, 
          linewidth=linewidth, palette=palette, inner=inner, cut=cut, order=order)
    facetgrid.set(xlabel='', xticks=[])
    facetgrid.set_titles('{col_name}')

We can cat (short for concatenate) the file, which means dump the contents out to the output:


In [ ]:
cat plotting_code.py

In [ ]:
import plotting_code

plotting_code.violinplot_grid(figure5b_tidy_clusters, order=cluster_order)

Now we see more of the "bimodality" they talk about in the paper

Figure 5c

Figure 5c is all you!

Exercise 13

Use all the amacrine cells, but this time use all the genes from Figure 5c.

Note: You may not have expression in EVERY gene .. this is a subset of the entire dataset (only run 1!) and


In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE


In [ ]:
figure5c_genes = ['Gng7', 'Gbx2', 'Tpbg', 'Slitrk6', 'Maf', 'Tac2', 'Loxl2', 'Vip', 'Glra1',
                  'Igfbp5', 'Pdgfra', 'Slc35d3', 'Car3', 'Fgf1', 'Igf1', 'Col12a1', 'Ptgds',
                  'Ppp1r17', 'Cck', 'Shisa9', 'Pou3f3']
figure5c_genes_upper = upperizer(figure5c_genes)
figure5c_expression = expression.loc[figure5b_cell_metadata.index, figure5c_genes_upper]
print(figure5c_expression.shape)
figure5c_expression.head()

In [ ]:
figure5c_genes_upper

In [ ]:
figure5c_tidy = tidify_and_log(figure5c_expression)
print(figure5c_tidy.shape)
figure5c_tidy.head()

In [ ]:
figure5c_tidy_cell_metadata = figure5c_tidy.join(cell_metadata, on='barcode')
print(figure5c_tidy_cell_metadata.shape)
figure5c_tidy_cell_metadata.head()

In [ ]:
plotting_code.violinplot_grid(figure5c_tidy_cell_metadata, order=cluster_order, aspect=0.2)

Let's take a step back ... What does this all mean?

They showed that for each amacrine cell cluster, they showed that one gene that was mutually exclusively detected using single-cell RNA-seq. And then, in Figure 5F, they showed that indeed, one of the markers is expressed in some amacrine cells but not others.

But, single-cell RNA seq is plagued with gene dropout -- randomly, one gene will be detected in one cell but not another.

What if there was a way that we could detect the genes that dropped out?

Compressed Sensing/Robust PCA

Compressed sensing is a field where they think about problems like, "if we only get 10% of the signal, and it's super noisy, could we reconstruct 100% of what was originally there?(sound familiar??) Turns out yes, you can! Robust PCA is one of the algorithms in compressed sensing which models the data $X$ as the sum of a low-rank matrix $L$ and a sparse matrix $S$.

$X = L + S$

  • $X$ is the expression data
  • $L$ is the low rank data. In our case, this essentially becomes a smoothed version of the expression matrix
  • $S$ is the sparse data. In our case, this captures the stochastic noise in the data. Some of this data may be biological, it is true. But largely, this data seems to carry the technical noise.

Robust PCA is often used in video analysis to find anomalies. In their case, $L$ is the background and $S$ is the "anomalies" (people walking around).

Cluster on raw (log2) amacrine cell expression

To understand what Robust PCA does to biological data, we first need to understand what the raw data looks like. Let's look at the gene expression in only amacrine cells, with the RAW data:


In [ ]:
# Import a file I wrote with a cleaned-up clustermap
import fig_code

amacrine_cluster_n = sorted(figure5b_cell_metadata.cluster_n.unique())
amacrine_cluster_to_color = dict(zip(amacrine_cluster_n, sns.color_palette('husl', n_colors=len(amacrine_cluster_n))))
amacrine_cell_colors = [amacrine_cluster_to_color[i] for i in figure5b_cell_metadata['cluster_n']]

amacrine_expression = expression_log10.loc[figure5b_cell_metadata.index]
print(amacrine_expression.shape)

fig_code.clustermap(amacrine_expression, row_colors=amacrine_cell_colors)

Cluster on Robust PCA'd amacrine cell expression (lowrank)


In [ ]:
csv = macosko2015.BASE_URL + 'differential_clusters_lowrank_tidy_metadata_amacrine.csv'
lowrank_tidy = pd.read_csv(csv)
print(lowrank_tidy.shape)

# Reshape the data to be a large 2d matrix
lowrank_tidy_2d = lowrank_tidy.pivot(index='barcode', columns='gene_symbol', values='expression_log')

# set minimum value shown to 0 because there's a bunch of small (e.g. -1.1) negative numbers in the lowrank data
fig_code.clustermap(lowrank_tidy_2d, row_colors=amacrine_cell_colors, vmin=0)

Figure 5b using Robust PCA data


In [ ]:
# Subset the genes on only figure 5b
rows = lowrank_tidy.gene_symbol.isin(figure5b_genes_upper)
lowrank_tidy_figure5b = lowrank_tidy.loc[rows]
print(lowrank_tidy_figure5b.shape)
lowrank_tidy_figure5b.head()

plotting_code.violinplot_grid(lowrank_tidy_figure5b, order=cluster_order, aspect=0.25)

Looks like a lot of the signal from the genes was recovered!

Robust PCA data for Figure 5c

Subset the genes on only figure 5c


In [ ]:
rows = lowrank_tidy.gene_symbol.isin(figure5c_genes_upper)
lowrank_tidy_figure5c = lowrank_tidy.loc[rows]
print(lowrank_tidy_figure5c.shape)
lowrank_tidy_figure5c.head()

plotting_code.violinplot_grid(lowrank_tidy_figure5c, order=cluster_order, aspect=0.2)

Interesting what happens to the "unique cluster identifiers" ...