How to make a clustered heatmap

Now we'll break down how to read the clustered heatmap we made in 1.3_explore_gene_dropout_via_distance_correlation_linkage_clustering


In [ ]:
# Import the pandas dataframe library
import pandas as pd

# Import the seaborn library for plotting
import seaborn as sns

# Put all the plots directly into the notebook
%matplotlib inline

Read Expression data using pandas. Notice that pandas can read URLs (!), not just files on your computer!


In [ ]:
csv = "https://media.githubusercontent.com/media/olgabot/macosko2015/" \
         "master/data/05_make_rentina_subsets_for_teaching/big_clusters_expression.csv"
expression = pd.read_csv(csv, index_col=0)
print(expression.shape)
expression.head()

Exercise 1

Now use pd.read_csv to read the csv file of the cell metadata


In [ ]:
csv = "https://media.githubusercontent.com/media/olgabot/macosko2015/" \
         "master/data/05_make_rentina_subsets_for_teaching/big_clusters_cell_metadata.csv"
# YOUR CODE HERE


In [ ]:
csv = "https://media.githubusercontent.com/media/olgabot/macosko2015/" \
         "master/data/05_make_rentina_subsets_for_teaching/big_clusters_cell_metadata.csv"
cell_metadata = pd.read_csv(csv, index_col=0)
print(cell_metadata.shape)
cell_metadata.head()

To correlate columns of dataframes in pandas, you use the function .corr. Let's look at the documentation of .corr

  1. Is the default method Pearson or Spearman correlation?
  2. Can you correlate between rows, or only between columns?

In [ ]:
expression.corr?

Since .corr only correlates between columns, we need to transpose our dataframe. Here's a little animation of matrix transposition from Wikipedia:

Exercise 2

Transpose the expression matrix so the cells are the columns, which makes it easy to calculate correlations. How do you transpose a dataframe in pandas? (hint: google knows everything)


In [ ]:
# YOUR CODE HERE


In [ ]:
expression_t = expression.T
print(expression_t.shape)
expression_t.head()

Exercise 3

Use .corr to calculate the Spearman correlation of the transposed expression dataframe. Make sure to print the shape, and show the head of the resulting dataframe.


In [ ]:
# YOUR CODE HERE


In [ ]:
expression_corr = expression_t.corr(method='spearman')
print(expression_corr.shape)
expression_corr.head()

Pro tip: if your matrix is really big, here's a trick to make spearman correlations faster

Remember that spearman correlation is equal to performing pearson correlation on the ranks? Well, that's exactly what's happening inside the .corr(method='spearman') function! Every time it's calculating spearman, it's converting each row to ranks, which means that it's double-converting to ranks since it has to do it for each pair. Let's cut the work in half by converting to ranks FIRST. Let's take a look at the options for .rank:


In [ ]:
expression_t.rank?

Notice we can specify axis=1 or axis=0, but what does that really mean? Was this ascending along rows, or ascending along columns?

To figure this out, let's use a small, simple dataframe:


In [ ]:
df = pd.DataFrame([[5, 6, 7], [5, 6, 7], [5, 6, 7]])
df

Exercise 4

Try axis=0 when using rank on this df


In [ ]:
# YOUR CODE HERE


In [ ]:
df.rank(axis=0)

Did that make ranks ascending along columns or along rows?

Exercise 5

Now try axis=1 when using rank on this df


In [ ]:
# YOUR CODE HERE


In [ ]:
df.rank(axis=1)

Did that make ranks ascending along columns or along rows?

Exercise 6

To get the gene (row) ranks for each cell (column), do we want axis=1 or axis=0? Perform .rank on the transposed expression matrix (expression_t), print the shape of the resulting ranks, and show the head() of it.


In [ ]:
# YOUR CODE HERE


In [ ]:
ranks = expression_t.rank(axis=0)
print(ranks.shape)
ranks.head()

Exercise 6

Now that you're armed with all this information, we'll calculate the ranks. While you're at it, let's compare the time it takes to run ("runtime") of .corr(method="pearson") on the ranks matrix vs .corr(method="spearman") on the expression matrix.

  1. Perform pearson correlation on the ranks
  2. Check that it is equal to the expression spearman correlation.
  3. Use the %timeit magic to check the runtimes of .corr on the ranks and expression matrices. (Feel free to calculate the expression correlation again, below)
    1. Note that when you use timeit, you cannot assign any variables -- using an equals sign doesn't work here.
  4. How much time did it take, in comparison? What's the order of magnitude difference?

Use as many cells as you need.


In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE


In [ ]:
%timeit expression_t.corr(method='spearman')
%timeit ranks.corr(method='pearson')

In [ ]:
ranks_corr = ranks.corr(method='pearson')
print(ranks_corr.shape)
ranks_corr.head()

Use inequality to see if any points are not the same. If this is equal to zero, then we know that they are ALL the same.


In [ ]:
(ranks_corr != expression_corr).sum().sum()

This is a flip of checking for equality, which is a little trickier because then you have to know exactly how many items are in the matrix. Since we have a 300x300 matrix, that multiplication is a little easier to do in your head and know that you got the right answer.


In [ ]:
(ranks_corr == expression_corr).sum().sum()

Make a heatmap!!

Now we are ready to make a clustered heatmap! We'll use seaborn's sns.clustermap. Let's read the documentation for sns.clustermap. What is the default distance metric and linkage method?


In [ ]:
sns.clustermap?

Exercise 7

Now run sns.clustermap on either the ranks or expression correlation matrices, since they are equal :)


In [ ]:
# YOUR CODE HERE


In [ ]:
sns.clustermap(expression_corr)

How can we add the colors labeling the rows and columns? Check the documentation for sns.clustermap again:

Exercise 8


In [ ]:
# YOUR CODE HERE


In [ ]:
sns.clustermap?

Since I am not a color design expert, I defer to color design experts in choosing my color palettes. One such expert is Cynthia Brewer, who made a ColorBrewer (hah!) list of color maps for both increasing quantity (shades), and for categories (differing colors).

As a reference, I like using this demo of every ColorBrewer scale. Hover over the palette to see its name.

Thankfully, seaborn has the ColorBrewer color maps built-in. Let's see what this output is

Remember -- we never make a variable without looking at it first!!


In [ ]:
palette = sns.color_palette('Accent', n_colors=3)
palette

Huh that's a bunch of weird numbers. What do they mean? Turns out it's a value from 0 to 1 representing the red, green, and blue (RGB) color channels that computers understand. But I'm not a computer .... what am I supposed to do??

Turns out, seaborn also has a very convenient function called palplot to plot the entire palette. This lets us look at the variable without having to convert from RGB


In [ ]:
sns.palplot(palette)

Exercise 9

  • Get the color palette for the "Set2" colormap and specify that you want 6 colors (read the documentation of sns.color_palette)
  • Plot the color palette

In [ ]:
# YOUR CODE HERE

In [ ]:
# YOUR CODE HERE


In [ ]:
set2 = sns.color_palette('Set2', n_colors=6)
sns.palplot(set2)

If you are more advanced and want access to more colormaps, I recommend checking out palettable.

Assign colors to clusters

To set a specific color to each cluster, we'll need to see the unique clusters here. For an individual column (called a "Series" in pandas-speak), how can we get only the unique items?

Exercise 10

Get the unique values from the column "cluster_celltype_with_id". Remember, always look at the variable you created!


In [ ]:
# YOUR CODE HERE


In [ ]:
cluster_ids_unique = cell_metadata['cluster_celltype_with_id'].unique()
cluster_ids_unique

Detour: zip and dict

To map colors to each cluster name, we need to talk about some built-in functions in Python, called zip and dict

For this next part, we'll use the built-in function zip which is very useful. It acts like a zipper (like for clothes) to glue together the pairs of items in two lists:


In [ ]:
english = ["hello", "goodbye", "no", "yes", "please", "thank you",]
spanish = ["hola", "adios", "no", "si", "por favor", "gracias"]
zip(english, spanish)

To be memory efficient, this doesn't show us what's inside right away. To look inside a zip object, we can use list:


In [ ]:
list(zip(english, spanish))

Exercise 11

What happened to "please" and "thank you" from english? Make another list, called spanish2, that contains the Spanish words for "please" and "thank you" (again, google knows everything), then call zip on english and spanish2. Don't forget to use list on them!


In [ ]:
# YOUR CODE HERE


In [ ]:
english = ["hello", "goodbye", "no", "yes", "please", "thank you",]
spanish = ["hola", "adios", "no", "si", "por favor", "gracias"]
list(zip(english, spanish))

Now we'll use a dictionary dict to make a lookup table that uses the pairing made by zip, using the first item as the "key" (what you use to look up) and the second item as the "value" (the result of the lookup)

You can think of it as a translator -- use the word in English to look up the word in Spanish.


In [ ]:
english_to_spanish = dict(zip(english, spanish))
english_to_spanish

Now we can use English words to look up the word in Spanish! We use the square brackets and the english word we want to use, to look up the spanish word.


In [ ]:
english_to_spanish['hello']

Exercise 12

Make an spanish_to_english dictionary and look up the English word for "por favor"


In [ ]:
# YOUR CODE HERE


In [ ]:
spanish_to_english = dict(zip(spanish, english))
spanish_to_english['por favor']

Okay, detour over! Switching from linguistics back to biology :)

Exercise 13

Use dict and zip to create a variable called id_to_color that assigns labels in cluster_ids_unique to a color in set2


In [ ]:
# YOUR CODE HERE


In [ ]:
id_to_color = dict(zip(cluster_ids_unique, set2))
id_to_color

Now we want to use this id_to_color lookup table to make a long list of colors for each cell.


In [ ]:
cell_metadata.head()

As an example, let's use the celltypes column to make a list of each celltype color first. Notice that we can use cell_metadata.celltype or cell_metadata['celltype'] to get the column we want.

We can only use the 'dot' notation because our column name has no unfriendly characters like spaces, dashes, or dots -- characters that mean something special in Python.


In [ ]:
celltypes = cell_metadata.celltype.unique()  # Could also use cell_metadata['celltype'].unique()
celltypes

In [ ]:
celltype_to_color = dict(zip(celltypes, sns.color_palette('Accent', n_colors=len(celltypes))))
celltype_to_color

Now we'll use the existing column cell_metadata.celltype to make a list of colors for each celltype


In [ ]:
per_cell_celltype_color = [celltype_to_color[celltype] for celltype in cell_metadata.celltype]

# Since this list is as long as our number of cells (300!), let's slice it and only look at the first 10
per_cell_celltype_color[:5]

Exercise 14

Make a variable called per_cell_cluster_color that uses the id_to_color dictionary to look up the color for each value in the cluster_celltype_with_id column of cluster_metadata


In [ ]:
# YOUR CODE HERE


In [ ]:
per_cell_cluster_color = [id_to_color[i] for i in cell_metadata.cluster_celltype_with_id]

# Since this list is as long as our number of cells (300!), let's slice it and only look at the first 10
per_cell_cluster_color[:10]

Exercise 15

Now use the cluster colors to label the rows and columns in sns.clustermap. How can


In [ ]:
# YOUR CODE HERE


In [ ]:
sns.clustermap(expression_corr, row_colors=per_cell_cluster_color, col_colors=per_cell_cluster_color)

We can also combine the celltype and cluster colors we created to create a double-layer colormap!


In [ ]:
combined_colors = [per_cell_cluster_color, per_cell_celltype_color]
len(combined_colors)

In [ ]:
sns.clustermap(expression_corr, row_colors=combined_colors, col_colors=combined_colors)

In [ ]: