# How to make a clustered heatmap

``````

In [ ]:

# Import the pandas dataframe library
import pandas as pd

# Import the seaborn library for plotting
import seaborn as sns

# Put all the plots directly into the notebook
%matplotlib inline

``````

Read Expression data using pandas. Notice that pandas can read URLs (!), not just files on your computer!

``````

In [ ]:

csv = "https://media.githubusercontent.com/media/olgabot/macosko2015/" \
"master/data/05_make_rentina_subsets_for_teaching/big_clusters_expression.csv"
print(expression.shape)

``````

### Exercise 1

Now use `pd.read_csv` to read the csv file of the cell metadata

``````

In [ ]:

csv = "https://media.githubusercontent.com/media/olgabot/macosko2015/" \

``````

``````

In [ ]:

csv = "https://media.githubusercontent.com/media/olgabot/macosko2015/" \

``````

To correlate columns of dataframes in pandas, you use the function `.corr`. Let's look at the documentation of `.corr`

1. Is the default method Pearson or Spearman correlation?
2. Can you correlate between rows, or only between columns?
``````

In [ ]:

expression.corr?

``````

Since `.corr` only correlates between columns, we need to transpose our dataframe. Here's a little animation of matrix transposition from Wikipedia:

### Exercise 2

Transpose the expression matrix so the cells are the columns, which makes it easy to calculate correlations. How do you transpose a dataframe in `pandas`? (hint: google knows everything)

``````

In [ ]:

``````

``````

In [ ]:

expression_t = expression.T
print(expression_t.shape)

``````

### Exercise 3

Use `.corr` to calculate the Spearman correlation of the transposed expression dataframe. Make sure to `print` the `shape`, and show the `head` of the resulting dataframe.

``````

In [ ]:

``````

``````

In [ ]:

expression_corr = expression_t.corr(method='spearman')
print(expression_corr.shape)

``````

### Pro tip: if your matrix is really big, here's a trick to make spearman correlations faster

Remember that spearman correlation is equal to performing pearson correlation on the ranks? Well, that's exactly what's happening inside the `.corr(method='spearman')` function! Every time it's calculating spearman, it's converting each row to ranks, which means that it's double-converting to ranks since it has to do it for each pair. Let's cut the work in half by converting to ranks FIRST. Let's take a look at the options for `.rank`:

``````

In [ ]:

expression_t.rank?

``````

Notice we can specify `axis=1` or `axis=0`, but what does that really mean? Was this ascending along rows, or ascending along columns?

To figure this out, let's use a small, simple dataframe:

``````

In [ ]:

df = pd.DataFrame([[5, 6, 7], [5, 6, 7], [5, 6, 7]])
df

``````

### Exercise 4

Try `axis=0` when using `rank` on this `df`

``````

In [ ]:

``````

``````

In [ ]:

df.rank(axis=0)

``````

Did that make ranks ascending along columns or along rows?

### Exercise 5

Now try `axis=1` when using `rank` on this `df`

``````

In [ ]:

``````

``````

In [ ]:

df.rank(axis=1)

``````

Did that make ranks ascending along columns or along rows?

### Exercise 6

To get the gene (row) ranks for each cell (column), do we want `axis=1` or `axis=0`? Perform `.rank` on the transposed expression matrix (`expression_t`), `print` the `shape` of the resulting ranks, and show the `head()` of it.

``````

In [ ]:

``````

``````

In [ ]:

ranks = expression_t.rank(axis=0)
print(ranks.shape)

``````

### Exercise 6

Now that you're armed with all this information, we'll calculate the ranks. While you're at it, let's compare the time it takes to run ("runtime") of `.corr(method="pearson")` on the ranks matrix vs `.corr(method="spearman")` on the expression matrix.

1. Perform pearson correlation on the ranks
2. Check that it is equal to the expression spearman correlation.
3. Use the `%timeit` magic to check the runtimes of `.corr` on the ranks and expression matrices. (Feel free to calculate the expression correlation again, below)
1. Note that when you use `timeit`, you cannot assign any variables -- using an equals sign doesn't work here.
4. How much time did it take, in comparison? What's the order of magnitude difference?

Use as many cells as you need.

``````

In [ ]:

``````
``````

In [ ]:

``````
``````

In [ ]:

``````

``````

In [ ]:

%timeit expression_t.corr(method='spearman')
%timeit ranks.corr(method='pearson')

``````
``````

In [ ]:

ranks_corr = ranks.corr(method='pearson')
print(ranks_corr.shape)

``````

Use inequality to see if any points are not the same. If this is equal to zero, then we know that they are ALL the same.

``````

In [ ]:

(ranks_corr != expression_corr).sum().sum()

``````

This is a flip of checking for equality, which is a little trickier because then you have to know exactly how many items are in the matrix. Since we have a 300x300 matrix, that multiplication is a little easier to do in your head and know that you got the right answer.

``````

In [ ]:

(ranks_corr == expression_corr).sum().sum()

``````

### Make a heatmap!!

Now we are ready to make a clustered heatmap! We'll use `seaborn`'s `sns.clustermap`. Let's read the documentation for `sns.clustermap`. What is the default distance metric and linkage method?

``````

In [ ]:

sns.clustermap?

``````

### Exercise 7

Now run `sns.clustermap` on either the ranks or expression correlation matrices, since they are equal :)

``````

In [ ]:

``````

``````

In [ ]:

sns.clustermap(expression_corr)

``````

How can we add the colors labeling the rows and columns? Check the documentation for `sns.clustermap` again:

### Exercise 8

``````

In [ ]:

``````

``````

In [ ]:

sns.clustermap?

``````

Since I am not a color design expert, I defer to color design experts in choosing my color palettes. One such expert is Cynthia Brewer, who made a ColorBrewer (hah!) list of color maps for both increasing quantity (shades), and for categories (differing colors).

As a reference, I like using this demo of every ColorBrewer scale. Hover over the palette to see its name.

Thankfully, `seaborn` has the ColorBrewer color maps built-in. Let's see what this output is

Remember -- we never make a variable without looking at it first!!

``````

In [ ]:

palette = sns.color_palette('Accent', n_colors=3)
palette

``````

Huh that's a bunch of weird numbers. What do they mean? Turns out it's a value from 0 to 1 representing the red, green, and blue (RGB) color channels that computers understand. But I'm not a computer .... what am I supposed to do??

Turns out, `seaborn` also has a very convenient function called `palplot` to plot the entire palette. This lets us look at the variable without having to convert from RGB

``````

In [ ]:

sns.palplot(palette)

``````

### Exercise 9

• Get the color palette for the "Set2" colormap and specify that you want 6 colors (read the documentation of `sns.color_palette`)
• Plot the color palette
``````

In [ ]:

``````
``````

In [ ]:

``````

``````

In [ ]:

set2 = sns.color_palette('Set2', n_colors=6)
sns.palplot(set2)

``````

If you are more advanced and want access to more colormaps, I recommend checking out `palettable`.

## Assign colors to clusters

To set a specific color to each cluster, we'll need to see the unique clusters here. For an individual column (called a "Series" in pandas-speak), how can we get only the unique items?

### Exercise 10

Get the unique values from the column `"cluster_celltype_with_id"`. Remember, always look at the variable you created!

``````

In [ ]:

``````

``````

In [ ]:

cluster_ids_unique

``````

## Detour: `zip` and `dict`

To map colors to each cluster name, we need to talk about some built-in functions in Python, called `zip` and `dict`

For this next part, we'll use the built-in function `zip` which is very useful. It acts like a zipper (like for clothes) to glue together the pairs of items in two lists:

``````

In [ ]:

english = ["hello", "goodbye", "no", "yes", "please", "thank you",]
spanish = ["hola", "adios", "no", "si", "por favor", "gracias"]
zip(english, spanish)

``````

To be memory efficient, this doesn't show us what's inside right away. To look inside a `zip` object, we can use `list`:

``````

In [ ]:

list(zip(english, spanish))

``````

### Exercise 11

What happened to "please" and "thank you" from `english`? Make another list, called `spanish2`, that contains the Spanish words for "please" and "thank you" (again, google knows everything), then call `zip` on `english` and `spanish2`. Don't forget to use `list` on them!

``````

In [ ]:

``````

``````

In [ ]:

english = ["hello", "goodbye", "no", "yes", "please", "thank you",]
spanish = ["hola", "adios", "no", "si", "por favor", "gracias"]
list(zip(english, spanish))

``````

Now we'll use a dictionary `dict` to make a lookup table that uses the pairing made by `zip`, using the first item as the "key" (what you use to look up) and the second item as the "value" (the result of the lookup)

You can think of it as a translator -- use the word in English to look up the word in Spanish.

``````

In [ ]:

english_to_spanish = dict(zip(english, spanish))
english_to_spanish

``````

Now we can use English words to look up the word in Spanish! We use the square brackets and the english word we want to use, to look up the spanish word.

``````

In [ ]:

english_to_spanish['hello']

``````

### Exercise 12

Make an `spanish_to_english` dictionary and look up the English word for "por favor"

``````

In [ ]:

``````

``````

In [ ]:

spanish_to_english = dict(zip(spanish, english))
spanish_to_english['por favor']

``````

Okay, detour over! Switching from linguistics back to biology :)

### Exercise 13

Use `dict` and `zip` to create a variable called `id_to_color` that assigns labels in `cluster_ids_unique` to a color in `set2`

``````

In [ ]:

``````

``````

In [ ]:

id_to_color = dict(zip(cluster_ids_unique, set2))
id_to_color

``````

Now we want to use this `id_to_color` lookup table to make a long list of colors for each cell.

``````

In [ ]:

``````

As an example, let's use the `celltypes` column to make a list of each celltype color first. Notice that we can use `cell_metadata.celltype` or `cell_metadata['celltype']` to get the column we want.

We can only use the 'dot' notation because our column name has no unfriendly characters like spaces, dashes, or dots -- characters that mean something special in Python.

``````

In [ ]:

celltypes

``````
``````

In [ ]:

celltype_to_color = dict(zip(celltypes, sns.color_palette('Accent', n_colors=len(celltypes))))
celltype_to_color

``````

Now we'll use the existing column `cell_metadata.celltype` to make a list of colors for each celltype

``````

In [ ]:

per_cell_celltype_color = [celltype_to_color[celltype] for celltype in cell_metadata.celltype]

# Since this list is as long as our number of cells (300!), let's slice it and only look at the first 10
per_cell_celltype_color[:5]

``````

### Exercise 14

Make a variable called `per_cell_cluster_color` that uses the `id_to_color` dictionary to look up the color for each value in the `cluster_celltype_with_id` column of `cluster_metadata`

``````

In [ ]:

``````

``````

In [ ]:

per_cell_cluster_color = [id_to_color[i] for i in cell_metadata.cluster_celltype_with_id]

# Since this list is as long as our number of cells (300!), let's slice it and only look at the first 10
per_cell_cluster_color[:10]

``````

### Exercise 15

Now use the cluster colors to label the rows and columns in `sns.clustermap`. How can

``````

In [ ]:

``````

``````

In [ ]:

sns.clustermap(expression_corr, row_colors=per_cell_cluster_color, col_colors=per_cell_cluster_color)

``````

We can also combine the celltype and cluster colors we created to create a double-layer colormap!

``````

In [ ]:

combined_colors = [per_cell_cluster_color, per_cell_celltype_color]
len(combined_colors)

``````
``````

In [ ]:

sns.clustermap(expression_corr, row_colors=combined_colors, col_colors=combined_colors)

``````
``````

In [ ]:

``````