Now we'll break down how to read the clustered heatmap we made in 1.3_explore_gene_dropout_via_distance_correlation_linkage_clustering
In [ ]:
# Import the pandas dataframe library
import pandas as pd
# Import the seaborn library for plotting
import seaborn as sns
# Put all the plots directly into the notebook
%matplotlib inline
Read Expression data using pandas. Notice that pandas can read URLs (!), not just files on your computer!
In [ ]:
csv = "https://media.githubusercontent.com/media/olgabot/macosko2015/" \
"master/data/05_make_rentina_subsets_for_teaching/big_clusters_expression.csv"
expression = pd.read_csv(csv, index_col=0)
print(expression.shape)
expression.head()
In [ ]:
csv = "https://media.githubusercontent.com/media/olgabot/macosko2015/" \
"master/data/05_make_rentina_subsets_for_teaching/big_clusters_cell_metadata.csv"
# YOUR CODE HERE
In [ ]:
csv = "https://media.githubusercontent.com/media/olgabot/macosko2015/" \
"master/data/05_make_rentina_subsets_for_teaching/big_clusters_cell_metadata.csv"
cell_metadata = pd.read_csv(csv, index_col=0)
print(cell_metadata.shape)
cell_metadata.head()
To correlate columns of dataframes in pandas, you use the function .corr
. Let's look at the documentation of .corr
In [ ]:
expression.corr?
Since .corr
only correlates between columns, we need to transpose our dataframe. Here's a little animation of matrix transposition from Wikipedia:
Transpose the expression matrix so the cells are the columns, which makes it easy to calculate correlations. How do you transpose a dataframe in pandas
? (hint: google knows everything)
In [ ]:
# YOUR CODE HERE
In [ ]:
expression_t = expression.T
print(expression_t.shape)
expression_t.head()
In [ ]:
# YOUR CODE HERE
In [ ]:
expression_corr = expression_t.corr(method='spearman')
print(expression_corr.shape)
expression_corr.head()
Remember that spearman correlation is equal to performing pearson correlation on the ranks? Well, that's exactly what's happening inside the .corr(method='spearman')
function! Every time it's calculating spearman, it's converting each row to ranks, which means that it's double-converting to ranks since it has to do it for each pair. Let's cut the work in half by converting to ranks FIRST. Let's take a look at the options for .rank
:
In [ ]:
expression_t.rank?
Notice we can specify axis=1
or axis=0
, but what does that really mean? Was this ascending along rows, or ascending along columns?
To figure this out, let's use a small, simple dataframe:
In [ ]:
df = pd.DataFrame([[5, 6, 7], [5, 6, 7], [5, 6, 7]])
df
In [ ]:
# YOUR CODE HERE
In [ ]:
df.rank(axis=0)
In [ ]:
# YOUR CODE HERE
In [ ]:
df.rank(axis=1)
Did that make ranks ascending along columns or along rows?
In [ ]:
# YOUR CODE HERE
In [ ]:
ranks = expression_t.rank(axis=0)
print(ranks.shape)
ranks.head()
Now that you're armed with all this information, we'll calculate the ranks. While you're at it, let's compare the time it takes to run ("runtime") of .corr(method="pearson")
on the ranks matrix vs .corr(method="spearman")
on the expression matrix.
%timeit
magic to check the runtimes of .corr
on the ranks and expression matrices. (Feel free to calculate the expression correlation again, below)timeit
, you cannot assign any variables -- using an equals sign doesn't work here.Use as many cells as you need.
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
%timeit expression_t.corr(method='spearman')
%timeit ranks.corr(method='pearson')
In [ ]:
ranks_corr = ranks.corr(method='pearson')
print(ranks_corr.shape)
ranks_corr.head()
Use inequality to see if any points are not the same. If this is equal to zero, then we know that they are ALL the same.
In [ ]:
(ranks_corr != expression_corr).sum().sum()
This is a flip of checking for equality, which is a little trickier because then you have to know exactly how many items are in the matrix. Since we have a 300x300 matrix, that multiplication is a little easier to do in your head and know that you got the right answer.
In [ ]:
(ranks_corr == expression_corr).sum().sum()
In [ ]:
sns.clustermap?
In [ ]:
# YOUR CODE HERE
In [ ]:
sns.clustermap(expression_corr)
In [ ]:
# YOUR CODE HERE
In [ ]:
sns.clustermap?
Since I am not a color design expert, I defer to color design experts in choosing my color palettes. One such expert is Cynthia Brewer, who made a ColorBrewer (hah!) list of color maps for both increasing quantity (shades), and for categories (differing colors).
As a reference, I like using this demo of every ColorBrewer scale. Hover over the palette to see its name.
Thankfully, seaborn
has the ColorBrewer color maps built-in. Let's see what this output is
Remember -- we never make a variable without looking at it first!!
In [ ]:
palette = sns.color_palette('Accent', n_colors=3)
palette
Huh that's a bunch of weird numbers. What do they mean? Turns out it's a value from 0 to 1 representing the red, green, and blue (RGB) color channels that computers understand. But I'm not a computer .... what am I supposed to do??
Turns out, seaborn
also has a very convenient function called palplot
to plot the entire palette. This lets us look at the variable without having to convert from RGB
In [ ]:
sns.palplot(palette)
In [ ]:
# YOUR CODE HERE
In [ ]:
# YOUR CODE HERE
In [ ]:
set2 = sns.color_palette('Set2', n_colors=6)
sns.palplot(set2)
If you are more advanced and want access to more colormaps, I recommend checking out palettable
.
To set a specific color to each cluster, we'll need to see the unique clusters here. For an individual column (called a "Series" in pandas-speak), how can we get only the unique items?
Get the unique values from the column "cluster_celltype_with_id"
. Remember, always look at the variable you created!
In [ ]:
# YOUR CODE HERE
In [ ]:
cluster_ids_unique = cell_metadata['cluster_celltype_with_id'].unique()
cluster_ids_unique
zip
and dict
To map colors to each cluster name, we need to talk about some built-in functions in Python, called zip
and dict
For this next part, we'll use the built-in function zip
which is very useful. It acts like a zipper (like for clothes) to glue together the pairs of items in two lists:
In [ ]:
english = ["hello", "goodbye", "no", "yes", "please", "thank you",]
spanish = ["hola", "adios", "no", "si", "por favor", "gracias"]
zip(english, spanish)
To be memory efficient, this doesn't show us what's inside right away. To look inside a zip
object, we can use list
:
In [ ]:
list(zip(english, spanish))
In [ ]:
# YOUR CODE HERE
In [ ]:
english = ["hello", "goodbye", "no", "yes", "please", "thank you",]
spanish = ["hola", "adios", "no", "si", "por favor", "gracias"]
list(zip(english, spanish))
Now we'll use a dictionary dict
to make a lookup table that uses the pairing made by zip
, using the first item as the "key" (what you use to look up) and the second item as the "value" (the result of the lookup)
You can think of it as a translator -- use the word in English to look up the word in Spanish.
In [ ]:
english_to_spanish = dict(zip(english, spanish))
english_to_spanish
Now we can use English words to look up the word in Spanish! We use the square brackets and the english word we want to use, to look up the spanish word.
In [ ]:
english_to_spanish['hello']
In [ ]:
# YOUR CODE HERE
In [ ]:
spanish_to_english = dict(zip(spanish, english))
spanish_to_english['por favor']
In [ ]:
# YOUR CODE HERE
In [ ]:
id_to_color = dict(zip(cluster_ids_unique, set2))
id_to_color
Now we want to use this id_to_color
lookup table to make a long list of colors for each cell.
In [ ]:
cell_metadata.head()
As an example, let's use the celltypes
column to make a list of each celltype color first. Notice that we can use cell_metadata.celltype
or cell_metadata['celltype']
to get the column we want.
We can only use the 'dot' notation because our column name has no unfriendly characters like spaces, dashes, or dots -- characters that mean something special in Python.
In [ ]:
celltypes = cell_metadata.celltype.unique() # Could also use cell_metadata['celltype'].unique()
celltypes
In [ ]:
celltype_to_color = dict(zip(celltypes, sns.color_palette('Accent', n_colors=len(celltypes))))
celltype_to_color
Now we'll use the existing column cell_metadata.celltype
to make a list of colors for each celltype
In [ ]:
per_cell_celltype_color = [celltype_to_color[celltype] for celltype in cell_metadata.celltype]
# Since this list is as long as our number of cells (300!), let's slice it and only look at the first 10
per_cell_celltype_color[:5]
In [ ]:
# YOUR CODE HERE
In [ ]:
per_cell_cluster_color = [id_to_color[i] for i in cell_metadata.cluster_celltype_with_id]
# Since this list is as long as our number of cells (300!), let's slice it and only look at the first 10
per_cell_cluster_color[:10]
In [ ]:
# YOUR CODE HERE
In [ ]:
sns.clustermap(expression_corr, row_colors=per_cell_cluster_color, col_colors=per_cell_cluster_color)
We can also combine the celltype and cluster colors we created to create a double-layer colormap!
In [ ]:
combined_colors = [per_cell_cluster_color, per_cell_celltype_color]
len(combined_colors)
In [ ]:
sns.clustermap(expression_corr, row_colors=combined_colors, col_colors=combined_colors)
In [ ]: