Note: This is not really a "network analysis" - we are only looking at the graph and seeing what cells are there. if you want to do more than just zoom in and look around at the cells in the graphs, I recommend using Cytoscape for visualizing newtorks.
In [1]:
# Interactive jupyter widgets - use IntSlider directly for more control
from ipywidgets import IntSlider, interact
# Convert RGB colors to hex cor portability
from matplotlib.colors import rgb2hex
# Visualize networks
import networkx
# Numerical python
import numpy as np
# Pandas for dataframes
import pandas as pd
# K-nearest neighbors cell clustering from Dana Pe'er's lab
import phenograph
# Make color palettes
import seaborn as sns
%matplotlib inline
# Bokeh - interactive plotting in the browser
from bokeh.plotting import figure, show, output_file
from bokeh.models import HoverTool, ColumnDataSource
from bokeh.models.widgets import Panel, Tabs
from bokeh.layouts import widgetbox
from bokeh.io import output_notebook
# Local file: networkplots.py
import networkplots
# This line is required for the plots to appear in the notebooks
output_notebook()
At this point, you can follow along with either the pre-baked Macosko2015 amacrine data, or you can load in your own expression matrices. For the best experience, make sure that the rows are cells and the columns are gene names.
In [2]:
import macosko2015
counts, cell_metadata, gene_metadata = macosko2015.load_big_clusters()
counts.head()
Out[2]:
Calculate correlation between cells:
In [3]:
correlations = counts.T.rank().corr()
print(correlations.shape)
correlations.head()
Out[3]:
Correlation is not equal to distance. If two things are exactly the same, their correlation value is 1. But in space, if two things are exactly the same, the distance between them is 0. Therefore, correlation is not a distance! Correlation is a similarity metric, where bigger = more similar. But we want a dissimilarity (aka distance) metric.
Take a look for yourself. Many values in the distribution of all correlation values are near zero (not correlated), and a blip near 1 ( self-correlations).
In [4]:
sns.distplot(correlations.values.flat)
Out[4]:
But for building a K-nearest neighbors graph, we want the closest things (in distance space) to be actually close. So we'll convert our correlation ($\rho$) into a distance ($d$) using this equation:
$$ d = \sqrt{2(1-\rho)} $$You can look at the code for networkplots.correlation_to_distance
to convince yourself that's actually what it's doing:
In [5]:
networkplots.correlation_to_distance??
In [6]:
# YOUR CODE HERE
In [7]:
distances = networkplots.correlation_to_distance(correlations)
distances.head()
Out[7]:
In [8]:
# YOUR CODE HERE
In [9]:
sns.distplot(distances.values.flat)
Out[9]:
Now we'll run phenograph.cluster
, which returns three items:
communities
: the cluster labels of each cellsparse_matrix
: a sparse matrix representing the connections between cells in the graphQ
: the modularity score. Higher is better, and the highest is 1.
In [10]:
communities, sparse_matrix, Q = phenograph.cluster(distances, k=10)
Let's take a look at each of these returned values
In [11]:
communities
Out[11]:
In [12]:
sparse_matrix
Out[12]:
In [13]:
Q
Out[13]:
It looks like the communities
labels each cell as belonging to a particular cluster, the sparse_matrix
is some data type that we can't directly investigate, and Q
is the modularity value.
To be able to lay out our graph in two dimensions, we'll use the networkx
Python Package to build the graph and lay out the cells and edges.
In [14]:
graph = networkx.from_scipy_sparse_matrix(sparse_matrix)
graph
Out[14]:
We'll use the "Spring layout" which is a force-directed layout that pushes cells and edges away from each other. We'll use the built-in networkx function called spring_layout
on our graph:
In [15]:
positions = networkx.spring_layout(graph)
positions
Out[15]:
In [16]:
networkplots.get_nodes_specs??
Looks like this function can deal with if we already have some clusters defined in our metadata! Let's look at our cell_metadata
and remind ourselves of which column we might like to use for the other_cluster_col
value.
In [17]:
cell_metadata.head()
Out[17]:
In this case, I'd like to use the cluster_n_celltype
column.
Let's take a look at the code again to see how the networkplots.get_nodes_specs
function uses the metadata
:
In [18]:
networkplots.get_nodes_specs??
Looks like this function uses another one, called labels_to_colors
-- what does that do?
In [19]:
networkplots.labels_to_colors??
Now let's use get_nodes_specs
to create a dataframe of information about nodes so we can plot them.
In [26]:
nodes_specs = networkplots.get_nodes_specs(
positions, cell_metadata, distances.index,
communities, other_cluster_col='cluster_n_celltype',
palette='Set2')
print(nodes_specs.shape)
nodes_specs.head()
Out[26]:
positions
dict to dataframe with edge informationWe've now created a dataframe containing the x,y positions, the community labels, and the colors for the communities and other clusters we were interested in. Now we want to do the same for the edges (lines between cells).
Let's take a look at the function we'll use:
In [21]:
networkplots.get_edges_specs??
In [22]:
# YOUR CODE HERE
In [23]:
edges_specs = networkplots.get_edges_specs(graph, positions)
print(edges_specs.shape)
edges_specs.head()
Out[23]:
To be able to use the dataframes with the Bokeh plotting language, we need to convert our dataframes into ColumnDataSource
objects.
In [24]:
nodes_source = ColumnDataSource(nodes_specs)
edges_source = ColumnDataSource(edges_specs)
In [25]:
# --- First tab: KNN clustering --- #
tab1 = networkplots.plot_graph(nodes_source, edges_source,
legend_col='community',
color_col='community_color', tab=True,
title='KNN Clustering')
# --- Second tab: Clusters from paper --- #
tab2 = networkplots.plot_graph(nodes_source, edges_source,
legend_col='cluster_n_celltype', tab=True,
color_col='other_cluster_color',
title="Clusters from paper")
tabs = Tabs(tabs=[tab1, tab2])
show(tabs)