The K-means section of this notebook was put together by [Jake Vanderplas](http://www.vanderplas.com). Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_tutorial/).

Clustering: K-Means In-Depth

Here we'll explore K Means Clustering, which is an unsupervised clustering technique.

We'll start with our standard set of initial imports



In [ ]:

    
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# use seaborn plotting defaults
import seaborn as sns; sns.set()

Introducing K-Means

K Means is an algorithm for unsupervised clustering: that is, finding clusters in data based on the data attributes alone (not the labels).

K Means is a relatively easy-to-understand algorithm. It searches for cluster centers which are the mean of the points within them, such that every point is closest to the cluster center it is assigned to.

Let's look at how KMeans operates on the simple clusters we looked at previously. To emphasize that this is unsupervised, we'll not plot the colors of the clusters:



In [ ]:

    
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=300, centers=4,
                  random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], s=50);

By eye, it is relatively easy to pick out the four clusters. If you were to perform an exhaustive search for the different segmentations of the data, however, the search space would be exponential in the number of points. Fortunately, there is a well-known Expectation Maximization (EM) procedure which scikit-learn implements, so that KMeans can be solved relatively quickly.



In [ ]:

    
from sklearn.cluster import KMeans
est = KMeans(4)  # 4 clusters
est.fit(X)
y_kmeans = est.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='rainbow');

The algorithm identifies the four clusters of points in a manner very similar to what we would do by eye!

The K-Means Algorithm: Expectation Maximization

K-Means is an example of an algorithm which uses an Expectation-Maximization approach to arrive at the solution. Expectation-Maximization is a two-step approach which works as follows:

Guess some cluster centers
Repeat until converged A. Assign points to the nearest cluster center B. Set the cluster centers to the mean

Let's quickly visualize this process:



In [ ]:

    
from networkplots import plot_kmeans_interactive
plot_kmeans_interactive();

This algorithm will (often) converge to the optimal cluster centers.

KMeans Caveats

The convergence of this algorithm is not guaranteed; for that reason, scikit-learn by default uses a large number of random initializations and finds the best results.
Also, the number of clusters must be set beforehand... there are other clustering algorithms for which this requirement may be lifted.
Clusters must be of similar size, because random initialization will prefer the larger clusters by default, and the smaller clusters will be ignored

Enter .. networks!

Let's take a step back and talk about graph definitions for a second. A Graph (or "network") is a set of nodes (or "verticies") that are connected to each other via edges (or "links"):

A graph $G = (V, E)$ is a set of vertices $V$ and edges $E$

Graphs can be directed if the edges point in specific directions between edges:

Or graphs can be undirected if the edges have no direction:

In this class, we'll be using undirected graphs.

Community detection!

Finding community structure within networks is a well-established problem in the social sciences. Given pairwise connections between people, can you guess what are the local communities? How can you partition the graph to be a bunch of mini-graphs?

PhenoGraph

PhenoGraph creates a $k$-nearest neighbor graph, where each cell is connected to the top $k$ cells it is closest to (in our case, which ones it is closet to in spearman correlation)
- Notice that $k$ here indicates the number of connections each cell is allowed to have, compared to $k$-means clustering where $k$ indicated how many clusters you thought were in your data.
Then, after graph creation, PhenoGraph detects the number of communities using a measure called "Modularity," which measures how connected a subgroup is, compared to if the edges between nodes were randomly distributed
Modularity ($Q$) ranges from -1 to 1, where -1 means the subgraphs aren't connected to each other and 1 means the subgraphs is maximally connected
- Modularity has a resolution limit. The smallest group it can find is limited by the total number of connections (edges) in the graph. If the number of edges is $m$, then the smallest findable module is $\sqrt{2m}$. How does the number of neighbors $k$ affect the total number of edges?
This is an unsupervised algorithm - you don't need to know the number of groups in the data before you try using it

We'll be using the phenograph package from Dana Pe'er's lab which was origianlly published in this paper: http://www.cell.com/cell/abstract/S0092-8674(15)00637-6

As a reference, we'll be performing clustering on the Spearman correlation between cells.



In [ ]:

    
from bokeh.io import output_notebook

# This line is required for the plots to appear in the notebooks
output_notebook()



In [ ]:

    
%load_ext autoreload
%autoreload 2



In [ ]:

    
import networkplots

networkplots.explore_phenograph()

Questions about PhenoGraph/Community Detection/K-nearest neighbors graphs

How does changing $k$ affect the graph creation?
1. Do you get more or less clusters with smaller $k$?
2. Do you get more or less clusters with larger $k$?
3. If you want cells to be more similar, would you use a smaller or larger $k$?
4. If you want cells to be more different, would you use a smaller or larger $k$?
How does changing min_cluster_size affect the number of clusters?
Which dataset has more distinct clusters, the amacrine cells or the "big clusters"?
1. For the amacrine data, do you believe that these are all cells of truly different types or on some kind of continuum?
2. Would you use the clusters as-is for both the datasets, or would you merge some of them?
How does the "lowrank" or "smoothed" data perform, compared to the "raw" counts data? Are cells more or less connected to each other?
How does the metric affect the clustering?