The K-means section of this notebook was put together by [Jake Vanderplas](http://www.vanderplas.com). Source and license info is on [GitHub](https://github.com/jakevdp/sklearn_tutorial/).
Here we'll explore K Means Clustering, which is an unsupervised clustering technique.
We'll start with our standard set of initial imports
In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
# use seaborn plotting defaults
import seaborn as sns; sns.set()
K Means is an algorithm for unsupervised clustering: that is, finding clusters in data based on the data attributes alone (not the labels).
K Means is a relatively easy-to-understand algorithm. It searches for cluster centers which are the mean of the points within them, such that every point is closest to the cluster center it is assigned to.
Let's look at how KMeans operates on the simple clusters we looked at previously. To emphasize that this is unsupervised, we'll not plot the colors of the clusters:
In [ ]:
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=300, centers=4,
random_state=0, cluster_std=0.60)
plt.scatter(X[:, 0], X[:, 1], s=50);
By eye, it is relatively easy to pick out the four clusters. If you were to perform an exhaustive search for the different segmentations of the data, however, the search space would be exponential in the number of points. Fortunately, there is a well-known Expectation Maximization (EM) procedure which scikit-learn implements, so that KMeans can be solved relatively quickly.
In [ ]:
from sklearn.cluster import KMeans
est = KMeans(4) # 4 clusters
est.fit(X)
y_kmeans = est.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='rainbow');
The algorithm identifies the four clusters of points in a manner very similar to what we would do by eye!
K-Means is an example of an algorithm which uses an Expectation-Maximization approach to arrive at the solution. Expectation-Maximization is a two-step approach which works as follows:
Let's quickly visualize this process:
In [ ]:
from networkplots import plot_kmeans_interactive
plot_kmeans_interactive();
This algorithm will (often) converge to the optimal cluster centers.
The convergence of this algorithm is not guaranteed; for that reason, scikit-learn by default uses a large number of random initializations and finds the best results.
Also, the number of clusters must be set beforehand... there are other clustering algorithms for which this requirement may be lifted.
Let's take a step back and talk about graph definitions for a second. A Graph (or "network") is a set of nodes (or "verticies") that are connected to each other via edges (or "links"):
In this class, we'll be using undirected graphs.
Finding community structure within networks is a well-established problem in the social sciences. Given pairwise connections between people, can you guess what are the local communities? How can you partition the graph to be a bunch of mini-graphs?
We'll be using the phenograph
package from Dana Pe'er's lab which was origianlly published in this paper: http://www.cell.com/cell/abstract/S0092-8674(15)00637-6
As a reference, we'll be performing clustering on the Spearman correlation between cells.
In [ ]:
from bokeh.io import output_notebook
# This line is required for the plots to appear in the notebooks
output_notebook()
In [ ]:
%load_ext autoreload
%autoreload 2
In [ ]:
import networkplots
networkplots.explore_phenograph()
min_cluster_size
affect the number of clusters?