The first dimension of a data analysis problem is the population dimension. The statistical algorithms corresponding to the analysis of that dimension are called clustering algorithms. When fed a data frame containing the values of selected variables for a given population sample, these algorithms spit out clusters of individuals that are similar from the point of view of these variables.
The general idea behind most of the clustering algorithms is to consider the collection variable values for a given individual (i.e., a given row in the data frame) as the components of a vectors $x=(x_1,\dots,x_m)\in \mathbb{R}^m$. So an individual becomes a point in a geometric space, and two individuals are close to each other if the distance between the corresponding points is small. A cluster is then a subset of individuals whose corresponding points in the geometric space are somehow closer to each other than to the points that are not in the cluster. This notion of cluster is somewhat relative, since it depends on several subjective choices to be made for each concrete problem:
There is some flexibility in defining this distance, and different problems may use different notions of distances. The first idea that comes to mind is to use the Euclidean distance; for instance, if we have another vector of observations $y=(y_1,\dots,y_m)$ for a second individual, the distance between these two individuals can be taken to be the length $d(x,y)$ of the line segment joining the point $x$ to the point $y$ in $\mathbb{R}^m$. In math notation, this gives:
$$d(x,y) = \|x-y\| = \sqrt{(y_1-x_1)^2 + \cdots + (y_m - x_m)^2} .$$This seems natural, but it may seem just as natural to take the sum of the absolute values of these differences, or a weighted average of them. Moreover, if the variables are binary, the Euclidean distance above may not be the most appropriate one.
Similarly, there is a need and some flexibility to scale the variables: If one of the variable has its values concentrated around, say 100, with a standard deviation of 10, while a second variable has its values concentrated around, say 0, with a standard deviation of 1, the first variable may completely mask the clustering information that may be contained in the second variable. One possible choice to make use of all the clustering information contained variables $X_1,\dots, X_m$ in our data set is to rescale them all, so they all have mean 0 and standard deviation 1. That is, before performing the clustering, one would transform our intitial variables $X_1,\dots, X_m$ into their normalized versions:
$$\tilde X_1=\frac{X_1-\mu_1}{\sigma_1},\dots,\, \tilde X_m=\frac{X_m-\mu_m}{\sigma_m}, $$where $\mu_i$ is the sample mean and $\sigma_i$ is the sample standard deviation for the $i^{th}$ variable. Of course, one could have choosen different measure of location and scale for the normalization, such as the median and the inter quartile range, for instance. Depending on the problem at hand, we may also want not a fully democratic rescaling, and keep the influence of some of the variable dominant, by an different rescaling.
At last, the notion of cluster, understood vaguely as group of points closer to each other than to the rest of the observations, is conditioned on the number of clusters that one is ready to see in the data. Some of the clustering algorithms, such as k-means, need to be given this number of clusters explicitly, while others, like hierarchical clustering, will compute all the possible cluters, and leave the task of interpreting which numbers of clusters makes sense to the statistician.
R provides many functions implementing various clustering algorithms. Here are some of the most popular:
hierachical clustering: hclust
k-means: kmeans
, agnes
READING MATERIAL: Chapter 7 of Phil Spector lecture notes on clustering:
Consider a population $\Omega=\{S_1,\dots,S_n\}$ of $n$ students in a given class. (Think of $S_i$ as the student ID, or student name, of student $i$, which we will use to represent the actual student.)
Suppose that we have collected information for two variables for each student in this class:
At this point, we need to distinguish between two cases depending on the nature of the variables: a variable is called
qualitative, if its possible values (also called levels) indicates membership to a category, such the major variable above (taking the average does not make sense);
quantitative, if its values are numbers (taking the average of makes sense).
Let's inverstigate clustering for each of these variable categories.
Clustering a population using a qualitative variable is trivial, since each level defines a cluster.
Let's do a simple simulation, to see how to manipulate qualitative variables and their clusters in R. We consider again our class $\Omega$ of $n$ student for which which we will simulate $n$ values among the possible values ECON, MATH, STAT, and LITT, for the qualitative variable $M$ indicating a student major.
Let's first generate our population $\Omega$:
In [1]:
%load_ext rmagic
In [3]:
%%R
student.number = 100
omega = paste('S',1:student.number, sep='')
omega
In R, qualitative variables are better represented by objects of the class factor
, than with the usual vectors:
In [4]:
%%R
major.names = c('ECON', 'STAT', 'MATH', 'LITT')
M = factor(sample(major.names, student.number, replace=T))
names(M) = omega
M
The generic function summary
applied to a factor returns the number of each individual in each of the category clusters; observe, that for a vector, summary
would have returned the usual numeric summary (min, first quartile, medin, etc.), which does not makes sense for factors:
In [6]:
%%R
summary(M)
This frequency information (number of individuals per categories) is pretty much all we can get from a single qualitative variable; to visualize this information, one can simply use the generic plot
function, which will display a barplot representing the frequencies, and not a dotplot, as it would have been the case for a quantitative variable:
In [7]:
%%R -r 86 -w 500 -h 300
plot(M)
To retrieve the student ideas corresponding to each of the clusters, one can use the lapply
function on the levels of the qualitative variable, along with logical indexing, to set up a list, whose elements contain the student ID per clusters. Note that the function levels
applied to a factor will return the levels of this factor.
In [8]:
%%R
clusters = lapply(levels(M), function(major) names(M[M==major]))
names(clusters) = levels(M)
clusters
Clustering a population using a quantitative variable is more challenging, since it already involves setting up a distance between the observations, and computing the distance between each of these observations. However, since, in this case, the collection of our obsvervation for each student amounts to only one number, its grade for the class, the euclidean distance between two grades corresponds with the absolute value of the differences:
$d(x_i, x_j) = \sqrt{(x_i - x_j)^2} = |x_i - x_j|,$
where $x_i= G(S_i)$ and $x_j = G(S_j)$ are the grades of student $S_i$ and $S_j$, respectively. Since, we only have one variable, we also don't really need any variable rescaling. However, we will see that we will need to specify the number of clusters.
Let's start by simulating grades for our class $\Omega$. We want the grades to be between a minimum and a maximum score. So let's write a function that makes sure of that:
In [9]:
%%R
simulate.grades = function(n, score.max=100, score.mean=score.max/2, score.sd=score.max/5){
grades = round(rnorm(n, mean=score.mean, sd=score.sd))
grades[grades < 0] = 0
grades[grades > score.max] = score.max
return(grades)
}
The best way to visualize population clusters determined by one qualitative variable $G$, is to draw a dotplot or stripchart, representing all the $n$ observations (i.e. all the student grades), along with the information of the distance between them.
A dotplot (or stripchart), plots each observation $x_i$ as a points on a line, at distance $x_i$ from the origin. Groups of observations close to each other will visually appears as groups, separated by larger band of white space from the other groups.
We now generate grades for the student in our class, using the major variable to introduce artificially clusters into the grade distribution. The idea is to use
rnorm(n, mean, sd)
to generate the grades, and set different means for different majors.
The function below takes a data frame df
, and returns the same data frame with its column simField
set to simulated values in a range specified by range
; the simulated values have are simulated using the functin rnorm
with different means for the different levels of the data frame column byField
. These means are chosen randomly in the range indicated by the argument range
.
In [10]:
%%R -r 86 -w 800 -h 400
simulate.clusters = function(df, simField=2, byField=1, range=c(min=0,max=100), sigma=3)
{
cluster.names = levels(df[[byField]])
clusters = lapply(cluster.names, function(clname) rownames(df[df[[byField]] == clname, ]))
names(clusters) = cluster.names
cluster.number = length(cluster.names)
cluster.means = round((range['max']/cluster.number -10)*1:cluster.number + rnorm(cluster.number, mean=0, sd=sigma))
cluster.means = sample(cluster.means, cluster.number)
for(i in 1:cluster.number) df[clusters[[i]], simField] = simulate.grades(length(clusters[[i]]),
score.mean=cluster.means[i],
score.sd=sigma);
return(df)
}
In [18]:
%%R
score.max = 100
gradeBook = data.frame(Major=M, Score=rep(0, student.number))
gradeBook = simulate.clusters(gradeBook)
head(gradeBook)
Then we can visualize the simulated clusters with a simple stripchart; if the values randomly chosen above for the major grade means and sd have introduced clusters in our population, we should be able to spot them as being group of points close to each other, separated by larger withe spaces (if not, rerun the simulation above until you can spot clusters).
Note the argument method=stack
that tells stripchart
to plot observations with the same numerical value stacked on the top of each other (other methods are jitter
and the default overplot
).
In [19]:
%%R -r 86 -w 800 -h 230
scores = gradeBook$Score
names(scores) = rownames(gradeBook)
stripchart(scores, method='stack', xlab='Student scores', main='Stacked stripchart of student grades')
Since, for this simulated grade data, we know what the cluster should correspond to (i.e. the class majors, since we randomly chose different means and sd for different majors to simulate the grades), we can check that the cluster we see in the stripchart above correspond to those majors. For that, we will plot below a fancier dotplot, where instead of plotting a mere points at, say x=40, to represent an grade of 40 in our data set, we will display vertically at x=40, the name of the student major, whose grade is 40. Moever, to make the majors/clusters relationship very clear visually, we will use different colors for the major labels.
Make sure you understand the snippet of code below: We start by plotting an empty canvas, by passing the non plotting option (n stands for 'none') to the argument type='n'
; then, we had text to the canvas by using the ploting function text
at the appropriate coordinate locations (passed as the two first arguments x
and y
of the plotting function).
In [20]:
%%R -r 86 -w 800 -h 250
xScores = gradeBook$Score; yHeight = rep(1.1, nrow(gradeBook))
plot(x=xScores, y=yHeight, type='n', yaxt='n', ylab='', xlab='', main='Doplot of student grades')
colors = c('red', 'blue', 'purple', 'orange')[gradeBook$Major]
text(x=gradeBook$Score, y=1.1, labels=gradeBook$Major, srt=90, col=colors)
In [30]:
%%R -r 86 -w 800 -h 250
fit = kmeans(scores, 4)
plot(scores, rep(1.1, length(scores)), type='n', yaxt='n', xlab='', ylab='')
colors = c('red', 'blue', 'purple', 'orange')[fit$cluster]
text(scores, 1.1, labels=gradeBook$Major, srt=90, col=colors)
In [33]:
%%R
fit
In [36]:
%%R
fit$cluster
fit$centers
fit$size
In [46]:
%%R -r 86 -w 1000 -h 400
D = dist(scores, method='euclidean')
In [48]:
%%R
hfit = hclust(D)
names(hfit)
In [49]:
%%R
plot(hfit, gradeBook$Major)
rect.hclust(hfit, k=4, border="orange")
In [53]:
%%R
hgroups = cutree(hfit, 4)
hgroups
In [54]:
%%R
gradeBook$hcluster = cutree(hfit, 4)
head(gradeBook)
We now consider clustering a population for which we have two quantitative variables.
Continuting with our grade example above, we will now suppose we have two grades for each of our class students: the midterm and the final grades. Let's simulate this data:
In [56]:
%%R
midterm = simulate.grades(student.number, score.mean=70, score.sd=10, score.max=100)
final = simulate.grades(student.number, score.mean=60, score.sd=10, score.max=100)
gradeBook2 = data.frame(Midterm=midterm, Final=final, Major=M, row.names=omega)
head(gradeBook2)
The generalization to two variables (dim = 2) of the dotplots we used to spot clusters for one variable the scatterplots. The idea is the same as for dotplots: One represents each individual as a point, but now on a plane, instead of on a line, since, we have two coordinates (the values for both variables), instead of one.
Definition: A scatter plot of two numeric vectors $x = (x_1, \dots, x_n) $ and $y = (y_1, \dots, y_n)$ with the same number $n$ of elements is a graph with two axes representing the two variable ranges and displaying $n$ points with coordinates $(x_1, y_1),\dots ,\, (x_n, y_n)$.
Below is a scatterplot of the final and midterm grade, we just generated:
In [57]:
%%R -r 96 -w 600 -h 600
plot(gradeBook2[c(-3)], xlim=c(0, score.max), ylim=c(0, score.max),
main='A scatterplot of midterm grades against final grades')
abline(v=70, col='red')
abline(h=60, col='red')
The scatterplot above do not indicate the presense of clusters. The data points concentrate around the means of both exams. The symmetric and circular shape is characteristic of a two independent normally distributed variables (more on that in the variable Section).
Now suppose, we want to add to our grade model the fact that student from different majors may perform differently in the class exams. For that, we will reuse the function
simulate.clusters(df, simField=2, byField=1, range=c(min=0,max=100), sigma=3)
we wrote above to introduce difference in values for different levels of a given factor in a data frame. Again, we will use the Major
factor in gradeBook2
to clusterize our data.
In [62]:
%%R
gradeBook2 = simulate.clusters(gradeBook2, simField=1, byField=3, sigma=5)
gradeBook2 = simulate.clusters(gradeBook2, simField=2, byField=3, sigma=5)
head(gradeBook2)
We can now display the scatter plot of our two variables. To verify that the clusters correspond to the majors, we'd like to plot our points in two different colors depending on the section.
In [63]:
%%R -r 96
colors = c('blue', 'orange', 'purple', 'red')
plot(gradeBook2[-3], xlim=c(0, 100), ylim=c(0, 100), col=colors[gradeBook$Major])
legend('topleft', legend=levels(gradeBook$Major), col=colors, pch=1)
Because of the increased noise (sigma = 8
), the cluster structure starts to be difficult to spot on a dotplot of each variable taken separately. However, the cluster structure for the same data is quite apparent on the 2D scatterplot above that takes the clustering information of both variable simultaneously into account. This is clear by comparint the dotplot below with the scatterplot above. Note also that the dotplot below correspond to the orthogonal projections of the data points on the scatterplot on its x and y axis, respectively.
In [64]:
%%R -r 96 -w 600 -h 600
par(mfrow=c(2,1))
stripchart(gradeBook2$Midterm, ylab='Midterm', method='jitter')
stripchart(gradeBook2$Midterm, ylab='Final', method='jitter')
With two variables, it's possible to visualize population clusters and variable relations on a single scatter plot, since it involves to plot points on the variable plane.
With three variables we can still use a scatter plot, although we now need to plot our points in a 3 dimensional space, which renders the interpretation slightly more challenging.
With more than three variables the only way to visualize the scatter plot (which involves plotting in a $n$-dimensional space, where $n$ is the number of variables) is to plot 2-dimensional projections of the $n$-dimensional scatter plot. This means that we can plot all the scatter plots for each pair of variables. The visual situation is however more challenging to interpret, and we will need to use other tools for that, as we will see.
To get started, let's add a variable to our simulated grade data and plot the 3D scatter plot.
The variable we will add is an homework grade. To keep simulation simple, we will generate a homework grade for the whole class, without making any distinction between the two sections.
In [65]:
%%R
initVal = rep(0, student.number)
noise = 5
gradeBook3 = data.frame(Midterm=initVal, Final=initVal, Homework=initVal, Major=M, row.names=omega)
gradeBook3 = simulate.clusters(gradeBook3, simField=1, byField=4, sigma=noise)
gradeBook3 = simulate.clusters(gradeBook3, simField=2, byField=4, sigma=noise)
gradeBook3 = simulate.clusters(gradeBook3, simField=3, byField=4, sigma=noise)
head(gradeBook3)
The lattice library offers advance plotting capabilities in R. To load it, type in
library(lattice)
The display 3D scatter plots of 3 continous variables, say $X$, $Y$, and $Z$, stored in a data frame myFrame
, we will use the lattice function
cloud(Z ~ X + Y, data=myFrame, group=cat, auto.key=T)
The first argument is an R formula. For now, you only need to understand that
Z ~ X + Y
means that $Z$ will be plotted on the $z$-axis, while $X$ and $Y$ will be plotted on the $xy$-plane.
data
is the data frame from which the quantitative variables $X$, $Y$, and $Z$ will be retrieved from.group
is a categorical variable from the data frame myFrame
indicating that the points from different categories should be displayed with different colors.auto.key
is a Boolean argument indicating a legend describing which color corresponds to which category should be plotted. Let us display a 3D scatter plot of our variables Final, Midterm
and Homework
and plot them with different colors depending on the Section
variable.
Remark: In the notebook, one needs to store the output of cloud
into a variable, which we will need to print in order for the plot to appears.
In [66]:
%%R -r 96 -w 700 -h 500
library(lattice)
grade_cloud = cloud(Final ~ Midterm + Homework, data=gradeBook3, group=Major, auto.key=T)
print(grade_cloud)
One still clearly seet the two clusters of LITT and DOUB majors. The linear relationship between the variables is however not clearly perceptible on the plot.
Another way to display variable relations and population clusters for more than 2 variables is to plot a scatter plot matrix containing a scatter plot of each pair of variables.
One can again pass our color vector to the col
argument.
In [67]:
%%R -r 96 -w 500 -h 500
plot(gradeBook3[c(-4)], col=colors[gradeBook3$Major])
The two section clusters are very visible in each of the scatter plots.
At contrast with our 3D scatter plot, one can still see the linear relation ship between the midterm score and the final score that we build into our data.
Although still very useful for three variables, we begin to get a sense that more sophisticated methods will be needed as the number of variable increases... That's what we are going to learn next.
Let us add a fourth variable to our simulated grade data:
In [50]:
%%R
initVal = rep(0, student.number); noise = 5
gradeBook4 = data.frame(Final=initVal, Midterm=initVal, Homework=initVal, Quiz=initVal, Major=M, row.names=omega)
head(gradeBook4)
gradeBook4 = simulate.clusters(gradeBook4, simField=1, byField=4, sigma=noise)
We can still display a scatter plot matrix:
In [306]:
%%R -r 96 -w 700 -h 700
plot(grades[,c(1,2,3,4)], col=point_colors)
We still see some clustering appearing for some variables, but for others (Quiz and Homework) the clustering does not seem to be present.
As the number of variable grows, we need to use other methods to retrieve clusters from the data.
Consider $n$ variables $X_1,\dots,X_n$ for a population sample of $m$ individual.
All the variable values for all the sample individuals can be arranged into a matrix
$$ \qquad\qquad\qquad\left( \begin{array}{ccc} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & & & \vdots \\ x_{m1} & x_{2m} & \cdots & x_{mn} \end{array} \right) \qquad \textrm{where}\qquad x_{ij} = X_i(s_j)\quad \textrm{(value of the variable}\quad X_i\quad \textrm{ for the individual} \quad s_j \textrm{ )} $$In other word, this matrix is the data frame stripped from its row and column labels.
The points $v_1,\dots, v_m$ represented in a scatter plots are just the row vectors of this value matrix:
$$ \qquad\qquad\qquad\begin{array} .v_1 & = & (x_{11},x_{12},\dots,x_{1n}) \\ v_2 & = & (x_{21},x_{22},\dots,x_{2n}) \\ & \vdots & \\ v_m & = & (x_{m1}, x_{m2}\dots,x_{mn}) \end{array} $$In R, a matrix is a represented by a vector with the hidden dimension vector set up with the corresponding number of rows and columns, in the same way a class is a list with the hidden class string set up to the class name.
This hidden dimension vector is accessed and set up through the function
dim(x)
Changing the dimension of a vector makes R interpret the resulting object as an instance of the class matrix
.
In [307]:
%%R
x = c(1,2,3,4,5,6,7,8,9)
print(class(x))
dim(x) = c(3,3)
print(class(x)); cat('\n')
print(x)
One can extract the value matrix from a data frame using the conversion funtion
as.matrix(x)
In [308]:
%%R
scores = as.matrix(grades[,c(1,2,3,4)])
print(class(scores))
The distance matrix of a matrix $A=(x_{ij})$ is the matrix containing all the distances between the matrix row vectors $v_1,v_2,\dots, v_m$:
$$ \qquad\qquad\qquad\operatorname{dist}(A)\quad=\quad\left( \begin{array}{ccc} d(v_1,v_1) & d(v_1,v_2) & \cdots & d(v_1,v_n) \\ d(v_2,v_1) & d(v_2,v_2) & \cdots & d(v_2,v_n) \\ \vdots & & & \vdots \\ d(v_m,v_1) & d(v_m,v_2) & \cdots & d(v_m,v_n) \end{array} \right) $$where
$$\qquad\qquad\qquad d(v_i,v_j) = \sum_{k=1}^m \sqrt{(x_{ik}^2 - x_{jk}^2)}$$is the euclidean distance between the row vectors $v_i$ and $v_j$.
In R, one computes the distance matrix using the function
dist(x)
The distance matrix contains all the necessary information to perform clustering.
In [309]:
%%R
scores_dist = dist(scores)
The function
hclust(distance_matrix)
takes a distance matrix and returns an object of the class hclust
, which contains a hierachical groups of the population.
This hierachical groups are computed as follows by hclust
:
The hierachy that results can be further plotted as a dendrogram using the plot
funciton.
In [310]:
%%R -r 96 -w 700 -h 400
hclusters = hclust(scores_dist)
plot(hclusters, main='Dendrogram of the class scores')
We see from the dendrogram that we can obtain a different number of cluster by cutting the dendogram at a given height.
In our example,
cutting at a distance between 60 and 120, then two cluster emerges that most probably correspond to the two majors.
cutting at a distance of 40, will produce several additional clusters revealing more structure in the data.
Of course, if the distance between the two cluster is small it may indicate that the cluster structure is not very strong. In our example, only the two main clusters seems to be real.
To retrieve a given number of cluster, one uses the function
cutree(hcluster_object, number_of_clusters)
which returns a integer vector where each element is an integer telling which cluster the corresponding individual in our population belongs to. This labels of this vectors are the individual names.
In [311]:
%%R
clusters = cutree(hclusters, 2)
print(clusters)
We can now separate the rows or our initial data frame into two categories, corresponding to the returned clusters and check that they represent the two majors:
In [312]:
%%R
cluster1.names = names(clusters[clusters == 1])
students1 = grades[cluster1.names,]
print(students1)