The first dimension of a data analysis problem is the population dimension. The statistical algorithms corresponding to the analysis of that dimension are called clustering algorithms. When fed a data frame containing the values of selected variables for a given population sample, these algorithms spit out clusters of individuals that are similar from the point of view of these variables.

The general idea behind most of the clustering algorithms is to consider the collection variable values for a given individual (i.e., a given row in the data frame) as the components of a vectors $x=(x_1,\dots,x_m)\in \mathbb{R}^m$. So an individual becomes a point in a geometric space, and two individuals are close to each other if the distance between the corresponding points is small. A cluster is then a subset of individuals whose corresponding points in the geometric space are somehow closer to each other than to the points that are not in the cluster. This notion of cluster is somewhat relative, since it depends on several subjective choices to be made for each concrete problem:

  • the choice of an appropriate distance between the observation
  • the choice of an appropriate scale for each of the variables
  • the choice of an appropriate number of clusters to identify

There is some flexibility in defining this distance, and different problems may use different notions of distances. The first idea that comes to mind is to use the Euclidean distance; for instance, if we have another vector of observations $y=(y_1,\dots,y_m)$ for a second individual, the distance between these two individuals can be taken to be the length $d(x,y)$ of the line segment joining the point $x$ to the point $y$ in $\mathbb{R}^m$. In math notation, this gives:

$$d(x,y) = \|x-y\| = \sqrt{(y_1-x_1)^2 + \cdots + (y_m - x_m)^2} .$$

This seems natural, but it may seem just as natural to take the sum of the absolute values of these differences, or a weighted average of them. Moreover, if the variables are binary, the Euclidean distance above may not be the most appropriate one.

Similarly, there is a need and some flexibility to scale the variables: If one of the variable has its values concentrated around, say 100, with a standard deviation of 10, while a second variable has its values concentrated around, say 0, with a standard deviation of 1, the first variable may completely mask the clustering information that may be contained in the second variable. One possible choice to make use of all the clustering information contained variables $X_1,\dots, X_m$ in our data set is to rescale them all, so they all have mean 0 and standard deviation 1. That is, before performing the clustering, one would transform our intitial variables $X_1,\dots, X_m$ into their normalized versions:

$$\tilde X_1=\frac{X_1-\mu_1}{\sigma_1},\dots,\, \tilde X_m=\frac{X_m-\mu_m}{\sigma_m}, $$

where $\mu_i$ is the sample mean and $\sigma_i$ is the sample standard deviation for the $i^{th}$ variable. Of course, one could have choosen different measure of location and scale for the normalization, such as the median and the inter quartile range, for instance. Depending on the problem at hand, we may also want not a fully democratic rescaling, and keep the influence of some of the variable dominant, by an different rescaling.

At last, the notion of cluster, understood vaguely as group of points closer to each other than to the rest of the observations, is conditioned on the number of clusters that one is ready to see in the data. Some of the clustering algorithms, such as k-means, need to be given this number of clusters explicitly, while others, like hierarchical clustering, will compute all the possible cluters, and leave the task of interpreting which numbers of clusters makes sense to the statistician.

R provides many functions implementing various clustering algorithms. Here are some of the most popular:

READING MATERIAL: Chapter 7 of Phil Spector lecture notes on clustering:

Clustering with one variable

Consider a population $\Omega=\{S_1,\dots,S_n\}$ of $n$ students in a given class. (Think of $S_i$ as the student ID, or student name, of student $i$, which we will use to represent the actual student.)

Suppose that we have collected information for two variables for each student in this class:

  • the student major, which we denote by $M(S_i)$ for student $S_i$
  • the student score in this class, which we denote by $G(S_i)$ for student $S_i$

At this point, we need to distinguish between two cases depending on the nature of the variables: a variable is called

  • qualitative, if its possible values (also called levels) indicates membership to a category, such the major variable above (taking the average does not make sense);

  • quantitative, if its values are numbers (taking the average of makes sense).

Let's inverstigate clustering for each of these variable categories.

Qualitative variable

Clustering a population using a qualitative variable is trivial, since each level defines a cluster.

Let's do a simple simulation, to see how to manipulate qualitative variables and their clusters in R. We consider again our class $\Omega$ of $n$ student for which which we will simulate $n$ values among the possible values ECON, MATH, STAT, and LITT, for the qualitative variable $M$ indicating a student major.

Let's first generate our population $\Omega$:


In [1]:
%load_ext rmagic

In [3]:
%%R

student.number = 100
omega = paste('S',1:student.number, sep='')
omega


  [1] "S1"   "S2"   "S3"   "S4"   "S5"   "S6"   "S7"   "S8"   "S9"   "S10" 
 [11] "S11"  "S12"  "S13"  "S14"  "S15"  "S16"  "S17"  "S18"  "S19"  "S20" 
 [21] "S21"  "S22"  "S23"  "S24"  "S25"  "S26"  "S27"  "S28"  "S29"  "S30" 
 [31] "S31"  "S32"  "S33"  "S34"  "S35"  "S36"  "S37"  "S38"  "S39"  "S40" 
 [41] "S41"  "S42"  "S43"  "S44"  "S45"  "S46"  "S47"  "S48"  "S49"  "S50" 
 [51] "S51"  "S52"  "S53"  "S54"  "S55"  "S56"  "S57"  "S58"  "S59"  "S60" 
 [61] "S61"  "S62"  "S63"  "S64"  "S65"  "S66"  "S67"  "S68"  "S69"  "S70" 
 [71] "S71"  "S72"  "S73"  "S74"  "S75"  "S76"  "S77"  "S78"  "S79"  "S80" 
 [81] "S81"  "S82"  "S83"  "S84"  "S85"  "S86"  "S87"  "S88"  "S89"  "S90" 
 [91] "S91"  "S92"  "S93"  "S94"  "S95"  "S96"  "S97"  "S98"  "S99"  "S100"

In R, qualitative variables are better represented by objects of the class factor, than with the usual vectors:


In [4]:
%%R

major.names  = c('ECON', 'STAT', 'MATH', 'LITT')

M = factor(sample(major.names, student.number, replace=T))
names(M) = omega

M


  S1   S2   S3   S4   S5   S6   S7   S8   S9  S10  S11  S12  S13  S14  S15  S16 
STAT LITT ECON MATH ECON ECON LITT MATH ECON MATH LITT STAT ECON MATH LITT LITT 
 S17  S18  S19  S20  S21  S22  S23  S24  S25  S26  S27  S28  S29  S30  S31  S32 
ECON STAT STAT LITT STAT LITT STAT LITT STAT STAT LITT LITT MATH LITT MATH LITT 
 S33  S34  S35  S36  S37  S38  S39  S40  S41  S42  S43  S44  S45  S46  S47  S48 
STAT STAT STAT STAT STAT ECON MATH MATH MATH STAT LITT MATH MATH ECON MATH LITT 
 S49  S50  S51  S52  S53  S54  S55  S56  S57  S58  S59  S60  S61  S62  S63  S64 
ECON ECON LITT ECON MATH LITT STAT MATH LITT ECON ECON LITT MATH MATH LITT STAT 
 S65  S66  S67  S68  S69  S70  S71  S72  S73  S74  S75  S76  S77  S78  S79  S80 
ECON MATH MATH STAT MATH MATH ECON LITT STAT ECON LITT ECON LITT MATH LITT ECON 
 S81  S82  S83  S84  S85  S86  S87  S88  S89  S90  S91  S92  S93  S94  S95  S96 
ECON STAT STAT MATH STAT MATH LITT ECON LITT STAT LITT STAT STAT STAT LITT STAT 
 S97  S98  S99 S100 
LITT LITT STAT LITT 
Levels: ECON LITT MATH STAT

The generic function summary applied to a factor returns the number of each individual in each of the category clusters; observe, that for a vector, summary would have returned the usual numeric summary (min, first quartile, medin, etc.), which does not makes sense for factors:


In [6]:
%%R

summary(M)


ECON LITT MATH STAT 
  20   30   23   27 

This frequency information (number of individuals per categories) is pretty much all we can get from a single qualitative variable; to visualize this information, one can simply use the generic plot function, which will display a barplot representing the frequencies, and not a dotplot, as it would have been the case for a quantitative variable:


In [7]:
%%R -r 86 -w 500 -h 300
plot(M)


To retrieve the student ideas corresponding to each of the clusters, one can use the lapply function on the levels of the qualitative variable, along with logical indexing, to set up a list, whose elements contain the student ID per clusters. Note that the function levels applied to a factor will return the levels of this factor.


In [8]:
%%R

clusters = lapply(levels(M), function(major) names(M[M==major]))
names(clusters) = levels(M)
    
clusters


$ECON
 [1] "S3"  "S5"  "S6"  "S9"  "S13" "S17" "S38" "S46" "S49" "S50" "S52" "S58"
[13] "S59" "S65" "S71" "S74" "S76" "S80" "S81" "S88"

$LITT
 [1] "S2"   "S7"   "S11"  "S15"  "S16"  "S20"  "S22"  "S24"  "S27"  "S28" 
[11] "S30"  "S32"  "S43"  "S48"  "S51"  "S54"  "S57"  "S60"  "S63"  "S72" 
[21] "S75"  "S77"  "S79"  "S87"  "S89"  "S91"  "S95"  "S97"  "S98"  "S100"

$MATH
 [1] "S4"  "S8"  "S10" "S14" "S29" "S31" "S39" "S40" "S41" "S44" "S45" "S47"
[13] "S53" "S56" "S61" "S62" "S66" "S67" "S69" "S70" "S78" "S84" "S86"

$STAT
 [1] "S1"  "S12" "S18" "S19" "S21" "S23" "S25" "S26" "S33" "S34" "S35" "S36"
[13] "S37" "S42" "S55" "S64" "S68" "S73" "S82" "S83" "S85" "S90" "S92" "S93"
[25] "S94" "S96" "S99"

Quantitative case

Clustering a population using a quantitative variable is more challenging, since it already involves setting up a distance between the observations, and computing the distance between each of these observations. However, since, in this case, the collection of our obsvervation for each student amounts to only one number, its grade for the class, the euclidean distance between two grades corresponds with the absolute value of the differences:

$d(x_i, x_j) = \sqrt{(x_i - x_j)^2} = |x_i - x_j|,$

where $x_i= G(S_i)$ and $x_j = G(S_j)$ are the grades of student $S_i$ and $S_j$, respectively. Since, we only have one variable, we also don't really need any variable rescaling. However, we will see that we will need to specify the number of clusters.

Let's start by simulating grades for our class $\Omega$. We want the grades to be between a minimum and a maximum score. So let's write a function that makes sure of that:


In [9]:
%%R

simulate.grades = function(n, score.max=100, score.mean=score.max/2, score.sd=score.max/5){
        grades = round(rnorm(n, mean=score.mean, sd=score.sd))
        grades[grades < 0] = 0
        grades[grades > score.max] = score.max
        return(grades)
    }

Dotplots and stripchart

The best way to visualize population clusters determined by one qualitative variable $G$, is to draw a dotplot or stripchart, representing all the $n$ observations (i.e. all the student grades), along with the information of the distance between them.

A dotplot (or stripchart), plots each observation $x_i$ as a points on a line, at distance $x_i$ from the origin. Groups of observations close to each other will visually appears as groups, separated by larger band of white space from the other groups.

We now generate grades for the student in our class, using the major variable to introduce artificially clusters into the grade distribution. The idea is to use

rnorm(n, mean, sd)

to generate the grades, and set different means for different majors.

The function below takes a data frame df, and returns the same data frame with its column simField set to simulated values in a range specified by range; the simulated values have are simulated using the functin rnorm with different means for the different levels of the data frame column byField. These means are chosen randomly in the range indicated by the argument range.


In [10]:
%%R -r 86 -w 800 -h 400



simulate.clusters = function(df, simField=2, byField=1, range=c(min=0,max=100), sigma=3)
{
    cluster.names   = levels(df[[byField]])
    clusters        = lapply(cluster.names, function(clname) rownames(df[df[[byField]] == clname, ]))
    names(clusters) = cluster.names

    cluster.number = length(cluster.names)
    cluster.means  = round((range['max']/cluster.number -10)*1:cluster.number + rnorm(cluster.number, mean=0, sd=sigma))
    cluster.means  = sample(cluster.means, cluster.number)

    for(i in 1:cluster.number) df[clusters[[i]], simField] = simulate.grades(length(clusters[[i]]), 
                                                      score.mean=cluster.means[i], 
                                                      score.sd=sigma);
    return(df)

}

In [18]:
%%R

score.max = 100
gradeBook = data.frame(Major=M, Score=rep(0, student.number))    

gradeBook = simulate.clusters(gradeBook)

head(gradeBook)


   Major Score
S1  STAT    11
S2  LITT    46
S3  ECON    28
S4  MATH    63
S5  ECON    28
S6  ECON    31

Then we can visualize the simulated clusters with a simple stripchart; if the values randomly chosen above for the major grade means and sd have introduced clusters in our population, we should be able to spot them as being group of points close to each other, separated by larger withe spaces (if not, rerun the simulation above until you can spot clusters). Note the argument method=stack that tells stripchart to plot observations with the same numerical value stacked on the top of each other (other methods are jitter and the default overplot).


In [19]:
%%R  -r 86 -w 800 -h 230

scores = gradeBook$Score
names(scores) = rownames(gradeBook)
stripchart(scores, method='stack', xlab='Student scores', main='Stacked stripchart of student grades')


Since, for this simulated grade data, we know what the cluster should correspond to (i.e. the class majors, since we randomly chose different means and sd for different majors to simulate the grades), we can check that the cluster we see in the stripchart above correspond to those majors. For that, we will plot below a fancier dotplot, where instead of plotting a mere points at, say x=40, to represent an grade of 40 in our data set, we will display vertically at x=40, the name of the student major, whose grade is 40. Moever, to make the majors/clusters relationship very clear visually, we will use different colors for the major labels.

Make sure you understand the snippet of code below: We start by plotting an empty canvas, by passing the non plotting option (n stands for 'none') to the argument type='n'; then, we had text to the canvas by using the ploting function text at the appropriate coordinate locations (passed as the two first arguments x and y of the plotting function).


In [20]:
%%R -r 86 -w 800 -h 250

xScores = gradeBook$Score; yHeight = rep(1.1, nrow(gradeBook))

plot(x=xScores, y=yHeight, type='n', yaxt='n', ylab='', xlab='', main='Doplot of student grades')

colors = c('red', 'blue', 'purple', 'orange')[gradeBook$Major]

text(x=gradeBook$Score, y=1.1, labels=gradeBook$Major, srt=90, col=colors)


kmean clustering


In [30]:
%%R -r 86 -w 800 -h 250

fit = kmeans(scores, 4)

plot(scores, rep(1.1, length(scores)), type='n', yaxt='n', xlab='', ylab='')
colors = c('red', 'blue', 'purple', 'orange')[fit$cluster]
text(scores, 1.1, labels=gradeBook$Major, srt=90, col=colors)



In [33]:
%%R

fit


K-means clustering with 4 clusters of sizes 38, 8, 15, 39

Cluster means:
      [,1]
1 17.63158
2 63.50000
3 59.00000
4 39.76923

Clustering vector:
  S1   S2   S3   S4   S5   S6   S7   S8   S9  S10  S11  S12  S13  S14  S15  S16 
   1    4    1    2    1    4    4    2    4    3    4    1    1    3    4    4 
 S17  S18  S19  S20  S21  S22  S23  S24  S25  S26  S27  S28  S29  S30  S31  S32 
   4    1    1    4    1    4    1    4    1    1    4    4    3    4    3    4 
 S33  S34  S35  S36  S37  S38  S39  S40  S41  S42  S43  S44  S45  S46  S47  S48 
   1    1    1    1    1    1    2    3    3    1    4    3    2    1    3    4 
 S49  S50  S51  S52  S53  S54  S55  S56  S57  S58  S59  S60  S61  S62  S63  S64 
   1    1    4    4    2    4    1    3    4    1    4    4    2    3    4    1 
 S65  S66  S67  S68  S69  S70  S71  S72  S73  S74  S75  S76  S77  S78  S79  S80 
   4    3    2    1    3    3    4    4    1    1    4    4    4    3    4    1 
 S81  S82  S83  S84  S85  S86  S87  S88  S89  S90  S91  S92  S93  S94  S95  S96 
   4    1    1    3    1    2    4    1    4    1    4    1    1    1    4    1 
 S97  S98  S99 S100 
   4    4    1    4 

Within cluster sum of squares by cluster:
[1] 1586.842   14.000   50.000 1016.923
 (between_SS / total_SS =  91.1 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"         "iter"         "ifault"      

In [36]:
%%R

fit$cluster
fit$centers
fit$size


[1] 38  8 15 39

Hierarchical clustering


In [46]:
%%R -r 86 -w 1000 -h 400


D = dist(scores, method='euclidean')

In [48]:
%%R

hfit = hclust(D)
names(hfit)


[1] "merge"       "height"      "order"       "labels"      "method"     
[6] "call"        "dist.method"

In [49]:
%%R

plot(hfit, gradeBook$Major)
rect.hclust(hfit, k=4, border="orange")



In [53]:
%%R

hgroups = cutree(hfit, 4)
hgroups


  S1   S2   S3   S4   S5   S6   S7   S8   S9  S10  S11  S12  S13  S14  S15  S16 
   1    2    3    4    3    3    2    4    3    4    2    1    3    4    2    2 
 S17  S18  S19  S20  S21  S22  S23  S24  S25  S26  S27  S28  S29  S30  S31  S32 
   3    1    1    2    1    2    1    2    1    1    2    2    4    2    4    2 
 S33  S34  S35  S36  S37  S38  S39  S40  S41  S42  S43  S44  S45  S46  S47  S48 
   1    1    1    1    1    3    4    4    4    1    2    4    4    3    4    2 
 S49  S50  S51  S52  S53  S54  S55  S56  S57  S58  S59  S60  S61  S62  S63  S64 
   3    3    2    3    4    2    1    4    2    3    3    2    4    4    2    1 
 S65  S66  S67  S68  S69  S70  S71  S72  S73  S74  S75  S76  S77  S78  S79  S80 
   3    4    4    1    4    4    3    2    1    3    2    3    2    4    2    3 
 S81  S82  S83  S84  S85  S86  S87  S88  S89  S90  S91  S92  S93  S94  S95  S96 
   3    1    1    4    1    4    2    3    2    1    2    1    1    1    2    1 
 S97  S98  S99 S100 
   2    2    1    2 

In [54]:
%%R

gradeBook$hcluster = cutree(hfit, 4)
head(gradeBook)


   Major Score hcluster
S1  STAT    11        1
S2  LITT    46        2
S3  ECON    28        3
S4  MATH    63        4
S5  ECON    28        3
S6  ECON    31        3

Clustering with two variables

We now consider clustering a population for which we have two quantitative variables.

Continuting with our grade example above, we will now suppose we have two grades for each of our class students: the midterm and the final grades. Let's simulate this data:


In [56]:
%%R 

midterm = simulate.grades(student.number, score.mean=70, score.sd=10, score.max=100)
final   = simulate.grades(student.number, score.mean=60, score.sd=10, score.max=100)

gradeBook2 = data.frame(Midterm=midterm, Final=final, Major=M, row.names=omega)
head(gradeBook2)


   Midterm Final Major
S1      75    70  STAT
S2      65    66  LITT
S3      68    52  ECON
S4      71    84  MATH
S5      67    47  ECON
S6      66    77  ECON

Scatter plots

The generalization to two variables (dim = 2) of the dotplots we used to spot clusters for one variable the scatterplots. The idea is the same as for dotplots: One represents each individual as a point, but now on a plane, instead of on a line, since, we have two coordinates (the values for both variables), instead of one.

Definition: A scatter plot of two numeric vectors $x = (x_1, \dots, x_n) $ and $y = (y_1, \dots, y_n)$ with the same number $n$ of elements is a graph with two axes representing the two variable ranges and displaying $n$ points with coordinates $(x_1, y_1),\dots ,\, (x_n, y_n)$.

Below is a scatterplot of the final and midterm grade, we just generated:


In [57]:
%%R -r 96 -w 600 -h 600

plot(gradeBook2[c(-3)], xlim=c(0, score.max), ylim=c(0, score.max), 
    main='A scatterplot of midterm grades against final grades')
abline(v=70, col='red')
abline(h=60, col='red')


The scatterplot above do not indicate the presense of clusters. The data points concentrate around the means of both exams. The symmetric and circular shape is characteristic of a two independent normally distributed variables (more on that in the variable Section).

Simulating population clusters

Now suppose, we want to add to our grade model the fact that student from different majors may perform differently in the class exams. For that, we will reuse the function

simulate.clusters(df, simField=2, byField=1, range=c(min=0,max=100), sigma=3)

we wrote above to introduce difference in values for different levels of a given factor in a data frame. Again, we will use the Major factor in gradeBook2 to clusterize our data.


In [62]:
%%R

gradeBook2 = simulate.clusters(gradeBook2, simField=1, byField=3, sigma=5)
gradeBook2 = simulate.clusters(gradeBook2, simField=2, byField=3, sigma=5)
head(gradeBook2)


   Midterm Final Major
S1      50    10  STAT
S2      15    38  LITT
S3      13    67  ECON
S4      39    30  MATH
S5      17    64  ECON
S6       9    61  ECON

We can now display the scatter plot of our two variables. To verify that the clusters correspond to the majors, we'd like to plot our points in two different colors depending on the section.


In [63]:
%%R -r 96


colors = c('blue', 'orange', 'purple', 'red')
plot(gradeBook2[-3], xlim=c(0, 100), ylim=c(0, 100), col=colors[gradeBook$Major])
legend('topleft', legend=levels(gradeBook$Major), col=colors, pch=1)


Because of the increased noise (sigma = 8), the cluster structure starts to be difficult to spot on a dotplot of each variable taken separately. However, the cluster structure for the same data is quite apparent on the 2D scatterplot above that takes the clustering information of both variable simultaneously into account. This is clear by comparint the dotplot below with the scatterplot above. Note also that the dotplot below correspond to the orthogonal projections of the data points on the scatterplot on its x and y axis, respectively.


In [64]:
%%R -r 96 -w 600 -h 600

par(mfrow=c(2,1))
stripchart(gradeBook2$Midterm, ylab='Midterm', method='jitter')
stripchart(gradeBook2$Midterm, ylab='Final',   method='jitter')


Clustering with three variables

With two variables, it's possible to visualize population clusters and variable relations on a single scatter plot, since it involves to plot points on the variable plane.

With three variables we can still use a scatter plot, although we now need to plot our points in a 3 dimensional space, which renders the interpretation slightly more challenging.

With more than three variables the only way to visualize the scatter plot (which involves plotting in a $n$-dimensional space, where $n$ is the number of variables) is to plot 2-dimensional projections of the $n$-dimensional scatter plot. This means that we can plot all the scatter plots for each pair of variables. The visual situation is however more challenging to interpret, and we will need to use other tools for that, as we will see.

To get started, let's add a variable to our simulated grade data and plot the 3D scatter plot.

The variable we will add is an homework grade. To keep simulation simple, we will generate a homework grade for the whole class, without making any distinction between the two sections.


In [65]:
%%R

initVal = rep(0, student.number)
noise = 5
gradeBook3 = data.frame(Midterm=initVal, Final=initVal, Homework=initVal, Major=M, row.names=omega)
gradeBook3 = simulate.clusters(gradeBook3, simField=1, byField=4, sigma=noise)
gradeBook3 = simulate.clusters(gradeBook3, simField=2, byField=4, sigma=noise)
gradeBook3 = simulate.clusters(gradeBook3, simField=3, byField=4, sigma=noise)

head(gradeBook3)


   Midterm Final Homework Major
S1      52    62       56  STAT
S2      20    31       46  LITT
S3      23    25       39  ECON
S4      71    46       22  MATH
S5      36    10       39  ECON
S6      29    20       41  ECON

The lattice library and the cloud function

The lattice library offers advance plotting capabilities in R. To load it, type in

library(lattice)

The display 3D scatter plots of 3 continous variables, say $X$, $Y$, and $Z$, stored in a data frame myFrame, we will use the lattice function

cloud(Z ~ X + Y, data=myFrame, group=cat, auto.key=T)

  • The first argument is an R formula. For now, you only need to understand that

      Z ~ X + Y

means that $Z$ will be plotted on the $z$-axis, while $X$ and $Y$ will be plotted on the $xy$-plane.

  • The second argument data is the data frame from which the quantitative variables $X$, $Y$, and $Z$ will be retrieved from.
  • The third argument group is a categorical variable from the data frame myFrame indicating that the points from different categories should be displayed with different colors.
  • The last argument auto.key is a Boolean argument indicating a legend describing which color corresponds to which category should be plotted.

Let us display a 3D scatter plot of our variables Final, Midterm and Homework and plot them with different colors depending on the Section variable.

Remark: In the notebook, one needs to store the output of cloud into a variable, which we will need to print in order for the plot to appears.


In [66]:
%%R -r 96 -w 700 -h 500

library(lattice)

grade_cloud = cloud(Final ~ Midterm + Homework, data=gradeBook3, group=Major, auto.key=T)

print(grade_cloud)


One still clearly seet the two clusters of LITT and DOUB majors. The linear relationship between the variables is however not clearly perceptible on the plot.

The scatter plot matrix

Another way to display variable relations and population clusters for more than 2 variables is to plot a scatter plot matrix containing a scatter plot of each pair of variables.

One can again pass our color vector to the col argument.


In [67]:
%%R -r 96 -w 500 -h 500

plot(gradeBook3[c(-4)], col=colors[gradeBook3$Major])


The two section clusters are very visible in each of the scatter plots.

At contrast with our 3D scatter plot, one can still see the linear relation ship between the midterm score and the final score that we build into our data.

Although still very useful for three variables, we begin to get a sense that more sophisticated methods will be needed as the number of variable increases... That's what we are going to learn next.

Multivariate Clustering

Let us add a fourth variable to our simulated grade data:


In [50]:
%%R


initVal = rep(0, student.number); noise = 5

gradeBook4 = data.frame(Final=initVal, Midterm=initVal, Homework=initVal, Quiz=initVal, Major=M, row.names=omega)
head(gradeBook4)

gradeBook4 = simulate.clusters(gradeBook4, simField=1, byField=4, sigma=noise)


Error in clusters[[i]] : subscript out of bounds

We can still display a scatter plot matrix:


In [306]:
%%R -r 96 -w 700 -h 700

plot(grades[,c(1,2,3,4)], col=point_colors)


Error in grades[, c(1, 2, 3, 4)] : incorrect number of dimensions

We still see some clustering appearing for some variables, but for others (Quiz and Homework) the clustering does not seem to be present.

As the number of variable grows, we need to use other methods to retrieve clusters from the data.

The distance matrix

Consider $n$ variables $X_1,\dots,X_n$ for a population sample of $m$ individual.

All the variable values for all the sample individuals can be arranged into a matrix

$$ \qquad\qquad\qquad\left( \begin{array}{ccc} x_{11} & x_{12} & \cdots & x_{1n} \\ x_{21} & x_{22} & \cdots & x_{2n} \\ \vdots & & & \vdots \\ x_{m1} & x_{2m} & \cdots & x_{mn} \end{array} \right) \qquad \textrm{where}\qquad x_{ij} = X_i(s_j)\quad \textrm{(value of the variable}\quad X_i\quad \textrm{ for the individual} \quad s_j \textrm{ )} $$

In other word, this matrix is the data frame stripped from its row and column labels.

The points $v_1,\dots, v_m$ represented in a scatter plots are just the row vectors of this value matrix:

$$ \qquad\qquad\qquad\begin{array} .v_1 & = & (x_{11},x_{12},\dots,x_{1n}) \\ v_2 & = & (x_{21},x_{22},\dots,x_{2n}) \\ & \vdots & \\ v_m & = & (x_{m1}, x_{m2}\dots,x_{mn}) \end{array} $$

In R, a matrix is a represented by a vector with the hidden dimension vector set up with the corresponding number of rows and columns, in the same way a class is a list with the hidden class string set up to the class name.

This hidden dimension vector is accessed and set up through the function

dim(x)

Changing the dimension of a vector makes R interpret the resulting object as an instance of the class matrix.


In [307]:
%%R

x = c(1,2,3,4,5,6,7,8,9)
print(class(x))

dim(x) = c(3,3)

print(class(x)); cat('\n')
print(x)


[1] "numeric"
[1] "matrix"

     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

One can extract the value matrix from a data frame using the conversion funtion

as.matrix(x)

In [308]:
%%R

scores = as.matrix(grades[,c(1,2,3,4)])
print(class(scores))


Error in grades[, c(1, 2, 3, 4)] : incorrect number of dimensions

The distance matrix of a matrix $A=(x_{ij})$ is the matrix containing all the distances between the matrix row vectors $v_1,v_2,\dots, v_m$:

$$ \qquad\qquad\qquad\operatorname{dist}(A)\quad=\quad\left( \begin{array}{ccc} d(v_1,v_1) & d(v_1,v_2) & \cdots & d(v_1,v_n) \\ d(v_2,v_1) & d(v_2,v_2) & \cdots & d(v_2,v_n) \\ \vdots & & & \vdots \\ d(v_m,v_1) & d(v_m,v_2) & \cdots & d(v_m,v_n) \end{array} \right) $$

where

$$\qquad\qquad\qquad d(v_i,v_j) = \sum_{k=1}^m \sqrt{(x_{ik}^2 - x_{jk}^2)}$$

is the euclidean distance between the row vectors $v_i$ and $v_j$.

In R, one computes the distance matrix using the function

dist(x)

The distance matrix contains all the necessary information to perform clustering.


In [309]:
%%R

scores_dist = dist(scores)

Hierachical Clustering Analysis

The function

hclust(distance_matrix)

takes a distance matrix and returns an object of the class hclust, which contains a hierachical groups of the population.

This hierachical groups are computed as follows by hclust:

  • it starts by groupping indiduals (i.e. rows) two by two by smallest distance
  • then it computes a distance between this pairs of two and group these pairs again two by two by smallest distance
  • this algorithm continues until there is no group is left for groupping

The hierachy that results can be further plotted as a dendrogram using the plot funciton.


In [310]:
%%R -r 96 -w 700 -h 400

hclusters = hclust(scores_dist)

plot(hclusters, main='Dendrogram of the class scores')


We see from the dendrogram that we can obtain a different number of cluster by cutting the dendogram at a given height.

In our example,

  • cutting at a distance between 60 and 120, then two cluster emerges that most probably correspond to the two majors.

  • cutting at a distance of 40, will produce several additional clusters revealing more structure in the data.

Of course, if the distance between the two cluster is small it may indicate that the cluster structure is not very strong. In our example, only the two main clusters seems to be real.

To retrieve a given number of cluster, one uses the function

cutree(hcluster_object, number_of_clusters)

which returns a integer vector where each element is an integer telling which cluster the corresponding individual in our population belongs to. This labels of this vectors are the individual names.


In [311]:
%%R

clusters = cutree(hclusters, 2)
print(clusters)


 S1  S2  S3  S4  S5  S6  S7  S8  S9 S10 S11 S12 S13 S14 S15 S16 S17 S18 S19 S20 
  1   1   1   2   2   2   1   2   2   1   2   1   2   1   1   2   1   1   1   1 
S21 S22 S23 S24 S25 S26 S27 S28 S29 S30 S31 S32 S33 S34 S35 S36 S37 S38 S39 S40 
  1   1   1   2   1   1   1   1   2   2   1   2   1   1   2   1   1   1   2   1 
S41 S42 S43 S44 S45 S46 S47 S48 S49 S50 
  1   1   1   1   1   1   2   1   1   1 

We can now separate the rows or our initial data frame into two categories, corresponding to the returned clusters and check that they represent the two majors:


In [312]:
%%R

cluster1.names = names(clusters[clusters == 1])
students1 = grades[cluster1.names,]
print(students1)


Error in grades[cluster1.names, ] : incorrect number of dimensions