The **first dimension** of a data analysis problem is the **population dimension**. The statistical algorithms corresponding to the analysis of that dimension are called **clustering algorithms**. When fed a data frame containing the values of selected variables for a given population sample, these algorithms spit out **clusters** of individuals that are similar from the point of view of these variables.

The general idea behind most of the clustering algorithms is to consider the collection variable values for a given individual (i.e., a given row in the data frame) as the components of a vectors $x=(x_1,\dots,x_m)\in \mathbb{R}^m$. So an individual becomes a point in a geometric space, and two individuals are close to each other if the distance between the corresponding points is small. A cluster is then a subset of individuals whose corresponding points in the geometric space are somehow closer to each other than to the points that are not in the cluster. This notion of cluster is somewhat relative, since it depends on several subjective choices to be made for each concrete problem:

- the choice of an appropriate
**distance**between the observation - the choice of an appropriate
**scale**for each of the variables - the choice of an appropriate
**number of clusters**to identify

There is some flexibility in defining this distance, and different problems may use different notions of distances. The first idea that comes to mind is to use the Euclidean distance; for instance, if we have another vector of observations $y=(y_1,\dots,y_m)$ for a second individual, the distance between these two individuals can be taken to be the length $d(x,y)$ of the line segment joining the point $x$ to the point $y$ in $\mathbb{R}^m$. In math notation, this gives:

$$d(x,y) = \|x-y\| = \sqrt{(y_1-x_1)^2 + \cdots + (y_m - x_m)^2} .$$This seems natural, but it may seem just as natural to take the sum of the absolute values of these differences, or a weighted average of them. Moreover, if the variables are binary, the Euclidean distance above may not be the most appropriate one.

Similarly, there is a need and some flexibility to scale the variables: If one of the variable has its values concentrated around, say 100, with a standard deviation of 10, while a second variable has its values concentrated around, say 0, with a standard deviation of 1, the first variable may completely mask the clustering information that may be contained in the second variable. One possible choice to make use of all the clustering information contained variables $X_1,\dots, X_m$ in our data set is to rescale them all, so they all have mean 0 and standard deviation 1. That is, before performing the clustering, one would transform our intitial variables $X_1,\dots, X_m$ into their normalized versions:

$$\tilde X_1=\frac{X_1-\mu_1}{\sigma_1},\dots,\, \tilde X_m=\frac{X_m-\mu_m}{\sigma_m}, $$where $\mu_i$ is the sample mean and $\sigma_i$ is the sample standard deviation for the $i^{th}$ variable. Of course, one could have choosen different measure of location and scale for the normalization, such as the median and the inter quartile range, for instance. Depending on the problem at hand, we may also want not a fully democratic rescaling, and keep the influence of some of the variable dominant, by an different rescaling.

At last, the notion of cluster, understood vaguely as *group of points closer to each other than to the rest of the observations*, is conditioned on the number of clusters that one is ready to see in the data. Some of the clustering algorithms, such as k-means, need to be given this number of clusters explicitly, while others, like hierarchical clustering, will compute all the possible cluters, and leave the task of interpreting which numbers of clusters makes sense to the statistician.

R provides many functions implementing various clustering algorithms. Here are some of the most popular:

hierachical clustering:

`hclust`

k-means:

`kmeans`

,`agnes`

**READING MATERIAL:** Chapter 7 of Phil Spector lecture notes on clustering:

Consider a population $\Omega=\{S_1,\dots,S_n\}$ of $n$ students in a given class. (Think of $S_i$ as the student ID, or student name, of student $i$, which we will use to represent the actual student.)

Suppose that we have collected information for two variables for each student in this class:

- the student major, which we denote by $M(S_i)$ for student $S_i$
- the student score in this class, which we denote by $G(S_i)$ for student $S_i$

At this point, we need to distinguish between two cases depending on the nature of the variables: a variable is called

**qualitative**, if its possible values (also called*levels*) indicates membership to a**category**, such the major variable above (taking the average does not make sense);**quantitative**, if its values are**numbers**(taking the average of makes sense).

Let's inverstigate clustering for each of these variable categories.

Clustering a population using a qualitative variable is trivial, since each level defines a cluster.

Let's do a simple simulation, to see how to manipulate qualitative variables and their clusters in R. We consider again our class $\Omega$ of $n$ student for which which we will simulate $n$ values among the possible values ECON, MATH, STAT, and LITT, for the qualitative variable $M$ indicating a student major.

Let's first generate our population $\Omega$:

```
In [1]:
```%load_ext rmagic

```
In [3]:
```%%R
student.number = 100
omega = paste('S',1:student.number, sep='')
omega

```
```

`factor`

, than with the usual vectors:

```
In [4]:
```%%R
major.names = c('ECON', 'STAT', 'MATH', 'LITT')
M = factor(sample(major.names, student.number, replace=T))
names(M) = omega
M

```
```

`summary`

applied to a factor returns the number of each individual in each of the category clusters; observe, that for a vector, `summary`

would have returned the usual numeric summary (min, first quartile, medin, etc.), which does not makes sense for factors:

```
In [6]:
```%%R
summary(M)

```
```

`plot`

function, which will display a **barplot** representing the frequencies, and not a **dotplot**, as it would have been the case for a quantitative variable:

```
In [7]:
```%%R -r 86 -w 500 -h 300
plot(M)

```
```

`lapply`

function on the levels of the qualitative variable, along with logical indexing, to set up a list, whose elements contain the student ID per clusters. Note that the function `levels`

applied to a factor will return the levels of this factor.

```
In [8]:
```%%R
clusters = lapply(levels(M), function(major) names(M[M==major]))
names(clusters) = levels(M)
clusters

```
```

Clustering a population using a *quantitative* variable is more challenging, since it already involves setting up a distance between the observations, and computing the distance between each of these observations. However, since, in this case, the collection of our obsvervation for each student amounts to only one number, its grade for the class, the euclidean distance between two grades corresponds with the absolute value of the differences:

$d(x_i, x_j) = \sqrt{(x_i - x_j)^2} = |x_i - x_j|,$

where $x_i= G(S_i)$ and $x_j = G(S_j)$ are the grades of student $S_i$ and $S_j$, respectively. Since, we only have one variable, we also don't really need any variable rescaling. However, we will see that we will need to specify the number of clusters.

Let's start by simulating grades for our class $\Omega$. We want the grades to be between a minimum and a maximum score. So let's write a function that makes sure of that:

```
In [9]:
```%%R
simulate.grades = function(n, score.max=100, score.mean=score.max/2, score.sd=score.max/5){
grades = round(rnorm(n, mean=score.mean, sd=score.sd))
grades[grades < 0] = 0
grades[grades > score.max] = score.max
return(grades)
}

The best way to visualize population clusters determined by one qualitative variable $G$, is to draw a **dotplot** or **stripchart**, representing all the $n$ observations (i.e. all the student grades), along with the information of the distance between them.

A dotplot (or stripchart), plots each observation $x_i$ as a points on a line, at distance $x_i$ from the origin. Groups of observations close to each other will visually appears as groups, separated by larger band of white space from the other groups.

We now generate grades for the student in our class, using the major variable to introduce artificially clusters into the grade distribution. The idea is to use

`rnorm(n, mean, sd)`

to generate the grades, and set different means for different majors.

The function below takes a data frame `df`

, and returns the same data frame with its column `simField`

set to simulated values in a range specified by `range`

; the simulated values have are simulated using the functin `rnorm`

with different means for the different levels of the data frame column `byField`

. These means are chosen randomly in the range indicated by the argument `range`

.

```
In [10]:
```%%R -r 86 -w 800 -h 400
simulate.clusters = function(df, simField=2, byField=1, range=c(min=0,max=100), sigma=3)
{
cluster.names = levels(df[[byField]])
clusters = lapply(cluster.names, function(clname) rownames(df[df[[byField]] == clname, ]))
names(clusters) = cluster.names
cluster.number = length(cluster.names)
cluster.means = round((range['max']/cluster.number -10)*1:cluster.number + rnorm(cluster.number, mean=0, sd=sigma))
cluster.means = sample(cluster.means, cluster.number)
for(i in 1:cluster.number) df[clusters[[i]], simField] = simulate.grades(length(clusters[[i]]),
score.mean=cluster.means[i],
score.sd=sigma);
return(df)
}

```
In [18]:
```%%R
score.max = 100
gradeBook = data.frame(Major=M, Score=rep(0, student.number))
gradeBook = simulate.clusters(gradeBook)
head(gradeBook)

```
```

`method=stack`

that tells `stripchart`

to plot observations with the same numerical value stacked on the top of each other (other methods are `jitter`

and the default `overplot`

).

```
In [19]:
```%%R -r 86 -w 800 -h 230
scores = gradeBook$Score
names(scores) = rownames(gradeBook)
stripchart(scores, method='stack', xlab='Student scores', main='Stacked stripchart of student grades')

```
```

Since, for this simulated grade data, we know what the cluster should correspond to (i.e. the class majors, since we randomly chose different means and sd for different majors to simulate the grades), we can check that the cluster we see in the stripchart above correspond to those majors. For that, we will plot below a fancier dotplot, where instead of plotting a mere points at, say x=40, to represent an grade of 40 in our data set, we will display vertically at x=40, the name of the student major, whose grade is 40. Moever, to make the majors/clusters relationship very clear visually, we will use different colors for the major labels.

Make sure you understand the snippet of code below: We start by plotting an empty canvas, by passing the *non plotting* option (n stands for 'none') to the argument `type='n'`

; then, we had text to the canvas by using the ploting function `text`

at the appropriate coordinate locations (passed as the two first arguments `x`

and `y`

of the plotting function).

```
In [20]:
```%%R -r 86 -w 800 -h 250
xScores = gradeBook$Score; yHeight = rep(1.1, nrow(gradeBook))
plot(x=xScores, y=yHeight, type='n', yaxt='n', ylab='', xlab='', main='Doplot of student grades')
colors = c('red', 'blue', 'purple', 'orange')[gradeBook$Major]
text(x=gradeBook$Score, y=1.1, labels=gradeBook$Major, srt=90, col=colors)

```
```

```
In [30]:
```%%R -r 86 -w 800 -h 250
fit = kmeans(scores, 4)
plot(scores, rep(1.1, length(scores)), type='n', yaxt='n', xlab='', ylab='')
colors = c('red', 'blue', 'purple', 'orange')[fit$cluster]
text(scores, 1.1, labels=gradeBook$Major, srt=90, col=colors)

```
```

```
In [33]:
```%%R
fit

```
```

```
In [36]:
```%%R
fit$cluster
fit$centers
fit$size

```
```

```
In [46]:
```%%R -r 86 -w 1000 -h 400
D = dist(scores, method='euclidean')

```
In [48]:
```%%R
hfit = hclust(D)
names(hfit)

```
```

```
In [49]:
```%%R
plot(hfit, gradeBook$Major)
rect.hclust(hfit, k=4, border="orange")

```
```

```
In [53]:
```%%R
hgroups = cutree(hfit, 4)
hgroups

```
```

```
In [54]:
```%%R
gradeBook$hcluster = cutree(hfit, 4)
head(gradeBook)

```
```

We now consider clustering a **population** for which we have two quantitative variables.

Continuting with our grade example above, we will now suppose we have two grades for each of our class students: the midterm and the final grades. Let's simulate this data:

```
In [56]:
```%%R
midterm = simulate.grades(student.number, score.mean=70, score.sd=10, score.max=100)
final = simulate.grades(student.number, score.mean=60, score.sd=10, score.max=100)
gradeBook2 = data.frame(Midterm=midterm, Final=final, Major=M, row.names=omega)
head(gradeBook2)

```
```

**scatterplots**. The idea is the same as for dotplots: One represents each individual as a point, but now on a plane, instead of on a line, since, we have two coordinates (the values for both variables), instead of one.

**Definition:** A **scatter plot** of **two numeric vectors** $x = (x_1, \dots, x_n) $ and $y = (y_1, \dots, y_n)$ with the same number $n$ of elements is a graph with **two axes** representing the two **variable ranges** and displaying $n$ **points** with coordinates $(x_1, y_1),\dots ,\, (x_n, y_n)$.

Below is a scatterplot of the final and midterm grade, we just generated:

```
In [57]:
```%%R -r 96 -w 600 -h 600
plot(gradeBook2[c(-3)], xlim=c(0, score.max), ylim=c(0, score.max),
main='A scatterplot of midterm grades against final grades')
abline(v=70, col='red')
abline(h=60, col='red')

```
```

Now suppose, we want to add to our grade model the fact that student from different majors may perform differently in the class exams. For that, we will reuse the function

```
simulate.clusters(df, simField=2, byField=1, range=c(min=0,max=100), sigma=3)
```

we wrote above to introduce difference in values for different levels of a given factor in a data frame. Again, we will use the `Major`

factor in `gradeBook2`

to clusterize our data.

```
In [62]:
```%%R
gradeBook2 = simulate.clusters(gradeBook2, simField=1, byField=3, sigma=5)
gradeBook2 = simulate.clusters(gradeBook2, simField=2, byField=3, sigma=5)
head(gradeBook2)

```
```

```
In [63]:
```%%R -r 96
colors = c('blue', 'orange', 'purple', 'red')
plot(gradeBook2[-3], xlim=c(0, 100), ylim=c(0, 100), col=colors[gradeBook$Major])
legend('topleft', legend=levels(gradeBook$Major), col=colors, pch=1)

```
```

`sigma = 8`

), the cluster structure starts to be difficult to spot on a dotplot of each variable taken separately. However, the cluster structure for the same data is quite apparent on the 2D scatterplot above that takes the clustering information of both variable simultaneously into account. This is clear by comparint the dotplot below with the scatterplot above. Note also that the dotplot below correspond to the orthogonal projections of the data points on the scatterplot on its x and y axis, respectively.

```
In [64]:
```%%R -r 96 -w 600 -h 600
par(mfrow=c(2,1))
stripchart(gradeBook2$Midterm, ylab='Midterm', method='jitter')
stripchart(gradeBook2$Midterm, ylab='Final', method='jitter')

```
```

With **two variables**, it's possible to visualize **population clusters** and **variable relations** on a **single scatter plot**, since it involves to plot points on the **variable plane**.

With **three variables** we can still use a scatter plot, although we now need to plot our points in a **3 dimensional space**, which renders the interpretation slightly more challenging.

With **more than three variables** the only way to visualize the **scatter plot** (which involves plotting in a $n$-dimensional space, where $n$ is the number of variables) is to **plot 2-dimensional projections of the $n$-dimensional scatter plot**. This means that we can plot all the **scatter plots for each pair of variables**. The visual situation is however more challenging to interpret, and we will need to use other tools for that, as we will see.

To get started, let's add a variable to our simulated grade data and plot the 3D scatter plot.

The variable we will add is an homework grade. To keep simulation simple, we will generate a homework grade for the whole class, without making any distinction between the two sections.

```
In [65]:
```%%R
initVal = rep(0, student.number)
noise = 5
gradeBook3 = data.frame(Midterm=initVal, Final=initVal, Homework=initVal, Major=M, row.names=omega)
gradeBook3 = simulate.clusters(gradeBook3, simField=1, byField=4, sigma=noise)
gradeBook3 = simulate.clusters(gradeBook3, simField=2, byField=4, sigma=noise)
gradeBook3 = simulate.clusters(gradeBook3, simField=3, byField=4, sigma=noise)
head(gradeBook3)

```
```

The **lattice library** offers advance plotting capabilities in R. To load it, type in

```
library(lattice)
```

The display **3D scatter plots** of 3 continous variables, say $X$, $Y$, and $Z$, stored in a data frame `myFrame`

, we will use the lattice function

```
cloud(Z ~ X + Y, data=myFrame, group=cat, auto.key=T)
```

The

**first argument**is an**R formula**. For now, you only need to understand that`Z ~ X + Y`

means that $Z$ will be plotted on the $z$-axis, while $X$ and $Y$ will be plotted on the $xy$-plane.

- The
**second argument**`data`

is the data frame from which the**quantitative variables**$X$, $Y$, and $Z$ will be retrieved from.

- The
**third argument**`group`

is a**categorical variable**from the data frame`myFrame`

indicating that the points from different categories should be displayed with different colors.

- The
**last argument**`auto.key`

is a Boolean argument indicating a legend describing which color corresponds to which category should be plotted.

Let us display a 3D scatter plot of our variables `Final, Midterm`

and `Homework`

and plot them with different colors depending on the `Section`

variable.

**Remark:** In the notebook, one needs to store the output of `cloud`

into a variable, which we will need to print in order for the plot to appears.

```
In [66]:
```%%R -r 96 -w 700 -h 500
library(lattice)
grade_cloud = cloud(Final ~ Midterm + Homework, data=gradeBook3, group=Major, auto.key=T)
print(grade_cloud)

```
```

Another way to display **variable relations** and **population clusters** for **more than 2 variables** is to plot a **scatter plot matrix** containing a scatter plot of **each pair** of variables.

One can again pass our **color vector** to the `col`

argument.

```
In [67]:
```%%R -r 96 -w 500 -h 500
plot(gradeBook3[c(-4)], col=colors[gradeBook3$Major])

```
```

The two **section clusters are very visible** in each of the scatter plots.

At contrast with our 3D scatter plot, **one can still see the linear relation ship between the midterm score and the final score** that we build into our data.

Although still very useful for three variables, we begin to get a sense that **more sophisticated methods will be needed as the number of variable increases**... That's what we are going to learn next.

Let us add a fourth variable to our simulated grade data:

```
In [50]:
```%%R
initVal = rep(0, student.number); noise = 5
gradeBook4 = data.frame(Final=initVal, Midterm=initVal, Homework=initVal, Quiz=initVal, Major=M, row.names=omega)
head(gradeBook4)
gradeBook4 = simulate.clusters(gradeBook4, simField=1, byField=4, sigma=noise)

```
```

We can still display a **scatter plot matrix**:

```
In [306]:
```%%R -r 96 -w 700 -h 700
plot(grades[,c(1,2,3,4)], col=point_colors)

```
```

We still see some clustering appearing for some variables, but for others (Quiz and Homework) the clustering does not seem to be present.

As the number of variable grows, we need to use other methods to retrieve clusters from the data.

Consider $n$ **variables** $X_1,\dots,X_n$ for a **population sample** of $m$ **individual**.

All the **variable values** for all the **sample individuals** can be arranged into a matrix

In other word, this matrix is the **data frame stripped from its row and column labels**.

The **points** $v_1,\dots, v_m$ represented in a **scatter plots** are just the **row vectors** of this value matrix:

In R, a **matrix** is a represented by a **vector** with the **hidden dimension vector** set up with the corresponding number of rows and columns, in the same way a **class** is a **list** with the **hidden class string** set up to the **class name**.

This **hidden dimension vector** is accessed and set up through the function

```
dim(x)
```

Changing the dimension of a vector makes R interpret the **resulting object as an instance of the class** `matrix`

.

```
In [307]:
```%%R
x = c(1,2,3,4,5,6,7,8,9)
print(class(x))
dim(x) = c(3,3)
print(class(x)); cat('\n')
print(x)

```
```

One can extract the **value matrix** from a data frame using the **conversion funtion**

`as.matrix(x)`

```
In [308]:
```%%R
scores = as.matrix(grades[,c(1,2,3,4)])
print(class(scores))

```
```

The **distance matrix** of a matrix $A=(x_{ij})$ is the matrix containing all the **distances between** the matrix **row vectors** $v_1,v_2,\dots, v_m$:

where

$$\qquad\qquad\qquad d(v_i,v_j) = \sum_{k=1}^m \sqrt{(x_{ik}^2 - x_{jk}^2)}$$is the **euclidean distance** between the **row vectors** $v_i$ and $v_j$.

In R, one computes **the distance matrix** using the function

```
dist(x)
```

The **distance matrix** contains all the necessary information to perform **clustering**.

```
In [309]:
```%%R
scores_dist = dist(scores)

The function

```
hclust(distance_matrix)
```

takes a **distance matrix** and returns an object of the class `hclust`

, which contains a **hierachical groups of the population**.

This **hierachical groups** are computed as follows by `hclust`

:

- it starts by
**groupping indiduals (i.e. rows) two by two by smallest distance**

- then it computes a distance between this pairs of two and
**group these pairs again two by two by smallest distance**

- this algorithm continues until there is
**no group is left for groupping**

The **hierachy that results** can be further plotted as a **dendrogram** using the `plot`

funciton.

```
In [310]:
```%%R -r 96 -w 700 -h 400
hclusters = hclust(scores_dist)
plot(hclusters, main='Dendrogram of the class scores')

```
```

We see from the dendrogram that we can obtain a different number of cluster by **cutting the dendogram at a given height**.

In our example,

**cutting**at a**distance between 60 and 120**, then**two cluster emerges**that most probably correspond to the two majors.**cutting**at a**distance of 40**, will produce**several additional clusters**revealing more structure in the data.

Of course, if the distance between the two cluster is **small** it may indicate that the **cluster structure** is not very strong. In our example, only the two main clusters seems to be real.

To **retrieve a given number of cluster**, one uses the function

```
cutree(hcluster_object, number_of_clusters)
```

which returns a **integer vector** where each element is an **integer** telling **which cluster the corresponding individual in our population belongs to**. This **labels** of this vectors are the **individual names**.

```
In [311]:
```%%R
clusters = cutree(hclusters, 2)
print(clusters)

```
```

```
In [312]:
```%%R
cluster1.names = names(clusters[clusters == 1])
students1 = grades[cluster1.names,]
print(students1)

```
```