Unidad III. Agrupamiento y clasificación.

Técnicas topológicas de agrupación (clustering) jerárquicas y lineales.

Clustering

Una buena descripción y comparación de diferentes algoritmos de clustering se encuentra en el sitio de Scikit Learn: Clustering

Existen diversas bibliotecas en diferentes leguaje (R, Python, Julia) que implementen diferentes algoritmos de clustering. En Julia tenemos, principalmente:


In [1]:
using RDatasets 

iris = dataset("datasets", "iris")

head(iris)


Out[1]:
SepalLengthSepalWidthPetalLengthPetalWidthSpecies
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa

In [2]:
X = convert(Matrix{Float64}, iris[:,1:4])


Out[2]:
150×4 Array{Float64,2}:
 5.1  3.5  1.4  0.2
 4.9  3.0  1.4  0.2
 4.7  3.2  1.3  0.2
 4.6  3.1  1.5  0.2
 5.0  3.6  1.4  0.2
 5.4  3.9  1.7  0.4
 4.6  3.4  1.4  0.3
 5.0  3.4  1.5  0.2
 4.4  2.9  1.4  0.2
 4.9  3.1  1.5  0.1
 5.4  3.7  1.5  0.2
 4.8  3.4  1.6  0.2
 4.8  3.0  1.4  0.1
 ⋮                 
 6.0  3.0  4.8  1.8
 6.9  3.1  5.4  2.1
 6.7  3.1  5.6  2.4
 6.9  3.1  5.1  2.3
 5.8  2.7  5.1  1.9
 6.8  3.2  5.9  2.3
 6.7  3.3  5.7  2.5
 6.7  3.0  5.2  2.3
 6.3  2.5  5.0  1.9
 6.5  3.0  5.2  2.0
 6.2  3.4  5.4  2.3
 5.9  3.0  5.1  1.8

In [6]:
using Distances

In [7]:
distancia = pairwise(Euclidean(), X)


Out[7]:
4×4 Array{Float64,2}:
  0.0     36.1578  28.9662  57.183 
 36.1578   0.0     25.7781  25.8641
 28.9662  25.7781   0.0     33.8647
 57.183   25.8641  33.8647   0.0   

In [8]:
using RCall

In [9]:
R"""
hc <- hclust(as.dist($distancia), method="complete")
"""


Out[9]:
RCall.RObject{RCall.VecSxp}

Call:
hclust(d = as.dist(`#JL`$distancia), method = "complete")

Cluster method   : complete 
Number of objects: 4 


In [10]:
R"""
plot(hc, label=$(names(iris)[1:4]))
rect.hclust(hc, h=30, border="gray")
""";

In [8]:
R"""
hc <- hclust(as.dist($distancia), method="single")
plot(hc, label=$(names(iris)[1:4]))
rect.hclust(hc, h=30, border="gray")
""";



In [11]:
distancia = pairwise(Euclidean(), X')


Out[11]:
150×150 Array{Float64,2}:
 0.0       0.538516  0.509902  0.648074  …  4.45982   4.65081   4.14005 
 0.538516  0.0       0.3       0.331662     4.49889   4.71805   4.15331 
 0.509902  0.3       0.0       0.244949     4.66154   4.84871   4.29884 
 0.648074  0.331662  0.244949  0.0          4.53321   4.71911   4.1497  
 0.141421  0.608276  0.509902  0.648074     4.50444   4.67868   4.17373 
 0.616441  1.09087   1.08628   1.16619   …  4.10244   4.26497   3.81838 
 0.519615  0.509902  0.264575  0.331662     4.59347   4.74974   4.21782 
 0.173205  0.424264  0.412311  0.5          4.39773   4.58912   4.06079 
 0.921954  0.509902  0.43589   0.3          4.70106   4.88876   4.30232 
 0.469042  0.173205  0.316228  0.316228     4.45758   4.67226   4.10609 
 0.374166  0.866025  0.883176  1.0       …  4.31625   4.5111    4.03237 
 0.374166  0.458258  0.374166  0.374166     4.38748   4.5618    4.02244 
 0.591608  0.141421  0.264575  0.264575     4.57602   4.79166   4.21782 
 ⋮                                       ⋱                              
 3.89615   3.91535   4.06694   3.92683      0.67082   0.9       0.316228
 4.79687   4.86004   5.02693   4.91019      0.469042  0.787401  1.09087 
 5.01996   5.07247   5.22877   5.1049    …  0.608276  0.6245    1.1225  
 4.63681   4.70213   4.86826   4.76025      0.519615  0.818535  1.1225  
 4.20833   4.18091   4.33474   4.17732      0.774597  0.948683  0.331662
 5.25738   5.32071   5.4754    5.34977      0.842615  0.806226  1.31909 
 5.13615   5.20673   5.3535    5.23259      0.793725  0.6245    1.25698 
 4.65403   4.7       4.86415   4.74552   …  0.360555  0.67082   0.948683
 4.27668   4.24971   4.43058   4.28836      0.583095  1.06771   0.655744
 4.45982   4.49889   4.66154   4.53321      0.0       0.616441  0.640312
 4.65081   4.71805   4.84871   4.71911      0.616441  0.0       0.768115
 4.14005   4.15331   4.29884   4.1497       0.640312  0.768115  0.0     

In [17]:
R"""
hcomplete <- hclust(as.dist($distancia), method="complete")
plot(hcomplete)
rect.hclust(hcomplete, h=1.0, border="gray")
""";

In [18]:
R"""
hsingle <- hclust(as.dist($distancia), method="single")
plot(hsingle)
rect.hclust(hsingle, h=1.0, border="gray")
""";

In [19]:
hsingle = rcopy( R"cutree(hsingle, h=1.0)" )


Out[19]:
150-element Array{Int32,1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2

In [20]:
hcomplete = rcopy( R"cutree(hcomplete, h=1.0)" )


Out[20]:
150-element Array{Int32,1}:
  1
  1
  2
  2
  1
  3
  1
  1
  2
  1
  3
  1
  1
  ⋮
 13
 21
 15
 21
 13
 15
 15
 21
 14
 21
 15
 13

In [21]:
using FreqTables

In [22]:
freqtable(hcomplete, hsingle)


Out[22]:
22×2 Named Array{Int64,2}
Dim1 ╲ Dim2 │  1   2
────────────┼───────
1           │ 22   0
2           │  8   0
3           │ 14   0
4           │  5   0
5           │  1   0
6           │  0   9
7           │  0   9
8           │  0   8
9           │  0  14
10          │  0   4
11          │  0   1
12          │  0   3
13          │  0   9
14          │  0   8
15          │  0   8
16          │  0   5
17          │  0   7
18          │  0   3
19          │  0   1
20          │  0   2
21          │  0   7
22          │  0   2

In [23]:
using Clustering


INFO: Recompiling stale cache file /home/dzea/.julia/lib/v0.5/Clustering.ji for module Clustering.

In [24]:
varinfo(maximum(hcomplete), hcomplete, maximum(hsingle), hsingle)


MethodError: no method matching varinfo(::Int32, ::Array{Int32,1}, ::Int32, ::Array{Int32,1})

In [25]:
methods(varinfo)


Out[25]:
3 methods for generic function varinfo:

In [26]:
hcomplete = convert(Vector{Int64}, hcomplete)
hsingle   = convert(Vector{Int64}, hsingle)


Out[26]:
150-element Array{Int64,1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1
 ⋮
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2
 2

In [27]:
varinfo(maximum(hcomplete), hcomplete, maximum(hsingle), hsingle)


Out[27]:
2.2007191709028677

In [28]:
using Bootstrap


INFO: Precompiling module Bootstrap.

In [29]:
( varinfo(maximum(hcomplete), hcomplete, maximum(hcomplete), hcomplete), 
  varinfo(maximum(hsingle), hsingle, maximum(hsingle), hsingle) )


Out[29]:
(8.881784197001252e-16,0.0)

In [30]:
asignaciones = hcat(hcomplete, hsingle)


Out[30]:
150×2 Array{Int64,2}:
  1  1
  1  1
  2  1
  2  1
  1  1
  3  1
  1  1
  1  1
  2  1
  1  1
  3  1
  1  1
  1  1
  ⋮   
 13  2
 21  2
 15  2
 21  2
 13  2
 15  2
 15  2
 21  2
 14  2
 21  2
 15  2
 13  2

In [31]:
function VI(indices)
    A = asignaciones[indices,1]
    B = asignaciones[indices,2]
    varinfo(maximum(A), A, maximum(B), B)
end


Out[31]:
VI (generic function with 1 method)

In [36]:
VI_boot = bootstrap(collect(1:size(asignaciones,1)), VI, BasicSampling(10_000))


Out[36]:
Bootstrap Sampling
  Estimates:
    │ Var │ Estimate │ Bias       │ StdError  │
    ├─────┼──────────┼────────────┼───────────┤
    │ 1   │ 2.20072  │ -0.0707971 │ 0.0717877 │
  Sampling: BasicSampling
  Samples:  10000
  Data:     Array{Int64,1}: { 150 }

In [39]:
ci(VI_boot, BCaConfInt(0.95))


Out[39]:
((2.2007191709028677,2.132464155482269,2.3668591663516687),)

In [43]:
using Plots, StatPlots
pyplot(size=(300,300))

histogram(VI_boot.t1, linecolor=:grey, fillcolor=:grey, legend=false)
vline!([0, VI_boot.t0])


Out[43]:

In [ ]: