Unidad III. Agrupamiento y clasificación.

Árboles de decisión.

  • Técnicas de segmentación CHAID.
  • Relación con redes neuronales y bayesianas.

In [1]:
using RDatasets
using DecisionTree


INFO: Precompiling module DecisionTree.

In [2]:
iris = dataset("datasets", "iris")

head(iris)


Out[2]:
SepalLengthSepalWidthPetalLengthPetalWidthSpecies
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa

In [3]:
features = convert(Array, iris[:, 1:4])
labels = convert(Array, iris[:, 5])


Out[3]:
150-element Array{String,1}:
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 "setosa"   
 ⋮          
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"
 "virginica"

In [4]:
# train full-tree classifier
model = build_tree(labels, features)


Out[4]:
Decision Tree
Leaves: 9
Depth:  5

In [5]:
# prune tree: merge leaves having >= 90% combined purity (default: 100%)
model = prune_tree(model, 0.9)


Out[5]:
Decision Tree
Leaves: 8
Depth:  5

In [6]:
# pretty print of the tree, to a depth of 5 nodes (optional)
print_tree(model, 5)


Feature 3, Threshold 3.0
L-> setosa : 50/50
R-> Feature 4, Threshold 1.8
    L-> Feature 3, Threshold 5.0
        L-> versicolor : 47/48
        R-> Feature 4, Threshold 1.6
            L-> virginica : 3/3
            R-> Feature 1, Threshold 7.2
                L-> versicolor : 2/2
                R-> virginica : 1/1
    R-> Feature 3, Threshold 4.9
        L-> Feature 1, Threshold 6.0
            L-> versicolor : 1/1
            R-> virginica : 2/2
        R-> virginica : 43/43

In [7]:
# apply learned model
apply_tree(model, [5.9,3.0,5.1,1.9])


Out[7]:
"virginica"

In [8]:
# get the probability of each label
apply_tree_proba(model, [5.9,3.0,5.1,1.9], ["setosa", "versicolor", "virginica"])


Out[8]:
3-element Array{Float64,1}:
 0.0
 0.0
 1.0

In [9]:
# run n-fold cross validation for pruned tree,
# using 90% purity threshold pruning, and 3 CV folds
accuracy = nfoldCV_tree(labels, features, 0.9, 3)


Fold 1
Classes:  
3×3 Array{Int64,2}:
 19   0   0
  0  15   0
  0   3  13
Any["setosa","versicolor","virginica"]
Matrix:   
3×3 Array{Int64,2}:
 18   0   0
  1  11   0
  0   2  18
3×3 Array{Int64,2}:
 13   0   0
  0  21   2
  0   1  13
Accuracy: 0.94
Kappa:    0.9096929560505719

Fold 2
Classes:  Any["setosa","versicolor","virginica"]
Matrix:   
Accuracy: 0.94
Kappa:    0.9086479902557856

Fold 3
Classes:  Any["setosa","versicolor","virginica"]
Matrix:   
Accuracy: 0.94
Kappa:    0.9071207430340557

Mean Accuracy: 0.94
Out[9]:
3-element Array{Float64,1}:
 0.94
 0.94
 0.94

Random Forest


In [10]:
# train random forest classifier
# using 2 random features, 10 trees, and 0.5 portion of samples per tree (optional)
model = build_forest(labels, features, 2, 10, 0.5)


Out[10]:
Ensemble of Decision Trees
Trees:      10
Avg Leaves: 6.2
Avg Depth:  4.5

In [11]:
# apply learned model
apply_forest(model, [5.9,3.0,5.1,1.9])


Out[11]:
"virginica"

In [12]:
# get the probability of each label
apply_forest_proba(model, [5.9,3.0,5.1,1.9], ["setosa", "versicolor", "virginica"])


Out[12]:
3-element Array{Float64,1}:
 0.0
 0.0
 1.0

In [13]:
# run n-fold cross validation for forests
# using 2 random features, 10 trees, 3 folds and 0.5 of samples per tree (optional)
accuracy = nfoldCV_forest(labels, features, 2, 10, 3, 0.5)


Fold 1
3×3 Array{Int64,2}:
 24   0   0
  0  13   0
  0   0  13
3×3 Array{Int64,2}:
 13   0   0
  1  17   2
  0   0  17
3×3 Array{Int64,2}:
 13   0   0
  0  16   1
  0   2  18
Classes:  Any["setosa","versicolor","virginica"]
Matrix:   
Accuracy: 1.0
Kappa:    1.0

Fold 2
Classes:  Any["setosa","versicolor","virginica"]
Matrix:   
Accuracy: 0.94
Kappa:    0.9093655589123866

Fold 3
Classes:  Any["setosa","versicolor","virginica"]
Matrix:   
Accuracy: 0.94
Kappa:    0.9088145896656534

Mean Accuracy: 0.96
Out[13]:
3-element Array{Float64,1}:
 1.0 
 0.94
 0.94