In this document, the SOMPY lib is going to be used in order to provide an example of usage of the Self Organising Maps algorithm. The data to be used will be the California Housing dataset, included in the SciKit Learn library and included below
In [1]:
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from sompy.sompy import SOMFactory
from sklearn.datasets import fetch_california_housing
First of all the data is loaded into the local environment as a numpy array.
In [2]:
data = fetch_california_housing()
descr = data.DESCR
names = fetch_california_housing().feature_names+["HouseValue"]
data = np.column_stack([data.data, data.target])
print(descr)
print( "FEATURES: ", ", ".join(names))
The SOM training consists in 2 phases: the rough and the finetune one. The parameters that can be configured in the training step are:
For the current example, only the rough/finetune iterations and the initialization mechanism parameters have been chosen. The other ones have not been specified so that the algorithm will choose them authomatically.
For quantifying the error of the approximation, 2 metrics should be computed:
A rule of thumb is to generate several models with different parameters and choose the one which, having a topographic error very near to zero, has the lowest quantization error. It is important to hold the topographic error very low in order to make the components smooth and easy to understand.
In [3]:
#msz = calculate_msz(data)
sm = SOMFactory().build(data, normalization = 'var', initialization='random', component_names=names)
sm.train(n_job=1, verbose=False, train_rough_len=2, train_finetune_len=5)
In [4]:
topographic_error = sm.calculate_topographic_error()
quantization_error = np.mean(sm._bmu[1])
print ("Topographic error = %s; Quantization error = %s" % (topographic_error, quantization_error))
In [5]:
from sompy.visualization.mapview import View2D
view2D = View2D(10,10,"rand data",text_size=10)
view2D.show(sm, col_sz=4, which_dim="all", denormalize=True)
In [6]:
from sompy.visualization.bmuhits import BmuHitsView
vhts = BmuHitsView(10,10,"Hits Map",text_size=7)
vhts.show(sm, anotate=True, onlyzeros=False, labelsize=12, cmap="Greys", logaritmic=False)
In [7]:
from sompy.visualization.hitmap import HitMapView
sm.cluster(4)
hits = HitMapView(10,10,"Clustering",text_size=7)
a=hits.show(sm, labelsize=12)
From the visualizations above we can extract different conclusions like,
It is important to remark that there are areas on the map where the density of instances is lower than others. It is represented by the hit map and it should be taken in consideration when interpreting the components map.
The clustering map can be used to help to find out the different behaviors represented in the components map.