Self Organizing Maps example - California Housing

In this document, the SOMPY lib is going to be used in order to provide an example of usage of the Self Organising Maps algorithm. The data to be used will be the California Housing dataset, included in the SciKit Learn library and included below


In [1]:
%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
from sompy.sompy import SOMFactory
from sklearn.datasets import fetch_california_housing


backend module://ipykernel.pylab.backend_inline version unknown

Data Loading

First of all the data is loaded into the local environment as a numpy array.


In [2]:
data = fetch_california_housing()
descr = data.DESCR
names = fetch_california_housing().feature_names+["HouseValue"]

data = np.column_stack([data.data, data.target])
print(descr)
print( "FEATURES: ", ", ".join(names))


California housing dataset.

The original database is available from StatLib

    http://lib.stat.cmu.edu/datasets/

The data contains 20,640 observations on 9 variables.

This dataset contains the average house value as target variable
and the following input variables (features): average income,
housing average age, average rooms, average bedrooms, population,
average occupation, latitude, and longitude in that order.

References
----------

Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297.


FEATURES:  MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude, HouseValue

SOM Training

The SOM training consists in 2 phases: the rough and the finetune one. The parameters that can be configured in the training step are:

  • The size of each individual grid
  • The rough and finetune iterations
  • The rough and finetune initial and final radiuses
  • The initialization mechanism (random/pca)

For the current example, only the rough/finetune iterations and the initialization mechanism parameters have been chosen. The other ones have not been specified so that the algorithm will choose them authomatically.

For quantifying the error of the approximation, 2 metrics should be computed:

  • The quantization error: average distance between each data vector and its BMU.
  • The topographic error: the proportion of all data vectors for which first and second BMUs are not adjacent units.

A rule of thumb is to generate several models with different parameters and choose the one which, having a topographic error very near to zero, has the lowest quantization error. It is important to hold the topographic error very low in order to make the components smooth and easy to understand.


In [3]:
#msz = calculate_msz(data)
sm = SOMFactory().build(data, normalization = 'var', initialization='random', component_names=names)
sm.train(n_job=1, verbose=False, train_rough_len=2, train_finetune_len=5)

In [4]:
topographic_error = sm.calculate_topographic_error()
quantization_error = np.mean(sm._bmu[1])
print ("Topographic error = %s; Quantization error = %s" % (topographic_error, quantization_error))


Topographic error = 0.0013565891472868217; Quantization error = 0.8164994471732518

Visualization

Components map


In [5]:
from sompy.visualization.mapview import View2D
view2D  = View2D(10,10,"rand data",text_size=10)
view2D.show(sm, col_sz=4, which_dim="all", denormalize=True)


Hits map


In [6]:
from sompy.visualization.bmuhits import BmuHitsView

vhts  = BmuHitsView(10,10,"Hits Map",text_size=7)
vhts.show(sm, anotate=True, onlyzeros=False, labelsize=12, cmap="Greys", logaritmic=False)


K-Means clustering


In [7]:
from sompy.visualization.hitmap import HitMapView
sm.cluster(4)
hits  = HitMapView(10,10,"Clustering",text_size=7)
a=hits.show(sm, labelsize=12)


C:\Users\Ivan Valles Perez\AppData\Local\Continuum\anaconda3\lib\site-packages\matplotlib\cbook\deprecation.py:107: MatplotlibDeprecationWarning: Adding an axes using the same arguments as a previous axes currently reuses the earlier instance.  In a future version, a new instance will always be created and returned.  Meanwhile, this warning can be suppressed, and the future behavior ensured, by passing a unique label to each axes instance.
  warnings.warn(message, mplDeprecation, stacklevel=1)

Conclusions

From the visualizations above we can extract different conclusions like,

  • The houses which have, on average, more bedrooms are generally on lower average income areas.
  • The highest occupations occours only in cities where the population is high.
  • The latitude and longitude of the samples have a strong negative correlation. It can be because California is diagonally oriented with respect to the coordinates system.
  • The most demanded houses (AveOccup) are placed on the 37~38 latitude and -121.6~-121 longitude; i.e. near San-Francisco area
  • Old houses are more likely to have less rooms and bedrooms on average.
  • Low average income areas usually have less rooms and bedrooms than high average ones
  • The house value seems to be related with the average income of the area where it sits.

It is important to remark that there are areas on the map where the density of instances is lower than others. It is represented by the hit map and it should be taken in consideration when interpreting the components map.

The clustering map can be used to help to find out the different behaviors represented in the components map.