This notebook is intended to be a brief guide on how to use Self Organizing Maps with the SOMPY library in Python. We are going to use hexagonal lattice in this example in order to understand the main causes of airflight cancellations
The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.
This version of the dataset was compiled from the Statistical Computing Statistical Graphics 2009 Data Expo and is also available here, here and here
Fields description
In [1]:
%matplotlib inline
import math
import glob
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import urllib3
from sklearn.externals import joblib
import random
import matplotlib
from sompy.sompy import SOMFactory
from sompy.visualization.plot_tools import plot_hex_map
import logging
In [2]:
df = pd.read_csv("./DelayedFlights.csv")
df = df[["Month","DayofMonth", "DayOfWeek","DepTime", "AirTime",
"Distance", "SecurityDelay","WeatherDelay", "NASDelay", "CarrierDelay",
"ArrDelay", "DepDelay", "LateAircraftDelay", "Cancelled"]]
clustering_vars = ["Month", "DayofMonth", "DepTime", "AirTime",
"LateAircraftDelay", "DepDelay", "ArrDelay", "CarrierDelay"]
df = df.fillna(0)
data = df[clustering_vars].values
names = clustering_vars
In [3]:
df.describe()
Out[3]:
As the data is relatively high, the model takes some time to train. We didn't finetune the hyperparameters of the algorithm and this is a potential improvement topic.
In [ ]:
# Study the models trained and plot the errors obtained in order to select the best one
models_pool = glob.glob("./model*")
errors=[]
for model_filepath in models_pool:
sm = joblib.load(model_filepath)
topographic_error = sm.calculate_topographic_error()
quantization_error = sm.calculate_quantization_error()
errors.append((topographic_error, quantization_error))
e_top, e_q = zip(*errors)
In [11]:
plt.scatter(e_top, e_q)
plt.xlabel("Topographic error")
plt.ylabel("Quantization error")
plt.show()
In [5]:
# Manually select the model with better features. In this case, the #3 model has been selected because
# quantization error is distributed across 34-40u and the topographic error varies much more,
# so the model with lower topographic error has been selected. It is very important to keep the topographic
# error as low as possible to assure a correct prototyping.
selected_model = 3
sm = joblib.load(models_pool[selected_model])
topographic_error = sm.calculate_topographic_error()
quantization_error = sm.calculate_quantization_error()
print ("Topographic error = %s\n Quantization error = %s" % (topographic_error, quantization_error))
The components map shows the values of the variables for each prototype and allows us to extract conclusions consisting of non-linear patterns between variables. We have represented 2 types of components maps.
If the quantization error is not very high and a proper visual assessment has been done assuring that the prototupes and real visualizations look very alike, the prototypes visualization can be used as a final product, since it is much visual appealing.
In [6]:
from sompy.visualization.mapview import View2D
view2D = View2D(10,10,"", text_size=7)
view2D.show(sm, col_sz=5, which_dim="all", denormalize=True)
plt.show()
In [7]:
# Addition of some exogeneous variables to the map
exogeneous_vars = [c for c in df.columns if not c in clustering_vars+["Cancelled", "bmus"]]
df["bmus"] = sm.project_data(data)
df = df[clustering_vars + exogeneous_vars + ["Cancelled"] + ["bmus"]]
empirical_codebook=df.groupby("bmus").mean().values
matplotlib.rcParams.update({'font.size': 10})
plot_hex_map(empirical_codebook.reshape(sm.codebook.mapsize + [empirical_codebook.shape[-1]]),
titles=df.columns[:-1], shape=[4, 5], colormap=None)
plt.show()
This visualization is very important because it shows how the instances are spreaded across the hexagonal lattice. The more instances lay into a cell, the more instances it is representing and hence the more we have to take it into acount
In [8]:
from sompy.visualization.bmuhits import BmuHitsView
#sm.codebook.lattice="rect"
vhts = BmuHitsView(12,12,"Hits Map",text_size=7)
vhts.show(sm, anotate=True, onlyzeros=False, labelsize=7, cmap="autumn", logaritmic=False)
plt.show()
This visualization helps us to focus on the groups which share similar characteristics
In [9]:
from sompy.visualization.hitmap import HitMapView
sm.cluster(4)
hits = HitMapView(12, 12,"Clustering",text_size=10, cmap=plt.cm.jet)
a=hits.show(sm, anotate=True, onlyzeros=False, labelsize=7, cmap="Pastel1")
plt.show()
Looking at the components map and comparing each component with the cancellation component, a set of conclusions can be extracted. The main ones are summarized below
Apart from these ones, secondary conclusions can be extracted by comparing all the components together. A couple of examples is summarized below