A common assumption in pattern recognition is that all classes are known. This assumption does not hold in many real-world applications, where classes which are not currently part of the model do exist.
In the context of kernel densities, we hypothesize that a novel class is one in which samples in the feature space occur in regions of low density relative to each known class, but in high density relative to the overall population.
In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.neighbors.kde import KernelDensity
%matplotlib inline
Load sample glass data.
In [2]:
data = pd.read_csv("../data/glass.csv", index_col=False,names=["class"] + list(range(8)))
data_features = [x for x in range(8)]
classes = np.unique(data["class"])
data.head()
Out[2]:
Read SDSS data, preprocessed by colour indices and reddenning correction
In [3]:
# data = pd.read_hdf('../data/sdss.h5', 'sdss')
# data.head()
Use the same features as reported in Alasdair Tran's Honours thesis 2015.
In [4]:
# target_col = 'class'
# data_features = ['psfMag_r_w14', 'psf_u_g_w14', 'psf_g_r_w14', 'psf_r_i_w14',
# 'psf_i_z_w14', 'petroMag_r_w14', 'petro_u_g_w14', 'petro_g_r_w14',
# 'petro_r_i_w14', 'petro_i_z_w14', 'petroRad_r']
In [5]:
#h = 1/np.sqrt(0.02) # Bandwidth coming from Alasdair's SVM experiments
def percentile_pairwise_distance(X, Y=None):
if Y is None: Y = X
distances = metrics.pairwise_distances(X, Y)
return np.percentile(distances, 20)
h = percentile_pairwise_distance(data[data_features].values)
print("Bandwidth:", h)
(TODO) Define the training, validation, and test sets, and select appropriate Gaussian kernel bandwidth. Use sklearn's grid search to find a good bandwidth.
In [6]:
num_data = len(data)
idx_all = np.random.permutation(num_data)
num_train = int(np.floor(0.7*num_data))
idx_train = idx_all[:num_train]
idx_test = idx_all[num_train:]
Estimate a kernel density estimator on the training set
In [7]:
kde = KernelDensity(kernel='gaussian', bandwidth=h, rtol=1e-5)
Xtrain = data[data_features].ix[idx_train]
kde.fit(Xtrain)
Out[7]:
Use the fitted density to estimate the log density for all items in the test set
In [8]:
Xtest = data[data_features].ix[idx_test]
pred = kde.score_samples(Xtest)
Choose an appropriate threshold for identifying outliers
In [9]:
_ = plt.hist(pred, bins=50)
In [10]:
idx_sort = np.argsort(pred)
pred[idx_sort[:10]]
Out[10]:
Identify the outliers in the dataset. (TODO) Export or visualise appropriately for getting feedback from the astronomers.
In [11]:
idx_outlier = idx_test[np.where(pred < -7)]
data.ix[idx_outlier]
Out[11]:
Calculate class-specific densities
In [12]:
densities = {}
for cl in classes:
Xtrain_cl = Xtrain[data["class"]==cl]
densities[cl] = KernelDensity(kernel='gaussian', bandwidth=h, rtol=1e-5)
densities[cl].fit(Xtrain_cl)
In [13]:
class_pred = {}
for cl in classes:
class_pred[cl] = densities[cl].score_samples(Xtest)
class_pred[cl] -= pred
In [14]:
fig = plt.figure(figsize=(16,10))
ax = fig.add_subplot(231)
_ = ax.hist(class_pred[1], 30)
ax = fig.add_subplot(232)
_ = ax.hist(class_pred[2], 30)
ax = fig.add_subplot(233)
_ = ax.hist(class_pred[3], 30)
ax = fig.add_subplot(234)
_ = ax.hist(class_pred[5], 30)
ax = fig.add_subplot(235)
_ = ax.hist(class_pred[6], 30)
ax = fig.add_subplot(236)
_ = ax.hist(class_pred[7], 30)
In [ ]: