We will demonstrate how to use contamination parameter to get predicted labels out of an Isolation Forest model. We will also experiment with a new feature added the algorithm - using a validation frame to check quality of the model and do early stopping.
Using validation frame is still an experimental feauture in active development. Please send us your feedback and we will try to incorporate your comments!
Please also be careful about your conclusions based on experiments with this notebook. The notebook is intentionally not seeded and we encourage you to rerun the experiments several times to see how different random inputs changes the performance of the model.
In [1]:
import sys
import h2o
from h2o.frame import H2OFrame
import numpy as np
import pandas as pd
In [2]:
h2o.init(strict_version_check=False)
In [3]:
N = 1000
cont = 0.05 # ratio of outliers/anomalies
In [4]:
regular_data = np.random.normal(0, 0.5, (int(N*(1-cont)), 2))
anomaly_data = np.column_stack((np.random.normal(-1.5, 1, int(N*cont)), np.random.normal(1.5, 1, int(N*cont))))
In [5]:
import matplotlib.pyplot as plt
In [6]:
plt.scatter(anomaly_data[:,0], anomaly_data[:,1])
plt.scatter(regular_data[:,0], regular_data[:,1])
plt.show()
In [7]:
regular_pd = pd.DataFrame({'x': regular_data[:, 0], 'y': regular_data[:, 1], 'label': np.zeros(regular_data.shape[0])})
anomaly_pd = pd.DataFrame({'x': anomaly_data[:, 0], 'y': anomaly_data[:, 1], 'label': np.ones(anomaly_data.shape[0])})
In [8]:
dataset = H2OFrame(regular_pd.append(anomaly_pd).sample(frac=1))
In [9]:
train_with_label, test = dataset.split_frame([0.8])
In [10]:
train_with_label["label"].table()
Out[10]:
In [11]:
test["label"].table()
Out[11]:
In [12]:
train = train_with_label.drop(["label"])
test["label"] = test["label"].asfactor()
In [13]:
from h2o.estimators.isolation_forest import H2OIsolationForestEstimator
from h2o.model.metrics_base import H2OAnomalyDetectionModelMetrics, H2OBinomialModelMetrics
We will use validation frame and enable early stopping. The observations of validation frame are labeled and anomalies/outliers are marked with label "1", regular observations with "0". This lets us use binomial classification metrics to do early stopping. The model will calculate binomial metrics on the validation frame and use early stopping based on the performance observed on the validation data.
In [14]:
if_model = H2OIsolationForestEstimator(seed=12, ntrees=200,
score_tree_interval=7, stopping_rounds=3, stopping_metric="mean_per_class_error",
validation_response_column="label")
if_model.train(training_frame=train, validation_frame=test)
The trained model will have different kind of metrics for training and validation frame. For training - where we don't have labeled data - anomaly metrics will be returned. For validation frame we will see binomial model metrics.
In [15]:
if_model
Out[15]:
In [16]:
predicted = if_model.predict(train)
predicted.head()
Out[16]:
The output incudes predicted class of the observation not anomaly/anomaly. This is accomplished by using the validation frame. In current implementation we pick the threshold to maximize the F1 score.
In [17]:
predicted_train_labels = predicted["predict"].as_data_frame(use_pandas=True)
train_pd = train.as_data_frame(use_pandas=True)
In [18]:
plt.scatter(train_pd["x"], train_pd["y"], c=predicted_train_labels["predict"])
plt.show()
In [19]:
if_model.model_performance(train_with_label)
Out[19]:
In [20]:
if_model_cont = H2OIsolationForestEstimator(seed=12, contamination=cont)
if_model_cont.train(training_frame=train)
if_model_cont
Out[20]:
In [21]:
predicted_train_labels_cont = if_model_cont.predict(train)["predict"].as_data_frame(use_pandas=True)
In [22]:
plt.scatter(train_pd["x"], train_pd["y"], c=predicted_train_labels_cont["predict"])
plt.show()