New features in Isolation Forest: contamination ratio and using validation frame

We will demonstrate how to use contamination parameter to get predicted labels out of an Isolation Forest model. We will also experiment with a new feature added the algorithm - using a validation frame to check quality of the model and do early stopping.

Using validation frame is still an experimental feauture in active development. Please send us your feedback and we will try to incorporate your comments!

Please also be careful about your conclusions based on experiments with this notebook. The notebook is intentionally not seeded and we encourage you to rerun the experiments several times to see how different random inputs changes the performance of the model.


In [1]:
import sys
import h2o
from h2o.frame import H2OFrame
import numpy as np
import pandas as pd

In [2]:
h2o.init(strict_version_check=False)


Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O_cluster_uptime: 52 mins 37 secs
H2O_cluster_timezone: America/New_York
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.31.0.99999
H2O_cluster_version_age: 53 minutes
H2O_cluster_name: mkurka
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 3.276 Gb
H2O_cluster_total_cores: 8
H2O_cluster_allowed_cores: 8
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
H2O_API_Extensions: Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version: 2.7.14 final

Generate some synthetic data


In [3]:
N = 1000
cont = 0.05 # ratio of outliers/anomalies

In [4]:
regular_data = np.random.normal(0, 0.5, (int(N*(1-cont)), 2))
anomaly_data = np.column_stack((np.random.normal(-1.5, 1, int(N*cont)), np.random.normal(1.5, 1, int(N*cont))))

In [5]:
import matplotlib.pyplot as plt

In [6]:
plt.scatter(anomaly_data[:,0], anomaly_data[:,1])
plt.scatter(regular_data[:,0], regular_data[:,1])
plt.show()



In [7]:
regular_pd = pd.DataFrame({'x': regular_data[:, 0], 'y': regular_data[:, 1], 'label': np.zeros(regular_data.shape[0])})
anomaly_pd = pd.DataFrame({'x': anomaly_data[:, 0], 'y': anomaly_data[:, 1], 'label': np.ones(anomaly_data.shape[0])})

In [8]:
dataset = H2OFrame(regular_pd.append(anomaly_pd).sample(frac=1))


Parse progress: |█████████████████████████████████████████████████████████| 100%

In [9]:
train_with_label, test = dataset.split_frame([0.8])

In [10]:
train_with_label["label"].table()


label Count
0 756
1 39
Out[10]:


In [11]:
test["label"].table()


label Count
0 194
1 11
Out[11]:


In [12]:
train = train_with_label.drop(["label"])
test["label"] = test["label"].asfactor()

In [13]:
from h2o.estimators.isolation_forest import H2OIsolationForestEstimator
from h2o.model.metrics_base import H2OAnomalyDetectionModelMetrics, H2OBinomialModelMetrics

Train Isolation Forest with a validation set

We will use validation frame and enable early stopping. The observations of validation frame are labeled and anomalies/outliers are marked with label "1", regular observations with "0". This lets us use binomial classification metrics to do early stopping. The model will calculate binomial metrics on the validation frame and use early stopping based on the performance observed on the validation data.


In [14]:
if_model = H2OIsolationForestEstimator(seed=12, ntrees=200,
                                       score_tree_interval=7, stopping_rounds=3, stopping_metric="mean_per_class_error",
                                       validation_response_column="label")
if_model.train(training_frame=train, validation_frame=test)


isolationforest Model Build progress: |███████████████████████████████████| 100%

The trained model will have different kind of metrics for training and validation frame. For training - where we don't have labeled data - anomaly metrics will be returned. For validation frame we will see binomial model metrics.


In [15]:
if_model


Model Details
=============
H2OIsolationForestEstimator :  Isolation Forest
Model Key:  IsolationForest_model_python_1593529508000_107


Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
0 56.0 56.0 39085.0 8.0 8.0 8.0 25.0 78.0 50.767857

ModelMetricsAnomaly: isolationforest
** Reported on train data. **

Anomaly Score: 6.77065108782
Normalized Anomaly Score: 0.0467037784801

ModelMetricsBinomial: isolationforest
** Reported on validation data. **

MSE: 0.0547974682524
RMSE: 0.234088590607
LogLoss: 0.188428153792
Mean Per-Class Error: 0.157919400187
AUC: 0.832474226804
AUCPR: 0.256449326499
Gini: 0.664948453608

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.145454545455: 
0 1 Error Rate
0 0 184.0 10.0 0.0515 (10.0/194.0)
1 1 4.0 7.0 0.3636 (4.0/11.0)
2 Total 188.0 17.0 0.0683 (14.0/205.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
0 max f1 0.145455 0.500000 13.0
1 max f2 0.145455 0.573770 13.0
2 max f0point5 0.167273 0.476190 10.0
3 max accuracy 0.778182 0.941463 0.0
4 max precision 0.167273 0.461538 10.0
5 max recall 0.003636 1.000000 46.0
6 max specificity 0.778182 0.994845 0.0
7 max absolute_mcc 0.145455 0.477875 13.0
8 max min_per_class_accuracy 0.080000 0.818182 25.0
9 max mean_per_class_accuracy 0.080000 0.842081 25.0
10 max tns 0.778182 193.000000 0.0
11 max fns 0.778182 11.000000 0.0
12 max fps 0.000000 194.000000 47.0
13 max tps 0.003636 11.000000 46.0
14 max tnr 0.778182 0.994845 0.0
15 max fnr 0.778182 1.000000 0.0
16 max fpr 0.000000 1.000000 47.0
17 max tpr 0.003636 1.000000 46.0
Scoring History: 
timestamp duration number_of_trees mean_tree_path_length mean_anomaly_score
0 2020-06-30 11:57:47 0.001 sec 0.0 NaN NaN
1 2020-06-30 11:57:47 0.031 sec 7.0 6.777928 0.044414
2 2020-06-30 11:57:47 0.062 sec 14.0 6.764469 0.048492
3 2020-06-30 11:57:47 0.097 sec 21.0 6.769009 0.047557
4 2020-06-30 11:57:47 0.124 sec 28.0 6.763426 0.049805
5 2020-06-30 11:57:47 0.152 sec 35.0 6.770500 0.045900
6 2020-06-30 11:57:47 0.179 sec 42.0 6.772585 0.045267
7 2020-06-30 11:57:47 0.211 sec 49.0 6.770547 0.046268
8 2020-06-30 11:57:47 0.229 sec 56.0 6.770651 0.046704
Out[15]:


In [16]:
predicted = if_model.predict(train)
predicted.head()


isolationforest prediction progress: |████████████████████████████████████| 100%
predict score mean_length
00.123636 6.39286
00 7
00.0145455 6.92857
10.421818 4.92857
00.0509091 6.75
00.0181818 6.91071
00 7
00.00727273 6.96429
00.134545 6.33929
00 7
Out[16]:

The output incudes predicted class of the observation not anomaly/anomaly. This is accomplished by using the validation frame. In current implementation we pick the threshold to maximize the F1 score.


In [17]:
predicted_train_labels = predicted["predict"].as_data_frame(use_pandas=True)
train_pd = train.as_data_frame(use_pandas=True)

In [18]:
plt.scatter(train_pd["x"], train_pd["y"], c=predicted_train_labels["predict"])
plt.show()



In [19]:
if_model.model_performance(train_with_label)


ModelMetricsBinomial: isolationforest
** Reported on test data. **

MSE: 0.0436448796715
RMSE: 0.20891356986
LogLoss: 0.138389352082
Mean Per-Class Error: 0.105667480667
AUC: 0.936440103107
AUCPR: 0.348593089161
Gini: 0.872880206214

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.0909090909091: 
0 1 Error Rate
0 0 692.0 64.0 0.0847 (64.0/756.0)
1 1 8.0 31.0 0.2051 (8.0/39.0)
2 Total 700.0 95.0 0.0906 (72.0/795.0)
Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
0 max f1 0.090909 0.462687 53.0
1 max f2 0.076364 0.629771 56.0
2 max f0point5 0.138182 0.406360 41.0
3 max accuracy 0.927273 0.953459 1.0
4 max precision 1.000000 1.000000 0.0
5 max recall 0.018182 1.000000 72.0
6 max specificity 1.000000 1.000000 0.0
7 max absolute_mcc 0.076364 0.476273 56.0
8 max min_per_class_accuracy 0.065455 0.871795 59.0
9 max mean_per_class_accuracy 0.043636 0.894333 65.0
10 max tns 1.000000 756.000000 0.0
11 max fns 1.000000 38.000000 0.0
12 max fps 0.000000 756.000000 77.0
13 max tps 0.018182 39.000000 72.0
14 max tnr 1.000000 1.000000 0.0
15 max fnr 1.000000 0.974359 0.0
16 max fpr 0.000000 1.000000 77.0
17 max tpr 0.018182 1.000000 72.0
Out[19]:

Train Isolation Forest using contamination parameter


In [20]:
if_model_cont = H2OIsolationForestEstimator(seed=12, contamination=cont)
if_model_cont.train(training_frame=train)
if_model_cont


/Users/mkurka/git/h2o/h2o-3/h2o-py/h2o/estimators/estimator_base.py:200: RuntimeWarning: Stopping tolerance is ignored for _stopping_rounds=0.
  warnings.warn(mesg["message"], RuntimeWarning)
isolationforest Model Build progress: |███████████████████████████████████| 100%
Model Details
=============
H2OIsolationForestEstimator :  Isolation Forest
Model Key:  IsolationForest_model_python_1593529508000_108


Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
0 50.0 50.0 34937.0 8.0 8.0 8.0 25.0 78.0 50.84

ModelMetricsAnomaly: isolationforest
** Reported on train data. **

Anomaly Score: 6.77064148276
Normalized Anomaly Score: 0.0458717034484

Scoring History: 
timestamp duration number_of_trees mean_tree_path_length mean_anomaly_score
0 2020-06-30 11:57:48 0.003 sec 0.0 NaN NaN
1 2020-06-30 11:57:48 0.012 sec 1.0 6.864504 0.019357
2 2020-06-30 11:57:48 0.034 sec 2.0 6.851371 0.027023
3 2020-06-30 11:57:48 0.057 sec 3.0 6.820479 0.029920
4 2020-06-30 11:57:48 0.080 sec 4.0 6.805021 0.038996
5 2020-06-30 11:57:48 0.092 sec 5.0 6.807712 0.035609
6 2020-06-30 11:57:48 0.100 sec 6.0 6.783165 0.043367
7 2020-06-30 11:57:48 0.108 sec 7.0 6.777928 0.044414
8 2020-06-30 11:57:48 0.115 sec 8.0 6.776496 0.048325
9 2020-06-30 11:57:48 0.127 sec 9.0 6.782595 0.044469
10 2020-06-30 11:57:48 0.142 sec 10.0 6.783853 0.043229
11 2020-06-30 11:57:48 0.151 sec 11.0 6.783153 0.045006
12 2020-06-30 11:57:48 0.161 sec 12.0 6.785637 0.042873
13 2020-06-30 11:57:48 0.171 sec 13.0 6.773668 0.046703
14 2020-06-30 11:57:48 0.180 sec 14.0 6.764469 0.048492
15 2020-06-30 11:57:48 0.188 sec 15.0 6.766426 0.046715
16 2020-06-30 11:57:48 0.197 sec 16.0 6.769173 0.045039
17 2020-06-30 11:57:48 0.204 sec 17.0 6.763707 0.048397
18 2020-06-30 11:57:48 0.212 sec 18.0 6.768759 0.046248
19 2020-06-30 11:57:48 0.221 sec 19.0 6.769960 0.048030
See the whole table with table.as_data_frame()
Out[20]:


In [21]:
predicted_train_labels_cont = if_model_cont.predict(train)["predict"].as_data_frame(use_pandas=True)


isolationforest prediction progress: |████████████████████████████████████| 100%

In [22]:
plt.scatter(train_pd["x"], train_pd["y"], c=predicted_train_labels_cont["predict"])
plt.show()