New features in Isolation Forest: contamination ratio and using validation frame

We will demonstrate how to use contamination parameter to get predicted labels out of an Isolation Forest model. We will also experiment with a new feature added the algorithm - using a validation frame to check quality of the model and do early stopping.

Using validation frame is still an experimental feauture in active development. Please send us your feedback and we will try to incorporate your comments!

Please also be careful about your conclusions based on experiments with this notebook. The notebook is intentionally not seeded and we encourage you to rerun the experiments several times to see how different random inputs changes the performance of the model.



In [1]:

    
import sys
import h2o
from h2o.frame import H2OFrame
import numpy as np
import pandas as pd



In [2]:

    
h2o.init(strict_version_check=False)









    



Checking whether there is an H2O instance running at http://localhost:54321 . connected.






    




H2O_cluster_uptime:
52 mins 37 secs
H2O_cluster_timezone:
America/New_York
H2O_data_parsing_timezone:
UTC
H2O_cluster_version:
3.31.0.99999
H2O_cluster_version_age:
53 minutes 
H2O_cluster_name:
mkurka
H2O_cluster_total_nodes:
1
H2O_cluster_free_memory:
3.276 Gb
H2O_cluster_total_cores:
8
H2O_cluster_allowed_cores:
8
H2O_cluster_status:
locked, healthy
H2O_connection_url:
http://localhost:54321
H2O_connection_proxy:
{"http": null, "https": null}
H2O_internal_security:
False
H2O_API_Extensions:
Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version:
2.7.14 final

Generate some synthetic data



In [3]:

    
N = 1000
cont = 0.05 # ratio of outliers/anomalies



In [4]:

    
regular_data = np.random.normal(0, 0.5, (int(N*(1-cont)), 2))
anomaly_data = np.column_stack((np.random.normal(-1.5, 1, int(N*cont)), np.random.normal(1.5, 1, int(N*cont))))



In [5]:

    
import matplotlib.pyplot as plt



In [6]:

    
plt.scatter(anomaly_data[:,0], anomaly_data[:,1])
plt.scatter(regular_data[:,0], regular_data[:,1])
plt.show()



In [7]:

    
regular_pd = pd.DataFrame({'x': regular_data[:, 0], 'y': regular_data[:, 1], 'label': np.zeros(regular_data.shape[0])})
anomaly_pd = pd.DataFrame({'x': anomaly_data[:, 0], 'y': anomaly_data[:, 1], 'label': np.ones(anomaly_data.shape[0])})



In [8]:

    
dataset = H2OFrame(regular_pd.append(anomaly_pd).sample(frac=1))









    



Parse progress: |█████████████████████████████████████████████████████████| 100%



In [9]:

    
train_with_label, test = dataset.split_frame([0.8])



In [10]:

    
train_with_label["label"].table()









    






  label   Count


      0     756
      1      39








    Out[10]:



In [11]:

    
test["label"].table()









    






  label   Count


      0     194
      1      11








    Out[11]:



In [12]:

    
train = train_with_label.drop(["label"])
test["label"] = test["label"].asfactor()



In [13]:

    
from h2o.estimators.isolation_forest import H2OIsolationForestEstimator
from h2o.model.metrics_base import H2OAnomalyDetectionModelMetrics, H2OBinomialModelMetrics

Train Isolation Forest with a validation set

We will use validation frame and enable early stopping. The observations of validation frame are labeled and anomalies/outliers are marked with label "1", regular observations with "0". This lets us use binomial classification metrics to do early stopping. The model will calculate binomial metrics on the validation frame and use early stopping based on the performance observed on the validation data.



In [14]:

    
if_model = H2OIsolationForestEstimator(seed=12, ntrees=200,
                                       score_tree_interval=7, stopping_rounds=3, stopping_metric="mean_per_class_error",
                                       validation_response_column="label")
if_model.train(training_frame=train, validation_frame=test)









    



isolationforest Model Build progress: |███████████████████████████████████| 100%

The trained model will have different kind of metrics for training and validation frame. For training - where we don't have labeled data - anomaly metrics will be returned. For validation frame we will see binomial model metrics.



In [15]:

    
if_model









    



Model Details
=============
H2OIsolationForestEstimator :  Isolation Forest
Model Key:  IsolationForest_model_python_1593529508000_107


Model Summary: 






    







  
    
      
      
      number_of_trees
      number_of_internal_trees
      model_size_in_bytes
      min_depth
      max_depth
      mean_depth
      min_leaves
      max_leaves
      mean_leaves
    
  
  
    
      0
      
      56.0
      56.0
      39085.0
      8.0
      8.0
      8.0
      25.0
      78.0
      50.767857
    
  








    



ModelMetricsAnomaly: isolationforest
** Reported on train data. **

Anomaly Score: 6.77065108782
Normalized Anomaly Score: 0.0467037784801

ModelMetricsBinomial: isolationforest
** Reported on validation data. **

MSE: 0.0547974682524
RMSE: 0.234088590607
LogLoss: 0.188428153792
Mean Per-Class Error: 0.157919400187
AUC: 0.832474226804
AUCPR: 0.256449326499
Gini: 0.664948453608

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.145454545455: 






    







  
    
      
      
      0
      1
      Error
      Rate
    
  
  
    
      0
      0
      184.0
      10.0
      0.0515
      (10.0/194.0)
    
    
      1
      1
      4.0
      7.0
      0.3636
      (4.0/11.0)
    
    
      2
      Total
      188.0
      17.0
      0.0683
      (14.0/205.0)
    
  








    



Maximum Metrics: Maximum metrics at their respective thresholds






    







  
    
      
      metric
      threshold
      value
      idx
    
  
  
    
      0
      max f1
      0.145455
      0.500000
      13.0
    
    
      1
      max f2
      0.145455
      0.573770
      13.0
    
    
      2
      max f0point5
      0.167273
      0.476190
      10.0
    
    
      3
      max accuracy
      0.778182
      0.941463
      0.0
    
    
      4
      max precision
      0.167273
      0.461538
      10.0
    
    
      5
      max recall
      0.003636
      1.000000
      46.0
    
    
      6
      max specificity
      0.778182
      0.994845
      0.0
    
    
      7
      max absolute_mcc
      0.145455
      0.477875
      13.0
    
    
      8
      max min_per_class_accuracy
      0.080000
      0.818182
      25.0
    
    
      9
      max mean_per_class_accuracy
      0.080000
      0.842081
      25.0
    
    
      10
      max tns
      0.778182
      193.000000
      0.0
    
    
      11
      max fns
      0.778182
      11.000000
      0.0
    
    
      12
      max fps
      0.000000
      194.000000
      47.0
    
    
      13
      max tps
      0.003636
      11.000000
      46.0
    
    
      14
      max tnr
      0.778182
      0.994845
      0.0
    
    
      15
      max fnr
      0.778182
      1.000000
      0.0
    
    
      16
      max fpr
      0.000000
      1.000000
      47.0
    
    
      17
      max tpr
      0.003636
      1.000000
      46.0
    
  








    



Scoring History: 






    







  
    
      
      
      timestamp
      duration
      number_of_trees
      mean_tree_path_length
      mean_anomaly_score
    
  
  
    
      0
      
      2020-06-30 11:57:47
      0.001 sec
      0.0
      NaN
      NaN
    
    
      1
      
      2020-06-30 11:57:47
      0.031 sec
      7.0
      6.777928
      0.044414
    
    
      2
      
      2020-06-30 11:57:47
      0.062 sec
      14.0
      6.764469
      0.048492
    
    
      3
      
      2020-06-30 11:57:47
      0.097 sec
      21.0
      6.769009
      0.047557
    
    
      4
      
      2020-06-30 11:57:47
      0.124 sec
      28.0
      6.763426
      0.049805
    
    
      5
      
      2020-06-30 11:57:47
      0.152 sec
      35.0
      6.770500
      0.045900
    
    
      6
      
      2020-06-30 11:57:47
      0.179 sec
      42.0
      6.772585
      0.045267
    
    
      7
      
      2020-06-30 11:57:47
      0.211 sec
      49.0
      6.770547
      0.046268
    
    
      8
      
      2020-06-30 11:57:47
      0.229 sec
      56.0
      6.770651
      0.046704
    
  








    Out[15]:



In [16]:

    
predicted = if_model.predict(train)
predicted.head()









    



isolationforest prediction progress: |████████████████████████████████████| 100%






    






  predict      score   mean_length


        0 0.123636        6.39286
        0 0               7      
        0 0.0145455       6.92857
        1 0.421818        4.92857
        0 0.0509091       6.75   
        0 0.0181818       6.91071
        0 0               7      
        0 0.00727273       6.96429
        0 0.134545        6.33929
        0 0               7      








    Out[16]:

The output incudes predicted class of the observation not anomaly/anomaly. This is accomplished by using the validation frame. In current implementation we pick the threshold to maximize the F1 score.



In [17]:

    
predicted_train_labels = predicted["predict"].as_data_frame(use_pandas=True)
train_pd = train.as_data_frame(use_pandas=True)



In [18]:

    
plt.scatter(train_pd["x"], train_pd["y"], c=predicted_train_labels["predict"])
plt.show()



In [19]:

    
if_model.model_performance(train_with_label)









    



ModelMetricsBinomial: isolationforest
** Reported on test data. **

MSE: 0.0436448796715
RMSE: 0.20891356986
LogLoss: 0.138389352082
Mean Per-Class Error: 0.105667480667
AUC: 0.936440103107
AUCPR: 0.348593089161
Gini: 0.872880206214

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.0909090909091: 






    







  
    
      
      
      0
      1
      Error
      Rate
    
  
  
    
      0
      0
      692.0
      64.0
      0.0847
      (64.0/756.0)
    
    
      1
      1
      8.0
      31.0
      0.2051
      (8.0/39.0)
    
    
      2
      Total
      700.0
      95.0
      0.0906
      (72.0/795.0)
    
  








    



Maximum Metrics: Maximum metrics at their respective thresholds






    







  
    
      
      metric
      threshold
      value
      idx
    
  
  
    
      0
      max f1
      0.090909
      0.462687
      53.0
    
    
      1
      max f2
      0.076364
      0.629771
      56.0
    
    
      2
      max f0point5
      0.138182
      0.406360
      41.0
    
    
      3
      max accuracy
      0.927273
      0.953459
      1.0
    
    
      4
      max precision
      1.000000
      1.000000
      0.0
    
    
      5
      max recall
      0.018182
      1.000000
      72.0
    
    
      6
      max specificity
      1.000000
      1.000000
      0.0
    
    
      7
      max absolute_mcc
      0.076364
      0.476273
      56.0
    
    
      8
      max min_per_class_accuracy
      0.065455
      0.871795
      59.0
    
    
      9
      max mean_per_class_accuracy
      0.043636
      0.894333
      65.0
    
    
      10
      max tns
      1.000000
      756.000000
      0.0
    
    
      11
      max fns
      1.000000
      38.000000
      0.0
    
    
      12
      max fps
      0.000000
      756.000000
      77.0
    
    
      13
      max tps
      0.018182
      39.000000
      72.0
    
    
      14
      max tnr
      1.000000
      1.000000
      0.0
    
    
      15
      max fnr
      1.000000
      0.974359
      0.0
    
    
      16
      max fpr
      0.000000
      1.000000
      77.0
    
    
      17
      max tpr
      0.018182
      1.000000
      72.0
    
  








    Out[19]:

Train Isolation Forest using contamination parameter



In [20]:

    
if_model_cont = H2OIsolationForestEstimator(seed=12, contamination=cont)
if_model_cont.train(training_frame=train)
if_model_cont









    



/Users/mkurka/git/h2o/h2o-3/h2o-py/h2o/estimators/estimator_base.py:200: RuntimeWarning: Stopping tolerance is ignored for _stopping_rounds=0.
  warnings.warn(mesg["message"], RuntimeWarning)






    



isolationforest Model Build progress: |███████████████████████████████████| 100%
Model Details
=============
H2OIsolationForestEstimator :  Isolation Forest
Model Key:  IsolationForest_model_python_1593529508000_108


Model Summary: 






    







  
    
      
      
      number_of_trees
      number_of_internal_trees
      model_size_in_bytes
      min_depth
      max_depth
      mean_depth
      min_leaves
      max_leaves
      mean_leaves
    
  
  
    
      0
      
      50.0
      50.0
      34937.0
      8.0
      8.0
      8.0
      25.0
      78.0
      50.84
    
  








    



ModelMetricsAnomaly: isolationforest
** Reported on train data. **

Anomaly Score: 6.77064148276
Normalized Anomaly Score: 0.0458717034484

Scoring History: 






    







  
    
      
      
      timestamp
      duration
      number_of_trees
      mean_tree_path_length
      mean_anomaly_score
    
  
  
    
      0
      
      2020-06-30 11:57:48
      0.003 sec
      0.0
      NaN
      NaN
    
    
      1
      
      2020-06-30 11:57:48
      0.012 sec
      1.0
      6.864504
      0.019357
    
    
      2
      
      2020-06-30 11:57:48
      0.034 sec
      2.0
      6.851371
      0.027023
    
    
      3
      
      2020-06-30 11:57:48
      0.057 sec
      3.0
      6.820479
      0.029920
    
    
      4
      
      2020-06-30 11:57:48
      0.080 sec
      4.0
      6.805021
      0.038996
    
    
      5
      
      2020-06-30 11:57:48
      0.092 sec
      5.0
      6.807712
      0.035609
    
    
      6
      
      2020-06-30 11:57:48
      0.100 sec
      6.0
      6.783165
      0.043367
    
    
      7
      
      2020-06-30 11:57:48
      0.108 sec
      7.0
      6.777928
      0.044414
    
    
      8
      
      2020-06-30 11:57:48
      0.115 sec
      8.0
      6.776496
      0.048325
    
    
      9
      
      2020-06-30 11:57:48
      0.127 sec
      9.0
      6.782595
      0.044469
    
    
      10
      
      2020-06-30 11:57:48
      0.142 sec
      10.0
      6.783853
      0.043229
    
    
      11
      
      2020-06-30 11:57:48
      0.151 sec
      11.0
      6.783153
      0.045006
    
    
      12
      
      2020-06-30 11:57:48
      0.161 sec
      12.0
      6.785637
      0.042873
    
    
      13
      
      2020-06-30 11:57:48
      0.171 sec
      13.0
      6.773668
      0.046703
    
    
      14
      
      2020-06-30 11:57:48
      0.180 sec
      14.0
      6.764469
      0.048492
    
    
      15
      
      2020-06-30 11:57:48
      0.188 sec
      15.0
      6.766426
      0.046715
    
    
      16
      
      2020-06-30 11:57:48
      0.197 sec
      16.0
      6.769173
      0.045039
    
    
      17
      
      2020-06-30 11:57:48
      0.204 sec
      17.0
      6.763707
      0.048397
    
    
      18
      
      2020-06-30 11:57:48
      0.212 sec
      18.0
      6.768759
      0.046248
    
    
      19
      
      2020-06-30 11:57:48
      0.221 sec
      19.0
      6.769960
      0.048030
    
  








    



See the whole table with table.as_data_frame()






    Out[20]:



In [21]:

    
predicted_train_labels_cont = if_model_cont.predict(train)["predict"].as_data_frame(use_pandas=True)









    



isolationforest prediction progress: |████████████████████████████████████| 100%



In [22]:

    
plt.scatter(train_pd["x"], train_pd["y"], c=predicted_train_labels_cont["predict"])
plt.show()

H2O_cluster_uptime:	52 mins 37 secs
H2O_cluster_timezone:	America/New_York
H2O_data_parsing_timezone:	UTC
H2O_cluster_version:	3.31.0.99999
H2O_cluster_version_age:	53 minutes
H2O_cluster_name:	mkurka
H2O_cluster_total_nodes:	1
H2O_cluster_free_memory:	3.276 Gb
H2O_cluster_total_cores:	8
H2O_cluster_allowed_cores:	8
H2O_cluster_status:	locked, healthy
H2O_connection_url:	http://localhost:54321
H2O_connection_proxy:	{"http": null, "https": null}
H2O_internal_security:	False
H2O_API_Extensions:	Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version:	2.7.14 final

		0	1	Error	Rate
0	0	184.0	10.0	0.0515	(10.0/194.0)
1	1	4.0	7.0	0.3636	(4.0/11.0)
2	Total	188.0	17.0	0.0683	(14.0/205.0)

	metric	threshold	value	idx
0	max f1	0.145455	0.500000	13.0
1	max f2	0.145455	0.573770	13.0
2	max f0point5	0.167273	0.476190	10.0
3	max accuracy	0.778182	0.941463	0.0
4	max precision	0.167273	0.461538	10.0
5	max recall	0.003636	1.000000	46.0
6	max specificity	0.778182	0.994845	0.0
7	max absolute_mcc	0.145455	0.477875	13.0
8	max min_per_class_accuracy	0.080000	0.818182	25.0
9	max mean_per_class_accuracy	0.080000	0.842081	25.0
10	max tns	0.778182	193.000000	0.0
11	max fns	0.778182	11.000000	0.0
12	max fps	0.000000	194.000000	47.0
13	max tps	0.003636	11.000000	46.0
14	max tnr	0.778182	0.994845	0.0
15	max fnr	0.778182	1.000000	0.0
16	max fpr	0.000000	1.000000	47.0
17	max tpr	0.003636	1.000000	46.0

	timestamp	duration	number_of_trees	mean_tree_path_length	mean_anomaly_score
0	2020-06-30 11:57:47	0.001 sec	0.0	NaN	NaN
1	2020-06-30 11:57:47	0.031 sec	7.0	6.777928	0.044414
2	2020-06-30 11:57:47	0.062 sec	14.0	6.764469	0.048492
3	2020-06-30 11:57:47	0.097 sec	21.0	6.769009	0.047557
4	2020-06-30 11:57:47	0.124 sec	28.0	6.763426	0.049805
5	2020-06-30 11:57:47	0.152 sec	35.0	6.770500	0.045900
6	2020-06-30 11:57:47	0.179 sec	42.0	6.772585	0.045267
7	2020-06-30 11:57:47	0.211 sec	49.0	6.770547	0.046268
8	2020-06-30 11:57:47	0.229 sec	56.0	6.770651	0.046704

predict	score	mean_length
0	0.123636	6.39286
0	0	7
0	0.0145455	6.92857
1	0.421818	4.92857
0	0.0509091	6.75
0	0.0181818	6.91071
0	0	7
0	0.00727273	6.96429
0	0.134545	6.33929
0	0	7

		0	1	Error	Rate
0	0	692.0	64.0	0.0847	(64.0/756.0)
1	1	8.0	31.0	0.2051	(8.0/39.0)
2	Total	700.0	95.0	0.0906	(72.0/795.0)

	metric	threshold	value	idx
0	max f1	0.090909	0.462687	53.0
1	max f2	0.076364	0.629771	56.0
2	max f0point5	0.138182	0.406360	41.0
3	max accuracy	0.927273	0.953459	1.0
4	max precision	1.000000	1.000000	0.0
5	max recall	0.018182	1.000000	72.0
6	max specificity	1.000000	1.000000	0.0
7	max absolute_mcc	0.076364	0.476273	56.0
8	max min_per_class_accuracy	0.065455	0.871795	59.0
9	max mean_per_class_accuracy	0.043636	0.894333	65.0
10	max tns	1.000000	756.000000	0.0
11	max fns	1.000000	38.000000	0.0
12	max fps	0.000000	756.000000	77.0
13	max tps	0.018182	39.000000	72.0
14	max tnr	1.000000	1.000000	0.0
15	max fnr	1.000000	0.974359	0.0
16	max fpr	0.000000	1.000000	77.0
17	max tpr	0.018182	1.000000	72.0

	timestamp	duration	number_of_trees	mean_tree_path_length	mean_anomaly_score
0	2020-06-30 11:57:48	0.003 sec	0.0	NaN	NaN
1	2020-06-30 11:57:48	0.012 sec	1.0	6.864504	0.019357
2	2020-06-30 11:57:48	0.034 sec	2.0	6.851371	0.027023
3	2020-06-30 11:57:48	0.057 sec	3.0	6.820479	0.029920
4	2020-06-30 11:57:48	0.080 sec	4.0	6.805021	0.038996
5	2020-06-30 11:57:48	0.092 sec	5.0	6.807712	0.035609
6	2020-06-30 11:57:48	0.100 sec	6.0	6.783165	0.043367
7	2020-06-30 11:57:48	0.108 sec	7.0	6.777928	0.044414
8	2020-06-30 11:57:48	0.115 sec	8.0	6.776496	0.048325
9	2020-06-30 11:57:48	0.127 sec	9.0	6.782595	0.044469
10	2020-06-30 11:57:48	0.142 sec	10.0	6.783853	0.043229
11	2020-06-30 11:57:48	0.151 sec	11.0	6.783153	0.045006
12	2020-06-30 11:57:48	0.161 sec	12.0	6.785637	0.042873
13	2020-06-30 11:57:48	0.171 sec	13.0	6.773668	0.046703
14	2020-06-30 11:57:48	0.180 sec	14.0	6.764469	0.048492
15	2020-06-30 11:57:48	0.188 sec	15.0	6.766426	0.046715
16	2020-06-30 11:57:48	0.197 sec	16.0	6.769173	0.045039
17	2020-06-30 11:57:48	0.204 sec	17.0	6.763707	0.048397
18	2020-06-30 11:57:48	0.212 sec	18.0	6.768759	0.046248
19	2020-06-30 11:57:48	0.221 sec	19.0	6.769960	0.048030