Tutorial for `history_diagnostics`

history_diagnostics is a pure-Python package to validate A/B test results based on contexts containing metrics relevant to the test. The validation of the test variants is performed by comparing the contexts to the values for a pre-experiment sample. This helps to increase the reliability of the validation. The code is available on github.

In this tutorial, we will validate an A/B test with samples containing only a ~100 requests each.



In [1]:

    
import json
import numpy as np
import pandas as pd
from history_diagnostics import targetspace

Load example data

First we need to get some data to work on. We use the example data, which is stored in a JSON file. Using pandas, we can easily generate Sample objects.

data/example.json contains three data samples:

pre-experiment contains a sample taken before the A/B test was started.

A and B contain the samples for the two test cases. Since the application's traffic was split 50/50 on the two test-cases, we need to tell the sample that it has a traffic_boost given by 1 / traffic_share. So, it's 2 for A and B.



In [2]:

    
data = json.load(open("data/example.json"))
s_pre = targetspace.Sample(pd.DataFrame(data["pre-experiment"]))
s_a = targetspace.Sample(pd.DataFrame(data["A"]), traffic_boost=2)
s_b = targetspace.Sample(pd.DataFrame(data["B"]), traffic_boost=2)

Frontend context

Next, we need to define our context function. In this tutorial, we focus on frontend-centric metrics. So the context consists of the following metrics:

traffic (optionally boosted for test-variant samples).
median performance
error rate normed to traffic
rate of some exemplary event "#1" also normed to traffic

The function will be called with a single argument: a Sample object.

We annotate the context function with the possible ranges of the metrics passed as (low, high) tuples. None means that no bound exists. There is no theoretical upper bound for the traffic rate or performance but the error and event rates are limitted to the interval (0, 1).

We also annotate the context function with the direction of the metrics: "upper" meaning that the metric gets better if it's increasing (traffic of event rate) and "lower" the opposite.



In [3]:

    
@targetspace.metric_directions("upper", "lower", "lower", "upper")
@targetspace.metric_ranges((0, None), (0, None), (0, 1), (0, 1))
def frontend_context(sample):
    n_requests = len(sample)
    
    if n_requests == 0:
        return 0, 0, 0, 0
    
    traffic = n_requests / sample.duration * sample.traffic_boost
    performance = np.median(sample.requests.performance)
    error_rate = sample.requests.js_error.sum() / n_requests
    event1_rate = sample.requests.event1.sum() / n_requests
    
    return traffic, performance, error_rate, event1_rate

We can now peak into our data:

We see that the traffic and event rate do not change so much. The performance of B improved by a factor of 2 compared to pre-test and A and the error rate worsened by a factor of 4.

Let's see if can detect this.

Target space

To perform the actual validation of the test, we need a target space. We need to create a sub-class of TargetSpace that inherits from a metric-combining mixin. In this case, we use the OneMinusMaxMixin, which returns the minimum of the probabilities to find a worse value for any of the metrics in the context function.



In [4]:

    
class FrontendTargetSpace(targetspace.OneMinusMaxMixin, targetspace.TargetSpace):
    pass

To create an instance, we must pass the context function and the pre-experiment sample. The pre-experiment sample can be referenced by the history attribute of the instance.



In [5]:

    
ts = FrontendTargetSpace(frontend_context, s_pre)

Then we need to calibrate the target space. We have to do this for each of the test samples by passing

the ratio of the durations s.duration / ts.history.duration.
the number of bootstrap samples (We use 500, which is rather larger).
the time-window to group requests for bootstrap samling.



In [6]:

    
ts.calibrate(s_a.duration / ts.history.duration, 200, 0.1 * ts.history.duration / len(ts.history), True)
ts.calibrate(s_b.duration / ts.history.duration, 200, 0.1 * ts.history.duration / len(ts.history), True)

Note that the calibration data is is cached. If you want/need to change the parameters of the bootstrap sampling, you have to pass recalibrate=True as additional parameter.

Validation

Now we can validate the test by locating our test-samples in the target space.



In [7]:

    
loc_a = ts.locate(s_a)
loc_b = ts.locate(s_b)

print("A: {:.5f}".format(loc_a))
print("B: {:.5f}".format(loc_b))









    



A: 0.17500
B: 0.02000

We see that for variant A the smallest probability to find a worse metric value than observed is 0.15, so much above a critical value of 0.05. For variant B, this probability is only 0.035. This indicates that there might exist an unnoticed issue with this variant.

Let's see, if we can find the culprit by comparing the metric values directly:



In [8]:

    
metric_labels = ("traffic", "performance", "error rate", "event rate")
metric_values = (("pre", frontend_context(ts.history)),
                 ("A", frontend_context(s_a)),
                 ("B", frontend_context(s_b)))

for i, metric_label in enumerate(metric_labels):
    print(metric_label)
    for sample_label, sample_values in metric_values:
        print("  {}: {}".format(sample_label, sample_values[i]))









    



traffic
  pre: 0.00019892527693762726
  A: 9.962747235310116e-05
  B: 9.396507163311686e-05
performance
  pre: 2230.5
  A: 2068.5
  B: 1156.0
error rate
  pre: 0.01764705882352941
  A: 0.0
  B: 0.052083333333333336
event rate
  pre: 0.04411764705882353
  A: 0.029411764705882353
  B: 0.041666666666666664

We see that the error rate is the metric that worsened in variant B compared to the pre-experiment sample: 0.052 vs. 0.018 .

Tutorial for history_diagnostics

Load example data

Frontend context

Target space

Validation

Tutorial for `history_diagnostics`