history_diagnosticshistory_diagnostics is a pure-Python package to validate A/B test results based on contexts containing metrics relevant to the test. The validation of the test variants is performed by comparing the contexts to the values for a pre-experiment sample. This helps to increase the reliability of the validation. The code is available on github.
In this tutorial, we will validate an A/B test with samples containing only a ~100 requests each.
In [1]:
import json
import numpy as np
import pandas as pd
from history_diagnostics import targetspace
First we need to get some data to work on. We use the example data, which is stored in a JSON file. Using pandas, we can easily generate Sample objects.
data/example.json contains three data samples:
pre-experiment contains a sample taken before the A/B test was started.
A and B contain the samples for the two test cases. Since the application's traffic was split 50/50 on the two test-cases, we need to tell the sample that it has a traffic_boost given by 1 / traffic_share. So, it's 2 for A and B.
In [2]:
data = json.load(open("data/example.json"))
s_pre = targetspace.Sample(pd.DataFrame(data["pre-experiment"]))
s_a = targetspace.Sample(pd.DataFrame(data["A"]), traffic_boost=2)
s_b = targetspace.Sample(pd.DataFrame(data["B"]), traffic_boost=2)
Next, we need to define our context function. In this tutorial, we focus on frontend-centric metrics. So the context consists of the following metrics:
The function will be called with a single argument: a Sample object.
We annotate the context function with the possible ranges of the metrics passed as (low, high) tuples. None means that no bound exists. There is no theoretical upper bound for the traffic rate or performance but the error and event rates are limitted to the interval (0, 1).
We also annotate the context function with the direction of the metrics: "upper" meaning that the metric gets better if it's increasing (traffic of event rate) and "lower" the opposite.
In [3]:
@targetspace.metric_directions("upper", "lower", "lower", "upper")
@targetspace.metric_ranges((0, None), (0, None), (0, 1), (0, 1))
def frontend_context(sample):
n_requests = len(sample)
if n_requests == 0:
return 0, 0, 0, 0
traffic = n_requests / sample.duration * sample.traffic_boost
performance = np.median(sample.requests.performance)
error_rate = sample.requests.js_error.sum() / n_requests
event1_rate = sample.requests.event1.sum() / n_requests
return traffic, performance, error_rate, event1_rate
We can now peak into our data:
We see that the traffic and event rate do not change so much. The performance of B improved by a factor of 2 compared to pre-test and A and the error rate worsened by a factor of 4.
Let's see if can detect this.
To perform the actual validation of the test, we need a target space. We need to create a sub-class of TargetSpace that inherits from a metric-combining mixin. In this case, we use the OneMinusMaxMixin, which returns the minimum of the probabilities to find a worse value for any of the metrics in the context function.
In [4]:
class FrontendTargetSpace(targetspace.OneMinusMaxMixin, targetspace.TargetSpace):
pass
To create an instance, we must pass the context function and the pre-experiment sample. The pre-experiment sample can be referenced by the history attribute of the instance.
In [5]:
ts = FrontendTargetSpace(frontend_context, s_pre)
Then we need to calibrate the target space. We have to do this for each of the test samples by passing
s.duration / ts.history.duration.
In [6]:
ts.calibrate(s_a.duration / ts.history.duration, 200, 0.1 * ts.history.duration / len(ts.history), True)
ts.calibrate(s_b.duration / ts.history.duration, 200, 0.1 * ts.history.duration / len(ts.history), True)
In [7]:
loc_a = ts.locate(s_a)
loc_b = ts.locate(s_b)
print("A: {:.5f}".format(loc_a))
print("B: {:.5f}".format(loc_b))
We see that for variant A the smallest probability to find a worse metric value than observed is 0.15, so much above a critical value of 0.05. For variant B, this probability is only 0.035. This indicates that there might exist an unnoticed issue with this variant.
Let's see, if we can find the culprit by comparing the metric values directly:
In [8]:
metric_labels = ("traffic", "performance", "error rate", "event rate")
metric_values = (("pre", frontend_context(ts.history)),
("A", frontend_context(s_a)),
("B", frontend_context(s_b)))
for i, metric_label in enumerate(metric_labels):
print(metric_label)
for sample_label, sample_values in metric_values:
print(" {}: {}".format(sample_label, sample_values[i]))
We see that the error rate is the metric that worsened in variant B compared to the pre-experiment sample: 0.052 vs. 0.018 .