The difference between two classifiers (algorithms) can be very small; however *there are no two classifiers whose
accuracies are perfectly equivalent*.

By using an null hypothesis significance test (NHST), the null hypothesis is that the classifiers are equal. However, the null hypothesis is practically always false!
By rejecting the null hypothesis NHST indicates that the null hypothesis is unlikely; **but this is known even before running the experiment**.

Can we say anything about the probability that two classifiers are practically equivalent (e.g., *j48* is practically equivalent to *j48gr*)?

NHST cannot answer this question, while Bayesian analysis can.

We need to define the meaning of **practically equivalent**.

The rope depends:

- on the
**metric**we use for comapring classifiers (accuracy, logloss etc.); - on our
**subjective**definition of practical equivalence (**domain specific**).

Accuracy is a number in $[0,1]$. For practical applications, it is sensible to define that two classifiers whose mean difference of accuracies is less that $1\%$ ($0.01$) are practically equivalent. A difference of accuracy of $1\%$ is neglegible in practice.

The interval $[-0.01,0.01]$ can thus be used to define a **region of practical equivalence** for classifiers.

See it in action.

`Data/accuracy_j48_j48gr.csv`

. For simplicity, we will skip the header row and the column with data set names.

```
In [2]:
```import numpy as np
scores = np.loadtxt('Data/accuracy_j48_j48gr.csv', delimiter=',', skiprows=1, usecols=(1, 2))
names = ("J48", "J48gr")

Function `signtest(x, rope, prior_strength=1, prior_place=ROPE, nsamples=50000, verbose=False, names=('C1', 'C2'))`

computes the Bayesian signed-rank test and returns the probabilities that the difference (the score of the first classifier minus the score of the first) is negative, within rope or positive.

```
In [3]:
```import bayesiantests as bt
left, within, right = bt.signtest(scores, rope=0.01,verbose=True,names=names)

```
```

The first value (`P(J48 > J48gr)`

) is the probability that the first classifier (the left column of `x`

) has a higher score than the second (or that the differences are negative, if `x`

is given as a vector).

The second value (`P(rope)`

) is the probability that they are practically equivalent.

The third value (`P(J48gr > J48)`

) is equal to `1-P(J48 > J48gr)-P(rope)`

.

The probability of the rope is equal to $1$ and, therefore, we can say that they are equivalent (for the given rope).

Decision tree grafting (**J48gr**) was developed to demonstrate that a preference for less complex trees (**J48**) does not serve to improve accuracy. The point is that c has a consistent (albeit small) improvements in accuracy than **J48**.

The advanatge of having a rope is that we can test this hypothesis from a statistical point of view.

```
In [10]:
```left, within, right = bt.signtest(scores, rope=0.001,verbose=True,names=names)

```
```

No the difference is less than 0.001 with probability 0.99

```
In [11]:
```left, within, right = bt.signtest(scores, rope=0.0001,verbose=True,names=names)

```
```

```
In [12]:
```left, within, right = bt.signtest(scores, rope=0.0001,prior_place=bt.RIGHT,verbose=True,names=names)

```
```

```
In [88]:
```%matplotlib inline
import matplotlib.pyplot as plt
left=np.zeros((10,1))
within=np.zeros((10,1))
right=np.zeros((10,1))
for i in range(9,-1,-1):
left[i], within[i], right[i] = bt.signtest(scores, rope=0.001/2**i,names=names)
plt.plot(0.001/(2**np.arange(0,10,1)),within)
plt.plot(0.001/(2**np.arange(0,10,1)),left)
plt.plot(0.001/(2**np.arange(0,10,1)),right)
plt.legend(('rope','left','right'))
plt.xlabel('Rope width')
plt.ylabel('Probability')

```
Out[88]:
```

```
In [13]:
```left, within, right = bt.signrank(scores, rope=0.001,verbose=True,names=names)

```
```

```
In [14]:
```left, within, right = bt.signrank(scores, rope=0.0001,verbose=True,names=names)

```
```

```
In [ ]:
```