In [1]:
import numpy as np
np.random.seed(0)
from selection.SPRT import SPRT
%pylab inline


Populating the interactive namespace from numpy and matplotlib

A sanity check

Let's take a test that does a minimum of 3 steps, a maximum of 5 and a quadratic decision boundary.


In [2]:
sequential_test = SPRT(lambda n: 4 * np.sqrt(n), nmin=3, nmax=5)

If we feed this constraint zero and measurement $\sigma$ 1.5 (so that the standard deviation of one difference is $\sqrt{2} \cdot 1.5$). We expect the constraints to be only on the third and fourth running sum. The third running sum should be between $\pm 6 \sqrt{6}$ while the fourth should be between $\pm 12 \sqrt{2}$.


In [3]:
sigma = 1.5
zero = np.zeros((sequential_test.nmax, 2))
zero_results = sequential_test(zero, sigma=sigma)
zero_results.outcome


Out[3]:
'reached max time'

In [4]:
zero_results.selection_constraints[0].inequality


Out[4]:
array([[-1., -1., -1., -0., -0.],
       [-1., -1., -1., -1., -0.],
       [ 1.,  1.,  1.,  0.,  0.],
       [ 1.,  1.,  1.,  1.,  0.]])

In [5]:
print zero_results.selection_constraints[0].inequality_offset
print (6*np.sqrt(6), 12*np.sqrt(2))


[ 14.69693846  16.97056275  14.69693846  16.97056275]
(14.696938456699067, 16.970562748477143)

If we instead give it data with a strong signal, we expect it will terminate very early. The boundaries are unchanged so we should be able to make it stop at either 3 steps or 4 steps.


In [6]:
upper = np.zeros((sequential_test.nmax, 2))
upper[:,1] = np.array([2,4,10,0,0])
upper_results = sequential_test(upper, sigma=sigma)
(upper_results.outcome, upper_results.stopping_time)


Out[6]:
('upper boundary', 3)

In [7]:
(upper_results.selection_constraints[0].inequality,
 upper_results.selection_constraints[0].inequality_offset)


Out[7]:
(array([[ 1.,  1.,  1.]]), array([-14.69693846]))

In [8]:
later = np.zeros((sequential_test.nmax, 2))
later[:,1] = -np.array([2,4,4,7,0])
later_results = sequential_test(later, sigma=sigma)
(later_results.outcome, later_results.stopping_time)


Out[8]:
('lower boundary', 4)

In [9]:
(later_results.selection_constraints[0].inequality,
 later_results.selection_constraints[0].inequality_offset)


Out[9]:
(array([[ 1.,  1.,  1.,  0.],
        [-1., -1., -1., -0.],
        [ 1.,  1.,  1.,  1.]]),
 array([ 14.69693846,  14.69693846, -16.97056275]))

The returned object has a truncated Gaussian associated to it.


In [10]:
quantile = later_results.trunc_gauss.quantile(0.05)
later_results.trunc_gauss.CDF(quantile)


Out[10]:
0.050000000000290611

It also has a selection interval attached to it, computed using Will's UMAU calculations.


In [11]:
later_results.naive_interval, later_results.UMAU_interval


Out[11]:
(array([-0.91663516,  0.40394909]), array([-0.89490976,  0.8548994 ]))

A simulation

Let's try simulating with a true mean difference.


In [12]:
sequential_test = SPRT(lambda n: 4 * np.sqrt(n), nmin=3, nmax=20)

mu, sigma = np.array([4,6]), 1
simulate = lambda mu: np.random.standard_normal((sequential_test.nmax,2)) * sigma + mu[None,:]
simulate(mu)


Out[12]:
array([[ 5.76405235,  6.40015721],
       [ 4.97873798,  8.2408932 ],
       [ 5.86755799,  5.02272212],
       [ 4.95008842,  5.84864279],
       [ 3.89678115,  6.4105985 ],
       [ 4.14404357,  7.45427351],
       [ 4.76103773,  6.12167502],
       [ 4.44386323,  6.33367433],
       [ 5.49407907,  5.79484174],
       [ 4.3130677 ,  5.14590426],
       [ 1.44701018,  6.6536186 ],
       [ 4.8644362 ,  5.25783498],
       [ 6.26975462,  4.54563433],
       [ 4.04575852,  5.81281615],
       [ 5.53277921,  7.46935877],
       [ 4.15494743,  6.37816252],
       [ 3.11221425,  4.01920353],
       [ 3.65208785,  6.15634897],
       [ 5.23029068,  7.20237985],
       [ 3.61267318,  5.69769725]])

In [13]:
Z = simulate(mu)
results = sequential_test(Z, sigma=sigma)
U = results.UMAU_interval
N = results.naive_interval

results.trunc_gauss.mu = N[0]
results.trunc_gauss.CDF(results.observed)


/Users/jonathantaylor/canopy/lib/python2.7/site-packages/selection/truncated.py:176: RuntimeWarning: invalid value encountered in double_scalars
  (observed - mu)/sigma, use_R=self.use_R)) / P.sum()
Out[13]:
0.97500001420285243

In [14]:
results.trunc_gauss.mu = N[1]
results.trunc_gauss.CDF(results.observed)


Out[14]:
0.024999998872230023

In [15]:
results.naive_interval, results.UMAU_interval


Out[15]:
(array([  0.19083166,  15.93350397]), array([  0.49508863,  15.34686104]))

In [16]:
results.outcome


Out[16]:
'upper boundary'

In [17]:
results.boundary


Out[17]:
array([[ -5.65685425,   5.65685425],
       [ -8.        ,   8.        ],
       [ -9.79795897,   9.79795897],
       [-11.3137085 ,  11.3137085 ]])

In [18]:
quantile = results.trunc_gauss.quantile(0.05)
results.trunc_gauss.CDF(quantile)


Out[18]:
0.0

Some intervals


In [19]:
P = []
pivot = []
nsim = 500

naive = plt.figure(figsize=(10,8)).gca()

for i in range(nsim):
    Z = simulate(mu)

    results = sequential_test(Z, sigma=sigma)
    results.alpha = 0.1
    
    P.append(results.pvalue(truth=0))
    pivot.append(results.pvalue(truth=mu[1]-mu[0]))
    
    if i < 100:
        if (results.naive_interval[0] < mu[1] - mu[0] and
            results.naive_interval[1] > mu[1] - mu[0]):
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='green')
        else:
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='red')
        
naive.plot([0,nsim], [mu[1]-mu[0],mu[1]-mu[0]], 'k--')
naive.plot([0,nsim], [0,0], 'gray')
naive.set_xlim([0,100])


Out[19]:
(0, 100)

In [20]:
%load_ext rmagic

In [21]:
%%R -i P,pivot
plot(ecdf(P), col='green')
lines(ecdf(pivot), col='red')
ecdf(P)(0.05)


[1] 0.374

A tweak

One possible reason that the intervals are large is that when we stop, the observed value might fall very close to its truncation interval. We might try taking a few additional samples, if permitted by the design, to form a better estimate after stopping.

This is the extra_samples argument.


In [22]:
P = []
pivot = []
nsim = 500

naive = plt.figure(figsize=(10,8)).gca()
nominal = plt.figure(figsize=(10,8)).gca()

for i in range(nsim):
    Z = simulate(mu)

    results = sequential_test(Z, sigma=sigma, extra_samples=5)
    results.alpha = 0.1
    P.append(results.pvalue(truth=0))
    pivot.append(results.pvalue(truth=mu[1]-mu[0]))
    
    if i < 100:
        if (results.naive_interval[0] < mu[1] - mu[0] and
            results.naive_interval[1] > mu[1] - mu[0]):
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='green')
        else:
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='red')
            
        if (results.nominal_interval[0] < mu[1] - mu[0] and
            results.nominal_interval[1] > mu[1] - mu[0]):
            nominal.plot([i,i], results.nominal_interval, linewidth=3, alpha=0.5, color='green')
        else:
            nominal.plot([i,i], results.nominal_interval, linewidth=3, alpha=0.5, color='red')
        
naive.plot([0,nsim], [mu[1]-mu[0],mu[1]-mu[0]], 'k--')
naive.plot([0,nsim], [0,0], 'gray')
naive.set_xlim([0,100])
naive.title('Adjusted intervals')

nominal.plot([0,nsim], [mu[1]-mu[0],mu[1]-mu[0]], 'k--')
nominal.plot([0,nsim], [0,0], 'gray')
nominal.set_xlim([0,100])
nominal.title('Nominal intervals for same data')


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-22-86f5a9a3038e> in <module>()
     30 naive.plot([0,nsim], [0,0], 'gray')
     31 naive.set_xlim([0,100])
---> 32 naive.title('Adjusted intervals')
     33 
     34 nominal.plot([0,nsim], [mu[1]-mu[0],mu[1]-mu[0]], 'k--')

TypeError: 'Text' object is not callable

In [ ]:
%%R -i P,pivot
plot(ecdf(P), col='green')
lines(ecdf(pivot), col='red')
ecdf(P)(0.05)

The effect of the boundary.

What if we change to have a wider boundary?


In [ ]:
sequential_test = SPRT(lambda n: 2 * np.sqrt(n), nmin=3, nmax=20)

mu, sigma = np.array([4,6]), 1. / np.sqrt(2)
simulate = lambda mu: np.random.standard_normal((sequential_test.nmax,2)) * sigma + mu[None,:]

In [54]:
P = []
pivot = []
nsim = 500

naive = plt.figure(figsize=(10,8)).gca()

for i in range(nsim):
    Z = simulate(mu)

    results = sequential_test(Z, sigma=sigma)
    results.alpha = 0.1
    P.append(results.pvalue(truth=0))
    pivot.append(results.pvalue(truth=mu[1]-mu[0]))
    
    if i < 100:
        if (results.naive_interval[0] < mu[1] - mu[0] and
            results.naive_interval[1] > mu[1] - mu[0]):
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='green')
        else:
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='red')
        
naive.plot([0,nsim], [mu[1]-mu[0],mu[1]-mu[0]], 'k--')
naive.plot([0,nsim], [0,0], 'gray')
naive.set_xlim([0,100])

In [55]:
%%R -i P,pivot
plot(ecdf(P), col='green')
lines(ecdf(pivot), col='red')
ecdf(P)(0.05)

In short, with a wider boundary, the intervals seem smaller but our conditional test seems to have less power. Let's also take a few more samples.


In [56]:
P = []
pivot = []
nsim = 500

naive = plt.figure(figsize=(10,8)).gca()

for i in range(nsim):
    Z = simulate(mu)

    results = sequential_test(Z, sigma=sigma, extra_samples=5)
    results.alpha = 0.1
    P.append(results.pvalue(truth=0))
    pivot.append(results.pvalue(truth=mu[1]-mu[0]))
    
    if i < 100:
        if (results.naive_interval[0] < mu[1] - mu[0] and
            results.naive_interval[1] > mu[1] - mu[0]):
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='green')
        else:
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='red')
        
naive.plot([0,nsim], [mu[1]-mu[0],mu[1]-mu[0]], 'k--')
naive.plot([0,nsim], [0,0], 'gray')
naive.set_xlim([0,100])


Out[56]:
(0, 100)

In [57]:
%%R -i P,pivot
plot(ecdf(P), col='green')
lines(ecdf(pivot), col='red')
ecdf(P)(0.05)

In [28]: