In [1]:

    
import numpy as np
np.random.seed(0)
from selection.SPRT import SPRT
%pylab inline









    



Populating the interactive namespace from numpy and matplotlib

A sanity check

Let's take a test that does a minimum of 3 steps, a maximum of 5 and a quadratic decision boundary.



In [2]:

    
sequential_test = SPRT(lambda n: 4 * np.sqrt(n), nmin=3, nmax=5)

If we feed this constraint zero and measurement $\sigma$ 1.5 (so that the standard deviation of one difference is $\sqrt{2} \cdot 1.5$). We expect the constraints to be only on the third and fourth running sum. The third running sum should be between $\pm 6 \sqrt{6}$ while the fourth should be between $\pm 12 \sqrt{2}$.



In [3]:

    
sigma = 1.5
zero = np.zeros((sequential_test.nmax, 2))
zero_results = sequential_test(zero, sigma=sigma)
zero_results.outcome









    Out[3]:





'reached max time'



In [4]:

    
zero_results.selection_constraints[0].inequality









    Out[4]:





array([[-1., -1., -1., -0., -0.],
       [-1., -1., -1., -1., -0.],
       [ 1.,  1.,  1.,  0.,  0.],
       [ 1.,  1.,  1.,  1.,  0.]])



In [5]:

    
print zero_results.selection_constraints[0].inequality_offset
print (6*np.sqrt(6), 12*np.sqrt(2))









    



[ 14.69693846  16.97056275  14.69693846  16.97056275]
(14.696938456699067, 16.970562748477143)

If we instead give it data with a strong signal, we expect it will terminate very early. The boundaries are unchanged so we should be able to make it stop at either 3 steps or 4 steps.



In [6]:

    
upper = np.zeros((sequential_test.nmax, 2))
upper[:,1] = np.array([2,4,10,0,0])
upper_results = sequential_test(upper, sigma=sigma)
(upper_results.outcome, upper_results.stopping_time)









    Out[6]:





('upper boundary', 3)



In [7]:

    
(upper_results.selection_constraints[0].inequality,
 upper_results.selection_constraints[0].inequality_offset)









    Out[7]:





(array([[ 1.,  1.,  1.]]), array([-14.69693846]))



In [8]:

    
later = np.zeros((sequential_test.nmax, 2))
later[:,1] = -np.array([2,4,4,7,0])
later_results = sequential_test(later, sigma=sigma)
(later_results.outcome, later_results.stopping_time)









    Out[8]:





('lower boundary', 4)



In [9]:

    
(later_results.selection_constraints[0].inequality,
 later_results.selection_constraints[0].inequality_offset)









    Out[9]:





(array([[ 1.,  1.,  1.,  0.],
        [-1., -1., -1., -0.],
        [ 1.,  1.,  1.,  1.]]),
 array([ 14.69693846,  14.69693846, -16.97056275]))

The returned object has a truncated Gaussian associated to it.



In [10]:

    
quantile = later_results.trunc_gauss.quantile(0.05)
later_results.trunc_gauss.CDF(quantile)









    Out[10]:





0.050000000000290611

It also has a selection interval attached to it, computed using Will's UMAU calculations.



In [11]:

    
later_results.naive_interval, later_results.UMAU_interval









    Out[11]:





(array([-0.91663516,  0.40394909]), array([-0.89490976,  0.8548994 ]))

A simulation

Let's try simulating with a true mean difference.



In [12]:

    
sequential_test = SPRT(lambda n: 4 * np.sqrt(n), nmin=3, nmax=20)

mu, sigma = np.array([4,6]), 1
simulate = lambda mu: np.random.standard_normal((sequential_test.nmax,2)) * sigma + mu[None,:]
simulate(mu)









    Out[12]:





array([[ 5.76405235,  6.40015721],
       [ 4.97873798,  8.2408932 ],
       [ 5.86755799,  5.02272212],
       [ 4.95008842,  5.84864279],
       [ 3.89678115,  6.4105985 ],
       [ 4.14404357,  7.45427351],
       [ 4.76103773,  6.12167502],
       [ 4.44386323,  6.33367433],
       [ 5.49407907,  5.79484174],
       [ 4.3130677 ,  5.14590426],
       [ 1.44701018,  6.6536186 ],
       [ 4.8644362 ,  5.25783498],
       [ 6.26975462,  4.54563433],
       [ 4.04575852,  5.81281615],
       [ 5.53277921,  7.46935877],
       [ 4.15494743,  6.37816252],
       [ 3.11221425,  4.01920353],
       [ 3.65208785,  6.15634897],
       [ 5.23029068,  7.20237985],
       [ 3.61267318,  5.69769725]])



In [13]:

    
Z = simulate(mu)
results = sequential_test(Z, sigma=sigma)
U = results.UMAU_interval
N = results.naive_interval

results.trunc_gauss.mu = N[0]
results.trunc_gauss.CDF(results.observed)









    



/Users/jonathantaylor/canopy/lib/python2.7/site-packages/selection/truncated.py:176: RuntimeWarning: invalid value encountered in double_scalars
  (observed - mu)/sigma, use_R=self.use_R)) / P.sum()






    Out[13]:





0.97500001420285243



In [14]:

    
results.trunc_gauss.mu = N[1]
results.trunc_gauss.CDF(results.observed)









    Out[14]:





0.024999998872230023



In [15]:

    
results.naive_interval, results.UMAU_interval









    Out[15]:





(array([  0.19083166,  15.93350397]), array([  0.49508863,  15.34686104]))



In [16]:

    
results.outcome









    Out[16]:





'upper boundary'



In [17]:

    
results.boundary









    Out[17]:





array([[ -5.65685425,   5.65685425],
       [ -8.        ,   8.        ],
       [ -9.79795897,   9.79795897],
       [-11.3137085 ,  11.3137085 ]])



In [18]:

    
quantile = results.trunc_gauss.quantile(0.05)
results.trunc_gauss.CDF(quantile)









    Out[18]:





0.0

Some intervals



In [19]:

    
P = []
pivot = []
nsim = 500

naive = plt.figure(figsize=(10,8)).gca()

for i in range(nsim):
    Z = simulate(mu)

    results = sequential_test(Z, sigma=sigma)
    results.alpha = 0.1
    
    P.append(results.pvalue(truth=0))
    pivot.append(results.pvalue(truth=mu[1]-mu[0]))
    
    if i < 100:
        if (results.naive_interval[0] < mu[1] - mu[0] and
            results.naive_interval[1] > mu[1] - mu[0]):
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='green')
        else:
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='red')
        
naive.plot([0,nsim], [mu[1]-mu[0],mu[1]-mu[0]], 'k--')
naive.plot([0,nsim], [0,0], 'gray')
naive.set_xlim([0,100])









    Out[19]:





(0, 100)



In [20]:

    
%load_ext rmagic



In [21]:

    
%%R -i P,pivot
plot(ecdf(P), col='green')
lines(ecdf(pivot), col='red')
ecdf(P)(0.05)

A tweak

One possible reason that the intervals are large is that when we stop, the observed value might fall very close to its truncation interval. We might try taking a few additional samples, if permitted by the design, to form a better estimate after stopping.

This is the extra_samples argument.



In [22]:

    
P = []
pivot = []
nsim = 500

naive = plt.figure(figsize=(10,8)).gca()
nominal = plt.figure(figsize=(10,8)).gca()

for i in range(nsim):
    Z = simulate(mu)

    results = sequential_test(Z, sigma=sigma, extra_samples=5)
    results.alpha = 0.1
    P.append(results.pvalue(truth=0))
    pivot.append(results.pvalue(truth=mu[1]-mu[0]))
    
    if i < 100:
        if (results.naive_interval[0] < mu[1] - mu[0] and
            results.naive_interval[1] > mu[1] - mu[0]):
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='green')
        else:
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='red')
            
        if (results.nominal_interval[0] < mu[1] - mu[0] and
            results.nominal_interval[1] > mu[1] - mu[0]):
            nominal.plot([i,i], results.nominal_interval, linewidth=3, alpha=0.5, color='green')
        else:
            nominal.plot([i,i], results.nominal_interval, linewidth=3, alpha=0.5, color='red')
        
naive.plot([0,nsim], [mu[1]-mu[0],mu[1]-mu[0]], 'k--')
naive.plot([0,nsim], [0,0], 'gray')
naive.set_xlim([0,100])
naive.title('Adjusted intervals')

nominal.plot([0,nsim], [mu[1]-mu[0],mu[1]-mu[0]], 'k--')
nominal.plot([0,nsim], [0,0], 'gray')
nominal.set_xlim([0,100])
nominal.title('Nominal intervals for same data')









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-22-86f5a9a3038e> in <module>()
     30 naive.plot([0,nsim], [0,0], 'gray')
     31 naive.set_xlim([0,100])
---> 32 naive.title('Adjusted intervals')
     33 
     34 nominal.plot([0,nsim], [mu[1]-mu[0],mu[1]-mu[0]], 'k--')

TypeError: 'Text' object is not callable



In [ ]:

    
%%R -i P,pivot
plot(ecdf(P), col='green')
lines(ecdf(pivot), col='red')
ecdf(P)(0.05)

The effect of the boundary.

What if we change to have a wider boundary?



In [ ]:

    
sequential_test = SPRT(lambda n: 2 * np.sqrt(n), nmin=3, nmax=20)

mu, sigma = np.array([4,6]), 1. / np.sqrt(2)
simulate = lambda mu: np.random.standard_normal((sequential_test.nmax,2)) * sigma + mu[None,:]



In [54]:

    
P = []
pivot = []
nsim = 500

naive = plt.figure(figsize=(10,8)).gca()

for i in range(nsim):
    Z = simulate(mu)

    results = sequential_test(Z, sigma=sigma)
    results.alpha = 0.1
    P.append(results.pvalue(truth=0))
    pivot.append(results.pvalue(truth=mu[1]-mu[0]))
    
    if i < 100:
        if (results.naive_interval[0] < mu[1] - mu[0] and
            results.naive_interval[1] > mu[1] - mu[0]):
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='green')
        else:
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='red')
        
naive.plot([0,nsim], [mu[1]-mu[0],mu[1]-mu[0]], 'k--')
naive.plot([0,nsim], [0,0], 'gray')
naive.set_xlim([0,100])



In [55]:

    
%%R -i P,pivot
plot(ecdf(P), col='green')
lines(ecdf(pivot), col='red')
ecdf(P)(0.05)

In short, with a wider boundary, the intervals seem smaller but our conditional test seems to have less power. Let's also take a few more samples.



In [56]:

    
P = []
pivot = []
nsim = 500

naive = plt.figure(figsize=(10,8)).gca()

for i in range(nsim):
    Z = simulate(mu)

    results = sequential_test(Z, sigma=sigma, extra_samples=5)
    results.alpha = 0.1
    P.append(results.pvalue(truth=0))
    pivot.append(results.pvalue(truth=mu[1]-mu[0]))
    
    if i < 100:
        if (results.naive_interval[0] < mu[1] - mu[0] and
            results.naive_interval[1] > mu[1] - mu[0]):
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='green')
        else:
            naive.plot([i,i], results.naive_interval, linewidth=3, alpha=0.5, color='red')
        
naive.plot([0,nsim], [mu[1]-mu[0],mu[1]-mu[0]], 'k--')
naive.plot([0,nsim], [0,0], 'gray')
naive.set_xlim([0,100])









    Out[56]:





(0, 100)



In [57]:

    
%%R -i P,pivot
plot(ecdf(P), col='green')
lines(ecdf(pivot), col='red')
ecdf(P)(0.05)



In [28]: