pyblock tutorial

The estimate of the standard error of a set of data assumes that the data points are completely independent. If this is not true, then naively calculating the standard error of the entire data set can give a substantial underestimate of the true error. This arises in, for example, Monte Carlo simulations where the state at one step depends upon the state at the previous step. Data calculated from the stochastic state hence has serial correlations.

A simple way to remove these correlations is to repeatedly average neighbouring pairs of data points and calculate the standard error on the new data set. As no data is discarded in this process (assuming the data set contains $2^n$ values), the error estimate should remain approximately constant if the data is truly independent.

pyblock is a python module for performing this reblocking analysis.

Normally correlated data comes from an experiment or simulation but we'll use randomly generated data which is serially correlated in order to show how pyblock works.



In [1]:

    
import numpy
def corr_data(N, L):
    '''Generate random correlated data containing 2^N data points.  
    Randon data is convolved over a 2^L/10 length to give the correlated signal.'''
    return numpy.convolve(numpy.random.randn(2**N), numpy.ones(2**L)/10, 'same')
rand_data = corr_data(16, 6)



In [2]:

    
plot(rand_data);

If we zoom in, we can clearly see that neighbouring data points do not immediately appear to be independent:



In [3]:

    
plot(rand_data[:1000]);
plot(rand_data[40000:41000]);

pyblock can perform a reblocking analysis to get a better estimate of the standard error of the data set:



In [4]:

    
import pyblock
reblock_data = pyblock.blocking.reblock(rand_data)
for reblock_iter in reblock_data:
    print(reblock_iter)









    



BlockTuple(block=0, ndata=65536, mean=array(0.029729412388881667), cov=array(0.6337749708548472), std_err=array(0.0031097650382892594), std_err_err=array(8.589659075051008e-06))
BlockTuple(block=1, ndata=32768, mean=array(0.02972941238888195), cov=array(0.628821230364245), std_err=array(0.004380650753518188), std_err_err=array(1.711217811903889e-05))
BlockTuple(block=2, ndata=16384, mean=array(0.02972941238888187), cov=array(0.6213291012014514), std_err=array(0.006158158716248116), std_err_err=array(3.402038032828577e-05))
BlockTuple(block=3, ndata=8192, mean=array(0.0297294123888818), cov=array(0.6072256553888598), std_err=array(0.00860954270047692), std_err_err=array(6.726615807324491e-05))
BlockTuple(block=4, ndata=4096, mean=array(0.02972941238888184), cov=array(0.5804081640995564), std_err=array(0.0119038318174598), std_err_err=array(0.00013153606075677518))
BlockTuple(block=5, ndata=2048, mean=array(0.02972941238888185), cov=array(0.5242933503018304), std_err=array(0.01600008163891877), std_err_err=array(0.0002500623334367383))
BlockTuple(block=6, ndata=1024, mean=array(0.029729412388881837), cov=array(0.4126013837636545), std_err=array(0.02007314222616115), std_err_err=array(0.00044377470816715493))
BlockTuple(block=7, ndata=512, mean=array(0.029729412388881854), cov=array(0.255910704765131), std_err=array(0.02235677962597468), std_err_err=array(0.0006993326391359534))
BlockTuple(block=8, ndata=256, mean=array(0.029729412388881854), cov=array(0.15369260703074866), std_err=array(0.024502280428847067), std_err_err=array(0.0010849792138732355))
BlockTuple(block=9, ndata=128, mean=array(0.029729412388881844), cov=array(0.07649773732547502), std_err=array(0.02444664747680699), std_err_err=array(0.0015339190875488663))
BlockTuple(block=10, ndata=64, mean=array(0.02972941238888185), cov=array(0.0455635621966979), std_err=array(0.026682028770755133), std_err_err=array(0.002377024048671685))
BlockTuple(block=11, ndata=32, mean=array(0.029729412388881847), cov=array(0.025945495376626042), std_err=array(0.028474492629712717), std_err_err=array(0.003616264180239503))
BlockTuple(block=12, ndata=16, mean=array(0.02972941238888184), cov=array(0.012627881930728472), std_err=array(0.02809346224071589), std_err_err=array(0.0051291409958865745))
BlockTuple(block=13, ndata=8, mean=array(0.02972941238888184), cov=array(0.006785523206998811), std_err=array(0.029123708570078285), std_err_err=array(0.00778363852153464))
BlockTuple(block=14, ndata=4, mean=array(0.02972941238888184), cov=array(0.005573075663761713), std_err=array(0.037326517597285024), std_err_err=array(0.015238486998060912))
BlockTuple(block=15, ndata=2, mean=array(0.02972941238888184), cov=array(0.006933024981306826), std_err=array(0.05887709648626886), std_err_err=array(0.04163239418201536))

The standard error of the original data set is clearly around 8 times too small. Note that the standard error of the last few reblock iterations fluctuates substantially---this is simply because of the small number of data points at those iterations.

In addition to the mean and standard error at each iteration, the covariance and an estimate of the error in the standard error are also calculated. Each tuple also contains the number of data points used at the given reblock iteration.

pyblock.blocking can also suggest the reblock iteration at which the standard error has converged (i.e. the iteration at which the serial correlation has been removed and every data point is truly independent).



In [5]:

    
opt = pyblock.blocking.find_optimal_block(len(rand_data), reblock_data)
print(opt)
print(reblock_data[opt[0]])









    



[10]
BlockTuple(block=10, ndata=64, mean=array(0.02972941238888185), cov=array(0.0455635621966979), std_err=array(0.026682028770755133), std_err_err=array(0.002377024048671685))

Whilst the above uses just a single data set, pyblock is designed to work on multiple data sets at once (e.g. multiple outputs from the same simulation). In that case, different optimal reblock iterations might be found for each data set. The only assumption is that the original data sets are of the same length.

pandas integration

The core pyblock functionality is built upon numpy. However, it is more convenient to use the pandas-based wrapper around pyblock.blocking, not least because it makes working with multiple data sets more pleasant.



In [6]:

    
import pandas as pd
rand_data = pd.Series(rand_data)



In [7]:

    
rand_data.head()









    Out[7]:





0   -0.294901
1   -0.360847
2   -0.386010
3   -0.496183
4   -0.625507
dtype: float64



In [8]:

    
(data_length, reblock_data, covariance) = pyblock.pd_utils.reblock(rand_data)



In [9]:

    
# number of data points at each reblock iteration
data_length









    Out[9]:





reblock
0          65536
1          32768
2          16384
3           8192
4           4096
5           2048
6           1024
7            512
8            256
9            128
10            64
11            32
12            16
13             8
14             4
15             2
Name: data length, dtype: int64



In [10]:

    
# mean, standard error and estimate of the error in the standard error at each 
# reblock iteration
# Note the suggested reblock iteration is already indicated.
# pyblock names the data series 'data' if no name is provided in the
pandas.Series/pandas.DataFrame.
reblock_data









    Out[10]:






  
    
      
      data
    
    
      
      mean
      standard error
      standard error error
      optimal block
    
    
      reblock
      
      
      
      
    
  
  
    
      0 
       0.029729
       0.003110
       0.000009
               
    
    
      1 
       0.029729
       0.004381
       0.000017
               
    
    
      2 
       0.029729
       0.006158
       0.000034
               
    
    
      3 
       0.029729
       0.008610
       0.000067
               
    
    
      4 
       0.029729
       0.011904
       0.000132
               
    
    
      5 
       0.029729
       0.016000
       0.000250
               
    
    
      6 
       0.029729
       0.020073
       0.000444
               
    
    
      7 
       0.029729
       0.022357
       0.000699
               
    
    
      8 
       0.029729
       0.024502
       0.001085
               
    
    
      9 
       0.029729
       0.024447
       0.001534
               
    
    
      10
       0.029729
       0.026682
       0.002377
       <---    
    
    
      11
       0.029729
       0.028474
       0.003616
               
    
    
      12
       0.029729
       0.028093
       0.005129
               
    
    
      13
       0.029729
       0.029124
       0.007784
               
    
    
      14
       0.029729
       0.037327
       0.015238
               
    
    
      15
       0.029729
       0.058877
       0.041632
               
    
  

16 rows × 4 columns



In [11]:

    
# Covariance matrix is not so relevant for a single data set.
covariance









    Out[11]:






  
    
      
      
      data
    
    
      reblock
      
      
    
  
  
    
      0 
      data
       0.633775
    
    
      1 
      data
       0.628821
    
    
      2 
      data
       0.621329
    
    
      3 
      data
       0.607226
    
    
      4 
      data
       0.580408
    
    
      5 
      data
       0.524293
    
    
      6 
      data
       0.412601
    
    
      7 
      data
       0.255911
    
    
      8 
      data
       0.153693
    
    
      9 
      data
       0.076498
    
    
      10
      data
       0.045564
    
    
      11
      data
       0.025945
    
    
      12
      data
       0.012628
    
    
      13
      data
       0.006786
    
    
      14
      data
       0.005573
    
    
      15
      data
       0.006933
    
  

16 rows × 1 columns

We can also plot the convergence of the standard error estimate and obtain a summary of the suggested data to quote:



In [12]:

    
pyblock.plot.plot_reblocking(reblock_data);

The standard error clearly converges to ~0.022. The suggested reblock iteration (which uses a slightly conservative formula) is indicated by the arrow on the plot.



In [13]:

    
pyblock.pd_utils.reblock_summary(reblock_data)









    Out[13]:






  
    
      
      mean
      standard error
      standard error error
    
  
  
    
      data
       0.02972941
       0.02668203
       0.002377024
    
  

1 rows × 3 columns

pyblock.error also contains simple error propogation functions for combining multiple noisy data sets and can handle multiple data sets at once (contained either within a numpy array using pyblock.blocking or within a pandas.DataFrame.

	data
	mean	standard error	standard error error	optimal block
reblock
0	0.029729	0.003110	0.000009
1	0.029729	0.004381	0.000017
2	0.029729	0.006158	0.000034
3	0.029729	0.008610	0.000067
4	0.029729	0.011904	0.000132
5	0.029729	0.016000	0.000250
6	0.029729	0.020073	0.000444
7	0.029729	0.022357	0.000699
8	0.029729	0.024502	0.001085
9	0.029729	0.024447	0.001534
10	0.029729	0.026682	0.002377	<---
11	0.029729	0.028474	0.003616
12	0.029729	0.028093	0.005129
13	0.029729	0.029124	0.007784
14	0.029729	0.037327	0.015238
15	0.029729	0.058877	0.041632

		data
reblock
0	data	0.633775
1	data	0.628821
2	data	0.621329
3	data	0.607226
4	data	0.580408
5	data	0.524293
6	data	0.412601
7	data	0.255911
8	data	0.153693
9	data	0.076498
10	data	0.045564
11	data	0.025945
12	data	0.012628
13	data	0.006786
14	data	0.005573
15	data	0.006933