The estimate of the standard error of a set of data assumes that the data points are completely independent. If this is not true, then naively calculating the standard error of the entire data set can give a substantial underestimate of the true error. This arises in, for example, Monte Carlo simulations where the state at one step depends upon the state at the previous step. Data calculated from the stochastic state hence has serial correlations.
A simple way to remove these correlations is to repeatedly average neighbouring pairs of data points and calculate the standard error on the new data set. As no data is discarded in this process (assuming the data set contains $2^n$ values), the error estimate should remain approximately constant if the data is truly independent.
pyblock is a python module for performing this reblocking analysis.
Normally correlated data comes from an experiment or simulation but we'll use randomly generated data which is serially correlated in order to show how pyblock works.
In [1]:
import numpy
def corr_data(N, L):
'''Generate random correlated data containing 2^N data points.
Randon data is convolved over a 2^L/10 length to give the correlated signal.'''
return numpy.convolve(numpy.random.randn(2**N), numpy.ones(2**L)/10, 'same')
rand_data = corr_data(16, 6)
In [2]:
plot(rand_data);
If we zoom in, we can clearly see that neighbouring data points do not immediately appear to be independent:
In [3]:
plot(rand_data[:1000]);
plot(rand_data[40000:41000]);
pyblock can perform a reblocking analysis to get a better estimate of the standard error of the data set:
In [4]:
import pyblock
reblock_data = pyblock.blocking.reblock(rand_data)
for reblock_iter in reblock_data:
print(reblock_iter)
The standard error of the original data set is clearly around 8 times too small. Note that the standard error of the last few reblock iterations fluctuates substantially---this is simply because of the small number of data points at those iterations.
In addition to the mean and standard error at each iteration, the covariance and an estimate of the error in the standard error are also calculated. Each tuple also contains the number of data points used at the given reblock iteration.
pyblock.blocking can also suggest the reblock iteration at which the standard error has converged (i.e. the iteration at which the serial correlation has been removed and every data point is truly independent).
In [5]:
opt = pyblock.blocking.find_optimal_block(len(rand_data), reblock_data)
print(opt)
print(reblock_data[opt[0]])
Whilst the above uses just a single data set, pyblock is designed to work on multiple data sets at once (e.g. multiple outputs from the same simulation). In that case, different optimal reblock iterations might be found for each data set. The only assumption is that the original data sets are of the same length.
The core pyblock functionality is built upon numpy
. However, it is more convenient to use the pandas
-based wrapper around pyblock.blocking, not least because it makes working with multiple data sets more pleasant.
In [6]:
import pandas as pd
rand_data = pd.Series(rand_data)
In [7]:
rand_data.head()
Out[7]:
In [8]:
(data_length, reblock_data, covariance) = pyblock.pd_utils.reblock(rand_data)
In [9]:
# number of data points at each reblock iteration
data_length
Out[9]:
In [10]:
# mean, standard error and estimate of the error in the standard error at each
# reblock iteration
# Note the suggested reblock iteration is already indicated.
# pyblock names the data series 'data' if no name is provided in the
pandas.Series/pandas.DataFrame.
reblock_data
Out[10]:
In [11]:
# Covariance matrix is not so relevant for a single data set.
covariance
Out[11]:
We can also plot the convergence of the standard error estimate and obtain a summary of the suggested data to quote:
In [12]:
pyblock.plot.plot_reblocking(reblock_data);
The standard error clearly converges to ~0.022. The suggested reblock iteration (which uses a slightly conservative formula) is indicated by the arrow on the plot.
In [13]:
pyblock.pd_utils.reblock_summary(reblock_data)
Out[13]:
pyblock.error also contains simple error propogation functions for combining multiple noisy data sets and can handle multiple data sets at once (contained either within a numpy
array using pyblock.blocking or within a pandas.DataFrame
.