Here we create an array of random values and for each row of the array, we create
a distinct pebaystats.dstats
object to accumulate the descriptive statistics
for the values in that row.
Once we have the data, we can use the numpy
package to generate the expected
values for mean and variance of the data for each row. We can also generate the
expected mean and variance of the total data set.
We then accumulate each column value for each row into its respective dstats
object
and when we have the data accumulated into these partial results, we can compare with
the expected row values.
We can then aggregate each of the row values into a final value for mean and variance of the entire set of data and compare to the expected value.
We will need to import the numpy
package as well as the dstats
class from the pebaystats
package. We import the nosetools package to allow direct comparison of expected and generated values.
In [1]:
import numpy as np
import nose.tools as nt
from pebaystats import dstats
Now we set the parameters, including the random seed for repeatability
In [2]:
np.random.seed(0)
### Random data size
rows = 10
cols = 100
### Each accumulators size
depth = 2
width = 1
The test array can now be created and its shape checked.
The individual row statistics and overall mean and variance can be generated as the expected values at this time as well.
In [3]:
### Test data -- 10 rows of 100 columns each
test_arr = np.random.random((rows,cols))
print('Test data has shape: %d, %d' % test_arr.shape)
### Expected intermediate output
mid_mean = np.mean(test_arr,axis = 1)
mid_var = np.var(test_arr, axis = 1)
### Expected final output
final_mean = np.mean(test_arr)
final_var = np.var(test_arr)
Now we can create a dstats
object for each row and accumulate the row data
into its respected accumulator. We can print the generated and expected intermediate (row)
values to check that all is working correctly.
In [4]:
### Create an object for each row and accumulate the data in that row
statsobjects = [ dstats(depth,width) for i in range(0,rows) ]
discard = [ statsobjects[i].add(test_arr[i,j])
for j in range(0,cols)
for i in range(0,rows)]
print('\nIntermediate Results\n')
for i in range(0,rows):
values = statsobjects[i].statistics()
print('Result %d mean: %11g, variance: %11g (M2/N: %11g/%d)' %(i,values[0],values[1],statsobjects[i].moments[1],statsobjects[i].n))
print('Expected mean: %11g, variance: %11g' %(mid_mean[i],mid_var[i]))
nt.assert_almost_equal(values[0], mid_mean[i], places = 14)
nt.assert_almost_equal(values[1], mid_var[i], places = 14)
Now we can aggregate each of the intermediate row results into a final mean and
variance value for the entire data set. And then compare with the numpy
generated expected values
In [5]:
### Aggregate result into the index 0 accumulator
discard = [ statsobjects[0].aggregate(statsobjects[i]) for i in range(1,rows) ]
values = statsobjects[0].statistics()
print('\nAggregated Results\n')
print('Result mean: %11g, variance: %11g' %(values[0],values[1]))
print('Expected mean: %11g, variance: %11g' %(final_mean,final_var))
nt.assert_almost_equal(values[0], final_mean, places = 14)
nt.assert_almost_equal(values[1], final_var, places = 14)