Demonstrate Aggregation of Descriptive Statistics

Here we create an array of random values and for each row of the array, we create a distinct pebaystats.dstats object to accumulate the descriptive statistics for the values in that row.

Once we have the data, we can use the numpy package to generate the expected values for mean and variance of the data for each row. We can also generate the expected mean and variance of the total data set.

We then accumulate each column value for each row into its respective dstats object and when we have the data accumulated into these partial results, we can compare with the expected row values.

We can then aggregate each of the row values into a final value for mean and variance of the entire set of data and compare to the expected value.

We will need to import the numpy package as well as the dstats class from the pebaystats package. We import the nosetools package to allow direct comparison of expected and generated values.


In [1]:
import numpy      as np
import nose.tools as nt
from pebaystats import dstats

Now we set the parameters, including the random seed for repeatability


In [2]:
np.random.seed(0)

### Random data size
rows =  10
cols = 100

### Each accumulators size
depth = 2
width = 1

The test array can now be created and its shape checked.

The individual row statistics and overall mean and variance can be generated as the expected values at this time as well.


In [3]:
### Test data -- 10 rows of 100 columns each
test_arr = np.random.random((rows,cols))

print('Test data has shape: %d, %d' % test_arr.shape)

### Expected intermediate output
mid_mean = np.mean(test_arr,axis = 1)
mid_var  = np.var(test_arr, axis = 1)

### Expected final output
final_mean = np.mean(test_arr)
final_var  = np.var(test_arr)


Test data has shape: 10, 100

Now we can create a dstats object for each row and accumulate the row data into its respected accumulator. We can print the generated and expected intermediate (row) values to check that all is working correctly.


In [4]:
### Create an object for each row and accumulate the data in that row
statsobjects = [ dstats(depth,width) for i in range(0,rows) ]
discard = [ statsobjects[i].add(test_arr[i,j])
            for j in range(0,cols)
            for i in range(0,rows)]

print('\nIntermediate Results\n')
for i in range(0,rows):
    values = statsobjects[i].statistics()
    print('Result %d mean: %11g, variance: %11g (M2/N: %11g/%d)' %(i,values[0],values[1],statsobjects[i].moments[1],statsobjects[i].n))
    print('Expected mean: %11g, variance: %11g' %(mid_mean[i],mid_var[i]))
    nt.assert_almost_equal(values[0], mid_mean[i], places = 14)
    nt.assert_almost_equal(values[1],  mid_var[i], places = 14)


Intermediate Results

Result 0 mean:    0.472794, variance:   0.0831178 (M2/N:     8.31178/100)
Expected mean:    0.472794, variance:   0.0831178
Result 1 mean:    0.528082, variance:   0.0765679 (M2/N:     7.65679/100)
Expected mean:    0.528082, variance:   0.0765679
Result 2 mean:    0.509632, variance:   0.0910113 (M2/N:     9.10113/100)
Expected mean:    0.509632, variance:   0.0910113
Result 3 mean:    0.472757, variance:   0.0810207 (M2/N:     8.10207/100)
Expected mean:    0.472757, variance:   0.0810207
Result 4 mean:    0.499723, variance:   0.0907343 (M2/N:     9.07343/100)
Expected mean:    0.499723, variance:   0.0907343
Result 5 mean:    0.506229, variance:   0.0906766 (M2/N:     9.06766/100)
Expected mean:    0.506229, variance:   0.0906766
Result 6 mean:     0.48552, variance:   0.0778794 (M2/N:     7.78794/100)
Expected mean:     0.48552, variance:   0.0778794
Result 7 mean:    0.468661, variance:   0.0894583 (M2/N:     8.94583/100)
Expected mean:    0.468661, variance:   0.0894583
Result 8 mean:    0.521702, variance:   0.0735833 (M2/N:     7.35833/100)
Expected mean:    0.521702, variance:   0.0735833
Result 9 mean:    0.494116, variance:   0.0864937 (M2/N:     8.64937/100)
Expected mean:    0.494116, variance:   0.0864937

Now we can aggregate each of the intermediate row results into a final mean and variance value for the entire data set. And then compare with the numpy generated expected values


In [5]:
### Aggregate result into the index 0 accumulator
discard = [ statsobjects[0].aggregate(statsobjects[i]) for i in range(1,rows) ]

values = statsobjects[0].statistics()
print('\nAggregated Results\n')
print('Result   mean: %11g, variance: %11g' %(values[0],values[1]))
print('Expected mean: %11g, variance: %11g' %(final_mean,final_var))
nt.assert_almost_equal(values[0], final_mean, places = 14)
nt.assert_almost_equal(values[1],  final_var, places = 14)


Aggregated Results

Result   mean:    0.495922, variance:   0.0844477
Expected mean:    0.495922, variance:   0.0844477