Calculate the mean of a distribution from its histogram

Patricia Schuster
2/21/18

I am doing this because in the fission project, I want to calculate the average energy within an energy distribution.


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [41]:
import sys
sys.path.append('../scripts/')
import bicorr as bicorr
import bicorr_plot as bicorr_plot
import bicorr_math as bicorr_math

In [3]:
%load_ext autoreload
%autoreload 2

Method for making a step plot


In [7]:
import inspect
print(inspect.getsource(bicorr_plot.step_plot))


def step_plot(edges,y, linewidth=.5, color='k', zorder = 1):
    """
    Plot a step plot. Meant for use with histogram data generated by:
        counts, bin_edges = np.histogram(x_samples,bin_edges)
        bicorr.step_plot(bin_edges,counts)
    
    Parameters
    ----------
    edges : ndarray
        Bin edges
    y : ndarray
        Bin counts
    linewidth : float, optional
        Width of step lines
    color : float, optional
        Color of lines
    zorder : int, optional
        Order of layer. Lower integer = farther back

    Returns
    -------
    n/a
    """
    # Horizontal lines
    for i in range(len(y)):
        plt.hlines(y[i],edges[i],edges[i+1],linewidth=linewidth,color=color,zorder=zorder)
    # Vertical lines
    for i in range(len(y)-1):
        plt.vlines(edges[i+1],y[i],y[i+1],linewidth=linewidth,color=color,zorder=zorder)

Generate some random data to work with

Using Normal-distributed data


In [8]:
x_samples = np.random.normal(0,5,150)

In [9]:
plt.figure(figsize=(4,2))
plt.hist(x_samples,np.arange(-10,10.1,1))
plt.xlabel('x')
plt.ylabel('Number of counts')
plt.title('Histogram of randomly generated x data')
plt.show()


Store histogram data


In [10]:
counts, bin_edges = np.histogram(x_samples,np.arange(-10,10.1,1))

In [11]:
print(bin_edges.shape)
print(counts.shape)


(21,)
(20,)

In [12]:
bin_edges


Out[12]:
array([-10.,  -9.,  -8.,  -7.,  -6.,  -5.,  -4.,  -3.,  -2.,  -1.,   0.,
         1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.])

In [13]:
counts


Out[13]:
array([ 0,  3,  1,  4,  6, 12,  6, 16, 11, 22, 16,  6,  8,  7, 13,  5,  4,
        2,  1,  2], dtype=int64)

Look at it using my step plot function.


In [16]:
plt.figure(figsize=(4,2))
bicorr_plot.step_plot(bin_edges,counts)
plt.xlabel('x')
plt.ylabel('Number of counts')
plt.ylim([-1,1.1*np.max(counts)])
plt.title('Histogram of randomly generated x data')
plt.show()


Calculate mean and associated error

Calculate the average x value for a distribution where we have $C$ counts for each $x$ value. Assume there is no error in $x_i$.

$x_{ave} = \frac{\Sigma(x_i \cdot C_i)}{\Sigma C_i}$

I am going to split this up into the numerator and the denomenator in order to propagate the error.

$num = \Sigma(x_i \cdot C_i)$

$denom = \Sigma C_i$

Numerator: $\Sigma(x_i\cdot C_i)$

Think of this as the sum of many products:

$num = x_1 c_1 + x_2 c_2 + ... = a + b + ...$

The error in each quantity is:

$\sigma_{x_i c_i} = x_i \sqrt{c_i}$

The error in the sum of these quantities is:

$\sigma_{num} = \sigma_{a+b+...} = \sqrt{\sigma_a^2+\sigma_b^2+...} = \sqrt{x_1^2 c_1 + x_2^2 c_2 + ...}$

Denomenator: $\Sigma(C_i)$

This one is more straight forward. The error in the sum of many count quantities is the square root of the total number of counts.

$denom = c_1 + c_2 + ...$

$\sigma_{denom} = \sqrt{\sigma_{c_1}^2+\sigma_{c_2}^2...} = \sqrt{c_1 + c_2 + ...} = \sqrt{c_{sum}} = \sqrt{denom}$

Put it together

$x_{ave} = \frac{num}{denom} = \frac{\Sigma(x_i \cdot C_i)}{\Sigma C_i}$

$\sigma_{x_{ave}} = x_{ave} \sqrt{(\frac{\sigma_{num}}{num})^2+(\frac{\sigma_{denom}}{denom})^2}$

Functionalize it


In [34]:
def calc_mean(bin_edges, counts, print_flag = False):
    """
    Calculate mean of a count rate distribution, counts vs. x. 
    Errors are calculated under the assumption that you are working
        with counting statistics. (C_err = sqrt(C) in each bin)
    
    Parameters
    ----------
    bin_edges : ndarray
        Bin edges for x
    counts : ndarray
        Bin counts
    print_flag : bool
        Option to print intermediate values
    
    Returns
    -------
    x_mean : float
    x_mean_err : float
    """
    bin_centers = bicorr.centers(bin_edges)
    
    num = np.sum(np.multiply(bin_centers,counts))  
    num_err = np.sqrt(np.sum(np.multiply(bin_centers**2,counts)))
    denom = np.sum(counts)    
    denom_err = np.sqrt(denom)    

    if print_flag:
        print('num: ',num)
        print('num_err: ',num_err)
        print('denom: ',denom)
        print('denom_err: ',denom_err)
    
    x_mean = num/denom
    x_mean_err = x_mean * np.sqrt((num_err/num)**2+(denom_err/denom)**2)
    
    if print_flag:
        print('x_mean: ',x_mean)
        print('x_mean_err:',x_mean_err)
    
    return x_mean, x_mean_err

In [35]:
x_mean, x_mean_err = calc_mean(bin_edges,counts,True)


num:  80.0
num_err:  50.4479930225
denom:  140
denom_err:  11.8321595662
x_mean:  0.571428571429
x_mean_err: 0.36356471266

Plot this and look at the position of the mean.


In [42]:
plt.figure(figsize=(4,2))
bicorr.step_plot(bin_edges,counts)
plt.axvline(x_mean,color='r',linewidth=1)
plt.axvline(x_mean+x_mean_err,linestyle='--',color='r',linewidth=.5)
plt.axvline(x_mean-x_mean_err,linestyle='--',color='r',linewidth=.5)
plt.xlabel('x')
plt.ylabel('Number of counts')
plt.ylim([-1,1.1*np.max(counts)])
plt.title('Histogram of randomly generated x data')
plt.show()


Look more closely to make sure it is evenly spaced.


In [44]:
plt.figure(figsize=(4,2))
bicorr.step_plot(bin_edges,counts)
plt.axvline(x_mean,color='r',linewidth=1)
plt.axvline(x_mean+x_mean_err,linestyle='--',color='r',linewidth=.5)
plt.axvline(x_mean-x_mean_err,linestyle='--',color='r',linewidth=.5)
plt.xlabel('x')
plt.xlim([-1,2])
plt.ylabel('Number of counts')
plt.ylim([-1,1.1*np.max(counts)])
plt.title('Histogram of randomly generated x data')
plt.show()


Try a few examples to make sure it's working

Only 10 samples... Expect mean to be incorrect and large error bars.


In [62]:
real_mean = 25
real_std = 5
num_samples = 10
x_samples = np.random.normal(real_mean,real_std,num_samples)
counts, bin_edges = np.histogram(x_samples,np.arange(15,35.1,1))
x_mean, x_mean_err = calc_mean(bin_edges,counts,True)

plt.figure(figsize=(4,2))
plt.axvline(real_mean,color='b',linewidth=2)
plt.axvline(x_mean,color='r',linewidth=1)
plt.axvline(x_mean+x_mean_err,linestyle='--',color='r',linewidth=.5)
plt.axvline(x_mean-x_mean_err,linestyle='--',color='r',linewidth=.5)
bicorr.step_plot(bin_edges,counts)
plt.legend(['Real mean','Calculated mean'], loc='lower right')
plt.xlabel('x')
plt.ylabel('Number of counts')
plt.ylim([-1,1.1*np.max(counts)])
plt.title('Number of samples = {}'.format(num_samples))
plt.show()


num:  235.0
num_err:  75.5148991921
denom:  10
denom_err:  3.16227766017
x_mean:  23.5
x_mean_err: 10.5948100502

Increase the number of samples. Expect mean to be closer, and errors to be smaller.


In [65]:
real_mean = 25
real_std = 5
num_samples = 30
x_samples = np.random.normal(real_mean,real_std,num_samples)
counts, bin_edges = np.histogram(x_samples,np.arange(15,35.1,1))
x_mean, x_mean_err = calc_mean(bin_edges,counts,True)

plt.figure(figsize=(4,2))
plt.axvline(real_mean,color='b',linewidth=2)
plt.axvline(x_mean,color='r',linewidth=1)
plt.axvline(x_mean+x_mean_err,linestyle='--',color='r',linewidth=.5)
plt.axvline(x_mean-x_mean_err,linestyle='--',color='r',linewidth=.5)
bicorr.step_plot(bin_edges,counts)
plt.legend(['Real mean','Calculated mean'], loc='lower right')
plt.xlabel('x')
plt.ylabel('Number of counts')
plt.ylim([-1,1.1*np.max(counts)])
plt.title('Number of samples = {}'.format(num_samples))
plt.show()


num:  802.0
num_err:  148.4570645
denom:  30
denom_err:  5.47722557505
x_mean:  26.7333333333
x_mean_err: 6.95059016945

In [66]:
real_mean = 25
real_std = 5
num_samples = 100
x_samples = np.random.normal(real_mean,real_std,num_samples)
counts, bin_edges = np.histogram(x_samples,np.arange(15,35.1,1))
x_mean, x_mean_err = calc_mean(bin_edges,counts,True)

plt.figure(figsize=(4,2))
plt.axvline(real_mean,color='b',linewidth=2)
plt.axvline(x_mean,color='r',linewidth=1)
plt.axvline(x_mean+x_mean_err,linestyle='--',color='r',linewidth=.5)
plt.axvline(x_mean-x_mean_err,linestyle='--',color='r',linewidth=.5)
bicorr.step_plot(bin_edges,counts)
plt.legend(['Real mean','Calculated mean'], loc='lower right')
plt.xlabel('x')
plt.ylabel('Number of counts')
plt.ylim([-1,1.1*np.max(counts)])
plt.title('Number of samples = {}'.format(num_samples))
plt.show()


num:  2424.0
num_err:  251.443830706
denom:  96
denom_err:  9.79795897113
x_mean:  25.25
x_mean_err: 3.67444133131

In [67]:
real_mean = 25
real_std = 5
num_samples = 1000
x_samples = np.random.normal(real_mean,real_std,num_samples)
counts, bin_edges = np.histogram(x_samples,np.arange(15,35.1,1))
x_mean, x_mean_err = calc_mean(bin_edges,counts,True)

plt.figure(figsize=(4,2))
plt.axvline(real_mean,color='b',linewidth=2)
plt.axvline(x_mean,color='r',linewidth=1)
plt.axvline(x_mean+x_mean_err,linestyle='--',color='r',linewidth=.5)
plt.axvline(x_mean-x_mean_err,linestyle='--',color='r',linewidth=.5)
bicorr.step_plot(bin_edges,counts)
plt.legend(['Real mean','Calculated mean'], loc='lower right')
plt.xlabel('x')
plt.ylabel('Number of counts')
plt.ylim([-1,1.1*np.max(counts)])
plt.title('Number of samples = {}'.format(num_samples))
plt.show()


num:  23564.0
num_err:  774.904187626
denom:  954
denom_err:  30.886890423
x_mean:  24.7002096436
x_mean_err: 1.13986770074

In [68]:
real_mean = 25
real_std = 5
num_samples = 10000
x_samples = np.random.normal(real_mean,real_std,num_samples)
counts, bin_edges = np.histogram(x_samples,np.arange(15,35.1,1))
x_mean, x_mean_err = calc_mean(bin_edges,counts,True)

plt.figure(figsize=(4,2))
plt.axvline(real_mean,color='b',linewidth=2)
plt.axvline(x_mean,color='r',linewidth=1)
plt.axvline(x_mean+x_mean_err,linestyle='--',color='r',linewidth=.5)
plt.axvline(x_mean-x_mean_err,linestyle='--',color='r',linewidth=.5)
bicorr.step_plot(bin_edges,counts)
plt.legend(['Real mean','Calculated mean'], loc='lower right')
plt.xlabel('x')
plt.ylabel('Number of counts')
plt.ylim([-1,1.1*np.max(counts)])
plt.title('Number of samples = {}'.format(num_samples))
plt.show()


num:  238989.5
num_err:  2483.28789108
denom:  9547
denom_err:  97.7087508875
x_mean:  25.0329422855
x_mean_err: 0.365097808241

I think this looks reasonable. Call it a day.


In [ ]:

In Two Dimensions

If I want to calculate the average sum of $x$ and $y$ across a 2d-histogram, how can I do that?

The equation looks like this:

Calculate the average x value for a distribution where we have $C$ counts for each $x$ value. Assume there is no error in $x_i$ or $y_i$.

$(x+y)_{ave} = \frac{\Sigma((x_i+y_i) \cdot C_i)}{\Sigma C_i}$

I am going to split this up into the numerator and the denomenator in order to propagate the error.

$num = \Sigma((x_i+y_i) \cdot C_i)$

$denom = \Sigma C_i$

Numerator: $\Sigma((x_i+y_i)\cdot C_i)$

Think of this as the sum of many products:

$num = x_1 y_1 c_{11} + x_2 y_1 c_{21} + ... = a + b + ...$

The error in each quantity is:

$\sigma_{(x_i+y_i) c_i} = (x_i+y_i) \sqrt{c_i}$

The error in the sum of these quantities is:

$\sigma_{num} = \sigma_{a+b+...} = \sqrt{\sigma_a^2+\sigma_b^2+...} = \sqrt{(x_1+y_1)^2 c_{11} + (x_2+y_1)^2 c_{21} + ...}$

Denomenator: $\Sigma(C_i)$

This one is more straight forward. The error in the sum of many count quantities is the square root of the total number of counts.

$denom = c_1 + c_2 + ...$

$\sigma_{denom} = \sqrt{\sigma_{c_1}^2+\sigma_{c_2}^2...} = \sqrt{c_1 + c_2 + ...} = \sqrt{c_{sum}} = \sqrt{denom}$

Put it together

$(x+y)_{ave} = \frac{num}{denom} = \frac{\Sigma((x_i+y_i) \cdot C_i)}{\Sigma C_i}$

$\sigma_{x_{ave}} = x_{ave} \sqrt{(\frac{\sigma_{num}}{num})^2+(\frac{\sigma_{denom}}{denom})^2}$


In [26]:
x_samples = np.random.normal(150,20,150);
y_samples = np.random.normal(75,10,150);
sum_samples = x_samples + y_samples

In [27]:
sns.jointplot(x_samples,y_samples)
plt.show()


C:\Users\pfschus\AppData\Local\Continuum\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "
C:\Users\pfschus\AppData\Local\Continuum\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

In [90]:
np.mean(x_samples+y_samples)


Out[90]:
222.08635797866148

In [23]:
bicorr_plot.histogram_metrics(sum_samples,'sum','counts')


C:\Users\pfschus\AppData\Local\Continuum\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py:6462: UserWarning: The 'normed' kwarg is deprecated, and has been replaced by the 'density' kwarg.
  warnings.warn("The 'normed' kwarg is deprecated, and has been "

Create a 2d-histogram.


In [47]:
x_edges = np.arange(0,201,10)
y_edges = np.arange(0,111,10)

H, x_edges, y_edges = np.histogram2d(x_samples,y_samples, bins=[x_edges,y_edges])
H = H.T

In [48]:
fig = plt.figure(figsize=(4,3))
ax = plt.gca()
mesh = ax.pcolormesh(x_edges, y_edges, H)
plt.xlabel('x'); plt.ylabel('y')
plt.show()



In [49]:
H.shape


Out[49]:
(11, 20)

In [36]:
x_edges.shape


Out[36]:
(21,)

In [37]:
y_edges.shape


Out[37]:
(12,)

In [50]:
X, Y = np.meshgrid(bicorr_math.calc_centers(x_edges),bicorr_math.calc_centers(y_edges))

In [53]:
X.shape


Out[53]:
(11, 20)

In [54]:
Y.shape


Out[54]:
(11, 20)

In [52]:
H.shape


Out[52]:
(11, 20)

In [57]:
X = X.reshape(X.shape[0]*X.shape[1])
Y = Y.reshape(Y.shape[0]*Y.shape[1])
H = H.reshape(H.shape[0]*H.shape[1])

In [82]:
plt.figure(figsize=(4,3))
plt.scatter(X, Y, c=H, marker='s', s=100)
plt.xlabel('X')
plt.ylabel('Y')
plt.show()



In [87]:
plt.figure(figsize=(4,3))
plt.scatter(X+Y, H, marker='s', s=100)
plt.xlabel('X+Y')
plt.ylabel('Counts')
plt.show()


Now try calculating the average.


In [92]:
bicorr_math.calc_histogram_mean(X+Y,H,True,True)


num:  33330.0
num_err:  2735.5255436570137
denom:  150.0
denom_err:  12.24744871391589
x_mean:  222.2
x_mean_err: 25.724200451887494
Out[92]:
(222.2, 25.724200451887494)

This can be used for bhp and bhp_e.


In [ ]: