In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
Let's generate a data array
In [2]:
data = np.random.rand(500)*1200;
Make a histogram of the data
In [3]:
fig = plt.figure();
plt.hist(data)
plt.show()
In [4]:
n, my_bins, my_patches = plt.hist(data, bins=10);
In this case:
weights
and/or normed
options are used.
In [5]:
n
len(n)
Out[5]:
In [6]:
my_bins
Out[6]:
In [7]:
my_patches
Out[7]:
In [8]:
plt.hist(data, bins=100);
In [9]:
plt.hist(data, bins=100, range=(0,1000));
In [10]:
plt.hist(data, normed=True);
This assures that the integral of the distribution is equal to unity.
If stacked
is also True, the sum of the histograms is normalized to 1.
However, sometimes it is useful to visualize the height of the bins to sum up to unity. For this we generate weights for the histogram. Each bin has the weight: 1/(number of data points)
N.B. : Using this technique you MUST NOT USE the normed=True
option.
This way adding up the bars will give you 1.
In [11]:
weights = np.ones_like(data)/len(data);
In [12]:
plt.hist(data, weights=weights); ## We have NOT used the normed = True option
To weight your data use the weights=(array)
option.
The weights array must be of the same shape of the data provided. Each data point provided in the data array only contributes its associated weight towards the bin count (instead of 1)
If you also use normed=True the weights will be normalized so that the integral of the density over the range is unity.
Again, sometimes it is useful to visualize the height of the bins to sum up to unity. For this we generate weights for the histogram. Each bin has the weight: 1/(number of data points)
N.B. : Using this technique you MUST NOT USE the normed=True
option.
This way adding up the bars will give you 1.
In [13]:
weights = np.ones_like(data)/len(data);
In [14]:
plt.hist(data, weights=weights);
In [15]:
plt.hist(data, weights=weights, cumulative=True);
In [16]:
plt.hist(data, weights=weights, bottom=5);
In [17]:
nbins = 10
bot = 5*np.random.rand(nbins)
In [18]:
plt.hist(data, bins=nbins, bottom=bot);
In [19]:
plt.hist(data, bins=nbins,histtype='bar');
In [20]:
plt.hist(data, bins=nbins,histtype='barstacked');
In [21]:
plt.hist(data, bins=nbins,histtype='step');
In [22]:
plt.hist(data, bins=nbins,histtype='stepfilled');
In [23]:
plt.hist(data, align='left');
In [24]:
plt.hist(data, align='mid');
In [25]:
plt.hist(data, align='right');
In [26]:
plt.hist(data, orientation="horizontal");
In [27]:
plt.hist(data, orientation="vertical");
In [28]:
plt.hist(data);
In [29]:
plt.hist(data, rwidth=0.2);
In [30]:
plt.hist(data, rwidth=0.8);
In [31]:
plt.hist(data, log=True);
In [32]:
plt.hist(data, color='red');
In [33]:
plt.hist(data, color=[0.2, 0.3, 0.8, 0.3]); # RGBA
In [34]:
plt.hist(data, label="Histogram");
In [35]:
data2 = np.random.rand(500)*1300;
In [36]:
plt.hist(data, stacked=True);
plt.hist(data2, stacked=True);
In [37]:
entries = len(data);
mean = data.mean();
stdev = data.std();
Then create the string and add these values
In [38]:
textstr = 'Entries=$%i$\nMean=$%.2f$\n$\sigma$=$%.2f$'%(entries, mean, stdev)
In [39]:
plt.hist(data, label=textstr);
plt.ylim(0,100);
plt.legend(loc='best',markerscale=0.01);
Or using a textbox...
In [40]:
plt.hist(data);
plt.ylim(0,100);
#plt.text(800,80,textstr);
plt.annotate(textstr, xy=(0.7, 0.8), xycoords='axes fraction') # annotate for specifying the
# fraction of the canvas
Out[40]:
In [41]:
fit_data = np.random.randn(500)*200;
plt.hist(fit_data);
Assume now that seeing these data we think that a gaussian distribution will fit the best on the given dataset.
We load the gaussian (normal) distribution from scipy.stats:
In [42]:
from scipy.stats import norm
Now, looking at this function norm
we see it has the loc
option and the scale
option.
loc
is the mean value and scale
the standard deviation
To fit a gaussian on top of the histogram, we need the normed histogram and also to get the mean and std of the gaussian that fits the data. Therefore we have
In [43]:
plt.hist(fit_data, normed=True);
In [44]:
mean, std = norm.fit(fit_data);
In [45]:
mean
Out[45]:
In [46]:
std
Out[46]:
Then we create the curve for that using the norm.pdf in the range of fit_data.min() and fit_data.max()
In [47]:
x = np.linspace(fit_data.min(), fit_data.max(), 1000);
In [48]:
fit_gaus_func = norm.pdf(x, mean, std);
In [49]:
plt.hist(fit_data, normed=True);
plt.plot(x,fit_gaus_func, lw=4);
Instead of specifying a distribution we can fit the best probability density function. This can be achieved thanks to the non-parametric techique of kernel density estimation.
KDE is a non parametric way to estimate the probability density function of a random variable.
How it works? Suppose $(x_1, x_2, ..., x_n)$ are i.i.d. with unknown density $f$. We want to estimate the shape of this function $f$. Its kernel density estimator is
$ \hat{f}_{h}(x) = \frac{1}{n} \sum_{i=1}^{n}(x-x_i) = \frac{1}{nh}\sum_{i=1}^{n}K\left(\frac{x-x_i}{h}\right)$
where the $K()$ is the kernel. Kernel is a non-negative function that intergrates to one and has mean zero, also h>0 is a smoothing parameter called bandwidth. A kernel with a subscript h is called a scaled kernel and is defined as $K_h(x)=\frac{1}{h}K(\frac{x}{h})$. Usually one wants to use small $h$, but is always a trade of between the bias of the estimator and its variance.
Kernel functions commonly used:
- uniform
- triangular
- biweight
- triweight
- Epanechinikov
- normal
More under https://en.wikipedia.org/wiki/Kernel_(statistics)#Kernel_functions_in_common_use
In python this is done using the scipy.stats.kde submodule. For gaussian kernel density estimation we use the gaussian kde
In [50]:
from scipy.stats import gaussian_kde
pdf_gaus = gaussian_kde(fit_data);
pdf_gaus
Out[50]:
In [51]:
pdf_gaus = pdf_gaus.evaluate(x); # get the "y" values from the pdf for the "x" axis, this is an array
In [52]:
pdf_gaus
Out[52]:
In [53]:
plt.hist(fit_data, normed=1);
plt.plot(x, pdf_gaus, 'k', lw=3)
Out[53]:
In [54]:
plt.plot(x,fit_gaus_func, lw=4, label="fit");
plt.plot(x, pdf_gaus, 'k', lw=3, label="KDE");
plt.legend();
N.B.: Notice the difference in the two fit curves! This comes from the fact that the Gaussian kernel is a mixture of normal distrubutions; a Gaussian mixture may be skew or heavy-, light-tailed or multimodal. Thus it does not assume the original distrubution of any particular form.
Scipy comes with an optimize submodule that provides several commonly used optimization algorithms.
One of the easiest is curve_fit, which uses non-linear least squares to fit a function $f$ to the data. It assumes that :
$ y_{data} = f(x_{data}, *params) + eps$
The declaration of the function is :
scipy.optimize.curve_fit(f, xdata, ydata, p0=None, sigma=None, absolute_sigma=False, check_finite=True, bounds=(-inf, inf), method=None, jac=None, **kwargs)
where
The function returns
popt : array of optimal values for the parameters so that the sum of the squared error of $f(xdata, popt)-ydata$ is minimized
pcov : the covariance matrix of popot. To compute one standard deviation errors on the parameters use : $$ perr = np.sqrt(np.diag(pcov)) $$
Errors raised by the module:
In [55]:
from scipy.optimize import curve_fit
## define the model function:
def func(x, a,b,c):
return a*np.exp(-b*x)+c
## the x-points
xdata = np.linspace(0,4,50);
## get some data from the model function...
y = func(xdata, 2.5, 1.3, 0.5)
##and then add some gaussian errors to generate the "data"
ydata = y + 0.2*np.random.normal(size=len(xdata))
In [56]:
## now run the curve_fit()
popt, pcov = curve_fit(func, xdata, ydata)
In [57]:
popt
Out[57]:
In [58]:
pcov
Out[58]:
In [59]:
### To constrain the optimization to the region of 0<a<3.0 , 0<b<2 and 0<c<1
popt, pcov = curve_fit(func, xdata, ydata, bounds=(0, [3., 2., 1.]))
In [60]:
popt
Out[60]:
In [61]:
pcov
Out[61]:
In [62]:
### Start with the histogram, but now get the values and binEdges
n,bins,patches = plt.hist(fit_data, bins=10);
In [63]:
n
Out[63]:
In [64]:
bins
Out[64]:
In [65]:
### Calculate the bin centers as (bins[1:]+bins[:-1])/2
binCenters = 0.5 * (bins[1:] + bins[:-1]); # we throw away the first (in the first) and last (in the second) edge
binCenters
Out[65]:
In [66]:
## function to model is a gaussian:
def func_g(x, a, b, c):
return a*np.exp(-(x-b)**2/(2*c**2))
## xdata are the centers, ydata are the values
xdata = binCenters
ydata = n
popt, pcov = curve_fit(func_g, xdata, ydata, p0=[1, ydata.mean(), ydata.std()])#,diag=(1./xdata.mean(),1./ydata.mean())) # setting some initial guesses to "help" the minimizer
In [67]:
popt
Out[67]:
In [68]:
pcov
Out[68]:
In [74]:
plt.plot(xdata,ydata, "ro--", lw=2, label="Data Bin Center");
plt.hist(fit_data, bins=10, label="Data Hist");
plt.plot(np.linspace(xdata.min(),xdata.max(),100),
func_g(np.linspace(xdata.min(),xdata.max(),100),popt[0],popt[1],popt[2]),
"g-", lw=3, label="Fit"); # I increased the x axis points to have smoother curve
plt.legend();
In [77]:
errors = []
for i in range(len(popt)):
try:
errors.append(np.absolute(pcov[i][i])**0.5)
except:
errors.append(0.00)
In [78]:
for i in range(len(popt)):
print popt[i],"+/-",errors[i]
However this works when using curve_fit.
The optimize.leastsq
method will return the fractional covariance matrix. We have to multiply the elements of this matrix by the residual variance (the reduced chi squared) and then take the square root of the diagonal elements, to get an estimate of the standard deviation of the fit parameters.
To get the proper estimate of the standard error in the fitted parameters is a complicated statistical problem. In detail, the resulting covariance matrix of the optimize.leastsq and optimize.curve_fit, is based on the assumptions regarding the probability distribution of the errors and the interactions between parameters; these interaction may exist depending on the specific fit function $f(x)$. A good way to deal with a complicated $f(x)$ is to use the bootstrap method.
In [ ]: