The typical problem we have to solve: Lets say we want to study the properties of a distribution e.g. physics process. Typically we see the histogram and we want to see to which class of distribution belongs to. we can obtain this by fitting our data and checking the residuals (goodness of fit)


In [50]:
#Necessary imports

# lib for numeric calculations
import numpy as np

# standard lib for python plotting
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

# seaborn lib for more option in DS
import seaborn as sns

# so to obtain pseudo-random numbers
import random 
                                     
# fits a curve to data
from scipy.optimize import curve_fit

In [74]:
# Lets say that a physics process gives us a gaussian distribution with a
# mean value somewhere between 3 nad 5
# and a sigma with a value around 1
# We do not know exactly their precise number and this is what we want to figure out with the fit.
mu =  random.randrange(3, 5, 1)
sigma = 0.1 * random.randrange(8, 12, 1)

In [87]:
# Create the data
data = np.random.normal(mu, sigma, size=10000)

bin_values, bin_edges = np.histogram(data, density=False, bins=100)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2

# Define model function to be used to fit to the data above:
def gauss(x, *p):
    A, mu, sigma = p
    return A*np.exp(-(x-mu)**2/(2.*sigma**2))

# p0 is the initial guess for the fitting coefficients (A, mu and sigma above)
p0 = [1., 0., 1.]

coeff, var_matrix = curve_fit(gauss, bin_centres, bin_values, p0=p0)

# Get the fitted curve
hist_fit = gauss(bin_centres, *coeff)

#plt.plot(bin_centres, hist, label='Test data')
plt.plot(bin_centres, hist_fit, label='Fitted data', color='red', linewidth=2)
_ = plt.hist(data, bins=100, color='blue', alpha=.3)

# Finally, lets get the fitting parameters, i.e. the mean and standard deviation:
print 'Fitted mean = ', coeff[1]
print 'Fitted standard deviation = ', coeff[2]

plt.show()


Fitted mean =  4.01415649167
Fitted standard deviation =  1.10782532643

Create the distribution and visualize it


In [60]:
# Lets say that a physics process gives us a gaussian distribution with a
# mean value somewhere between 3 nad 5
# while it has a broad sigma with a value around 1
mu =  random.randrange(3, 5, 1)
sigma = 0.1 * random.randrange(8, 12, 1)

# Lets create now the distribution
data = np.random.normal(mu, sigma, 1000)
 
# Lets visualize it
_ = plt.hist(data, bins=100, alpha=.5)


Fit a function to the distribution and obtain its properties


In [65]:
# get the binned data
# density = True => histo integral = 1, normalized
hist_data = np.histogram(data, density=True, bins=100) 

# Define model function to be used to fit to the data above, we assume here that we know that is a gaussian function:
def gauss(x, *p):
    A, mu, sigma = p
    return A * np.exp( - (x - mu) ** 2 / (2. * sigma ** 2))


# get distribution information
bin_edges = hist_data[1]
bin_centres = (bin_edges[:-1] + bin_edges[1:]) / 2
bin_values = hist_data[0]

# p0 is the initial guess for the fitting coefficients (A, mu and sigma above)
p0 = [1., 4., 1.]

# Fit the data
coeff, var_matrix = curve_fit(gauss, bin_centres, bin_values, p0=p0)
print "The fit coeff. are:", coeff

# Obtain the final fit function
hist_fit = gauss(bin_centres, *coeff)

hist_data = plt.hist(data, bins=100, alpha = 0.5)
plt.plot(bin_centres, hist_fit, label='Fitted data')



#hist_data = plt.hist(s, bins=100)


The fit coeff. are: [ 0.35953265  4.01647755  1.10742563]
Out[65]:
[<matplotlib.lines.Line2D at 0x1106226d0>]

Now do the fit


In [47]:


In [48]:
# Get the fitted curve
hist_fit = gauss(bin_centres, *coeff)


Out[48]:
[<matplotlib.lines.Line2D at 0x10d9cb0d0>]

In [45]:
hist_data = plt.hist(s, bins=100, alpha = 0.5)
plt.plot(bin_centres, hist_fit, label='Fitted data')

coeff


Out[45]:
array([ 0.40272387,  0.00613318,  0.98290819])

In [ ]: