# Distribution Fitting

Goals:

1. Load up raw data from previous post
2. Inspect distribution for a feature
3. Postulate a fit function
4. Use scipy.stats to fit function to the distribution

### 1. Load the Raw Data



In [1]:

import pandas as pd




In [2]:




Out[2]:

AfterInhMATRIX5
PrescaleMATRIX5
RawMATRIX4
RawTriggers
label

eventID

430001
27094
12
1598521
10759045
2

430002
34901
14
1670878
11813291
3

430003
36317
15
1675869
12002554
3

430004
34088
14
1637602
11564482
3

430005
27489
12
1587623
10627391
4



### 2. Inspect the Distribution for a Feature



In [3]:

import matplotlib.pylab as plt
import seaborn as sns

# Show plots in notebook
%matplotlib inline

# Set some styling options
sns.set_style("darkgrid")
sns.set_context("paper", font_scale=1.4)




In [4]:

feature = "AfterInhMATRIX5"
sns.distplot(data_df.query("label == 1")[feature])




Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f7d45caf650>



#### Extract the values of the KDE curve for fitting purposes



In [5]:

from sklearn.neighbors import KernelDensity
import numpy as np
import matplotlib.pylab as plt

# Put the data you want to apply the KDE to into an array
data = data_df.query("label == 1")[feature].values[:, np.newaxis]
# Create a KDE object and then fit it to the data
kde = KernelDensity(kernel='gaussian', bandwidth=1400).fit(data)



#### Let's plot it to make sure it looks like what we've seen above



In [6]:

X_plot = np.linspace(10000, 100000, 1000)[:, np.newaxis]
# Get log density values for each point on the x-axis
log_dens = kde.score_samples(X_plot)
Y_plot = np.exp(log_dens)

# Plot the two against each other
sns.distplot(data_df.query("label == 1")[feature],
hist=False, color='black',
label='Seaborn KDE')
plt.plot(X_plot, Y_plot, '--', color='red', lw=2,
label='scikit-learn KDE')
plt.legend(loc='best')




Out[6]:

<matplotlib.legend.Legend at 0x7f7d0b4d9490>



Good! It looks like it matches exactly.

### 3. Postulate a Fit Function

I'm guessing that this is four Gaussian functions added together.

• Define a Gaussian function
• Define our custom function made up of Gaussians
• Make a guess at our parameters
• Use scipy.optimize.curve_fit to optimize the parameters
$$f_{G}(x) = a \cdot exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right)$$


In [7]:

def gauss_dist(xdata, amp, mean, stddev):
return (amp * np.exp( np.divide(-1 * np.square(xdata-mean),
(2 * stddev**2))))




In [8]:

# Take four amplitudes, means, and standard deviations
# Compute sum of four Gaussians
def my_fit(xdata,
a1, a2, a3, a4,
m1, m2, m3, m4,
s1, s2, s3, s4):
exp1 = gauss_dist(xdata, a1, m1, s1)
exp2 = gauss_dist(xdata, a2, m2, s2)
exp3 = gauss_dist(xdata, a3, m3, s3)
exp4 = gauss_dist(xdata, a4, m4, s4)
return exp1 + exp2 + exp3 + exp4



#### Here I make my ballpark guesses for the amplitudes, means, and deviations



In [9]:

p0 = [0.00001, 0.00002, 0.000061, 0.000005,
31000, 51000, 66000, 83000,
1000, 1500, 2000, 3000]



#### Take a look at what that gives us



In [10]:

my_guess = my_fit(X_plot[:, 0], *p0)
plt.plot(X_plot, my_guess, '-')




Out[10]:

[<matplotlib.lines.Line2D at 0x7f7d241ab350>]



... This looks not good, but it's a nice little nucleation point for an optimization routine.

## 4. Use scipy.stats to fit the function to the distribution



In [11]:

from scipy.optimize import curve_fit
popt, pcov = curve_fit(my_fit, X_plot[:, 0], Y_plot, p0)




In [12]:

print popt




[  7.46200603e-06   1.98656889e-05   5.84597102e-05   4.69580815e-06
3.03665488e+04   5.30710216e+04   6.64936414e+04   8.05486741e+04
2.50726590e+03   5.68857720e+03   4.19554544e+03   4.69961442e+03]



#### Put in these optimized parameters and see what we get



In [12]:

optim_fit = my_fit(X_plot[:, 0], *popt)
# Plot the whole fit
plt.plot(X_plot, optim_fit, '-.', lw=3, color='black')
# Along with the consituent gaussians
for i in range(0,4):
plt.plot(X_plot,
gauss_dist(X_plot[:, 0], popt[i], popt[i+4], popt[i+8]),
'-', lw=2)






#### Compare to the actual KDE distribution



In [13]:

plt.plot(X_plot, Y_plot, '-', lw=3, color='black',
label='scikit-learn KDE')
plt.plot(X_plot, optim_fit, '-', color='red',
label=r'$\sum_{i=1}^4\ f_G(a_i, \mu_i, \sigma_i)$')
plt.legend(loc='best')




Out[13]:

<matplotlib.legend.Legend at 0x7fa1f5bce990>



# Calculate a $\chi^2$ value for this fit

With the hypothesis that these two distributions are the same, we calculate: $$\chi^2 = \sum_i \frac{(O_i - E_i)^2}{E_i}$$



In [14]:

def calc_chisq(obs, exp):
chisq = 0.0
for i in range(0,len(obs)):
chisq += (obs[i] - exp[i])**2 / exp[i]
return chisq




In [15]:

print calc_chisq(Y_plot, optim_fit)




0.0482006570666



For 12 degrees of freedom (12 fit parameters), we can look at a $\chi^2$ table to find that we have a $p>0.995$ that this is a good fit to the distribution.

#### Or use scipy.stats to get the $\chi^2$ and p-value



In [16]:

from scipy.stats import chisquare
chisq, pvalue = chisquare(Y_plot, optim_fit, ddof=12)
print ("Chi-squared: %.02f\np-value: %.02f" % (chisq, pvalue))




Chi-squared: 0.05
p-value: 1.00



# Overlay onto original data



In [17]:

sns.distplot(data_df.query("label == 1")[feature],
kde=False, hist=True, norm_hist=True)
for i in range(0,4):
plt.plot(X_plot,
gauss_dist(X_plot[:, 0], popt[i], popt[i+4], popt[i+8]),
'-', lw=2)







In [ ]: