In [ ]:
%matplotlib inline
import numpy as np
import matplotlib.pylab as plt
import scipy.optimize, scipy.stats
import pandas as pd
You measure an observable over time and would like to extract a rate constant $k$ for the process. You are fitting the data using the code below. Each data point was measured independently of the others. The uncertainty in each measured point is normally distributed with a standard deviation of 0.05.
How robust are your estimates of $A$ and $k$ given the uncertainty in your measured points?
In [ ]:
def first_order(t,A,k):
"""
First-order kinetics model.
"""
return A*(1 - np.exp(-k*t))
def first_order_r(param,t,obs):
"""
Residuals function for first-order model.
"""
return first_order(t,param[0],param[1]) - obs
def fit_model(t,obs,param_guesses=(1,1)):
"""
Fit the first-order model.
"""
fit = scipy.optimize.least_squares(first_order_r,
param_guesses,
args=(t,obs))
fit_A = fit.x[0]
fit_k = fit.x[1]
return fit_A, fit_k
d = pd.read_csv("data/time_course_0.csv")
A, k = fit_model(d.t,d.obs)
plt.plot(d.t,d.obs,'o')
plt.plot(d.t,first_order(d.t,A,k))
In [ ]:
In [ ]:
In [ ]:
data/time_course_1.csv. Is there a statistically significant difference between $k$ from dataset 1 vs. 0?
In [ ]:
data/time_course_2.csv in the obs_err column. Modify your sampling code so you incorporate these point-specific uncertainties. data/time_course_2.csv, data/time_course_3.csv, and data/time_course_4.csv, then report your estimate and 95% confidence intervals of $k$. Do your confidence intervals change with more data?The Akaike Information Criterion (AIC) helps you select between competing models. It penalizes models for the number of fittable parameters because adding parameters will almost always make a fit better.
$$AIC = -2ln(\hat{L}) + 2k$$where $\hat{L}$ is the "maximum likelihood" and $k$ is (1 + number of fittable parameters in the model). $\hat{L}$ is proportional to the sum of the squared residuals ($SSR$), so we can write:
$$ AIC = nln(SSR) + 2k $$where $n$ is the number of observations. If we want to compare $N$ different models, we first normalize $AIC$ for each model as $\Delta _{i}$.
$$\Delta _{i} = AIC_{i} - AIC_{min}$$where $AIC_{min}$ is the best AIC. The support for a given model is given by the Akaike weight ($w$):
$$w_{i} = \frac{exp(-\Delta_{i}/2)}{\sum_{j=0}^{j<N} exp(-\Delta_{j}/2)}$$The weight runs from 0.0 to 1.0, with 1.0 representing the strongest support. For a more thorough description, see here.
In [ ]:
# these values
ssr_list = [0.05,0.001,0.00095]
num_parameters = [2,3,10]
num_obs = 10
# should give these weights
weights = [8.69e-9,0.9988,1.18e-3]
In [ ]:
Sedimentation equilibrium experiments are used to study molecular assemblies. If spin a solution of molecules in a centrifuge very fast for a very long time, you get an equlibirum distribution of molecules along the length ($r$) of the tube. This is because centrigual force pulls them to the bottom of the tube, while diffusion tends to spread them out over the tube. If you measure the concentration of molecules along the length of the tube $c(r)$, you can then fit a model and back out the molecular weight(s) of the species in solution.
The equation describing a simple, non-interacting molecule is below. $c_{0}$ is the concentration of the molecule at $r=0$, $M$ is the molecular weight of the molecule, and $r$ is the position along the tube.
$$c(r) = c_{0}exp \Big [ M \Big ( \frac{r^{2}}{2} \Big ) \Big ]$$You are interested in whether there is any dimer present (a dimer is a pair of interacting molecules). If there is dimer, our model gets more complicated. We have to add a term $\theta$ that describes the fraction of the molecules in dimer versus monomer:
$$c(r) = c_{0} \Big ( \theta exp \Big [ M \Big ( \frac{r^{2}}{2} \Big ) \Big ] + (1 - \theta )exp \Big [ 2M \Big ( \frac{r^{2}}{2} \Big ) \Big ] \Big )$$(I've collapsed in a bunch of constants to simplify the problem. If you want a full discussion of the method, see here, particularly equation 4).
In [ ]:
data/sed_eq.csv. What are your estimates of $c_{0}$, $M$, and $\theta$? Are they the same between the two fits?
In [ ]:
calc_aic function on these fits. Which model is supported? Can you conclude there is dimer present?
In [ ]:
In [ ]:
d = pd.read_csv("data/gaussian.csv")
plt.plot(d.x,d.y)
You find code to analyze this kind of data on the internet. Using the functions below, determine:
lab-data/gaussian.csv.PS: you should play with your initial guesses PPS: negative area gaussians are not allowed
In [ ]:
def multi_gaussian(x,means,stds,areas):
"""
Function calculating multiple gaussians (built from
values in means, stds, areas). The number of gaussians
is determined by the length of means, stds, and areas.
The gaussian functions are calculated at values in
array x.
"""
if len(means) != len(stds) or len(means) != len(areas):
err = "means, standard deviations and areas should have the same length!\n"
raise ValueError(err)
out = np.zeros(len(x),dtype=float)
for i in range(len(means)):
out += areas[i]*scipy.stats.norm(means[i],stds[i]).pdf(x)
return out
def multi_gaussian_r(params,x,y):
"""
Residuals function for multi_guassian.
"""
params = np.array(params)
if params.shape[0] % 3 != 0:
err = "num parameters must be divisible by 3\n"
raise ValueError(err)
means = params[np.arange(0,len(params),3)]
stds = params[np.arange(1,len(params),3)]
areas = params[np.arange(2,len(params),3)]
return multi_gaussian(x,means,stds,areas) - y
def fitter(x,y,means_guess,stds_guess,areas_guess):
"""
Fit an arbitrary number of gaussian functions to x/y data.
The number of gaussians that will be fit is determined by
the length of means_guess.
x: measurement x-values (array)
y: measurement y-values (array)
means_guess: array of guesses for means for gaussians.
length determines number of gaussians
stds_guess: array of guesses of standard deviations for
gaussians. length must match means_guess
areas_guess: array of area guesses for gaussians.
length must match means guess.
returns: means, stds, areas and fit sum-of-squared-residuals
"""
# Sanity check
if len(means_guess) != len(stds_guess) or len(means_guess) != len(areas_guess):
err = "means, standard deviations and areas should have the same length!\n"
raise ValueError(err)
# Construct an array of parameter guesses by assembling
# means, stds, and areas
param_guesses = []
for i in range(len(means_guess)):
param_guesses.append(means_guess[i])
param_guesses.append(stds_guess[i])
param_guesses.append(areas_guess[i])
param_guesses = np.array(param_guesses)
# Fit the multigaussian function
fit = scipy.optimize.least_squares(multi_gaussian_r,param_guesses,
args=(x,y))
# Disassemble into means, stds, areas
means = fit.x[np.arange(0,len(fit.x),3)]
stds = fit.x[np.arange(1,len(fit.x),3)]
areas = fit.x[np.arange(2,len(fit.x),3)]
return means, stds, areas, fit.cost
def plot_gaussians(means,stds,areas):
"""
Plot a collection of gaussians.
means: array of means for gaussians.
length determines number of gaussians
stds: array of standard deviations for gaussians.
length must match means_guess
areas: array of areas for gaussians.
length must match means guess.
"""
plt.plot(d.x,multi_gaussian(d.x,means,stds,areas))
for i in range(len(means)):
plt.plot(d.x,multi_gaussian(d.x,
[means[i]],
[stds[i]],
[areas[i]]))
In [ ]: