1. Determine the true means, $\hat{\bar{v}}_{a}$, for $v_{1}$, $v_{2}$, ..., $v_{5}$
In [1]:
include("q1/p1.jl")
2. Consider values $N$ = 1,000 and $N$ = 10,000. There are $M/N$ samples of this size in our $M$ values. Histogram the sample means for these values of $N$ and determine the true standard deviation of the means $\hat{\sigma}_{\bar{v}_{a},N}$.
Recall that $\hat{\sigma}_{\bar{v}_{a},N} = \frac{\sigma}{\sqrt{N}}$, where $\sigma$ is the variance of the overall population. So we should see that $$\bigg(\frac{\hat{\sigma}_{\bar{v}_{a},N_{1}}}{\hat{\sigma}_{\bar{v}_{a},N_{2}}}\bigg)^{2} = \frac{N_{2}}{N_{2}} = 10$$ in this case. This is roughly the result we get (see below).
In [2]:
include("q1/p2.jl")
In [3]:
histograms[1]
Out[3]:
In [4]:
histograms[2]
Out[4]:
In [5]:
histograms[3]
Out[5]:
In [6]:
histograms[4]
Out[6]:
In [7]:
histograms[5]
Out[7]:
3. You can now determine the true autocorrelation function for each variable, $\hat{C}_{v_{a}, n}$, which is given by:
$$\hat{C}_{v_{a}, n} = \frac{1}{M-n} \sum_{i=1}^{M-n} \big(v_{a,i+n} - \hat{\bar{v}}_{a}\big)\big(v_{a,i} - \hat{\bar{v}}_{a}\big)$$$n$ goes from 0 to some maximum value $n_{cut}$ with $n_{cut} << M$. Plot $\hat{C}_{v_{a},n}/\hat{C}_{v_{a},0}$ versus $n$ for $a = 1...5$.
We want to pick our bin sizes $n$ such that $\hat{C}_{v_{a},n}$ is small.
In [8]:
include("q1/p3.jl");
In [9]:
plots[1]
Out[9]:
In [10]:
plots[2]
Out[10]:
In [11]:
plots[3]
Out[11]:
In [12]:
plots[4]
Out[12]:
In [13]:
plots[5]
Out[13]:
3. Find the integrated autocorrelation times
$$ \hat{\tau}_{int,v_{a}} \equiv \frac{1}{2}\frac{1}{\hat{C}_{v_{a},0}} \sum_{n=-n_{cut}}^{n_{cut}} \hat{C}_{v_{a},n} = \sum_{n=0}^{n_{cut}} \frac{\hat{C}_{v_{a},n}}{\hat{C}_{v_{a},0}} - \frac{1}{2} $$...since the zeroth term equals 1.
Estimate a value for $n_{cut}$ from your plots. $n_{cut}$ should be large enough that $\hat{C}_{v_{a},n}/\hat{C}_{v_{a},0}$ has gotten close enough to zero that the value of $\hat{\tau}_{int,v_{a}}$ is not affected by modest changes in $n_{cut}$.
Judging by the above plots, $\hat{C}_{v_{a},n}/\hat{C}_{v_{a},0}$ reaches zero at around $n = 100$ for each dataset. Since $2\tau_{int}$ is the separation between unrelated measurements, pick $n_{cut} = 300$ to be on the safe side.
In [14]:
include("q1/p4.jl")
5. Calculate the true standard deviation of the data, i.e.
$$ \hat{\sigma}_{v_{a}}^{2} \equiv \frac{1}{M-1} \sum_{i=1}^{M} (v_{a,i} - \hat{\bar{v}}_{a})^{2} $$For a sample of size $N$, we should have
$$ \hat{\sigma}_{\bar{v}_{a},N} = \sqrt{\frac{2\hat{\tau}_{int,v_{a}}}{N}} \hat{\sigma}_{v_{a}} $$Giving the testable hypothesis that
$$ N = 2 \hat{\tau}_{int,v_{a}} \frac{\hat{\sigma}_{v_{a}}^{2}}{\hat{\sigma}_{\bar{v}_{a},N}^{2}} $$
In [15]:
include("q1/p5.jl")
6. Calculate the true covariance matrix for the data, defined by
$$ \hat{c}_{v_{a},v_{b}} = \frac{1}{M} \sum_{i=1}^{M} (v_{a,i} - \hat{\bar{v}}_{a}) (v_{b,i} - \hat{\bar{v}}_{b}) $$It is customary to define a normalized version of $\hat{c}_{v_{a},v_{b}}$ by
$$ \hat{\rho}_{v_{a},v_{b}} \equiv \frac{\hat{c}_{v_{a},v_{b}}}{\hat{\sigma}_{v_{a}}\hat{\sigma}_{v_{b}}} $$Some scratch work for problem 6:
We can nicely express the covariance matrix $\hat{c}$ as
$$ \hat{c} = \frac{1}{M} D^{T}D $$or in code as
ĉ = transpose(D) * D ./ M
where
D = v - ones(M) * transpose(v̄̂)
and we can write $\hat{\rho}_{v_{a},v_{b}}$ as
ρ̂ = ĉ ./ ( σ * transpose(σ) )
In [16]:
include("q1/p6.jl")
Out[16]:
In [17]:
using PyPlot
surf(ρ̂)
xlabel("va")
ylabel("vb")
title("Normalized Covariance between Datasets")
7. Pick two groups of data from the full universe of data. One should have N = 1,000 and the other should have N = 10,000. These two groups represent results one might get from simulations. These two groups represent results one might get from simulations. We want to see how well these groups represent reqults one might get from simulations. We want to see how well these groups reproduced the true statistical results for these data.
a. Estimate the autocorrelation function $C_{v_{a},n}$ from these two groups and the integrated autocorrelation time.
b. Use these to determine the standard deviation of the mean $\sigma_{\bar{b}_{a},N}$.
c. Compare this with the results from the universe of data. Also compare the normalized covariance matrix $\rho_{v_{a},v_{b}}$ from these small samples with the universe of data.
In [18]:
include("q1/p7.jl")
In [19]:
plots1[1]
Out[19]:
In [20]:
plots1[2]
Out[20]:
In [21]:
plots1[3]
Out[21]:
In [22]:
plots1[4]
Out[22]:
In [23]:
plots1[5]
Out[23]:
In [24]:
plots2[1]
Out[24]:
In [25]:
plots2[2]
Out[25]:
In [26]:
plots2[3]
Out[26]:
In [27]:
plots2[4]
Out[27]:
In [28]:
plots2[5]
Out[28]:
In [29]:
ρratio1 = ρ̂ ./ ρ1
In [30]:
ρrelerr1 = ones(5,5) - ρratio1
Even for the smaller of the two sample sizes, $N1 = 1,000$, the estimate is fairly close to the true value, except for the correlations between datasets v1 and v5, which are off for both sample sizes.
In [31]:
ρratio2 = ρ̂ ./ ρ2
In [32]:
ρrelerr2 = ones(5,5) - ρratio2
Again, the estimates are (for the most part) significantly better for the larger sample size. Let's compare the ratio of the relative errors of the two estimators:
In [33]:
ρrelerrratio = ρrelerr1 ./ ρrelerr2
Take the geometric mean of the off-diagonal rows to see how much of a reduction in error we get, on average, in going from N1 = 1,000 to N2 = 10,000.
In [34]:
ρoffdiagratio = ρrelerrratio;
mn = 1
for i in [1:5]
ρoffdiagratio[i,i] = 1.0; # Set diagonals to 1
end
for r in ρoffdiagratio
mn *= r;
end
mn = mn ^ (1/16); # Off-diagonal elements
While the overall picture is a little muddled (there are some correlation factors where the smaller sample size coincidentally gets a more accurate measurement), the correlation factors estimated in this particular case with the larger sample size are, on average, around 2.6 times better in terms of relative error than the smaller sample sizes.
1. Break the $M$ measurements up into groups of size $N$, calculate $\bar{v}_{a}$ for each group and then calculate $f_{i}(\bar{b}_{a})$ for each group. Calculate these function of the data means for all $M/N$ groups and find the standard deviation for $f_{i}(\bar{v}_{a})$, $\hat{\sigma}_{f_{i},N}$.
In [35]:
include("q2/p0.jl")
include("q2/p1.jl")
Out[35]:
In [36]:
include("q2/p2.jl")
In [37]:
include("q2/p3.jl")
In [38]:
σvsbplots[1]
Out[38]:
In [39]:
σvsbplots[2]
Out[39]:
In [40]:
σvsbplots[3]
Out[40]:
In [41]:
σvsbplots[4]
Out[41]:
In [42]:
σvsbplots[5]
Out[42]:
The estimated standard deviations scale with the square root of the bin size, which intuitively makes sense, given that jacknife resampling suppresses variance as bin count increases. Random fluctuations due to falling bin size become apparent beyond around b=60, but near b=40 (the rough autocorrelation time we calculated for each of our datasets), the estimated variance is stable.
4. Now calculate $f_{i}(v^{\prime}_{a,k})$ for each of the $N/b$ jacknife blocks. You can then determine $\sigma_{f_{i},N}$ from
$$ \sigma^{2}_{f_{i},N} = \frac{N/b - 1}{N/b} \sum^{N/b}_{k=1}(f_{i}(v^{\prime}_{a,k})-f_{i}(\bar{v}_{a}))^2 $$Again, do this for a few values of b that are comparable to the integrated autocorrelation time. How does $\sigma_{f_{i},N}$ compare with $\hat{\sigma}_{f_{i},N}$ from part 1?
In [43]:
include("q2/p4.jl")
In [44]:
σfvsb[1]
Out[44]:
In [45]:
σfvsb[2]
Out[45]:
In [46]:
σfvsb[3]
Out[46]:
The jacknife estimated standard deviations for the functions are each larger than our naive estimates. It seems that correlations between the variables were an important factor.
Choosing two values for $N$, estimate $\tau_{int}$ for each of the $M/N$ samples of size $N$ in the universe of data, as a function of $n_{cut}$. Then find the standard deviation $\sigma_{\tau,N}$ of $\tau_{int}$. Does the standard deviation with $n_{cut} \sim N$ decrease as $N$ increases?
In [47]:
include("q3/p1.jl");
In [48]:
στintN1./στintN2
Out[48]:
Indeed, the standard deviations are around 3 times higher for the smaller value of N.
1. Make measurements of the temperature, potential energy, and the time average of the virial, which is given by
$$ \sum_{i} \sum_{j>i} r_{ij} \frac{\partial{V}_{ij}}{\partial{r}_{ij}} $$for every molecular dynamics time step.
Read the temperature, potential energy, and virial datasets into arrays:
In [1]:
include("q4/p1.jl");
Out[1]:
2. Find autocorrelation times for those three quantities.
HAVE TO FIX THAT AUTOCORRELATION CODE
In [39]:
include("q4/p2.jl")
Note that these autocorrelation times should be multiplied by 10, since that's the number of actual simulation steps between each recorded measurement. Plot autocorrelations against step separation to get a sense of the autocorrelation times:
In [33]:
mdautocorr[1]
Out[33]:
In [34]:
mdautocorr[2]
Out[34]:
In [35]:
mdautocorr[3]
Out[35]:
In [36]:
mdautocorr[4]
Out[36]:
In [37]:
mdautocorr[5]
Out[37]:
In [38]:
mdautocorr[6]
Out[38]:
The distance between uncorrelated steps for the virial, temperature, and potential energy at $T = 1.304$ all seem to be around 100 steps, corresponding to 10 steps in our saved data (since we only saved these quantities every 10 simulation steps). At $T = 1.069$, with a calm region for all three values at around 250 steps. Use these values in determining ncut
for each quantity for the jacknife part.
3. Measure the covariance matrix for these 3 quantities.
In [44]:
include("q4/p3.jl")
Out[44]:
4. Use $\tau_{int}$, binning, and jacknife resampling to get an error estimate on the pressure.
Recall:
$$ \frac{P}{\rho kT} = 1 - \frac{1}{6NkT} \langle \sum_{i \neq j} r_{ij} \frac{\partial{V}}{\partial{r_{ij}}} \rangle $$which gives pressure in natural units.
In [ ]:
include("q4/p4.jl")
Note: I tried changing what we discussed (i.e. calculating the energy by multiplying by 0.5 instead of 16), but it only gave more nonsensical looking results, so I'm keeping things as they were, since the behaviour seems qualitatively correct.