In [1]:
#Code from exercise
srand(42)
N = 10000
true_mean = 10000.
y = true_mean+randn(N)
sample_mean = mean(y)
sample_var = var(y)
println("Mean: ",sample_mean," Varience: ",sample_var)
In [2]:
function mean_and_var(y,sample_mean,sample_var) #made this a function for part 2d.
y32bit = convert(Array{Float32,1},y);
y16bit = convert(Array{Float16,1},y);
mean32,var32 = mean(y32bit),var(y32bit)
mean16,var16 = mean(y16bit),var(y16bit)
println("64 bit mean: ",sample_mean," var: ",sample_var)
println()
println("32 bit mean: ",mean32," var: ",var32)
println("dif mean: ",sample_mean-mean32," dif var: ",sample_var-var32)
println("error mean: ",(sample_mean-mean32)/sample_mean," error var: ",(sample_var-var32)/sample_var)
println()
println("16 bit mean: ",mean16," var: ",var16)
println("dif mean: ",sample_mean-mean16," dif var: ",sample_var-var16)
println("error mean: ",(sample_mean-mean16)/sample_mean," error var: ",(sample_var-var16)/sample_var)
end
mean_and_var(y,sample_mean,sample_var)
The 32 bit array has a "small" difference compared to the 64 bit array (e-5, e-6), while the 16 bit array results in a very large difference. The error is increasing in the varience calculation faster because the rounding errors are being squared, also there may be catastrophic cancellation in the the varience calculation (it contains a subtraction) whin the random error is small.
In [3]:
#CASE 1
#Code from exercise
srand(42)
N = 100000
true_mean = 10000.
y = true_mean+randn(N)
sample_mean = mean(y)
sample_var = var(y)
mean_and_var(y,sample_mean,sample_var)
In [4]:
#CASE 2
#Code from exercise
srand(42)
N = 10000
true_mean = 100000.
y = true_mean+randn(N)
sample_mean = mean(y)
sample_var = var(y)
mean_and_var(y,sample_mean,sample_var)
Ah, nice one! In case two it looks like we hit an overflow in the $\mu$ calculation. I will blame always working in double precision for not seeing that one comming.
In case one, the errors slightly improved, rather than getting worse. Maybe the rounding is calleling out, and I got lucky? A quick change in random seed doesn't see to support this. The law of large numbers at play? Maybe, the floating point errors are just another source of random error. So yeah I a going to go with that.
In [41]:
srand(42)
N = 1000000
true_mean = 1e6
y = true_mean+randn(N);
$m = 1/N \times \sum_{i=1}^{N} y_i$
$s^2 = 1/(N-1) \times \sum_{i=1}^N (y_i-m)^2 = 1/(N-1) \times \left[ \left( \sum_{i=1}^N y_i^2 \right) - N m^2 \right]$
Note: A quick glance at http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance cleared up my confusion on what was meant by "one-pass" vs. "two-pass".
In [5]:
function onepass(y::Array)
N = length(y)
sUm = zero(y[1]) #oopse this killed the built in sum function
sUm_squared = zero(y[1])
for i in 1:N
sUm = sUm + y[i]
sUm_squared = sUm_squared + y[i]^2
end
return (sUm_squared - sUm^2/N)*(1/(N-1))
end
Out[5]:
In [6]:
onepass(y)
Out[6]:
In [7]:
function twopass(y::Array)
N = length(y)
sUm = zero(y[1]) #oopse this killed the built in sum function
sUm_squared = zero(y[1])
for i in 1:N
sUm = sUm + y[i]
end
m = sUm/N
for i in 1:N
sUm_squared = sUm_squared + (y[i]-m)^2
end
return (sUm_squared)*(1/(N-1))
end
Out[7]:
In [8]:
println("One pass: ",onepass(y),", Two pass:", twopass(y),", Julia var(): ",var(y))
println("One pass error: ",(var(y)-onepass(y))/var(y),", Two pass error:", (var(y)-twopass(y))/var(y))
The one pass has the larger errors under the given test conditions.
In [11]:
#Smaller N
srand(42)
N = 10000
true_mean = 1e6
y = true_mean+randn(N);
println("One pass: ",onepass(y),", Two pass:", twopass(y),", Julia var(): ",var(y))
println("One pass error: ",(var(y)-onepass(y))/var(y),", Two pass error:", (var(y)-twopass(y))/var(y))
In [13]:
#Smaller mean
srand(42)
N = 1000000
true_mean = 1e3
y = true_mean+randn(N);
println("One pass: ",onepass(y),", Two pass:", twopass(y),", Julia var(): ",var(y))
println("One pass error: ",(var(y)-onepass(y))/var(y),", Two pass error:", (var(y)-twopass(y))/var(y))
If you knew for sure (or made sure) were dealing with a big array of small values, then you might prefer the faster one-pass algorithm. But also the two pass algorithm has a danger of catastrophic cancellation, so the one pass may be more reliable when $y \sim \mu$.
In [14]:
function var_online(y::Array)
n = length(y);
sum1 = zero(y[1]);
mean = zero(y[1]);
M2 = zero(y[1]);
for i in 1:n
diff_by_i = (y[i]-mean)/i;
mean = mean +diff_by_i;
M2 = M2 + (i-1)*diff_by_i*diff_by_i+(y[i]-mean)*(y[i]-mean);
end;
variance = M2/(n-1);
end
Out[14]:
In [21]:
println("One pass: ",onepass(y),", Two pass:", twopass(y),", online(): ",var_online(y))
println("One pass error: ",(var(y)-onepass(y))/var(y),", Two pass error:", (var(y)-twopass(y))/var(y),", Online error:",(var(y)-var_online(y))/var(y) )
Seems to work better than the regular onepass, but with all the subtractions there is a larger risk of catastrophic cancellation, so this may not be a good choice for narrow distributions.
In [109]:
function make_set(N,mu)
srand(42) #Random Seed
true_vel = zeros(N)
m = ones(N)*2
vel_err = randn(N).*m #Did somebody say z-score? This takes me back to stats 101! :)
#0 velocity with no planet, right?
vel_obs = true_vel+vel_err
return vel_obs,true_vel,vel_err,m
end
y,z,err,mu = make_set(100,2)
Out[109]:
In [108]:
# I don't like one line functions as rule.
Pnorm(y,sigma,z) = 1/sqrt(2*pi*sigma^2)*exp(-(y-z)^2/(2*sigma^2))
#does it work?
[Pnorm(y[i],2,z[i]) for i in 1:length(y)]
#probably
Out[108]:
In [95]:
function Pnorm_set(ys::Array,sigmas::Array,zs::Array)
n = length(ys)
ps = [Pnorm(ys[i],sigmas[i],zs[i]) for i in 1:n]
return prod(ps)
end
Out[95]:
In [110]:
y,z,err,mu = make_set(100,2)
println(Pnorm_set(y,err,z))
y,z,err,mu = make_set(600,2)
println(Pnorm_set(y,err,z))
In [114]:
function Pnorm_set_log(ys::Array,sigmas::Array,zs::Array)
n = length(ys)
ps = [log(Pnorm(ys[i],sigmas[i],zs[i])) for i in 1:n] # I LOVE comprehensions!
return sum(ps)
end
Out[114]:
Well, it can't handel big N at least.
In [117]:
y,z,err,mu = make_set(100,2)
println(Pnorm_set_log(y,err,z))
y,z,err,mu = make_set(600,2)
println(Pnorm_set_log(y,err,z))
Ah that is better!
In [ ]: