Basic Statistics in Julia



In [ ]:

    
using Stats



In [ ]:

    
srand(1)



In [ ]:

    
x = rand(100)



In [ ]:

    
min(x)



In [ ]:

    
median(x)



In [ ]:

    
max(x)



In [ ]:

    
quantile(x, [0.0, 0.5, 1.0])



In [ ]:

    
describe(x)

Probability Distributions in Julia



In [ ]:

    
using Distributions



In [ ]:

    
x = rand(Gamma(1, 2), 100)

Standard R Functions with Simpler Names



In [ ]:

    
d = Normal(0, 1)



In [ ]:

    
pdf(d, 0.0)



In [ ]:

    
cdf(d, 0.0)



In [ ]:

    
quantile(d, 0.1)



In [ ]:

    
rand(d)



In [ ]:

    
rand(Categorical([0.1, 0.9]))



In [ ]:

    
rand(sampler(Categorical([0.5, 0.5])))



In [ ]:

    
Categorical([0.5, 0.5])



In [ ]:

    
sampler(Categorical([0.5, 0.5]))

Additional Abstractions around PDF's, CDF's, etc.



In [ ]:

    
quantile(d, [0.25, 0.75])



In [ ]:

    
-loglikelihood(d, rand(d, 100_000)) / 100_000

Theoretical Properties of Distributions



In [ ]:

    
entropy(d)



In [ ]:

    
mean(d)



In [ ]:

    
skewness(d)



In [ ]:

    
kurtosis(d)



In [ ]:

    
var(d)



In [ ]:

    
modes(d)

Fit Distributions to Data



In [ ]:

    
x = rand(d, 1_000)



In [ ]:

    
fit_mle(Normal, x)



In [ ]:

    
(mean(d), std(d)), (mean(x), std(x))



In [ ]:

    
methods(mean)

Bayesian Updating with Conjugate Priors



In [ ]:

    
x = rand(Bernoulli(0.9), 10_000)



In [ ]:

    
posterior(Beta(3, 3), Bernoulli, x)

Kernel Density Estimation



In [ ]:

    
using Gadfly



In [ ]:

    
x = rand(Gamma(3, 3), 100_000)



In [ ]:

    
k = kde(x)



In [ ]:

    
names(Distributions.UnivariateKDE)



In [ ]:

    
set_default_plot_size(25cm, 15cm)



In [ ]:

    
plot(x = k.x, y = k.density,
     Guide.XLabel("x"), Guide.YLabel("Estimated Density"),
     Geom.line)

Tabular Data and Missing Values in Julia

Representing Missing Values



In [ ]:

    
using DataFrames



In [ ]:

    
NA + 1



In [ ]:

    
x = DataArray([1, 2, 3])



In [ ]:

    
{1, 2, NA}



In [ ]:

    
x[1] = NA



In [ ]:

    
mean(x)



In [ ]:

    
x[!isna(x)]



In [ ]:

    
mean(x[!isna(x)])

Factor-Like Variables



In [ ]:

    
y = PooledDataArray([1, 1, 2, 3])



In [ ]:

    
levels(y)

Representing Tabular Data



In [ ]:

    
df = DataFrame(A = float(1:10), B = rand(10))



In [ ]:

    
head(df)



In [ ]:

    
tail(df)



In [ ]:

    
df["C"] = repeat(["G1", "G2"], inner = [5])



In [ ]:

    
pool!(df, ["C"])



In [ ]:

    
df["C"]



In [ ]:

    
levels(df["C"])



In [ ]:

    
repeat([1 2; 3 4], inner = [2, 1], outer = [1, 2],)



In [ ]:

    
z = DataArray([1 + 2im])



In [ ]:

    
z[1] = NA



In [ ]:

    
DataFrame(A = [DataFrame(B = 1:2), DataFrame(C = 3:4)])



In [ ]:

    
df[1:10, :]



In [ ]:

    
by(df, "C", df -> mean(df["B"]))



In [ ]:

    
select(:(C .== "G1"), df)



In [ ]:

    
df[:(C .== "G1"), :]



In [ ]:

    
df["C"] .== "G1"



In [ ]:

    
with(df, :(A + B))

Accessing Classical Datasets



In [ ]:

    
using RDatasets



In [ ]:

    
iris = data("datasets", "iris")



In [ ]:

    
head(iris)



In [ ]:

    
plot(iris,
     x = "Petal.Length", y = "Petal.Width", color = "Species",
     Geom.point)

Converting DataFrames to Design Matrices



In [ ]:

    
ModelMatrix(ModelFrame(:(A ~ B), df))

DataFrame I/O



In [ ]:

    
writetable("df.csv", df)



In [ ]:

    
df



In [ ]:

    
df2 = readtable("df.csv")

Merging Data Sets



In [ ]:

    
A = DataFrame(X = 1:3, Z = ["A", "B", "C"])



In [ ]:

    
B = DataFrame(Y = 4:6, Z = ["A", "B", "B"])



In [ ]:

    
join(A, B, on = "Z")



In [ ]:

    
join(A, B, on = "Z", kind = :inner)



In [ ]:

    
join(A, B, on = "Z", kind = :left)



In [ ]:

    
join(A, B, on = "Z", kind = :right)



In [ ]:

    
join(A, B, on = "Z", kind = :outer)

Split-Apply-Combine Operations



In [ ]:

    
by(iris, "Species", nrow)



In [ ]:

    
by(iris, "Species", df -> mean(df["Petal.Length"]))



In [ ]:

    
by(iris, "Species", :(N = size(_DF, 1)))

GLM's in Julia



In [ ]:

    
using GLM



In [ ]:

    
glm(:(B ~ A), df, Binomial())



In [ ]:

    
glm(:(A ~ B), df, Poisson())

Optimization in Julia



In [ ]:

    
using Optim



In [ ]:

    
f(x::Vector) = (10.73 - x[1])^2 + (1134.29 - x[2])^4



In [ ]:

    
f([0.0, 0.0])



In [ ]:

    
optimize(f, [0.0, 0.0])



In [ ]:

    
optimize(f, [0.0, 0.0], method = :l_bfgs)

Maximum Likelihood Estimation in Julia



In [ ]:

    
x = rand(Normal(11, 3), 1_000)



In [ ]:

    
function makenll(x)
    nll(params::Vector) = -loglikelihood(Normal(params[1], 3), x)
end



In [ ]:

    
nll = makenll(x)



In [ ]:

    
nll([0.0])



In [ ]:

    
nll([10.0])



In [ ]:

    
optimize(nll, [0.0])



In [ ]:

    
mean(x)

More resources:

NLopt
JuMP

ML Algorithms



In [ ]:

    
using RDatasets



In [ ]:

    
iris = data("datasets", "iris")



In [ ]:

    
using Clustering



In [ ]:

    
kmeans(matrix(iris[:, 2:5])', 3)



In [ ]:

    
by(iris, "Species", df -> DataFrame(A = mean(df[2]),
                                    B = mean(df[3]),
                                    C = mean(df[4]),
                                    D = mean(df[5])))