Basic Statistics in Julia


In [ ]:
using Stats

In [ ]:
srand(1)

In [ ]:
x = rand(100)

In [ ]:
min(x)

In [ ]:
median(x)

In [ ]:
max(x)

In [ ]:
quantile(x, [0.0, 0.5, 1.0])

In [ ]:
describe(x)

Probability Distributions in Julia


In [ ]:
using Distributions

In [ ]:
x = rand(Gamma(1, 2), 100)

Standard R Functions with Simpler Names


In [ ]:
d = Normal(0, 1)

In [ ]:
pdf(d, 0.0)

In [ ]:
cdf(d, 0.0)

In [ ]:
quantile(d, 0.1)

In [ ]:
rand(d)

In [ ]:
rand(Categorical([0.1, 0.9]))

In [ ]:
rand(sampler(Categorical([0.5, 0.5])))

In [ ]:
Categorical([0.5, 0.5])

In [ ]:
sampler(Categorical([0.5, 0.5]))

Additional Abstractions around PDF's, CDF's, etc.


In [ ]:
quantile(d, [0.25, 0.75])

In [ ]:
-loglikelihood(d, rand(d, 100_000)) / 100_000

Theoretical Properties of Distributions


In [ ]:
entropy(d)

In [ ]:
mean(d)

In [ ]:
skewness(d)

In [ ]:
kurtosis(d)

In [ ]:
var(d)

In [ ]:
modes(d)

Fit Distributions to Data


In [ ]:
x = rand(d, 1_000)

In [ ]:
fit_mle(Normal, x)

In [ ]:
(mean(d), std(d)), (mean(x), std(x))

In [ ]:
methods(mean)

Bayesian Updating with Conjugate Priors


In [ ]:
x = rand(Bernoulli(0.9), 10_000)

In [ ]:
posterior(Beta(3, 3), Bernoulli, x)

Kernel Density Estimation


In [ ]:
using Gadfly

In [ ]:
x = rand(Gamma(3, 3), 100_000)

In [ ]:
k = kde(x)

In [ ]:
names(Distributions.UnivariateKDE)

In [ ]:
set_default_plot_size(25cm, 15cm)

In [ ]:
plot(x = k.x, y = k.density,
     Guide.XLabel("x"), Guide.YLabel("Estimated Density"),
     Geom.line)

Tabular Data and Missing Values in Julia

Representing Missing Values


In [ ]:
using DataFrames

In [ ]:
NA + 1

In [ ]:
x = DataArray([1, 2, 3])

In [ ]:
{1, 2, NA}

In [ ]:
x[1] = NA

In [ ]:
mean(x)

In [ ]:
x[!isna(x)]

In [ ]:
mean(x[!isna(x)])

Factor-Like Variables


In [ ]:
y = PooledDataArray([1, 1, 2, 3])

In [ ]:
levels(y)

Representing Tabular Data


In [ ]:
df = DataFrame(A = float(1:10), B = rand(10))

In [ ]:
head(df)

In [ ]:
tail(df)

In [ ]:
df["C"] = repeat(["G1", "G2"], inner = [5])

In [ ]:
pool!(df, ["C"])

In [ ]:
df["C"]

In [ ]:
levels(df["C"])

In [ ]:
repeat([1 2; 3 4], inner = [2, 1], outer = [1, 2],)

In [ ]:
z = DataArray([1 + 2im])

In [ ]:
z[1] = NA

In [ ]:
DataFrame(A = [DataFrame(B = 1:2), DataFrame(C = 3:4)])

In [ ]:
df[1:10, :]

In [ ]:
by(df, "C", df -> mean(df["B"]))

In [ ]:
select(:(C .== "G1"), df)

In [ ]:
df[:(C .== "G1"), :]

In [ ]:
df["C"] .== "G1"

In [ ]:
with(df, :(A + B))

Accessing Classical Datasets


In [ ]:
using RDatasets

In [ ]:
iris = data("datasets", "iris")

In [ ]:
head(iris)

In [ ]:
plot(iris,
     x = "Petal.Length", y = "Petal.Width", color = "Species",
     Geom.point)

Converting DataFrames to Design Matrices


In [ ]:
ModelMatrix(ModelFrame(:(A ~ B), df))

DataFrame I/O


In [ ]:
writetable("df.csv", df)

In [ ]:
df

In [ ]:
df2 = readtable("df.csv")

Merging Data Sets


In [ ]:
A = DataFrame(X = 1:3, Z = ["A", "B", "C"])

In [ ]:
B = DataFrame(Y = 4:6, Z = ["A", "B", "B"])

In [ ]:
join(A, B, on = "Z")

In [ ]:
join(A, B, on = "Z", kind = :inner)

In [ ]:
join(A, B, on = "Z", kind = :left)

In [ ]:
join(A, B, on = "Z", kind = :right)

In [ ]:
join(A, B, on = "Z", kind = :outer)

Split-Apply-Combine Operations


In [ ]:
by(iris, "Species", nrow)

In [ ]:
by(iris, "Species", df -> mean(df["Petal.Length"]))

In [ ]:
by(iris, "Species", :(N = size(_DF, 1)))

GLM's in Julia


In [ ]:
using GLM

In [ ]:
glm(:(B ~ A), df, Binomial())

In [ ]:
glm(:(A ~ B), df, Poisson())

Optimization in Julia


In [ ]:
using Optim

In [ ]:
f(x::Vector) = (10.73 - x[1])^2 + (1134.29 - x[2])^4

In [ ]:
f([0.0, 0.0])

In [ ]:
optimize(f, [0.0, 0.0])

In [ ]:
optimize(f, [0.0, 0.0], method = :l_bfgs)

Maximum Likelihood Estimation in Julia


In [ ]:
x = rand(Normal(11, 3), 1_000)

In [ ]:
function makenll(x)
    nll(params::Vector) = -loglikelihood(Normal(params[1], 3), x)
end

In [ ]:
nll = makenll(x)

In [ ]:
nll([0.0])

In [ ]:
nll([10.0])

In [ ]:
optimize(nll, [0.0])

In [ ]:
mean(x)

More resources:

  • NLopt
  • JuMP

ML Algorithms


In [ ]:
using RDatasets

In [ ]:
iris = data("datasets", "iris")

In [ ]:
using Clustering

In [ ]:
kmeans(matrix(iris[:, 2:5])', 3)

In [ ]:
by(iris, "Species", df -> DataFrame(A = mean(df[2]),
                                    B = mean(df[3]),
                                    C = mean(df[4]),
                                    D = mean(df[5])))