Distributions en Julia

Ulises M. Alvarez

Instalación de software

Python y actualización



In [1]:

    
ENV["PYTHON"] = ""  # Declaramos la variable PYTHON nula, para reproducibilidad
Pkg.update()









    



INFO: Updating METADATA...
INFO: Computing changes...
INFO: No packages to install, update or remove

Función para acelerar y facilitar la instalación de paquetes



In [2]:

    
function p_it(p::String...)
    # Instala el paquete 'p'
    #
    # Args:
    #   p: el nombre completo del paquete::String
    #
    # Returns:
    #   Mensaje de error o notifica de la instalación vía Base.Pkg.add
    #
    ENV["PYTHON"]=""
    for i = 1:length(p)
        if !contains(==, Pkg.available(), p[i])
            println(p[i], ", no disponible.")
        else
            if !haskey(Pkg.installed(), p[i])
                Pkg.add(p[i])
            end
        end
     end
end









    Out[2]:





p_it (generic function with 1 method)

Para saber más sobre la función, lo invitamos a consultar la libreta "Paquetes en Julia II":

https://github.com/umalvarez/TallerJuR/blob/master/julia/PaquetesEnJulia-II.ipynb

Paquetes



In [3]:

    
p_it("Distributions",
    "GLM",
    "RDatasets",
    "Plots",
    "PlotlyJS",
    "PyPlot",
    "GR",
    "UnicodePlots",
    "PGFPlots",
    "StatPlots",
    "PlotRecipes",
    "Blink")



In [4]:

    
using Blink
# Blink.AtomShell.install()  # Ejecutar sólo una vez

Distributions

Normal



In [5]:

    
using Distributions, Plots, StatPlots









    



INFO: Recompiling stale cache file /home/uma/.julia/lib/v0.6/LineSearches.ji for module LineSearches.
INFO: Recompiling stale cache file /home/uma/.julia/lib/v0.6/Optim.ji for module Optim.
INFO: Recompiling stale cache file /home/uma/.julia/lib/v0.6/KernelDensity.ji for module KernelDensity.



In [6]:

    
pyplot()                       # motor de gráficos para Plot









    Out[6]:





Plots.PyPlotBackend()



In [7]:

    
plot(Normal(),               # Gráfica de ~ N(μ, σ)
    label = "~ N(μ, σ)",     # Definimos leyenda
    fill  = (0, .5,:green),  # Color de fondo
    fmt   = :svg)            # Opción para generar las notas

#\mu = μ 
#\sigma = σ









    Out[7]:



In [8]:

    
params(Normal())          # Obtenemos los parámetros: μ, σ









    Out[8]:





(0.0, 1.0)



In [9]:

    
x0 = rand(Normal(), 10000)
histogram!(x0, lab = "")









    Out[9]:



In [10]:

    
plot(Normal(3,5),          # Alteramos los parámetros ~N(μ, σ)
    label = "~ N(3, 5)",   # Definimos leyenda
    fill=(0, .5,:orange),  # Color de fondo
    fmt = :svg)            # Opción para generar las notas









    Out[10]:

Distribución exponencial



In [11]:

    
plot(Exponential(),    # exponencial, λ por defecto
fill=(0, .5, :green),  # Color de fondo con transparencia
    fmt = :svg)        # Opción para generar las notas









    Out[11]:



In [12]:

    
params(Exponential())









    Out[12]:





(1.0,)



In [13]:

    
plot(Exponential(),         # Exponencial, λ por defecto
    fill=(0, .5, :orange),  # Color de fondo con transparencia
    label = "λ = 1.0",      # Definimos leyenda
    fmt = :svg)             # Opción para generar las notas









    Out[13]:



In [14]:

    
plot!(Exponential(0.5),    # Exponencial, λ = 0.5
    fill=(0, .5, :green),  # Color de fondo con transparencia
    label = "λ = 0.5",     # Definimos leyenda
    fmt = :svg)            # Opción para generar las notas









    Out[14]:



In [15]:

    
# savefig("exponential")

Gamma



In [16]:

    
dist = Gamma()

scatter(dist, 
       leg = false)

bar!(dist, 
    func=cdf, 
    alpha=0.3,
    xlabel = "Gamma()",
    ylabel = "n",
    fmt = :svg)            # Opción para generar las notas









    Out[16]:



In [17]:

    
params(Gamma())









    Out[17]:





(1.0, 1.0)



In [18]:

    
dist = Gamma(2)

scatter(dist, leg=false)

bar!(dist, 
    func=cdf, 
    alpha=0.3,
    xlabel = "Gamma(2, 1)",
    ylabel = "n",
    fmt = :svg)            # Opción para generar las notas









    Out[18]:

Aplicaciones

Generación de números aleatorios



In [19]:

    
x = rand(Uniform(),             # Generamos aleatorios con d ~ unif()
    10000)                      # diez mil números

histogram(x,
        xlabel = "unif(0, 1)",  # leyenda en abscisas
        ylabel = "n",           # leyenda en ordenadas
        lab    = "",            # No leyenda
        fmt = :svg)             # Opción para generar las notas









    Out[19]:



In [20]:

    
mean(x)









    Out[20]:





0.497911869135094

Regresión lineal: ~ N(μ, σ)



In [21]:

    
using GLM, RDatasets

Los datos



In [22]:

    
data01 = dataset("datasets",  # usamos los 'datasets' de R
        "Formaldehyde")       # Nombre del conjunto a usar









    Out[22]:




Carb OptDen
1 0.1 0.086
2 0.3 0.269
3 0.5 0.446
4 0.6 0.538
5 0.7 0.626
6 0.9 0.782



In [23]:

    
scatter(data01[:Carb],      # la variable independiente, x
        data01[:OptDen],    # la variable dependiente, y
        xlabel = "Carb",    # Leyenda en las ordenadas
        ylabel = "OptDen",  # Leyenda en las abscisas 
        lab    = "",        # No queremos leyenda
        fmt    = :svg)      # Opción para generar las notas









    Out[23]:

La regresión lineal



In [24]:

    
lm1 = fit(LinearModel,            # El tipo de ajuste: linear
        @formula(OptDen ~ Carb),  # El módelo a evaluar: y ~ x
        data01)                   # El conjunto de datos









    Out[24]:





DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: OptDen ~ 1 + Carb

Coefficients:
               Estimate  Std.Error  t value Pr(>|t|)
(Intercept)  0.00508571 0.00783368 0.649211   0.5516
Carb           0.876286  0.0135345  64.7444    <1e-6



In [25]:

    
plot!(data01[:Carb],  # la variable independiente, x 
    predict(lm1),     # la variable dependiente, y
    lab = "lm",       # La leyenda
    fmt = :svg)       # Opción para generar las notas









    Out[25]:

Regresión "Probit"

Un modelo probit, es aquel en la que la variable dependiente sólo puede tomar uno de dos valores; i.e., es binaria.

Los datos



In [26]:

    
data02 = DataFrame(X=[1,2,3], Y=[1,0,1])  # Generamos datos en un DataFrame









    Out[26]:




X Y
1 1 1
2 2 0
3 3 1

La regresión tipo Probit



In [27]:

    
Probit = fit(GeneralizedLinearModel,  # El tipo de ajuste: linear generalizado 
        @formula(Y ~ X),              # El modelo a evaluar
        data02,                       # El conjunto de datos
        Binomial(),                   # El tipo de distribución a usar
        ProbitLink())                 # Regresión Probit









    Out[27]:





DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Binomial{Float64},GLM.ProbitLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: Y ~ 1 + X

Coefficients:
                 Estimate Std.Error      z value Pr(>|z|)
(Intercept)      0.430727   1.98019     0.217518   0.8278
X            -3.64399e-19   0.91665 -3.97534e-19   1.0000



In [28]:

    
vcov(Probit)









    Out[28]:





2×2 Array{Float64,2}:
  3.92116  -1.6805  
 -1.6805    0.840248

Regresión de Poisson

Un modelo de Poisson es aquel en el que la variable dependiente sigue una distribución de Poisson.

Los datos



In [29]:

    
dobson = DataFrame(Counts = [18.,17,15,20,10,20,25,13,12],
                          Outcome = gl(3,1,9),
                          Treatment = gl(3,3))









    



WARNING: Array{T}(::Type{T}, m::Int) is deprecated, use Array{T}(m) instead.
Stacktrace:
 [1] depwarn(::String, ::Symbol) at ./deprecated.jl:64
 [2] Array(::Type{Int64}, ::Int64) at ./deprecated.jl:51
 [3] gl(::Int64, ::Int64, ::Int64) at /home/uma/.julia/v0.6/DataArrays/src/statistics.jl:30
 [4] execute_request(::ZMQ.Socket, ::IJulia.Msg) at /home/uma/.julia/v0.6/IJulia/src/execute_request.jl:160
 [5] eventloop(::ZMQ.Socket) at /home/uma/.julia/v0.6/IJulia/src/eventloop.jl:8
 [6] (::IJulia.##11#14)() at ./task.jl:335
while loading In[29], in expression starting on line 1






    Out[29]:




Counts Outcome Treatment
1 18.0 1 1
2 17.0 2 1
3 15.0 3 1
4 20.0 1 2
5 10.0 2 2
6 20.0 3 2
7 25.0 1 3
8 13.0 2 3
9 12.0 3 3

La regresión de Poisson



In [30]:

    
gm2 = fit(GeneralizedLinearModel,            # El tipo de ajuste: linear generalizado 
    @formula(Counts ~ Outcome + Treatment),  # El modelo a evaluar
    dobson,                                  # Los datos
    Poisson())                               # Regresión de Poisson









    Out[30]:





DataFrames.DataFrameRegressionModel{GLM.GeneralizedLinearModel{GLM.GlmResp{Array{Float64,1},Distributions.Poisson{Float64},GLM.LogLink},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: Counts ~ 1 + Outcome + Treatment

Coefficients:
                  Estimate Std.Error     z value Pr(>|z|)
(Intercept)        3.04452  0.170899     17.8148   <1e-70
Outcome: 2       -0.454255  0.202171    -2.24689   0.0246
Outcome: 3       -0.292987  0.192742     -1.5201   0.1285
Treatment: 2   2.02908e-16       0.2 1.01454e-15   1.0000
Treatment: 3  -6.41139e-18       0.2 -3.2057e-17   1.0000



In [31]:

    
deviance(gm2)









    Out[31]:





5.129141077001141

Para saber más:

$Distributions.jl$: http://distributionsjl.readthedocs.io/en/stable/starting.html
$GLM.jl$: https://github.com/JuliaStats/GLM.jl
$StatPlots$: https://github.com/JuliaStats/GLM.jl

Reproducibilidad



In [32]:

    
versioninfo()









    



Julia Version 0.6.0-rc2.0
Commit 68e911be53* (2017-05-18 02:31 UTC)
Platform Info:
  OS: Linux (x86_64-solus-linux)
  CPU: Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (NO_AFFINITY CORE2)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-4.0.0 (ORCJIT, haswell)



In [33]:

    
Dates.today()









    Out[33]:





2017-06-21

	Counts	Outcome	Treatment
1	18.0	1	1
2	17.0	2	1
3	15.0	3	1
4	20.0	1	2
5	10.0	2	2
6	20.0	3	2
7	25.0	1	3
8	13.0	2	3
9	12.0	3	3