A day with (the) Julia (language)

Análisis de datos


In [1]:
ENV["LINES"]   = 10 
ENV["COLUMNS"] = 60;
  • DataFrame para manejar datos tabulares, datos faltantes...
  • Gadfly para hacer rápidamente gráficos estadísticos al estilo de ggplot2
  • Plots para graficar independiente del backend, soporte para animaciones, gráficos interactivos con Plotly o 3D, etc.
  • IJulia para trabajar con IPython/Jupyter notebooks, e Interact
  • Test estadísticos, clustering, etc. con los paquetes de la organización JuliaStats

In [2]:
using DataFrames
benchmark = readtable(joinpath("data","benchmark.csv"), separator=';')


Out[2]:
BenchmarkFortranJuliaPythonRMatlabOctaveMathematicaJavaScriptGoLuaJITJava
1fib0.72.1177.76533.5226.899324.35118.533.361.861.711.21
2parse_int5.051.4517.0245.73802.529581.4415.026.061.25.773.35
3quicksort1.311.1532.89264.544.921866.0143.232.71.292.032.6
4mandel0.810.7915.3253.167.58451.815.130.661.110.671.35
5pi_sum1.01.021.999.561.0299.311.691.011.01.01.0
6rand_mat_stat1.451.6617.9314.5614.5230.935.952.32.963.273.92
7rand_mat_mul3.481.021.141.571.121.121.315.071.421.162.36

In [3]:
longformat = melt(benchmark, [:Benchmark]) # Reshaping: wide -> long


Out[3]:
variablevalueBenchmark
1Fortran0.7fib
2Fortran5.05parse_int
3Fortran1.31quicksort
4Fortran0.81mandel
5Fortran1.0pi_sum
6Fortran1.45rand_mat_stat
7Fortran3.48rand_mat_mul
8Julia2.11fib
9Julia1.45parse_int
10Julia1.15quicksort
&vellip&vellip&vellip&vellip

In [4]:
meantime = by(longformat, :variable, df -> mean( df[:value] )) # Split-Apply-Combine


Out[4]:
variablex1
1Fortran1.9714285714285715
2Go1.5485714285714285
3Java2.255714285714286
4JavaScript4.451428571428571
5Julia1.3114285714285712
6LuaJIT2.23
7Mathematica27.264285714285712
8Matlab122.64999999999999
9Octave3079.281428571429
10Python26.292857142857144
&vellip&vellip&vellip

In [5]:
sort!(meantime, cols=[:x1]) # Sorting
names!(meantime, [:Lenguaje, :Tiempo_Medio])


Out[5]:
LenguajeTiempo_Medio
1Julia1.3114285714285712
2Go1.5485714285714285
3Fortran1.9714285714285715
4LuaJIT2.23
5Java2.255714285714286
6JavaScript4.451428571428571
7Python26.292857142857144
8Mathematica27.264285714285712
9Matlab122.64999999999999
10R131.80571428571426
&vellip&vellip&vellip

Descripción (estadística) del dataset (columnas), similar a summary de R.


In [6]:
describe(meantime)


Lenguaje
Length  11
Type    Symbol
NAs     0
NA%     0.0%
Unique  11

Tiempo_Medio
Min      1.3114285714285712
1st Qu.  2.100714285714286
Median   4.451428571428571
Mean     309.1875324675325
3rd Qu.  74.95714285714286
Max      3079.281428571429
NAs      0
NA%      0.0%

Usando funciones de R


In [8]:
using RCall

In [9]:
ks = R"""ks.test($(benchmark[:Julia]), $(benchmark[:R]))"""


Out[9]:
RCall.RObject{RCall.VecSxp}

	Two-sample Kolmogorov-Smirnov test

data:  `#JL`$`(benchmark[:Julia])` and `#JL`$`(benchmark[:R])`
D = 0.85714, p-value = 0.008159
alternative hypothesis: two-sided


In [10]:
rcopy(ks) # Transforma los datos de R a tipos de datos de Julia


Out[10]:
Dict{Symbol,Any} with 5 entries:
  :statistic          => 0.857143
  :alternative        => "two-sided"
  Symbol("p.value")   => 0.00815851
  :method             => "Two-sample Kolmogorov-Smirnov tes…
  Symbol("data.name") => "`#JL`\$`(benchmark[:Julia])` and …

JuliaStats

Muchos de los tests estadísticos más clásicos están definidos en alguna biblioteca de Julia


In [12]:
using HypothesisTests

In [13]:
ApproximateTwoSampleKSTest(benchmark[:Julia], benchmark[:R])


Out[13]:
Approximate two sample Kolmogorov-Smirnov test
----------------------------------------------
Population details:
    parameter of interest:   Supremum of CDF differences
    value under h_0:         0.0
    point estimate:          0.857142857142857

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.011681952358448919

Details:
    number of observations:   [7,7]
    KS-statistic:              1.603567451474546

In [14]:
SignedRankTest(
convert(Vector{Float64}, benchmark[:Julia]), 
convert(Vector{Float64}, benchmark[:R])
) # Wilcoxon signed rank test


Out[14]:
Exact Wilcoxon signed rank test
-------------------------------
Population details:
    parameter of interest:   Location parameter (pseudomedian)
    value under h_0:         0
    point estimate:          -44.279999999999994
    95% confidence interval: (-291.89,-6.7250000000000005)

Test summary:
    outcome with 95% confidence: reject h_0
    two-sided p-value:           0.015625000000000003

Details:
    number of observations:      7
    Wilcoxon rank-sum statistic: 0.0
    rank sums:                   [0.0,28.0]
    adjustment for ties:         0.0

In [16]:
using GLM

lineal = lm(@formula(R ~ Julia), benchmark)


Out[16]:
DataFrames.DataFrameRegressionModel{GLM.LinearModel{GLM.LmResp{Array{Float64,1}},GLM.DensePredChol{Float64,Base.LinAlg.Cholesky{Float64,Array{Float64,2}}}},Array{Float64,2}}

Formula: R ~ 1 + Julia

Coefficients:
             Estimate Std.Error t value Pr(>|t|)
(Intercept)  -247.159   200.519 -1.2326   0.2725
Julia         288.971   145.487 1.98623   0.1037

In [17]:
inter, pend = coef(lineal)


Out[17]:
2-element Array{Float64,1}:
 -247.159
  288.971

In [22]:
using Plots, StatPlots

In [24]:
scatter(benchmark[:Julia], benchmark[:R], legend=false, xlab="Julia", ylab="R")
Plots.abline!(pend, inter)


Out[24]:

In [25]:
using Distributions

β = rand(Beta(2,2), 1000)


Out[25]:
1000-element Array{Float64,1}:
 0.326723
 0.327877
 0.615056
 ⋮       
 0.787692
 0.636218

Usando bibliotecas de R


In [27]:
# R"""
# install.packages("modeest")
# """

@rimport modeest as rmode

β_mode = rmode.hsm(β) # half-sample mode


This is package 'modeest' written by P. PONCET.
For a complete list of functions, use 'library(help = "modeest")' or 'help.start()'.

Out[27]:
RCall.RObject{RCall.RealSxp}
[1] 0.5082594

In [28]:
rcopy(β_mode)


Out[28]:
0.5082594499045219

Clustering


In [29]:
iris = readtable(joinpath("data", "iris.csv"))


Out[29]:
SepalLengthSepalWidthPetalLengthPetalWidthSpecies
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa
74.63.41.40.3setosa
85.03.41.50.2setosa
94.42.91.40.2setosa
104.93.11.50.1setosa
&vellip&vellip&vellip&vellip&vellip&vellip

In [31]:
using Clustering

cl = kmeans(convert(Matrix{Float64}, iris[:, [:PetalWidth, :PetalLength]])', 3)


Out[31]:
Clustering.KmeansResult{Float64}([1.35926 0.246 2.04783; 4.29259 1.462 5.62609],[2,2,2,2,2,2,2,2,2,2  …  3,3,3,3,3,3,3,3,3,3],[0.00596,0.00596,0.02836,0.00356,0.00596,0.08036,0.00676,0.00356,0.00596,0.02276  …  0.124707,0.340359,0.29862,0.13862,0.209924,0.245142,0.413837,0.183837,0.114707,0.338185],[54,50,46],[54.0,50.0,46.0],31.412885668276868,5,true)

In [32]:
cl.centers'


Out[32]:
3×2 Array{Float64,2}:
 1.35926  4.29259
 0.246    1.462  
 2.04783  5.62609

In [33]:
by(iris, :Species, df -> (mean(df[:PetalWidth]), mean(df[:PetalLength])))


Out[33]:
Speciesx1
1setosa(0.24600000000000002,1.462)
2versicolor(1.3259999999999998,4.260000000000001)
3virginica(2.026,5.5520000000000005)