In [ ]:
Pkg.add("Discretizers")
Once the installation is complete you can use it anywhere by running
In [1]:
using Discretizers
In [2]:
data = [:cat, :dog, :dog, :cat, :cat, :elephant]
catdisc = CategoricalDiscretizer(data);
The resulting object can be used to encode your source labels to their categorical labels
In [3]:
println(":cat becomes: ", encode(catdisc, :cat))
println(":dog becomes: ", encode(catdisc, :dog))
println("data becomes: ", encode(catdisc, data))
You can also transform back
In [4]:
println("1 becomes: ", decode(catdisc, 1))
println("2 becomes: ", decode(catdisc, 2))
println("[1,2,3] becomes: ", decode(catdisc, [1,2,3]))
The CategoricalDiscretizer works with any object type
In [5]:
CategoricalDiscretizer(["A", "B", "C"])
CategoricalDiscretizer([5000, 1200, 100])
CategoricalDiscretizer([:dog, "hello world", NaN]);
In [6]:
bin_edges = [0.0,0.5,1.0]
lindisc = LinearDiscretizer(bin_edges);
Encoding works the same way
In [7]:
println("0.2 becomes: ", encode(lindisc, 0.2))
println("0.7 becomes: ", encode(lindisc, 0.7))
println("0.5 becomes: ", encode(lindisc, 0.5))
println("it works on arrays: ", encode(lindisc, [0.0,0.8,0.2]))
Decoding is a bit different. Here we obtain the bin and sample from it uniformally
In [8]:
println("1 becomes: ", decode(lindisc, 1))
println("2 becomes: ", decode(lindisc, 2))
println("it works on arrays: ", decode(lindisc, [2,1,2]))
Some other functions are supported
In [9]:
println("number of labels: ", nlabels(catdisc), " ", nlabels(lindisc))
println("bin centers: ", bincenters(lindisc))
println("extrama of a bin: ", extrema(lindisc, 2))
Both discretizers can be constructed to map to other integer types
In [10]:
catdisc = CategoricalDiscretizer(data, Int32)
lindisc = LinearDiscretizer(bin_edges, UInt8)
encode(lindisc, 0.2)
Out[10]:
Uniform Width
DiscretizeUniformWidth(nbins) - divide the domain evenly into nbins
Uniform Count
DiscretizeUniformCount(nbins) - divide the domain into nbins where each bin has approximately equal count
Bayesian Blocks
DiscretizeBayesianBlocks() - determines an appropriate number of bins by maximizing a Bayesian prior.
See this website for an overview.
In [11]:
nbins = 3
data = randn(1000)
edges = binedges(DiscretizeUniformWidth(nbins), data)
Out[11]:
In [12]:
using PGFPlots
using Distributions
# draw a set of variables and
# filter values to a reasonable range
srand(0)
data = [rand(Cauchy(-5, 1.8), 500);
rand(Cauchy(-4, 0.8), 2000);
rand(Cauchy(-1, 0.3), 500);
rand(Cauchy( 2, 0.8), 1000);
rand(Cauchy( 4, 1.5), 500)]
data = filter!(x->-15.0 <= x <= 15.0, data)
g = GroupPlot(3, 1, groupStyle = "horizontal sep = 1.75cm")
discalgs = [("Uniform Width", DiscretizeUniformWidth(15)),
("Uniform Count", DiscretizeUniformCount(15)),
("Bayesian Blocks", DiscretizeBayesianBlocks())]
for (name, discalg) in discalgs
disc = LinearDiscretizer(binedges(discalg, data))
counts = get_discretization_counts(disc, data)
arr_x, arr_y = get_histogram_plot_arrays(disc.binedges, counts ./ binwidths(disc))
push!(g, Axis(Plots.Linear(arr_x, convert(Vector{Float64}, arr_y), style="const plot, mark=none, fill=blue!60"),
ymin=0, xlabel="x", ylabel="pdf(x)", title=name))
end
g
Out[12]:
In [13]:
g = GroupPlot(3, 3, groupStyle = "horizontal sep = 1.75cm, vertical sep = 1.5cm")
discalgs = [:sqrt, # used by Excel and others for its simplicity and speed
:sturges, # R's default method, only good for near-Gaussian data
:rice, # commonly overestimates the number of bins required
:doane, # improves Sturges’ for non-normal datasets.
:scott, # less robust estimator that that takes into account data variability and data size.
:fd, # Freedman Diaconis Estimator, robust
:auto, # max between :fd and :sturges. Good all-round performance
]
for discalg in discalgs
disc = LinearDiscretizer(binedges(DiscretizeUniformWidth(discalg), data))
counts = get_discretization_counts(disc, data)
arr_x, arr_y = get_histogram_plot_arrays(disc.binedges, counts ./ binwidths(disc))
push!(g, Axis(Plots.Linear(arr_x, convert(Vector{Float64}, arr_y), style="const plot, mark=none, fill=blue!60"),
ymin=0, title=string(discalg)))
end
g
Out[13]:
A third algorithm, MODL, was implemented to find optimal bins given both a continuous data set and a labelled discrete data set.
In [14]:
data = [randn(100); randn(100)+1.0]
labels = [fill(:cat, 100); fill(:dog, 100)]
integer_labels = encode(CategoricalDiscretizer([:cat, :dog]), labels)
edges = binedges(DiscretizeMODL_Optimal(), data, integer_labels)
Out[14]:
More information on MODL can be found here.
In [ ]: