You will need mainly 2 packages to do statistical plots: DataFrames and StatPlots.
In [20]:
# import packages
using DataFrames #data manipulation package
import RDatasets #collection of well-known datasets
using StatPlots #a package for statistical plots
gr() # set gr as plotting backend (the thing that actually builds the plot)
school = RDatasets.dataset("mlmRev","Hsb82"); #Load a test dataset
head(school) # display first few rows of dataset
Out[20]:
Your data should be organized in the dataframe, just like the one displayed above. If you are unsure what a dataframe is, it basically is a series of labelled columns of the same length, where each column is a variable and each row is a datapoint. Please see DataFrames.jl documentation.
The main function we'll be using is groupapply, which in turn will call get_groupederror. To see what a function does you can type ?function_name and return. For more information please refer to the StatPlots.jl README
In [21]:
?groupapply #this information explain the pipeline of the data analysis performed to produce the plot
Out[21]:
Below you'll see examples from the README of StatPlots showing how to make sophisticated analysis with simple commands.
Here we want to compute the cumulative density function of the variable :MAch in our dataframe school. We want the data to be grouped by the variable :Sx (i.e. first you divide your data point by :Sx, run separate analysis for each subdataframe and plot them together in different colors) and we want to compute the s.e.m. across different schools.
In [22]:
#Run the statistical analysis
grp_error = groupapply(:cumulative, school, :MAch; group = :Sx,compute_error = (:across, :School))
#Plot the outcome (possibly personalized)
plt = plot(grp_error)
Out[22]:
In [23]:
# Save a pdf of the plot:
savefig(plt,"cdf.pdf")
Once we have the outcome of the statistical analysis (in this case grp_error::StatPlots.GroupedError) we can plot it in potentially personalized ways. The keyword line will determine the type of plot: path (default), scatter or bar. The other keywords work as in any other plot with Plots.jl. As is always the case in Plots.jl, you can pass different values of the keyword to the different traces by inputting them as a row vector (see color = ["blue" "black"] in the following example).
In [24]:
#Plot the outcome (possibly personalized)
plt = plot(grp_error, line = :path, grid = false, xlabel = "MAch", ylabel = "cdf",
color = ["blue" "black"], fillalpha = 0.2)
Out[24]:
groupapply also supports local regression. For a continuous x axis it uses Loess.jl, whereas the discrete case is done by binning together all the y values corresponding to a given x value and computing the mean.
In [25]:
grp_error = groupapply(:locreg, school, :MAch, :CSES; group = :Sx, compute_error = (:across, :School))
plot(grp_error, line = :path)
Out[25]:
In [26]:
# two alternative syntaxes are also available
grp_error1 = groupapply(school, :MAch, :CSES; group = :Sx, compute_error = (:across, :School))
grp_error2 = groupapply(:CSES, school, :MAch; group = :Sx, compute_error = (:across, :School))
p1 = plot(grp_error1, line = :path)
p2 = plot(grp_error2, line = :path)
plot(p1,p2)
Out[26]:
Let's look at density plots. The keywords used by the analysis function (in this case bandwidth: it determines the degree of smoothing, small is irregular while big may flatten out important information) can be given directly to groupapply and it will pass them to the function taking care of the analysis.
This is shown in the example below. This example also shows that bootstrap is a sloooooow way of computing error, if your dataset is big maybe across should be recommended wherever it makes sense.
In [27]:
grp_errors = Array(StatPlots.GroupedError,4)
plts = Array(Plots.Plot, 4)
bandwidths = [0.01, 0.1, 1., 5.]
for i = 1:4
grp_errors[i] = groupapply(:density, school, :CSES;
bandwidth = bandwidths[i], compute_error = (:bootstrap,500), group = :Sx)
plts[i] = plot(grp_errors[i], line = :path,legend = false, title = "bandwidth = $(bandwidths[i])")
end
plot(plts...)
Out[27]:
In case you are still waiting for the previous cell to output a plot: yes, as mentioned in the StatPlots documentation, bootstrap works on everything but is computationally very demanding.
You can also choose a categorical x variable. In that case groupapply can be used as a convenient syntax to get grouped bar plots. Here we compute the mean and s.e.m. of :Mach for males and females, subdividing each group according to the variable :Minrty
In [28]:
pool!(school, :Sx) #specify that a variable is categorical
grp_error = groupapply(school, :Sx, :MAch; compute_error = :across, group = :Minrty)
plot(grp_error, line = :bar)
Out[28]:
Plots.jl can also be combined with Interact.jl, a particularly useful library for interactive programming in Jupyter notebooks.
In [29]:
using Interact
If a plot depends on some "dynamic" variables, it will update as soon as they are changed.
In [30]:
# simple example: play with the n slider to get more or less points and with the s slider
# if you want to increase or decrease their size
@manipulate for n = 1:100, s = 1:0.5:10
scatter(rand(n), rand(n), markersize = s, markerstrokealpha = 0, grid = false, legend = false)
end
Out[30]:
In [31]:
# choose a dataset
school = RDatasets.dataset("mlmRev","Hsb82");
# possible things to plot on the y axis
ys = vcat([:density, :cumulative], names(school))
# possible things to plot on the x axis
xs = vcat(names(school), :constant)
# possible ways of splitting the data
groups = vcat(names(school), [:no_grouping])
# possible axis types
axis_types = [:discrete, :continuous]
# Add a constant column, to be used if you want to plot only one value that doesn't depend on anything.
# E.g. mean and sem of :MAch can be thought of as :MAch as a function of :constant
school[:constant] = "constant"
# Add another constant column in case you don't want to group your data
school[:no_grouping] = "";
Play with the widgets below to explore the dataset (here the example dataset "school" is provided, but you can use your own data). Step by step:
x to select the variable on the x axis, choose "constant" if you don't have any relevant x variable.group (you could also set this up to use more than one variable to group the data).:School, you can turn it off by setting err = false.As an example, try setting x = CSES, y = density, group = Sx, err = true, axis_type = continuous, plot_type = bar and play with the bandwidth slider to smooth the plot. The slider span is used in local regression with continuous axis.
Sometimes, you may by mistake try and do an impossible plot. For example, you may try and group by School, whereas this variable is already used to compute the standard errror. In this case, you'll get the message: "Choose carefully!". Sometimes your choice made sense, but after splitting the data, some condition had too little data to do sophisticated analaysis (like :locreg) in which case you may still get an error. A good try would be to stop splitting the data across schools by setting err = false
In [32]:
s = fill(plot(),()) # a trick to be able to save the plot! Every time we run an analysis, s[1] gets updated, see next cell
@manipulate for x in xs,
y in ys,
group in groups,
err in [true, false],
axis_type in axis_types,
plot_type in [:path, :scatter, :bar],
bandwidth in 0.01:0.01:2,
span in 0.01:0.01:1
# Add extra keyword needed for smoothing
kwargs = []
if axis_type == :continuous
if y in names(school)
kwargs = [(:span, span)]
elseif y == :density
kwargs = [(:bandwidth, bandwidth)]
end
end
# try the plot, otherwise make a silly plot and tell the user to try again!
try
ge = groupapply(y,school,x; axis_type = axis_type, group = group,
compute_error = err ?(:across, :School) : :none, kwargs...)
s[:] = plot(ge, line = plot_type, grid = false, xlabel = string(x), ylabel = string(y))
catch
s[:] = plot(;legend = false, grid = false, annotations = [(0.5,0.5, "Choose carefully!!!")])
end
end
Out[32]:
In [33]:
# to save your plot
savefig(s[1], "exploratoryplot.pdf")
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]:
In [ ]: