Example notebook to learn how to do exploratory visualizations of your data, both programmatically and interactively

You will need mainly 2 packages to do statistical plots: DataFrames and StatPlots.


In [20]:
# import packages
using DataFrames #data manipulation package
import RDatasets #collection of well-known datasets
using StatPlots #a package for statistical plots
gr() # set gr as plotting backend (the thing that actually builds the plot)

school = RDatasets.dataset("mlmRev","Hsb82"); #Load a test dataset
head(school) # display first few rows of dataset


Out[20]:
SchoolMinrtySxSSSMAchMeanSESSectorCSES
11224NoFemale-1.5285.876-0.43438297872340426Public-1.0936170212765957
21224NoFemale-0.58819.708-0.43438297872340426Public-0.1536170212765957
31224NoMale-0.52820.349-0.43438297872340426Public-0.09361702127659577
41224NoMale-0.6688.781-0.43438297872340426Public-0.23361702127659578
51224NoMale-0.15817.898-0.43438297872340426Public0.2763829787234042
61224NoMale0.0224.583-0.43438297872340426Public0.4563829787234043

Your data should be organized in the dataframe, just like the one displayed above. If you are unsure what a dataframe is, it basically is a series of labelled columns of the same length, where each column is a variable and each row is a datapoint. Please see DataFrames.jl documentation.

The main function we'll be using is groupapply, which in turn will call get_groupederror. To see what a function does you can type ?function_name and return. For more information please refer to the StatPlots.jl README


In [21]:
?groupapply #this information explain the pipeline of the data analysis performed to produce the plot


search: groupapply GroupApplied

Out[21]:
groupapply(f::Function, df, args...;
            axis_type = :auto, compute_error = :none, group = [],
            summarize = (get_symbol(compute_error) == :bootstrap) ? (mean, std) : (mean, sem),
            kwargs...)

Split df by group. Then apply get_groupederror to get a population summary of the grouped data. Output is a GroupedError with error computed according to the keyword compute_error. It can be plotted using plot(g::GroupedError) Seriestype can be specified to be :path, :scatter or :bar

groupapply(s::Symbol, df, args...; kwargs...)

s can be :locreg, :density or :cumulative, in which case the corresponding built in analysis function is used. s can also be a symbol of a column of df, in which case the call is equivalent to groupapply(:locreg, df, args[1], s; kwargs...)

groupapply(df::AbstractDataFrame, x, y; kwargs...)

Equivalent to groupapply(:locreg, df::AbstractDataFrame, x, y; kwargs...)

Below you'll see examples from the README of StatPlots showing how to make sophisticated analysis with simple commands.

An explained example

Here we want to compute the cumulative density function of the variable :MAch in our dataframe school. We want the data to be grouped by the variable :Sx (i.e. first you divide your data point by :Sx, run separate analysis for each subdataframe and plot them together in different colors) and we want to compute the s.e.m. across different schools.


In [22]:
#Run the statistical analysis
grp_error = groupapply(:cumulative, school, :MAch; group = :Sx,compute_error = (:across, :School))
#Plot the outcome (possibly personalized)
plt = plot(grp_error)


Out[22]:
0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 Male Female

In [23]:
# Save a pdf of the plot:
savefig(plt,"cdf.pdf")

Once we have the outcome of the statistical analysis (in this case grp_error::StatPlots.GroupedError) we can plot it in potentially personalized ways. The keyword line will determine the type of plot: path (default), scatter or bar. The other keywords work as in any other plot with Plots.jl. As is always the case in Plots.jl, you can pass different values of the keyword to the different traces by inputting them as a row vector (see color = ["blue" "black"] in the following example).


In [24]:
#Plot the outcome (possibly personalized)
plt = plot(grp_error, line = :path, grid = false, xlabel = "MAch", ylabel = "cdf",
color = ["blue" "black"], fillalpha = 0.2)


Out[24]:
0 10 20 0.0 0.2 0.4 0.6 0.8 1.0 MAch cdf Male Female

groupapply also supports local regression. For a continuous x axis it uses Loess.jl, whereas the discrete case is done by binning together all the y values corresponding to a given x value and computing the mean.


In [25]:
grp_error = groupapply(:locreg, school, :MAch, :CSES; group = :Sx, compute_error = (:across, :School))
plot(grp_error, line = :path)


Out[25]:
0 10 20 -0.4 -0.2 0.0 0.2 0.4 Male Female

In [26]:
# two alternative syntaxes are also available
grp_error1 = groupapply(school, :MAch, :CSES; group = :Sx, compute_error = (:across, :School))
grp_error2 = groupapply(:CSES, school, :MAch; group = :Sx, compute_error = (:across, :School))

p1 = plot(grp_error1, line = :path)
p2 = plot(grp_error2, line = :path)

plot(p1,p2)


Out[26]:
0 10 20 -0.4 -0.2 0.0 0.2 0.4 Male Female 0 10 20 -0.4 -0.2 0.0 0.2 0.4 Male Female

Let's look at density plots. The keywords used by the analysis function (in this case bandwidth: it determines the degree of smoothing, small is irregular while big may flatten out important information) can be given directly to groupapply and it will pass them to the function taking care of the analysis. This is shown in the example below. This example also shows that bootstrap is a sloooooow way of computing error, if your dataset is big maybe across should be recommended wherever it makes sense.


In [27]:
grp_errors = Array(StatPlots.GroupedError,4)
plts = Array(Plots.Plot, 4)
bandwidths = [0.01, 0.1, 1., 5.]
for i = 1:4
    grp_errors[i] = groupapply(:density, school, :CSES;
                                bandwidth = bandwidths[i], compute_error = (:bootstrap,500), group = :Sx)
    plts[i] = plot(grp_errors[i], line = :path,legend = false, title = "bandwidth = $(bandwidths[i])")
end
plot(plts...)


Out[27]:
-2 0 2 0.0 0.2 0.4 0.6 bandwidth = 0.01 -2 0 2 0.00 0.25 0.50 bandwidth = 0.1 -2 0 2 0.0 0.1 0.2 0.3 bandwidth = 1.0 -2 0 2 0.00 0.02 0.04 0.06 bandwidth = 5.0

In case you are still waiting for the previous cell to output a plot: yes, as mentioned in the StatPlots documentation, bootstrap works on everything but is computationally very demanding.

Bar plots

You can also choose a categorical x variable. In that case groupapply can be used as a convenient syntax to get grouped bar plots. Here we compute the mean and s.e.m. of :Mach for males and females, subdividing each group according to the variable :Minrty


In [28]:
pool!(school, :Sx) #specify that a variable is categorical
grp_error = groupapply(school, :Sx, :MAch; compute_error = :across, group = :Minrty)
plot(grp_error, line = :bar)


Out[28]:
Female Male 0 5 10 15 No Yes

Adding interactivity to your plots

Plots.jl can also be combined with Interact.jl, a particularly useful library for interactive programming in Jupyter notebooks.


In [29]:
using Interact

If a plot depends on some "dynamic" variables, it will update as soon as they are changed.


In [30]:
# simple example: play with the n slider to get more or less points and with the s slider
# if you want to increase or decrease their size
@manipulate for n = 1:100, s = 1:0.5:10
    scatter(rand(n), rand(n), markersize = s, markerstrokealpha = 0, grid = false, legend = false)
end


Out[30]:
0.0 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 1.0

Let's put it all together! We can set up a generic analysis of our data with the groupapply functionality and change all we want interactively.


In [31]:
# choose a dataset
school = RDatasets.dataset("mlmRev","Hsb82");
# possible things to plot on the y axis
ys = vcat([:density, :cumulative], names(school))
# possible things to plot on the x axis
xs = vcat(names(school), :constant)
# possible ways of splitting the data
groups = vcat(names(school), [:no_grouping])
# possible axis types
axis_types = [:discrete, :continuous]

# Add a constant column, to be used if you want to plot only one value that doesn't depend on anything.
# E.g. mean and sem of :MAch can be thought of as :MAch as a function of :constant
school[:constant] = "constant"

# Add another constant column in case you don't want to group your data
school[:no_grouping] = "";

Play with the widgets below to explore the dataset (here the example dataset "school" is provided, but you can use your own data). Step by step:

  • Click on the widgets next to x to select the variable on the x axis, choose "constant" if you don't have any relevant x variable.
  • Choose what to plot on the y axis: either the pdf or cdf of x, but also plotting another of your variable y as a function of x.
  • Group the data according to the value of one column using group (you could also set this up to use more than one variable to group the data).
  • The error is assumed to be s.e.m. across :School, you can turn it off by setting err = false.
  • Specify whether the x axis should be continuous or discrete.
  • Specify what type of plot you want (path, scatter, bar).
  • Finally, if your analysis has some smoothing variables, you can play with those interactively using the sliders.

Suggested first example

As an example, try setting x = CSES, y = density, group = Sx, err = true, axis_type = continuous, plot_type = bar and play with the bandwidth slider to smooth the plot. The slider span is used in local regression with continuous axis.

In case of error

Sometimes, you may by mistake try and do an impossible plot. For example, you may try and group by School, whereas this variable is already used to compute the standard errror. In this case, you'll get the message: "Choose carefully!". Sometimes your choice made sense, but after splitting the data, some condition had too little data to do sophisticated analaysis (like :locreg) in which case you may still get an error. A good try would be to stop splitting the data across schools by setting err = false


In [32]:
s = fill(plot(),()) # a trick to be able to save the plot! Every time we run an analysis, s[1] gets updated, see next cell

@manipulate for x in xs,
                y in ys,
                group in groups,
                err in [true, false],
                axis_type in axis_types,
                plot_type in [:path, :scatter, :bar],
                bandwidth in 0.01:0.01:2,
                span in 0.01:0.01:1
    # Add extra keyword needed for smoothing
    kwargs = []
    if axis_type == :continuous
        if y in names(school)
            kwargs = [(:span, span)]
        elseif y == :density
            kwargs = [(:bandwidth, bandwidth)]
        end
    end
    # try the plot, otherwise make a silly plot and tell the user to try again!
    try
        ge = groupapply(y,school,x; axis_type = axis_type, group = group,
        compute_error = err ?(:across, :School) : :none, kwargs...)
        s[:] = plot(ge, line = plot_type, grid = false, xlabel = string(x), ylabel = string(y))
    catch
        s[:] = plot(;legend = false, grid = false, annotations = [(0.5,0.5, "Choose carefully!!!")])
    end
end


Out[32]:
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 School density 8367 8854 4458 5762 6990 5815 7172 4868 7341 1358 4383 2305 8800 3088 8775 7890 6144 6443 5192 6808 2818 9340 4523 6816 2277 8009 5783 3013 7101 4530 9021 4511 2639 3377 6578 9347 3705 3533 1296 4350 9397 4253 2655 7342 9292 3499 7364 8983 5650 2658 8188 4410 9508 8707 1499 8477 1288 6291 1224 4292 8857 3967 6415 1317 2629 4223 1462 9550 6464 4931 5937 7919 3716 1909 2651 2467 1374 6600 5667 5720 3498 3881 2995 5838 3688 9158 8946 7232 2917 6170 8165 9104 2030 8150 4042 8357 8531 6074 4420 1906 3992 3999 4173 4325 5761 6484 6897 7635 7734 8175 8874 9225 2458 3610 5640 3838 9359 2208 6089 1477 2768 3039 5819 6397 1308 1433 1436 1461 1637 1942 1946 2336 2526 2626 2755 2771 2990 3020 3152 3332 3351 3427 3657 4642 5404 5619 6366 6469 7011 7276 7332 7345 7688 7697 8193 8202 8627 8628 9198 9586

In [33]:
# to save your plot
savefig(s[1], "exploratoryplot.pdf")

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: