Example notebook to learn how to do exploratory visualizations of your data, both programmatically and interactively

You will need mainly 2 packages to do statistical plots: DataFrames and StatPlots.



In [20]:

    
# import packages
using DataFrames #data manipulation package
import RDatasets #collection of well-known datasets
using StatPlots #a package for statistical plots
gr() # set gr as plotting backend (the thing that actually builds the plot)

school = RDatasets.dataset("mlmRev","Hsb82"); #Load a test dataset
head(school) # display first few rows of dataset









    Out[20]:




School Minrty Sx SSS MAch MeanSES Sector CSES
1 1224 No Female -1.528 5.876 -0.43438297872340426 Public -1.0936170212765957
2 1224 No Female -0.588 19.708 -0.43438297872340426 Public -0.1536170212765957
3 1224 No Male -0.528 20.349 -0.43438297872340426 Public -0.09361702127659577
4 1224 No Male -0.668 8.781 -0.43438297872340426 Public -0.23361702127659578
5 1224 No Male -0.158 17.898 -0.43438297872340426 Public 0.2763829787234042
6 1224 No Male 0.022 4.583 -0.43438297872340426 Public 0.4563829787234043

Your data should be organized in the dataframe, just like the one displayed above. If you are unsure what a dataframe is, it basically is a series of labelled columns of the same length, where each column is a variable and each row is a datapoint. Please see DataFrames.jl documentation.

The main function we'll be using is groupapply, which in turn will call get_groupederror. To see what a function does you can type ?function_name and return. For more information please refer to the StatPlots.jl README



In [21]:

    
?groupapply #this information explain the pipeline of the data analysis performed to produce the plot









    



search: groupapply GroupApplied







    Out[21]:





groupapply(f::Function, df, args...;
            axis_type = :auto, compute_error = :none, group = [],
            summarize = (get_symbol(compute_error) == :bootstrap) ? (mean, std) : (mean, sem),
            kwargs...)
Split df by group. Then apply get_groupederror to get a population summary of the grouped data. Output is a GroupedError with error computed according to the keyword compute_error. It can be plotted using plot(g::GroupedError) Seriestype can be specified to be :path, :scatter or :bar

groupapply(s::Symbol, df, args...; kwargs...)
s can be :locreg, :density or :cumulative, in which case the corresponding built in analysis function is used. s can also be a symbol of a column of df, in which case the call is equivalent to groupapply(:locreg, df, args[1], s; kwargs...)

groupapply(df::AbstractDataFrame, x, y; kwargs...)
Equivalent to groupapply(:locreg, df::AbstractDataFrame, x, y; kwargs...)

Below you'll see examples from the README of StatPlots showing how to make sophisticated analysis with simple commands.

An explained example

Here we want to compute the cumulative density function of the variable :MAch in our dataframe school. We want the data to be grouped by the variable :Sx (i.e. first you divide your data point by :Sx, run separate analysis for each subdataframe and plot them together in different colors) and we want to compute the s.e.m. across different schools.



In [22]:

    
#Run the statistical analysis
grp_error = groupapply(:cumulative, school, :MAch; group = :Sx,compute_error = (:across, :School))
#Plot the outcome (possibly personalized)
plt = plot(grp_error)









    Out[22]:



In [23]:

    
# Save a pdf of the plot:
savefig(plt,"cdf.pdf")

Once we have the outcome of the statistical analysis (in this case grp_error::StatPlots.GroupedError) we can plot it in potentially personalized ways. The keyword line will determine the type of plot: path (default), scatter or bar. The other keywords work as in any other plot with Plots.jl. As is always the case in Plots.jl, you can pass different values of the keyword to the different traces by inputting them as a row vector (see color = ["blue" "black"] in the following example).



In [24]:

    
#Plot the outcome (possibly personalized)
plt = plot(grp_error, line = :path, grid = false, xlabel = "MAch", ylabel = "cdf",
color = ["blue" "black"], fillalpha = 0.2)









    Out[24]:

groupapply also supports local regression. For a continuous x axis it uses Loess.jl, whereas the discrete case is done by binning together all the y values corresponding to a given x value and computing the mean.



In [25]:

    
grp_error = groupapply(:locreg, school, :MAch, :CSES; group = :Sx, compute_error = (:across, :School))
plot(grp_error, line = :path)









    Out[25]:



In [26]:

    
# two alternative syntaxes are also available
grp_error1 = groupapply(school, :MAch, :CSES; group = :Sx, compute_error = (:across, :School))
grp_error2 = groupapply(:CSES, school, :MAch; group = :Sx, compute_error = (:across, :School))

p1 = plot(grp_error1, line = :path)
p2 = plot(grp_error2, line = :path)

plot(p1,p2)









    Out[26]:

Let's look at density plots. The keywords used by the analysis function (in this case bandwidth: it determines the degree of smoothing, small is irregular while big may flatten out important information) can be given directly to groupapply and it will pass them to the function taking care of the analysis. This is shown in the example below. This example also shows that bootstrap is a sloooooow way of computing error, if your dataset is big maybe across should be recommended wherever it makes sense.



In [27]:

    
grp_errors = Array(StatPlots.GroupedError,4)
plts = Array(Plots.Plot, 4)
bandwidths = [0.01, 0.1, 1., 5.]
for i = 1:4
    grp_errors[i] = groupapply(:density, school, :CSES;
                                bandwidth = bandwidths[i], compute_error = (:bootstrap,500), group = :Sx)
    plts[i] = plot(grp_errors[i], line = :path,legend = false, title = "bandwidth = $(bandwidths[i])")
end
plot(plts...)









    Out[27]:

In case you are still waiting for the previous cell to output a plot: yes, as mentioned in the StatPlots documentation, bootstrap works on everything but is computationally very demanding.

Bar plots

You can also choose a categorical x variable. In that case groupapply can be used as a convenient syntax to get grouped bar plots. Here we compute the mean and s.e.m. of :Mach for males and females, subdividing each group according to the variable :Minrty



In [28]:

    
pool!(school, :Sx) #specify that a variable is categorical
grp_error = groupapply(school, :Sx, :MAch; compute_error = :across, group = :Minrty)
plot(grp_error, line = :bar)









    Out[28]:

Adding interactivity to your plots

Plots.jl can also be combined with Interact.jl, a particularly useful library for interactive programming in Jupyter notebooks.



In [29]:

    
using Interact

If a plot depends on some "dynamic" variables, it will update as soon as they are changed.



In [30]:

    
# simple example: play with the n slider to get more or less points and with the s slider
# if you want to increase or decrease their size
@manipulate for n = 1:100, s = 1:0.5:10
    scatter(rand(n), rand(n), markersize = s, markerstrokealpha = 0, grid = false, legend = false)
end

Let's put it all together! We can set up a generic analysis of our data with the groupapply functionality and change all we want interactively.



In [31]:

    
# choose a dataset
school = RDatasets.dataset("mlmRev","Hsb82");
# possible things to plot on the y axis
ys = vcat([:density, :cumulative], names(school))
# possible things to plot on the x axis
xs = vcat(names(school), :constant)
# possible ways of splitting the data
groups = vcat(names(school), [:no_grouping])
# possible axis types
axis_types = [:discrete, :continuous]

# Add a constant column, to be used if you want to plot only one value that doesn't depend on anything.
# E.g. mean and sem of :MAch can be thought of as :MAch as a function of :constant
school[:constant] = "constant"

# Add another constant column in case you don't want to group your data
school[:no_grouping] = "";

Play with the widgets below to explore the dataset (here the example dataset "school" is provided, but you can use your own data). Step by step:

Click on the widgets next to x to select the variable on the x axis, choose "constant" if you don't have any relevant x variable.
Choose what to plot on the y axis: either the pdf or cdf of x, but also plotting another of your variable y as a function of x.
Group the data according to the value of one column using group (you could also set this up to use more than one variable to group the data).
The error is assumed to be s.e.m. across :School, you can turn it off by setting err = false.
Specify whether the x axis should be continuous or discrete.
Specify what type of plot you want (path, scatter, bar).
Finally, if your analysis has some smoothing variables, you can play with those interactively using the sliders.

Suggested first example

As an example, try setting x = CSES, y = density, group = Sx, err = true, axis_type = continuous, plot_type = bar and play with the bandwidth slider to smooth the plot. The slider span is used in local regression with continuous axis.

In case of error

Sometimes, you may by mistake try and do an impossible plot. For example, you may try and group by School, whereas this variable is already used to compute the standard errror. In this case, you'll get the message: "Choose carefully!". Sometimes your choice made sense, but after splitting the data, some condition had too little data to do sophisticated analaysis (like :locreg) in which case you may still get an error. A good try would be to stop splitting the data across schools by setting err = false



In [32]:

    
s = fill(plot(),()) # a trick to be able to save the plot! Every time we run an analysis, s[1] gets updated, see next cell

@manipulate for x in xs,
                y in ys,
                group in groups,
                err in [true, false],
                axis_type in axis_types,
                plot_type in [:path, :scatter, :bar],
                bandwidth in 0.01:0.01:2,
                span in 0.01:0.01:1
    # Add extra keyword needed for smoothing
    kwargs = []
    if axis_type == :continuous
        if y in names(school)
            kwargs = [(:span, span)]
        elseif y == :density
            kwargs = [(:bandwidth, bandwidth)]
        end
    end
    # try the plot, otherwise make a silly plot and tell the user to try again!
    try
        ge = groupapply(y,school,x; axis_type = axis_type, group = group,
        compute_error = err ?(:across, :School) : :none, kwargs...)
        s[:] = plot(ge, line = plot_type, grid = false, xlabel = string(x), ylabel = string(y))
    catch
        s[:] = plot(;legend = false, grid = false, annotations = [(0.5,0.5, "Choose carefully!!!")])
    end
end



In [33]:

    
# to save your plot
savefig(s[1], "exploratoryplot.pdf")



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:



In [ ]:

	School	Minrty	Sx	SSS	MAch	MeanSES	Sector	CSES
1	1224	No	Female	-1.528	5.876	-0.43438297872340426	Public	-1.0936170212765957
2	1224	No	Female	-0.588	19.708	-0.43438297872340426	Public	-0.1536170212765957
3	1224	No	Male	-0.528	20.349	-0.43438297872340426	Public	-0.09361702127659577
4	1224	No	Male	-0.668	8.781	-0.43438297872340426	Public	-0.23361702127659578
5	1224	No	Male	-0.158	17.898	-0.43438297872340426	Public	0.2763829787234042
6	1224	No	Male	0.022	4.583	-0.43438297872340426	Public	0.4563829787234043