R's ggplot2
module, written by Hadley Wickham is a graphics system that is based on the Grammar of Graphics.
Plotting your data with ggplot2
can be as simple as a plot(x,y)
command, or as complex as several lines of R commands. However, the result, even in the most basic theme is publication-ready.
I will cover a short introduction to the module, split into two parts. The first part will cover the qplot
function, which is the quick-and-dirty way of plotting data with this module. The second part will cover the more low-level plotting configuration and I will try to glance over the various aspect of it.
One important thing to remember is that ggplot2
operates on data frames. Therefore for the purpose of this notebook I will keep using the mpg
dataset that is included in the module. Let's load it and see what it holds..
In [1]:
## if you don't have ggplot2 install execute: install.packages("ggplot2")
In [2]:
library(ggplot2)
In [3]:
str(mpg)
The qplot()
is the most basic function of the module. It works like the plot()
function of the base graphics system.
The plots in ggplot2
are made of objects that belong in two categories:
As already mentioned, ggplot2
works on data frames. In fact the qplot()
function acts as a "graphical wrapper" around the data frame, extracting all the necessary information for the needs of plotting the dataset. This means that it has access to all numerical and categorical information within the dataset, as well as their labels. In fact, it will use the labels of the columns as axes titles in the final plot. Therefore, one important thing to keep in mind is always have your factor variables labeled.
Let's make our first plot! Using the miles to the galon variable for the consumption of the car in the highway
In [4]:
qplot(hwy, data=mpg)
Since only one variable was given to plot, the function chose to illustrate a histogram to best describe the input data.
In the function call:
qplot(hwy, data=mpg)
the first argument is the name of the variable to be plotted in the dataset, while the second keyword argument is the name of the dataset from which to load the information.
Notice how the name of the variable was used to label the x-axis, while the fact that the histogram includes counts per bins is shown as the y-axis label. In addition, the width of the histogram is chosen automatically to describe all data points, while the default binning is $\frac{\mathrm{range}}{30}$.
Since the function identified that this is one variable and chose to plot it with a histogram, additional optimisation of the plot can be achieved with including additional arguments. For example, changing the default width of the bins:
In [5]:
qplot(hwy, data=mpg, binwidth=4)
To create a two dimensional scatter plot for two variables of the dataframe one can simply do
In [6]:
qplot(displ, hwy, data=mpg)
The order of the arguments correspond to x-axis, y-axis and dataset.
One additional level of information can be achieved by modifying the aesthetics. For example, one could add the information of the factor variable drv
that shows if the vehicle is forward, rear of 4-wheel drive. To show such information on the same plot, one could use different colors to specify this categorical feature.
So projecting 3D information on a 2D plot by simply adding one more argument...
In [7]:
qplot(displ, hwy, data=mpg, color=drv)
Notice how the plot points are colorised to indicate different levels of the drv
variable. Moreover, the legend for the coloring has been added automatically.
Assume now that we also add statistics to the plot. For example let's try to describe our data using a smooth curve to see the overall trends.
This can be done by adding geoms. In fact, in the previous example we had already invoked the call of a geom; the data points. The qplot()
function and ggplot2
in general when seeded only with the information of the variable and the dataset to which it belongs, they have no idea what to do with it. Thus, at that level, no plot has been created. What qplot()
does, is to automatically invoke -for example- the geom("points") to plot the scatter plot of the two variables.
Let's add a geom, including a smooth line:
In [8]:
qplot(displ, hwy, data=mpg, geom="smooth")
This plots a smooth line that corresponds to the dataset, while the gray area defines the $95\%$ confidence interval around it. Notice that the points have dissapeared. This is due to the fact that by specifying the geom that the plot should use, the default configuration is overwritten. If I'd want to also include the points then I have to specify it explicitly.
In [9]:
qplot(displ, hwy, data=mpg, geom=c("smooth","point"))
Finally, notice that I have removed the condition color=drv
.
This is due to the fact that when such a categorisation is present, the geoms will be drawn for each category separately. For example..
In [10]:
qplot(displ, hwy, data=mpg, color=drv ,geom=c("smooth","point"))
One additional way to visualise the separation of the data would be not to separate them by color but rather separate them by shape.
In [11]:
qplot(displ, hwy, data=mpg, shape=drv)
Where we used circles, triangles and squares as markers for the various categories.
Finally, one could use differnet statistical methods for the smoothing. For example if instead of the confidence level one wanted to use a linear model to identify the linear relation, one could specify
In [12]:
qplot(displ, hwy, data=mpg, color=drv, geom=c("smooth","point"), method=lm)
Another interesting feature of ggplot2
is the introduction of facets. These are panel-like plots that separate the distribution of variables based on the level of one categorical variable. This is similar to what color
and fill
arguments had as a result in the plots we've seen up to now, with the difference that the different categories are not overlayed, but rather split into different plots.
Facets can be used as an argument in the qplot()
function. They have a very distinct syntax:
~
(tilde) symbol..
is placed, to keep the syntax rows ~ columns
.For example, to split the plots by the drv
variable and organise them in one row with three columns:
In [13]:
qplot(displ, hwy, data=mpg, facets = .~drv)
While to do the same thing but split them in rows in one column:
In [14]:
qplot(displ, hwy, data=mpg, facets = drv~.)
Again, the same logic applies on geoms.
In [15]:
qplot(displ, hwy, data=mpg, facets = drv~., geom=c("point","smooth"))
By the way the same categorisation can be achieved in all types of plots, by properly setting their arguments. For example, in a histogram we don't have color, but rather "fill color". So if we try for one variable...
In [16]:
qplot(hwy, data=mpg, fill=drv)
qplot()
will create a histogram for this variable which consists of three internal ones, each one corresponding to a different level of the drv
variable and is filled automatically using different colours.
In histograms, one of the geoms that can be introduced is the "density
" smooth.
In [17]:
qplot(hwy, data=mpg, geom="density")
And to see where these two peaks come from we may want to categorise this density distribution based on the drv
variable. Thus by categorise and visualise it by the line color:
In [18]:
qplot(hwy, data=mpg, color=drv, geom="density")
or by the fill color
In [19]:
qplot(hwy, data=mpg, fill=drv, geom="density")
While the qplot()
function is quite handly, it hides the real power of ggplot2
: the customisation level.
In this part, I want to cover the basic ideas of the package.
Contrary to the qplot()
function, when using the full ggplot package, we should explicitly define
Using the package, the plots are made in steps (layers):
So let's generate a basic plot using ggplot
In [20]:
g <- ggplot(mpg, aes(displ, hwy))
So we generate a ggplot
object that loads the mpg
dataset and uses as aesthetics (aes()
) the displ
and the hwy
variables.
We have the plot object in memory, but we made no actual plot yet, since we have not defined the geometry!
This is easily seen by printing out the g object:
In [21]:
summary(g)
In [22]:
g
So indeed no plot is created yet since it has no layers.
If we now add on top a geom :
In [23]:
p <- g + geom_point()
In [24]:
p
The geom_point()
needs no argumens in this case since all the information is held in the g
object, on top of which the geom_point()
is added.
Adding on top a smooth line...
In [40]:
g + geom_point() + geom_smooth() # default smoother is lo(w)ess (LOcally (WEighted) Scatter plot Smoothing)
... or for a specific smoothing method
In [25]:
g + geom_point() + geom_smooth(method="lm")
and to reproduce the final plot as when using qplot()
I should also include the facets. In the syntax of ggplot
just add one more layer..
In [26]:
g + geom_point() + geom_smooth(method="lm") + facet_grid(.~drv)
In [ ]: