In Class: An introduction to the Tidyverse

Loading packages and data

Today we're going to continue to practice the basics of dplyr and ggplot with a dataset from the 2017 Iditarod Trail Sled Dog Race. We begin by loading our packages:



In [0]:

    
library(ggplot2)
library(dplyr)
library(lubridate) # This package will help us convert data/time data into numeric variables
options(repr.plot.width=10, repr.plot.height=3) #set size for plots in this notebook

Now, write the commands to read in the data ("Iditarod.txt", this file is tab-separated and contains a header line) and convert it to the dplyr tbl format. Store your table as a variable called "idtrd".



In [0]:

Below, we have provided code to convert the date/time data into a numeric variable we can use for plotting, and take a peak at what's in our dataset.



In [0]:

    
idtrd$Departure = ymd_hms(idtrd$Departure)
idtrd$Arrival = ymd_hms(idtrd$Arrival)
head(idtrd)

Let's make some plots!!!

Each row contains the data for each musher at each checkpoint. This includes the time and number of dogs at departure and arrival, along with the musher's home country and rookie/veteran status. This means that each individual will have a row for each checkpoint they reached. Let's use a ggplot type we haven't talked about yet, the bar plot, to see how many mushers reached each checkpoint. For the bar plot we only need one aesthetic variable, because the height of the bar is simply the count of how many times we see each value in our variable.



In [0]:

    
ggplot(idtrd) + geom_bar(aes(x=Checkpoint))

Let's plot the number of competitors per country. We can't just use ggplot(idtrd) + geom_bar(aes(x=Country)) because there are multiple rows per individual, and so we'd be counting each individual multiple times. Using dplyr, we can group by individual, and then create our bar plot. The first command in summarise just gives us the first value of a variable for each group. Since each musher's country is the same for each entry, we can just take the first one.



In [0]:

    
countryCount = idtrd %>% group_by(Name) %>% summarise(Country=first(Country))
ggplot(countryCount) + geom_bar(aes(x=Country))

Now, try using the same approach to make a bar plot of the number of rookies and veterans (the "Status" column).



In [0]:

As we have a row for each musher at each checkpoint, we can use the number of entries for each individual to see who finished the race. Using group_by, summarise, and n() (counts the number of entries for a group), fill in the code below to make data tbl called "chkptCounts" with counts for how many checkpoints each musher reached. Then make a bar plot for those checkpoint counts. Wherever there is a "_" you should enter code.



In [0]:

    
chkptCounts = ____ %>% group_by(____) %>% summarise(_____ = n())
ggplot(_____) + geom_bar(aes(x=_______))

You should see that the majority of individuals reached all 17 checkpoints. Now filter chkptCounts to include only the individuals that finished the race (reached all 17 checkpoints), and select just the Name column. Store this as "finishers".



In [0]:

Now we can choose an individual that reached all 17 checkpoints and filter idtrd to just contain entries for that individual. Now we will make a plot (using geom_point() and geom_path()) with the latitude and longitude variables to draw the route of that individual. The x-axis should be longitude and y-axis latitude.

Note: geom_path() uses the order of the entries in the data table, while geom_line() uses the order along the x-axis, to connect points. In this dataset, we want to connect the points in the order of the checkpoints, so we use geom_path().



In [0]:

    
Noah = idtrd %>% filter(Name=="Noah Burmeister")
ggplot(Noah) + geom_point(aes(x=Longitude,y=Latitude)) + geom_path(aes(x=Longitude,y=Latitude))

Now, copy the code to make the map plot in the cell below and make the following changes:

Instead of geom_point(), use geom_text() to write the name of the checkpoint at its latitude/longitude. geom_text() requires the additional aesthetic label, where you should give the variable Checkpoint (i.e., label=Checkpoint).
Write the labels in red.



In [0]:

Let's take a look at the Departure times. Write code below, using geom_point(), to plot the Departure times on the x-axis and the number of dogs at each departure for Noah.



In [0]:

Now, try building a plot to compare the arrival and departure times for all individuals at one particular checkpoint. Pick a checkpoint in the middle of the race (i.e. not Fairbanks or Nome) so we have both arrival and departure times. Then use geom_point to build a scatter plot with Arrival on the X axis, Departure on the Y axis, and Status as the color of the points.



In [0]:

Here's a challenge: let's determine which mushers (on average) take the quickest breaks at each checkpoint. For this task, you will need to:

Calculate a new variable of time spent at each checkpoint ("breakTime").
Group the observations by Name and calculate the mean of the break time ("meanBreakTime"). Use na.rm=T when calculating the mean or all the values will be "NA" because not every checkpoint has both arrival and departure times.
Sort the data by "meanBreakTime" to find those with the shortest breaks (try arrange).



In [0]:

Now use a plot to compare the mean break time with the order at the finish. Here you'll need to:

Extract just the observations from Nome to get the time each musher reached the finish line.
Combine that result in a tbl with the meanBreakTime variable you calculated above (save that in a tbl if you haven't already). Here you'll want to join tbls-- check out the "Combine Data Sets" section at https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf. Note: not every musher reached Nome!
Use geom_point to plot the arrival time on the x-axis with the meanBreakTime on the y-axis. Color the points by the country of the musher.
Add a purple linear regression line with geom_smooth(method="lm",aes(...),color=...).
Change the axis labels to "Time of Finish" (x-axis) and "Mean Break Time" (y-axis).



In [0]:

Homework

Using the idtrd dataset and the commands you have learned in ggplot and dplyr, make 2 new plots. For each plot, you should:

Use at least one pipe (%>%) and one dplyr command (mutate, group_by, summarise, filter, arrange, ...) to manipulate the dataset.
Use a different plot type (geom_line, geom_point, geom_boxplot, geom_bar, or more! Check out https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf to see more options).
Use color, shape, or size as an aesthetic.

Plot 1:



In [0]:

Plot 2:



In [0]: