Prelab: An introduction to the tidyverse

In this module, we will be learning the basics of a cluster of R packages collectively known as the "tidyverse". This is a set of tools that makes data manipulation and visualization in R easier and more flexible than in the basic R language. Specifically, we will be using the libraries dplyr and ggplot. The capabilities of these packages are much greater than what we can cover in this module, and there is a list of resources at the end of this prelab to help you continue to learn beyond the basics. Please execute the commands in each of the following cells, and answer the question at the end of the prelab to receive full credit.

Loading packages and data

As you will remember from our previous R modules, we need to load the libraries we will be using before we begin any analysis.


In [0]:
library(dplyr)
library(ggplot2)
options(repr.plot.width=10, repr.plot.height=3) #set size for plots in this notebook

For this prelab, we will be using some cancer incidence statistics (downloaded from https://www.cdc.gov/cancer/npcr/uscs/download_data.htm). This dataset contains statistics for a set of seven types of cancer, stratified by year, race, and sex.


In [0]:
data = read.table("Cancer_Incidence.txt",header=T,sep="\t")
head(data)

Manipulating data with dplyr

First, we will convert our data from the standard "data frame" format R creates when you use read.table into the "tbl" format designed for use in dplyr, using the command tbl_df.


In [0]:
data = tbl_df(data)

A key difference in dplyr from basic R is the "pipe" command (%>%). This command let's us string together commands, like the | operator in unix, resulting in clearer and more modular code. Let's combine %>% with the command select, which allows use to select a subset of columns from our data frame. Below we take our entire dataset, then select two columns, then pipe that to head to just see the first few rows. Notice how we can string together multiple pipes, with our old friend head at the end to print just the beginning of the tbl:


In [0]:
data %>% select(YEAR,RACE) %>% head()

We can also use select to remove the SEX and YEAR columns using "-":


In [0]:
data %>% select(-SEX,-YEAR) %>% head()

Now let's try the filter command, which let's us subset the data by choosing rows which follow the given conditions. Here we subset the data to only look at the female statistics. Note that we use the boolean condition == because are evaluating a true/false statement, rather than setting a variable's value.


In [0]:
table(data$SEX) #prints the counts of each unique value in the SEX column
female = data %>% filter(SEX=="Female") #filter to only include rows where the SEX column is "Female"
table(female$SEX) #now we only have females after using filter

The command arrange will sort the tbl by a given column. Let's filter to include only data on females of all races in the year 1999, and look at the most and least common cancers.


In [0]:
#arrange: for 1999, Females, All Races, sort by rate lowest->highest
data %>% filter(SEX=="Female",YEAR==1999,RACE=="All Races") %>% arrange(AGE_ADJUSTED_RATE) %>% head()
#arrange: for 1999, Females, All Races, sort by rate highest->lowest
data %>% filter(SEX=="Female",YEAR==1999,RACE=="All Races") %>% arrange(desc(AGE_ADJUSTED_RATE)) %>% head()

The mutate command creates a new column by performing a specified calculation on each row. Here we calculate the crude rate for each row by dividing the count by the total population size, and name our new column "NEW_RATE". You can see that the reported CRUDE_RATE is multiplied by 1e5 (so the "RATE" is per 100,000 individuals).


In [0]:
data %>% mutate(NEW_RATE = COUNT/POPULATION) %>% head()

Finally, the group_by command let's you subset your data by a variable and calculate any summary statistics (via summarise) for each group. For example, below we remove the "all races" category, then group by sex and calculate the mean of the reported CRUDE_RATE for each sex.


In [0]:
data %>% filter(RACE=="All Races") %>% group_by(SEX) %>% summarise(meanRate = mean(CRUDE_RATE)) %>% head

Notice that this string of commands resulted in only 2 columns, because we only grouped by SEX, thus combining all other variables (e.g. SITE, the type of cancer). Let's try grouping by multiple variables. Here we don't use head(), and will get the whole table as output:


In [0]:
#group by sex and site
data %>% filter(RACE=="All Races") %>% group_by(SEX,SITE) %>% summarise(meanRate = mean(CRUDE_RATE))

Plotting in ggplot

Now that we can easily manipulate our data, let's play around with making graphs! ggplot allows you to build graphs easily in a modular fashion. The basic format for a ggplot command includes calling ggplot and then adding the type of graph or feature you want to include.

For example, the below code makes a tbl called "rates", which is filtered to include only rates for all races and male and females combined. Then we call ggplot(rates) to say what data we want to plot, and call geom_point to add points to our plot by specifying variables for the X and Y coordinates. Anytime we want to plot something where each row is a data point, as in this plot, we put the variables we are using in the aesthetic call ("aes()"). So here, we are telling ggplot to plot a point for each row, with the X value as the YEAR and the Y value as the AGE_ADJUSTED_RATE.


In [0]:
rates = data %>% filter(RACE=="All Races",SEX=="Male and Female") #filter data to all races, male and female
head(rates)
ggplot(rates) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_RATE)) #make a plot of the year and adjusted rates!

Looking at the data in rates, we have multiple types of cancers (SITE), but we plotted them all together without telling ggplot to differentiate them in any way. Let's add an additional "color" aesthetic, which will color the points by whatever variable we provide.


In [0]:
ggplot(rates) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_RATE,color=SITE)) #make a plot of the year and adjusted rates!

Now we can see each site separately! And ggplot has automatically added a nice legend for us. This will happen any time you use "color" or another aesthetic (in the "aes()" call) to differentiate by a variable.

Notice what happens when we put the "color" call outside aes(), below. We get an error because anything specified by a variable within the data must be called within aes().


In [0]:
ggplot(rates) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_RATE),color=SITE) #make a plot of the year and adjusted rates!

If you just want to make everything the same color regardless of its variables, you can use color outside aes(), and assign a particular color.


In [0]:
ggplot(rates) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_RATE),color="red") #make a plot of the year and adjusted rates!

Now let's connect the points with lines. ggplot is "buildable", meaning if you want to add something you can just add an extra command.


In [0]:
ggplot(rates) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_RATE,color=SITE)) + geom_line(aes(x=YEAR,y=AGE_ADJUSTED_RATE,color=SITE))

The rates reported in our dataset include confidence intervals. Let's try plotting the confidence intervals in our point & line plot. To do this, we simply add another call of "geom_point" for the additional variables. To keep our plot from getting too busy, we'll filter the data down to just lymphomas in females. Notice how we have changed the size of our points and lines using the size command.


In [0]:
lymphFem = data %>% filter(SITE=="Lymphomas",SEX=="Female",RACE!="All Races")
ggplot(lymphFem) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_RATE,color=RACE),size=3) + geom_line(aes(x=YEAR,y=AGE_ADJUSTED_RATE,color=RACE),size=1) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_CI_LOWER,color=RACE),size=1)  + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_CI_UPPER,color=RACE),size=1)

Now, let's add line segments to connect our confidence intervals. For line segments we need to specify where the line starts and stops on both the x and y axes, thus there are now 4 variables going into the aesthetic call in addition to color.


In [0]:
ggplot(lymphFem) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_RATE,color=RACE),size=3) + geom_line(aes(x=YEAR,y=AGE_ADJUSTED_RATE,color=RACE),size=1) + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_CI_LOWER,color=RACE),size=1)  + geom_point(aes(x=YEAR,y=AGE_ADJUSTED_CI_UPPER,color=RACE),size=1) + geom_segment(aes(x=YEAR,xend=YEAR,y=AGE_ADJUSTED_CI_LOWER,yend=AGE_ADJUSTED_CI_UPPER,color=RACE))

ggplot can make many types of plots. Let's try another type, a boxplot. Let's plot the range of rates for each site/sex combination in a boxplot:


In [0]:
ggplot(data) + geom_boxplot(aes(x=SITE,y= AGE_ADJUSTED_RATE,color=SEX))

Different plot types use different aesthetics. For boxplots, if we want to color the whole box, we use "fill". We can also combine a filter command with our ggplot:


In [0]:
ggplot(data %>% filter(RACE!="All Races")) + geom_boxplot(aes(x=SITE,y= AGE_ADJUSTED_RATE,fill=RACE))

What if we want to use our own colors instead of the ones ggplot chooses automatically? We can do that too, by adding another command to our code:


In [0]:
myColors = c("red","orange","purple","chartreuse","magenta")
ggplot(data %>% filter(RACE!="All Races")) + geom_boxplot(aes(x=SITE,y= AGE_ADJUSTED_RATE,fill=RACE)) + scale_fill_manual(values=myColors)

We also might want to change the axes labels instead of using the names of the variables being plotted. To do that we can use xlab() and ylab(), again simply adding on to our existing code.


In [0]:
ggplot(data %>% filter(RACE!="All Races")) + geom_boxplot(aes(x=SITE,y= AGE_ADJUSTED_RATE,fill=RACE)) + scale_fill_manual(values=myColors) + xlab("Cancer Site") + ylab("Rate per 100,000")

Question

We've now gone through some basic features of dplyr and ggplot on an example dataset. How could you use these tools in your own work (or a made up example)? You don't need to write code, just give a sentence or two example.

Your answer here!

Resources

WE HIGHLY RECOMMENDED YOU TAKE A LOOK AT THESE CHEATSHEETS TO SEE THE MANY OPTIONS AVAILABLE IN DPLYR & GGPLOT:

tidyr & dplyr cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

ggplot cheatsheet: https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Additional tutorials:

A dplyr tutorial: http://genomicsclass.github.io/book/pages/dplyr_tutorial.html

Data processing in tidyr & dplyr: https://rpubs.com/bradleyboehmke/data_wrangling

DataCamp dplyr tutorial: https://www.datacamp.com/courses/dplyr-data-manipulation-r-tutorial/

DataCamp ggplot tutorial: https://www.datacamp.com/courses/data-visualization-with-ggplot2-1