Introduction to R and RStudio

Complete all **Exercises**, and submit answers to **Questions** on the Coursera platform.

The goal of this lab is to introduce you to R and RStudio, which you'll be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface.

As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.

RStudio

Your RStudio window has four panels.

Your R Markdown file (this document) is in the upper left panel.

The panel on the lower left is where the action happens. It's called the console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you're running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

The panel in the upper right contains your workspace as well as a history of the commands that you've previously entered.

Any plots that you generate will show up in the panel in the lower right corner. This is also where you can browse your files, access help, manage packages, etc.

R Packages

R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. For this lab, and many others in the future, we will use the following R packages:

statsr: for data files and functions used in this course
dplyr: for data wrangling
ggplot2: for data visualization

You should have already installed these packages using commands like install.packages and install_github.

Next, you need to load the packages in your working environment. We do this with the library function. Note that you only need to install packages once, but you need to load them each time you relaunch RStudio.



In [1]:

    
library(dplyr)
library(ggplot2)
library(statsr)









    



Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Warning message:
: package ‘ggplot2’ was built under R version 3.2.4

To do so, you can

click on the green arrow at the top of the code chunk in the R Markdown (Rmd) file, or
highlight these lines, and hit the Run button on the upper right corner of the pane, or
type the code in the console.

Going forward you will be asked to load any relevant packages at the beginning of each lab.

Dataset 1: Dr. Arbuthnot's Baptism Records

To get you started, run the following command to load the data.



In [ ]:

    
data(arbuthnot)

To do so, once again, you can

click on the green arrow at the top of the code chunk in the R Markdown (Rmd) file, or
put your cursor on this line, and hit the Run button on the upper right corner of the pane, or
type the code in the console.

This command instructs R to load some data. The Arbuthnot baptism counts for boys and girls. You should see that the workspace area in the upper righthand corner of the RStudio window now lists a data set called arbuthnot that has 82 observations on 3 variables. As you interact with R, you will create a series of objects. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed.

The Arbuthnot data set refers to Dr. John Arbuthnot, an 18^th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710. We can take a look at the data by typing its name into the console.



In [2]:

    
arbuthnot









    Out[2]:





year boys girls

	1 1629 5218 4683
	2 1630 4858 4457
	3 1631 4422 4102
	4 1632 4994 4590
	5 1633 5158 4839
	6 1634 5035 4820
	7 1635 5106 4928
	8 1636 4917 4605
	9 1637 4703 4457
	10 1638 5359 4952
	11 1639 5366 4784
	12 1640 5518 5332
	13 1641 5470 5200
	14 1642 5460 4910
	15 1643 4793 4617
	16 1644 4107 3997
	17 1645 4047 3919
	18 1646 3768 3395
	19 1647 3796 3536
	20 1648 3363 3181
	21 1649 3079 2746
	22 1650 2890 2722
	23 1651 3231 2840
	24 1652 3220 2908
	25 1653 3196 2959
	26 1654 3441 3179
	27 1655 3655 3349
	28 1656 3668 3382
	29 1657 3396 3289
	30 1658 3157 3013
	31 ⋮ ⋮ ⋮
	32 1681 6822 6533
	33 1682 6909 6744
	34 1683 7577 7158
	35 1684 7575 7127
	36 1685 7484 7246
	37 1686 7575 7119
	38 1687 7737 7214
	39 1688 7487 7101
	40 1689 7604 7167
	41 1690 7909 7302
	42 1691 7662 7392
	43 1692 7602 7316
	44 1693 7676 7483
	45 1694 6985 6647
	46 1695 7263 6713
	47 1696 7632 7229
	48 1697 8062 7767
	49 1698 8426 7626
	50 1699 7911 7452
	51 1700 7578 7061
	52 1701 8102 7514
	53 1702 8031 7656
	54 1703 7765 7683
	55 1704 6113 5738
	56 1705 8366 7779
	57 1706 7952 7417
	58 1707 8379 7687
	59 1708 8239 7623
	60 1709 7840 7380
	61 1710 7640 7288

However printing the whole dataset in the console is not that useful. One advantage of RStudio is that it comes with a built-in data viewer. Click on the name arbuthnot in the Environment pane (upper right window) that lists the objects in your workspace. This will bring up an alternative display of the data set in the Data Viewer (upper left window). You can close the data viewer by clicking on the x in the upper lefthand corner.

What you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.

Note that the row numbers in the first column are not part of Arbuthnot's data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored Arbuthnot's data in a kind of spreadsheet or table called a data frame.

You can see the dimensions of this data frame by typing:



In [3]:

    
dim(arbuthnot)









    Out[3]:





	82
	3

This command should output [1] 82 3, indicating that there are 82 rows and 3 columns (we'll get to what the [1] means in a bit), just as it says next to the object in your workspace. You can see the names of these columns (or variables) by typing:



In [4]:

    
names(arbuthnot)









    Out[4]:





	'year'
	'boys'
	'girls'

How many variables are included in this data set?
1. 2
2. 3
3. 4
4. 82
5. 1710

**Exercise**: What years are included in this dataset? Hint: Take a look at the year variable in the Data Viewer to answer this question.

You should see that the data frame contains the columns year, boys, and girls. At this point, you might notice that many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of arguments. The dim and names commands, for example, each took a single argument, the name of a data frame.

**Tip: ** If you use the up and down arrow keys, you can scroll through your previous commands, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.

R Markdown

So far we asked you to type your commands in the console. The console is a great place for playing around with some code, however it is not a good place for documenting your work. Working in the console exclusively makes it difficult to document your work as you go, and reproduce it later.

R Markdown is a great solution for this problem. And, you already have worked with an R Markdown document -- this lab! Going forward type the code for the questions in the code chunks provided in the R Markdown (Rmd) document for the lab, and Knit the document to see the results.

Some Exploration

Let's start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like



In [5]:

    
arbuthnot$boys

This command will only show the number of boys baptized each year. The dollar sign basically says "go to the data frame that comes before me, and find the variable that comes after me".

What command would you use to extract just the counts of girls born?
1. `arbuthnot$boys`
2. `arbuthnot$girls`
3. `girls`
4. `arbuthnot[girls]`
5. `$girls`



In [6]:

    
# type your code for the Question 2 here, and Knit
arbuthnot$girls

Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called vectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, 5218 follows [1], indicating that 5218 is the first entry in the vector. And if [43] starts a line, then that would mean the first number on that line would represent the 43rd entry in the vector.

R has some powerful functions for making graphics. We can create a simple plot of the number of girls baptized per year with the command



In [7]:

    
ggplot(data = arbuthnot, aes(x = year, y = girls)) +
  geom_point()

Before we review the code for this plot, let's summarize the trends we see in the data.

Which of the following best describes the number of girls baptised over the years included in this dataset?
1. There appears to be no trend in the number of girls baptised from 1629 to 1710.
2. There is initially an increase in the number of girls baptised, which peaks around 1640. After 1640 there is a decrease in the number of girls baptised, but the number begins to increase again in 1660. Overall the trend is an increase in the number of girls baptised.
3. There is initially an increase in the number of girls baptised. This number peaks around 1640 and then after 1640 the number of girls baptised decreases.
4. The number of girls baptised has decreased over time.
5. There is an initial increase in the number of girls baptised but this number appears to level around 1680 and not change after that time point.

Back to the code... We use the ggplot() function to build plots. If you run the plotting code in your console, you should see the plot appear under the Plots tab of the lower right panel of RStudio. Notice that the command above again looks like a function, this time with arguments separated by commas.

The first argument is always the dataset.
Next, we provide thevariables from the dataset to be assigned to aesthetic elements of the plot, e.g. the x and the y axes.
Finally, we use another layer, separated by a + to specify the geometric object for the plot. Since we want to scatterplot, we use geom_point.

You might wonder how you are supposed to know the syntax for the ggplot function. Thankfully, R documents all of its functions extensively. To read what a function does and learn the arguments that are available to you, just type in a question mark followed by the name of the function that you're interested in. Try the following in your console:



In [8]:

    
?ggplot









    Out[8]:





ggplot {ggplot2} R Documentation

Create a new ggplot plot.

Description

ggplot() initializes a ggplot object. It can be used to
declare the input data frame for a graphic and to specify the
set of plot aesthetics intended to be common throughout all
subsequent layers unless specifically overridden.



Usage

ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())



Arguments


data

Default dataset to use for plot. If not already a data.frame,
will be converted to one by fortify. If not specified,
must be suppled in each layer added to the plot.

mapping

Default list of aesthetic mappings to use for plot.
If not specified, must be suppled in each layer added to the plot.

...

Other arguments passed on to methods. Not currently used.

environment

If an variable defined in the aesthetic mapping is not
found in the data, ggplot will look for it in this environment. It defaults
to using the environment in which ggplot() is called.




Details

ggplot() is typically used to construct a plot
incrementally, using the + operator to add layers to the
existing ggplot object. This is advantageous in that the
code is explicit about which layers are added and the order
in which they are added. For complex graphics with multiple
layers, initialization with ggplot is recommended.

There are three common ways to invoke ggplot:



 ggplot(df, aes(x, y, <other aesthetics>))


 ggplot(df)


 ggplot()



The first method is recommended if all layers use the same
data and the same set of aesthetics, although this method
can also be used to add a layer using data from another
data frame. See the first example below. The second
method specifies the default data frame to use for the plot,
but no aesthetics are defined up front. This is useful when
one data frame is used predominantly as layers are added,
but the aesthetics may vary from one layer to another. The
third method initializes a skeleton ggplot object which
is fleshed out as layers are added. This method is useful when
multiple data frames are used to produce different layers, as
is often the case in complex graphics.



Examples

df <- data.frame(gp = factor(rep(letters[1:3], each = 10)),
                 y = rnorm(30))
# Compute sample mean and standard deviation in each group
ds <- plyr::ddply(df, "gp", plyr::summarise, mean = mean(y), sd = sd(y))

# Declare the data frame and common aesthetics.
# The summary data frame ds is used to plot
# larger red points in a second geom_point() layer.
# If the data = argument is not specified, it uses the
# declared data frame from ggplot(); ditto for the aesthetics.
ggplot(df, aes(x = gp, y = y)) +
   geom_point() +
   geom_point(data = ds, aes(y = mean),
              colour = 'red', size = 3)
# Same plot as above, declaring only the data frame in ggplot().
# Note how the x and y aesthetics must now be declared in
# each geom_point() layer.
ggplot(df) +
   geom_point(aes(x = gp, y = y)) +
   geom_point(data = ds, aes(x = gp, y = mean),
                 colour = 'red', size = 3)
# Set up a skeleton ggplot object and add layers:
ggplot() +
  geom_point(data = df, aes(x = gp, y = y)) +
  geom_point(data = ds, aes(x = gp, y = mean),
                        colour = 'red', size = 3) +
  geom_errorbar(data = ds, aes(x = gp, y = mean,
                    ymin = mean - sd, ymax = mean + sd),
                    colour = 'red', width = 0.4)


[Package ggplot2 version 2.1.0 ]

Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.

More extensive help for plotting with the `ggplot2` package can be found at http://docs.ggplot2.org/current/. The best (and easiest) way to learn the syntax is to take a look at the sample plots provided on that page, and modify the code bit by bit until you get achieve the plot you want.

R as a big calculator

Now, suppose we want to plot the total number of baptisms. To compute this, we could use the fact that R is really just a big calculator. We can type in mathematical expressions like



In [9]:

    
5218 + 4683









    Out[9]:




9901

to see the total number of baptisms in 1629. We could repeat this once for each year, but there is a faster way. If we add the vector for baptisms for boys to that of girls, R will compute all sums simultaneously.



In [10]:

    
arbuthnot$boys + arbuthnot$girls

What you will see are 82 numbers (in that packed display, because we aren't looking at a data frame here), each one representing the sum we're after. Take a look at a few of them and verify that they are right.

Adding a new variable to the data frame

We'll be using this new vector to generate some plots, so we'll want to save it as a permanent column in our data frame.



In [11]:

    
arbuthnot <- arbuthnot %>% 
  mutate(total = boys + girls)

What in the world is going on here? The %>% operator is called the piping operator. Basically, it takes the output of the current line and pipes it into the following line of code.

**A note on piping: ** Note that we can read these three lines of code as the following: *"Take the `arbuthnot` dataset and **pipe** it into the `mutate` function. Using this mutate a new variable called `total` that is the sum of the variables called `boys` and `girls`. Then assign this new resulting dataset to the object called `arbuthnot`, i.e. overwrite the old `arbuthnot` dataset with the new one containing the new variable."* This is essentially equivalent to going through each row and adding up the boys and girls counts for that year and recording that value in a new column called total.

**Where is the new variable? ** When you make changes to variables in your dataset, click on the name of the dataset again to update it in the data viewer.

You'll see that there is now a new column called total that has been tacked on to the data frame. The special symbol <- performs an assignment, taking the output of one line of code and saving it into an object in your workspace. In this case, you already have an object called arbuthnot, so this command updates that data set with the new mutated column.

We can make a plot of the total number of baptisms per year with the following command.



In [12]:

    
ggplot(data = arbuthnot, aes(x = year, y = total)) +
  geom_line()

Note that using geom_line() instead of geom_point() results in a line plot instead of a scatter plot. You want both? Just layer them on:



In [13]:

    
ggplot(data = arbuthnot, aes(x = year, y = total)) +
  geom_line() +
  geom_point()

**Exercise**: Now, generate a plot of the proportion of boys born over time. What do you see?



In [14]:

    
# type your code for the Exercise here, and Knit
ggplot(data = arbuthnot, aes(x = year, y = boys / total)) +
  geom_line() +
  geom_point()

Finally, in addition to simple mathematical operators like subtraction and division, you can ask R to make comparisons like greater than, >, less than, <, and equality, ==. For example, we can ask if boys outnumber girls in each year with the expression



In [15]:

    
arbuthnot <- arbuthnot %>%
  mutate(more_boys = boys > girls)



In [19]:

    
arbuthnot$more_boys









    Out[19]:





	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE
	TRUE

This command add a new variable to the arbuthnot data frame containing the values of either TRUE if that year had more boys than girls, or FALSE if that year did not (the answer may surprise you). This variable contains different kind of data than we have considered so far. All other columns in the arbuthnot data frame have values are numerical (the year, the number of boys and girls). Here, we've asked R to create logical data, data where the values are either TRUE or FALSE. In general, data analysis will involve many different kinds of data types, and one reason for using R is that it is able to represent and compute with many of them.

Dataset 2: Present birth records

In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot's baptism data. Next you will do a similar analysis, but for present day birth records in the United States. Load up the present day data with the following command.



In [20]:

    
data(present)

The data are stored in a data frame called present which should now be loaded in your workspace.

How many variables are included in this data set?

2
3
4
74
2013



In [21]:

    
# type your code for Question 4 here, and Knit
str(present)









    



Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	74 obs. of  3 variables:
 $ year : num  1940 1941 1942 1943 1944 ...
 $ boys : num  1211684 1289734 1444365 1508959 1435301 ...
 $ girls: num  1148715 1223693 1364631 1427901 1359499 ...

**Exercise**: What years are included in this dataset? **Hint:** Use the `range` function and `present$year` as its argument.



In [22]:

    
# type your code for Exercise here, and Knit
summary(present)









    Out[22]:





      year           boys             girls        
 Min.   :1940   Min.   :1211684   Min.   :1148715  
 1st Qu.:1958   1st Qu.:1823071   1st Qu.:1731210  
 Median :1976   Median :1988038   Median :1897810  
 Mean   :1976   Mean   :1917502   Mean   :1825037  
 3rd Qu.:1995   3rd Qu.:2076156   3rd Qu.:1979778  
 Max.   :2013   Max.   :2208071   Max.   :2108162



In [23]:

    
range(present$year)









    Out[23]:





	1940
	2013

Calculate the total number of births for each year and store these values in a new

variable called total in the present dataset. Then, calculate the proportion of boys born each year and store these values in a new variable called prop_boys in the same dataset. Plot these values over time and based on the plot determine if the following statement is true or false: The proportion of boys born in the US has decreased over time.

True
False



In [32]:

    
ggplot(data = present, aes(x = year, y = boys / total)) +
  geom_line() +
  geom_point()



In [24]:

    
# type your code for Question 5 here, and Knit
present$total = present$boys + present$girls

Create a new variable called more_boys which contains the value of either TRUE if that year had more boys than girls, or FALSE if that year did not. Based on this variable which of the following statements is true?
1. Every year there are more girls born than boys.
2. Every year there are more boys born than girls.
3. Half of the years there are more boys born, and the other half more girls born.



In [25]:

    
# type your code for Question 6 here, and Knit
present$more_boys = present$boys > present$girls

Calculate the boy-to-girl ratio each year, and store these values in a new variable called prop_boy_girl in the present dataset. Plot these values over time. Which of the following best describes the trend?
1. There appears to be no trend in the boy-to-girl ratio from 1940 to 2013.
2. There is initially an increase in boy-to-girl ratio, which peaks around 1960. After 1960 there is a decrease in the boy-to-girl ratio, but the number begins to increase in the mid 1970s.
3. There is initially a decrease in the boy-to-girl ratio, and then an increase between 1960 and 1970, followed by a decrease.
4. The boy-to-girl ratio has increased over time.
5. There is an initial decrease in the boy-to-girl ratio born but this number appears to level around 1960 and remain constant since then.



In [34]:

    
sum(present$more_boys)/nrow(present)









    Out[34]:




1



In [26]:

    
# type your code for Question 7 here, and Knit
present$prop_boy_girl = present$boys / present$girls



In [35]:

    
ggplot(data = present, aes(x = year, y = prop_boy_girl)) +
  geom_line() +
  geom_point()

In what year did we see the most total number of births in the U.S.? Hint: Sort your dataset in descending order based on the total column. You can do this interactively in the data viewer by clicking on the arrows next to the variable names. Or to arrange the data in a descenting order with new function: descr (for descending order).
1. 1940
2. 1957
3. 1961
4. 1991
5. 2007



In [31]:

    
# type your code for Question 8 here
# sample code is provided below, edit as necessary, uncomment, and then Knit
present %>% arrange(desc(total))









    Out[31]:





year boys girls total more_boys prop_boy_girl

	1 2007 2208071 2108162 4316233 TRUE 1.04739151924757
	2 1961 2186274 2082052 4268326 TRUE 1.05005734727087
	3 2006 2184237 2081318 4265555 TRUE 1.04944895494105
	4 1960 2179708 2078142 4257850 TRUE 1.04887346485466
	5 1957 2179960 2074824 4254784 TRUE 1.05067224979083
	6 2008 2173625 2074069 4247694 TRUE 1.0480003317151
	7 1959 2173638 2071158 4244796 TRUE 1.04947956650338
	8 1958 2152546 2051266 4203812 TRUE 1.04937438635457
	9 1962 2132466 2034896 4167362 TRUE 1.04794839637996
	10 1956 2133588 2029502 4163090 TRUE 1.05128647323334
	11 1990 2129495 2028717 4158212 TRUE 1.0496757310162
	12 2005 2118982 2019367 4138349 TRUE 1.04932981473898
	13 2009 2113739 2016926 4130665 TRUE 1.04800027368381
	14 2004 2104661 2007391 4112052 TRUE 1.0484559311066
	15 1991 2101518 2009389 4110907 TRUE 1.0458492606459
	16 1963 2101632 1996388 4098020 TRUE 1.05271720727634
	17 2003 2093535 1996415 4089950 TRUE 1.04864720010619
	18 1992 2082097 1982917 4065014 TRUE 1.05001722210259
	19 2000 2076969 1981845 4058814 TRUE 1.0479976991137
	20 1955 2073719 1973576 4047295 TRUE 1.05074190200935
	21 1989 2069490 1971468 4040958 TRUE 1.04972030994163
	22 1964 2060162 1967328 4027490 TRUE 1.04718786089559
	23 2001 2057922 1968011 4025933 TRUE 1.04568622837982
	24 2002 2057979 1963747 4021726 TRUE 1.0479858148733
	25 1954 2059068 1958294 4017362 TRUE 1.05146009741132
	26 1993 2048861 1951379 4000240 TRUE 1.04995544176708
	27 2010 2046561 1952825 3999386 TRUE 1.04800020483146
	28 1999 2026854 1932563 3959417 TRUE 1.0487906474459
	29 2011 2024068 1929522 3953590 TRUE 1.04899970044394
	30 2012 2021800 1931041 3952841 TRUE 1.04700003780344
	31 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
	32 1970 1915378 1816008 3731386 TRUE 1.05471892194308
	33 1947 1899876 1800064 3699940 TRUE 1.05544913958615
	34 1982 1885676 1794861 3680537 TRUE 1.05059723287764
	35 1984 1879490 1789651 3669141 TRUE 1.05019917291137
	36 1983 1865553 1773380 3638933 TRUE 1.05197588785258
	37 1981 1860272 1768966 3629238 TRUE 1.05161546349675
	38 1980 1852616 1759642 3612258 TRUE 1.05283688386615
	39 1966 1845862 1760412 3606274 TRUE 1.0485397736439
	40 1969 1846572 1753634 3600206 TRUE 1.05299737573519
	41 1949 1826352 1733177 3559529 TRUE 1.05375965639978
	42 1971 1822910 1733060 3555970 TRUE 1.05184471397413
	43 1950 1823555 1730594 3554149 TRUE 1.05371623847072
	44 1948 1813852 1721216 3535068 TRUE 1.0538200899829
	45 1967 1803388 1717571 3520959 TRUE 1.04996416450907
	46 1968 1796326 1705238 3501564 TRUE 1.05341659052871
	47 1979 1791267 1703131 3494398 TRUE 1.05174939567185
	48 1978 1709394 1623885 3333279 TRUE 1.052657053917
	49 1977 1705916 1620716 3326632 TRUE 1.05256935823426
	50 1946 1691220 1597452 3288672 TRUE 1.05869847732514
	51 1972 1669927 1588484 3258411 TRUE 1.0512708972832
	52 1976 1624436 1543352 3167788 TRUE 1.05253759349779
	53 1974 1622114 1537844 3159958 TRUE 1.05479749571478
	54 1975 1613135 1531063 3144198 TRUE 1.05360458713978
	55 1973 1608326 1528639 3136965 TRUE 1.05212937783218
	56 1943 1508959 1427901 2936860 TRUE 1.05676724086614
	57 1942 1444365 1364631 2808996 TRUE 1.0584289819006
	58 1944 1435301 1359499 2794800 TRUE 1.05575730471299
	59 1945 1404587 1330869 2735456 TRUE 1.05539087618691
	60 1941 1289734 1223693 2513427 TRUE 1.05396860160187
	61 1940 1211684 1148715 2360399 TRUE 1.05481690410589

Resources for learning R and working in RStudio

That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses. You might find the following tips and resources helpful.

In this course we will be using the dplyr (for data wrangling) and ggplot2 (for data visualization) extensively. If you are googling for R code, make sure to also include these package names in your search query. For example, instead of googling "scatterplot in R", google "scatterplot in R with ggplot2".
The following cheathseets may come in handy throughout the course. Note that some of the code on these cheatsheets may be too advanced for this course, however majority of it will become useful as you progress through the course material.
While you will get plenty of exercise working with these packages in the labs of this course, if you would like further opportunities to practice we recommend checking out the relevant courses at DataCamp.

This is a derivative of an [OpenIntro](https://www.openintro.org/stat/labs.php) lab, and is released under a [Attribution-NonCommercial-ShareAlike 3.0 United States](https://creativecommons.org/licenses/by-nc-sa/3.0/us/) license.

`data`	Default dataset to use for plot. If not already a data.frame, will be converted to one by `fortify`. If not specified, must be suppled in each layer added to the plot.
`mapping`	Default list of aesthetic mappings to use for plot. If not specified, must be suppled in each layer added to the plot.
`...`	Other arguments passed on to methods. Not currently used.
`environment`	If an variable defined in the aesthetic mapping is not found in the data, ggplot will look for it in this environment. It defaults to using the environment in which `ggplot()` is called.

	year	boys	girls
1	1629	5218	4683
2	1630	4858	4457
3	1631	4422	4102
4	1632	4994	4590
5	1633	5158	4839
6	1634	5035	4820
7	1635	5106	4928
8	1636	4917	4605
9	1637	4703	4457
10	1638	5359	4952
11	1639	5366	4784
12	1640	5518	5332
13	1641	5470	5200
14	1642	5460	4910
15	1643	4793	4617
16	1644	4107	3997
17	1645	4047	3919
18	1646	3768	3395
19	1647	3796	3536
20	1648	3363	3181
21	1649	3079	2746
22	1650	2890	2722
23	1651	3231	2840
24	1652	3220	2908
25	1653	3196	2959
26	1654	3441	3179
27	1655	3655	3349
28	1656	3668	3382
29	1657	3396	3289
30	1658	3157	3013
31	⋮	⋮	⋮
32	1681	6822	6533
33	1682	6909	6744
34	1683	7577	7158
35	1684	7575	7127
36	1685	7484	7246
37	1686	7575	7119
38	1687	7737	7214
39	1688	7487	7101
40	1689	7604	7167
41	1690	7909	7302
42	1691	7662	7392
43	1692	7602	7316
44	1693	7676	7483
45	1694	6985	6647
46	1695	7263	6713
47	1696	7632	7229
48	1697	8062	7767
49	1698	8426	7626
50	1699	7911	7452
51	1700	7578	7061
52	1701	8102	7514
53	1702	8031	7656
54	1703	7765	7683
55	1704	6113	5738
56	1705	8366	7779
57	1706	7952	7417
58	1707	8379	7687
59	1708	8239	7623
60	1709	7840	7380
61	1710	7640	7288

	year	boys	girls	total	more_boys	prop_boy_girl
1	2007	2208071	2108162	4316233	TRUE	1.04739151924757
2	1961	2186274	2082052	4268326	TRUE	1.05005734727087
3	2006	2184237	2081318	4265555	TRUE	1.04944895494105
4	1960	2179708	2078142	4257850	TRUE	1.04887346485466
5	1957	2179960	2074824	4254784	TRUE	1.05067224979083
6	2008	2173625	2074069	4247694	TRUE	1.0480003317151
7	1959	2173638	2071158	4244796	TRUE	1.04947956650338
8	1958	2152546	2051266	4203812	TRUE	1.04937438635457
9	1962	2132466	2034896	4167362	TRUE	1.04794839637996
10	1956	2133588	2029502	4163090	TRUE	1.05128647323334
11	1990	2129495	2028717	4158212	TRUE	1.0496757310162
12	2005	2118982	2019367	4138349	TRUE	1.04932981473898
13	2009	2113739	2016926	4130665	TRUE	1.04800027368381
14	2004	2104661	2007391	4112052	TRUE	1.0484559311066
15	1991	2101518	2009389	4110907	TRUE	1.0458492606459
16	1963	2101632	1996388	4098020	TRUE	1.05271720727634
17	2003	2093535	1996415	4089950	TRUE	1.04864720010619
18	1992	2082097	1982917	4065014	TRUE	1.05001722210259
19	2000	2076969	1981845	4058814	TRUE	1.0479976991137
20	1955	2073719	1973576	4047295	TRUE	1.05074190200935
21	1989	2069490	1971468	4040958	TRUE	1.04972030994163
22	1964	2060162	1967328	4027490	TRUE	1.04718786089559
23	2001	2057922	1968011	4025933	TRUE	1.04568622837982
24	2002	2057979	1963747	4021726	TRUE	1.0479858148733
25	1954	2059068	1958294	4017362	TRUE	1.05146009741132
26	1993	2048861	1951379	4000240	TRUE	1.04995544176708
27	2010	2046561	1952825	3999386	TRUE	1.04800020483146
28	1999	2026854	1932563	3959417	TRUE	1.0487906474459
29	2011	2024068	1929522	3953590	TRUE	1.04899970044394
30	2012	2021800	1931041	3952841	TRUE	1.04700003780344
31	⋮	⋮	⋮	⋮	⋮	⋮
32	1970	1915378	1816008	3731386	TRUE	1.05471892194308
33	1947	1899876	1800064	3699940	TRUE	1.05544913958615
34	1982	1885676	1794861	3680537	TRUE	1.05059723287764
35	1984	1879490	1789651	3669141	TRUE	1.05019917291137
36	1983	1865553	1773380	3638933	TRUE	1.05197588785258
37	1981	1860272	1768966	3629238	TRUE	1.05161546349675
38	1980	1852616	1759642	3612258	TRUE	1.05283688386615
39	1966	1845862	1760412	3606274	TRUE	1.0485397736439
40	1969	1846572	1753634	3600206	TRUE	1.05299737573519
41	1949	1826352	1733177	3559529	TRUE	1.05375965639978
42	1971	1822910	1733060	3555970	TRUE	1.05184471397413
43	1950	1823555	1730594	3554149	TRUE	1.05371623847072
44	1948	1813852	1721216	3535068	TRUE	1.0538200899829
45	1967	1803388	1717571	3520959	TRUE	1.04996416450907
46	1968	1796326	1705238	3501564	TRUE	1.05341659052871
47	1979	1791267	1703131	3494398	TRUE	1.05174939567185
48	1978	1709394	1623885	3333279	TRUE	1.052657053917
49	1977	1705916	1620716	3326632	TRUE	1.05256935823426
50	1946	1691220	1597452	3288672	TRUE	1.05869847732514
51	1972	1669927	1588484	3258411	TRUE	1.0512708972832
52	1976	1624436	1543352	3167788	TRUE	1.05253759349779
53	1974	1622114	1537844	3159958	TRUE	1.05479749571478
54	1975	1613135	1531063	3144198	TRUE	1.05360458713978
55	1973	1608326	1528639	3136965	TRUE	1.05212937783218
56	1943	1508959	1427901	2936860	TRUE	1.05676724086614
57	1942	1444365	1364631	2808996	TRUE	1.0584289819006
58	1944	1435301	1359499	2794800	TRUE	1.05575730471299
59	1945	1404587	1330869	2735456	TRUE	1.05539087618691
60	1941	1289734	1223693	2513427	TRUE	1.05396860160187
61	1940	1211684	1148715	2360399	TRUE	1.05481690410589