Introduction to R and RStudio

Complete all **Exercises**, and submit answers to **Questions** on the Coursera platform.

The goal of this lab is to introduce you to R and RStudio, which you'll be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface.

As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.

RStudio

Your RStudio window has four panels.

Your R Markdown file (this document) is in the upper left panel.

The panel on the lower left is where the action happens. It's called the console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you're running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

The panel in the upper right contains your workspace as well as a history of the commands that you've previously entered.

Any plots that you generate will show up in the panel in the lower right corner. This is also where you can browse your files, access help, manage packages, etc.

R Packages

R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. For this lab, and many others in the future, we will use the following R packages:

  • statsr: for data files and functions used in this course
  • dplyr: for data wrangling
  • ggplot2: for data visualization

You should have already installed these packages using commands like install.packages and install_github.

Next, you need to load the packages in your working environment. We do this with the library function. Note that you only need to install packages once, but you need to load them each time you relaunch RStudio.


In [1]:
library(dplyr)
library(ggplot2)
library(statsr)


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Warning message:
: package ‘ggplot2’ was built under R version 3.2.4

To do so, you can

  • click on the green arrow at the top of the code chunk in the R Markdown (Rmd) file, or
  • highlight these lines, and hit the Run button on the upper right corner of the pane, or
  • type the code in the console.

Going forward you will be asked to load any relevant packages at the beginning of each lab.

Dataset 1: Dr. Arbuthnot's Baptism Records

To get you started, run the following command to load the data.


In [ ]:
data(arbuthnot)

To do so, once again, you can

  • click on the green arrow at the top of the code chunk in the R Markdown (Rmd) file, or
  • put your cursor on this line, and hit the Run button on the upper right corner of the pane, or
  • type the code in the console.

This command instructs R to load some data. The Arbuthnot baptism counts for boys and girls. You should see that the workspace area in the upper righthand corner of the RStudio window now lists a data set called arbuthnot that has 82 observations on 3 variables. As you interact with R, you will create a series of objects. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed.

The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710. We can take a look at the data by typing its name into the console.


In [2]:
arbuthnot


Out[2]:
yearboysgirls
1162952184683
2163048584457
3163144224102
4163249944590
5163351584839
6163450354820
7163551064928
8163649174605
9163747034457
10163853594952
11163953664784
12164055185332
13164154705200
14164254604910
15164347934617
16164441073997
17164540473919
18164637683395
19164737963536
20164833633181
21164930792746
22165028902722
23165132312840
24165232202908
25165331962959
26165434413179
27165536553349
28165636683382
29165733963289
30165831573013
31
32168168226533
33168269096744
34168375777158
35168475757127
36168574847246
37168675757119
38168777377214
39168874877101
40168976047167
41169079097302
42169176627392
43169276027316
44169376767483
45169469856647
46169572636713
47169676327229
48169780627767
49169884267626
50169979117452
51170075787061
52170181027514
53170280317656
54170377657683
55170461135738
56170583667779
57170679527417
58170783797687
59170882397623
60170978407380
61171076407288

However printing the whole dataset in the console is not that useful. One advantage of RStudio is that it comes with a built-in data viewer. Click on the name arbuthnot in the Environment pane (upper right window) that lists the objects in your workspace. This will bring up an alternative display of the data set in the Data Viewer (upper left window). You can close the data viewer by clicking on the x in the upper lefthand corner.

What you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.

Note that the row numbers in the first column are not part of Arbuthnot's data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored Arbuthnot's data in a kind of spreadsheet or table called a data frame.

You can see the dimensions of this data frame by typing:


In [3]:
dim(arbuthnot)


Out[3]:
  1. 82
  2. 3

This command should output [1] 82 3, indicating that there are 82 rows and 3 columns (we'll get to what the [1] means in a bit), just as it says next to the object in your workspace. You can see the names of these columns (or variables) by typing:


In [4]:
names(arbuthnot)


Out[4]:
  1. 'year'
  2. 'boys'
  3. 'girls'
  1. How many variables are included in this data set?
    1. 2
    2. 3
    3. 4
    4. 82
    5. 1710
**Exercise**: What years are included in this dataset? Hint: Take a look at the year variable in the Data Viewer to answer this question.

You should see that the data frame contains the columns year, boys, and girls. At this point, you might notice that many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of arguments. The dim and names commands, for example, each took a single argument, the name of a data frame.

**Tip: ** If you use the up and down arrow keys, you can scroll through your previous commands, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.

R Markdown

So far we asked you to type your commands in the console. The console is a great place for playing around with some code, however it is not a good place for documenting your work. Working in the console exclusively makes it difficult to document your work as you go, and reproduce it later.

R Markdown is a great solution for this problem. And, you already have worked with an R Markdown document -- this lab! Going forward type the code for the questions in the code chunks provided in the R Markdown (Rmd) document for the lab, and Knit the document to see the results.

Some Exploration

Let's start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like


In [5]:
arbuthnot$boys


Out[5]:
  1. 5218
  2. 4858
  3. 4422
  4. 4994
  5. 5158
  6. 5035
  7. 5106
  8. 4917
  9. 4703
  10. 5359
  11. 5366
  12. 5518
  13. 5470
  14. 5460
  15. 4793
  16. 4107
  17. 4047
  18. 3768
  19. 3796
  20. 3363
  21. 3079
  22. 2890
  23. 3231
  24. 3220
  25. 3196
  26. 3441
  27. 3655
  28. 3668
  29. 3396
  30. 3157
  31. 3209
  32. 3724
  33. 4748
  34. 5216
  35. 5411
  36. 6041
  37. 5114
  38. 4678
  39. 5616
  40. 6073
  41. 6506
  42. 6278
  43. 6449
  44. 6443
  45. 6073
  46. 6113
  47. 6058
  48. 6552
  49. 6423
  50. 6568
  51. 6247
  52. 6548
  53. 6822
  54. 6909
  55. 7577
  56. 7575
  57. 7484
  58. 7575
  59. 7737
  60. 7487
  61. 7604
  62. 7909
  63. 7662
  64. 7602
  65. 7676
  66. 6985
  67. 7263
  68. 7632
  69. 8062
  70. 8426
  71. 7911
  72. 7578
  73. 8102
  74. 8031
  75. 7765
  76. 6113
  77. 8366
  78. 7952
  79. 8379
  80. 8239
  81. 7840
  82. 7640

This command will only show the number of boys baptized each year. The dollar sign basically says "go to the data frame that comes before me, and find the variable that comes after me".

  1. What command would you use to extract just the counts of girls born?
    1. `arbuthnot$boys`
    2. `arbuthnot$girls`
    3. `girls`
    4. `arbuthnot[girls]`
    5. `$girls`

In [6]:
# type your code for the Question 2 here, and Knit
arbuthnot$girls


Out[6]:
  1. 4683
  2. 4457
  3. 4102
  4. 4590
  5. 4839
  6. 4820
  7. 4928
  8. 4605
  9. 4457
  10. 4952
  11. 4784
  12. 5332
  13. 5200
  14. 4910
  15. 4617
  16. 3997
  17. 3919
  18. 3395
  19. 3536
  20. 3181
  21. 2746
  22. 2722
  23. 2840
  24. 2908
  25. 2959
  26. 3179
  27. 3349
  28. 3382
  29. 3289
  30. 3013
  31. 2781
  32. 3247
  33. 4107
  34. 4803
  35. 4881
  36. 5681
  37. 4858
  38. 4319
  39. 5322
  40. 5560
  41. 5829
  42. 5719
  43. 6061
  44. 6120
  45. 5822
  46. 5738
  47. 5717
  48. 5847
  49. 6203
  50. 6033
  51. 6041
  52. 6299
  53. 6533
  54. 6744
  55. 7158
  56. 7127
  57. 7246
  58. 7119
  59. 7214
  60. 7101
  61. 7167
  62. 7302
  63. 7392
  64. 7316
  65. 7483
  66. 6647
  67. 6713
  68. 7229
  69. 7767
  70. 7626
  71. 7452
  72. 7061
  73. 7514
  74. 7656
  75. 7683
  76. 5738
  77. 7779
  78. 7417
  79. 7687
  80. 7623
  81. 7380
  82. 7288

Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 82 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called vectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, 5218 follows [1], indicating that 5218 is the first entry in the vector. And if [43] starts a line, then that would mean the first number on that line would represent the 43rd entry in the vector.

R has some powerful functions for making graphics. We can create a simple plot of the number of girls baptized per year with the command


In [7]:
ggplot(data = arbuthnot, aes(x = year, y = girls)) +
  geom_point()


Before we review the code for this plot, let's summarize the trends we see in the data.

  1. Which of the following best describes the number of girls baptised over the years included in this dataset?
    1. There appears to be no trend in the number of girls baptised from 1629 to 1710.
    2. There is initially an increase in the number of girls baptised, which peaks around 1640. After 1640 there is a decrease in the number of girls baptised, but the number begins to increase again in 1660. Overall the trend is an increase in the number of girls baptised.
    3. There is initially an increase in the number of girls baptised. This number peaks around 1640 and then after 1640 the number of girls baptised decreases.
    4. The number of girls baptised has decreased over time.
    5. There is an initial increase in the number of girls baptised but this number appears to level around 1680 and not change after that time point.

Back to the code... We use the ggplot() function to build plots. If you run the plotting code in your console, you should see the plot appear under the Plots tab of the lower right panel of RStudio. Notice that the command above again looks like a function, this time with arguments separated by commas.

  • The first argument is always the dataset.
  • Next, we provide thevariables from the dataset to be assigned to aesthetic elements of the plot, e.g. the x and the y axes.
  • Finally, we use another layer, separated by a + to specify the geometric object for the plot. Since we want to scatterplot, we use geom_point.

You might wonder how you are supposed to know the syntax for the ggplot function. Thankfully, R documents all of its functions extensively. To read what a function does and learn the arguments that are available to you, just type in a question mark followed by the name of the function that you're interested in. Try the following in your console:


In [8]:
?ggplot


Out[8]:
ggplot {ggplot2}R Documentation

Create a new ggplot plot.

Description

ggplot() initializes a ggplot object. It can be used to declare the input data frame for a graphic and to specify the set of plot aesthetics intended to be common throughout all subsequent layers unless specifically overridden.

Usage

ggplot(data = NULL, mapping = aes(), ..., environment = parent.frame())

Arguments

data

Default dataset to use for plot. If not already a data.frame, will be converted to one by fortify. If not specified, must be suppled in each layer added to the plot.

mapping

Default list of aesthetic mappings to use for plot. If not specified, must be suppled in each layer added to the plot.

...

Other arguments passed on to methods. Not currently used.

environment

If an variable defined in the aesthetic mapping is not found in the data, ggplot will look for it in this environment. It defaults to using the environment in which ggplot() is called.

Details

ggplot() is typically used to construct a plot incrementally, using the + operator to add layers to the existing ggplot object. This is advantageous in that the code is explicit about which layers are added and the order in which they are added. For complex graphics with multiple layers, initialization with ggplot is recommended.

There are three common ways to invoke ggplot:

  • ggplot(df, aes(x, y, <other aesthetics>))

  • ggplot(df)

  • ggplot()

The first method is recommended if all layers use the same data and the same set of aesthetics, although this method can also be used to add a layer using data from another data frame. See the first example below. The second method specifies the default data frame to use for the plot, but no aesthetics are defined up front. This is useful when one data frame is used predominantly as layers are added, but the aesthetics may vary from one layer to another. The third method initializes a skeleton ggplot object which is fleshed out as layers are added. This method is useful when multiple data frames are used to produce different layers, as is often the case in complex graphics.

Examples

df <- data.frame(gp = factor(rep(letters[1:3], each = 10)),
                 y = rnorm(30))
# Compute sample mean and standard deviation in each group
ds <- plyr::ddply(df, "gp", plyr::summarise, mean = mean(y), sd = sd(y))

# Declare the data frame and common aesthetics.
# The summary data frame ds is used to plot
# larger red points in a second geom_point() layer.
# If the data = argument is not specified, it uses the
# declared data frame from ggplot(); ditto for the aesthetics.
ggplot(df, aes(x = gp, y = y)) +
   geom_point() +
   geom_point(data = ds, aes(y = mean),
              colour = 'red', size = 3)
# Same plot as above, declaring only the data frame in ggplot().
# Note how the x and y aesthetics must now be declared in
# each geom_point() layer.
ggplot(df) +
   geom_point(aes(x = gp, y = y)) +
   geom_point(data = ds, aes(x = gp, y = mean),
                 colour = 'red', size = 3)
# Set up a skeleton ggplot object and add layers:
ggplot() +
  geom_point(data = df, aes(x = gp, y = y)) +
  geom_point(data = ds, aes(x = gp, y = mean),
                        colour = 'red', size = 3) +
  geom_errorbar(data = ds, aes(x = gp, y = mean,
                    ymin = mean - sd, ymax = mean + sd),
                    colour = 'red', width = 0.4)

[Package ggplot2 version 2.1.0 ]

Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.

More extensive help for plotting with the `ggplot2` package can be found at http://docs.ggplot2.org/current/. The best (and easiest) way to learn the syntax is to take a look at the sample plots provided on that page, and modify the code bit by bit until you get achieve the plot you want.

R as a big calculator

Now, suppose we want to plot the total number of baptisms. To compute this, we could use the fact that R is really just a big calculator. We can type in mathematical expressions like


In [9]:
5218 + 4683


Out[9]:
9901

to see the total number of baptisms in 1629. We could repeat this once for each year, but there is a faster way. If we add the vector for baptisms for boys to that of girls, R will compute all sums simultaneously.


In [10]:
arbuthnot$boys + arbuthnot$girls


Out[10]:
  1. 9901
  2. 9315
  3. 8524
  4. 9584
  5. 9997
  6. 9855
  7. 10034
  8. 9522
  9. 9160
  10. 10311
  11. 10150
  12. 10850
  13. 10670
  14. 10370
  15. 9410
  16. 8104
  17. 7966
  18. 7163
  19. 7332
  20. 6544
  21. 5825
  22. 5612
  23. 6071
  24. 6128
  25. 6155
  26. 6620
  27. 7004
  28. 7050
  29. 6685
  30. 6170
  31. 5990
  32. 6971
  33. 8855
  34. 10019
  35. 10292
  36. 11722
  37. 9972
  38. 8997
  39. 10938
  40. 11633
  41. 12335
  42. 11997
  43. 12510
  44. 12563
  45. 11895
  46. 11851
  47. 11775
  48. 12399
  49. 12626
  50. 12601
  51. 12288
  52. 12847
  53. 13355
  54. 13653
  55. 14735
  56. 14702
  57. 14730
  58. 14694
  59. 14951
  60. 14588
  61. 14771
  62. 15211
  63. 15054
  64. 14918
  65. 15159
  66. 13632
  67. 13976
  68. 14861
  69. 15829
  70. 16052
  71. 15363
  72. 14639
  73. 15616
  74. 15687
  75. 15448
  76. 11851
  77. 16145
  78. 15369
  79. 16066
  80. 15862
  81. 15220
  82. 14928

What you will see are 82 numbers (in that packed display, because we aren't looking at a data frame here), each one representing the sum we're after. Take a look at a few of them and verify that they are right.

Adding a new variable to the data frame

We'll be using this new vector to generate some plots, so we'll want to save it as a permanent column in our data frame.


In [11]:
arbuthnot <- arbuthnot %>% 
  mutate(total = boys + girls)

What in the world is going on here? The %>% operator is called the piping operator. Basically, it takes the output of the current line and pipes it into the following line of code.

**A note on piping: ** Note that we can read these three lines of code as the following: *"Take the `arbuthnot` dataset and **pipe** it into the `mutate` function. Using this mutate a new variable called `total` that is the sum of the variables called `boys` and `girls`. Then assign this new resulting dataset to the object called `arbuthnot`, i.e. overwrite the old `arbuthnot` dataset with the new one containing the new variable."* This is essentially equivalent to going through each row and adding up the boys and girls counts for that year and recording that value in a new column called total.
**Where is the new variable? ** When you make changes to variables in your dataset, click on the name of the dataset again to update it in the data viewer.

You'll see that there is now a new column called total that has been tacked on to the data frame. The special symbol <- performs an assignment, taking the output of one line of code and saving it into an object in your workspace. In this case, you already have an object called arbuthnot, so this command updates that data set with the new mutated column.

We can make a plot of the total number of baptisms per year with the following command.


In [12]:
ggplot(data = arbuthnot, aes(x = year, y = total)) +
  geom_line()


Note that using geom_line() instead of geom_point() results in a line plot instead of a scatter plot. You want both? Just layer them on:


In [13]:
ggplot(data = arbuthnot, aes(x = year, y = total)) +
  geom_line() +
  geom_point()


**Exercise**: Now, generate a plot of the proportion of boys born over time. What do you see?

In [14]:
# type your code for the Exercise here, and Knit
ggplot(data = arbuthnot, aes(x = year, y = boys / total)) +
  geom_line() +
  geom_point()


Finally, in addition to simple mathematical operators like subtraction and division, you can ask R to make comparisons like greater than, >, less than, <, and equality, ==. For example, we can ask if boys outnumber girls in each year with the expression


In [15]:
arbuthnot <- arbuthnot %>%
  mutate(more_boys = boys > girls)

In [19]:
arbuthnot$more_boys


Out[19]:
  1. TRUE
  2. TRUE
  3. TRUE
  4. TRUE
  5. TRUE
  6. TRUE
  7. TRUE
  8. TRUE
  9. TRUE
  10. TRUE
  11. TRUE
  12. TRUE
  13. TRUE
  14. TRUE
  15. TRUE
  16. TRUE
  17. TRUE
  18. TRUE
  19. TRUE
  20. TRUE
  21. TRUE
  22. TRUE
  23. TRUE
  24. TRUE
  25. TRUE
  26. TRUE
  27. TRUE
  28. TRUE
  29. TRUE
  30. TRUE
  31. TRUE
  32. TRUE
  33. TRUE
  34. TRUE
  35. TRUE
  36. TRUE
  37. TRUE
  38. TRUE
  39. TRUE
  40. TRUE
  41. TRUE
  42. TRUE
  43. TRUE
  44. TRUE
  45. TRUE
  46. TRUE
  47. TRUE
  48. TRUE
  49. TRUE
  50. TRUE
  51. TRUE
  52. TRUE
  53. TRUE
  54. TRUE
  55. TRUE
  56. TRUE
  57. TRUE
  58. TRUE
  59. TRUE
  60. TRUE
  61. TRUE
  62. TRUE
  63. TRUE
  64. TRUE
  65. TRUE
  66. TRUE
  67. TRUE
  68. TRUE
  69. TRUE
  70. TRUE
  71. TRUE
  72. TRUE
  73. TRUE
  74. TRUE
  75. TRUE
  76. TRUE
  77. TRUE
  78. TRUE
  79. TRUE
  80. TRUE
  81. TRUE
  82. TRUE

This command add a new variable to the arbuthnot data frame containing the values of either TRUE if that year had more boys than girls, or FALSE if that year did not (the answer may surprise you). This variable contains different kind of data than we have considered so far. All other columns in the arbuthnot data frame have values are numerical (the year, the number of boys and girls). Here, we've asked R to create logical data, data where the values are either TRUE or FALSE. In general, data analysis will involve many different kinds of data types, and one reason for using R is that it is able to represent and compute with many of them.

Dataset 2: Present birth records

In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot's baptism data. Next you will do a similar analysis, but for present day birth records in the United States. Load up the present day data with the following command.


In [20]:
data(present)

The data are stored in a data frame called present which should now be loaded in your workspace.

4

  1. How many variables are included in this data set?
  1. 2
  2. 3
  3. 4
  4. 74
  5. 2013

In [21]:
# type your code for Question 4 here, and Knit
str(present)


Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	74 obs. of  3 variables:
 $ year : num  1940 1941 1942 1943 1944 ...
 $ boys : num  1211684 1289734 1444365 1508959 1435301 ...
 $ girls: num  1148715 1223693 1364631 1427901 1359499 ...
**Exercise**: What years are included in this dataset? **Hint:** Use the `range` function and `present$year` as its argument.

In [22]:
# type your code for Exercise here, and Knit
summary(present)


Out[22]:
      year           boys             girls        
 Min.   :1940   Min.   :1211684   Min.   :1148715  
 1st Qu.:1958   1st Qu.:1823071   1st Qu.:1731210  
 Median :1976   Median :1988038   Median :1897810  
 Mean   :1976   Mean   :1917502   Mean   :1825037  
 3rd Qu.:1995   3rd Qu.:2076156   3rd Qu.:1979778  
 Max.   :2013   Max.   :2208071   Max.   :2108162  

In [23]:
range(present$year)


Out[23]:
  1. 1940
  2. 2013

5

  1. Calculate the total number of births for each year and store these values in a new

variable called total in the present dataset. Then, calculate the proportion of boys born each year and store these values in a new variable called prop_boys in the same dataset. Plot these values over time and based on the plot determine if the following statement is true or false: The proportion of boys born in the US has decreased over time.

  1. True
  2. False

In [32]:
ggplot(data = present, aes(x = year, y = boys / total)) +
  geom_line() +
  geom_point()



In [24]:
# type your code for Question 5 here, and Knit
present$total = present$boys + present$girls

6

  1. Create a new variable called more_boys which contains the value of either TRUE if that year had more boys than girls, or FALSE if that year did not. Based on this variable which of the following statements is true?
    1. Every year there are more girls born than boys.
    2. Every year there are more boys born than girls.
    3. Half of the years there are more boys born, and the other half more girls born.

In [25]:
# type your code for Question 6 here, and Knit
present$more_boys = present$boys > present$girls

7

  1. Calculate the boy-to-girl ratio each year, and store these values in a new variable called prop_boy_girl in the present dataset. Plot these values over time. Which of the following best describes the trend?
    1. There appears to be no trend in the boy-to-girl ratio from 1940 to 2013.
    2. There is initially an increase in boy-to-girl ratio, which peaks around 1960. After 1960 there is a decrease in the boy-to-girl ratio, but the number begins to increase in the mid 1970s.
    3. There is initially a decrease in the boy-to-girl ratio, and then an increase between 1960 and 1970, followed by a decrease.
    4. The boy-to-girl ratio has increased over time.
    5. There is an initial decrease in the boy-to-girl ratio born but this number appears to level around 1960 and remain constant since then.

In [34]:
sum(present$more_boys)/nrow(present)


Out[34]:
1

In [26]:
# type your code for Question 7 here, and Knit
present$prop_boy_girl = present$boys / present$girls

In [35]:
ggplot(data = present, aes(x = year, y = prop_boy_girl)) +
  geom_line() +
  geom_point()


8

  1. In what year did we see the most total number of births in the U.S.? Hint: Sort your dataset in descending order based on the total column. You can do this interactively in the data viewer by clicking on the arrows next to the variable names. Or to arrange the data in a descenting order with new function: descr (for descending order).
    1. 1940
    2. 1957
    3. 1961
    4. 1991
    5. 2007

In [31]:
# type your code for Question 8 here
# sample code is provided below, edit as necessary, uncomment, and then Knit
present %>% arrange(desc(total))


Out[31]:
yearboysgirlstotalmore_boysprop_boy_girl
12007220807121081624316233TRUE1.04739151924757
21961218627420820524268326TRUE1.05005734727087
32006218423720813184265555TRUE1.04944895494105
41960217970820781424257850TRUE1.04887346485466
51957217996020748244254784TRUE1.05067224979083
62008217362520740694247694TRUE1.0480003317151
71959217363820711584244796TRUE1.04947956650338
81958215254620512664203812TRUE1.04937438635457
91962213246620348964167362TRUE1.04794839637996
101956213358820295024163090TRUE1.05128647323334
111990212949520287174158212TRUE1.0496757310162
122005211898220193674138349TRUE1.04932981473898
132009211373920169264130665TRUE1.04800027368381
142004210466120073914112052TRUE1.0484559311066
151991210151820093894110907TRUE1.0458492606459
161963210163219963884098020TRUE1.05271720727634
172003209353519964154089950TRUE1.04864720010619
181992208209719829174065014TRUE1.05001722210259
192000207696919818454058814TRUE1.0479976991137
201955207371919735764047295TRUE1.05074190200935
211989206949019714684040958TRUE1.04972030994163
221964206016219673284027490TRUE1.04718786089559
232001205792219680114025933TRUE1.04568622837982
242002205797919637474021726TRUE1.0479858148733
251954205906819582944017362TRUE1.05146009741132
261993204886119513794000240TRUE1.04995544176708
272010204656119528253999386TRUE1.04800020483146
281999202685419325633959417TRUE1.0487906474459
292011202406819295223953590TRUE1.04899970044394
302012202180019310413952841TRUE1.04700003780344
31
321970191537818160083731386TRUE1.05471892194308
331947189987618000643699940TRUE1.05544913958615
341982188567617948613680537TRUE1.05059723287764
351984187949017896513669141TRUE1.05019917291137
361983186555317733803638933TRUE1.05197588785258
371981186027217689663629238TRUE1.05161546349675
381980185261617596423612258TRUE1.05283688386615
391966184586217604123606274TRUE1.0485397736439
401969184657217536343600206TRUE1.05299737573519
411949182635217331773559529TRUE1.05375965639978
421971182291017330603555970TRUE1.05184471397413
431950182355517305943554149TRUE1.05371623847072
441948181385217212163535068TRUE1.0538200899829
451967180338817175713520959TRUE1.04996416450907
461968179632617052383501564TRUE1.05341659052871
471979179126717031313494398TRUE1.05174939567185
481978170939416238853333279TRUE1.052657053917
491977170591616207163326632TRUE1.05256935823426
501946169122015974523288672TRUE1.05869847732514
511972166992715884843258411TRUE1.0512708972832
521976162443615433523167788TRUE1.05253759349779
531974162211415378443159958TRUE1.05479749571478
541975161313515310633144198TRUE1.05360458713978
551973160832615286393136965TRUE1.05212937783218
561943150895914279012936860TRUE1.05676724086614
571942144436513646312808996TRUE1.0584289819006
581944143530113594992794800TRUE1.05575730471299
591945140458713308692735456TRUE1.05539087618691
601941128973412236932513427TRUE1.05396860160187
611940121168411487152360399TRUE1.05481690410589

Resources for learning R and working in RStudio

That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses. You might find the following tips and resources helpful.

  • In this course we will be using the dplyr (for data wrangling) and ggplot2 (for data visualization) extensively. If you are googling for R code, make sure to also include these package names in your search query. For example, instead of googling "scatterplot in R", google "scatterplot in R with ggplot2".

  • The following cheathseets may come in handy throughout the course. Note that some of the code on these cheatsheets may be too advanced for this course, however majority of it will become useful as you progress through the course material.

  • While you will get plenty of exercise working with these packages in the labs of this course, if you would like further opportunities to practice we recommend checking out the relevant courses at DataCamp.

This is a derivative of an [OpenIntro](https://www.openintro.org/stat/labs.php) lab, and is released under a [Attribution-NonCommercial-ShareAlike 3.0 United States](https://creativecommons.org/licenses/by-nc-sa/3.0/us/) license.