In [ ]:
library(nycflights13)
library(tidyverse)

All notes

5.1 Introduction

  • filter, pick observations
  • arrange, reorder the row
  • select, pick the variables by name
  • mutate, create variable with functions of existing variables
  • summarise, Collapse many values down to a single summary
  • group_by, used in conjunction with the above, apply it on different groups

Use them similarly:

  1. The first arg is a dataframe
  2. The subsequent arguments describe what to do with the DF, variable names without quote
  3. result is a new DF.

5.2 filter() notes

  • If we want to both assign to a variable and print out, use ()
  • Comparison operators: >, >=, <, <=, !=, ==
    • when compare numerical, use near(sqrt(2)^2, 2)
  • Logic operators: &, |, !
    • handy shorthand: x %in% y
  • Almost all operations with NA is NA.
  • Filter only includes rows where the condition is TRUE, excludes both NA and FALSE

5.3 arrange() notes

  • arrange() change the order of the rows.
  • use desc() to reverse the order.
  • NA is always at the end, no matter ascending or desending

5.4 select() notes

  • Select couple of variables
  • Have a couple of helper functions
    • Like starts_with and matches, see ?select_helpers
  • rename function to change variable name
  • everthing helper function to bring a variable to the front

5.5 mutate() notes

  • mutate() can use the variable just created
  • transmute() only keeps the new variables
  • Useful creation functions, take in vectors and produce vectors
    • Arithmetic operators: +/-/*/ / /^
    • Arithmetic operators with aggregate functions: x/sum(x)
    • module %/%, %%
    • logs
    • offset: lead/lag, useful in computing running difference or find when value changes.
    • cumulative function
    • ranking functions: min_rank() and etc. see ?min_rank

5.6 summarise() notes

  • collapse a data frame to a single row
  • More useful with group_by
  • Use pipe %>% is much easier to coding and reading
    • x %>% f(y) %>% g(z) <=> g(f(x, y), z)
    • example here:
delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")
  • missing value, mean function has na.rm.
    • But you can also do filter like

5.2 filter()


In [ ]:
flights

In [ ]:
(jan1 <- filter(flights, month == 1, day == 1))

In [ ]:
filter(flights, arr_delay <= 120 & dep_delay <= 120)

In [ ]:
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)

In [ ]:
filter(df, is.na(x) | x > 1)

5.2.1 Exercise


In [ ]:
# 1.1 Had an arrival delay of two or more hours
filter(flights, arr_delay >= 2)

In [ ]:
# Flew to Houston (IAH or HOU)
filter(flights, dest %in% c("IAH", "HOU"))

In [ ]:
# Were operated by United, American, or Delta
filter(flights, carrier %in% c("UA", "AA", "DL"))

In [ ]:
# Departed in summer (July, August, and September)
filter(flights, month %in% c(7, 8, 9))

In [ ]:
filter(flights, is.na(dep_time))

In [ ]:
filter(flights, dep_delay >= 60 & arr_delay <= 30)

In [ ]:
filter(flights, dep_time >= 0 & dep_time <= 600)