In [ ]:

    
library(nycflights13)
library(tidyverse)

All notes

5.1 Introduction

filter, pick observations
arrange, reorder the row
select, pick the variables by name
mutate, create variable with functions of existing variables
summarise, Collapse many values down to a single summary
group_by, used in conjunction with the above, apply it on different groups

Use them similarly:

The first arg is a dataframe
The subsequent arguments describe what to do with the DF, variable names without quote
result is a new DF.

5.2 filter() notes

If we want to both assign to a variable and print out, use ()
Comparison operators: >, >=, <, <=, !=, ==
- when compare numerical, use near(sqrt(2)^2, 2)
Logic operators: &, |, !
- handy shorthand: x %in% y
Almost all operations with NA is NA.
Filter only includes rows where the condition is TRUE, excludes both NA and FALSE

5.3 arrange() notes

arrange() change the order of the rows.
use desc() to reverse the order.
NA is always at the end, no matter ascending or desending

5.4 select() notes

Select couple of variables
Have a couple of helper functions
- Like starts_with and matches, see ?select_helpers
rename function to change variable name
everthing helper function to bring a variable to the front

5.5 mutate() notes

mutate() can use the variable just created
transmute() only keeps the new variables
Useful creation functions, take in vectors and produce vectors
- Arithmetic operators: +/-/*/ / /^
- Arithmetic operators with aggregate functions: x/sum(x)
- module %/%, %%
- logs
- offset: lead/lag, useful in computing running difference or find when value changes.
- cumulative function
- ranking functions: min_rank() and etc. see ?min_rank

5.6 summarise() notes

collapse a data frame to a single row
More useful with group_by
Use pipe %>% is much easier to coding and reading
- x %>% f(y) %>% g(z) <=> g(f(x, y), z)
- example here:

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")

missing value, mean function has na.rm.
- But you can also do filter like

5.2 filter()



In [ ]:

    
flights



In [ ]:

    
(jan1 <- filter(flights, month == 1, day == 1))



In [ ]:

    
filter(flights, arr_delay <= 120 & dep_delay <= 120)



In [ ]:

    
df <- tibble(x = c(1, NA, 3))
filter(df, x > 1)



In [ ]:

    
filter(df, is.na(x) | x > 1)

5.2.1 Exercise



In [ ]:

    
# 1.1 Had an arrival delay of two or more hours
filter(flights, arr_delay >= 2)



In [ ]:

    
# Flew to Houston (IAH or HOU)
filter(flights, dest %in% c("IAH", "HOU"))



In [ ]:

    
# Were operated by United, American, or Delta
filter(flights, carrier %in% c("UA", "AA", "DL"))



In [ ]:

    
# Departed in summer (July, August, and September)
filter(flights, month %in% c(7, 8, 9))



In [ ]:

    
filter(flights, is.na(dep_time))



In [ ]:

    
filter(flights, dep_delay >= 60 & arr_delay <= 30)



In [ ]:

    
filter(flights, dep_time >= 0 & dep_time <= 600)