In [ ]:
options(jupyter.rich_display=FALSE)
In [ ]:
library(dplyr)
library(stringr)
taxi_url <- "http://alizaidi.blob.core.windows.net/training/taxi_df.rds"
taxi_df <- readRDS(gzcon(url(taxi_url)))
In [ ]:
ls()
In [ ]:
class(taxi_df)
In [ ]:
taxi_df <- taxi_df %>% mutate(tip_pct = tip_amount/fare_amount)
In [ ]:
taxi_df
In [ ]:
## Some useful functions
# summary
# quantile
# ggplot() + geom_histogram
# ggplot() + geom_density
In [ ]:
## some useful functions
# group_by(payment_type) %>% summarise(tip_amount)
# ggplot() + facet_wrap(~payment_type)
In [ ]:
library(broom)
taxi_coefs <- taxi_df %>% sample_n(10^5) %>%
group_by(dropoff_dow) %>%
do(tidy(lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
data = .), conf.int = TRUE)) %>% select(dropoff_dow, conf.low, conf.high)
taxi_metrics <- taxi_df %>% sample_n(10^5) %>%
group_by(dropoff_dow) %>%
do(glance(lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
data = .)))
Use the left_join
function in dplyr
to append the model metrics to the coefficients.
tidyr
The tidyr
package is a very handy package for transforming data that is wide into data that is tall.
Take a look at the tidyr
cheatsheet and try to convert the coeffs data from tall to wide
In [ ]: