Lab 1 - dplyr examples


In [ ]:
options(jupyter.rich_display=FALSE)

In [ ]:
library(dplyr)
library(stringr)
taxi_url <- "http://alizaidi.blob.core.windows.net/training/taxi_df.rds"
taxi_df  <- readRDS(gzcon(url(taxi_url)))

In [ ]:
ls()

In [ ]:
class(taxi_df)

In [ ]:
taxi_df <- taxi_df %>% mutate(tip_pct = tip_amount/fare_amount)

In [ ]:
taxi_df

Exploratory Data Analysis - Data Validation

Let's see if we can find out anything about the different numeric fields, tip_amount and fare_amount and see if we can spot any outliers. How should we deal with them?


In [ ]:
## Some useful functions

# summary
# quantile
# ggplot() + geom_histogram
# ggplot() + geom_density

Summarize data by payment type

There is a payment type column that is an label for the type of payment used for the taxi ride. Let's see if we can find out anything strange about the various payment types.


In [ ]:
## some useful functions
# group_by(payment_type) %>% summarise(tip_amount)
# ggplot() + facet_wrap(~payment_type)

Two-table joins

Let's see examples of the two-table functions in dplyr.


In [ ]:
library(broom)
taxi_coefs <- taxi_df %>% sample_n(10^5) %>%
  group_by(dropoff_dow) %>%
  do(tidy(lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
     data = .), conf.int = TRUE)) %>% select(dropoff_dow, conf.low, conf.high)

taxi_metrics <- taxi_df %>% sample_n(10^5) %>%
  group_by(dropoff_dow) %>%
  do(glance(lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
     data = .)))

Use the left_join function in dplyr to append the model metrics to the coefficients.

tidyr

The tidyr package is a very handy package for transforming data that is wide into data that is tall.

Take a look at the tidyr cheatsheet and try to convert the coeffs data from tall to wide


In [ ]: