## Lab 1 - `dplyr` examples

``````

In [ ]:

options(jupyter.rich_display=FALSE)

``````
``````

In [ ]:

library(dplyr)
library(stringr)
taxi_url <- "http://alizaidi.blob.core.windows.net/training/taxi_df.rds"

``````
``````

In [ ]:

ls()

``````
``````

In [ ]:

class(taxi_df)

``````
``````

In [ ]:

taxi_df <- taxi_df %>% mutate(tip_pct = tip_amount/fare_amount)

``````
``````

In [ ]:

taxi_df

``````

## Exploratory Data Analysis - Data Validation

Let's see if we can find out anything about the different numeric fields, `tip_amount` and `fare_amount` and see if we can spot any outliers. How should we deal with them?

``````

In [ ]:

## Some useful functions

# summary
# quantile
# ggplot() + geom_histogram
# ggplot() + geom_density

``````

## Summarize data by payment type

There is a payment type column that is an label for the type of payment used for the taxi ride. Let's see if we can find out anything strange about the various payment types.

``````

In [ ]:

## some useful functions
# group_by(payment_type) %>% summarise(tip_amount)
# ggplot() + facet_wrap(~payment_type)

``````

## Two-table joins

Let's see examples of the two-table functions in `dplyr`.

``````

In [ ]:

library(broom)
taxi_coefs <- taxi_df %>% sample_n(10^5) %>%
group_by(dropoff_dow) %>%
do(tidy(lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
data = .), conf.int = TRUE)) %>% select(dropoff_dow, conf.low, conf.high)

taxi_metrics <- taxi_df %>% sample_n(10^5) %>%
group_by(dropoff_dow) %>%
do(glance(lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
data = .)))

``````

Use the `left_join` function in `dplyr` to append the model metrics to the coefficients.

# `tidyr`

The `tidyr` package is a very handy package for transforming data that is wide into data that is tall.

Take a look at the `tidyr` cheatsheet and try to convert the coeffs data from tall to wide

``````

In [ ]:

``````