In [ ]:
library(dplyr)
load(url("http://alizaidi.blob.core.windows.net/training/taxi_df.RData"))
(taxi_df <- tbl_df(taxi_df))
The goal of this lab is to teach you how to write functions in R that are easy to use and debug.
There are three major components of a function:
In [ ]:
any_function <- function(args1, args2, ...) {
#
# BODY
#
return(value)
}
In [ ]:
tip_lm <- lm(tip_amount ~ trip_distance, data = taxi_df)
summary(tip_lm)
But suppose we now want to calculate this model for a specific pickup_nhood. For instance, let's calculate it for the Upper West Side.
In [ ]:
tip_uws <- lm(tip_amount ~ trip_distance,
data = filter(taxi_df, pickup_nhood == "Upper West Side"))
But now say we want to calculate that model but for the Upper East Side. Should we copy and paste the code from above and change the last parameter? Sure, but that's going to get annoying if we have to do it more than once. R is lazy, and so am I!
Before you go out and create your function, ask yourself the important questions:
In [ ]:
## Starter code
est_lm_nhood <- function(nhood) {
##
# return(something)
}
In [ ]:
char_vector <- c("batman", "superman", "magneto", "ironman", "deadpool")
class(char_vector)
If I wanted to conver this vector to a different type, say factors, I can try and use a helper function as.factor.
In [ ]:
(factor_vector <- as.factor(char_vector))
class(factor_vector)
class(as.character(factor_vector))
Principle 1 from Day 1: everything that exists in R is an object. That means that anything that exists in R is some class (may be many classes!), and if we want to change that class, we can try and find an appropraite as.otherclass function for it.
That means we can even create rather complex objects by simply chaining together easier functions. For instance, the formula object that is needed in all modeling functions can be created programatically from character functions:
In [ ]:
dep_var <- "tip_amoount"
indep_vars <- c("trip_distance", "passenger_count", "pickup_nhood")
## The paste function will paste together objects based on a separator
## It has to important arguments: collapse and sep
## Use collapse when you want to take a vector and paste all its elements into 1 element
## Use sep when you have multiple vectors (or scalars: vectors of length 1) and paste them together
(rhs <- paste(indep_vars, collapse = " + "))
In [ ]:
make_model <- function(depvar, indepvars) {
# body
# return value
}
In [ ]:
est_lm_nhood <- function(nhood, model_form) {
## Body
## return value
}
A functional is simply a function that takes another function as one of it's arguments. Strictly speaking, functionals should take a function as it's primary input, and output a single vector/list.
Suppose we want to use our est_lm_hood function to estimate not one model, but several models for different values of pickup_nhood. We could create a for loop, and iterate over a vector of pickup_nhood columns. However, you have probably heard that for loops are signs of weakness.
Instead, you can use the most popular functional in R: lapply.
lapply workslapply is actually a very simple functional, and is absolutely worth learning because it makes functional programming in R easy and effective.
lapply works by taking a list and a function as its inputs, and then applies the function to each element of the list.
lapply Example
In [ ]:
summarise_col <- function(colname, df = taxi_df) {
return(summary(df[[colname]]))
}
lapply(list("tip_amount", "fare_amount"), summarise_col)
# same as
lapply(c("tip_amount", "fare_amount"), summarise_col)
lapply to Estimate Many Models In One CallIn this exercise, take the lapply function, and use est_lm_hood as the "functional" argument. Make a list/vector argument of neighborhood names.
Your functions will not be perfect the first time you write them. They will have bugs.
The best way to become a better programmer and human being is to write buggy software and then stay up at night debugging.
Let's say we use an incorrect column name inside of lapply in the chunk above:
In [ ]:
# misspelling, get strange zero
lapply(c("tp_amount", "fare_amount"), summarise_col)
debugonce(summarise_col)
lapply(c("tp_amount", "fare_amount"), summarise_col)
If you are eager to stick to the tidyverse of packages, take a look at the purrr.
The map function in purrr is very similar to the lapply function. The main differnece is that purrr prefers taking a data.frame as it's first argument, and applies a function to each element/column of the data.frame.
For example, if I wanted five point summaries of all the columns in taxi_df, I can use purrr's map function.
There's also a handy function called keep, which I mainly use as a way of doing select but based on column types rather than names/indices. This way, I can select numeric columns only.
In [ ]:
library(purrr)
map(taxi_df, summary)
taxi_df %>% map(summary)
taxi_df %>% keep(is.numeric) %>% map(summary)
In [ ]:
taxi_df %>% keep(is.numeric) %>% map(mean)
taxi_df %>% keep(is.numeric) %>% map_dbl(mean)