In [4]:
knitr::opts_chunk$set(warning=FALSE, message=FALSE, fig.align = 'center')
In [5]:
options(jupiter.rich_display=FALSE)
dplyr
SparkR
, and the sparklyr
packageSpark
and Microsoft RR follows the Unix philosophy
Everything that exist in R is an *object*
Everything that happens in R is a *function call*
R was born to *interface*
_—John Chambers_
dplyr
, tidyr
, data.table
ggplot2
, ggvis
, htmlwidgets
, shiny
haven
, RODBC
, readr
, foreign
magrittr
, rmarkdown
, caret
Symbol | Meaning |
---|---|
<- |
assignment operator |
> |
ready for a new command |
+ |
awaiting the completion of an existing command |
? |
get help for following function |
Can change options either permanently at startup (see ?Startup
) or manually at each session with the options
function, options(repos = " ")
for example.
Check your CRAN mirror with getOption("repos")
.
"Bad programmers worry about the code. Good programmers worry about data structures and their relationships."
- Linus Torvalds
Homogeneous | Heterogeneous | |
---|---|---|
1d | Atomic vector | List |
2d | Matrix | Data frame |
nd | Array |
logical
(boolean). Values: TRUE
| FALSE
integer
double
(often called numeric)character
complex
raw
c
: c(1, 4, 1, 3)
list
: list(1, 'hi', data.frame(1:10, letters[1:10]))
[ ]
[[ ]]
class
(higher level representation)typeof
(lower level representation)dplyr
PackageRather than describing the nitty gritty details of writing R code, I'd like you to get started at immediately writing R code.
As most of you are data scientists/data enthusiasts, I will showcase one of the most useful data manipulation packages in R, dplyr
.
At the end of this session, you will have learned:
dplyr
using a very intuitive "grammar"dplyr
to perform common exploratory analysis data manipulation proceduresdplyr
with mutate()
, summarise()
and do()
dplyr
is currently the most downloaded package from CRANdplyr
makes data manipulation easier by providing a few functions for the most common tasks and proceduresdplyr
achieves remarkable speed-up gains by using a C++ backenddplyr
has multiple backends for working with data stored in various sources: SQLite, MySQL, bigquery, SQL Server, and many moredplyr
was inspired to give data manipulation a simple, cohesive grammar (similar philosophy to ggplot
- grammar of graphics)dplyr
has inspired many new packages, which now adopt it's easy to understand syntax.dplyrXdf
and SparkR/sparklyr
brings much of the same functionality of dplyr
to XDF
s data and Spark DataFrames
filter
: select rows based on matching criteria
slice
: select rows by number
select
: select columns by column names
arrange
: reorder rows by column values
mutate
: add new variables based on transformations of existing variables
transmute
: transform and drop other variables
group_by
: identify grouping variables for calculating groupwise summary statistics
count
: count the number of records per group
summarise
| summarize
: calculate one or more summary functions per group, returning one row of results per group (or one for the entire dataset)
saveRDS
function; read an rds object with the readRDS
objectdplyr
includes a wrapper called tbl_df
that adds an additional class attribute onto data.frames
that provides some better data manipulation aesthetics (there's now a dedicated package tibble
for this wrapper and it's class)tbl_df
and data.frame
s is the console output: tbl_df
s will only print what the current R console window can displayoptions(dplyr.width = Inf)
In [6]:
library(dplyr)
library(stringr)
taxi_url <- "http://alizaidi.blob.core.windows.net/training/taxi_df.rds"
taxi_df <- readRDS(gzcon(url(taxi_url)))
(taxi_df <- tbl_df(taxi_df))
In [7]:
class(taxi_df)
dplyr
makes subsetting by rows very easyfilter
verb takes conditions for filtering rows based on conditionsdplyr
function uses a data.frame/tbl as it's first argument
In [8]:
filter(taxi_df,
dropoff_dow %in% c("Fri", "Sat", "Sun"),
tip_amount > 1)
In [9]:
library(stringr)
#table(taxi_df$pickup_nhood)
#harlem_pickups <- filter(taxi_df, str_detect(pickup_nhood, "Harlem")) #str_detect: equivalent of 'Like' in SQL
harlem_pickups <- filter(taxi_df, pickup_nhood == "Harlem" | pickup_nhood == "East Harlem")
nrow(harlem_pickups)
#findistr_dropoffs <- filter(harlem_pickups, str_detect(dropoff_nhood, "Financial District"))
findistr_dropoffs <- filter(harlem_pickups, dropoff_nhood == "Financial District")
nrow(findistr_dropoffs)
# or all together (without creating additional objects == memory)
nrow(filter(taxi_df, str_detect(pickup_nhood, "Harlem"), dropoff_nhood == "Financial District"))
select()
verb to specify which columns of a dataset you wantkeep
option in SAS's data step.:
to select all the columns between two variables (inclusive)contains
to take any columns containing a certain word/phrase/character
In [10]:
select(taxi_df, pickup_nhood, dropoff_nhood,
fare_amount, dropoff_hour, trip_distance)
starts_with(x, ignore.case = FALSE)
: name starts with x
ends_with(x, ignore.case = FALSE)
: name ends with x
matches(x, ignore.case = FALSE)
: selects all variables whose name matches the regular expression x
num_range("V", 1:5, width = 1)
: selects all variables (numerically) from V1
to V5
.
-
to drop variables.arrange()
verbdesc
function to sort in descending order rather than ascending order (default)
In [11]:
#select(arrange(taxi_df, desc(fare_amount), pickup_nhood),
# fare_amount, pickup_nhood)
head(select(arrange(taxi_df, desc(fare_amount), pickup_nhood, dropoff_nhood),
fare_amount, pickup_nhood, dropoff_nhood), 10)
Use arrange()
to sort on the basis of tip_amount
, dropoff_nhood
, and pickup_dow
, with descending order for tip amount
filter
: Extract subsets of rows. See also slice()
select
: Extract subsets of columns. See also rename()
arrange
: Sort your data
mutate()
verb can be used to make new columns
In [12]:
#Exercise
head(select(arrange(taxi_df, desc(tip_amount), dropoff_nhood, pickup_nhood), tip_amount, dropoff_nhood, pickup_nhood))
In [13]:
taxi_df <- mutate(taxi_df, tip_pct = tip_amount/fare_amount)
head(select(taxi_df, tip_pct, fare_amount, tip_amount))
head(transmute(taxi_df, tip_pct = tip_amount/fare_amount))
In [14]:
str(taxi_df)
In [15]:
class(taxi_df)
In [16]:
grouped_taxi <- group_by(taxi_df, dropoff_nhood)
class(grouped_taxi)
head(grouped_taxi)#grouping is not visible, but the class knows about the grouping
In [17]:
nrow(summarize(group_by(taxi_df, dropoff_nhood),
Num = n(), ave_tip_pct = mean(tip_pct)))
nrow(summarize(grouped_taxi, Num=n(), ave_tip_pct=mean(tip_pct))) #calculated above
In [18]:
summarise(group_by(taxi_df, pickup_nhood, dropoff_nhood),
Num = n(), ave_tip_pct = mean(tip_pct))
dplyr
installation includes the magrittr
package as a dependencymagrittr
package includes a pipe operator that allows you to pass the current dataset to another function
In [19]:
filter(arrange(summarise(group_by(taxi_df, pickup_nhood, dropoff_nhood), Num = n(), ave_tip_pct = mean(tip_pct)), desc(ave_tip_pct)), Num >= 10)
In [20]:
filter(
arrange(
summarise(
group_by(taxi_df,
pickup_nhood, dropoff_nhood),
Num = n(),
ave_tip_pct = mean(tip_pct)),
desc(ave_tip_pct)),
Num >= 10)
|
, and F# forward pipe |>
, magrittr
introduces the funny character (%>%
, the then operator)%>%
pipes the object on the left hand side to the first argument of the function on the right hand sidedplyr
has a slot for data.frame/tbl
as it's first argument, so this works beautifully!
In [21]:
taxi_df %>%
group_by(pickup_nhood, dropoff_nhood) %>%
summarize(Num = n(),
ave_tip_pct = mean(tip_pct)) %>%
arrange(desc(ave_tip_pct)) %>%
filter(Num >= 10)
In [22]:
mht_url <- "http://alizaidi.blob.core.windows.net/training/manhattan.rds"
manhattan_hoods <- readRDS(gzcon(url(mht_url)))
taxi_df %>%
filter(pickup_nhood %in% manhattan_hoods,
dropoff_nhood %in% manhattan_hoods) %>%
group_by(dropoff_nhood, pickup_nhood) %>%
summarize(ave_tip = mean(tip_pct),
ave_dist = mean(trip_distance)) %>%
filter(ave_dist > 3, ave_tip > 0.05)
In [23]:
library(ggplot2)
taxi_df %>%
filter(pickup_nhood %in% manhattan_hoods,
dropoff_nhood %in% manhattan_hoods) %>%
group_by(dropoff_nhood, pickup_nhood) %>%
summarize(ave_tip = mean(tip_pct),
ave_dist = mean(trip_distance)) %>%
filter(ave_dist > 3, ave_tip > 0.05) %>%
ggplot(aes(x = pickup_nhood, y = dropoff_nhood)) +
geom_tile(aes(fill = ave_tip), colour = "white") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = 'bottom') +
scale_fill_gradient(low = "white", high = "steelblue")
In [24]:
library(ggplot2)
taxi_df %>%
filter(pickup_nhood %in% manhattan_hoods,
dropoff_nhood %in% manhattan_hoods) %>%
group_by(dropoff_nhood, pickup_nhood) %>%
summarize(ave_tip = mean(tip_pct),
ave_dist = mean(trip_distance)) %>%
filter(ave_dist > 3, ave_tip > 0.05) %>%
ggplot(aes(x = pickup_nhood, y = dropoff_nhood)) +
geom_tile(aes(fill = ave_tip), colour = "white") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = 'bottom') +
scale_fill_gradient(low = "white", high = "steelblue")
In [25]:
taxi_df %>%
filter(pickup_nhood %in% manhattan_hoods,
dropoff_nhood %in% manhattan_hoods) %>%
group_by(dropoff_nhood, pickup_nhood) %>%
summarize(ave_tip = mean(tip_pct),
ave_dist = mean(trip_distance)) %>%
lm(ave_tip ~ ave_dist, data = .) -> taxi_model # -> to assign left hand size to the right hand size
summary(taxi_model)
#OR
#taxi_df %>%
# filter(pickup_nhood %in% manhattan_hoods,
# dropoff_nhood %in% manhattan_hoods) %>%
# group_by(dropoff_nhood, pickup_nhood) %>%
# summarize(ave_tip = mean(tip_pct),
# ave_dist = mean(trip_distance)) %>%
# ungroup() %>% # <-
# lm(ave_tip ~ ., data = .) -> taxi_model # <-
#summary(taxi_model)
In [26]:
str(manhattan_hoods)
In [27]:
library(ggplot2)
taxi_df %>%
filter(dropoff_nhood %in% manhattan_hoods) %>%
group_by(dropoff_nhood, pickup_dow) %>%
summarize(ave_fare = mean(fare_amount)) %>%
ggplot(aes(x = pickup_dow, y = dropoff_nhood)) +
geom_tile(aes(fill = ave_fare), colour = "white") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = 'bottom') +
scale_fill_gradient(low = "white", high = "steelblue")
In [28]:
taxi_hood_sum <- function(taxi_data = taxi_df) {
mht_url <- "http://alizaidi.blob.core.windows.net/training/manhattan.rds"
manhattan_hoods <- readRDS(gzcon(url(mht_url)))
taxi_data %>%
filter(pickup_nhood %in% manhattan_hoods,
dropoff_nhood %in% manhattan_hoods) %>%
group_by(dropoff_nhood, pickup_nhood) %>%
summarize(ave_tip = mean(tip_pct),
ave_dist = mean(trip_distance)) %>%
filter(ave_dist > 3, ave_tip > 0.05) -> sum_df
return(sum_df)
}
In [29]:
tile_plot_hood <- function(df = taxi_hood_sum()) {
library(ggplot2)
ggplot(data = df, aes(x = pickup_nhood, y = dropoff_nhood)) +
geom_tile(aes(fill = ave_tip), colour = "white") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = 'bottom') +
scale_fill_gradient(low = "white", high = "steelblue") -> gplot
return(gplot)
}
In [30]:
#library(plotly)
#taxi_hood_sum(taxi_df) %>% tile_plot_hood
#OR
tile_plot_hood()
In [ ]:
In [31]:
library(plotly)
In [32]:
embed_notebook(ggplotly(tile_plot_hood()))
In [ ]:
taxi_df %>% group_by(dropoff_dow) %>%
filter(!is.na(dropoff_nhood), !is.na(pickup_nhood)) %>%
arrange(desc(tip_pct)) %>%
do(slice(., 1:2)) %>%
select(dropoff_dow, tip_amount, tip_pct,
fare_amount, dropoff_nhood, pickup_nhood)
In [ ]:
dow_lms <- taxi_df %>% sample_n(10^4) %>%
group_by(dropoff_dow) %>%
do(lm_tip = lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
data = .))
In [ ]:
dow_lms
In [ ]:
summary(dow_lms$lm_tip[[1]])
library(broom)
dow_lms %>% tidy(lm_tip)
dplyr
returns a data.frame
data.frame
with a column of S3
lm
objectsmutate
function to extract intercept or statisticsbroom
package for cleaning up such objects into data.frames
In [ ]:
library(broom)
taxi_df %>% sample_n(10^5) %>%
group_by(dropoff_dow) %>%
do(glance(lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
data = .)))
In [ ]:
taxi_df %>% sample_n(10^5) %>%
group_by(dropoff_dow) %>%
do(tidy(lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
data = .)))
mutate
: Create transformations
summarise
: Aggregate
group_by
: Group your dataset by levels
do
: Evaluate complex operations on a tbl
Chaining with the %>%
operator can result in more readable code.
dplyr
and functional programming landscapevignette('databases', package = 'dplyr')
vignette('two-table', package = 'dplyr')
dplyr
,vignette('nse', package = 'dplyr')
summarize_each
and mutate_each