At the end of this session, you will have learned how to:
dplyr
module to manipulate RxXdfData
data objectsRxXdfData
objects quickly and easilydplyrXdf
package and when to use functions from the RevoScaleR
packageRevoScaleR
package enables R users to manipulate data that is larger than memoryxdf
(short for eXternal Data Frame), which are highly efficient out-of-memory objectsRevoScaleR
functions have a dramatically different syntax from base R functionsdplyr
package is an exceptionally popular, due to its appealing syntax, and it's extensibilitydplyrXdf
that exposes most of the dplyr
functionality to xdf
objectsdplyrXdf
abstracts this task of file management, so that you can focus on the data itself, rather than the management of intermediate filesdplyr
, or other base R packages, dplyrXdf
allows you to work with data residing outside of memory, and therefore scales to datasets of arbitrary sizedplyr
trainingdevtools
)[github.com/hadley/devtools] (and if on a Windows machine, Rtools)dplyrXdf
package is not yet on CRANdevtools
package provides a very handy function, install_github
, for installing R packages saved in github repositories
In [ ]:
your_name <- "alizaidi"
your_dir <- paste0('/datadrive/', your_name)
# File Path to your Data
your_data <- file.path(your_dir, 'tripdata_2015.xdf')
dir.create(your_dir)
download.file("http://alizaidi.blob.core.windows.net/training/trainingData/manhattan.xdf",
destfile = your_data)
In [ ]:
library(dplyrXdf)
taxi_xdf <- RxXdfData(your_data)
taxi_xdf %>% head
taxi_xdf %>% nrow
In [ ]:
class(taxi_xdf)
dplyrXdf
package can also be completed
by using the rxDataStep
function in the RevoScaleR
package included with your MRS installationdplyrXdf
consists almost entirely of wrapper functions that call on other RevoScaleR functionsrxDataStep
vs dplyrXdf
In [ ]:
taxi_xdf %>% rxGetInfo(getVarInfo = TRUE, numRows = 4)
In [ ]:
taxi_transform <- RxXdfData(your_data)
In [ ]:
system.time(rxDataStep(inData = taxi_xdf,
outFile = taxi_transform,
transforms = list(tip_pct = tip_amount/fare_amount),
overwrite = TRUE))
In [ ]:
rxGetInfo(RxXdfData(taxi_transform), numRows = 2)
In [ ]:
system.time(taxi_transform <- taxi_xdf %>% mutate(tip_pct = tip_amount/fare_amount))
taxi_transform %>% rxGetInfo(numRows = 2)
rxDataStep
operation and the dplyrXdf
method, is that we do not specify an outFile
argument anywhere in the dplyrXdf
pipelinemutate
value to a new variable called taxi_transform
xdf
, and only saves the most recent output of a pipeline, where a pipeline is defined as all operations starting from a raw xdf file.persist
verb
In [ ]:
taxi_transform@file
In [ ]:
persist(taxi_transform, outFile = "/datadrive/alizaidi/taxiTransform.xdf") -> taxi_transform
dplyrXdf
package really shines when used for data aggregations and summarizationsrxSummary
, rxCube
, and rxCrossTabs
can compute a few summary statistics and do aggregations very quickly, they are not sufficiently general to be used in all places
In [ ]:
taxi_group <- taxi_transform %>%
group_by(pickup_nhood) %>%
summarise(ave_tip = mean(tip_pct))
taxi_group %>% head
rxCube
as well, but would require additional considerationspickup_nhood
column was a factor (can't mutate in place because of different data types)rxCube
can only provide summations and averages, so we cannot get standard deviations for instance.
In [ ]:
rxFactors(inData = taxi_transform,
outFile = "/datadrive/alizaidi/taxi_factor.xdf",
factorInfo = c("pickup_nhood"),
overwrite = TRUE)
head(rxCube(tip_pct ~ pickup_nhood,
means = TRUE,
data = "/datadrive/alizaidi/taxi_factor.xdf"))
# file.remove("data/taxi_factor.xdf")
As we saw above, it's pretty easy to create a summarization or aggregation script. We can encapsulate our aggregation into it's own function.
Suppose we wanted to calculate average tip as a function of dropoff and pickup neighborhoods. In the dplyr
nonmenclature, this means grouping by dropoff and pickup neighborhoods, and summarizing/averaging tip percent.
In [ ]:
rxGetInfo(taxi_transform, numRows = 5)
In [ ]:
mht_url <- "http://alizaidi.blob.core.windows.net/training/manhattan.rds"
manhattan_hoods <- readRDS(gzcon(url(mht_url)))
In [ ]:
taxi_transform %>%
filter(pickup_nhood %in% mht_hoods,
dropoff_nhood %in% mht_hoods,
.rxArgs = list(transformObjects = list(mht_hoods = manhattan_hoods))) %>%
group_by(dropoff_nhood, pickup_nhood) %>%
summarize(ave_tip = mean(tip_pct),
ave_dist = mean(trip_distance)) %>%
filter(ave_dist > 3, ave_tip > 0.05) -> sum_df
In [ ]:
sum_df %>% rxGetInfo(getVarInfo = TRUE, numRows = 5)
class(sum_df)
Alternatively, we can encapsulate this script into a function, so that we can easily call it in a functional pipeline.
In [ ]:
taxi_hood_sum <- function(taxi_data = taxi_df, ...) {
taxi_data %>%
filter(pickup_nhood %in% manhattan_hoods,
dropoff_nhood %in% manhattan_hoods, ...) %>%
group_by(dropoff_nhood, pickup_nhood) %>%
summarize(ave_tip = mean(tip_pct),
ave_dist = mean(trip_distance)) %>%
filter(ave_dist > 3, ave_tip > 0.05) -> sum_df
return(sum_df)
}
The resulting summary object isn't very large (about 408 rows in this case), so it shouldn't cause any memory overhead issues if we covert it now to a data.frame
. We can plot our results using our favorite plotting library.
In [ ]:
tile_plot_hood <- function(df = taxi_hood_sum()) {
library(ggplot2)
ggplot(data = df, aes(x = pickup_nhood, y = dropoff_nhood)) +
geom_tile(aes(fill = ave_tip), colour = "white") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = 'bottom') +
scale_fill_gradient(low = "white", high = "steelblue") -> gplot
return(gplot)
}
In [ ]:
# tile_plot_hood(as.data.frame(sum_df))
taxi_transform <- taxi_xdf %>% mutate(tip_pct = tip_amount/fare_amount)
library(plotly)
sum_df <- taxi_hood_sum(taxi_transform,
.rxArgs = list(transformObjects = list(manhattan_hoods = manhattan_hoods))) %>%
persist("/datadrive/alizaidi/summarized.xdf")
ggplotly(tile_plot_hood(as.data.frame(sum_df)))
The do
verb is an exception to the rule that dplyrXdf verbs write their output as xdf files. This is because do executes arbitrary R code, and can return arbitrary R objects; while a data frame is capable of storing these objects, an xdf file is limited to character and numeric vectors only.
The doXdf verb is similar to do, but where do splits its input into one data frame per group, doXdf splits it into one xdf file per group. This allows do-like functionality with grouped data, where each group can be arbitrarily large. The syntax for the two functions is essentially the same, although the code passed to doXdf must obviously know how to handle xdfs.
In [ ]:
taxi_models <- taxi_xdf %>% group_by(pickup_dow) %>% doXdf(model = rxLinMod(tip_amount ~ fare_amount, data = .))
In [ ]:
taxi_models
taxi_models$model[[1]]
All the caveats that go with working with data.frames
apply here. While each grouped partition is it's own RxXdfData
object, the return value must be a data.frame
, and hence, must fit in memory.
Moreover, the function you apply against the splits will determine how they are operated. If you use an rx
function, you'll get the nice fault-tolerant, parallel execution strategies the RevoScaleR
package provides, but for any vanilla/CRAN function will work with data.frames and can easily cause your session to crash.
In [ ]:
library(broom)
taxi_broom <- taxi_xdf %>% group_by(pickup_dow) %>% doXdf(model = lm(tip_amount ~ fare_amount, data = .))
Now we can apply the broom::tidy
function at the row level to get summary statistics:
In [ ]:
library(broom)
tbl_df(taxi_broom) %>% tidy(model)