In [ ]:
knitr::opts_chunk$set(warning=FALSE, message=FALSE, fig.align = 'center')

Course Logistics

Day One

R U Ready?

  • Overview of The R Project for Statistical Computing
  • The Microsoft R Family
  • R's capabilities and it's limitations
  • What types of problems R might be useful for
  • How to manage data with the exceptionally popular open source package dplyr
  • How to develop models and write functions in R

Day Two

Scalable Data Analysis with Microsoft R

  • Moving the compute to your data
  • WODA - Write Once, Deploy Anywhere
  • High Performance Analytics
  • High Performance Computing
  • Machine Learning with Microsoft R

Day Three

Distributing Computing on Spark Clusters with R

  • Overview of the Apache Spark Project
  • Taming the Hadoop Zoo with HDInsight
  • Provisioing and Managing HDInsight Clusters
  • Spark DataFrames, SparkR, and the sparklyr package
  • Developing Machine Learning Pipelines with Spark and Microsoft R

Prerequisites

Computing Environments

Development Environments

Where to Write R Code

  • The most popular integrated development environment for R is RStudio
  • The RStudio IDE is entirely html/javascript based, so completely cross-platform
  • RStudio Server provides a full IDE in your browser: perfect for cloud instances
  • For Windows machines, 2016 provided general availability of R Tools for Visual Studio, RTVS
  • RTVS supports connectivity to Azure and SQL Server for remote connectivity

What is R?

Why should I care?

  • R is the successor to the S Language, originated at Bell Labs AT&T
  • It is based on the Scheme interpreter
  • Originally designed by two University of Auckland Professors for their introductory statistics course

R's Philosophy

What R Thou?

R follows the Unix philosophy

  • Write programs that do one thing and do it well (modularity)
  • Write programs that work together (cohesiveness)
  • R is extensible with more than 10,000 packages available at CRAN (http://crantastic.org/packages)

The aRt of Being Lazy

Lazy Evaluation in R

  • R, like it's inspiration, Scheme, is a functional programming language
  • R evaluates lazily, delaying evaluation until necessary, which can make it very flexible
  • R is a highly interpreted dynamically typed language, allowing you to mutate variables and analyze datasets quickly, but is significantly slower than low-level, statically typed languages like C or Java
  • R has a high memory footprint, and can easily lead to crashes if you aren't careful

R's Programming Paradigm

Keys to R

Everything that exist in R is an *object*
Everything that happens in R is a *function call*
R was born to *interface*

_—John Chambers_

Strengths of R

Where R Succeeds

  • Expressive
  • Open source
  • Extendable -- nearly 10,000 packages with functions to use, and that list continues to grow
  • Focused on statistics and machine learning -- cutting-edge algorithms and powerful data manipulation packages
  • Advanced data structures and graphical capabilities
  • Large user community, both within academia and industry
  • It is designed by statisticians

Weaknesses of R

Where R Falls Short

  • It is designed by statisticians
  • Inefficient at element-by-element computations
  • May make large demands on system resources, namely memory
  • Data capacity limited by memory
  • Single-threaded

Distributions of R

Some Essential Open Source Packages

  • There are over 10,000 R packages to choose from, what do I start with?
  • Data Management: dplyr, tidyr, data.table
  • Visualization: ggplot2, ggvis, htmlwidgets, shiny
  • Data Importing: haven, RODBC, readr, foreign
  • Other favorites: magrittr, rmarkdown, caret

R Foundations

Command line prompts

Symbol Meaning
<- assignment operator
> ready for a new command
+ awaiting the completion of an existing command
? get help for following function

Can change options either permanently at startup (see ?Startup) or manually at each session with the options function, options(repos = " ") for example.

Check your CRAN mirror with getOption("repos").

I'm Lost!

Getting Help for R

Quick Tour of Things You Need to Know

Data Structures

"Bad programmers worry about the code. Good programmers worry about data structures and their relationships."

  • Linus Torvalds
  • R's data structures can be described by their dimensionality, and their type.
Homogeneous Heterogeneous
1d Atomic vector List
2d Matrix Data frame
nd Array

Quick Tour of Things You Need to Know

Data Types

  • Atomic vectors come in one of four types
  • logical (boolean). Values: TRUE | FALSE
  • integer
  • double (often called numeric)
  • character
  • Rare types:
  • complex
  • raw

Manipulating Data Structures

Subsetting Operators

  • To create a vector, use c: c(1, 4, 1, 3)
  • To create a list, use list: list(1, 'hi', data.frame(1:10, letters[1:10]))
  • To subset a vector or list, use the [ ]
    • inside the brackets:
      • positive integer vectors for indices you want
      • negative integer vectors for indices you want to drop
      • logical vectors for indices you want to keep/drop (TRUE/FALSE)
      • character vectors for named vectors corresponding to which named elements you want to keep
      • subsetting a list with a single square bracket always returns a list
  • To subset a list and get back just that element, use [[ ]]

Object Representation

  • To find the type of an object, use class (higher level representation)
  • To find how the object is stored in memory, use typeof (lower level representation)
  • Good time to do Lab 1!

Data Manipulation with the dplyr Package

Overview

Rather than describing the nitty gritty details of writing R code, I'd like you to get started at immediately writing R code.

As most of you are data scientists/data enthusiasts, I will showcase one of the most useful data manipulation packages in R, dplyr. At the end of this session, you will have learned:

  • How to manipulate data quickly with dplyr using a very intuitive "grammar"
  • How to use dplyr to perform common exploratory analysis data manipulation procedures
  • How to apply your own custom functions to group manipulations dplyr with mutate(), summarise() and do()
  • Connect to remote databases to work with larger than memory datasets

Why use dplyr?

The Grammar of Data Manipulation

  • dplyr is currently the most downloaded package from CRAN
  • dplyr makes data manipulation easier by providing a few functions for the most common tasks and procedures
  • dplyr achieves remarkable speed-up gains by using a C++ backend
  • dplyr has multiple backends for working with data stored in various sources: SQLite, MySQL, bigquery, SQL Server, and many more
  • dplyr was inspired to give data manipulation a simple, cohesive grammar (similar philosophy to ggplot - grammar of graphics)
  • dplyr has inspired many new packages, which now adopt it's easy to understand syntax.
  • The recent packages dplyrXdf and SparkR/sparklyr brings much of the same functionality of dplyr to XDFs data and Spark DataFrames

Tidy Data and Happier Coding

Premature Optimization

  • For a dats scientist, the most important parameter to optimize in a data science development cycle is YOUR time
  • It is therefore important to be able to write efficient code, quickly
  • Goals: writing fast code that is: portable, platform invariant, easy to understand, and easy to debug
    • Be serious about CReUse!

Manipulation verbs

filter

: select rows based on matching criteria

slice

: select rows by number

select

: select columns by column names

arrange

: reorder rows by column values

mutate

: add new variables based on transformations of existing variables

transmute

: transform and drop other variables

Aggregation verbs

group_by

: identify grouping variables for calculating groupwise summary statistics

count

: count the number of records per group

summarise | summarize

: calculate one or more summary functions per group, returning one row of results per group (or one for the entire dataset)

NYC Taxi Data

Data for Class

  • The data we will be examining in this module is dervided from the NYC Taxi and Limousine Commission
  • Data contains taxi trips in NYC, and includes spatial features (pickup and dropoff neighborhoods), temporal features, and monetary features (fare and tip amounts)
  • The dataset for this module is saved as an rds file in a public facing Azure storage blob
  • An rds file is a compressed, serialized R object
  • Save an object to rds by using the saveRDS function; read an rds object with the readRDS object

Viewing Data

tibble

  • dplyr includes a wrapper called tbl_df that adds an additional class attribute onto data.frames that provides some better data manipulation aesthetics (there's now a dedicated package tibble for this wrapper and it's class)
  • Most noticeable differential between tbl_df and data.frames is the console output: tbl_dfs will only print what the current R console window can display
  • Can change the default setting for number of displayed columns by changing the options parameter: options(dplyr.width = Inf)

In [ ]:
library(dplyr)
library(stringr)
taxi_url <- "http://alizaidi.blob.core.windows.net/training/trainingData/manhattan_df.rds"
taxi_df  <- readRDS(gzcon(url(taxi_url)))
(taxi_df <- tbl_df(taxi_df))

Filtering and Reordering Data

Subsetting Data

  • dplyr makes subsetting by rows very easy
  • The filter verb takes conditions for filtering rows based on conditions
  • every dplyr function uses a data.frame/tbl as it's first argument
  • Additional conditions are passed as new arguments (no need to make an insanely complicated expression, split em up!)

Filter


In [ ]:
filter(taxi_df,
       dropoff_dow %in% c("Fri", "Sat", "Sun"),
       tip_amount > 1)

Exercise

Your turn:

  • How many observations started in Harlem?
    • pick both sides of Harlem, including east harlem
    • hint: it might be useful to use the str_detect function from stringr
  • How many observations that started in Harlem ended in the Financial District?

Solution


In [ ]:
library(stringr)
table(taxi_df$pickup_nhood)
harlem_pickups <- filter(taxi_df, str_detect(pickup_nhood, "Harlem"))
harlem_pickups
# uncomment the line below (ctrl+shift+c) and filter harlem_pickups on Financial District
# how many rows?
# fidi <- filter(harlem_pickups, ...)

Select a set of columns

  • You can use the select() verb to specify which columns of a dataset you want
  • This is similar to the keep option in SAS's data step.
  • Use a colon : to select all the columns between two variables (inclusive)
  • Use contains to take any columns containing a certain word/phrase/character

Select Example


In [ ]:
select(taxi_df, pickup_nhood, dropoff_nhood,
       fare_amount, dropoff_hour, trip_distance)

Select: Other Options

starts_with(x, ignore.case = FALSE)

: name starts with x

ends_with(x, ignore.case = FALSE)

: name ends with x

matches(x, ignore.case = FALSE)

: selects all variables whose name matches the regular expression x

num_range("V", 1:5, width = 1)

: selects all variables (numerically) from V1 to V5.

  • You can also use a - to drop variables.

Reordering Data

  • You can reorder your dataset based on conditions using the arrange() verb
  • Use the desc function to sort in descending order rather than ascending order (default)

Arrange


In [ ]:
select(arrange(taxi_df, desc(fare_amount), pickup_nhood),
       fare_amount, pickup_nhood)

head(select(arrange(taxi_df, desc(fare_amount), pickup_nhood),
       fare_amount, pickup_nhood), 10)

Exercise

Use arrange() to sort on the basis of tip_amount, dropoff_nhood, and pickup_dow, with descending order for tip amount

Summary

filter

: Extract subsets of rows. See also slice()

select

: Extract subsets of columns. See also rename()

arrange

: Sort your data

Data Aggregations and Transformations

Transformations

  • The mutate() verb can be used to make new columns

In [ ]:
taxi_df <- mutate(taxi_df, tip_pct = tip_amount/fare_amount)
select(taxi_df, tip_pct, fare_amount, tip_amount)
transmute(taxi_df, tip_pct = tip_amount/fare_amount)

Summarise Data by Groups

  • The group_by verb creates a grouping by a categorical variable
  • Functions can be placed inside summarise to create summary functions

In [ ]:
grouped_taxi <- group_by(taxi_df, dropoff_nhood)
class(grouped_taxi)
grouped_taxi

In [ ]:
summarize(group_by(taxi_df, dropoff_nhood),
          Num = n(), ave_tip_pct = mean(tip_pct))

Group By Neighborhoods Example


In [ ]:
summarise(group_by(taxi_df, pickup_nhood, dropoff_nhood),
          Num = n(), ave_tip_pct = mean(tip_pct))

Chaining/Piping

  • A dplyr installation includes the magrittr package as a dependency
  • The magrittr package includes a pipe operator that allows you to pass the current dataset to another function
  • This makes interpreting a nested sequence of operations much easier to understand

Standard Code

  • Code is executed inside-out.
  • Let's arrange the above average tips in descending order, and only look at the locations that had at least 10 dropoffs and pickups.

In [ ]:
filter(arrange(summarise(group_by(taxi_df, pickup_nhood, dropoff_nhood), Num = n(), ave_tip_pct = mean(tip_pct)), desc(ave_tip_pct)), Num >= 10)

Reformatted


In [ ]:
filter(
  arrange(
    summarise(
      group_by(taxi_df,
               pickup_nhood, dropoff_nhood),
      Num = n(),
      ave_tip_pct = mean(tip_pct)),
    desc(ave_tip_pct)),
  Num >= 10)

Magrittr

  • Inspired by unix |, and F# forward pipe |>, magrittr introduces the funny character (%>%, the then operator)
  • %>% pipes the object on the left hand side to the first argument of the function on the right hand side
  • Every function in dplyr has a slot for data.frame/tbl as it's first argument, so this works beautifully!

Put that Function in Your Pipe and...


In [ ]:
taxi_df %>%
  group_by(pickup_nhood, dropoff_nhood) %>%
  summarize(Num = n(),
            ave_tip_pct = mean(tip_pct)) %>%
  arrange(desc(ave_tip_pct)) %>%
  filter(Num >= 10)

Pipe + group_by()

  • The pipe operator is very helpful for group by summaries
  • Let's calculate average tip amount, and average trip distance, controlling for dropoff day of the week and dropoff location
  • First filter with the vector manhattan_hoods


In [ ]:
mht_url <- "http://alizaidi.blob.core.windows.net/training/manhattan.rds"
manhattan_hoods <- readRDS(gzcon(url(mht_url)))
taxi_df %>%
  filter(pickup_nhood %in% manhattan_hoods,
         dropoff_nhood %in% manhattan_hoods) %>%
  group_by(dropoff_nhood, pickup_nhood) %>%
  summarize(ave_tip = mean(tip_pct),
            ave_dist = mean(trip_distance)) %>%
  filter(ave_dist > 3, ave_tip > 0.05)

Pipe and Plot

Piping is not limited to dplyr functions, can be used everywhere!


In [ ]:
library(ggplot2)
taxi_df %>%
  filter(pickup_nhood %in% manhattan_hoods,
         dropoff_nhood %in% manhattan_hoods) %>%
  group_by(dropoff_nhood, pickup_nhood) %>%
  summarize(ave_tip = mean(tip_pct),
            ave_dist = mean(trip_distance)) %>%
  filter(ave_dist > 3, ave_tip > 0.05) %>%
  ggplot(aes(x = pickup_nhood, y = dropoff_nhood)) +
    geom_tile(aes(fill = ave_tip), colour = "white") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = 'bottom') +
    scale_fill_gradient(low = "white", high = "steelblue")


In [ ]:
library(ggplot2)
taxi_df %>%
  filter(pickup_nhood %in% manhattan_hoods,
         dropoff_nhood %in% manhattan_hoods) %>%
  group_by(dropoff_nhood, pickup_nhood) %>%
  summarize(ave_tip = mean(tip_pct),
            ave_dist = mean(trip_distance)) %>%
  filter(ave_dist > 3, ave_tip > 0.05) %>%
  ggplot(aes(x = pickup_nhood, y = dropoff_nhood)) +
    geom_tile(aes(fill = ave_tip), colour = "white") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = 'bottom') +
    scale_fill_gradient(low = "white", high = "steelblue")

Piping to other arguments

  • Although dplyr takes great care to make it particularly amenable to piping, other functions may not reserve the first argument to the object you are passing into it.
  • You can use the special . placeholder to specify where the object should enter

In [ ]:
taxi_df %>%
  filter(pickup_nhood %in% manhattan_hoods,
         dropoff_nhood %in% manhattan_hoods) %>%
  group_by(dropoff_nhood, pickup_nhood) %>%
  summarize(ave_tip = mean(tip_pct),
            ave_dist = mean(trip_distance)) %>%
  lm(ave_tip ~ ave_dist, data = .) -> taxi_model
summary(taxi_model)

Exercise

Your turn:

  • Use the pipe operator to group by day of week and dropoff neighborhood
  • Filter to Manhattan neighborhoods
  • Make tile plot with average fare amount in dollars as the fill

Functional Programming

Creating Functional Pipelines

Too Many Pipes?


Reusable code

  • The examples above create a rather messy pipeline operation
  • Can be very hard to debug
  • The operation is pretty readable, but lacks reusability
  • Since R is a functional language, we benefit by splitting these operations into functions and calling them separately
  • This allows resuability; don't write the same code twice!

Functional Pipelines

Summarization

  • Let's create a function that takes an argument for the data, and applies the summarization by neighborhood to calculate average tip and trip distance


In [ ]:
taxi_hood_sum <- function(taxi_data = taxi_df) {

  mht_url <- "http://alizaidi.blob.core.windows.net/training/manhattan.rds"

  manhattan_hoods <- readRDS(gzcon(url(mht_url)))
  taxi_data %>%
    filter(pickup_nhood %in% manhattan_hoods,
           dropoff_nhood %in% manhattan_hoods) %>%
    group_by(dropoff_nhood, pickup_nhood) %>%
    summarize(ave_tip = mean(tip_pct),
              ave_dist = mean(trip_distance)) %>%
    filter(ave_dist > 3, ave_tip > 0.05) -> sum_df

  return(sum_df)

}

Functional Pipelines

Plotting Function

  • We can create a second function for the plot

In [ ]:
tile_plot_hood <- function(df = taxi_hood_sum()) {

  library(ggplot2)

  ggplot(data = df, aes(x = pickup_nhood, y = dropoff_nhood)) +
    geom_tile(aes(fill = ave_tip), colour = "white") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = 'bottom') +
    scale_fill_gradient(low = "white", high = "steelblue") -> gplot

  return(gplot)
}

Calling Our Pipeline

  • Now we can create our plot by simply calling our two functions

In [ ]:
library(plotly)
taxi_hood_sum(taxi_df) %>% tile_plot_hood %>% ggplotly

Let's make that baby interactive.

Creating Complex Pipelines with do

  • The summarize function is fun, can summarize many numeric/scalar quantities
  • But what if you want multiple values/rows back, not just a scalar summary?
  • Meet the do verb -- arbitrary tbl operations


In [ ]:
taxi_df %>% group_by(dropoff_dow) %>%
  filter(!is.na(dropoff_nhood), !is.na(pickup_nhood)) %>%
  arrange(desc(tip_pct)) %>%
  do(slice(., 1:2)) %>%
  select(dropoff_dow, tip_amount, tip_pct,
         fare_amount, dropoff_nhood, pickup_nhood)

Estimating Multiple Models with do

  • A common use of do is to calculate many different models by a grouping variable

In [ ]:
dow_lms <- taxi_df %>% sample_n(10^4) %>%
  group_by(dropoff_dow) %>%
  do(lm_tip = lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
     data = .))


In [ ]:
dow_lms

Where are our results?

Cleaning Output


In [ ]:
summary(dow_lms$lm_tip[[1]])
library(broom)
dow_lms %>% tidy(lm_tip)
  • By design, every function in dplyr returns a data.frame
  • In the example above, we get back a spooky data.frame with a column of S3 lm objects
  • You can still modify each element as you would normally, or pass it to a mutate function to extract intercept or statistics
  • But there's also a very handy broom package for cleaning up such objects into data.frames

Brooming Up the Mess

Model Metrics


In [ ]:
library(broom)
taxi_df %>% sample_n(10^5) %>%
  group_by(dropoff_dow) %>%
  do(glance(lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
     data = .)))

Model Coefficients

The most commonly used function in the broom package is the tidy function. This will expand our data.frame and give us the model coefficients


In [ ]:
taxi_df %>% sample_n(10^5) %>%
  group_by(dropoff_dow) %>%
  do(tidy(lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
     data = .)))

Spatial Visualizations with ggplot2 and purrr

Visualizing Pickups by Time

  • Let's try another example
  • We will visualize pickups and index them by time

In [ ]:
# min and max coordinates:
min_lat <- 40.5774
max_lat <- 40.9176
min_long <- -74.15
max_long <- -73.7004

pickups <- taxi_df %>%
  filter(pickup_longitude > min_long,
         pickup_latitude < max_lat,
         dropoff_longitude > min_long,
         dropoff_latitude < max_lat) %>%
  group_by(pickup_hour,
           pickup_longitude,
           pickup_latitude) %>%
  summarise(num_pickups = n())

Load Additional Libraries


In [ ]:
library(purrr)
library(lubridate)
library(RColorBrewer)
library(magick)

Visualize Pickups

ggplot2 Theme

  • ggplot will give very aesthetically appealing plots by default
  • However, it really shines in it's ability to customize
  • See the ggthemes for some template themes
  • We'll use the theme below inspired from Max Woolf's Tweet on this dataset

In [ ]:
theme_map_dark <- function(palate_color = "Greys") {

  palate <- brewer.pal(palate_color, n=9)
  color.background = "black"
  color.grid.minor = "black"
  color.grid.major = "black"
  color.axis.text = palate[1]
  color.axis.title = palate[1]
  color.title = palate[1]

  font.title <- "Source Sans Pro"
  font.axis <- "Open Sans Condensed Bold"

  theme_bw(base_size=5) +
    theme(panel.background=element_rect(fill=color.background, color=color.background)) +
    theme(plot.background=element_rect(fill=color.background, color=color.background)) +
    theme(panel.border=element_rect(color=color.background)) +
    theme(panel.grid.major=element_blank()) +
    theme(panel.grid.minor=element_blank()) +
    theme(axis.ticks=element_blank()) +
    theme(legend.background = element_rect(fill=color.background)) +
    theme(legend.text = element_text(size=3,colour=color.axis.title,family=font.axis)) +
    theme(legend.title = element_blank(), legend.position="top", legend.direction="horizontal") +
    theme(legend.key.width=unit(1, "cm"), legend.key.height=unit(0.25, "cm"), legend.margin=unit(-0.5,"cm")) +
    theme(plot.title=element_text(colour=color.title,family=font.title, size=14)) +
    theme(plot.subtitle = element_text(colour=color.title,family=font.title, size=12)) +
    theme(axis.text.x=element_blank()) +
    theme(axis.text.y=element_blank()) +
    theme(axis.title.y=element_blank()) +
    theme(axis.title.x=element_blank()) +
    theme(strip.background = element_rect(fill=color.background,
                                          color=color.background),
          strip.text=element_text(size=7,colour=color.axis.title,family=font.title))

}

Plot Function

Complete the Function Below


In [ ]:
# x axis should be longitude
# y axis should be latitude
map_nyc <- function(df, pickup_hr) {

  gplot <- ggplot(df,
                  aes(x=...,
                      y=...)) +
    geom_point(color="white", size=0.06) +
    scale_x_continuous(limits=c(min_long, max_long)) +
    scale_y_continuous(limits=c(min_lat, max_lat)) +
    theme_map_dark() +
    labs(title = "Map of NYC Taxi Pickups",
         subtitle = paste0("Pickups between ", pickup_hr))

  return(gplot)

}

Iterate and Plot!

Now we can Iterate and Plot


In [ ]:
hour_plots <- ungroup(pickups) %>%
  filter(num_pickups > 1) %>%
  split(.$pickup_hour) %>%
  map(~ map_nyc(.x, pickup_hr = .x$pickup_hour[1]))

hour_plots

Summary

mutate

: Create transformations

summarise

: Aggregate

group_by

: Group your dataset by levels

do

: Evaluate complex operations on a tbl

Chaining with the %>% operator can result in more readable code.

What We Didn't Cover

  • There are many additional topics that fit well into the dplyr and functional programming landscape
  • There are too many to cover in one session. Fortunately, most are well documented. The most notable omissions:
    1. Connecting to remote databases, see vignette('databases', package = 'dplyr')
    2. Merging and Joins, see vignette('two-table', package = 'dplyr')
    3. Programming with dplyr,vignette('nse', package = 'dplyr')
    4. summarize_each and mutate_each

Thanks for Attending!

  • Any questions?