In [4]:
knitr::opts_chunk$set(warning=FALSE, message=FALSE, fig.align = 'center')

In [5]:
options(jupiter.rich_display=FALSE)

Course Logistics

Day One

R U Ready?

  • Overview of The R Project for Statistical Computing
  • The Microsoft R Family
  • R's capabilities and its limitations
  • What types of problems R might be useful for
  • How to manage data with the exceptionally popular open source package dplyr
  • How to develop models and write functions in R

Day Two

Scalable Data Analysis with Microsoft R

  • Moving the compute to your data
  • WODA - Write Once, Deploy Anywhere
  • High Performance Analytics
  • High Performance Computing
  • Machine Learning with Microsoft R

Day Three

Distributing Computing on Spark Clusters with R

  • Overview of the Apache Spark Project
  • Taming the Hadoop Zoo with HDInsight
  • Provisioing and Managing HDInsight Clusters
  • Spark DataFrames, SparkR, and the sparklyr package
  • Developing Machine Learning Pipelines with Spark and Microsoft R

Prerequisites

Computing Environments

Development Environments

Where to Write R Code

  • The most popular integrated development environment for R is RStudio
  • The RStudio IDE is entirely html/javascript based, so completely cross-platform
  • RStudio Server provides a full IDE in your browser: perfect for cloud instances
  • For Windows machines, we have recently announced the general availability of R Tools for Visual Studio, RTVS
  • RTVS supports connectivity to Azure and SQL Server

What is R?

Why should I care?

  • R is the successor to the S Language, originated at Bell Labs AT&T
  • It is based on the Scheme interpreter
  • Originally designed by two University of Auckland Professors for their introductory statistics course

R's Philosophy

What R Thou?

R follows the Unix philosophy

  • Write programs that do one thing and do it well (modularity)
  • Write programs that work together (cohesiveness)
  • R is extensible with more than 10,000 packages available at CRAN (http://crantastic.org/packages)

The aRt of Being Lazy

Lazy Evaluation in R

  • R, like it's inspiration, Scheme, is a functional programming language
  • R evaluates lazily, delaying evaluation until necessary, which can make it very flexible
  • R is a highly interpreted dynamically typed language, allowing you to mutate variables and analyze datasets quickly, but is significantly slower than low-level, statically typed languages like C or Java
  • R has a high memory footprint, and can easily lead to crashes if you aren't careful

R's Programming Paradigm

Keys to R

Everything that exist in R is an *object*
Everything that happens in R is a *function call*
R was born to *interface*

_—John Chambers_

Strengths of R

Where R Succeeds

  • Expressive
  • Open source
  • Extendable -- nearly 10,000 packages with functions to use, and that list continues to grow
  • Focused on statistics and machine learning -- cutting-edge algorithms and powerful data manipulation packages
  • Advanced data structures and graphical capabilities
  • Large user community, both within academia and industry
  • It is designed by statisticians

Weaknesses of R

Where R Falls Short

  • It is designed by statisticians
  • Inefficient at element-by-element computations
  • May make large demands on system resources, namely memory
  • Data capacity limited by memory
  • Single-threaded

Some Essential Open Source Packages

  • There are over 10,000 R packages to choose from, what do I start with?
  • Data Management: dplyr, tidyr, data.table
  • Visualization: ggplot2, ggvis, htmlwidgets, shiny
  • Data Importing: haven, RODBC, readr, foreign
  • Other favorites: magrittr, rmarkdown, caret

R Foundations

Command line prompts

Symbol Meaning
<- assignment operator
> ready for a new command
+ awaiting the completion of an existing command
? get help for following function

Can change options either permanently at startup (see ?Startup) or manually at each session with the options function, options(repos = " ") for example.

Check your CRAN mirror with getOption("repos").

I'm Lost!

Getting Help for R

Quick Tour of Things You Need to Know

Data Structures

"Bad programmers worry about the code. Good programmers worry about data structures and their relationships."

  • Linus Torvalds
  • R's data structures can be described by their dimensionality, and their type.
Homogeneous Heterogeneous
1d Atomic vector List
2d Matrix Data frame
nd Array

Quick Tour of Things You Need to Know

Data Types

  • Atomic vectors come in one of four types
  • logical (boolean). Values: TRUE | FALSE
  • integer
  • double (often called numeric)
  • character
  • Rare types:
  • complex
  • raw

Manipulating Data Structures

Subsetting Operators

  • To create a vector, use c: c(1, 4, 1, 3)
  • To create a list, use list: list(1, 'hi', data.frame(1:10, letters[1:10]))
  • To subset a vector or list, use the [ ]
    • inside the brackets:
      • positive integer vectors for indices you want
      • negative integer vectors for indices you want to drop
      • logical vectors for indices you want to keep/drop (TRUE/FALSE)
      • character vectors for named vectors corresponding to which named elements you want to keep
      • subsetting a list with a single square bracket always returns a list
  • To subset a list and get back just that element, use [[ ]]

Object Representation

  • To find the type of an object, use class (higher level representation)
  • To find how the object is stored in memory, use typeof (lower level representation)
  • Good time to do Lab 0!

Data Manipulation with the dplyr Package

Overview

Rather than describing the nitty gritty details of writing R code, I'd like you to get started at immediately writing R code.

As most of you are data scientists/data enthusiasts, I will showcase one of the most useful data manipulation packages in R, dplyr. At the end of this session, you will have learned:

  • How to manipulate data quickly with dplyr using a very intuitive "grammar"
  • How to use dplyr to perform common exploratory analysis data manipulation procedures
  • How to apply your own custom functions to group manipulations dplyr with mutate(), summarise() and do()
  • Connect to remote databases to work with larger than memory datasets

Why use dplyr?

The Grammar of Data Manipulation

  • dplyr is currently the most downloaded package from CRAN
  • dplyr makes data manipulation easier by providing a few functions for the most common tasks and procedures
  • dplyr achieves remarkable speed-up gains by using a C++ backend
  • dplyr has multiple backends for working with data stored in various sources: SQLite, MySQL, bigquery, SQL Server, and many more
  • dplyr was inspired to give data manipulation a simple, cohesive grammar (similar philosophy to ggplot - grammar of graphics)
  • dplyr has inspired many new packages, which now adopt it's easy to understand syntax.
  • The recent packages dplyrXdf and SparkR/sparklyr brings much of the same functionality of dplyr to XDFs data and Spark DataFrames

Tidy Data and Happier Coding

Premature Optimization

  • For a dats scientist, the most important parameter to optimize in a data science development cycle is YOUR time
  • It is therefore important to be able to write efficient code, quickly
  • Goals: writing fast code that is: portable, platform invariant, easy to understand, and easy to debug
    • Be serious about CReUse!

Manipulation verbs

filter

: select rows based on matching criteria

slice

: select rows by number

select

: select columns by column names

arrange

: reorder rows by column values

mutate

: add new variables based on transformations of existing variables

transmute

: transform and drop other variables

Aggregation verbs

group_by

: identify grouping variables for calculating groupwise summary statistics

count

: count the number of records per group

summarise | summarize

: calculate one or more summary functions per group, returning one row of results per group (or one for the entire dataset)

NYC Taxi Data

Data for Class

  • The data we will be examining in this module is derived from the NYC Taxi and Limousine Commission
  • Data contains taxi trips in NYC, and includes spatial features (pickup and dropoff neighborhoods), temporal features, and monetary features (fare and tip amounts)
  • The dataset for this module is saved as an rds file in a public facing Azure storage blob
  • An rds file is a compressed, serialized R object
  • Save an object to rds by using the saveRDS function; read an rds object with the readRDS object

Viewing Data

tibble

  • dplyr includes a wrapper called tbl_df that adds an additional class attribute onto data.frames that provides some better data manipulation aesthetics (there's now a dedicated package tibble for this wrapper and it's class)
  • Most noticeable differential between tbl_df and data.frames is the console output: tbl_dfs will only print what the current R console window can display
  • Can change the default setting for number of displayed columns by changing the options parameter: options(dplyr.width = Inf)

In [6]:
library(dplyr)
library(stringr)
taxi_url <- "http://alizaidi.blob.core.windows.net/training/taxi_df.rds"
taxi_df  <- readRDS(gzcon(url(taxi_url)))
(taxi_df <- tbl_df(taxi_df))


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

VendorIDpassenger_counttrip_distanceRateCodeIDstore_and_fwd_flagpayment_typefare_amounttip_amounttolls_amountpickup_hourpickup_dowdropoff_hourdropoff_dowpickup_nhooddropoff_nhoodkSplits
1 1 1.80 1 N 2 9.5 0.00 0.00 6-10 Sat 6-10 Sat Morningside HeightsHamilton Heights A
1 2 0.90 1 N 1 6.5 1.55 0.00 6-10 Sat 6-10 Sat Midtown Midtown A
1 1 0.90 1 N 1 7.0 1.66 0.00 6-10 Sat 6-10 Sat Lower East Side Soho A
1 1 0.30 1 N 2 3.0 0.00 0.00 6-10 Sat 6-10 Sat Financial District Financial District A
2 1 0.96 1 N 1 5.5 1.30 0.00 6-10 Thu 6-10 Thu Chelsea West Village A
2 1 2.01 1 N 1 9.5 2.16 0.00 10-5 Sun 10-5 Sun Upper East Side Harlem A
2 3 3.14 1 N 1 12.5 2.50 0.00 12-4 Sun 12-4 Sun Fort Green Soho A
1 1 0.50 1 N 1 4.0 1.00 0.00 12-4 Sun 12-4 Sun Upper East Side Upper East Side A
2 1 0.67 1 N 1 5.0 1.00 0.00 12-4 Thu 12-4 Thu Upper West Side Upper West Side A
2 1 15.20 2 N 1 52.0 14.33 5.33 12-4 Thu 12-4 Thu NA Clinton A
2 5 2.96 1 N 2 20.5 0.00 0.00 12-4 Thu 12-4 Thu Upper East Side Garment District A
1 1 0.70 1 N 2 6.0 0.00 0.00 9-12 Mon 9-12 Mon Upper East Side Upper East Side A
1 1 2.60 1 N 1 16.0 3.35 0.00 9-12 Thu 9-12 Thu Upper East Side Gramercy A
2 2 0.79 1 N 2 5.0 0.00 0.00 12-4 Wed 12-4 Wed NA NA A
2 1 3.37 1 N 1 18.0 3.60 0.00 12-4 Wed 12-4 Wed Upper East Side Chelsea A
1 3 2.40 1 N 2 11.0 0.00 0.00 6-10 Tue 10-5 Tue East Village Garment District A
1 1 16.30 1 Y 1 45.0 11.57 0.00 6-10 Tue 10-5 Tue NA NA A
1 1 5.70 1 N 1 25.0 5.16 0.00 12-4 Mon 12-4 Mon Midtown NA A
1 1 3.20 1 N 2 16.0 0.00 0.00 12-4 Mon 12-4 Mon Midtown Upper West Side A
1 1 0.70 1 N 2 4.5 0.00 0.00 12-4 Mon 12-4 Mon Upper West Side Harlem A
1 1 1.00 1 N 2 6.0 0.00 0.00 10-5 Tue 10-5 Tue Midtown Upper West Side A
1 1 1.50 1 N 1 8.5 1.00 0.00 10-5 Tue 10-5 Tue West Village East Village A
1 2 5.00 1 N 2 21.5 0.00 0.00 12-4 Mon 12-4 Mon Midtown Jackson Heights A
1 1 2.00 1 N 1 11.5 3.69 0.00 12-4 Mon 12-4 Mon Downtown Fort Green A
1 1 1.40 1 N 1 8.0 1.75 0.00 12-4 Mon 12-4 Mon Upper East Side Upper East Side A
1 1 1.40 1 N 1 8.5 1.85 0.00 12-4 Mon 12-4 Mon Midtown Gramercy A
1 1 1.80 1 N 2 9.5 0.00 0.00 12-4 Mon 12-4 Mon Upper West Side Midtown A
2 1 6.08 1 N 1 18.5 3.70 0.00 6-10 Sun 6-10 Sun Financial District Upper East Side A
2 1 1.96 1 N 1 8.5 2.12 0.00 6-10 Sun 6-10 Sun Chelsea Midtown A
2 6 1.26 1 N 2 7.0 0.00 0.00 6-10 Sun 6-10 Sun Garment District Murray Hill A
2 1 14.21 1 N 2 39.5 0.00 0.00 6-10 Tue 6-10 Tue NA Sunny Side A
2 1 1.79 1 N 2 10.0 0.00 0.00 6-10 Tue 6-10 Tue Upper East Side Upper West Side A
2 1 5.34 1 N 2 20.0 0.00 0.00 6-10 Tue 6-10 Tue Upper West Side Battery Park A
2 1 1.51 1 N 1 8.0 1.86 0.00 6-10 Tue 6-10 Tue Midtown Upper East Side A
2 6 2.92 1 N 1 12.5 1.00 0.00 4-6 Sat 4-6 Sat Upper East Side Midtown A
2 6 1.41 1 N 2 7.5 0.00 0.00 4-6 Sun 4-6 Sun West Village Soho A
2 2 4.91 1 N 1 17.0 3.00 0.00 10-5 Tue 10-5 Tue NA Upper East Side A
2 3 0.92 1 N 2 5.0 0.00 0.00 10-5 Tue 10-5 Tue East Village Gramercy A
2 2 13.31 1 N 2 37.0 0.00 0.00 10-5 Tue 10-5 Tue NA NA A
2 1 3.46 1 N 2 12.5 0.00 0.00 10-5 Tue 10-5 Tue East Village Clinton A
2 5 6.08 1 N 1 20.5 4.36 0.00 10-5 Tue 10-5 Tue Garment District Cobble Hill A
2 1 1.41 1 N 1 7.0 1.66 0.00 10-5 Tue 10-5 Tue Soho East Village A
2 1 0.83 1 N 1 5.5 1.36 0.00 10-5 Fri 10-5 Fri Lower East Side East Village A
2 3 1.44 1 N 1 8.5 1.86 0.00 12-4 Thu 12-4 Thu Upper West Side Clinton A
2 1 1.00 1 N 2 7.5 0.00 0.00 12-4 Thu 12-4 Thu Upper East Side Upper West Side A
2 1 5.72 1 N 1 24.0 4.96 0.00 5-9 Tue 9-12 Tue Financial District Upper East Side A
2 1 8.83 1 N 1 34.0 10.03 5.33 5-9 Tue 9-12 Tue NA Midtown A
2 2 1.38 1 N 1 6.5 1.46 0.00 5-9 Tue 5-9 Tue Upper East Side Yorkville A
2 1 1.61 1 N 1 14.5 3.06 0.00 5-9 Tue 9-12 Tue Gramercy Midtown A
2 1 2.28 1 N 1 15.0 3.16 0.00 5-9 Tue 9-12 Tue Upper East Side Murray Hill A
2 1 1.00 1 N 1 6.0 0.00 0.00 5-9 Tue 5-9 Tue Chelsea Chelsea A
2 1 3.24 1 N 2 13.5 0.00 0.00 10-5 Sat 10-5 Sat NA Fort Green A
2 1 1.13 1 N 1 6.5 1.95 0.00 10-5 Sat 10-5 Sat Midtown Midtown A
2 1 2.27 1 N 2 9.5 0.00 0.00 10-5 Sat 10-5 Sat Upper East Side Harlem A
2 1 1.87 1 N 1 8.5 1.00 0.00 10-5 Sat 10-5 Sat Morningside HeightsHamilton Heights A
2 5 1.57 1 N 2 7.0 0.00 0.00 10-5 Sun 10-5 Sun Midtown Chelsea A
2 2 2.75 1 N 2 12.0 0.00 0.00 10-5 Sun 10-5 Sun Tribeca Garment District A
2 5 9.18 1 N 1 30.5 6.36 0.00 10-5 Sun 10-5 Sun Midtown Park Slope A
2 1 0.79 1 N 1 5.0 1.16 0.00 6-10 Sat 6-10 Sat Upper West Side Upper West Side A
2 1 0.75 1 N 1 4.5 1.00 0.00 6-10 Sat 6-10 Sat Chelsea Chelsea A

In [7]:
class(taxi_df)


  1. 'tbl_df'
  2. 'tbl'
  3. 'data.frame'

Filtering and Reordering Data

Subsetting Data

  • dplyr makes subsetting by rows very easy
  • The filter verb takes conditions for filtering rows based on conditions
  • every dplyr function uses a data.frame/tbl as it's first argument
  • Additional conditions are passed as new arguments (no need to make an insanely complicated expression, split em up!)

Filter


In [8]:
filter(taxi_df,
       dropoff_dow %in% c("Fri", "Sat", "Sun"),
       tip_amount > 1)


VendorIDpassenger_counttrip_distanceRateCodeIDstore_and_fwd_flagpayment_typefare_amounttip_amounttolls_amountpickup_hourpickup_dowdropoff_hourdropoff_dowpickup_nhooddropoff_nhoodkSplits
1 2 0.90 1 N 1 6.5 1.55 0.00 6-10 Sat 6-10 Sat Midtown Midtown A
1 1 0.90 1 N 1 7.0 1.66 0.00 6-10 Sat 6-10 Sat Lower East Side Soho A
2 1 2.01 1 N 1 9.5 2.16 0.00 10-5 Sun 10-5 Sun Upper East Side Harlem A
2 3 3.14 1 N 1 12.5 2.50 0.00 12-4 Sun 12-4 Sun Fort Green Soho A
2 1 6.08 1 N 1 18.5 3.70 0.00 6-10 Sun 6-10 Sun Financial DistrictUpper East Side A
2 1 1.96 1 N 1 8.5 2.12 0.00 6-10 Sun 6-10 Sun Chelsea Midtown A
1 1 0.40 1 N 1 4.0 1.06 0.00 10-5 Sat 10-5 Sat Greenwich Village East Village A
1 2 1.70 1 N 1 10.0 2.00 0.00 12-4 Fri 4-6 Fri Midtown Upper East Side A
1 2 0.60 1 N 1 4.5 1.70 0.00 6-10 Sat 6-10 Sat Tribeca Tribeca A
1 2 9.30 1 N 1 28.5 5.86 0.00 6-10 Sat 6-10 Sat NA Upper East Side A
1 2 5.20 1 N 1 17.5 3.85 0.00 6-10 Fri 6-10 Fri Upper East Side Soho A
1 1 8.50 1 N 1 24.5 6.30 5.33 6-10 Fri 6-10 Fri NA Upper East Side A
1 1 2.20 1 N 1 15.5 4.07 0.00 6-10 Sat 6-10 Sat Garment District East Village A
1 1 5.30 1 N 1 20.5 6.39 0.00 6-10 Sat 6-10 Sat Financial DistrictMidtown A
2 1 1.01 1 N 1 5.5 1.58 0.00 4-6 Sun 4-6 Sun Yorkville Upper East Side A
1 1 0.40 1 N 1 4.5 1.58 0.00 6-10 Fri 6-10 Fri East Village East Village A
1 1 7.70 1 N 1 22.0 5.83 5.33 6-10 Fri 6-10 Fri NA Harlem A
2 1 1.01 1 N 1 7.0 1.95 0.00 5-9 Fri 5-9 Fri Gramercy Murray Hill A
2 1 0.79 1 N 1 8.0 2.20 0.00 5-9 Fri 5-9 Fri Midtown Midtown A
1 1 0.50 1 N 1 4.0 1.20 0.00 5-9 Sat 5-9 Sat Upper East Side Upper East Side A
1 1 2.90 1 N 1 10.5 2.25 0.00 5-9 Sat 5-9 Sat Greenwich Village Upper East Side A
2 1 1.16 1 N 1 6.0 1.82 0.00 10-5 Fri 10-5 Fri Financial DistrictBattery Park A
1 1 0.60 1 N 1 5.5 1.35 0.00 6-10 Sat 6-10 Sat Lower East Side East Village A
2 1 1.09 1 N 1 14.0 2.80 0.00 6-10 Sat 6-10 Sat Greenwich Village East Village A
1 1 2.40 1 N 1 12.5 2.00 0.00 4-6 Fri 4-6 Fri Midtown Chelsea A
2 1 1.28 1 N 1 5.5 1.20 0.00 6-10 Sun 6-10 Sun Midtown Upper East Side A
2 1 5.08 1 N 1 17.5 3.60 0.00 6-10 Sun 6-10 Sun East Village Upper West Side A
2 5 1.29 1 N 1 7.0 1.88 0.00 6-10 Sun 6-10 Sun Upper East Side Upper East Side A
2 1 2.83 1 N 1 10.0 2.10 0.00 6-10 Sun 6-10 Sun Upper East Side Gramercy A
1 1 2.50 1 N 1 10.5 2.00 0.00 6-10 Sat 10-5 Sat Midtown Upper East Side A
2 5 2.36 1 N 1 10.0 3.24 0 5-9 Fri 5-9 Fri Gramercy Tribeca A
2 1 1.24 1 N 1 6.0 1.46 0 10-5 Sat 10-5 Sat Garment District Midtown A
2 2 1.24 1 N 1 7.0 1.66 0 6-10 Sat 6-10 Sat Gramercy Midtown A
2 1 2.63 1 N 1 11.0 2.46 0 6-10 Sat 6-10 Sat Garment District Upper East Side A
2 5 3.14 1 N 1 20.0 4.16 0 5-9 Fri 5-9 Fri Sunny Side Upper East Side A
2 1 2.25 1 N 1 11.0 3.54 0 5-9 Fri 5-9 Fri Upper East Side Upper West Side A
2 1 0.71 1 N 1 8.5 1.86 0 5-9 Fri 5-9 Fri Upper East Side Upper East Side A
2 1 3.04 1 N 1 10.0 2.26 0 6-10 Sat 6-10 Sat Financial DistrictLower East Side A
2 3 10.36 1 N 1 31.0 4.00 0 6-10 Sat 6-10 Sat Chelsea Inwood A
2 2 1.56 1 N 1 8.5 1.50 0 12-4 Sat 12-4 Sat Upper East Side Upper West Side A
2 2 2.04 1 N 1 8.5 1.86 0 4-6 Sun 4-6 Sun Upper East Side Yorkville A
2 1 7.44 1 N 1 27.5 5.66 0 6-10 Sat 6-10 Sat Chelsea Bushwick A
2 1 1.96 1 N 1 9.5 2.06 0 12-4 Sat 12-4 Sat Financial DistrictGreenwich Village A
2 3 3.30 1 N 1 13.5 3.58 0 12-4 Sat 4-6 Sat Garment District Upper West Side A
2 1 1.32 1 N 1 9.0 1.96 0 12-4 Sat 4-6 Sat Midtown Garment District A
2 1 2.85 1 N 1 12.0 2.56 0 12-4 Sat 4-6 Sat Upper East Side Midtown A
2 2 0.83 1 N 1 6.0 1.36 0 12-4 Sat 12-4 Sat Clinton Garment District A
2 1 2.96 1 N 1 12.0 2.00 0 4-6 Sun 4-6 Sun Yorkville North Sutton Area A
2 1 2.64 1 N 1 11.5 2.46 0 12-4 Sat 12-4 Sat Garment District Central Park A
2 5 1.40 1 N 1 7.5 1.66 0 9-12 Sat 9-12 Sat Garment District Midtown A
2 1 0.59 1 N 1 4.5 1.32 0 9-12 Sun 9-12 Sun Chelsea Chelsea A
2 1 1.34 1 N 1 7.5 1.76 0 10-5 Fri 10-5 Sat East Village Greenwich Village A
2 1 4.62 1 N 1 20.0 4.26 0 10-5 Fri 10-5 Sat Soho Bushwick A
2 2 0.76 1 N 1 4.5 1.16 0 10-5 Sat 10-5 Sat Yorkville Yorkville A
2 1 7.03 1 N 1 24.0 1.70 0 10-5 Sat 10-5 Sat Little Italy Upper West Side A
2 4 1.25 1 N 1 6.5 1.56 0 10-5 Sat 10-5 Sat Little Italy West Village A
2 1 0.83 1 N 1 5.5 1.36 0 10-5 Fri 10-5 Fri Lower East Side East Village A
2 1 1.13 1 N 1 6.5 1.95 0 10-5 Sat 10-5 Sat Midtown Midtown A
2 5 9.18 1 N 1 30.5 6.36 0 10-5 Sun 10-5 Sun Midtown Park Slope A
2 1 0.79 1 N 1 5.0 1.16 0 6-10 Sat 6-10 Sat Upper West Side Upper West Side A

Exercise

Your turn:

  • How many observations started in Harlem?
    • pick both sides of Harlem, including east harlem
  • How many observations that started in Harlem ended in the Financial District?

Solution


In [9]:
library(stringr)
#table(taxi_df$pickup_nhood)
#harlem_pickups <- filter(taxi_df, str_detect(pickup_nhood, "Harlem")) #str_detect: equivalent of 'Like' in SQL
  harlem_pickups <- filter(taxi_df, pickup_nhood == "Harlem" | pickup_nhood == "East Harlem")
nrow(harlem_pickups)
#findistr_dropoffs <- filter(harlem_pickups, str_detect(dropoff_nhood, "Financial District"))
    findistr_dropoffs <- filter(harlem_pickups, dropoff_nhood == "Financial District")
nrow(findistr_dropoffs)

# or all together (without creating additional objects == memory)
nrow(filter(taxi_df, str_detect(pickup_nhood, "Harlem"), dropoff_nhood == "Financial District"))


32025
99
99

Select a set of columns

  • You can use the select() verb to specify which columns of a dataset you want
  • This is similar to the keep option in SAS's data step.
  • Use a colon : to select all the columns between two variables (inclusive)
  • Use contains to take any columns containing a certain word/phrase/character

Select Example


In [10]:
select(taxi_df, pickup_nhood, dropoff_nhood,
       fare_amount, dropoff_hour, trip_distance)


pickup_nhooddropoff_nhoodfare_amountdropoff_hourtrip_distance
Morningside HeightsHamilton Heights 9.5 6-10 1.80
Midtown Midtown 6.5 6-10 0.90
Lower East Side Soho 7.0 6-10 0.90
Financial District Financial District 3.0 6-10 0.30
Chelsea West Village 5.5 6-10 0.96
Upper East Side Harlem 9.5 10-5 2.01
Fort Green Soho 12.5 12-4 3.14
Upper East Side Upper East Side 4.0 12-4 0.50
Upper West Side Upper West Side 5.0 12-4 0.67
NA Clinton 52.0 12-4 15.20
Upper East Side Garment District 20.5 12-4 2.96
Upper East Side Upper East Side 6.0 9-12 0.70
Upper East Side Gramercy 16.0 9-12 2.60
NA NA 5.0 12-4 0.79
Upper East Side Chelsea 18.0 12-4 3.37
East Village Garment District 11.0 10-5 2.40
NA NA 45.0 10-5 16.30
Midtown NA 25.0 12-4 5.70
Midtown Upper West Side 16.0 12-4 3.20
Upper West Side Harlem 4.5 12-4 0.70
Midtown Upper West Side 6.0 10-5 1.00
West Village East Village 8.5 10-5 1.50
Midtown Jackson Heights 21.5 12-4 5.00
Downtown Fort Green 11.5 12-4 2.00
Upper East Side Upper East Side 8.0 12-4 1.40
Midtown Gramercy 8.5 12-4 1.40
Upper West Side Midtown 9.5 12-4 1.80
Financial District Upper East Side 18.5 6-10 6.08
Chelsea Midtown 8.5 6-10 1.96
Garment District Murray Hill 7.0 6-10 1.26
NA Sunny Side 39.5 6-10 14.21
Upper East Side Upper West Side 10.0 6-10 1.79
Upper West Side Battery Park 20.0 6-10 5.34
Midtown Upper East Side 8.0 6-10 1.51
Upper East Side Midtown 12.5 4-6 2.92
West Village Soho 7.5 4-6 1.41
NA Upper East Side 17.0 10-5 4.91
East Village Gramercy 5.0 10-5 0.92
NA NA 37.0 10-5 13.31
East Village Clinton 12.5 10-5 3.46
Garment District Cobble Hill 20.5 10-5 6.08
Soho East Village 7.0 10-5 1.41
Lower East Side East Village 5.5 10-5 0.83
Upper West Side Clinton 8.5 12-4 1.44
Upper East Side Upper West Side 7.5 12-4 1.00
Financial District Upper East Side 24.0 9-12 5.72
NA Midtown 34.0 9-12 8.83
Upper East Side Yorkville 6.5 5-9 1.38
Gramercy Midtown 14.5 9-12 1.61
Upper East Side Murray Hill 15.0 9-12 2.28
Chelsea Chelsea 6.0 5-9 1.00
NA Fort Green 13.5 10-5 3.24
Midtown Midtown 6.5 10-5 1.13
Upper East Side Harlem 9.5 10-5 2.27
Morningside HeightsHamilton Heights 8.5 10-5 1.87
Midtown Chelsea 7.0 10-5 1.57
Tribeca Garment District 12.0 10-5 2.75
Midtown Park Slope 30.5 10-5 9.18
Upper West Side Upper West Side 5.0 6-10 0.79
Chelsea Chelsea 4.5 6-10 0.75

Select: Other Options

starts_with(x, ignore.case = FALSE)

: name starts with x

ends_with(x, ignore.case = FALSE)

: name ends with x

matches(x, ignore.case = FALSE)

: selects all variables whose name matches the regular expression x

num_range("V", 1:5, width = 1)

: selects all variables (numerically) from V1 to V5.

  • You can also use a - to drop variables.

Reordering Data

  • You can reorder your dataset based on conditions using the arrange() verb
  • Use the desc function to sort in descending order rather than ascending order (default)

Arrange


In [11]:
#select(arrange(taxi_df, desc(fare_amount), pickup_nhood),
#       fare_amount, pickup_nhood)

head(select(arrange(taxi_df, desc(fare_amount), pickup_nhood, dropoff_nhood),
       fare_amount, pickup_nhood, dropoff_nhood), 10)


fare_amountpickup_nhooddropoff_nhood
3130.30 Borough Park Borough Park
3130.30 Upper West SideUpper West Side
990.00 NA NA
900.00 Little Italy Little Italy
900.00 NA NA
630.01 Chelsea Murray Hill
600.00 Throggs Neck City Island
500.00 Chinatown Chinatown
500.00 Chinatown Chinatown
500.00 Clinton NA

Exercise

Use arrange() to sort on the basis of tip_amount, dropoff_nhood, and pickup_dow, with descending order for tip amount

Summary

filter

: Extract subsets of rows. See also slice()

select

: Extract subsets of columns. See also rename()

arrange

: Sort your data

Data Aggregations and Transformations

Transformations

  • The mutate() verb can be used to make new columns

In [12]:
#Exercise
head(select(arrange(taxi_df, desc(tip_amount), dropoff_nhood, pickup_nhood), tip_amount, dropoff_nhood, pickup_nhood))


tip_amountdropoff_nhoodpickup_nhood
454.00 Bedford-StuyvesantBedford-Stuyvesant
300.00 Greenwich Village Chelsea
239.07 Clinton Greenwich Village
221.20 Clinton Central Park
202.00 Lower East Side Upper East Side
173.00 NA NA

In [13]:
taxi_df <- mutate(taxi_df, tip_pct = tip_amount/fare_amount)
head(select(taxi_df, tip_pct, fare_amount, tip_amount))
head(transmute(taxi_df, tip_pct = tip_amount/fare_amount))


tip_pctfare_amounttip_amount
0.00000009.5 0.00
0.23846156.5 1.55
0.23714297.0 1.66
0.00000003.0 0.00
0.23636365.5 1.30
0.22736849.5 2.16
tip_pct
0.0000000
0.2384615
0.2371429
0.0000000
0.2363636
0.2273684

In [14]:
str(taxi_df)


Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	3770319 obs. of  17 variables:
 $ VendorID          : chr  "1" "1" "1" "1" ...
 $ passenger_count   : int  1 2 1 1 1 1 3 1 1 1 ...
 $ trip_distance     : num  1.8 0.9 0.9 0.3 0.96 2.01 3.14 0.5 0.67 15.2 ...
 $ RateCodeID        : chr  "1" "1" "1" "1" ...
 $ store_and_fwd_flag: chr  "N" "N" "N" "N" ...
 $ payment_type      : chr  "2" "1" "1" "2" ...
 $ fare_amount       : num  9.5 6.5 7 3 5.5 9.5 12.5 4 5 52 ...
 $ tip_amount        : num  0 1.55 1.66 0 1.3 ...
 $ tolls_amount      : num  0 0 0 0 0 0 0 0 0 5.33 ...
 $ pickup_hour       : chr  "6-10" "6-10" "6-10" "6-10" ...
 $ pickup_dow        : chr  "Sat" "Sat" "Sat" "Sat" ...
 $ dropoff_hour      : chr  "6-10" "6-10" "6-10" "6-10" ...
 $ dropoff_dow       : chr  "Sat" "Sat" "Sat" "Sat" ...
 $ pickup_nhood      : chr  "Morningside Heights" "Midtown" "Lower East Side" "Financial District" ...
 $ dropoff_nhood     : chr  "Hamilton Heights" "Midtown" "Soho" "Financial District" ...
 $ kSplits           : chr  "A" "A" "A" "A" ...
 $ tip_pct           : num  0 0.238 0.237 0 0.236 ...

Summarise Data by Groups

  • The group_by verb creates a grouping by a categorical variable
  • Functions can be placed inside summarise to create summary functions

In [15]:
class(taxi_df)


  1. 'tbl_df'
  2. 'tbl'
  3. 'data.frame'

In [16]:
grouped_taxi <- group_by(taxi_df, dropoff_nhood)
class(grouped_taxi)
head(grouped_taxi)#grouping is not visible, but the class knows about the grouping


  1. 'grouped_df'
  2. 'tbl_df'
  3. 'tbl'
  4. 'data.frame'
VendorIDpassenger_counttrip_distanceRateCodeIDstore_and_fwd_flagpayment_typefare_amounttip_amounttolls_amountpickup_hourpickup_dowdropoff_hourdropoff_dowpickup_nhooddropoff_nhoodkSplitstip_pct
1 1 1.80 1 N 2 9.5 0.00 0 6-10 Sat 6-10 Sat Morningside HeightsHamilton Heights A 0.0000000
1 2 0.90 1 N 1 6.5 1.55 0 6-10 Sat 6-10 Sat Midtown Midtown A 0.2384615
1 1 0.90 1 N 1 7.0 1.66 0 6-10 Sat 6-10 Sat Lower East Side Soho A 0.2371429
1 1 0.30 1 N 2 3.0 0.00 0 6-10 Sat 6-10 Sat Financial District Financial District A 0.0000000
2 1 0.96 1 N 1 5.5 1.30 0 6-10 Thu 6-10 Thu Chelsea West Village A 0.2363636
2 1 2.01 1 N 1 9.5 2.16 0 10-5 Sun 10-5 Sun Upper East Side Harlem A 0.2273684

In [17]:
nrow(summarize(group_by(taxi_df, dropoff_nhood),
          Num = n(), ave_tip_pct = mean(tip_pct)))

nrow(summarize(grouped_taxi, Num=n(), ave_tip_pct=mean(tip_pct)))  #calculated above


122
122

Group By Neighborhoods Example


In [18]:
summarise(group_by(taxi_df, pickup_nhood, dropoff_nhood),
          Num = n(), ave_tip_pct = mean(tip_pct))


pickup_nhooddropoff_nhoodNumave_tip_pct
Ardon Heights Ardon Heights 1 0.00000000
Astoria-Long Island CityAstoria-Long Island City6714 0.09712743
Astoria-Long Island CityAuburndale 13 0.06520367
Astoria-Long Island CityBattery Park 14 0.10140865
Astoria-Long Island CityBay Ridge 5 0.15388657
Astoria-Long Island CityBedford Park 3 0.05550679
Astoria-Long Island CityBedford-Stuyvesant 68 0.09819024
Astoria-Long Island CityBensonhurst 2 0.00000000
Astoria-Long Island CityBoerum Hill 12 0.09941085
Astoria-Long Island CityBorough Park 9 0.08205749
Astoria-Long Island CityBrownsville 4 0.10376598
Astoria-Long Island CityBushwick 53 0.05616446
Astoria-Long Island CityCanarsie 3 0.08080808
Astoria-Long Island CityCarnegie Hill 29 0.10760679
Astoria-Long Island CityCarroll Gardens 19 0.15358826
Astoria-Long Island CityCentral Park 72 0.12755575
Astoria-Long Island CityChelsea 224 0.11327865
Astoria-Long Island CityChinatown 8 0.09015545
Astoria-Long Island CityClearview 7 0.02959184
Astoria-Long Island CityClinton 163 0.11348404
Astoria-Long Island CityCobble Hill 2 0.10235294
Astoria-Long Island CityDouglastown-Little Neck 2 0.00000000
Astoria-Long Island CityDowntown 22 0.13353090
Astoria-Long Island CityDyker Heights 1 0.00000000
Astoria-Long Island CityEast Brooklyn 8 0.04740602
Astoria-Long Island CityEast Harlem 79 0.07145051
Astoria-Long Island CityEast Village 171 0.14311855
Astoria-Long Island CityFinancial District 117 0.13240868
Astoria-Long Island CityFlushing 54 0.08736876
Astoria-Long Island CityFordham 5 0.04092308
NA Soundview 114 0.02754077
NA South Beach 15 0.11433528
NA South Bronx 236 0.05923333
NA Springfield Gardens 655 0.06109878
NA Spuyten Duyvil 131 0.11685827
NA Sunny Side 2678 0.12713336
NA Sunset Park 288 0.11079174
NA The Rockaways 259 0.06733444
NA Throggs Neck 80 0.08751933
NA Todt Hill 12 0.03739845
NA Tottensville 2 0.14492754
NA Tremont 119 0.02535450
NA Tribeca 1931 0.16457058
NA Union Port 64 0.04806241
NA University Heights 82 0.03717917
NA Upper East Side 10544 0.16246302
NA Upper West Side 7967 0.16120439
NA Utopia 677 0.06395129
NA Wakefield-Williamsbridge 107 0.04609067
NA Washington Heights 1241 0.13664189
NA West Village 1726 0.16814002
NA Westerleigh-Castleton 11 0.04190773
NA Whitestone 1 0.00000000
NA Williams Bridge 65 0.06279349
NA Williamsburg 5111 0.14403349
NA Woodhaven-Richmond Hill 480 0.03536936
NA Woodlawn-Nordwood 24 0.08135023
NA Woodside 765 0.08492267
NA Yorkville 957 0.15686904
NA NA 27834 0.47254372

Chaining/Piping

  • A dplyr installation includes the magrittr package as a dependency
  • The magrittr package includes a pipe operator that allows you to pass the current dataset to another function
  • This makes interpreting a nested sequence of operations much easier to understand

Standard Code

  • Code is executed inside-out.
  • Let's arrange the above average tips in descending order, and only look at the locations that had at least 10 dropoffs and pickups.

In [19]:
filter(arrange(summarise(group_by(taxi_df, pickup_nhood, dropoff_nhood), Num = n(), ave_tip_pct = mean(tip_pct)), desc(ave_tip_pct)), Num >= 10)


pickup_nhooddropoff_nhoodNumave_tip_pct
Kings Bridge Kings Bridge 11 1.2256871
Upper East Side Brownsville 23 0.9447013
Bay Ridge Bay Ridge 60 0.7915420
The Rockaways The Rockaways 33 0.7828405
Gravesend-Sheepshead BayGravesend-Sheepshead Bay 80 0.5122171
NA NA 27834 0.4725437
Bensonhurst Bensonhurst 59 0.4512018
Upper East Side Spuyten Duyvil 65 0.3463543
Soho Mott Haven 12 0.3239580
Bedford Park Bedford Park 30 0.3078915
Midtown Westerleigh-Castleton 10 0.3017644
Bay Ridge Gravesend-Sheepshead Bay 10 0.2951905
Clinton Bay Ridge 90 0.2893189
Midtown South Beach 17 0.2473338
Battery Park Greenwood 15 0.2396805
Fordham Fordham 15 0.2326797
Bedford-Stuyvesant Bedford-Stuyvesant 936 0.2293516
Greenwood Financial District 11 0.2273952
Carnegie Hill Carroll Gardens 13 0.2223942
Lower East Side Spuyten Duyvil 10 0.2105018
Murray Hill Clearview 10 0.2102742
Nkew Gardens Midtown 10 0.2101923
Battery Park Park Slope 134 0.2032499
Hunts Point Hunts Point 13 0.2014353
Washington Heights NA 150 0.1995337
Forest Hills Gramercy 26 0.1990748
Battery Park Battery Park 1145 0.1987076
Jackson Heights Financial District 13 0.1970842
Carroll Gardens West Village 51 0.1966033
Financial District Spuyten Duyvil 10 0.1964487
Yorkville South Bronx 127 0.018011226
Parkchester Parkchester 28 0.017944170
East Harlem Bedford Park 17 0.017535651
Clinton Soundview 27 0.017476971
Sunny Side Woodhaven-Richmond Hill 13 0.017094017
Yorkville Woodlawn-Nordwood 12 0.016944444
Williamsburg Brownsville 11 0.016161616
Greenwich Village Soundview 21 0.015615496
Brownsville Bedford-Stuyvesant 14 0.015566502
Harlem Morris Heights 31 0.015263697
South Bronx East Harlem 12 0.015151515
Brownsville East Brooklyn 12 0.014800000
Mott Haven Tremont 18 0.014422658
Jamaica Utopia 11 0.013722127
West Village High Bridge 15 0.013529412
Fordham Tremont 12 0.013333333
Utopia Jamaica 21 0.012786596
Harlem Parkchester 23 0.012385605
Garment District East Brooklyn 37 0.012137732
Harlem Wakefield-Williamsbridge 11 0.006060606
East Harlem Hunts Point 25 0.005934066
High Bridge Mott Haven 43 0.004651163
Woodhaven-Richmond Hill Jamaica 16 0.004166667
Yorkville Soundview 26 0.003870043
Hamilton Heights Morris Heights 16 0.000000000
Jamaica Woodhaven-Richmond Hill 13 0.000000000
Mott Haven Soundview 16 0.000000000
Mott Haven University Heights 10 0.000000000
Tremont Fordham 11 0.000000000
University Heights Inwood 11 0.000000000

Reformatted


In [20]:
filter(
  arrange(
    summarise(
      group_by(taxi_df,
               pickup_nhood, dropoff_nhood),
      Num = n(),
      ave_tip_pct = mean(tip_pct)),
    desc(ave_tip_pct)),
  Num >= 10)


pickup_nhooddropoff_nhoodNumave_tip_pct
Kings Bridge Kings Bridge 11 1.2256871
Upper East Side Brownsville 23 0.9447013
Bay Ridge Bay Ridge 60 0.7915420
The Rockaways The Rockaways 33 0.7828405
Gravesend-Sheepshead BayGravesend-Sheepshead Bay 80 0.5122171
NA NA 27834 0.4725437
Bensonhurst Bensonhurst 59 0.4512018
Upper East Side Spuyten Duyvil 65 0.3463543
Soho Mott Haven 12 0.3239580
Bedford Park Bedford Park 30 0.3078915
Midtown Westerleigh-Castleton 10 0.3017644
Bay Ridge Gravesend-Sheepshead Bay 10 0.2951905
Clinton Bay Ridge 90 0.2893189
Midtown South Beach 17 0.2473338
Battery Park Greenwood 15 0.2396805
Fordham Fordham 15 0.2326797
Bedford-Stuyvesant Bedford-Stuyvesant 936 0.2293516
Greenwood Financial District 11 0.2273952
Carnegie Hill Carroll Gardens 13 0.2223942
Lower East Side Spuyten Duyvil 10 0.2105018
Murray Hill Clearview 10 0.2102742
Nkew Gardens Midtown 10 0.2101923
Battery Park Park Slope 134 0.2032499
Hunts Point Hunts Point 13 0.2014353
Washington Heights NA 150 0.1995337
Forest Hills Gramercy 26 0.1990748
Battery Park Battery Park 1145 0.1987076
Jackson Heights Financial District 13 0.1970842
Carroll Gardens West Village 51 0.1966033
Financial District Spuyten Duyvil 10 0.1964487
Yorkville South Bronx 127 0.018011226
Parkchester Parkchester 28 0.017944170
East Harlem Bedford Park 17 0.017535651
Clinton Soundview 27 0.017476971
Sunny Side Woodhaven-Richmond Hill 13 0.017094017
Yorkville Woodlawn-Nordwood 12 0.016944444
Williamsburg Brownsville 11 0.016161616
Greenwich Village Soundview 21 0.015615496
Brownsville Bedford-Stuyvesant 14 0.015566502
Harlem Morris Heights 31 0.015263697
South Bronx East Harlem 12 0.015151515
Brownsville East Brooklyn 12 0.014800000
Mott Haven Tremont 18 0.014422658
Jamaica Utopia 11 0.013722127
West Village High Bridge 15 0.013529412
Fordham Tremont 12 0.013333333
Utopia Jamaica 21 0.012786596
Harlem Parkchester 23 0.012385605
Garment District East Brooklyn 37 0.012137732
Harlem Wakefield-Williamsbridge 11 0.006060606
East Harlem Hunts Point 25 0.005934066
High Bridge Mott Haven 43 0.004651163
Woodhaven-Richmond Hill Jamaica 16 0.004166667
Yorkville Soundview 26 0.003870043
Hamilton Heights Morris Heights 16 0.000000000
Jamaica Woodhaven-Richmond Hill 13 0.000000000
Mott Haven Soundview 16 0.000000000
Mott Haven University Heights 10 0.000000000
Tremont Fordham 11 0.000000000
University Heights Inwood 11 0.000000000

Magrittr

  • Inspired by unix |, and F# forward pipe |>, magrittr introduces the funny character (%>%, the then operator)
  • %>% pipes the object on the left hand side to the first argument of the function on the right hand side
  • Every function in dplyr has a slot for data.frame/tbl as it's first argument, so this works beautifully!

Put that Function in Your Pipe and...


In [21]:
taxi_df %>%
  group_by(pickup_nhood, dropoff_nhood) %>%
  summarize(Num = n(),
            ave_tip_pct = mean(tip_pct)) %>%
  arrange(desc(ave_tip_pct)) %>%
  filter(Num >= 10)


pickup_nhooddropoff_nhoodNumave_tip_pct
Kings Bridge Kings Bridge 11 1.2256871
Upper East Side Brownsville 23 0.9447013
Bay Ridge Bay Ridge 60 0.7915420
The Rockaways The Rockaways 33 0.7828405
Gravesend-Sheepshead BayGravesend-Sheepshead Bay 80 0.5122171
NA NA 27834 0.4725437
Bensonhurst Bensonhurst 59 0.4512018
Upper East Side Spuyten Duyvil 65 0.3463543
Soho Mott Haven 12 0.3239580
Bedford Park Bedford Park 30 0.3078915
Midtown Westerleigh-Castleton 10 0.3017644
Bay Ridge Gravesend-Sheepshead Bay 10 0.2951905
Clinton Bay Ridge 90 0.2893189
Midtown South Beach 17 0.2473338
Battery Park Greenwood 15 0.2396805
Fordham Fordham 15 0.2326797
Bedford-Stuyvesant Bedford-Stuyvesant 936 0.2293516
Greenwood Financial District 11 0.2273952
Carnegie Hill Carroll Gardens 13 0.2223942
Lower East Side Spuyten Duyvil 10 0.2105018
Murray Hill Clearview 10 0.2102742
Nkew Gardens Midtown 10 0.2101923
Battery Park Park Slope 134 0.2032499
Hunts Point Hunts Point 13 0.2014353
Washington Heights NA 150 0.1995337
Forest Hills Gramercy 26 0.1990748
Battery Park Battery Park 1145 0.1987076
Jackson Heights Financial District 13 0.1970842
Carroll Gardens West Village 51 0.1966033
Financial District Spuyten Duyvil 10 0.1964487
Yorkville South Bronx 127 0.018011226
Parkchester Parkchester 28 0.017944170
East Harlem Bedford Park 17 0.017535651
Clinton Soundview 27 0.017476971
Sunny Side Woodhaven-Richmond Hill 13 0.017094017
Yorkville Woodlawn-Nordwood 12 0.016944444
Williamsburg Brownsville 11 0.016161616
Greenwich Village Soundview 21 0.015615496
Brownsville Bedford-Stuyvesant 14 0.015566502
Harlem Morris Heights 31 0.015263697
South Bronx East Harlem 12 0.015151515
Brownsville East Brooklyn 12 0.014800000
Mott Haven Tremont 18 0.014422658
Jamaica Utopia 11 0.013722127
West Village High Bridge 15 0.013529412
Fordham Tremont 12 0.013333333
Utopia Jamaica 21 0.012786596
Harlem Parkchester 23 0.012385605
Garment District East Brooklyn 37 0.012137732
Harlem Wakefield-Williamsbridge 11 0.006060606
East Harlem Hunts Point 25 0.005934066
High Bridge Mott Haven 43 0.004651163
Woodhaven-Richmond Hill Jamaica 16 0.004166667
Yorkville Soundview 26 0.003870043
Hamilton Heights Morris Heights 16 0.000000000
Jamaica Woodhaven-Richmond Hill 13 0.000000000
Mott Haven Soundview 16 0.000000000
Mott Haven University Heights 10 0.000000000
Tremont Fordham 11 0.000000000
University Heights Inwood 11 0.000000000

Pipe + group_by()

  • The pipe operator is very helpful for group by summaries
  • Let's calculate average tip amount, and average trip distance, controlling for dropoff day of the week and dropoff location
  • First filter with the vector manhattan_hoods


In [22]:
mht_url <- "http://alizaidi.blob.core.windows.net/training/manhattan.rds"
manhattan_hoods <- readRDS(gzcon(url(mht_url)))
taxi_df %>%
  filter(pickup_nhood %in% manhattan_hoods,
         dropoff_nhood %in% manhattan_hoods) %>%
  group_by(dropoff_nhood, pickup_nhood) %>%
  summarize(ave_tip = mean(tip_pct),
            ave_dist = mean(trip_distance)) %>%
  filter(ave_dist > 3, ave_tip > 0.05)


dropoff_nhoodpickup_nhoodave_tipave_dist
Battery Park Central Park 0.12694563 6.149281
Battery Park Clinton 0.11996579 4.016902
Battery Park East Harlem 0.07116177 10.124000
Battery Park East Village 0.14717019 3.537367
Battery Park Garment District 0.13463903 3.965532
Battery Park Gramercy 0.14396885 4.153174
Battery Park Hamilton Heights 0.06770436 8.843571
Battery Park Harlem 0.13829591 9.039286
Battery Park Inwood 0.18235294 11.950000
Battery Park Midtown 0.13428280 5.496734
Battery Park Morningside Heights0.15595534 7.947419
Battery Park Murray Hill 0.14230493 5.457552
Battery Park North Sutton Area 0.13318941 6.866011
Battery Park Upper East Side 0.13112513 7.677720
Battery Park Upper West Side 0.13287111 5.773742
Battery Park Washington Heights 0.12779408 10.206923
Battery Park Yorkville 0.14815491 8.544762
Central Park Battery Park 0.11153630 5.983153
Central Park Chinatown 0.12276618 5.425789
Central Park East Village 0.13408203 4.644099
Central Park Financial District 0.09623395 6.620338
Central Park Gramercy 0.12756278 3.222990
Central Park Greenwich Village 0.12932688 3.919407
Central Park Little Italy 0.11914271 4.660275
Central Park Lower East Side 0.11894056 5.403105
Central Park Soho 0.10882615 4.381541
Central Park Tribeca 0.13259363 5.202881
Central Park Washington Heights 0.09936008 4.700000
Central Park West Village 0.13151131 3.813692
Chelsea Central Park 0.13334925 3.164956
Washington Heights Yorkville 0.08562844 5.209286
West Village Central Park 0.11439411 3.897739
West Village East Harlem 0.12856174 7.203200
West Village Hamilton Heights 0.11837929 7.470263
West Village Harlem 0.10529889 7.113514
West Village Inwood 0.20812500 10.460000
West Village Morningside Heights0.14703183 6.389273
West Village North Sutton Area 0.13483326 3.603191
West Village Upper East Side 0.13367464 4.646157
West Village Upper West Side 0.13723914 3.919866
West Village Washington Heights 0.14651280 8.680526
West Village Yorkville 0.13939565 6.126538
Yorkville Battery Park 0.14375157 8.927209
Yorkville Chelsea 0.12630881 4.988345
Yorkville Chinatown 0.08176344 7.073214
Yorkville Clinton 0.11385074 4.278752
Yorkville East Village 0.12104080 5.233184
Yorkville Financial District 0.11838887 8.146121
Yorkville Garment District 0.10531276 4.162365
Yorkville Gramercy 0.12129103 4.255384
Yorkville Greenwich Village 0.12414755 5.513012
Yorkville Inwood 0.07585581 6.710000
Yorkville Little Italy 0.18410847 5.886102
Yorkville Lower East Side 0.11105280 6.343524
Yorkville Midtown 0.12631612 3.006321
Yorkville Murray Hill 0.12157790 3.433576
Yorkville Soho 0.14101288 6.410496
Yorkville Tribeca 0.13259729 7.757073
Yorkville Washington Heights 0.09903927 4.951351
Yorkville West Village 0.12226814 6.025485

Pipe and Plot

Piping is not limited to dplyr functions, can be used everywhere!


In [23]:
library(ggplot2)
taxi_df %>%
  filter(pickup_nhood %in% manhattan_hoods,
         dropoff_nhood %in% manhattan_hoods) %>%
  group_by(dropoff_nhood, pickup_nhood) %>%
  summarize(ave_tip = mean(tip_pct),
            ave_dist = mean(trip_distance)) %>%
  filter(ave_dist > 3, ave_tip > 0.05) %>%
  ggplot(aes(x = pickup_nhood, y = dropoff_nhood)) +
    geom_tile(aes(fill = ave_tip), colour = "white") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = 'bottom') +
    scale_fill_gradient(low = "white", high = "steelblue")




In [24]:
library(ggplot2)
taxi_df %>%
  filter(pickup_nhood %in% manhattan_hoods,
         dropoff_nhood %in% manhattan_hoods) %>%
  group_by(dropoff_nhood, pickup_nhood) %>%
  summarize(ave_tip = mean(tip_pct),
            ave_dist = mean(trip_distance)) %>%
  filter(ave_dist > 3, ave_tip > 0.05) %>%
  ggplot(aes(x = pickup_nhood, y = dropoff_nhood)) +
    geom_tile(aes(fill = ave_tip), colour = "white") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = 'bottom') +
    scale_fill_gradient(low = "white", high = "steelblue")


Piping to other arguments

  • Although dplyr takes great care to make it particularly amenable to piping, other functions may not reserve the first argument to the object you are passing into it.
  • You can use the special . placeholder to specify where the object should enter

In [25]:
taxi_df %>%
  filter(pickup_nhood %in% manhattan_hoods,
         dropoff_nhood %in% manhattan_hoods) %>%
  group_by(dropoff_nhood, pickup_nhood) %>%
  summarize(ave_tip = mean(tip_pct),
            ave_dist = mean(trip_distance)) %>%
  lm(ave_tip ~ ave_dist, data = .) -> taxi_model         # -> to assign left hand size to the right hand size
summary(taxi_model)

#OR 

#taxi_df %>%
#  filter(pickup_nhood %in% manhattan_hoods,
#         dropoff_nhood %in% manhattan_hoods) %>%
#  group_by(dropoff_nhood, pickup_nhood) %>%
#  summarize(ave_tip = mean(tip_pct),
#            ave_dist = mean(trip_distance)) %>%
#  ungroup() %>%                                       # <-
#  lm(ave_tip ~ ., data = .) -> taxi_model             # <-
#summary(taxi_model)


Call:
lm(formula = ave_tip ~ ave_dist, data = .)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.112258 -0.010882  0.002727  0.014168  0.140976 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.1324307  0.0016071  82.402  < 2e-16 ***
ave_dist    -0.0017345  0.0003004  -5.773 1.15e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.02468 on 724 degrees of freedom
Multiple R-squared:  0.04401,	Adjusted R-squared:  0.04269 
F-statistic: 33.33 on 1 and 724 DF,  p-value: 1.153e-08

Exercise

Your turn:

  • Use the pipe operator to group by day of week and dropoff neighborhood
  • Filter to Manhattan neighborhoods
  • Make tile plot with average fare amount in dollars as the fill

In [26]:
str(manhattan_hoods)


 chr [1:28] "Chinatown" "Little Italy" "Tribeca" "Lower East Side" ...

In [27]:
library(ggplot2)
taxi_df %>%
  filter(dropoff_nhood %in% manhattan_hoods) %>%
  group_by(dropoff_nhood, pickup_dow) %>%
  summarize(ave_fare = mean(fare_amount)) %>%
  ggplot(aes(x = pickup_dow, y = dropoff_nhood)) +
    geom_tile(aes(fill = ave_fare), colour = "white") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = 'bottom') +
    scale_fill_gradient(low = "white", high = "steelblue")


Functional Programming

Creating Functional Pipelines

Too Many Pipes?


Reusable code

  • The examples above create a rather messy pipeline operation
  • Can be very hard to debug
  • The operation is pretty readable, but lacks reusability
  • Since R is a functional language, we benefit by splitting these operations into functions and calling them separately
  • This allows resuability; don't write the same code twice!

Functional Pipelines

Summarization

  • Let's create a function that takes an argument for the data, and applies the summarization by neighborhood to calculate average tip and trip distance


In [28]:
taxi_hood_sum <- function(taxi_data = taxi_df) {

  mht_url <- "http://alizaidi.blob.core.windows.net/training/manhattan.rds"

  manhattan_hoods <- readRDS(gzcon(url(mht_url)))
  taxi_data %>%
    filter(pickup_nhood %in% manhattan_hoods,
           dropoff_nhood %in% manhattan_hoods) %>%
    group_by(dropoff_nhood, pickup_nhood) %>%
    summarize(ave_tip = mean(tip_pct),
              ave_dist = mean(trip_distance)) %>%
    filter(ave_dist > 3, ave_tip > 0.05) -> sum_df

  return(sum_df)

}

Functional Pipelines

Plotting Function

  • We can create a second function for the plot

In [29]:
tile_plot_hood <- function(df = taxi_hood_sum()) {

  library(ggplot2)

  ggplot(data = df, aes(x = pickup_nhood, y = dropoff_nhood)) +
    geom_tile(aes(fill = ave_tip), colour = "white") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = 'bottom') +
    scale_fill_gradient(low = "white", high = "steelblue") -> gplot

  return(gplot)
}

Calling Our Pipeline

  • Now we can create our plot by simply calling our two functions

In [30]:
#library(plotly)

#taxi_hood_sum(taxi_df) %>% tile_plot_hood 
#OR
tile_plot_hood()



In [ ]:


In [31]:
library(plotly)

In [32]:
embed_notebook(ggplotly(tile_plot_hood()))


Let's make that baby interactive.

Creating Complex Pipelines with do

  • The summarize function is fun, can summarize many numeric/scalar quantities
  • But what if you want multiple values/rows back, not just a scalar summary?
  • Meet the do verb -- arbitrary tbl operations


In [ ]:
taxi_df %>% group_by(dropoff_dow) %>%
  filter(!is.na(dropoff_nhood), !is.na(pickup_nhood)) %>%
  arrange(desc(tip_pct)) %>%
  do(slice(., 1:2)) %>%
  select(dropoff_dow, tip_amount, tip_pct,
         fare_amount, dropoff_nhood, pickup_nhood)

Estimating Multiple Models with do

  • A common use of do is to calculate many different models by a grouping variable

In [ ]:
dow_lms <- taxi_df %>% sample_n(10^4) %>%
  group_by(dropoff_dow) %>%
  do(lm_tip = lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
     data = .))


In [ ]:
dow_lms

Where are our results?

Cleaning Output


In [ ]:
summary(dow_lms$lm_tip[[1]])
library(broom)
dow_lms %>% tidy(lm_tip)
  • By design, every function in dplyr returns a data.frame
  • In the example above, we get back a spooky data.frame with a column of S3 lm objects
  • You can still modify each element as you would normally, or pass it to a mutate function to extract intercept or statistics
  • But there's also a very handy broom package for cleaning up such objects into data.frames

Brooming Up the Mess

Model Metrics


In [ ]:
library(broom)
taxi_df %>% sample_n(10^5) %>%
  group_by(dropoff_dow) %>%
  do(glance(lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
     data = .)))

Model Coefficients

The most commonly used function in the broom package is the tidy function. This will expand our data.frame and give us the model coefficients


In [ ]:
taxi_df %>% sample_n(10^5) %>%
  group_by(dropoff_dow) %>%
  do(tidy(lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
     data = .)))

Summary

mutate

: Create transformations

summarise

: Aggregate

group_by

: Group your dataset by levels

do

: Evaluate complex operations on a tbl

Chaining with the %>% operator can result in more readable code.

What We Didn't Cover

  • There are many additional topics that fit well into the dplyr and functional programming landscape
  • There are too many to cover in one session. Fortunately, most are well documented. The most notable omissions:
    1. Connecting to remote databases, see vignette('databases', package = 'dplyr')
    2. Merging and Joins, see vignette('two-table', package = 'dplyr')
    3. Programming with dplyr,vignette('nse', package = 'dplyr')
    4. summarize_each and mutate_each

Thanks for Attending!

  • Any questions?