In [4]:
knitr::opts_chunk$set(warning=FALSE, message=FALSE, fig.align = 'center')

In [5]:
options(jupiter.rich_display=FALSE)

Course Logistics

Day One

R U Ready?

  • Overview of The R Project for Statistical Computing
  • The Microsoft R Family
  • R's capabilities and its limitations
  • What types of problems R might be useful for
  • How to manage data with the exceptionally popular open source package dplyr
  • How to develop models and write functions in R

Day Two

Scalable Data Analysis with Microsoft R

  • Moving the compute to your data
  • WODA - Write Once, Deploy Anywhere
  • High Performance Analytics
  • High Performance Computing
  • Machine Learning with Microsoft R

Day Three

Distributing Computing on Spark Clusters with R

  • Overview of the Apache Spark Project
  • Taming the Hadoop Zoo with HDInsight
  • Provisioing and Managing HDInsight Clusters
  • Spark DataFrames, SparkR, and the sparklyr package
  • Developing Machine Learning Pipelines with Spark and Microsoft R

Prerequisites

Computing Environments

Development Environments

Where to Write R Code

  • The most popular integrated development environment for R is RStudio
  • The RStudio IDE is entirely html/javascript based, so completely cross-platform
  • RStudio Server provides a full IDE in your browser: perfect for cloud instances
  • For Windows machines, we have recently announced the general availability of R Tools for Visual Studio, RTVS
  • RTVS supports connectivity to Azure and SQL Server

What is R?

Why should I care?

  • R is the successor to the S Language, originated at Bell Labs AT&T
  • It is based on the Scheme interpreter
  • Originally designed by two University of Auckland Professors for their introductory statistics course

R's Philosophy

What R Thou?

R follows the Unix philosophy

  • Write programs that do one thing and do it well (modularity)
  • Write programs that work together (cohesiveness)
  • R is extensible with more than 10,000 packages available at CRAN (http://crantastic.org/packages)

The aRt of Being Lazy

Lazy Evaluation in R

  • R, like it's inspiration, Scheme, is a functional programming language
  • R evaluates lazily, delaying evaluation until necessary, which can make it very flexible
  • R is a highly interpreted dynamically typed language, allowing you to mutate variables and analyze datasets quickly, but is significantly slower than low-level, statically typed languages like C or Java
  • R has a high memory footprint, and can easily lead to crashes if you aren't careful

R's Programming Paradigm

Keys to R

Everything that exist in R is an *object*
Everything that happens in R is a *function call*
R was born to *interface*

_—John Chambers_

Strengths of R

Where R Succeeds

  • Expressive
  • Open source
  • Extendable -- nearly 10,000 packages with functions to use, and that list continues to grow
  • Focused on statistics and machine learning -- cutting-edge algorithms and powerful data manipulation packages
  • Advanced data structures and graphical capabilities
  • Large user community, both within academia and industry
  • It is designed by statisticians

Weaknesses of R

Where R Falls Short

  • It is designed by statisticians
  • Inefficient at element-by-element computations
  • May make large demands on system resources, namely memory
  • Data capacity limited by memory
  • Single-threaded

Some Essential Open Source Packages

  • There are over 10,000 R packages to choose from, what do I start with?
  • Data Management: dplyr, tidyr, data.table
  • Visualization: ggplot2, ggvis, htmlwidgets, shiny
  • Data Importing: haven, RODBC, readr, foreign
  • Other favorites: magrittr, rmarkdown, caret

R Foundations

Command line prompts

Symbol Meaning
<- assignment operator
> ready for a new command
+ awaiting the completion of an existing command
? get help for following function

Can change options either permanently at startup (see ?Startup) or manually at each session with the options function, options(repos = " ") for example.

Check your CRAN mirror with getOption("repos").

I'm Lost!

Getting Help for R

Quick Tour of Things You Need to Know

Data Structures

"Bad programmers worry about the code. Good programmers worry about data structures and their relationships."

  • Linus Torvalds
  • R's data structures can be described by their dimensionality, and their type.
Homogeneous Heterogeneous
1d Atomic vector List
2d Matrix Data frame
nd Array

Quick Tour of Things You Need to Know

Data Types

  • Atomic vectors come in one of four types
  • logical (boolean). Values: TRUE | FALSE
  • integer
  • double (often called numeric)
  • character
  • Rare types:
  • complex
  • raw

Manipulating Data Structures

Subsetting Operators

  • To create a vector, use c: c(1, 4, 1, 3)
  • To create a list, use list: list(1, 'hi', data.frame(1:10, letters[1:10]))
  • To subset a vector or list, use the [ ]
    • inside the brackets:
      • positive integer vectors for indices you want
      • negative integer vectors for indices you want to drop
      • logical vectors for indices you want to keep/drop (TRUE/FALSE)
      • character vectors for named vectors corresponding to which named elements you want to keep
      • subsetting a list with a single square bracket always returns a list
  • To subset a list and get back just that element, use [[ ]]

Object Representation

  • To find the type of an object, use class (higher level representation)
  • To find how the object is stored in memory, use typeof (lower level representation)
  • Good time to do Lab 0!

Data Manipulation with the dplyr Package

Overview

Rather than describing the nitty gritty details of writing R code, I'd like you to get started at immediately writing R code.

As most of you are data scientists/data enthusiasts, I will showcase one of the most useful data manipulation packages in R, dplyr. At the end of this session, you will have learned:

  • How to manipulate data quickly with dplyr using a very intuitive "grammar"
  • How to use dplyr to perform common exploratory analysis data manipulation procedures
  • How to apply your own custom functions to group manipulations dplyr with mutate(), summarise() and do()
  • Connect to remote databases to work with larger than memory datasets

Why use dplyr?

The Grammar of Data Manipulation

  • dplyr is currently the most downloaded package from CRAN
  • dplyr makes data manipulation easier by providing a few functions for the most common tasks and procedures
  • dplyr achieves remarkable speed-up gains by using a C++ backend
  • dplyr has multiple backends for working with data stored in various sources: SQLite, MySQL, bigquery, SQL Server, and many more
  • dplyr was inspired to give data manipulation a simple, cohesive grammar (similar philosophy to ggplot - grammar of graphics)
  • dplyr has inspired many new packages, which now adopt it's easy to understand syntax.
  • The recent packages dplyrXdf and SparkR/sparklyr brings much of the same functionality of dplyr to XDFs data and Spark DataFrames

Tidy Data and Happier Coding

Premature Optimization

  • For a dats scientist, the most important parameter to optimize in a data science development cycle is YOUR time
  • It is therefore important to be able to write efficient code, quickly
  • Goals: writing fast code that is: portable, platform invariant, easy to understand, and easy to debug
    • Be serious about CReUse!

Manipulation verbs

filter

: select rows based on matching criteria

slice

: select rows by number

select

: select columns by column names

arrange

: reorder rows by column values

mutate

: add new variables based on transformations of existing variables

transmute

: transform and drop other variables

Aggregation verbs

group_by

: identify grouping variables for calculating groupwise summary statistics

count

: count the number of records per group

summarise | summarize

: calculate one or more summary functions per group, returning one row of results per group (or one for the entire dataset)

NYC Taxi Data

Data for Class

  • The data we will be examining in this module is derived from the NYC Taxi and Limousine Commission
  • Data contains taxi trips in NYC, and includes spatial features (pickup and dropoff neighborhoods), temporal features, and monetary features (fare and tip amounts)
  • The dataset for this module is saved as an rds file in a public facing Azure storage blob
  • An rds file is a compressed, serialized R object
  • Save an object to rds by using the saveRDS function; read an rds object with the readRDS object

Viewing Data

tibble

  • dplyr includes a wrapper called tbl_df that adds an additional class attribute onto data.frames that provides some better data manipulation aesthetics (there's now a dedicated package tibble for this wrapper and it's class)
  • Most noticeable differential between tbl_df and data.frames is the console output: tbl_dfs will only print what the current R console window can display
  • Can change the default setting for number of displayed columns by changing the options parameter: options(dplyr.width = Inf)

In [6]:
library(dplyr)
library(stringr)
taxi_url <- "http://alizaidi.blob.core.windows.net/training/taxi_df.rds"
taxi_df  <- readRDS(gzcon(url(taxi_url)))
(taxi_df <- tbl_df(taxi_df))


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

VendorIDpassenger_counttrip_distanceRateCodeIDstore_and_fwd_flagpayment_typefare_amounttip_amounttolls_amountpickup_hourpickup_dowdropoff_hourdropoff_dowpickup_nhooddropoff_nhoodkSplits
1 1 1.80 1 N 2 9.5 0.00 0.00 6-10 Sat 6-10 Sat Morningside HeightsHamilton Heights A
1 2 0.90 1 N 1 6.5 1.55 0.00 6-10 Sat 6-10 Sat Midtown Midtown A
1 1 0.90 1 N 1 7.0 1.66 0.00 6-10 Sat 6-10 Sat Lower East Side Soho A
1 1 0.30 1 N 2 3.0 0.00 0.00 6-10 Sat 6-10 Sat Financial District Financial District A
2 1 0.96 1 N 1 5.5 1.30 0.00 6-10 Thu 6-10 Thu Chelsea West Village A
2 1 2.01 1 N 1 9.5 2.16 0.00 10-5 Sun 10-5 Sun Upper East Side Harlem A
2 3 3.14 1 N 1 12.5 2.50 0.00 12-4 Sun 12-4 Sun Fort Green Soho A
1 1 0.50 1 N 1 4.0 1.00 0.00 12-4 Sun 12-4 Sun Upper East Side Upper East Side A
2 1 0.67 1 N 1 5.0 1.00 0.00 12-4 Thu 12-4 Thu Upper West Side Upper West Side A
2 1 15.20 2 N 1 52.0 14.33 5.33 12-4 Thu 12-4 Thu NA Clinton A
2 5 2.96 1 N 2 20.5 0.00 0.00 12-4 Thu 12-4 Thu Upper East Side Garment District A
1 1 0.70 1 N 2 6.0 0.00 0.00 9-12 Mon 9-12 Mon Upper East Side Upper East Side A
1 1 2.60 1 N 1 16.0 3.35 0.00 9-12 Thu 9-12 Thu Upper East Side Gramercy A
2 2 0.79 1 N 2 5.0 0.00 0.00 12-4 Wed 12-4 Wed NA NA A
2 1 3.37 1 N 1 18.0 3.60 0.00 12-4 Wed 12-4 Wed Upper East Side Chelsea A
1 3 2.40 1 N 2 11.0 0.00 0.00 6-10 Tue 10-5 Tue East Village Garment District A
1 1 16.30 1 Y 1 45.0 11.57 0.00 6-10 Tue 10-5 Tue NA NA A
1 1 5.70 1 N 1 25.0 5.16 0.00 12-4 Mon 12-4 Mon Midtown NA A
1 1 3.20 1 N 2 16.0 0.00 0.00 12-4 Mon 12-4 Mon Midtown Upper West Side A
1 1 0.70 1 N 2 4.5 0.00 0.00 12-4 Mon 12-4 Mon Upper West Side Harlem A
1 1 1.00 1 N 2 6.0 0.00 0.00 10-5 Tue 10-5 Tue Midtown Upper West Side A
1 1 1.50 1 N 1 8.5 1.00 0.00 10-5 Tue 10-5 Tue West Village East Village A
1 2 5.00 1 N 2 21.5 0.00 0.00 12-4 Mon 12-4 Mon Midtown Jackson Heights A
1 1 2.00 1 N 1 11.5 3.69 0.00 12-4 Mon 12-4 Mon Downtown Fort Green A
1 1 1.40 1 N 1 8.0 1.75 0.00 12-4 Mon 12-4 Mon Upper East Side Upper East Side A
1 1 1.40 1 N 1 8.5 1.85 0.00 12-4 Mon 12-4 Mon Midtown Gramercy A
1 1 1.80 1 N 2 9.5 0.00 0.00 12-4 Mon 12-4 Mon Upper West Side Midtown A
2 1 6.08 1 N 1 18.5 3.70 0.00 6-10 Sun 6-10 Sun Financial District Upper East Side A
2 1 1.96 1 N 1 8.5 2.12 0.00 6-10 Sun 6-10 Sun Chelsea Midtown A
2 6 1.26 1 N 2 7.0 0.00 0.00 6-10 Sun 6-10 Sun Garment District Murray Hill A
2 1 14.21 1 N 2 39.5 0.00 0.00 6-10 Tue 6-10 Tue NA Sunny Side A
2 1 1.79 1 N 2 10.0 0.00 0.00 6-10 Tue 6-10 Tue Upper East Side Upper West Side A
2 1 5.34 1 N 2 20.0 0.00 0.00 6-10 Tue 6-10 Tue Upper West Side Battery Park A
2 1 1.51 1 N 1 8.0 1.86 0.00 6-10 Tue 6-10 Tue Midtown Upper East Side A
2 6 2.92 1 N 1 12.5 1.00 0.00 4-6 Sat 4-6 Sat Upper East Side Midtown A
2 6 1.41 1 N 2 7.5 0.00 0.00 4-6 Sun 4-6 Sun West Village Soho A
2 2 4.91 1 N 1 17.0 3.00 0.00 10-5 Tue 10-5 Tue NA Upper East Side A
2 3 0.92 1 N 2 5.0 0.00 0.00 10-5 Tue 10-5 Tue East Village Gramercy A
2 2 13.31 1 N 2 37.0 0.00 0.00 10-5 Tue 10-5 Tue NA NA A
2 1 3.46 1 N 2 12.5 0.00 0.00 10-5 Tue 10-5 Tue East Village Clinton A
2 5 6.08 1 N 1 20.5 4.36 0.00 10-5 Tue 10-5 Tue Garment District Cobble Hill A
2 1 1.41 1 N 1 7.0 1.66 0.00 10-5 Tue 10-5 Tue Soho East Village A
2 1 0.83 1 N 1 5.5 1.36 0.00 10-5 Fri 10-5 Fri Lower East Side East Village A
2 3 1.44 1 N 1 8.5 1.86 0.00 12-4 Thu 12-4 Thu Upper West Side Clinton A
2 1 1.00 1 N 2 7.5 0.00 0.00 12-4 Thu 12-4 Thu Upper East Side Upper West Side A
2 1 5.72 1 N 1 24.0 4.96 0.00 5-9 Tue 9-12 Tue Financial District Upper East Side A
2 1 8.83 1 N 1 34.0 10.03 5.33 5-9 Tue 9-12 Tue NA Midtown A
2 2 1.38 1 N 1 6.5 1.46 0.00 5-9 Tue 5-9 Tue Upper East Side Yorkville A
2 1 1.61 1 N 1 14.5 3.06 0.00 5-9 Tue 9-12 Tue Gramercy Midtown A
2 1 2.28 1 N 1 15.0 3.16 0.00 5-9 Tue 9-12 Tue Upper East Side Murray Hill A
2 1 1.00 1 N 1 6.0 0.00 0.00 5-9 Tue 5-9 Tue Chelsea Chelsea A
2 1 3.24 1 N 2 13.5 0.00 0.00 10-5 Sat 10-5 Sat NA Fort Green A
2 1 1.13 1 N 1 6.5 1.95 0.00 10-5 Sat 10-5 Sat Midtown Midtown A
2 1 2.27 1 N 2 9.5 0.00 0.00 10-5 Sat 10-5 Sat Upper East Side Harlem A
2 1 1.87 1 N 1 8.5 1.00 0.00 10-5 Sat 10-5 Sat Morningside HeightsHamilton Heights A
2 5 1.57 1 N 2 7.0 0.00 0.00 10-5 Sun 10-5 Sun Midtown Chelsea A
2 2 2.75 1 N 2 12.0 0.00 0.00 10-5 Sun 10-5 Sun Tribeca Garment District A
2 5 9.18 1 N 1 30.5 6.36 0.00 10-5 Sun 10-5 Sun Midtown Park Slope A
2 1 0.79 1 N 1 5.0 1.16 0.00 6-10 Sat 6-10 Sat Upper West Side Upper West Side A
2 1 0.75 1 N 1 4.5 1.00 0.00 6-10 Sat 6-10 Sat Chelsea Chelsea A

In [7]:
class(taxi_df)


  1. 'tbl_df'
  2. 'tbl'
  3. 'data.frame'

Filtering and Reordering Data

Subsetting Data

  • dplyr makes subsetting by rows very easy
  • The filter verb takes conditions for filtering rows based on conditions
  • every dplyr function uses a data.frame/tbl as it's first argument
  • Additional conditions are passed as new arguments (no need to make an insanely complicated expression, split em up!)

Filter


In [8]:
filter(taxi_df,
       dropoff_dow %in% c("Fri", "Sat", "Sun"),
       tip_amount > 1)


VendorIDpassenger_counttrip_distanceRateCodeIDstore_and_fwd_flagpayment_typefare_amounttip_amounttolls_amountpickup_hourpickup_dowdropoff_hourdropoff_dowpickup_nhooddropoff_nhoodkSplits
1 2 0.90 1 N 1 6.5 1.55 0.00 6-10 Sat 6-10 Sat Midtown Midtown A
1 1 0.90 1 N 1 7.0 1.66 0.00 6-10 Sat 6-10 Sat Lower East Side Soho A
2 1 2.01 1 N 1 9.5 2.16 0.00 10-5 Sun 10-5 Sun Upper East Side Harlem A
2 3 3.14 1 N 1 12.5 2.50 0.00 12-4 Sun 12-4 Sun Fort Green Soho A
2 1 6.08 1 N 1 18.5 3.70 0.00 6-10 Sun 6-10 Sun Financial DistrictUpper East Side A
2 1 1.96 1 N 1 8.5 2.12 0.00 6-10 Sun 6-10 Sun Chelsea Midtown A
1 1 0.40 1 N 1 4.0 1.06 0.00 10-5 Sat 10-5 Sat Greenwich Village East Village A
1 2 1.70 1 N 1 10.0 2.00 0.00 12-4 Fri 4-6 Fri Midtown Upper East Side A
1 2 0.60 1 N 1 4.5 1.70 0.00 6-10 Sat 6-10 Sat Tribeca Tribeca A
1 2 9.30 1 N 1 28.5 5.86 0.00 6-10 Sat 6-10 Sat NA Upper East Side A
1 2 5.20 1 N 1 17.5 3.85 0.00 6-10 Fri 6-10 Fri Upper East Side Soho A
1 1 8.50 1 N 1 24.5 6.30 5.33 6-10 Fri 6-10 Fri NA Upper East Side A
1 1 2.20 1 N 1 15.5 4.07 0.00 6-10 Sat 6-10 Sat Garment District East Village A
1 1 5.30 1 N 1 20.5 6.39 0.00 6-10 Sat 6-10 Sat Financial DistrictMidtown A
2 1 1.01 1 N 1 5.5 1.58 0.00 4-6 Sun 4-6 Sun Yorkville Upper East Side A
1 1 0.40 1 N 1 4.5 1.58 0.00 6-10 Fri 6-10 Fri East Village East Village A
1 1 7.70 1 N 1 22.0 5.83 5.33 6-10 Fri 6-10 Fri NA Harlem A
2 1 1.01 1 N 1 7.0 1.95 0.00 5-9 Fri 5-9 Fri Gramercy Murray Hill A
2 1 0.79 1 N 1 8.0 2.20 0.00 5-9 Fri 5-9 Fri Midtown Midtown A
1 1 0.50 1 N 1 4.0 1.20 0.00 5-9 Sat 5-9 Sat Upper East Side Upper East Side A
1 1 2.90 1 N 1 10.5 2.25 0.00 5-9 Sat 5-9 Sat Greenwich Village Upper East Side A
2 1 1.16 1 N 1 6.0 1.82 0.00 10-5 Fri 10-5 Fri Financial DistrictBattery Park A
1 1 0.60 1 N 1 5.5 1.35 0.00 6-10 Sat 6-10 Sat Lower East Side East Village A
2 1 1.09 1 N 1 14.0 2.80 0.00 6-10 Sat 6-10 Sat Greenwich Village East Village A
1 1 2.40 1 N 1 12.5 2.00 0.00 4-6 Fri 4-6 Fri Midtown Chelsea A
2 1 1.28 1 N 1 5.5 1.20 0.00 6-10 Sun 6-10 Sun Midtown Upper East Side A
2 1 5.08 1 N 1 17.5 3.60 0.00 6-10 Sun 6-10 Sun East Village Upper West Side A
2 5 1.29 1 N 1 7.0 1.88 0.00 6-10 Sun 6-10 Sun Upper East Side Upper East Side A
2 1 2.83 1 N 1 10.0 2.10 0.00 6-10 Sun 6-10 Sun Upper East Side Gramercy A
1 1 2.50 1 N 1 10.5 2.00 0.00 6-10 Sat 10-5 Sat Midtown Upper East Side A
2 5 2.36 1 N 1 10.0 3.24 0 5-9 Fri 5-9 Fri Gramercy Tribeca A
2 1 1.24 1 N 1 6.0 1.46 0 10-5 Sat 10-5 Sat Garment District Midtown A
2 2 1.24 1 N 1 7.0 1.66 0 6-10 Sat 6-10 Sat Gramercy Midtown A
2 1 2.63 1 N 1 11.0 2.46 0 6-10 Sat 6-10 Sat Garment District Upper East Side A
2 5 3.14 1 N 1 20.0 4.16 0 5-9 Fri 5-9 Fri Sunny Side Upper East Side A
2 1 2.25 1 N 1 11.0 3.54 0 5-9 Fri 5-9 Fri Upper East Side Upper West Side A
2 1 0.71 1 N 1 8.5 1.86 0 5-9 Fri 5-9 Fri Upper East Side Upper East Side A
2 1 3.04 1 N 1 10.0 2.26 0 6-10 Sat 6-10 Sat Financial DistrictLower East Side A
2 3 10.36 1 N 1 31.0 4.00 0 6-10 Sat 6-10 Sat Chelsea Inwood A
2 2 1.56 1 N 1 8.5 1.50 0 12-4 Sat 12-4 Sat Upper East Side Upper West Side A
2 2 2.04 1 N 1 8.5 1.86 0 4-6 Sun 4-6 Sun Upper East Side Yorkville A
2 1 7.44 1 N 1 27.5 5.66 0 6-10 Sat 6-10 Sat Chelsea Bushwick A
2 1 1.96 1 N 1 9.5 2.06 0 12-4 Sat 12-4 Sat Financial DistrictGreenwich Village A
2 3 3.30 1 N 1 13.5 3.58 0 12-4 Sat 4-6 Sat Garment District Upper West Side A
2 1 1.32 1 N 1 9.0 1.96 0 12-4 Sat 4-6 Sat Midtown Garment District A
2 1 2.85 1 N 1 12.0 2.56 0 12-4 Sat 4-6 Sat Upper East Side Midtown A
2 2 0.83 1 N 1 6.0 1.36 0 12-4 Sat 12-4 Sat Clinton Garment District A
2 1 2.96 1 N 1 12.0 2.00 0 4-6 Sun 4-6 Sun Yorkville North Sutton Area A
2 1 2.64 1 N 1 11.5 2.46 0 12-4 Sat 12-4 Sat Garment District Central Park A
2 5 1.40 1 N 1 7.5 1.66 0 9-12 Sat 9-12 Sat Garment District Midtown A
2 1 0.59 1 N 1 4.5 1.32 0 9-12 Sun 9-12 Sun Chelsea Chelsea A
2 1 1.34 1 N 1 7.5 1.76 0 10-5 Fri 10-5 Sat East Village Greenwich Village A
2 1 4.62 1 N 1 20.0 4.26 0 10-5 Fri 10-5 Sat Soho Bushwick A
2 2 0.76 1 N 1 4.5 1.16 0 10-5 Sat 10-5 Sat Yorkville Yorkville A
2 1 7.03 1 N 1 24.0 1.70 0 10-5 Sat 10-5 Sat Little Italy Upper West Side A
2 4 1.25 1 N 1 6.5 1.56 0 10-5 Sat 10-5 Sat Little Italy West Village A
2 1 0.83 1 N 1 5.5 1.36 0 10-5 Fri 10-5 Fri Lower East Side East Village A
2 1 1.13 1 N 1 6.5 1.95 0 10-5 Sat 10-5 Sat Midtown Midtown A
2 5 9.18 1 N 1 30.5 6.36 0 10-5 Sun 10-5 Sun Midtown Park Slope A
2 1 0.79 1 N 1 5.0 1.16 0 6-10 Sat 6-10 Sat Upper West Side Upper West Side A

Exercise

Your turn:

  • How many observations started in Harlem?
    • pick both sides of Harlem, including east harlem
  • How many observations that started in Harlem ended in the Financial District?

Solution


In [9]:
library(stringr)
#table(taxi_df$pickup_nhood)
#harlem_pickups <- filter(taxi_df, str_detect(pickup_nhood, "Harlem")) #str_detect: equivalent of 'Like' in SQL
  harlem_pickups <- filter(taxi_df, pickup_nhood == "Harlem" | pickup_nhood == "East Harlem")
nrow(harlem_pickups)
#findistr_dropoffs <- filter(harlem_pickups, str_detect(dropoff_nhood, "Financial District"))
    findistr_dropoffs <- filter(harlem_pickups, dropoff_nhood == "Financial District")
nrow(findistr_dropoffs)

# or all together (without creating additional objects == memory)
nrow(filter(taxi_df, str_detect(pickup_nhood, "Harlem"), dropoff_nhood == "Financial District"))


32025
99
99

Select a set of columns

  • You can use the select() verb to specify which columns of a dataset you want
  • This is similar to the keep option in SAS's data step.
  • Use a colon : to select all the columns between two variables (inclusive)
  • Use contains to take any columns containing a certain word/phrase/character

Select Example


In [10]:
select(taxi_df, pickup_nhood, dropoff_nhood,
       fare_amount, dropoff_hour, trip_distance)


pickup_nhooddropoff_nhoodfare_amountdropoff_hourtrip_distance
Morningside HeightsHamilton Heights 9.5 6-10 1.80
Midtown Midtown 6.5 6-10 0.90
Lower East Side Soho 7.0 6-10 0.90
Financial District Financial District 3.0 6-10 0.30
Chelsea West Village 5.5 6-10 0.96
Upper East Side Harlem 9.5 10-5 2.01
Fort Green Soho 12.5 12-4 3.14
Upper East Side Upper East Side 4.0 12-4 0.50
Upper West Side Upper West Side 5.0 12-4 0.67
NA Clinton 52.0 12-4 15.20
Upper East Side Garment District 20.5 12-4 2.96
Upper East Side Upper East Side 6.0 9-12 0.70
Upper East Side Gramercy 16.0 9-12 2.60
NA NA 5.0 12-4 0.79
Upper East Side Chelsea 18.0 12-4 3.37
East Village Garment District 11.0 10-5 2.40
NA NA 45.0 10-5 16.30
Midtown NA 25.0 12-4 5.70
Midtown Upper West Side 16.0 12-4 3.20
Upper West Side Harlem 4.5 12-4 0.70
Midtown Upper West Side 6.0 10-5 1.00
West Village East Village 8.5 10-5 1.50
Midtown Jackson Heights 21.5 12-4 5.00
Downtown Fort Green 11.5 12-4 2.00
Upper East Side Upper East Side 8.0 12-4 1.40
Midtown Gramercy 8.5 12-4 1.40
Upper West Side Midtown 9.5 12-4 1.80
Financial District Upper East Side 18.5 6-10 6.08
Chelsea Midtown 8.5 6-10 1.96
Garment District Murray Hill 7.0 6-10 1.26
NA Sunny Side 39.5 6-10 14.21
Upper East Side Upper West Side 10.0 6-10 1.79
Upper West Side Battery Park 20.0 6-10 5.34
Midtown Upper East Side 8.0 6-10 1.51
Upper East Side Midtown 12.5 4-6 2.92
West Village Soho 7.5 4-6 1.41
NA Upper East Side 17.0 10-5 4.91
East Village Gramercy 5.0 10-5 0.92
NA NA 37.0 10-5 13.31
East Village Clinton 12.5 10-5 3.46
Garment District Cobble Hill 20.5 10-5 6.08
Soho East Village 7.0 10-5 1.41
Lower East Side East Village 5.5 10-5 0.83
Upper West Side Clinton 8.5 12-4 1.44
Upper East Side Upper West Side 7.5 12-4 1.00
Financial District Upper East Side 24.0 9-12 5.72
NA Midtown 34.0 9-12 8.83
Upper East Side Yorkville 6.5 5-9 1.38
Gramercy Midtown 14.5 9-12 1.61
Upper East Side Murray Hill 15.0 9-12 2.28
Chelsea Chelsea 6.0 5-9 1.00
NA Fort Green 13.5 10-5 3.24
Midtown Midtown 6.5 10-5 1.13
Upper East Side Harlem 9.5 10-5 2.27
Morningside HeightsHamilton Heights 8.5 10-5 1.87
Midtown Chelsea 7.0 10-5 1.57
Tribeca Garment District 12.0 10-5 2.75
Midtown Park Slope 30.5 10-5 9.18
Upper West Side Upper West Side 5.0 6-10 0.79
Chelsea Chelsea 4.5 6-10 0.75

Select: Other Options

starts_with(x, ignore.case = FALSE)

: name starts with x

ends_with(x, ignore.case = FALSE)

: name ends with x

matches(x, ignore.case = FALSE)

: selects all variables whose name matches the regular expression x

num_range("V", 1:5, width = 1)

: selects all variables (numerically) from V1 to V5.

  • You can also use a - to drop variables.

Reordering Data

  • You can reorder your dataset based on conditions using the arrange() verb
  • Use the desc function to sort in descending order rather than ascending order (default)

Arrange


In [11]:
#select(arrange(taxi_df, desc(fare_amount), pickup_nhood),
#       fare_amount, pickup_nhood)

head(select(arrange(taxi_df, desc(fare_amount), pickup_nhood, dropoff_nhood),
       fare_amount, pickup_nhood, dropoff_nhood), 10)


fare_amountpickup_nhooddropoff_nhood
3130.30 Borough Park Borough Park
3130.30 Upper West SideUpper West Side
990.00 NA NA
900.00 Little Italy Little Italy
900.00 NA NA
630.01 Chelsea Murray Hill
600.00 Throggs Neck City Island
500.00 Chinatown Chinatown
500.00 Chinatown Chinatown
500.00 Clinton NA

Exercise

Use arrange() to sort on the basis of tip_amount, dropoff_nhood, and pickup_dow, with descending order for tip amount

Summary

filter

: Extract subsets of rows. See also slice()

select

: Extract subsets of columns. See also rename()

arrange

: Sort your data

Data Aggregations and Transformations

Transformations

  • The mutate() verb can be used to make new columns

In [12]:
#Exercise
head(select(arrange(taxi_df, desc(tip_amount), dropoff_nhood, pickup_nhood), tip_amount, dropoff_nhood, pickup_nhood))


tip_amountdropoff_nhoodpickup_nhood
454.00 Bedford-StuyvesantBedford-Stuyvesant
300.00 Greenwich Village Chelsea
239.07 Clinton Greenwich Village
221.20 Clinton Central Park
202.00 Lower East Side Upper East Side
173.00 NA NA

In [13]:
taxi_df <- mutate(taxi_df, tip_pct = tip_amount/fare_amount)
head(select(taxi_df, tip_pct, fare_amount, tip_amount))
head(transmute(taxi_df, tip_pct = tip_amount/fare_amount))


tip_pctfare_amounttip_amount
0.00000009.5 0.00
0.23846156.5 1.55
0.23714297.0 1.66
0.00000003.0 0.00
0.23636365.5 1.30
0.22736849.5 2.16
tip_pct
0.0000000
0.2384615
0.2371429
0.0000000
0.2363636
0.2273684

In [14]:
str(taxi_df)


Classes ‘tbl_df’, ‘tbl’ and 'data.frame':	3770319 obs. of  17 variables:
 $ VendorID          : chr  "1" "1" "1" "1" ...
 $ passenger_count   : int  1 2 1 1 1 1 3 1 1 1 ...
 $ trip_distance     : num  1.8 0.9 0.9 0.3 0.96 2.01 3.14 0.5 0.67 15.2 ...
 $ RateCodeID        : chr  "1" "1" "1" "1" ...
 $ store_and_fwd_flag: chr  "N" "N" "N" "N" ...
 $ payment_type      : chr  "2" "1" "1" "2" ...
 $ fare_amount       : num  9.5 6.5 7 3 5.5 9.5 12.5 4 5 52 ...
 $ tip_amount        : num  0 1.55 1.66 0 1.3 ...
 $ tolls_amount      : num  0 0 0 0 0 0 0 0 0 5.33 ...
 $ pickup_hour       : chr  "6-10" "6-10" "6-10" "6-10" ...
 $ pickup_dow        : chr  "Sat" "Sat" "Sat" "Sat" ...
 $ dropoff_hour      : chr  "6-10" "6-10" "6-10" "6-10" ...
 $ dropoff_dow       : chr  "Sat" "Sat" "Sat" "Sat" ...
 $ pickup_nhood      : chr  "Morningside Heights" "Midtown" "Lower East Side" "Financial District" ...
 $ dropoff_nhood     : chr  "Hamilton Heights" "Midtown" "Soho" "Financial District" ...
 $ kSplits           : chr  "A" "A" "A" "A" ...
 $ tip_pct           : num  0 0.238 0.237 0 0.236 ...

Summarise Data by Groups

  • The group_by verb creates a grouping by a categorical variable
  • Functions can be placed inside summarise to create summary functions

In [15]:
class(taxi_df)


  1. 'tbl_df'
  2. 'tbl'
  3. 'data.frame'

In [16]:
grouped_taxi <- group_by(taxi_df, dropoff_nhood)
class(grouped_taxi)
head(grouped_taxi)#grouping is not visible, but the class knows about the grouping


  1. 'grouped_df'
  2. 'tbl_df'
  3. 'tbl'
  4. 'data.frame'
VendorIDpassenger_counttrip_distanceRateCodeIDstore_and_fwd_flagpayment_typefare_amounttip_amounttolls_amountpickup_hourpickup_dowdropoff_hourdropoff_dowpickup_nhooddropoff_nhoodkSplitstip_pct
1 1 1.80 1 N 2 9.5 0.00 0 6-10 Sat 6-10 Sat Morningside HeightsHamilton Heights A 0.0000000
1 2 0.90 1 N 1 6.5 1.55 0 6-10 Sat 6-10 Sat Midtown Midtown A 0.2384615
1 1 0.90 1 N 1 7.0 1.66 0 6-10 Sat 6-10 Sat Lower East Side Soho A 0.2371429
1 1 0.30 1 N 2 3.0 0.00 0 6-10 Sat 6-10 Sat Financial District Financial District A 0.0000000
2 1 0.96 1 N 1 5.5 1.30 0 6-10 Thu 6-10 Thu Chelsea West Village A 0.2363636
2 1 2.01 1 N 1 9.5 2.16 0 10-5 Sun 10-5 Sun Upper East Side Harlem A 0.2273684

In [17]:
nrow(summarize(group_by(taxi_df, dropoff_nhood),
          Num = n(), ave_tip_pct = mean(tip_pct)))

nrow(summarize(grouped_taxi, Num=n(), ave_tip_pct=mean(tip_pct)))  #calculated above


122
122

Group By Neighborhoods Example


In [18]:
summarise(group_by(taxi_df, pickup_nhood, dropoff_nhood),
          Num = n(), ave_tip_pct = mean(tip_pct))


pickup_nhooddropoff_nhoodNumave_tip_pct
Ardon Heights Ardon Heights 1 0.00000000
Astoria-Long Island CityAstoria-Long Island City6714 0.09712743
Astoria-Long Island CityAuburndale 13 0.06520367
Astoria-Long Island CityBattery Park 14 0.10140865
Astoria-Long Island CityBay Ridge 5 0.15388657
Astoria-Long Island CityBedford Park 3 0.05550679
Astoria-Long Island CityBedford-Stuyvesant 68 0.09819024
Astoria-Long Island CityBensonhurst 2 0.00000000
Astoria-Long Island CityBoerum Hill 12 0.09941085
Astoria-Long Island CityBorough Park 9 0.08205749
Astoria-Long Island CityBrownsville 4 0.10376598
Astoria-Long Island CityBushwick 53 0.05616446
Astoria-Long Island CityCanarsie 3 0.08080808
Astoria-Long Island CityCarnegie Hill 29 0.10760679
Astoria-Long Island CityCarroll Gardens 19 0.15358826
Astoria-Long Island CityCentral Park 72 0.12755575
Astoria-Long Island CityChelsea 224 0.11327865
Astoria-Long Island CityChinatown 8 0.09015545
Astoria-Long Island CityClearview 7 0.02959184
Astoria-Long Island CityClinton 163 0.11348404
Astoria-Long Island CityCobble Hill 2 0.10235294
Astoria-Long Island CityDouglastown-Little Neck 2 0.00000000
Astoria-Long Island CityDowntown 22 0.13353090
Astoria-Long Island CityDyker Heights 1 0.00000000
Astoria-Long Island CityEast Brooklyn 8 0.04740602
Astoria-Long Island CityEast Harlem 79 0.07145051
Astoria-Long Island CityEast Village 171 0.14311855
Astoria-Long Island CityFinancial District 117 0.13240868
Astoria-Long Island CityFlushing 54 0.08736876
Astoria-Long Island CityFordham 5 0.04092308
NA Soundview 114 0.02754077
NA South Beach 15 0.11433528
NA South Bronx 236 0.05923333
NA Springfield Gardens 655 0.06109878
NA Spuyten Duyvil 131 0.11685827
NA Sunny Side 2678 0.12713336
NA Sunset Park 288 0.11079174
NA The Rockaways 259 0.06733444
NA Throggs Neck 80 0.08751933
NA Todt Hill 12 0.03739845
NA Tottensville 2 0.14492754
NA Tremont 119 0.02535450
NA Tribeca 1931 0.16457058
NA Union Port 64 0.04806241
NA University Heights 82 0.03717917
NA Upper East Side 10544 0.16246302
NA Upper West Side 7967 0.16120439
NA Utopia 677 0.06395129
NA Wakefield-Williamsbridge 107 0.04609067
NA Washington Heights 1241 0.13664189
NA West Village 1726 0.16814002
NA Westerleigh-Castleton 11 0.04190773
NA Whitestone 1 0.00000000
NA Williams Bridge 65 0.06279349
NA Williamsburg 5111 0.14403349
NA Woodhaven-Richmond Hill 480 0.03536936
NA Woodlawn-Nordwood 24 0.08135023
NA Woodside 765 0.08492267
NA Yorkville 957 0.15686904
NA NA 27834 0.47254372

Chaining/Piping

  • A dplyr installation includes the magrittr package as a dependency
  • The magrittr package includes a pipe operator that allows you to pass the current dataset to another function
  • This makes interpreting a nested sequence of operations much easier to understand

Standard Code

  • Code is executed inside-out.
  • Let's arrange the above average tips in descending order, and only look at the locations that had at least 10 dropoffs and pickups.

In [19]:
filter(arrange(summarise(group_by(taxi_df, pickup_nhood, dropoff_nhood), Num = n(), ave_tip_pct = mean(tip_pct)), desc(ave_tip_pct)), Num >= 10)


pickup_nhooddropoff_nhoodNumave_tip_pct
Kings Bridge Kings Bridge 11 1.2256871
Upper East Side Brownsville 23 0.9447013
Bay Ridge Bay Ridge 60 0.7915420
The Rockaways The Rockaways 33 0.7828405
Gravesend-Sheepshead BayGravesend-Sheepshead Bay 80 0.5122171
NA NA 27834 0.4725437
Bensonhurst Bensonhurst 59 0.4512018
Upper East Side Spuyten Duyvil 65 0.3463543
Soho Mott Haven 12 0.3239580
Bedford Park Bedford Park 30 0.3078915
Midtown Westerleigh-Castleton 10 0.3017644
Bay Ridge Gravesend-Sheepshead Bay 10 0.2951905
Clinton Bay Ridge 90 0.2893189
Midtown South Beach 17 0.2473338
Battery Park Greenwood 15 0.2396805
Fordham Fordham 15 0.2326797
Bedford-Stuyvesant Bedford-Stuyvesant 936 0.2293516
Greenwood Financial District 11 0.2273952
Carnegie Hill Carroll Gardens 13 0.2223942
Lower East Side Spuyten Duyvil 10 0.2105018
Murray Hill Clearview 10 0.2102742
Nkew Gardens Midtown 10 0.2101923
Battery Park Park Slope 134 0.2032499
Hunts Point Hunts Point 13 0.2014353
Washington Heights NA 150 0.1995337
Forest Hills Gramercy 26 0.1990748
Battery Park Battery Park 1145 0.1987076
Jackson Heights Financial District 13 0.1970842
Carroll Gardens West Village 51 0.1966033
Financial District Spuyten Duyvil 10 0.1964487
Yorkville South Bronx 127 0.018011226
Parkchester Parkchester 28 0.017944170
East Harlem Bedford Park 17 0.017535651
Clinton Soundview 27 0.017476971
Sunny Side Woodhaven-Richmond Hill 13 0.017094017
Yorkville Woodlawn-Nordwood 12 0.016944444
Williamsburg Brownsville 11 0.016161616
Greenwich Village Soundview 21 0.015615496
Brownsville Bedford-Stuyvesant 14 0.015566502
Harlem Morris Heights 31 0.015263697
South Bronx East Harlem 12 0.015151515
Brownsville East Brooklyn 12 0.014800000
Mott Haven Tremont 18 0.014422658
Jamaica Utopia 11 0.013722127
West Village High Bridge 15 0.013529412
Fordham Tremont 12 0.013333333
Utopia Jamaica 21 0.012786596
Harlem Parkchester 23 0.012385605
Garment District East Brooklyn 37 0.012137732
Harlem Wakefield-Williamsbridge 11 0.006060606
East Harlem Hunts Point 25 0.005934066
High Bridge Mott Haven 43 0.004651163
Woodhaven-Richmond Hill Jamaica 16 0.004166667
Yorkville Soundview 26 0.003870043
Hamilton Heights Morris Heights 16 0.000000000
Jamaica Woodhaven-Richmond Hill 13 0.000000000
Mott Haven Soundview 16 0.000000000
Mott Haven University Heights 10 0.000000000
Tremont Fordham 11 0.000000000
University Heights Inwood 11 0.000000000

Reformatted


In [20]:
filter(
  arrange(
    summarise(
      group_by(taxi_df,
               pickup_nhood, dropoff_nhood),
      Num = n(),
      ave_tip_pct = mean(tip_pct)),
    desc(ave_tip_pct)),
  Num >= 10)


pickup_nhooddropoff_nhoodNumave_tip_pct
Kings Bridge Kings Bridge 11 1.2256871
Upper East Side Brownsville 23 0.9447013
Bay Ridge Bay Ridge 60 0.7915420
The Rockaways The Rockaways 33 0.7828405
Gravesend-Sheepshead BayGravesend-Sheepshead Bay 80 0.5122171
NA NA 27834 0.4725437
Bensonhurst Bensonhurst 59 0.4512018
Upper East Side Spuyten Duyvil 65 0.3463543
Soho Mott Haven 12 0.3239580
Bedford Park Bedford Park 30 0.3078915
Midtown Westerleigh-Castleton 10 0.3017644
Bay Ridge Gravesend-Sheepshead Bay 10 0.2951905
Clinton Bay Ridge 90 0.2893189
Midtown South Beach 17 0.2473338
Battery Park Greenwood 15 0.2396805
Fordham Fordham 15 0.2326797
Bedford-Stuyvesant Bedford-Stuyvesant 936 0.2293516
Greenwood Financial District 11 0.2273952
Carnegie Hill Carroll Gardens 13 0.2223942
Lower East Side Spuyten Duyvil 10 0.2105018
Murray Hill Clearview 10 0.2102742
Nkew Gardens Midtown 10 0.2101923
Battery Park Park Slope 134 0.2032499
Hunts Point Hunts Point 13 0.2014353
Washington Heights NA 150 0.1995337
Forest Hills Gramercy 26 0.1990748
Battery Park Battery Park 1145 0.1987076
Jackson Heights Financial District 13 0.1970842
Carroll Gardens West Village 51 0.1966033
Financial District Spuyten Duyvil 10 0.1964487
Yorkville South Bronx 127 0.018011226
Parkchester Parkchester 28 0.017944170
East Harlem Bedford Park 17 0.017535651
Clinton Soundview 27 0.017476971
Sunny Side Woodhaven-Richmond Hill 13 0.017094017
Yorkville Woodlawn-Nordwood 12 0.016944444
Williamsburg Brownsville 11 0.016161616
Greenwich Village Soundview 21 0.015615496
Brownsville Bedford-Stuyvesant 14 0.015566502
Harlem Morris Heights 31 0.015263697
South Bronx East Harlem 12 0.015151515
Brownsville East Brooklyn 12 0.014800000
Mott Haven Tremont 18 0.014422658
Jamaica Utopia 11 0.013722127
West Village High Bridge 15 0.013529412
Fordham Tremont 12 0.013333333
Utopia Jamaica 21 0.012786596
Harlem Parkchester 23 0.012385605
Garment District East Brooklyn 37 0.012137732
Harlem Wakefield-Williamsbridge 11 0.006060606
East Harlem Hunts Point 25 0.005934066
High Bridge Mott Haven 43 0.004651163
Woodhaven-Richmond Hill Jamaica 16 0.004166667
Yorkville Soundview 26 0.003870043
Hamilton Heights Morris Heights 16 0.000000000
Jamaica Woodhaven-Richmond Hill 13 0.000000000
Mott Haven Soundview 16 0.000000000
Mott Haven University Heights 10 0.000000000
Tremont Fordham 11 0.000000000
University Heights Inwood 11 0.000000000

Magrittr

  • Inspired by unix |, and F# forward pipe |>, magrittr introduces the funny character (%>%, the then operator)
  • %>% pipes the object on the left hand side to the first argument of the function on the right hand side
  • Every function in dplyr has a slot for data.frame/tbl as it's first argument, so this works beautifully!

Put that Function in Your Pipe and...


In [21]:
taxi_df %>%
  group_by(pickup_nhood, dropoff_nhood) %>%
  summarize(Num = n(),
            ave_tip_pct = mean(tip_pct)) %>%
  arrange(desc(ave_tip_pct)) %>%
  filter(Num >= 10)


pickup_nhooddropoff_nhoodNumave_tip_pct
Kings Bridge Kings Bridge 11 1.2256871
Upper East Side Brownsville 23 0.9447013
Bay Ridge Bay Ridge 60 0.7915420
The Rockaways The Rockaways 33 0.7828405
Gravesend-Sheepshead BayGravesend-Sheepshead Bay 80 0.5122171
NA NA 27834 0.4725437
Bensonhurst Bensonhurst 59 0.4512018
Upper East Side Spuyten Duyvil 65 0.3463543
Soho Mott Haven 12 0.3239580
Bedford Park Bedford Park 30 0.3078915
Midtown Westerleigh-Castleton 10 0.3017644
Bay Ridge Gravesend-Sheepshead Bay 10 0.2951905
Clinton Bay Ridge 90 0.2893189
Midtown South Beach 17 0.2473338
Battery Park Greenwood 15 0.2396805
Fordham Fordham 15 0.2326797
Bedford-Stuyvesant Bedford-Stuyvesant 936 0.2293516
Greenwood Financial District 11 0.2273952
Carnegie Hill Carroll Gardens 13 0.2223942
Lower East Side Spuyten Duyvil 10 0.2105018
Murray Hill Clearview 10 0.2102742
Nkew Gardens Midtown 10 0.2101923
Battery Park Park Slope 134 0.2032499
Hunts Point Hunts Point 13 0.2014353
Washington Heights NA 150 0.1995337
Forest Hills Gramercy 26 0.1990748
Battery Park Battery Park 1145 0.1987076
Jackson Heights Financial District 13 0.1970842
Carroll Gardens West Village 51 0.1966033
Financial District Spuyten Duyvil 10 0.1964487
Yorkville South Bronx 127 0.018011226
Parkchester Parkchester 28 0.017944170
East Harlem Bedford Park 17 0.017535651
Clinton Soundview 27 0.017476971
Sunny Side Woodhaven-Richmond Hill 13 0.017094017
Yorkville Woodlawn-Nordwood 12 0.016944444
Williamsburg Brownsville 11 0.016161616
Greenwich Village Soundview 21 0.015615496
Brownsville Bedford-Stuyvesant 14 0.015566502
Harlem Morris Heights 31 0.015263697
South Bronx East Harlem 12 0.015151515
Brownsville East Brooklyn 12 0.014800000
Mott Haven Tremont 18 0.014422658
Jamaica Utopia 11 0.013722127
West Village High Bridge 15 0.013529412
Fordham Tremont 12 0.013333333
Utopia Jamaica 21 0.012786596
Harlem Parkchester 23 0.012385605
Garment District East Brooklyn 37 0.012137732
Harlem Wakefield-Williamsbridge 11 0.006060606
East Harlem Hunts Point 25 0.005934066
High Bridge Mott Haven 43 0.004651163
Woodhaven-Richmond Hill Jamaica 16 0.004166667
Yorkville Soundview 26 0.003870043
Hamilton Heights Morris Heights 16 0.000000000
Jamaica Woodhaven-Richmond Hill 13 0.000000000
Mott Haven Soundview 16 0.000000000
Mott Haven University Heights 10 0.000000000
Tremont Fordham 11 0.000000000
University Heights Inwood 11 0.000000000

Pipe + group_by()

  • The pipe operator is very helpful for group by summaries
  • Let's calculate average tip amount, and average trip distance, controlling for dropoff day of the week and dropoff location
  • First filter with the vector manhattan_hoods


In [22]:
mht_url <- "http://alizaidi.blob.core.windows.net/training/manhattan.rds"
manhattan_hoods <- readRDS(gzcon(url(mht_url)))
taxi_df %>%
  filter(pickup_nhood %in% manhattan_hoods,
         dropoff_nhood %in% manhattan_hoods) %>%
  group_by(dropoff_nhood, pickup_nhood) %>%
  summarize(ave_tip = mean(tip_pct),
            ave_dist = mean(trip_distance)) %>%
  filter(ave_dist > 3, ave_tip > 0.05)


dropoff_nhoodpickup_nhoodave_tipave_dist
Battery Park Central Park 0.12694563 6.149281
Battery Park Clinton 0.11996579 4.016902
Battery Park East Harlem 0.07116177 10.124000
Battery Park East Village 0.14717019 3.537367
Battery Park Garment District 0.13463903 3.965532
Battery Park Gramercy 0.14396885 4.153174
Battery Park Hamilton Heights 0.06770436 8.843571
Battery Park Harlem 0.13829591 9.039286
Battery Park Inwood 0.18235294 11.950000
Battery Park Midtown 0.13428280 5.496734
Battery Park Morningside Heights0.15595534 7.947419
Battery Park Murray Hill 0.14230493 5.457552
Battery Park North Sutton Area 0.13318941 6.866011
Battery Park Upper East Side 0.13112513 7.677720
Battery Park Upper West Side 0.13287111 5.773742
Battery Park Washington Heights 0.12779408 10.206923
Battery Park Yorkville 0.14815491 8.544762
Central Park Battery Park 0.11153630 5.983153
Central Park Chinatown 0.12276618 5.425789
Central Park East Village 0.13408203 4.644099
Central Park Financial District 0.09623395 6.620338
Central Park Gramercy 0.12756278 3.222990
Central Park Greenwich Village 0.12932688 3.919407
Central Park Little Italy 0.11914271 4.660275
Central Park Lower East Side 0.11894056 5.403105
Central Park Soho 0.10882615 4.381541
Central Park Tribeca 0.13259363 5.202881
Central Park Washington Heights 0.09936008 4.700000
Central Park West Village 0.13151131 3.813692
Chelsea Central Park 0.13334925 3.164956
Washington Heights Yorkville 0.08562844 5.209286
West Village Central Park 0.11439411 3.897739
West Village East Harlem 0.12856174 7.203200
West Village Hamilton Heights 0.11837929 7.470263
West Village Harlem 0.10529889 7.113514
West Village Inwood 0.20812500 10.460000
West Village Morningside Heights0.14703183 6.389273
West Village North Sutton Area 0.13483326 3.603191
West Village Upper East Side 0.13367464 4.646157
West Village Upper West Side 0.13723914 3.919866
West Village Washington Heights 0.14651280 8.680526
West Village Yorkville 0.13939565 6.126538
Yorkville Battery Park 0.14375157 8.927209
Yorkville Chelsea 0.12630881 4.988345
Yorkville Chinatown 0.08176344 7.073214
Yorkville Clinton 0.11385074 4.278752
Yorkville East Village 0.12104080 5.233184
Yorkville Financial District 0.11838887 8.146121
Yorkville Garment District 0.10531276 4.162365
Yorkville Gramercy 0.12129103 4.255384
Yorkville Greenwich Village 0.12414755 5.513012
Yorkville Inwood 0.07585581 6.710000
Yorkville Little Italy 0.18410847 5.886102
Yorkville Lower East Side 0.11105280 6.343524
Yorkville Midtown 0.12631612 3.006321
Yorkville Murray Hill 0.12157790 3.433576
Yorkville Soho 0.14101288 6.410496
Yorkville Tribeca 0.13259729 7.757073
Yorkville Washington Heights 0.09903927 4.951351
Yorkville West Village 0.12226814 6.025485

Pipe and Plot

Piping is not limited to dplyr functions, can be used everywhere!


In [23]:
library(ggplot2)
taxi_df %>%
  filter(pickup_nhood %in% manhattan_hoods,
         dropoff_nhood %in% manhattan_hoods) %>%
  group_by(dropoff_nhood, pickup_nhood) %>%
  summarize(ave_tip = mean(tip_pct),
            ave_dist = mean(trip_distance)) %>%
  filter(ave_dist > 3, ave_tip > 0.05) %>%
  ggplot(aes(x = pickup_nhood, y = dropoff_nhood)) +
    geom_tile(aes(fill = ave_tip), colour = "white") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = 'bottom') +
    scale_fill_gradient(low = "white", high = "steelblue")




In [24]:
library(ggplot2)
taxi_df %>%
  filter(pickup_nhood %in% manhattan_hoods,
         dropoff_nhood %in% manhattan_hoods) %>%
  group_by(dropoff_nhood, pickup_nhood) %>%
  summarize(ave_tip = mean(tip_pct),
            ave_dist = mean(trip_distance)) %>%
  filter(ave_dist > 3, ave_tip > 0.05) %>%
  ggplot(aes(x = pickup_nhood, y = dropoff_nhood)) +
    geom_tile(aes(fill = ave_tip), colour = "white") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = 'bottom') +
    scale_fill_gradient(low = "white", high = "steelblue")


Piping to other arguments

  • Although dplyr takes great care to make it particularly amenable to piping, other functions may not reserve the first argument to the object you are passing into it.
  • You can use the special . placeholder to specify where the object should enter

In [25]:
taxi_df %>%
  filter(pickup_nhood %in% manhattan_hoods,
         dropoff_nhood %in% manhattan_hoods) %>%
  group_by(dropoff_nhood, pickup_nhood) %>%
  summarize(ave_tip = mean(tip_pct),
            ave_dist = mean(trip_distance)) %>%
  lm(ave_tip ~ ave_dist, data = .) -> taxi_model         # -> to assign left hand size to the right hand size
summary(taxi_model)

#OR 

#taxi_df %>%
#  filter(pickup_nhood %in% manhattan_hoods,
#         dropoff_nhood %in% manhattan_hoods) %>%
#  group_by(dropoff_nhood, pickup_nhood) %>%
#  summarize(ave_tip = mean(tip_pct),
#            ave_dist = mean(trip_distance)) %>%
#  ungroup() %>%                                       # <-
#  lm(ave_tip ~ ., data = .) -> taxi_model             # <-
#summary(taxi_model)


Call:
lm(formula = ave_tip ~ ave_dist, data = .)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.112258 -0.010882  0.002727  0.014168  0.140976 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.1324307  0.0016071  82.402  < 2e-16 ***
ave_dist    -0.0017345  0.0003004  -5.773 1.15e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.02468 on 724 degrees of freedom
Multiple R-squared:  0.04401,	Adjusted R-squared:  0.04269 
F-statistic: 33.33 on 1 and 724 DF,  p-value: 1.153e-08

Exercise

Your turn:

  • Use the pipe operator to group by day of week and dropoff neighborhood
  • Filter to Manhattan neighborhoods
  • Make tile plot with average fare amount in dollars as the fill

In [26]:
str(manhattan_hoods)


 chr [1:28] "Chinatown" "Little Italy" "Tribeca" "Lower East Side" ...

In [27]:
library(ggplot2)
taxi_df %>%
  filter(dropoff_nhood %in% manhattan_hoods) %>%
  group_by(dropoff_nhood, pickup_dow) %>%
  summarize(ave_fare = mean(fare_amount)) %>%
  ggplot(aes(x = pickup_dow, y = dropoff_nhood)) +
    geom_tile(aes(fill = ave_fare), colour = "white") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = 'bottom') +
    scale_fill_gradient(low = "white", high = "steelblue")


Functional Programming

Creating Functional Pipelines

Too Many Pipes?


Reusable code

  • The examples above create a rather messy pipeline operation
  • Can be very hard to debug
  • The operation is pretty readable, but lacks reusability
  • Since R is a functional language, we benefit by splitting these operations into functions and calling them separately
  • This allows resuability; don't write the same code twice!

Functional Pipelines

Summarization

  • Let's create a function that takes an argument for the data, and applies the summarization by neighborhood to calculate average tip and trip distance


In [28]:
taxi_hood_sum <- function(taxi_data = taxi_df) {

  mht_url <- "http://alizaidi.blob.core.windows.net/training/manhattan.rds"

  manhattan_hoods <- readRDS(gzcon(url(mht_url)))
  taxi_data %>%
    filter(pickup_nhood %in% manhattan_hoods,
           dropoff_nhood %in% manhattan_hoods) %>%
    group_by(dropoff_nhood, pickup_nhood) %>%
    summarize(ave_tip = mean(tip_pct),
              ave_dist = mean(trip_distance)) %>%
    filter(ave_dist > 3, ave_tip > 0.05) -> sum_df

  return(sum_df)

}

Functional Pipelines

Plotting Function

  • We can create a second function for the plot

In [29]:
tile_plot_hood <- function(df = taxi_hood_sum()) {

  library(ggplot2)

  ggplot(data = df, aes(x = pickup_nhood, y = dropoff_nhood)) +
    geom_tile(aes(fill = ave_tip), colour = "white") +
    theme_bw() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = 'bottom') +
    scale_fill_gradient(low = "white", high = "steelblue") -> gplot

  return(gplot)
}

Calling Our Pipeline

  • Now we can create our plot by simply calling our two functions

In [30]:
#library(plotly)

#taxi_hood_sum(taxi_df) %>% tile_plot_hood 
#OR
tile_plot_hood()



In [ ]:


In [31]:
library(plotly)

In [32]:
embed_notebook(ggplotly(tile_plot_hood()))


Let's make that baby interactive.

Creating Complex Pipelines with do

  • The summarize function is fun, can summarize many numeric/scalar quantities
  • But what if you want multiple values/rows back, not just a scalar summary?
  • Meet the do verb -- arbitrary tbl operations


In [38]:
taxi_df %>% group_by(dropoff_dow) %>%
  filter(!is.na(dropoff_nhood), !is.na(pickup_nhood), tip_amount < fare_amount) %>%
  arrange(desc(tip_pct)) %>%
  do(slice(., 1:2)) %>%
  select(dropoff_dow, tip_amount, tip_pct,
         fare_amount, dropoff_nhood, pickup_nhood)


dropoff_dowtip_amounttip_pctfare_amountdropoff_nhoodpickup_nhood
Fri 4.45 0.9888889 4.5 Upper East Side Upper East Side
Fri 9.83 0.9637255 10.2 Carnegie Hill Midtown
Mon 13.40 0.9925926 13.5 Financial District Chelsea
Mon 6.70 0.9571429 7.0 Astoria-Long Island CityAstoria-Long Island City
Sat 11.45 0.9956522 11.5 West Village Murray Hill
Sat 11.00 0.9565217 11.5 Financial District Greenwich Village
Sun 17.00 0.9714286 17.5 Midtown Financial District
Sun 8.20 0.9647059 8.5 Clinton Greenwich Village
Thu 5.88 0.9800000 6.0 Midtown Midtown
Thu 4.40 0.9777778 4.5 Astoria-Long Island CityAstoria-Long Island City
Tue 13.00 0.9629630 13.5 Chinatown Chinatown
Tue 7.70 0.9625000 8.0 Upper East Side Murray Hill
Wed 8.89 0.9877778 9.0 Greenwich Village Lower East Side
Wed 20.00 0.9756098 20.5 Gramercy East Harlem

Estimating Multiple Models with do

  • A common use of do is to calculate many different models by a grouping variable

In [39]:
dow_lms <- taxi_df %>% sample_n(10^4) %>%
  group_by(dropoff_dow) %>%
  do(lm_tip = lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
     data = .))


In [ ]:
dow_lms

Where are our results?

Cleaning Output


In [42]:
summary(dow_lms$lm_tip[[1]])
library(broom)
dow_lms %>% tidy(lm_tip)


Call:
lm(formula = tip_pct ~ pickup_nhood + passenger_count + pickup_hour, 
    data = .)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.23314 -0.12115  0.01163  0.09153  1.07217 

Coefficients:
                                     Estimate Std. Error t value Pr(>|t|)   
(Intercept)                          0.063260   0.050935   1.242  0.21444   
pickup_nhoodBattery Park             0.049273   0.058673   0.840  0.40117   
pickup_nhoodBushwick                 0.275042   0.132531   2.075  0.03813 * 
pickup_nhoodCarnegie Hill            0.058877   0.056755   1.037  0.29972   
pickup_nhoodCarroll Gardens          0.282313   0.132284   2.134  0.03300 * 
pickup_nhoodCentral Park             0.118818   0.056775   2.093  0.03654 * 
pickup_nhoodChelsea                  0.096741   0.051445   1.880  0.06024 . 
pickup_nhoodChinatown                0.095589   0.074304   1.286  0.19849   
pickup_nhoodClinton                  0.053276   0.053285   1.000  0.31756   
pickup_nhoodDowntown                 0.122684   0.099925   1.228  0.21974   
pickup_nhoodEast Harlem             -0.026572   0.100227  -0.265  0.79096   
pickup_nhoodEast Village             0.066278   0.052988   1.251  0.21121   
pickup_nhoodFinancial District       0.110258   0.054445   2.025  0.04303 * 
pickup_nhoodForest Hills             0.172633   0.132284   1.305  0.19209   
pickup_nhoodFort Green               0.003003   0.074314   0.040  0.96777   
pickup_nhoodGarment District         0.085783   0.051929   1.652  0.09877 . 
pickup_nhoodGramercy                 0.104515   0.051268   2.039  0.04167 * 
pickup_nhoodGreenwich Village        0.075102   0.052262   1.437  0.15092   
pickup_nhoodGreenwood                0.196979   0.132284   1.489  0.13669   
pickup_nhoodHamilton Heights         0.006894   0.070996   0.097  0.92266   
pickup_nhoodHarlem                   0.050250   0.059364   0.846  0.39743   
pickup_nhoodLittle Italy             0.131509   0.059914   2.195  0.02832 * 
pickup_nhoodLower East Side          0.067765   0.055113   1.230  0.21906   
pickup_nhoodMidtown                  0.105960   0.050759   2.088  0.03702 * 
pickup_nhoodMorningside Heights      0.067949   0.062321   1.090  0.27576   
pickup_nhoodMott Haven               0.126787   0.100056   1.267  0.20530   
pickup_nhoodMurray Hill              0.116079   0.052675   2.204  0.02770 * 
pickup_nhoodNorth Sutton Area        0.093483   0.055572   1.682  0.09274 . 
pickup_nhoodSoho                     0.182122   0.055565   3.278  0.00107 **
pickup_nhoodSouth Beach             -0.053322   0.132546  -0.402  0.68753   
pickup_nhoodSunny Side               0.108672   0.068236   1.593  0.11147   
pickup_nhoodSunset Park              0.271265   0.132284   2.051  0.04048 * 
pickup_nhoodTribeca                  0.105135   0.056768   1.852  0.06423 . 
pickup_nhoodUpper East Side          0.097675   0.050747   1.925  0.05446 . 
pickup_nhoodUpper West Side          0.096521   0.051161   1.887  0.05941 . 
pickup_nhoodWashington Heights      -0.034046   0.100203  -0.340  0.73408   
pickup_nhoodWest Village             0.102613   0.053786   1.908  0.05661 . 
pickup_nhoodWilliamsburg             0.067892   0.063605   1.067  0.28597   
pickup_nhoodWoodhaven-Richmond Hill -0.033376   0.132525  -0.252  0.80119   
pickup_nhoodWoodside                -0.003993   0.100056  -0.040  0.96817   
pickup_nhoodYorkville                0.046151   0.068338   0.675  0.49958   
passenger_count                     -0.004969   0.002358  -2.108  0.03522 * 
pickup_hour12-4                     -0.024915   0.011469  -2.172  0.02999 * 
pickup_hour4-6                      -0.034891   0.012667  -2.754  0.00595 **
pickup_hour5-9                      -0.011822   0.011302  -1.046  0.29571   
pickup_hour6-10                     -0.007270   0.010213  -0.712  0.47665   
pickup_hour9-12                     -0.033554   0.011910  -2.817  0.00491 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1224 on 1454 degrees of freedom
  (58 observations deleted due to missingness)
Multiple R-squared:  0.05485,	Adjusted R-squared:  0.02495 
F-statistic: 1.834 on 46 and 1454 DF,  p-value: 0.0006361
dropoff_dowtermestimatestd.errorstatisticp.value
Fri (Intercept) 0.063260435 0.05093509 1.24198134 0.214443737
Fri pickup_nhoodBattery Park 0.049272773 0.05867317 0.83978367 0.401167749
Fri pickup_nhoodBushwick 0.275042208 0.13253142 2.07529817 0.038134382
Fri pickup_nhoodCarnegie Hill 0.058877395 0.05675452 1.03740449 0.299719875
Fri pickup_nhoodCarroll Gardens 0.282312579 0.13228388 2.13414203 0.032998046
Fri pickup_nhoodCentral Park 0.118817792 0.05677513 2.09277875 0.036541878
Fri pickup_nhoodChelsea 0.096740812 0.05144475 1.88047989 0.060242445
Fri pickup_nhoodChinatown 0.095589155 0.07430385 1.28646299 0.198486322
Fri pickup_nhoodClinton 0.053275622 0.05328498 0.99982432 0.317561893
Fri pickup_nhoodDowntown 0.122683687 0.09992544 1.22775223 0.219738716
Fri pickup_nhoodEast Harlem -0.026571743 0.10022719 -0.26511512 0.790958357
Fri pickup_nhoodEast Village 0.066277805 0.05298827 1.25080150 0.211208142
Fri pickup_nhoodFinancial District 0.110258372 0.05444459 2.02514836 0.043034538
Fri pickup_nhoodForest Hills 0.172633092 0.13228388 1.30501991 0.192092555
Fri pickup_nhoodFort Green 0.003002924 0.07431395 0.04040861 0.967772912
Fri pickup_nhoodGarment District 0.085782538 0.05192899 1.65192003 0.098766774
Fri pickup_nhoodGramercy 0.104514525 0.05126820 2.03858384 0.041672166
Fri pickup_nhoodGreenwich Village 0.075102376 0.05226202 1.43703555 0.150923036
Fri pickup_nhoodGreenwood 0.196979246 0.13228388 1.48906467 0.136687199
Fri pickup_nhoodHamilton Heights 0.006893659 0.07099617 0.09709903 0.922661131
Fri pickup_nhoodHarlem 0.050249891 0.05936449 0.84646381 0.397433343
Fri pickup_nhoodLittle Italy 0.131508577 0.05991356 2.19497186 0.028323243
Fri pickup_nhoodLower East Side 0.067765488 0.05511333 1.22956611 0.219058551
Fri pickup_nhoodMidtown 0.105960115 0.05075893 2.08751676 0.037015174
Fri pickup_nhoodMorningside Heights 0.067948678 0.06232050 1.09031017 0.275757192
Fri pickup_nhoodMott Haven 0.126786596 0.10005570 1.26716012 0.205300961
Fri pickup_nhoodMurray Hill 0.116078743 0.05267518 2.20367042 0.027703808
Fri pickup_nhoodNorth Sutton Area 0.093483466 0.05557174 1.68221222 0.092742412
Fri pickup_nhoodSoho 0.182122072 0.05556494 3.27764391 0.001071483
Fri pickup_nhoodSouth Beach -0.053321816 0.13254613 -0.40228874 0.687530671
Wed pickup_nhoodGarment District 1.395108e-01 0.065074891 2.1438503284 0.032235654
Wed pickup_nhoodGramercy 1.349188e-01 0.064323041 2.0975195551 0.036146275
Wed pickup_nhoodGreenwich Village 1.436234e-01 0.065636025 2.1881790615 0.028839544
Wed pickup_nhoodHamilton Heights -1.234508e-02 0.109115757 -0.1131374670 0.909939580
Wed pickup_nhoodHarlem 2.493713e-02 0.081456427 0.3061407739 0.759548000
Wed pickup_nhoodJackson Heights -1.106369e-02 0.140531185 -0.0787276533 0.937261740
Wed pickup_nhoodLittle Italy 1.465534e-01 0.071101933 2.0611735424 0.039491126
Wed pickup_nhoodLower East Side 1.423642e-01 0.068156090 2.0887964652 0.036926168
Wed pickup_nhoodMaspeth 9.110012e-05 0.140001218 0.0006507095 0.999480912
Wed pickup_nhoodMidtown 1.349769e-01 0.063631696 2.1212211074 0.034097864
Wed pickup_nhoodMorningside Heights 1.547385e-01 0.079071010 1.9569559359 0.050573038
Wed pickup_nhoodMurray Hill 1.230506e-01 0.065712635 1.8725558213 0.061361000
Wed pickup_nhoodNorth Sutton Area 1.020566e-01 0.069715070 1.4639106443 0.143467594
Wed pickup_nhoodPark Slope 2.818446e-01 0.140451176 2.0067084923 0.044993864
Wed pickup_nhoodSoho 1.096566e-01 0.067844900 1.6162832785 0.106283210
Wed pickup_nhoodSunny Side -1.756478e-02 0.095856425 -0.1832405535 0.854638733
Wed pickup_nhoodTribeca 1.098985e-01 0.067428858 1.6298436467 0.103384292
Wed pickup_nhoodUpper East Side 1.145725e-01 0.063860156 1.7941151796 0.073034269
Wed pickup_nhoodUpper West Side 1.301787e-01 0.064170913 2.0286249860 0.042706297
Wed pickup_nhoodWashington Heights 1.171645e-01 0.108974344 1.0751566405 0.282510276
Wed pickup_nhoodWest Village 1.072339e-01 0.066935176 1.6020556861 0.109393747
Wed pickup_nhoodWilliamsburg 9.131228e-02 0.081174544 1.1248880847 0.260850529
Wed pickup_nhoodWoodside -1.234508e-02 0.140499550 -0.0878656220 0.929997432
Wed pickup_nhoodYorkville 8.249680e-02 0.079146551 1.0423296782 0.297458646
Wed passenger_count -7.940443e-03 0.002767425 -2.8692531754 0.004183029
Wed pickup_hour12-4 3.214347e-03 0.014599876 0.2201626383 0.825780130
Wed pickup_hour4-6 3.644005e-04 0.015825663 0.0230259232 0.981633238
Wed pickup_hour5-9 4.495737e-03 0.014305672 0.3142625517 0.753373622
Wed pickup_hour6-10 2.621518e-02 0.013747320 1.9069301670 0.056756692
Wed pickup_hour9-12 1.087159e-02 0.015030233 0.7233145894 0.469620775

In [45]:
dow_lms %>% tidy(lm_tip)


dropoff_dowtermestimatestd.errorstatisticp.value
Fri (Intercept) 0.063260435 0.05093509 1.24198134 0.214443737
Fri pickup_nhoodBattery Park 0.049272773 0.05867317 0.83978367 0.401167749
Fri pickup_nhoodBushwick 0.275042208 0.13253142 2.07529817 0.038134382
Fri pickup_nhoodCarnegie Hill 0.058877395 0.05675452 1.03740449 0.299719875
Fri pickup_nhoodCarroll Gardens 0.282312579 0.13228388 2.13414203 0.032998046
Fri pickup_nhoodCentral Park 0.118817792 0.05677513 2.09277875 0.036541878
Fri pickup_nhoodChelsea 0.096740812 0.05144475 1.88047989 0.060242445
Fri pickup_nhoodChinatown 0.095589155 0.07430385 1.28646299 0.198486322
Fri pickup_nhoodClinton 0.053275622 0.05328498 0.99982432 0.317561893
Fri pickup_nhoodDowntown 0.122683687 0.09992544 1.22775223 0.219738716
Fri pickup_nhoodEast Harlem -0.026571743 0.10022719 -0.26511512 0.790958357
Fri pickup_nhoodEast Village 0.066277805 0.05298827 1.25080150 0.211208142
Fri pickup_nhoodFinancial District 0.110258372 0.05444459 2.02514836 0.043034538
Fri pickup_nhoodForest Hills 0.172633092 0.13228388 1.30501991 0.192092555
Fri pickup_nhoodFort Green 0.003002924 0.07431395 0.04040861 0.967772912
Fri pickup_nhoodGarment District 0.085782538 0.05192899 1.65192003 0.098766774
Fri pickup_nhoodGramercy 0.104514525 0.05126820 2.03858384 0.041672166
Fri pickup_nhoodGreenwich Village 0.075102376 0.05226202 1.43703555 0.150923036
Fri pickup_nhoodGreenwood 0.196979246 0.13228388 1.48906467 0.136687199
Fri pickup_nhoodHamilton Heights 0.006893659 0.07099617 0.09709903 0.922661131
Fri pickup_nhoodHarlem 0.050249891 0.05936449 0.84646381 0.397433343
Fri pickup_nhoodLittle Italy 0.131508577 0.05991356 2.19497186 0.028323243
Fri pickup_nhoodLower East Side 0.067765488 0.05511333 1.22956611 0.219058551
Fri pickup_nhoodMidtown 0.105960115 0.05075893 2.08751676 0.037015174
Fri pickup_nhoodMorningside Heights 0.067948678 0.06232050 1.09031017 0.275757192
Fri pickup_nhoodMott Haven 0.126786596 0.10005570 1.26716012 0.205300961
Fri pickup_nhoodMurray Hill 0.116078743 0.05267518 2.20367042 0.027703808
Fri pickup_nhoodNorth Sutton Area 0.093483466 0.05557174 1.68221222 0.092742412
Fri pickup_nhoodSoho 0.182122072 0.05556494 3.27764391 0.001071483
Fri pickup_nhoodSouth Beach -0.053321816 0.13254613 -0.40228874 0.687530671
Wed pickup_nhoodGarment District 1.395108e-01 0.065074891 2.1438503284 0.032235654
Wed pickup_nhoodGramercy 1.349188e-01 0.064323041 2.0975195551 0.036146275
Wed pickup_nhoodGreenwich Village 1.436234e-01 0.065636025 2.1881790615 0.028839544
Wed pickup_nhoodHamilton Heights -1.234508e-02 0.109115757 -0.1131374670 0.909939580
Wed pickup_nhoodHarlem 2.493713e-02 0.081456427 0.3061407739 0.759548000
Wed pickup_nhoodJackson Heights -1.106369e-02 0.140531185 -0.0787276533 0.937261740
Wed pickup_nhoodLittle Italy 1.465534e-01 0.071101933 2.0611735424 0.039491126
Wed pickup_nhoodLower East Side 1.423642e-01 0.068156090 2.0887964652 0.036926168
Wed pickup_nhoodMaspeth 9.110012e-05 0.140001218 0.0006507095 0.999480912
Wed pickup_nhoodMidtown 1.349769e-01 0.063631696 2.1212211074 0.034097864
Wed pickup_nhoodMorningside Heights 1.547385e-01 0.079071010 1.9569559359 0.050573038
Wed pickup_nhoodMurray Hill 1.230506e-01 0.065712635 1.8725558213 0.061361000
Wed pickup_nhoodNorth Sutton Area 1.020566e-01 0.069715070 1.4639106443 0.143467594
Wed pickup_nhoodPark Slope 2.818446e-01 0.140451176 2.0067084923 0.044993864
Wed pickup_nhoodSoho 1.096566e-01 0.067844900 1.6162832785 0.106283210
Wed pickup_nhoodSunny Side -1.756478e-02 0.095856425 -0.1832405535 0.854638733
Wed pickup_nhoodTribeca 1.098985e-01 0.067428858 1.6298436467 0.103384292
Wed pickup_nhoodUpper East Side 1.145725e-01 0.063860156 1.7941151796 0.073034269
Wed pickup_nhoodUpper West Side 1.301787e-01 0.064170913 2.0286249860 0.042706297
Wed pickup_nhoodWashington Heights 1.171645e-01 0.108974344 1.0751566405 0.282510276
Wed pickup_nhoodWest Village 1.072339e-01 0.066935176 1.6020556861 0.109393747
Wed pickup_nhoodWilliamsburg 9.131228e-02 0.081174544 1.1248880847 0.260850529
Wed pickup_nhoodWoodside -1.234508e-02 0.140499550 -0.0878656220 0.929997432
Wed pickup_nhoodYorkville 8.249680e-02 0.079146551 1.0423296782 0.297458646
Wed passenger_count -7.940443e-03 0.002767425 -2.8692531754 0.004183029
Wed pickup_hour12-4 3.214347e-03 0.014599876 0.2201626383 0.825780130
Wed pickup_hour4-6 3.644005e-04 0.015825663 0.0230259232 0.981633238
Wed pickup_hour5-9 4.495737e-03 0.014305672 0.3142625517 0.753373622
Wed pickup_hour6-10 2.621518e-02 0.013747320 1.9069301670 0.056756692
Wed pickup_hour9-12 1.087159e-02 0.015030233 0.7233145894 0.469620775
  • By design, every function in dplyr returns a data.frame
  • In the example above, we get back a spooky data.frame with a column of S3 lm objects
  • You can still modify each element as you would normally, or pass it to a mutate function to extract intercept or statistics
  • But there's also a very handy broom package for cleaning up such objects into data.frames

Brooming Up the Mess

Model Metrics


In [43]:
library(broom)
taxi_df %>% sample_n(10^5) %>%
  group_by(dropoff_dow) %>%
  do(glance(lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
     data = .)))


dropoff_dowr.squaredadj.r.squaredsigmastatisticp.valuedflogLikAICBICdeviancedf.residual
Fri 0.01587300 0.011176775 0.1383786 3.379950 9.744936e-2071 8272.639 -16401.278 -15854.199 280.8915 14669
Mon 0.37108607 0.367400716 0.1900269 100.692069 0.000000e+0070 2897.464 -5652.928 -5128.973 425.1977 11775
Sat 0.02660795 0.022272406 0.1274534 6.137163 3.115533e-5271 10156.087 -20168.175 -19616.155 255.2965 15716
Sun 0.01632989 0.011066050 0.1294762 3.102279 5.279886e-1772 8377.867 -16609.734 -16062.342 222.4257 13268
Thu 0.04535462 0.040931498 0.1203320 10.253988 6.973106e-9866 9879.306 -19624.612 -19118.522 203.1371 14029
Tue 0.01172907 0.006699497 0.1796764 2.332021 1.366208e-0864 3735.856 -7341.711 -6858.832 399.6389 12379
Wed 0.02089842 0.015580984 0.1215562 3.930169 2.061605e-2570 8829.915 -17517.830 -16988.507 187.7280 12705

Model Coefficients

The most commonly used function in the broom package is the tidy function. This will expand our data.frame and give us the model coefficients


In [44]:
taxi_df %>% sample_n(10^5) %>%
  group_by(dropoff_dow) %>%
  do(tidy(lm(tip_pct ~ pickup_nhood + passenger_count + pickup_hour,
     data = .)))


dropoff_dowtermestimatestd.errorstatisticp.value
Fri (Intercept) 0.092335815 0.01447957 6.37697105 1.859967e-10
Fri pickup_nhoodBattery Park 0.069134184 0.01791031 3.86002210 1.138596e-04
Fri pickup_nhoodBedford Park -0.099137113 0.12456954 -0.79583750 4.261393e-01
Fri pickup_nhoodBedford-Stuyvesant 0.059284406 0.03847831 1.54072260 1.234057e-01
Fri pickup_nhoodBoerum Hill 0.032812837 0.03407523 0.96295270 3.355870e-01
Fri pickup_nhoodBorough Park 0.067849329 0.07285284 0.93132028 3.517031e-01
Fri pickup_nhoodBushwick 0.056196643 0.04368206 1.28649239 1.982914e-01
Fri pickup_nhoodCarnegie Hill 0.031766089 0.01714010 1.85331944 6.385652e-02
Fri pickup_nhoodCarroll Gardens 0.073350737 0.02902766 2.52692566 1.151694e-02
Fri pickup_nhoodCentral Park 0.046524977 0.01704461 2.72960097 6.348616e-03
Fri pickup_nhoodChelsea 0.061768303 0.01477198 4.18145180 2.913246e-05
Fri pickup_nhoodChinatown 0.045964772 0.02185517 2.10315300 3.546927e-02
Fri pickup_nhoodClinton 0.044584654 0.01539272 2.89647601 3.779335e-03
Fri pickup_nhoodCobble Hill -0.007276768 0.03850576 -0.18897871 8.501121e-01
Fri pickup_nhoodDowntown 0.001545011 0.02585081 0.05976645 9.523425e-01
Fri pickup_nhoodDyker Heights -0.090995908 0.12456502 -0.73050929 4.650905e-01
Fri pickup_nhoodEast Harlem -0.037517366 0.02153468 -1.74218364 8.149714e-02
Fri pickup_nhoodEast Village 0.060576675 0.01521603 3.98110907 6.892280e-05
Fri pickup_nhoodFinancial District 0.039675800 0.01584096 2.50463373 1.226852e-02
Fri pickup_nhoodFlushing -0.090995908 0.12456502 -0.73050929 4.650905e-01
Fri pickup_nhoodFordham -0.090995908 0.12456502 -0.73050929 4.650905e-01
Fri pickup_nhoodForest Hills 0.017216138 0.08865984 0.19418193 8.460361e-01
Fri pickup_nhoodFort Green 0.004221563 0.02464461 0.17129761 8.639921e-01
Fri pickup_nhoodGarment District 0.024393541 0.01490931 1.63612827 1.018341e-01
Fri pickup_nhoodGramercy 0.055185069 0.01470055 3.75394646 1.747394e-04
Fri pickup_nhoodGreenwich Village 0.054852819 0.01500759 3.65500562 2.580681e-04
Fri pickup_nhoodGreenwood 0.016797387 0.08866708 0.18944333 8.497480e-01
Fri pickup_nhoodHamilton Heights 0.040616826 0.02533434 1.60323228 1.089048e-01
Fri pickup_nhoodHarlem -0.005078447 0.01945315 -0.26106040 7.940496e-01
Fri pickup_nhoodHigh Bridge -0.078805000 0.12457074 -0.63261243 5.269966e-01
Wed pickup_nhoodJackson Heights 0.008972805 0.0541068845 0.16583481 0.8682895624
Wed pickup_nhoodJamaica -0.075498874 0.0699637973 -1.07911343 0.2805574364
Wed pickup_nhoodLittle Italy 0.072606464 0.0217161688 3.34342881 0.0008298660
Wed pickup_nhoodLower East Side 0.048356560 0.0197387087 2.44983401 0.0143054683
Wed pickup_nhoodMidtown 0.060540602 0.0177148642 3.41750305 0.0006339448
Wed pickup_nhoodMorningside Heights 0.036199236 0.0241864260 1.49667568 0.1345022497
Wed pickup_nhoodMorris Park -0.079981182 0.1366071813 -0.58548300 0.5582332878
Wed pickup_nhoodMott Haven -0.003259940 0.0579887458 -0.05621678 0.9551699915
Wed pickup_nhoodMurray Hill 0.069460027 0.0184681630 3.76106854 0.0001699406
Wed pickup_nhoodNkew Gardens -0.087631641 0.0801256201 -1.09367817 0.2741166736
Wed pickup_nhoodNorth Sutton Area 0.026271188 0.0207620392 1.26534721 0.2057698561
Wed pickup_nhoodPark Slope 0.017829346 0.0414496068 0.43014513 0.6670972952
Wed pickup_nhoodSoho 0.067486557 0.0193186064 3.49334502 0.0004786161
Wed pickup_nhoodSouth Bronx -0.100848358 0.1365632616 -0.73847356 0.4602402258
Wed pickup_nhoodSpringfield Gardens 0.181608956 0.0973702029 1.86513893 0.0621847287
Wed pickup_nhoodSunny Side 0.025947213 0.0317963552 0.81604361 0.4144903334
Wed pickup_nhoodTribeca 0.063379260 0.0195975259 3.23404394 0.0012235931
Wed pickup_nhoodUpper East Side 0.048340470 0.0177728012 2.71991281 0.0065386994
Wed pickup_nhoodUpper West Side 0.038770943 0.0179751083 2.15692401 0.0310300884
Wed pickup_nhoodWashington Heights 0.051585857 0.0381177421 1.35332929 0.1759742856
Wed pickup_nhoodWest Village 0.073147363 0.0193200618 3.78608330 0.0001537374
Wed pickup_nhoodWilliamsburg 0.032407747 0.0276644776 1.17145705 0.2414368541
Wed pickup_nhoodWoodside -0.026366082 0.0509885335 -0.51709826 0.6050964753
Wed pickup_nhoodYorkville 0.037307628 0.0222607453 1.67593795 0.0937746974
Wed passenger_count -0.001614221 0.0008982713 -1.79703103 0.0723541887
Wed pickup_hour12-4 -0.011357149 0.0046405744 -2.44735850 0.0144040659
Wed pickup_hour4-6 0.005875346 0.0051653421 1.13745540 0.2553691871
Wed pickup_hour5-9 0.001790360 0.0046066245 0.38864910 0.6975421864
Wed pickup_hour6-10 0.014410291 0.0043658767 3.30066372 0.0009671944
Wed pickup_hour9-12 -0.013349910 0.0048771872 -2.73721498 0.0062047080

Summary

mutate

: Create transformations

summarise

: Aggregate

group_by

: Group your dataset by levels

do

: Evaluate complex operations on a tbl

Chaining with the %>% operator can result in more readable code.

What We Didn't Cover

  • There are many additional topics that fit well into the dplyr and functional programming landscape
  • There are too many to cover in one session. Fortunately, most are well documented. The most notable omissions:
    1. Connecting to remote databases, see vignette('databases', package = 'dplyr')
    2. Merging and Joins, see vignette('two-table', package = 'dplyr')
    3. Programming with dplyr,vignette('nse', package = 'dplyr')
    4. summarize_each and mutate_each

Thanks for Attending!

  • Any questions?