It is based on the Thomas Augspurger comparison Notebook
This is the introductory dplyr vignette to analyze some flight data, played via an Ipython/Jupyter notebook.
We just play the "R" code part, this time
Nota: Hadley Wickham would recommand you to add more package to the list: caret, reader, shiny
%R install.packages("tidyr")
%R install.packages("dplyr")
%R install.packages("ggplot2")
%R install.packages("rvest")
%R install.packages('RSQLite')
%R install.packages("zoo")
%R install.packages("forecast")
%R install.packages('R.utils')
%R install.packages("nycflights13")
%R install.packages('hflights')
%R update.packages()
In [1]:
library("dplyr") # for functions
library("nycflights13")
write.csv(flights, "flights.csv")
In [2]:
dim(flights)
Out[2]:
In [3]:
head(flights)
Out[3]:
dplyr
has a small set of nicely defined verbs. I've listed their closest pandas verbs.
dplyr | pandas |
filter() (and slice()) | query() (and loc[], iloc[]) |
arrange() | sort() |
select() (and rename()) | \_\_getitem\_\_ (and rename()) |
distinct() | drop_duplicates() |
mutate() (and transmute()) | None |
summarise() | None |
sample_n() and sample_frac() | None |
Some of the "missing" verbs in pandas are because there are other, different ways of achieving the same goal. For example summarise
is spread across mean
, std
, etc. Others, like sample_n
, just haven't been implemented yet.
filter() allows you to select a subset of the rows of a data frame. The first argument is the name of the data frame, and the second and subsequent are filtering expressions evaluated in the context of that data frame:
For example, we can select all flights on January 1st with:
In [4]:
filter(flights, month == 1, day == 1)
Out[4]:
The more verbose version:
In [5]:
flights[flights$month == 1 & flights$day == 1, ]
Out[5]:
filter() works similarly to subset() except that you can give it any number of filtering conditions which are joined together with & (not && which is easy to do accidentally!). You can use other boolean operators explicitly:
In [6]:
filter(flights, month == 1 | month == 2)
Out[6]:
In [7]:
slice(flights, 1:10)
Out[7]:
arrange() works similarly to filter() except that instead of filtering or selecting rows, it reorders them. It takes a data frame, and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
In [8]:
arrange(flights, year, month, day)
Out[8]:
Use desc() to order a column in descending order:
In [9]:
arrange(flights, desc(arr_delay))
Out[9]:
In [10]:
select(flights, year, month, day)
Out[10]:
In [11]:
select(flights, year:day)
Out[11]:
In [12]:
select(flights, -(year:day))
Out[12]:
In [13]:
select(flights, tail_num = tailnum)
Out[13]:
But like Hadley mentions, not that useful since it only returns the one column. dplyr
and pandas
compare well here.
In [14]:
rename(flights, tail_num = tailnum)
Out[14]:
In [15]:
distinct(select(flights, tailnum))
Out[15]:
FYI this returns a numpy array instead of a Series.
In [16]:
distinct(select(flights, origin, dest))
Out[16]:
In [17]:
mutate(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60)
Out[17]:
In [18]:
mutate(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) )
Out[18]:
In [19]:
# mutate(flights,
# gain = arr_delay - dep_delay,
# gain_per_hour = gain / (air_time / 60)
# )
flights['gain'] = flights.arr_delay - flights.dep_delay
flights['gain_per_hour'] = flights.gain / (flights.air_time / 60)
flights
Out[19]:
dplyr's
approach may be nicer here since you get to refer to the variables in subsequent statements within the mutate()
. To achieve this with pandas, you have to add the gain
variable as another column in flights
. If I don't want it around I would have to explicitly drop it.
In [20]:
transmute(flights, gain = arr_delay - dep_delay, gain_per_hour = gain / (air_time / 60) )
Out[20]:
In [21]:
summarise(flights,
delay = mean(dep_delay, na.rm = TRUE))
Out[21]:
In [22]:
sample_n(flights, 10)
Out[22]:
In [23]:
sample_frac(flights, 0.01)
Out[23]:
In [24]:
library("ggplot2")
by_tailnum <- group_by(flights, tailnum)
delay <- summarise(by_tailnum,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE))
delay <- filter(delay, count > 20, dist < 2000)
# Interestingly, the average delay is only slightly related to the
# average distance flown by a plane.
ggplot(delay, aes(dist, delay)) +
geom_point(aes(size = count), alpha = 1/2) +
geom_smooth() +
scale_size_area()
Out[24]:
In [25]:
destinations <- group_by(flights, dest)
summarise(destinations,
planes = n_distinct(tailnum),
flights = n()
)
Out[25]:
In [26]:
filter(
summarise(
select(
group_by(flights, year, month, day),
arr_delay, dep_delay
),
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
),
arr > 30 | dep > 30
)
Out[26]:
Similar to how dplyr
provides optimized C++ versions of most of the summarise
functions, pandas uses cython optimized versions for most of the agg
methods.
In [27]:
daily <- group_by(flights, year, month, day)
(per_day <- summarise(daily, flights = n()))
Out[27]:
In [28]:
(per_month <- summarise(per_day, flights = sum(flights)))
Out[28]:
In [29]:
(per_year <- summarise(per_month, flights = sum(flights)))
Out[29]:
I'm not sure how dplyr
is handling the other columns, like year
, in the last example. With pandas, it's clear that we're grouping by them since they're included in the groupby. For the last example, we didn't group by anything, so they aren't included in the result.
Any follower of Hadley's twitter account will know how much R users love the %>%
(pipe) operator. And for good reason!
In [30]:
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Out[30]:
Pandas has tons IO tools to help you get data in and out, including SQL databases via SQLAlchemy.
I think pandas held up pretty well, considering this was a vignette written for dplyr. I found the degree of similarity more interesting than the differences. The most difficult task was renaming of columns within an operation; they had to be followed up with a call to rename
after the operation, which isn't that burdensome honestly.
More and more it looks like we're moving towards future where being a language or package partisan just doesn't make sense. Not when you can load up a Jupyter (formerly IPython) notebook to call up a library written in R, and hand those results off to python or Julia or whatever for followup, before going back to R to make a cool shiny web app.
There will always be a place for your "utility belt" package like dplyr or pandas, but it wouldn't hurt to be familiar with both.
If you want to contribute to pandas, we're always looking for help at https://github.com/pydata/pandas/. You can get ahold of me directly on twitter.
In [56]: