Define all the libraries which needs to be set for operations here
In [1]:
library("dplyr")
library("nycflights13")
Creates the my_db database using the command src_sqlite
In [2]:
my_db <- src_sqlite("my_db.sqlite3", create = T)
Import the flights data using the convenient copy_to() function.
In [3]:
flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = list(c("year", "month", "day"), "carrier", "tailnum"))
Print a few rows
In [4]:
head(flights_sqlite)
A built-in src that will cache flights in a standard location
In [5]:
flights_sqlite <- tbl(nycflights13_sqlite(), "flights")
Prints contents of flights_sqlite
In [6]:
flights_sqlite
An alternative for flights_sqlite <- tbl(nycflights13_sqlite(), "flights")
In [7]:
tbl(my_db, sql("SELECT * FROM flights"))
Select the mentioned columns from the flights_sqlite
In [8]:
select(flights_sqlite, year:day, dep_delay, arr_delay)
Selects the rows based on a parameter
In [9]:
filter(flights_sqlite, dep_delay > 240)
Arranges the rows based on given parameter
In [10]:
arrange(flights_sqlite, year, month, day)
Mutate adds the column in table with a given formula using other fields
In [11]:
mutate(flights_sqlite, speed = air_time / distance)
Summarises as per a given parameter
In [12]:
summarise(flights_sqlite, delay = mean(dep_time))
Laziness Example When working with databases, dplyr tries to be as lazy as possible:
In [13]:
c1 <- filter(flights_sqlite, year == 2013, month == 1, day == 1)
c2 <- select(c1, year, month, day, carrier, dep_delay, air_time, distance)
c3 <- mutate(c2, speed = distance / air_time * 60)
c4 <- arrange(c3, year, month, day, carrier)
Fetches the data from database when requested. This is how Laziness works
In [14]:
c4
To pull down all the results use collect(), which returns a tbl_df()
In [15]:
collect(c4)
You can see the query dplyr has generated by looking at the query component of the object
In [16]:
c4$query
You can also ask the database how it plans to execute the query with explain(). It’s helpful if you’re trying to figure out which indexes are being used.
In [17]:
explain(c4)
When performing the simple mathematical operations used when filtering, mutating or summarising, translating R code to SQL (or indeed to any programming language) is relatively straightforward. dplyr knows how to convert the R functions to SQL
In [18]:
translate_sql(x)
In [19]:
translate_sql("x")
In [20]:
translate_sql(x == 1 && (y < 2 || z > 3))
In [21]:
translate_sql(x ^ 2 < 10)
In [22]:
translate_sql(x %% 2 == 10)
In [23]:
translate_sql(1)
In [24]:
translate_sql(1L)
In [26]:
translate_sql(glob(x, y))
In [27]:
translate_sql(x %like% "ab*")
SQLite lacks the window functions that are needed for grouped mutation and filtering. This means that the only really useful operations for grouped SQLite tables are found in summarise(). The grouped summarise from the introduction translates well - the only difference is that databases always drop NULLs (their equivalent of missing values), so we don’t supply na.rm = TRUE.
In [28]:
by_tailnum <- group_by(flights_sqlite, tailnum)
delay <- summarise(by_tailnum,
count = n(),
dist = mean(distance),
delay = mean(arr_delay)
)
delay <- filter(delay, count > 20, dist < 2000)
delay_local <- collect(delay)