DIC LAB 2 Problem 0 : dplyr database Vignette

Define all the libraries which needs to be set for operations here


In [1]:
library("dplyr")
library("nycflights13")


Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Creates the my_db database using the command src_sqlite


In [2]:
my_db <- src_sqlite("my_db.sqlite3", create = T)

Import the flights data using the convenient copy_to() function.


In [3]:
flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = list(c("year", "month", "day"), "carrier", "tailnum"))

Print a few rows


In [4]:
head(flights_sqlite)


Source:   query [?? x 19]
Database: sqlite 3.11.1 [my_db.sqlite3]

   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dbl>

A built-in src that will cache flights in a standard location


In [5]:
flights_sqlite <- tbl(nycflights13_sqlite(), "flights")


Caching nycflights db at /tmp/Rtmpa6cr7q/nycflights13.sqlite
Creating table: airlines
Creating table: airports
Creating table: flights
Creating table: planes
Creating table: weather

Prints contents of flights_sqlite


In [6]:
flights_sqlite


Source:   query [?? x 19]
Database: sqlite 3.11.1 [/tmp/Rtmpa6cr7q/nycflights13.sqlite]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013     1     1      517            515         2      830            819
2   2013     1     1      533            529         4      850            830
3   2013     1     1      542            540         2      923            850
4   2013     1     1      544            545        -1     1004           1022
5   2013     1     1      554            600        -6      812            837
6   2013     1     1      554            558        -4      740            728
7   2013     1     1      555            600        -5      913            854
8   2013     1     1      557            600        -3      709            723
9   2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

An alternative for flights_sqlite <- tbl(nycflights13_sqlite(), "flights")


In [7]:
tbl(my_db, sql("SELECT * FROM flights"))


Source:   query [?? x 19]
Database: sqlite 3.11.1 [my_db.sqlite3]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013     1     1      517            515         2      830            819
2   2013     1     1      533            529         4      850            830
3   2013     1     1      542            540         2      923            850
4   2013     1     1      544            545        -1     1004           1022
5   2013     1     1      554            600        -6      812            837
6   2013     1     1      554            558        -4      740            728
7   2013     1     1      555            600        -5      913            854
8   2013     1     1      557            600        -3      709            723
9   2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

Select the mentioned columns from the flights_sqlite


In [8]:
select(flights_sqlite, year:day, dep_delay, arr_delay)


Source:   query [?? x 5]
Database: sqlite 3.11.1 [/tmp/Rtmpa6cr7q/nycflights13.sqlite]

    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
1   2013     1     1         2        11
2   2013     1     1         4        20
3   2013     1     1         2        33
4   2013     1     1        -1       -18
5   2013     1     1        -6       -25
6   2013     1     1        -4        12
7   2013     1     1        -5        19
8   2013     1     1        -3       -14
9   2013     1     1        -3        -8
10  2013     1     1        -2         8
# ... with more rows

Selects the rows based on a parameter


In [9]:
filter(flights_sqlite, dep_delay > 240)


Source:   query [?? x 19]
Database: sqlite 3.11.1 [/tmp/Rtmpa6cr7q/nycflights13.sqlite]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013     1     1      848           1835       853     1001           1950
2   2013     1     1     1815           1325       290     2120           1542
3   2013     1     1     1842           1422       260     1958           1535
4   2013     1     1     2115           1700       255     2330           1920
5   2013     1     1     2205           1720       285       46           2040
6   2013     1     1     2343           1724       379      314           1938
7   2013     1     2     1332            904       268     1616           1128
8   2013     1     2     1412            838       334     1710           1147
9   2013     1     2     1607           1030       337     2003           1355
10  2013     1     2     2131           1512       379     2340           1741
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

Arranges the rows based on given parameter


In [10]:
arrange(flights_sqlite, year, month, day)


Source:   query [?? x 19]
Database: sqlite 3.11.1 [/tmp/Rtmpa6cr7q/nycflights13.sqlite]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013     1     1      517            515         2      830            819
2   2013     1     1      533            529         4      850            830
3   2013     1     1      542            540         2      923            850
4   2013     1     1      544            545        -1     1004           1022
5   2013     1     1      554            600        -6      812            837
6   2013     1     1      554            558        -4      740            728
7   2013     1     1      555            600        -5      913            854
8   2013     1     1      557            600        -3      709            723
9   2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

Mutate adds the column in table with a given formula using other fields


In [11]:
mutate(flights_sqlite, speed = air_time / distance)


Source:   query [?? x 20]
Database: sqlite 3.11.1 [/tmp/Rtmpa6cr7q/nycflights13.sqlite]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013     1     1      517            515         2      830            819
2   2013     1     1      533            529         4      850            830
3   2013     1     1      542            540         2      923            850
4   2013     1     1      544            545        -1     1004           1022
5   2013     1     1      554            600        -6      812            837
6   2013     1     1      554            558        -4      740            728
7   2013     1     1      555            600        -5      913            854
8   2013     1     1      557            600        -3      709            723
9   2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>, speed <dbl>

Summarises as per a given parameter


In [12]:
summarise(flights_sqlite, delay = mean(dep_time))


Source:   query [?? x 1]
Database: sqlite 3.11.1 [/tmp/Rtmpa6cr7q/nycflights13.sqlite]

    delay
    <dbl>
1 1349.11

Laziness Example When working with databases, dplyr tries to be as lazy as possible:

  1. It never pulls data into R unless you explicitly ask for it.
  2. It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.

In [13]:
c1 <- filter(flights_sqlite, year == 2013, month == 1, day == 1)
c2 <- select(c1, year, month, day, carrier, dep_delay, air_time, distance)
c3 <- mutate(c2, speed = distance / air_time * 60)
c4 <- arrange(c3, year, month, day, carrier)

Fetches the data from database when requested. This is how Laziness works


In [14]:
c4


Source:   query [?? x 8]
Database: sqlite 3.11.1 [/tmp/Rtmpa6cr7q/nycflights13.sqlite]

    year month   day carrier dep_delay air_time distance    speed
   <int> <int> <int>   <chr>     <dbl>    <dbl>    <dbl>    <dbl>
1   2013     1     1      9E         0      189     1029 326.6667
2   2013     1     1      9E        -9       57      228 240.0000
3   2013     1     1      9E        -3       68      301 265.5882
4   2013     1     1      9E        -6       57      209 220.0000
5   2013     1     1      9E        -8       66      264 240.0000
6   2013     1     1      9E         0       40      184 276.0000
7   2013     1     1      9E         6      146      740 304.1096
8   2013     1     1      9E         0      139      665 287.0504
9   2013     1     1      9E        -8      150      765 306.0000
10  2013     1     1      9E        -6       41      187 273.6585
# ... with more rows

To pull down all the results use collect(), which returns a tbl_df()


In [15]:
collect(c4)


yearmonthdaycarrierdep_delayair_timedistancespeed
2013 1 1 9E 0 189 1029 326.6667
2013 1 1 9E -9 57 228 240.0000
2013 1 1 9E -3 68 301 265.5882
2013 1 1 9E -6 57 209 220.0000
2013 1 1 9E -8 66 264 240.0000
2013 1 1 9E 0 40 184 276.0000
2013 1 1 9E 6 146 740 304.1096
2013 1 1 9E 0 139 665 287.0504
2013 1 1 9E -8 150 765 306.0000
2013 1 1 9E -6 41 187 273.6585
2013 1 1 9E -10 35 94 161.1429
2013 1 1 9E 52 173 1029 356.8786
2013 1 1 9E 88 72 427 355.8333
2013 1 1 9E -4 107 509 285.4206
2013 1 1 9E -3 106 636 360.0000
2013 1 1 9E 0 207 1182 342.6087
2013 1 1 9E 0 141 740 314.8936
2013 1 1 9E -2 75 427 341.6000
2013 1 1 9E 59 NA 1391 NA
2013 1 1 9E 5 129 828 385.1163
2013 1 1 9E -5 76 340 268.4211
2013 1 1 9E -8 41 184 269.2683
2013 1 1 9E 0 30 94 188.0000
2013 1 1 9E 10 62 301 291.2903
2013 1 1 9E 42 91 425 280.2198
2013 1 1 9E 38 118 589 299.4915
2013 1 1 9E 11 43 213 297.2093
2013 1 1 9E 255 115 589 307.3043
2013 1 1 AA 2 160 1089 408.3750
2013 1 1 AA -2 138 733 318.6957
2013 1 1 VX 0 362 2475 410.2210
2013 1 1 VX 0 364 2586 426.2637
2013 1 1 VX -8 349 2475 425.5014
2013 1 1 WN -1 40 185 277.5000
2013 1 1 WN -1 279 1620 348.3871
2013 1 1 WN -3 146 711 292.1918
2013 1 1 WN 4 264 1605 364.7727
2013 1 1 WN -2 145 725 300.0000
2013 1 1 WN -3 143 738 309.6503
2013 1 1 WN -1 43 185 258.1395
2013 1 1 WN -5 176 888 302.7273
2013 1 1 WN -2 38 169 266.8421
2013 1 1 WN 0 142 711 300.4225
2013 1 1 WN -2 137 738 323.2117
2013 1 1 WN 0 264 1620 368.1818
2013 1 1 WN 6 239 1411 354.2259
2013 1 1 WN 16 334 2133 383.1737
2013 1 1 WN 5 141 725 308.5106
2013 1 1 WN -4 141 711 302.5532
2013 1 1 WN 16 133 738 332.9323
2013 1 1 WN 3 254 1605 379.1339
2013 1 1 WN -3 181 888 294.3646
2013 1 1 WN 7 181 872 289.0608
2013 1 1 WN 8 245 1411 345.5510
2013 1 1 WN -1 137 725 317.5182
2013 1 1 WN 6 148 711 288.2432
2013 1 1 WN 10 46 185 241.3043
2013 1 1 WN -2 131 738 338.0153
2013 1 1 WN 31 134 711 318.3582
2013 1 1 WN -2 134 725 324.6269

You can see the query dplyr has generated by looking at the query component of the object


In [16]:
c4$query


NULL

You can also ask the database how it plans to execute the query with explain(). It’s helpful if you’re trying to figure out which indexes are being used.


In [17]:
explain(c4)


<SQL>
SELECT *
FROM (SELECT `year`, `month`, `day`, `carrier`, `dep_delay`, `air_time`, `distance`, `distance` / `air_time` * 60.0 AS `speed`
FROM (SELECT `year` AS `year`, `month` AS `month`, `day` AS `day`, `carrier` AS `carrier`, `dep_delay` AS `dep_delay`, `air_time` AS `air_time`, `distance` AS `distance`
FROM (SELECT *
FROM `flights`
WHERE ((`year` = 2013.0) AND (`month` = 1.0) AND (`day` = 1.0)))))
ORDER BY `year`, `month`, `day`, `carrier`


<PLAN>
   addr       opcode p1    p2 p3           p4 p5 comment
1     0         Init  0    56  0              00      NA
2     1   SorterOpen  4     9  0       k(1,B) 00      NA
3     2     OpenRead  3   131  0           19 00      NA
4     3     OpenRead  5 22656  0     k(4,,,,) 02      NA
5     4         Real  0     1  0         2013 00      NA
6     5         Real  0     2  0            1 00      NA
7     6         Real  0     3  0            1 00      NA
8     7     Affinity  1     3  0          DDD 00      NA
9     8       SeekGE  5    36  1            3 00      NA
10    9        IdxGT  5    36  1            3 00      NA
11   10         Seek  5     0  3              00      NA
12   11       Column  5     0  8              00      NA
13   12       Column  5     1  9              00      NA
14   13       Column  5     2 10              00      NA
15   14       Column  3     9 11              00      NA
16   15       Column  3     5 12              00      NA
17   16 RealAffinity 12     0  0              00      NA
18   17       Column  3    14 13              00      NA
19   18 RealAffinity 13     0  0              00      NA
20   19       Column  3    15 14              00      NA
21   20 RealAffinity 14     0  0              00      NA
22   21       Divide 13    14 16              00      NA
23   22     Multiply 18    16 15              00      NA
24   23         Copy  8     4  0              00      NA
25   24         Copy  9     5  0              00      NA
26   25         Copy 10     6  0              00      NA
27   26         Copy 11     7  0              00      NA
28   27   MakeRecord  7     9 19              00      NA
29   28 SequenceTest  4    33  0              00      NA
30   29      Compare 20     4  3 k(4,B,B,B,B) 00      NA
31   30         Jump 31    34 31              00      NA
32   31        Gosub 23    40  0              00      NA
33   32  ResetSorter  4     0  0              00      NA
34   33         Move  4    20  3              00      NA
35   34 SorterInsert  4    19  0              00      NA
36   35         Next  5     9  1              00      NA
37   36        Close  3     0  0              00      NA
38   37        Close  5     0  0              00      NA
39   38        Gosub 23    40  0              00      NA
40   39         Goto  0    55  0              00      NA
41   40         Once  0    42  0              00      NA
42   41   OpenPseudo  6    24 10              00      NA
43   42   SorterSort  4    55  0              00      NA
44   43   SorterData  4    24  6              00      NA
45   44       Column  6     1  8              00      NA
46   45       Column  6     2  9              00      NA
47   46       Column  6     3 10              00      NA
48   47       Column  6     4 11              00      NA
49   48       Column  6     5 12              00      NA
50   49       Column  6     6 13              00      NA
51   50       Column  6     7 14              00      NA
52   51       Column  6     8 15              00      NA
53   52    ResultRow  8     8  0              00      NA
54   53   SorterNext  4    43  0              00      NA
55   54       Return 23     0  0              00      NA
56   55         Halt  0     0  0              00      NA
57   56  Transaction  0     0 16            0 01      NA
58   57    TableLock  0   131  0      flights 00      NA
59   58         Real  0    18  0           60 00      NA
60   59         Goto  0     1  0              00      NA

When performing the simple mathematical operations used when filtering, mutating or summarising, translating R code to SQL (or indeed to any programming language) is relatively straightforward. dplyr knows how to convert the R functions to SQL


In [18]:
translate_sql(x)


'"x"'

In [19]:
translate_sql("x")


'\'x''

In [20]:
translate_sql(x == 1 && (y < 2 || z > 3))


'"x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)'

In [21]:
translate_sql(x ^ 2 < 10)


'POWER("x", 2.0) < 10.0'

In [22]:
translate_sql(x %% 2 == 10)


'"x" % 2.0 = 10.0'

In [23]:
translate_sql(1)


'1.0'

In [24]:
translate_sql(1L)


'1'

In [26]:
translate_sql(glob(x, y))


'GLOB("x", "y")'

In [27]:
translate_sql(x %like% "ab*")


'"x" LIKE \'ab*''

SQLite lacks the window functions that are needed for grouped mutation and filtering. This means that the only really useful operations for grouped SQLite tables are found in summarise(). The grouped summarise from the introduction translates well - the only difference is that databases always drop NULLs (their equivalent of missing values), so we don’t supply na.rm = TRUE.


In [28]:
by_tailnum <- group_by(flights_sqlite, tailnum)
delay <- summarise(by_tailnum,
  count = n(),
  dist = mean(distance),
  delay = mean(arr_delay)
)
delay <- filter(delay, count > 20, dist < 2000)
delay_local <- collect(delay)

References