DIC LAB 2 Problem 0 : dplyr database Vignette

Define all the libraries which needs to be set for operations here



In [1]:

    
library("dplyr")
library("nycflights13")









    



Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

Creates the my_db database using the command src_sqlite



In [2]:

    
my_db <- src_sqlite("my_db.sqlite3", create = T)

Import the flights data using the convenient copy_to() function.



In [3]:

    
flights_sqlite <- copy_to(my_db, flights, temporary = FALSE, indexes = list(c("year", "month", "day"), "carrier", "tailnum"))

Print a few rows



In [4]:

    
head(flights_sqlite)









    





Source:   query [?? x 19]
Database: sqlite 3.11.1 [my_db.sqlite3]

   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ... with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dbl>

A built-in src that will cache flights in a standard location



In [5]:

    
flights_sqlite <- tbl(nycflights13_sqlite(), "flights")









    



Caching nycflights db at /tmp/Rtmpa6cr7q/nycflights13.sqlite
Creating table: airlines
Creating table: airports
Creating table: flights
Creating table: planes
Creating table: weather

Prints contents of flights_sqlite



In [6]:

    
flights_sqlite









    





Source:   query [?? x 19]
Database: sqlite 3.11.1 [/tmp/Rtmpa6cr7q/nycflights13.sqlite]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013     1     1      517            515         2      830            819
2   2013     1     1      533            529         4      850            830
3   2013     1     1      542            540         2      923            850
4   2013     1     1      544            545        -1     1004           1022
5   2013     1     1      554            600        -6      812            837
6   2013     1     1      554            558        -4      740            728
7   2013     1     1      555            600        -5      913            854
8   2013     1     1      557            600        -3      709            723
9   2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

An alternative for flights_sqlite <- tbl(nycflights13_sqlite(), "flights")



In [7]:

    
tbl(my_db, sql("SELECT * FROM flights"))









    





Source:   query [?? x 19]
Database: sqlite 3.11.1 [my_db.sqlite3]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013     1     1      517            515         2      830            819
2   2013     1     1      533            529         4      850            830
3   2013     1     1      542            540         2      923            850
4   2013     1     1      544            545        -1     1004           1022
5   2013     1     1      554            600        -6      812            837
6   2013     1     1      554            558        -4      740            728
7   2013     1     1      555            600        -5      913            854
8   2013     1     1      557            600        -3      709            723
9   2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

Select the mentioned columns from the flights_sqlite



In [8]:

    
select(flights_sqlite, year:day, dep_delay, arr_delay)









    





Source:   query [?? x 5]
Database: sqlite 3.11.1 [/tmp/Rtmpa6cr7q/nycflights13.sqlite]

    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
1   2013     1     1         2        11
2   2013     1     1         4        20
3   2013     1     1         2        33
4   2013     1     1        -1       -18
5   2013     1     1        -6       -25
6   2013     1     1        -4        12
7   2013     1     1        -5        19
8   2013     1     1        -3       -14
9   2013     1     1        -3        -8
10  2013     1     1        -2         8
# ... with more rows

Selects the rows based on a parameter



In [9]:

    
filter(flights_sqlite, dep_delay > 240)









    





Source:   query [?? x 19]
Database: sqlite 3.11.1 [/tmp/Rtmpa6cr7q/nycflights13.sqlite]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013     1     1      848           1835       853     1001           1950
2   2013     1     1     1815           1325       290     2120           1542
3   2013     1     1     1842           1422       260     1958           1535
4   2013     1     1     2115           1700       255     2330           1920
5   2013     1     1     2205           1720       285       46           2040
6   2013     1     1     2343           1724       379      314           1938
7   2013     1     2     1332            904       268     1616           1128
8   2013     1     2     1412            838       334     1710           1147
9   2013     1     2     1607           1030       337     2003           1355
10  2013     1     2     2131           1512       379     2340           1741
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

Arranges the rows based on given parameter



In [10]:

    
arrange(flights_sqlite, year, month, day)









    





Source:   query [?? x 19]
Database: sqlite 3.11.1 [/tmp/Rtmpa6cr7q/nycflights13.sqlite]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013     1     1      517            515         2      830            819
2   2013     1     1      533            529         4      850            830
3   2013     1     1      542            540         2      923            850
4   2013     1     1      544            545        -1     1004           1022
5   2013     1     1      554            600        -6      812            837
6   2013     1     1      554            558        -4      740            728
7   2013     1     1      555            600        -5      913            854
8   2013     1     1      557            600        -3      709            723
9   2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

Mutate adds the column in table with a given formula using other fields



In [11]:

    
mutate(flights_sqlite, speed = air_time / distance)









    





Source:   query [?? x 20]
Database: sqlite 3.11.1 [/tmp/Rtmpa6cr7q/nycflights13.sqlite]

    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1   2013     1     1      517            515         2      830            819
2   2013     1     1      533            529         4      850            830
3   2013     1     1      542            540         2      923            850
4   2013     1     1      544            545        -1     1004           1022
5   2013     1     1      554            600        -6      812            837
6   2013     1     1      554            558        -4      740            728
7   2013     1     1      555            600        -5      913            854
8   2013     1     1      557            600        -3      709            723
9   2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>, speed <dbl>

Summarises as per a given parameter



In [12]:

    
summarise(flights_sqlite, delay = mean(dep_time))









    





Source:   query [?? x 1]
Database: sqlite 3.11.1 [/tmp/Rtmpa6cr7q/nycflights13.sqlite]

    delay
    <dbl>
1 1349.11

Laziness Example When working with databases, dplyr tries to be as lazy as possible:

It never pulls data into R unless you explicitly ask for it.
It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.



In [13]:

    
c1 <- filter(flights_sqlite, year == 2013, month == 1, day == 1)
c2 <- select(c1, year, month, day, carrier, dep_delay, air_time, distance)
c3 <- mutate(c2, speed = distance / air_time * 60)
c4 <- arrange(c3, year, month, day, carrier)

Fetches the data from database when requested. This is how Laziness works



In [14]:

    
c4









    





Source:   query [?? x 8]
Database: sqlite 3.11.1 [/tmp/Rtmpa6cr7q/nycflights13.sqlite]

    year month   day carrier dep_delay air_time distance    speed
   <int> <int> <int>   <chr>     <dbl>    <dbl>    <dbl>    <dbl>
1   2013     1     1      9E         0      189     1029 326.6667
2   2013     1     1      9E        -9       57      228 240.0000
3   2013     1     1      9E        -3       68      301 265.5882
4   2013     1     1      9E        -6       57      209 220.0000
5   2013     1     1      9E        -8       66      264 240.0000
6   2013     1     1      9E         0       40      184 276.0000
7   2013     1     1      9E         6      146      740 304.1096
8   2013     1     1      9E         0      139      665 287.0504
9   2013     1     1      9E        -8      150      765 306.0000
10  2013     1     1      9E        -6       41      187 273.6585
# ... with more rows

To pull down all the results use collect(), which returns a tbl_df()



In [15]:

    
collect(c4)









    





year month day carrier dep_delay air_time distance speed

	2013    1       1       9E        0     189     1029    326.6667
	2013    1       1       9E       -9      57      228    240.0000
	2013    1       1       9E       -3      68      301    265.5882
	2013    1       1       9E       -6      57      209    220.0000
	2013    1       1       9E       -8      66      264    240.0000
	2013    1       1       9E        0      40      184    276.0000
	2013    1       1       9E        6     146      740    304.1096
	2013    1       1       9E        0     139      665    287.0504
	2013    1       1       9E       -8     150      765    306.0000
	2013    1       1       9E       -6      41      187    273.6585
	2013    1       1       9E      -10      35       94    161.1429
	2013    1       1       9E       52     173     1029    356.8786
	2013    1       1       9E       88      72      427    355.8333
	2013    1       1       9E       -4     107      509    285.4206
	2013    1       1       9E       -3     106      636    360.0000
	2013    1       1       9E        0     207     1182    342.6087
	2013    1       1       9E        0     141      740    314.8936
	2013    1       1       9E       -2      75      427    341.6000
	2013    1       1       9E       59      NA     1391          NA
	2013    1       1       9E        5     129      828    385.1163
	2013    1       1       9E       -5      76      340    268.4211
	2013    1       1       9E       -8      41      184    269.2683
	2013    1       1       9E        0      30       94    188.0000
	2013    1       1       9E       10      62      301    291.2903
	2013    1       1       9E       42      91      425    280.2198
	2013    1       1       9E       38     118      589    299.4915
	2013    1       1       9E       11      43      213    297.2093
	2013    1       1       9E      255     115      589    307.3043
	2013    1       1       AA        2     160     1089    408.3750
	2013    1       1       AA       -2     138      733    318.6957
	⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
	2013    1       1       VX       0      362     2475    410.2210
	2013    1       1       VX       0      364     2586    426.2637
	2013    1       1       VX      -8      349     2475    425.5014
	2013    1       1       WN      -1       40      185    277.5000
	2013    1       1       WN      -1      279     1620    348.3871
	2013    1       1       WN      -3      146      711    292.1918
	2013    1       1       WN       4      264     1605    364.7727
	2013    1       1       WN      -2      145      725    300.0000
	2013    1       1       WN      -3      143      738    309.6503
	2013    1       1       WN      -1       43      185    258.1395
	2013    1       1       WN      -5      176      888    302.7273
	2013    1       1       WN      -2       38      169    266.8421
	2013    1       1       WN       0      142      711    300.4225
	2013    1       1       WN      -2      137      738    323.2117
	2013    1       1       WN       0      264     1620    368.1818
	2013    1       1       WN       6      239     1411    354.2259
	2013    1       1       WN      16      334     2133    383.1737
	2013    1       1       WN       5      141      725    308.5106
	2013    1       1       WN      -4      141      711    302.5532
	2013    1       1       WN      16      133      738    332.9323
	2013    1       1       WN       3      254     1605    379.1339
	2013    1       1       WN      -3      181      888    294.3646
	2013    1       1       WN       7      181      872    289.0608
	2013    1       1       WN       8      245     1411    345.5510
	2013    1       1       WN      -1      137      725    317.5182
	2013    1       1       WN       6      148      711    288.2432
	2013    1       1       WN      10       46      185    241.3043
	2013    1       1       WN      -2      131      738    338.0153
	2013    1       1       WN      31      134      711    318.3582
	2013    1       1       WN      -2      134      725    324.6269

You can see the query dplyr has generated by looking at the query component of the object



In [16]:

    
c4$query









    





NULL

You can also ask the database how it plans to execute the query with explain(). It’s helpful if you’re trying to figure out which indexes are being used.



In [17]:

    
explain(c4)









    



<SQL>
SELECT *
FROM (SELECT `year`, `month`, `day`, `carrier`, `dep_delay`, `air_time`, `distance`, `distance` / `air_time` * 60.0 AS `speed`
FROM (SELECT `year` AS `year`, `month` AS `month`, `day` AS `day`, `carrier` AS `carrier`, `dep_delay` AS `dep_delay`, `air_time` AS `air_time`, `distance` AS `distance`
FROM (SELECT *
FROM `flights`
WHERE ((`year` = 2013.0) AND (`month` = 1.0) AND (`day` = 1.0)))))
ORDER BY `year`, `month`, `day`, `carrier`


<PLAN>
   addr       opcode p1    p2 p3           p4 p5 comment
1     0         Init  0    56  0              00      NA
2     1   SorterOpen  4     9  0       k(1,B) 00      NA
3     2     OpenRead  3   131  0           19 00      NA
4     3     OpenRead  5 22656  0     k(4,,,,) 02      NA
5     4         Real  0     1  0         2013 00      NA
6     5         Real  0     2  0            1 00      NA
7     6         Real  0     3  0            1 00      NA
8     7     Affinity  1     3  0          DDD 00      NA
9     8       SeekGE  5    36  1            3 00      NA
10    9        IdxGT  5    36  1            3 00      NA
11   10         Seek  5     0  3              00      NA
12   11       Column  5     0  8              00      NA
13   12       Column  5     1  9              00      NA
14   13       Column  5     2 10              00      NA
15   14       Column  3     9 11              00      NA
16   15       Column  3     5 12              00      NA
17   16 RealAffinity 12     0  0              00      NA
18   17       Column  3    14 13              00      NA
19   18 RealAffinity 13     0  0              00      NA
20   19       Column  3    15 14              00      NA
21   20 RealAffinity 14     0  0              00      NA
22   21       Divide 13    14 16              00      NA
23   22     Multiply 18    16 15              00      NA
24   23         Copy  8     4  0              00      NA
25   24         Copy  9     5  0              00      NA
26   25         Copy 10     6  0              00      NA
27   26         Copy 11     7  0              00      NA
28   27   MakeRecord  7     9 19              00      NA
29   28 SequenceTest  4    33  0              00      NA
30   29      Compare 20     4  3 k(4,B,B,B,B) 00      NA
31   30         Jump 31    34 31              00      NA
32   31        Gosub 23    40  0              00      NA
33   32  ResetSorter  4     0  0              00      NA
34   33         Move  4    20  3              00      NA
35   34 SorterInsert  4    19  0              00      NA
36   35         Next  5     9  1              00      NA
37   36        Close  3     0  0              00      NA
38   37        Close  5     0  0              00      NA
39   38        Gosub 23    40  0              00      NA
40   39         Goto  0    55  0              00      NA
41   40         Once  0    42  0              00      NA
42   41   OpenPseudo  6    24 10              00      NA
43   42   SorterSort  4    55  0              00      NA
44   43   SorterData  4    24  6              00      NA
45   44       Column  6     1  8              00      NA
46   45       Column  6     2  9              00      NA
47   46       Column  6     3 10              00      NA
48   47       Column  6     4 11              00      NA
49   48       Column  6     5 12              00      NA
50   49       Column  6     6 13              00      NA
51   50       Column  6     7 14              00      NA
52   51       Column  6     8 15              00      NA
53   52    ResultRow  8     8  0              00      NA
54   53   SorterNext  4    43  0              00      NA
55   54       Return 23     0  0              00      NA
56   55         Halt  0     0  0              00      NA
57   56  Transaction  0     0 16            0 01      NA
58   57    TableLock  0   131  0      flights 00      NA
59   58         Real  0    18  0           60 00      NA
60   59         Goto  0     1  0              00      NA

When performing the simple mathematical operations used when filtering, mutating or summarising, translating R code to SQL (or indeed to any programming language) is relatively straightforward. dplyr knows how to convert the R functions to SQL



In [18]:

    
translate_sql(x)









    




'"x"'



In [19]:

    
translate_sql("x")









    




'\'x''



In [20]:

    
translate_sql(x == 1 && (y < 2 || z > 3))









    




'"x" = 1.0 AND ("y" < 2.0 OR "z" > 3.0)'



In [21]:

    
translate_sql(x ^ 2 < 10)









    




'POWER("x", 2.0) < 10.0'



In [22]:

    
translate_sql(x %% 2 == 10)









    




'"x" % 2.0 = 10.0'



In [23]:

    
translate_sql(1)









    




'1.0'



In [24]:

    
translate_sql(1L)

'1'



In [26]:

    
translate_sql(glob(x, y))









    




'GLOB("x", "y")'



In [27]:

    
translate_sql(x %like% "ab*")









    




'"x" LIKE \'ab*''

SQLite lacks the window functions that are needed for grouped mutation and filtering. This means that the only really useful operations for grouped SQLite tables are found in summarise(). The grouped summarise from the introduction translates well - the only difference is that databases always drop NULLs (their equivalent of missing values), so we don’t supply na.rm = TRUE.



In [28]:

    
by_tailnum <- group_by(flights_sqlite, tailnum)
delay <- summarise(by_tailnum,
  count = n(),
  dist = mean(distance),
  delay = mean(arr_delay)
)
delay <- filter(delay, count > 20, dist < 2000)
delay_local <- collect(delay)

References

https://cran.r-project.org/web/packages/dplyr/vignettes/databases.html

year	month	day	carrier	dep_delay	air_time	distance	speed
2013	1	1	9E	0	189	1029	326.6667
2013	1	1	9E	-9	57	228	240.0000
2013	1	1	9E	-3	68	301	265.5882
2013	1	1	9E	-6	57	209	220.0000
2013	1	1	9E	-8	66	264	240.0000
2013	1	1	9E	0	40	184	276.0000
2013	1	1	9E	6	146	740	304.1096
2013	1	1	9E	0	139	665	287.0504
2013	1	1	9E	-8	150	765	306.0000
2013	1	1	9E	-6	41	187	273.6585
2013	1	1	9E	-10	35	94	161.1429
2013	1	1	9E	52	173	1029	356.8786
2013	1	1	9E	88	72	427	355.8333
2013	1	1	9E	-4	107	509	285.4206
2013	1	1	9E	-3	106	636	360.0000
2013	1	1	9E	0	207	1182	342.6087
2013	1	1	9E	0	141	740	314.8936
2013	1	1	9E	-2	75	427	341.6000
2013	1	1	9E	59	NA	1391	NA
2013	1	1	9E	5	129	828	385.1163
2013	1	1	9E	-5	76	340	268.4211
2013	1	1	9E	-8	41	184	269.2683
2013	1	1	9E	0	30	94	188.0000
2013	1	1	9E	10	62	301	291.2903
2013	1	1	9E	42	91	425	280.2198
2013	1	1	9E	38	118	589	299.4915
2013	1	1	9E	11	43	213	297.2093
2013	1	1	9E	255	115	589	307.3043
2013	1	1	AA	2	160	1089	408.3750
2013	1	1	AA	-2	138	733	318.6957
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
2013	1	1	VX	0	362	2475	410.2210
2013	1	1	VX	0	364	2586	426.2637
2013	1	1	VX	-8	349	2475	425.5014
2013	1	1	WN	-1	40	185	277.5000
2013	1	1	WN	-1	279	1620	348.3871
2013	1	1	WN	-3	146	711	292.1918
2013	1	1	WN	4	264	1605	364.7727
2013	1	1	WN	-2	145	725	300.0000
2013	1	1	WN	-3	143	738	309.6503
2013	1	1	WN	-1	43	185	258.1395
2013	1	1	WN	-5	176	888	302.7273
2013	1	1	WN	-2	38	169	266.8421
2013	1	1	WN	0	142	711	300.4225
2013	1	1	WN	-2	137	738	323.2117
2013	1	1	WN	0	264	1620	368.1818
2013	1	1	WN	6	239	1411	354.2259
2013	1	1	WN	16	334	2133	383.1737
2013	1	1	WN	5	141	725	308.5106
2013	1	1	WN	-4	141	711	302.5532
2013	1	1	WN	16	133	738	332.9323
2013	1	1	WN	3	254	1605	379.1339
2013	1	1	WN	-3	181	888	294.3646
2013	1	1	WN	7	181	872	289.0608
2013	1	1	WN	8	245	1411	345.5510
2013	1	1	WN	-1	137	725	317.5182
2013	1	1	WN	6	148	711	288.2432
2013	1	1	WN	10	46	185	241.3043
2013	1	1	WN	-2	131	738	338.0153
2013	1	1	WN	31	134	711	318.3582
2013	1	1	WN	-2	134	725	324.6269