Bases de datos en R

Ulises M. Alvarez

Porqué usar bases de datos

Podemos interrogar nuestros datos (hacerles preguntas).
Podemos buscar datos con relativa rapidez.
Podemos relacionar los datos de dos tablas diferentes utilizando JOINs.
Podemos crear informes significativos de nuestros datos.
Los datos tienen una estructura incorporada.
La información de un tipo, cualquiera que este sea, sólo se almacena una única vez.
Las DB son ACID (acrónimo de Atomicity, Consistency, Isolation and Durability)
Las DB son tolerantes a fallas.
Las DB pueden manejar grandes conjuntos de datos.
Las DB son concurrentes, esto es, múltiples usuarios pueden acceder al mismo tiempo a los datos sin corromperlos.
Las DB son escalables.

Fuente

https://softwareengineering.stackexchange.com/questions/190482/why-use-a-database-instead-of-just-saving-your-data-to-disk

ACID (Atomicity, Consistency, Isolation and Durability)

Atomicidad.

Si cuando una operación consiste en una serie de pasos, bien todos ellos se ejecutan o bien ninguno, es decir, las transacciones son completas.

Consistencia (integridad).

Es la propiedad que asegura que sólo se empieza aquello que se puede acabar. Por lo tanto se ejecutan aquellas operaciones que no van a romper las reglas y directrices de Integridad de la base de datos.

Aislamiento.

Esta propiedad asegura que una operación no puede afectar a otras.

Durabilidad (persistencia).

Esta propiedad asegura que una vez realizada la operación, ésta persistirá y no se podrá deshacer aunque falle el sistema.

Fuente:

https://es.wikipedia.org/wiki/ACID

Bases de datos en R

Vamos a usar DB en R cuando:

Deseamos trabajar con los datos de una base sin exportarlos, o
Tenemos un conjunto de datos que no cabe en la RAM de nuestro equipo.



In [1]:

    
options(repos = "https://cloud.r-project.org")      # Definimos repositorio 
if (!require("pacman")) install.packages("pacman")  # Si no tenemos, pacman, lo instalamos









    



Loading required package: pacman



In [2]:

    
pacman::p_load(dplyr,                               # librería para manipular datos
              RSQLite,                              # librería para acceder a SQLite
              nycflights13)                         # data for all, 336776, flights departing NYC in 2013



In [18]:

    
my_db <- src_sqlite("my_db.sqlite3", create = T)    # just give the path and the ok to create a table.

Ahora, copiaremos los datos mediante:

copy_to()

El uso de esta función no es recomendable con grandes conjuntos de datos.



In [20]:

    
flights_sqlite <- copy_to(my_db,                           # destino
                          flights,                         # DF de origen 
                          temporary = FALSE,               # hacemos nuestro objeto no temporal
                          indexes =                        # Generamos el "index" para acelerar análisis
                          list(c("year", "month", "day"),  
                               "carrier", "tailnum"))









    



Error: Table flights already exists.
Traceback:

1. copy_to(my_db, flights, temporary = FALSE, indexes = list(c("year", 
 .     "month", "day"), "carrier", "tailnum"))
2. copy_to.src_sql(my_db, flights, temporary = FALSE, indexes = list(c("year", 
 .     "month", "day"), "carrier", "tailnum"))
3. stop("Table ", name, " already exists.", call. = FALSE)



In [22]:

    
flights_sqlite <- tbl(nycflights13_sqlite(), "flights")    # Caching nycflights db



In [24]:

    
explain(flights_sqlite)









    



<SQL>
SELECT *
FROM `flights`


<PLAN>
   addr       opcode p1  p2 p3      p4 p5 comment
1     0         Init  0  33  0         00      NA
2     1     OpenRead  0 131  0      19 00      NA
3     2       Rewind  0  31  0         00      NA
4     3       Column  0   0  1         00      NA
5     4       Column  0   1  2         00      NA
6     5       Column  0   2  3         00      NA
7     6       Column  0   3  4         00      NA
8     7       Column  0   4  5         00      NA
9     8       Column  0   5  6         00      NA
10    9 RealAffinity  6   0  0         00      NA
11   10       Column  0   6  7         00      NA
12   11       Column  0   7  8         00      NA
13   12       Column  0   8  9         00      NA
14   13 RealAffinity  9   0  0         00      NA
15   14       Column  0   9 10         00      NA
16   15       Column  0  10 11         00      NA
17   16       Column  0  11 12         00      NA
18   17       Column  0  12 13         00      NA
19   18       Column  0  13 14         00      NA
20   19       Column  0  14 15         00      NA
21   20 RealAffinity 15   0  0         00      NA
22   21       Column  0  15 16         00      NA
23   22 RealAffinity 16   0  0         00      NA
24   23       Column  0  16 17         00      NA
25   24 RealAffinity 17   0  0         00      NA
26   25       Column  0  17 18         00      NA
27   26 RealAffinity 18   0  0         00      NA
28   27       Column  0  18 19         00      NA
29   28 RealAffinity 19   0  0         00      NA
30   29    ResultRow  1  19  0         00      NA
31   30         Next  0   3  0         01      NA
32   31        Close  0   0  0         00      NA
33   32         Halt  0   0  0         00      NA
34   33  Transaction  0   0 16       0 01      NA
35   34    TableLock  0 131  0 flights 00      NA
36   35         Goto  0   1  0         00      NA



In [25]:

    
tbl(my_db, sql("SELECT * FROM flights"))  # Probemos a ejecutar la búsqueda con SQL...









    





Source:   query [?? x 19]
Database: sqlite 3.11.1 [my_db.sqlite3]

# A tibble: ?? x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

Acciones básicas

Select columns with select()



In [26]:

    
select(flights_sqlite,  # los datos 
       year:day,        # seleccionamos desde la columna 'year', hasta 'day'
       dep_delay,       # también 'dep_delay'
       arr_delay)       # y 'arr_delay'









    





Source:   query [?? x 5]
Database: sqlite 3.11.1 [/tmp/Rtmplxf76Y/nycflights13.sqlite]

# A tibble: ?? x 5
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# ... with more rows

Filter rows with filter()



In [27]:

    
filter(flights_sqlite,   # los datos
       dep_delay > 240)  # seleccionamos un retraso mayor a 240...









    





Source:   query [?? x 19]
Database: sqlite 3.11.1 [/tmp/Rtmplxf76Y/nycflights13.sqlite]

# A tibble: ?? x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      848           1835       853     1001           1950
 2  2013     1     1     1815           1325       290     2120           1542
 3  2013     1     1     1842           1422       260     1958           1535
 4  2013     1     1     2115           1700       255     2330           1920
 5  2013     1     1     2205           1720       285       46           2040
 6  2013     1     1     2343           1724       379      314           1938
 7  2013     1     2     1332            904       268     1616           1128
 8  2013     1     2     1412            838       334     1710           1147
 9  2013     1     2     1607           1030       337     2003           1355
10  2013     1     2     2131           1512       379     2340           1741
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

Arrange rows with arrange()

Parecido a filter, pero reordena. Si se selecciona más de una columna, cada columna adicional se usará para reordenar la precedente.



In [30]:

    
arrange(flights_sqlite,    # los datos
        year, month, day)  # las columnas









    





Source:   query [?? x 19]
Database: sqlite 3.11.1 [/tmp/Rtmplxf76Y/nycflights13.sqlite]

# A tibble: ?? x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

Add new columns with mutate()



In [31]:

    
mutate(flights_sqlite,               # los datos
       speed = air_time / distance)  # agregamos la columna 'speed'









    





Source:   query [?? x 20]
Database: sqlite 3.11.1 [/tmp/Rtmplxf76Y/nycflights13.sqlite]

# A tibble: ?? x 20
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>, speed <dbl>

Summarise values with summarise()



In [32]:

    
summarise(flights_sqlite,          # los datos
          delay = mean(dep_time))  # la función a aplicar, mean()









    





Source:   query [?? x 1]
Database: sqlite 3.11.1 [/tmp/Rtmplxf76Y/nycflights13.sqlite]

# A tibble: ?? x 1
    delay
    <dbl>
1 1349.11

"Laziness"

When working with databases, dplyr tries to be as lazy as possible:

It never pulls data into R unless you explicitly ask for it.
It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.

For example, take the following code:



In [13]:

    
c1 <- filter(flights_sqlite, year == 2013, month == 1, day == 1)
c2 <- select(c1, year, month, day, carrier, dep_delay, air_time, distance)
c3 <- mutate(c2, speed = distance / air_time * 60)
c4 <- arrange(c3, year, month, day, carrier)

Suprisingly, this sequence of operations never actually touches the database. It’s not until you ask for the data (e.g. by printing c4) that dplyr generates the SQL and requests the results from the database. Even then it only pulls down 10 rows.



In [14]:

    
c4









    





Source:   query [?? x 8]
Database: sqlite 3.11.1 [/tmp/Rtmplxf76Y/nycflights13.sqlite]

# A tibble: ?? x 8
    year month   day carrier dep_delay air_time distance    speed
   <int> <int> <int>   <chr>     <dbl>    <dbl>    <dbl>    <dbl>
 1  2013     1     1      9E         0      189     1029 326.6667
 2  2013     1     1      9E        -9       57      228 240.0000
 3  2013     1     1      9E        -3       68      301 265.5882
 4  2013     1     1      9E        -6       57      209 220.0000
 5  2013     1     1      9E        -8       66      264 240.0000
 6  2013     1     1      9E         0       40      184 276.0000
 7  2013     1     1      9E         6      146      740 304.1096
 8  2013     1     1      9E         0      139      665 287.0504
 9  2013     1     1      9E        -8      150      765 306.0000
10  2013     1     1      9E        -6       41      187 273.6585
# ... with more rows

To pull down all the results use collect(), which returns a tbl_df():



In [15]:

    
collect(c4)









    





year month day carrier dep_delay air_time distance speed

	2013    1       1       9E        0     189     1029    326.6667
	2013    1       1       9E       -9      57      228    240.0000
	2013    1       1       9E       -3      68      301    265.5882
	2013    1       1       9E       -6      57      209    220.0000
	2013    1       1       9E       -8      66      264    240.0000
	2013    1       1       9E        0      40      184    276.0000
	2013    1       1       9E        6     146      740    304.1096
	2013    1       1       9E        0     139      665    287.0504
	2013    1       1       9E       -8     150      765    306.0000
	2013    1       1       9E       -6      41      187    273.6585
	2013    1       1       9E      -10      35       94    161.1429
	2013    1       1       9E       52     173     1029    356.8786
	2013    1       1       9E       88      72      427    355.8333
	2013    1       1       9E       -4     107      509    285.4206
	2013    1       1       9E       -3     106      636    360.0000
	2013    1       1       9E        0     207     1182    342.6087
	2013    1       1       9E        0     141      740    314.8936
	2013    1       1       9E       -2      75      427    341.6000
	2013    1       1       9E       59      NA     1391          NA
	2013    1       1       9E        5     129      828    385.1163
	2013    1       1       9E       -5      76      340    268.4211
	2013    1       1       9E       -8      41      184    269.2683
	2013    1       1       9E        0      30       94    188.0000
	2013    1       1       9E       10      62      301    291.2903
	2013    1       1       9E       42      91      425    280.2198
	2013    1       1       9E       38     118      589    299.4915
	2013    1       1       9E       11      43      213    297.2093
	2013    1       1       9E      255     115      589    307.3043
	2013    1       1       AA        2     160     1089    408.3750
	2013    1       1       AA       -2     138      733    318.6957
	⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
	2013    1       1       VX       0      362     2475    410.2210
	2013    1       1       VX       0      364     2586    426.2637
	2013    1       1       VX      -8      349     2475    425.5014
	2013    1       1       WN      -1       40      185    277.5000
	2013    1       1       WN      -1      279     1620    348.3871
	2013    1       1       WN      -3      146      711    292.1918
	2013    1       1       WN       4      264     1605    364.7727
	2013    1       1       WN      -2      145      725    300.0000
	2013    1       1       WN      -3      143      738    309.6503
	2013    1       1       WN      -1       43      185    258.1395
	2013    1       1       WN      -5      176      888    302.7273
	2013    1       1       WN      -2       38      169    266.8421
	2013    1       1       WN       0      142      711    300.4225
	2013    1       1       WN      -2      137      738    323.2117
	2013    1       1       WN       0      264     1620    368.1818
	2013    1       1       WN       6      239     1411    354.2259
	2013    1       1       WN      16      334     2133    383.1737
	2013    1       1       WN       5      141      725    308.5106
	2013    1       1       WN      -4      141      711    302.5532
	2013    1       1       WN      16      133      738    332.9323
	2013    1       1       WN       3      254     1605    379.1339
	2013    1       1       WN      -3      181      888    294.3646
	2013    1       1       WN       7      181      872    289.0608
	2013    1       1       WN       8      245     1411    345.5510
	2013    1       1       WN      -1      137      725    317.5182
	2013    1       1       WN       6      148      711    288.2432
	2013    1       1       WN      10       46      185    241.3043
	2013    1       1       WN      -2      131      738    338.0153
	2013    1       1       WN      31      134      711    318.3582
	2013    1       1       WN      -2      134      725    324.6269

You can see the query dplyr has generated by looking at the query component of the object:



In [16]:

    
c4$query









    





NULL

You can also ask the database how it plans to execute the query with explain(). The output for SQLite is described in more detail on the SQLite website. It’s helpful if you’re trying to figure out which indexes are being used.



In [17]:

    
explain(c4)









    



<SQL>
SELECT *
FROM (SELECT `year`, `month`, `day`, `carrier`, `dep_delay`, `air_time`, `distance`, `distance` / `air_time` * 60.0 AS `speed`
FROM (SELECT `year` AS `year`, `month` AS `month`, `day` AS `day`, `carrier` AS `carrier`, `dep_delay` AS `dep_delay`, `air_time` AS `air_time`, `distance` AS `distance`
FROM (SELECT *
FROM `flights`
WHERE ((`year` = 2013.0) AND (`month` = 1.0) AND (`day` = 1.0)))))
ORDER BY `year`, `month`, `day`, `carrier`


<PLAN>
   addr       opcode p1    p2 p3           p4 p5 comment
1     0         Init  0    56  0              00      NA
2     1   SorterOpen  4     9  0       k(1,B) 00      NA
3     2     OpenRead  3   131  0           19 00      NA
4     3     OpenRead  5 22656  0     k(4,,,,) 02      NA
5     4         Real  0     1  0         2013 00      NA
6     5         Real  0     2  0            1 00      NA
7     6         Real  0     3  0            1 00      NA
8     7     Affinity  1     3  0          DDD 00      NA
9     8       SeekGE  5    36  1            3 00      NA
10    9        IdxGT  5    36  1            3 00      NA
11   10         Seek  5     0  3              00      NA
12   11       Column  5     0  8              00      NA
13   12       Column  5     1  9              00      NA
14   13       Column  5     2 10              00      NA
15   14       Column  3     9 11              00      NA
16   15       Column  3     5 12              00      NA
17   16 RealAffinity 12     0  0              00      NA
18   17       Column  3    14 13              00      NA
19   18 RealAffinity 13     0  0              00      NA
20   19       Column  3    15 14              00      NA
21   20 RealAffinity 14     0  0              00      NA
22   21       Divide 13    14 16              00      NA
23   22     Multiply 18    16 15              00      NA
24   23         Copy  8     4  0              00      NA
25   24         Copy  9     5  0              00      NA
26   25         Copy 10     6  0              00      NA
27   26         Copy 11     7  0              00      NA
28   27   MakeRecord  7     9 19              00      NA
29   28 SequenceTest  4    33  0              00      NA
30   29      Compare 20     4  3 k(4,B,B,B,B) 00      NA
31   30         Jump 31    34 31              00      NA
32   31        Gosub 23    40  0              00      NA
33   32  ResetSorter  4     0  0              00      NA
34   33         Move  4    20  3              00      NA
35   34 SorterInsert  4    19  0              00      NA
36   35         Next  5     9  1              00      NA
37   36        Close  3     0  0              00      NA
38   37        Close  5     0  0              00      NA
39   38        Gosub 23    40  0              00      NA
40   39         Goto  0    55  0              00      NA
41   40         Once  0    42  0              00      NA
42   41   OpenPseudo  6    24 10              00      NA
43   42   SorterSort  4    55  0              00      NA
44   43   SorterData  4    24  6              00      NA
45   44       Column  6     1  8              00      NA
46   45       Column  6     2  9              00      NA
47   46       Column  6     3 10              00      NA
48   47       Column  6     4 11              00      NA
49   48       Column  6     5 12              00      NA
50   49       Column  6     6 13              00      NA
51   50       Column  6     7 14              00      NA
52   51       Column  6     8 15              00      NA
53   52    ResultRow  8     8  0              00      NA
54   53   SorterNext  4    43  0              00      NA
55   54       Return 23     0  0              00      NA
56   55         Halt  0     0  0              00      NA
57   56  Transaction  0     0 16            0 01      NA
58   57    TableLock  0   131  0      flights 00      NA
59   58         Real  0    18  0           60 00      NA
60   59         Goto  0     1  0              00      NA

year	month	day	carrier	dep_delay	air_time	distance	speed
2013	1	1	9E	0	189	1029	326.6667
2013	1	1	9E	-9	57	228	240.0000
2013	1	1	9E	-3	68	301	265.5882
2013	1	1	9E	-6	57	209	220.0000
2013	1	1	9E	-8	66	264	240.0000
2013	1	1	9E	0	40	184	276.0000
2013	1	1	9E	6	146	740	304.1096
2013	1	1	9E	0	139	665	287.0504
2013	1	1	9E	-8	150	765	306.0000
2013	1	1	9E	-6	41	187	273.6585
2013	1	1	9E	-10	35	94	161.1429
2013	1	1	9E	52	173	1029	356.8786
2013	1	1	9E	88	72	427	355.8333
2013	1	1	9E	-4	107	509	285.4206
2013	1	1	9E	-3	106	636	360.0000
2013	1	1	9E	0	207	1182	342.6087
2013	1	1	9E	0	141	740	314.8936
2013	1	1	9E	-2	75	427	341.6000
2013	1	1	9E	59	NA	1391	NA
2013	1	1	9E	5	129	828	385.1163
2013	1	1	9E	-5	76	340	268.4211
2013	1	1	9E	-8	41	184	269.2683
2013	1	1	9E	0	30	94	188.0000
2013	1	1	9E	10	62	301	291.2903
2013	1	1	9E	42	91	425	280.2198
2013	1	1	9E	38	118	589	299.4915
2013	1	1	9E	11	43	213	297.2093
2013	1	1	9E	255	115	589	307.3043
2013	1	1	AA	2	160	1089	408.3750
2013	1	1	AA	-2	138	733	318.6957
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
2013	1	1	VX	0	362	2475	410.2210
2013	1	1	VX	0	364	2586	426.2637
2013	1	1	VX	-8	349	2475	425.5014
2013	1	1	WN	-1	40	185	277.5000
2013	1	1	WN	-1	279	1620	348.3871
2013	1	1	WN	-3	146	711	292.1918
2013	1	1	WN	4	264	1605	364.7727
2013	1	1	WN	-2	145	725	300.0000
2013	1	1	WN	-3	143	738	309.6503
2013	1	1	WN	-1	43	185	258.1395
2013	1	1	WN	-5	176	888	302.7273
2013	1	1	WN	-2	38	169	266.8421
2013	1	1	WN	0	142	711	300.4225
2013	1	1	WN	-2	137	738	323.2117
2013	1	1	WN	0	264	1620	368.1818
2013	1	1	WN	6	239	1411	354.2259
2013	1	1	WN	16	334	2133	383.1737
2013	1	1	WN	5	141	725	308.5106
2013	1	1	WN	-4	141	711	302.5532
2013	1	1	WN	16	133	738	332.9323
2013	1	1	WN	3	254	1605	379.1339
2013	1	1	WN	-3	181	888	294.3646
2013	1	1	WN	7	181	872	289.0608
2013	1	1	WN	8	245	1411	345.5510
2013	1	1	WN	-1	137	725	317.5182
2013	1	1	WN	6	148	711	288.2432
2013	1	1	WN	10	46	185	241.3043
2013	1	1	WN	-2	131	738	338.0153
2013	1	1	WN	31	134	711	318.3582
2013	1	1	WN	-2	134	725	324.6269