Bases de datos en R
Ulises M. Alvarez

Porqué usar bases de datos

  1. Podemos interrogar nuestros datos (hacerles preguntas).
  2. Podemos buscar datos con relativa rapidez.
  3. Podemos relacionar los datos de dos tablas diferentes utilizando JOINs.
  4. Podemos crear informes significativos de nuestros datos.
  5. Los datos tienen una estructura incorporada.
  6. La información de un tipo, cualquiera que este sea, sólo se almacena una única vez.
  7. Las DB son ACID (acrónimo de Atomicity, Consistency, Isolation and Durability)
  8. Las DB son tolerantes a fallas.
  9. Las DB pueden manejar grandes conjuntos de datos.
  10. Las DB son concurrentes, esto es, múltiples usuarios pueden acceder al mismo tiempo a los datos sin corromperlos.
  11. Las DB son escalables.

Fuente

ACID (Atomicity, Consistency, Isolation and Durability)

Atomicidad.

Si cuando una operación consiste en una serie de pasos, bien todos ellos se ejecutan o bien ninguno, es decir, las transacciones son completas.

Consistencia (integridad).

Es la propiedad que asegura que sólo se empieza aquello que se puede acabar. Por lo tanto se ejecutan aquellas operaciones que no van a romper las reglas y directrices de Integridad de la base de datos.

Aislamiento.

Esta propiedad asegura que una operación no puede afectar a otras.

Durabilidad (persistencia).

Esta propiedad asegura que una vez realizada la operación, ésta persistirá y no se podrá deshacer aunque falle el sistema.

Fuente:

Bases de datos en R

Vamos a usar DB en R cuando:

  • Deseamos trabajar con los datos de una base sin exportarlos, o
  • Tenemos un conjunto de datos que no cabe en la RAM de nuestro equipo.

In [1]:
options(repos = "https://cloud.r-project.org")      # Definimos repositorio 
if (!require("pacman")) install.packages("pacman")  # Si no tenemos, pacman, lo instalamos


Loading required package: pacman

In [2]:
pacman::p_load(dplyr,                               # librería para manipular datos
              RSQLite,                              # librería para acceder a SQLite
              nycflights13)                         # data for all, 336776, flights departing NYC in 2013

In [18]:
my_db <- src_sqlite("my_db.sqlite3", create = T)    # just give the path and the ok to create a table.

Ahora, copiaremos los datos mediante:

copy_to()

El uso de esta función no es recomendable con grandes conjuntos de datos.


In [20]:
flights_sqlite <- copy_to(my_db,                           # destino
                          flights,                         # DF de origen 
                          temporary = FALSE,               # hacemos nuestro objeto no temporal
                          indexes =                        # Generamos el "index" para acelerar análisis
                          list(c("year", "month", "day"),  
                               "carrier", "tailnum"))


Error: Table flights already exists.
Traceback:

1. copy_to(my_db, flights, temporary = FALSE, indexes = list(c("year", 
 .     "month", "day"), "carrier", "tailnum"))
2. copy_to.src_sql(my_db, flights, temporary = FALSE, indexes = list(c("year", 
 .     "month", "day"), "carrier", "tailnum"))
3. stop("Table ", name, " already exists.", call. = FALSE)

In [22]:
flights_sqlite <- tbl(nycflights13_sqlite(), "flights")    # Caching nycflights db

In [24]:
explain(flights_sqlite)


<SQL>
SELECT *
FROM `flights`


<PLAN>
   addr       opcode p1  p2 p3      p4 p5 comment
1     0         Init  0  33  0         00      NA
2     1     OpenRead  0 131  0      19 00      NA
3     2       Rewind  0  31  0         00      NA
4     3       Column  0   0  1         00      NA
5     4       Column  0   1  2         00      NA
6     5       Column  0   2  3         00      NA
7     6       Column  0   3  4         00      NA
8     7       Column  0   4  5         00      NA
9     8       Column  0   5  6         00      NA
10    9 RealAffinity  6   0  0         00      NA
11   10       Column  0   6  7         00      NA
12   11       Column  0   7  8         00      NA
13   12       Column  0   8  9         00      NA
14   13 RealAffinity  9   0  0         00      NA
15   14       Column  0   9 10         00      NA
16   15       Column  0  10 11         00      NA
17   16       Column  0  11 12         00      NA
18   17       Column  0  12 13         00      NA
19   18       Column  0  13 14         00      NA
20   19       Column  0  14 15         00      NA
21   20 RealAffinity 15   0  0         00      NA
22   21       Column  0  15 16         00      NA
23   22 RealAffinity 16   0  0         00      NA
24   23       Column  0  16 17         00      NA
25   24 RealAffinity 17   0  0         00      NA
26   25       Column  0  17 18         00      NA
27   26 RealAffinity 18   0  0         00      NA
28   27       Column  0  18 19         00      NA
29   28 RealAffinity 19   0  0         00      NA
30   29    ResultRow  1  19  0         00      NA
31   30         Next  0   3  0         01      NA
32   31        Close  0   0  0         00      NA
33   32         Halt  0   0  0         00      NA
34   33  Transaction  0   0 16       0 01      NA
35   34    TableLock  0 131  0 flights 00      NA
36   35         Goto  0   1  0         00      NA

In [25]:
tbl(my_db, sql("SELECT * FROM flights"))  # Probemos a ejecutar la búsqueda con SQL...


Source:   query [?? x 19]
Database: sqlite 3.11.1 [my_db.sqlite3]

# A tibble: ?? x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

Acciones básicas

Select columns with select()


In [26]:
select(flights_sqlite,  # los datos 
       year:day,        # seleccionamos desde la columna 'year', hasta 'day'
       dep_delay,       # también 'dep_delay'
       arr_delay)       # y 'arr_delay'


Source:   query [?? x 5]
Database: sqlite 3.11.1 [/tmp/Rtmplxf76Y/nycflights13.sqlite]

# A tibble: ?? x 5
    year month   day dep_delay arr_delay
   <int> <int> <int>     <dbl>     <dbl>
 1  2013     1     1         2        11
 2  2013     1     1         4        20
 3  2013     1     1         2        33
 4  2013     1     1        -1       -18
 5  2013     1     1        -6       -25
 6  2013     1     1        -4        12
 7  2013     1     1        -5        19
 8  2013     1     1        -3       -14
 9  2013     1     1        -3        -8
10  2013     1     1        -2         8
# ... with more rows

Filter rows with filter()


In [27]:
filter(flights_sqlite,   # los datos
       dep_delay > 240)  # seleccionamos un retraso mayor a 240...


Source:   query [?? x 19]
Database: sqlite 3.11.1 [/tmp/Rtmplxf76Y/nycflights13.sqlite]

# A tibble: ?? x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      848           1835       853     1001           1950
 2  2013     1     1     1815           1325       290     2120           1542
 3  2013     1     1     1842           1422       260     1958           1535
 4  2013     1     1     2115           1700       255     2330           1920
 5  2013     1     1     2205           1720       285       46           2040
 6  2013     1     1     2343           1724       379      314           1938
 7  2013     1     2     1332            904       268     1616           1128
 8  2013     1     2     1412            838       334     1710           1147
 9  2013     1     2     1607           1030       337     2003           1355
10  2013     1     2     2131           1512       379     2340           1741
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

Arrange rows with arrange()

Parecido a filter, pero reordena. Si se selecciona más de una columna, cada columna adicional se usará para reordenar la precedente.


In [30]:
arrange(flights_sqlite,    # los datos
        year, month, day)  # las columnas


Source:   query [?? x 19]
Database: sqlite 3.11.1 [/tmp/Rtmplxf76Y/nycflights13.sqlite]

# A tibble: ?? x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>

Add new columns with mutate()


In [31]:
mutate(flights_sqlite,               # los datos
       speed = air_time / distance)  # agregamos la columna 'speed'


Source:   query [?? x 20]
Database: sqlite 3.11.1 [/tmp/Rtmplxf76Y/nycflights13.sqlite]

# A tibble: ?? x 20
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ... with more rows, and 12 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>, speed <dbl>

Summarise values with summarise()


In [32]:
summarise(flights_sqlite,          # los datos
          delay = mean(dep_time))  # la función a aplicar, mean()


Source:   query [?? x 1]
Database: sqlite 3.11.1 [/tmp/Rtmplxf76Y/nycflights13.sqlite]

# A tibble: ?? x 1
    delay
    <dbl>
1 1349.11

"Laziness"

When working with databases, dplyr tries to be as lazy as possible:

  • It never pulls data into R unless you explicitly ask for it.
  • It delays doing any work until the last possible moment: it collects together everything you want to do and then sends it to the database in one step.

For example, take the following code:


In [13]:
c1 <- filter(flights_sqlite, year == 2013, month == 1, day == 1)
c2 <- select(c1, year, month, day, carrier, dep_delay, air_time, distance)
c3 <- mutate(c2, speed = distance / air_time * 60)
c4 <- arrange(c3, year, month, day, carrier)

Suprisingly, this sequence of operations never actually touches the database. It’s not until you ask for the data (e.g. by printing c4) that dplyr generates the SQL and requests the results from the database. Even then it only pulls down 10 rows.


In [14]:
c4


Source:   query [?? x 8]
Database: sqlite 3.11.1 [/tmp/Rtmplxf76Y/nycflights13.sqlite]

# A tibble: ?? x 8
    year month   day carrier dep_delay air_time distance    speed
   <int> <int> <int>   <chr>     <dbl>    <dbl>    <dbl>    <dbl>
 1  2013     1     1      9E         0      189     1029 326.6667
 2  2013     1     1      9E        -9       57      228 240.0000
 3  2013     1     1      9E        -3       68      301 265.5882
 4  2013     1     1      9E        -6       57      209 220.0000
 5  2013     1     1      9E        -8       66      264 240.0000
 6  2013     1     1      9E         0       40      184 276.0000
 7  2013     1     1      9E         6      146      740 304.1096
 8  2013     1     1      9E         0      139      665 287.0504
 9  2013     1     1      9E        -8      150      765 306.0000
10  2013     1     1      9E        -6       41      187 273.6585
# ... with more rows

To pull down all the results use collect(), which returns a tbl_df():


In [15]:
collect(c4)


yearmonthdaycarrierdep_delayair_timedistancespeed
2013 1 1 9E 0 189 1029 326.6667
2013 1 1 9E -9 57 228 240.0000
2013 1 1 9E -3 68 301 265.5882
2013 1 1 9E -6 57 209 220.0000
2013 1 1 9E -8 66 264 240.0000
2013 1 1 9E 0 40 184 276.0000
2013 1 1 9E 6 146 740 304.1096
2013 1 1 9E 0 139 665 287.0504
2013 1 1 9E -8 150 765 306.0000
2013 1 1 9E -6 41 187 273.6585
2013 1 1 9E -10 35 94 161.1429
2013 1 1 9E 52 173 1029 356.8786
2013 1 1 9E 88 72 427 355.8333
2013 1 1 9E -4 107 509 285.4206
2013 1 1 9E -3 106 636 360.0000
2013 1 1 9E 0 207 1182 342.6087
2013 1 1 9E 0 141 740 314.8936
2013 1 1 9E -2 75 427 341.6000
2013 1 1 9E 59 NA 1391 NA
2013 1 1 9E 5 129 828 385.1163
2013 1 1 9E -5 76 340 268.4211
2013 1 1 9E -8 41 184 269.2683
2013 1 1 9E 0 30 94 188.0000
2013 1 1 9E 10 62 301 291.2903
2013 1 1 9E 42 91 425 280.2198
2013 1 1 9E 38 118 589 299.4915
2013 1 1 9E 11 43 213 297.2093
2013 1 1 9E 255 115 589 307.3043
2013 1 1 AA 2 160 1089 408.3750
2013 1 1 AA -2 138 733 318.6957
2013 1 1 VX 0 362 2475 410.2210
2013 1 1 VX 0 364 2586 426.2637
2013 1 1 VX -8 349 2475 425.5014
2013 1 1 WN -1 40 185 277.5000
2013 1 1 WN -1 279 1620 348.3871
2013 1 1 WN -3 146 711 292.1918
2013 1 1 WN 4 264 1605 364.7727
2013 1 1 WN -2 145 725 300.0000
2013 1 1 WN -3 143 738 309.6503
2013 1 1 WN -1 43 185 258.1395
2013 1 1 WN -5 176 888 302.7273
2013 1 1 WN -2 38 169 266.8421
2013 1 1 WN 0 142 711 300.4225
2013 1 1 WN -2 137 738 323.2117
2013 1 1 WN 0 264 1620 368.1818
2013 1 1 WN 6 239 1411 354.2259
2013 1 1 WN 16 334 2133 383.1737
2013 1 1 WN 5 141 725 308.5106
2013 1 1 WN -4 141 711 302.5532
2013 1 1 WN 16 133 738 332.9323
2013 1 1 WN 3 254 1605 379.1339
2013 1 1 WN -3 181 888 294.3646
2013 1 1 WN 7 181 872 289.0608
2013 1 1 WN 8 245 1411 345.5510
2013 1 1 WN -1 137 725 317.5182
2013 1 1 WN 6 148 711 288.2432
2013 1 1 WN 10 46 185 241.3043
2013 1 1 WN -2 131 738 338.0153
2013 1 1 WN 31 134 711 318.3582
2013 1 1 WN -2 134 725 324.6269

You can see the query dplyr has generated by looking at the query component of the object:


In [16]:
c4$query


NULL

You can also ask the database how it plans to execute the query with explain(). The output for SQLite is described in more detail on the SQLite website. It’s helpful if you’re trying to figure out which indexes are being used.


In [17]:
explain(c4)


<SQL>
SELECT *
FROM (SELECT `year`, `month`, `day`, `carrier`, `dep_delay`, `air_time`, `distance`, `distance` / `air_time` * 60.0 AS `speed`
FROM (SELECT `year` AS `year`, `month` AS `month`, `day` AS `day`, `carrier` AS `carrier`, `dep_delay` AS `dep_delay`, `air_time` AS `air_time`, `distance` AS `distance`
FROM (SELECT *
FROM `flights`
WHERE ((`year` = 2013.0) AND (`month` = 1.0) AND (`day` = 1.0)))))
ORDER BY `year`, `month`, `day`, `carrier`


<PLAN>
   addr       opcode p1    p2 p3           p4 p5 comment
1     0         Init  0    56  0              00      NA
2     1   SorterOpen  4     9  0       k(1,B) 00      NA
3     2     OpenRead  3   131  0           19 00      NA
4     3     OpenRead  5 22656  0     k(4,,,,) 02      NA
5     4         Real  0     1  0         2013 00      NA
6     5         Real  0     2  0            1 00      NA
7     6         Real  0     3  0            1 00      NA
8     7     Affinity  1     3  0          DDD 00      NA
9     8       SeekGE  5    36  1            3 00      NA
10    9        IdxGT  5    36  1            3 00      NA
11   10         Seek  5     0  3              00      NA
12   11       Column  5     0  8              00      NA
13   12       Column  5     1  9              00      NA
14   13       Column  5     2 10              00      NA
15   14       Column  3     9 11              00      NA
16   15       Column  3     5 12              00      NA
17   16 RealAffinity 12     0  0              00      NA
18   17       Column  3    14 13              00      NA
19   18 RealAffinity 13     0  0              00      NA
20   19       Column  3    15 14              00      NA
21   20 RealAffinity 14     0  0              00      NA
22   21       Divide 13    14 16              00      NA
23   22     Multiply 18    16 15              00      NA
24   23         Copy  8     4  0              00      NA
25   24         Copy  9     5  0              00      NA
26   25         Copy 10     6  0              00      NA
27   26         Copy 11     7  0              00      NA
28   27   MakeRecord  7     9 19              00      NA
29   28 SequenceTest  4    33  0              00      NA
30   29      Compare 20     4  3 k(4,B,B,B,B) 00      NA
31   30         Jump 31    34 31              00      NA
32   31        Gosub 23    40  0              00      NA
33   32  ResetSorter  4     0  0              00      NA
34   33         Move  4    20  3              00      NA
35   34 SorterInsert  4    19  0              00      NA
36   35         Next  5     9  1              00      NA
37   36        Close  3     0  0              00      NA
38   37        Close  5     0  0              00      NA
39   38        Gosub 23    40  0              00      NA
40   39         Goto  0    55  0              00      NA
41   40         Once  0    42  0              00      NA
42   41   OpenPseudo  6    24 10              00      NA
43   42   SorterSort  4    55  0              00      NA
44   43   SorterData  4    24  6              00      NA
45   44       Column  6     1  8              00      NA
46   45       Column  6     2  9              00      NA
47   46       Column  6     3 10              00      NA
48   47       Column  6     4 11              00      NA
49   48       Column  6     5 12              00      NA
50   49       Column  6     6 13              00      NA
51   50       Column  6     7 14              00      NA
52   51       Column  6     8 15              00      NA
53   52    ResultRow  8     8  0              00      NA
54   53   SorterNext  4    43  0              00      NA
55   54       Return 23     0  0              00      NA
56   55         Halt  0     0  0              00      NA
57   56  Transaction  0     0 16            0 01      NA
58   57    TableLock  0   131  0      flights 00      NA
59   58         Real  0    18  0           60 00      NA
60   59         Goto  0     1  0              00      NA