--- title: "Introduction to dbi.table" output: rmarkdown::html_vignette: number_sections: true vignette: > %\VignetteIndexEntry{Introduction to dbi.table} %\VignetteEngine{knitr::rmarkdown}x %\VignetteEncoding{UTF-8} --- ```{r, set_options, include = FALSE} old_opts <- options(width = 90) ``` Query database tables and views over a DBI connection using `data.table`'s `[i, j, by]` syntax, attach database schemas to the search path, and programmatically load database catalogs. This vignette assumes that you are already fluent with `data.table`'s syntax and that you know how to open a database connection using the `DBI` package. # Installation The `dbi.table` package is hosted on GitHub. Use the following command to install the package. ```{r install, eval = FALSE} #install.packages("devtools") devtools::install_github("kjellpk/dbi.table") ``` Note: if the `install_github` function is not found, you will need to first install the `devtools` package using `install.packages`. # Getting Started This section uses the sample [Chinook Database](https://github.com/lerocha/chinook-database) (included in the package) to demonstrate how to 1. create a single `dbi.table` using the `dbi.table` function, 2. maniuplate a `dbi.table` using `data.table`'s `[i, j, by]` syntax, 3. attach a schema to the search path using the `dbi.attach` function, and 4. load a database catalog using the `dbi.catalog` function. The function `chinook.duckdb` that returns an open `duckdb` (DBI) connection to the sample Chinook Database. This connection is a typical DBI connection as returned by `DBI::dbConnect` that can be used as the `conn` argument in DBI package functions. Let's get started by loading the package and opening the connection. ```{r library, message = FALSE} library(dbi.table) chinook <- chinook.duckdb() ``` ## Create a Single `dbi.table` The `dbi.table` function takes 2 arguments: a DBI connection, and an Id indentifying a database table or view. ```{r single_table} my_album <- dbi.table(chinook, DBI::Id("Album")) ``` The object `my_album` is a `dbi.table`, a data structure that represents a SQL query (which we refer to as the `dbi.table`'s *underlying SQL query*). The `print` method displays a preview of the underlying SQL query. ```{r my_album_print} #print(my_album) my_album ``` The preview has a format similar to a `data.table` with two notable exceptions. 1. The row numbers are omitted. SQL queries do not necessarily return the result set in a reliable order (even on subsequent evaluations of the same query), and `dbi.table` does not make any extra effort to order the rows by default. Thus the row numbers are omitted. 2. Only 5 rows of the `dbi.table` are displayed (`data.table` displays the first 5 and the last 5). Again, since the result set does not have a reliable order, it is not possible to say which rows are the first and which are the last. The rows displayed are the first 5 returned by the RDBMS. The function `as.data.table` executes the `dbi.table`'s underlying SQL query and retrieves the result set as a `data.table`. Pro tip: calling the extracts method (`[]`) with no arguments is a shortcut for `as.data.table`. ```{r fetch_data_table} #as.data.table(my_album) my_album[] ``` Since the result set is instantiated locally as a `data.table`, the row numbers and the last 5 rows are displayed. Note: by default, `as.data.table` (and the empty extracts shortcut) fetch a maximum of 10,000 rows. To override this limit, either set the option `dbi_table_max_fetch` or call `as.data.table` and provide the `n` argument (e.g., `n = -1` to fetch the entire result set). The `csql` utility displays the query. ```{r my_album_csql} csql(my_album) ``` The underlying SQL query of a newly created `dbi.table` selects all the columns from the database table. ## Manipulate a `dbi.table` using `data.table` Syntax This table from `data.table`'s *Introduction to data.table* vignette pretty much sums up what `dbi.table` does. ```{r xref, eval = FALSE} DT[i, j, by] ## R: i j by ## SQL: where | order by select | update group by ``` In general, `dbi.table` should be able to handle basic `data.table` syntax. SQL translation is done by `dbplyr::translate_sql_` which works with a wide variety of R functions. However, complicated expressions (e.g., custom functions in `j`, nested aggregation functions, most special symbols) do not work. Best practice is to use `dbi.table` to subset and wrangle on the database, then `data.table` to fine tune locally. The remainder of this section demonstrates how `i`, `j`, and `by` manipulate a `dbi.table`'s underlying SQL query. When `i` is a logical expression of the variables in the `dbi.table` then it becomes the *WHERE* clause in the `dbi.table`'s underlying SQL query. ```{r i_where} csql(my_album[AlbumId == ArtistId + 1]) ``` When `i` is a call to `order` (or `forder`), it becomes the *ORDER BY* clause in the `dbi.table`'s underlying SQL query. ```{r i_order} csql(my_album[order(nchar(Title), -AlbumId)]) ``` When `j` is a list of expressions of the variables in the `dbi.table`, then `j` becomes the *SELECT* clause in the `dbi.table`'s underlying SQL query. ```{r j_list} csql(my_album[, .(AlbumId, Title)]) ``` When `by` is a list of expressions of the variables in the `dbi.table`, then `by` becomes the *GROUP BY* clause in the `dbi.table`'s underlying SQL query. ```{r by_list} csql(my_album[, .("# of Albums" = .N), .(ArtistId)]) ``` ## Attach a Schema to the Search Path The `dbi.attach` function *attaches* a DBI connection to the search path. That is, `dbi.attach` creates a `dbi.table` for each table and each view in the schema associated with the DBI connection, then assigns these `dbi.table`s to an environment on the search path. ```{r dbi.attach} dbi.attach(chinook) ``` A quick look at the search path shows the database attached in position 2. ```{r search_path} head(search(), 3) ``` The tables and views in the database schema are queriable as `dbi.table`s in the attached environment `r search()[[2L]]`. ```{r ls_chinook} ls("duckdb:chinook_duckdb") ``` Note: Attaching a DBI connection is intended for an interactive exploratory analysis of a database (schema). For programatic use cases, see the *Load a Database Catalog* section. Merging two `dbi.table`s results in a SQL join that describes the same result set as the associated `data.table` merge. That is, ```{r merge_dts, eval = FALSE} merge(as.data.table(Album), as.data.table(Artist), by = "ArtistId") ``` and ```{r merge_dbit, eval = FALSE} as.data.table(merge(Album, Artist, by = "ArtistId")) ``` are the same `data.table` up to row order. ```{r merge_dt_like} csql(merge(Album, Artist, by = "ArtistId")) ``` When a DBI connection is attached to the search path, `dbi.attach` also loads the schema's relational meta data (whether this works depends on how the underlying database implments an *information schema*). In particular, *foreign key* constraints are used as the default `by` when merging. In the previous example, the `ArtistId` column is a foreign key referencing the `Album` table. For this example, when the `by` argument is omitted, `dbi.table` still merges *by* `ArtistId`. ```{r merge_no_by} csql(merge(Album, Artist)) ``` When the `y` argument is omitted, `dbi.table`'s `merge` uses the foreign key constraints that reference `x` to determin the `y` (or `y`s) to merge with. ```{r merge_no_y} csql(merge(Track)) ``` When the optional `recursive` argument is `TRUE`, `merge.dbi.table` recursively merges on each of the just-merged tables. In this example, `Track` has a foreign key that references `Album` and `Album` has a foreign key that references `Artist`. ```{r merge_no_y_rec} csql(merge(Track, recursive = TRUE)) ``` ## Load a Database Catalog As a best practice for programatic use, it is better to load the catalog in order to avoid modifying the search path. ```{r dbi_catalog} catalog <- dbi.catalog(chinook) ``` Printing the catalog lists its schemas. ```{r print_dbi_catalog} catalog ``` Individual tables can be accessed using `catalog$schema$table` syntax. ```{r dbi_catalog_table} catalog$main$Album ``` When a catalog is loaded, all of its tables have access to the relational data in the information schema. ```{r dbi_catalog_merge} merge(catalog$main$Album) ``` # Scope This section provides a brief explanation of what the `dbi.table` package is trying to do. Suppose that `x` is a `dbi.table` and that `e` is an expression involving `x` that returns either a `dbi.table` or a `data.table`. ```{r scope_example} x <- dbi.table(chinook, DBI::Id("Album")) e <- quote(x[, .("# of Albums" = .N), .(ArtistId)]) ``` Since `dbi.table`'s syntax is a subset of `data.table`'s syntax, if `e` can be evaluated successfully (i.e., `eval(e)` does not throw an error), then `e` should also be able to be successfully evaluated when `x` is a `data.table`. There are thus 2 paths to the final `data.table` result: 1. evaluate `e` then coerce the result using `as.data.table`, or 2. coerce `x` to a `data.table` then evaluate `e`. Path 2 is referred to as the reference implementation and describes the *correct* answer: the *reference result set*. The design goal of `dbi.table` is to get the same result set as the reference result set, up to row order. ```{r reference_check} result_set <- as.data.table(eval(e)) x <- as.data.table(x) reference_result_set <- eval(e) all.equal(reference_result_set, result_set, ignore.row.order = TRUE) ``` The `dbi.table` package includes the function `reference.test` that compares the result set to the reference result set in the more general case where `expr` (the function's first argument) is an expression involving 1 or more `dbi.table`s. ```{r reference.test} x <- dbi.table(chinook, DBI::Id("Album")) reference.test({ x[, .("# of Albums" = .N), .(ArtistId)] }) ``` This function is used extensively in `dbi.table`'s unit/regression tests. # Cleaning Up We used the `chinook.duckdb` function to open a DBI connection at the beginning of this vignette and now it is up to us to close it. ```{r disconnect} DBI::dbDisconnect(chinook) ``` However, this leaves our R session in a wonky state. The environment "duckdb:chinook_duckdb" is still attached and there are several `dbi.table`s in the global environment - all of these `dbi.table`s are associated with an invalid DBI connection. ```{r bork, error = TRUE} #A dbi.table in the duckdb:chinook_duckdb environment Genre ``` The R objects associated with our now-closed DBI connection need to be cleaned up manually (or you could just restart R). ```{r clean_up_manually} detach("duckdb:chinook_duckdb") rm(catalog, my_album, x) ``` ## Connection Management Alternatively, when using either `dbi.attach` or `dbi.catalog`, the first arguement can be a zero-argument function that returns an open DBI connection. When `dbi.table` uses a function to open the DBI connection, then that connection belongs to `dbi.table` and `dbi.table` will take care of closing it when it is no longer needed. ```{r attach_function} dbi.attach(chinook.duckdb) ``` When `dbi.table` is managing the connection, then all the user has to do is detach (or delete if a catalog). The DBI connection will be closed when the object is garbage collected. ```{r clean_up} detach("duckdb:chinook_duckdb") ``` Further, when `dbi.table` owns the connection, it is able to reconnect in the event that the connection unexpectedly drops. ```{r, restore_options, include = FALSE} options(old_opts) ```