---
title: "Introduction to dbi.table"
output:
  rmarkdown::html_vignette:
    number_sections: true

vignette: >
  %\VignetteIndexEntry{Introduction to dbi.table}
  %\VignetteEngine{knitr::rmarkdown}x
  %\VignetteEncoding{UTF-8}
---

```{r, set_options, include = FALSE}
old_opts <- options(width = 90)
```

Query database tables and views over a DBI connection using `data.table`'s
`[i, j, by]` syntax, attach database schemas to the search path, and
programmatically load database catalogs.

This vignette assumes that you are already fluent with `data.table`'s syntax and
that you know how to open a database connection using the `DBI` package.

# Installation

The `dbi.table` package is hosted on GitHub. Use the following command to
install the package.

```{r install, eval = FALSE}
#install.packages("devtools")
devtools::install_github("kjellpk/dbi.table")
```

Note: if the `install_github` function is not found, you will need to first
install the `devtools` package using `install.packages`.


# Getting Started

This section uses the sample
[Chinook Database](https://github.com/lerocha/chinook-database)
(included in the package) to demonstrate how to

  1. create a single `dbi.table` using the `dbi.table` function,
  2. maniuplate a `dbi.table` using `data.table`'s `[i, j, by]` syntax,
  3. attach a schema to the search path using the `dbi.attach` function, and
  4. load a database catalog using the `dbi.catalog` function.

The function `chinook.duckdb` that returns an open `duckdb` (DBI) connection to
the sample Chinook Database.  This connection is a typical DBI connection as
returned by `DBI::dbConnect` that can be used as the `conn` argument in DBI
package functions. Let's get started by loading the package and opening the
connection.

```{r library, message = FALSE}
library(dbi.table)
chinook <- chinook.duckdb()
```


## Create a Single `dbi.table`

The `dbi.table` function takes 2 arguments: a DBI connection, and an Id
indentifying a database table or view.

```{r single_table}
my_album <- dbi.table(chinook, DBI::Id("Album"))
```

The object `my_album` is a `dbi.table`, a data structure that represents a SQL
query (which we refer to as the `dbi.table`'s *underlying SQL query*). The
`print` method displays a preview of the underlying SQL query. 

```{r my_album_print}
#print(my_album)
my_album
```

The preview has a format similar to a `data.table` with two notable exceptions.

1. The row numbers are omitted. SQL queries do not necessarily return the
   result set in a reliable order (even on subsequent evaluations of the same
   query), and `dbi.table` does not make any extra effort to order the rows by
   default. Thus the row numbers are omitted.

2. Only 5 rows of the `dbi.table` are displayed (`data.table` displays the first
   5 and the last 5). Again, since the result set does not have a reliable
   order, it is not possible to say which rows are the first and which are the
   last. The rows displayed are the first 5 returned by the RDBMS.

The function `as.data.table` executes the `dbi.table`'s underlying SQL query and
retrieves the result set as a `data.table`. Pro tip: calling the extracts method
(`[]`) with no arguments is a shortcut for `as.data.table`.

```{r fetch_data_table}
#as.data.table(my_album)
my_album[]
```

Since the result set is instantiated locally as a `data.table`, the row numbers
and the last 5 rows are displayed.

Note: by default, `as.data.table` (and the empty extracts shortcut) fetch a
maximum of 10,000 rows. To override this limit, either set the option
`dbi_table_max_fetch` or call `as.data.table` and provide the `n` argument
(e.g., `n = -1` to fetch the entire result set). 

The `csql` utility displays the query.

```{r my_album_csql}
csql(my_album)
```

The underlying SQL query of a newly created `dbi.table` selects all the columns
from the database table.

## Manipulate a `dbi.table` using `data.table` Syntax

This table from `data.table`'s *Introduction to data.table* vignette pretty much
sums up what `dbi.table` does.

```{r xref, eval = FALSE}
DT[i, j, by]

##   R:                 i                 j        by
## SQL:  where | order by   select | update  group by
```

In general, `dbi.table` should be able to handle basic `data.table` syntax. SQL
translation is done by `dbplyr::translate_sql_` which works with a wide variety
of R functions. However, complicated expressions (e.g., custom functions in `j`,
nested aggregation functions, most special symbols) do not work.

Best practice is to use `dbi.table` to subset and wrangle on the database, then
`data.table` to fine tune locally.

The remainder of this section demonstrates how `i`, `j`, and `by` manipulate a
`dbi.table`'s underlying SQL query.

When `i` is a logical expression of the variables in the `dbi.table` then it
becomes the *WHERE* clause in the `dbi.table`'s underlying SQL query.

```{r i_where}
csql(my_album[AlbumId == ArtistId + 1])
```

When `i` is a call to `order` (or `forder`), it becomes the *ORDER BY* clause in
the `dbi.table`'s underlying SQL query.

```{r i_order}
csql(my_album[order(nchar(Title), -AlbumId)])
```

When `j` is a list of expressions of the variables in the `dbi.table`, then `j`
becomes the *SELECT* clause in the `dbi.table`'s underlying SQL query.

```{r j_list}
csql(my_album[, .(AlbumId, Title)])
```

When `by` is a list of expressions of the variables in the `dbi.table`, then
`by` becomes the *GROUP BY* clause in the `dbi.table`'s underlying SQL query.

```{r by_list}
csql(my_album[, .("# of Albums" = .N), .(ArtistId)])
```


## Attach a Schema to the Search Path

The `dbi.attach` function *attaches* a DBI connection to the search path. That
is, `dbi.attach` creates a `dbi.table` for each table and each view in the
schema associated with the DBI connection, then assigns these `dbi.table`s to an
environment on the search path.

```{r dbi.attach}
dbi.attach(chinook)
```

A quick look at the search path shows the database attached in position 2.

```{r search_path}
head(search(), 3)
```

The tables and views in the database schema are queriable as `dbi.table`s in the
attached environment `r search()[[2L]]`.

```{r ls_chinook}
ls("duckdb:chinook_duckdb")
```

Note: Attaching a DBI connection is intended for an interactive exploratory
analysis of a database (schema). For programatic use cases, see the *Load a
Database Catalog* section.

Merging two `dbi.table`s results in a SQL join that describes the same result
set as the associated `data.table` merge. That is,

```{r merge_dts, eval = FALSE}
merge(as.data.table(Album), as.data.table(Artist), by = "ArtistId")
```

and

```{r merge_dbit, eval = FALSE}
as.data.table(merge(Album, Artist, by = "ArtistId"))
```

are the same `data.table` up to row order.

```{r merge_dt_like}
csql(merge(Album, Artist, by = "ArtistId"))
```

When a DBI connection is attached to the search path, `dbi.attach` also loads
the schema's relational meta data (whether this works depends on how the
underlying database implments an *information schema*). In particular, *foreign
key* constraints are used as the default `by` when merging. In the previous
example, the `ArtistId` column is a foreign key referencing the `Album` table.
For this example, when the `by` argument is omitted, `dbi.table` still merges
*by* `ArtistId`.

```{r merge_no_by}
csql(merge(Album, Artist))
```
When the `y` argument is omitted, `dbi.table`'s `merge` uses the foreign key
constraints that reference `x` to determin the `y` (or `y`s) to merge with.

```{r merge_no_y}
csql(merge(Track))
```

When the optional `recursive` argument is `TRUE`, `merge.dbi.table` recursively
merges on each of the just-merged tables. In this example, `Track` has a foreign
key that references `Album` and `Album` has a foreign key that references
`Artist`.

```{r merge_no_y_rec}
csql(merge(Track, recursive = TRUE))
```


## Load a Database Catalog

As a best practice for programatic use, it is better to load the catalog in
order to avoid modifying the search path.

```{r dbi_catalog}
catalog <- dbi.catalog(chinook)
```

Printing the catalog lists its schemas.

```{r print_dbi_catalog}
catalog
```

Individual tables can be accessed using `catalog$schema$table` syntax.

```{r dbi_catalog_table}
catalog$main$Album
```

When a catalog is loaded, all of its tables have access to the relational data
in the information schema.

```{r dbi_catalog_merge}
merge(catalog$main$Album)
```


# Scope

This section provides a brief explanation of what the `dbi.table` package is
trying to do.

Suppose that `x` is a `dbi.table` and that `e` is an expression involving `x`
that returns either a `dbi.table` or a `data.table`.

```{r scope_example}
x <- dbi.table(chinook, DBI::Id("Album"))
e <- quote(x[, .("# of Albums" = .N), .(ArtistId)])
```

Since `dbi.table`'s syntax is a subset of `data.table`'s syntax, if `e` can be
evaluated successfully (i.e., `eval(e)` does not throw an error), then `e`
should also be able to be successfully evaluated when `x` is a `data.table`.
There are thus 2 paths to the final `data.table` result:

1. evaluate `e` then coerce the result using `as.data.table`, or

2. coerce `x` to a `data.table` then evaluate `e`.

Path 2 is referred to as the reference implementation and describes the
*correct* answer: the *reference result set*. The design goal of `dbi.table` is
to get the same result set as the reference result set, up to row order.

```{r reference_check}
result_set <- as.data.table(eval(e))
x <- as.data.table(x)
reference_result_set <- eval(e)
all.equal(reference_result_set, result_set, ignore.row.order = TRUE)
```

The `dbi.table` package includes the function `reference.test` that compares the
result set to the reference result set in the more general case where `expr`
(the function's first argument) is an expression involving 1 or more
`dbi.table`s.

```{r reference.test}
x <- dbi.table(chinook, DBI::Id("Album"))
reference.test({
  x[, .("# of Albums" = .N), .(ArtistId)]
})
```

This function is used extensively in `dbi.table`'s unit/regression tests.

# Cleaning Up

We used the `chinook.duckdb` function to open a DBI connection at the beginning
of this vignette and now it is up to us to close it.

```{r disconnect}
DBI::dbDisconnect(chinook)
```

However, this leaves our R session in a wonky state. The environment
"duckdb:chinook_duckdb" is still attached and there are several `dbi.table`s in
the global environment - all of these `dbi.table`s are associated with an
invalid DBI connection.

```{r bork, error = TRUE}
#A dbi.table in the duckdb:chinook_duckdb environment
Genre
```

The R objects associated with our now-closed DBI connection need to be cleaned
up manually (or you could just restart R).

```{r clean_up_manually}
detach("duckdb:chinook_duckdb")
rm(catalog, my_album, x)
```

## Connection Management

Alternatively, when using either `dbi.attach` or `dbi.catalog`, the first
arguement can be a zero-argument function that returns an open DBI connection.
When `dbi.table` uses a function to open the DBI connection, then that
connection belongs to `dbi.table` and `dbi.table` will take care of closing it
when it is no longer needed.

```{r attach_function}
dbi.attach(chinook.duckdb)
```

When `dbi.table` is managing the connection, then all the user has to do is
detach (or delete if a catalog). The DBI connection will be closed when the
object is garbage collected.

```{r clean_up}
detach("duckdb:chinook_duckdb")
```

Further, when `dbi.table` owns the connection, it is able to reconnect in the
event that the connection unexpectedly drops.

```{r, restore_options, include = FALSE}
options(old_opts)
```