--- title: "Introduction to nabla" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to nabla} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ```{r setup} library(nabla) ``` ## What is nabla? `nabla` differentiates any R function exactly — no symbolic algebra, no finite-difference grid, no loss of precision. Pass a function and a point to `D()`, and you get the exact derivative: ```{r} f <- function(x) x[1]^2 + sin(x[1]) D(f, pi / 4) # exact first derivative at pi/4 ``` Functions passed to `D()` use `x[1]`, `x[2]`, ... indexing — the same convention as R's `optim()`. For multi-parameter functions, `gradient()`, `hessian()`, and `jacobian()` provide the standard objects of calculus: ```{r} g <- function(x) x[1]^2 * exp(x[2]) gradient(g, c(2, 0)) # c(4, 4) hessian(g, c(2, 0)) # 2x2 Hessian matrix ``` ## Quick start The `D` operator is the core of `nabla`. Applied once, it gives the first derivative. Applied twice (via `order = 2`), it gives the second derivative: ```{r} f <- function(x) x[1]^2 + sin(x[1]) # First derivative: f'(x) = 2x + cos(x) D(f, pi / 4) # Verify against the analytical formula 2 * (pi / 4) + cos(pi / 4) # Second derivative: f''(x) = 2 - sin(x) D(f, pi / 4, order = 2) ``` `gradient()` is equivalent to `D(f, x)` for scalar-valued functions: ```{r} gradient(f, pi / 4) ``` The function and its AD-computed derivative over $[0, 2\pi]$, with the evaluation point $x = \pi/4$ marked: ```{r fig-function-derivative, fig.width=6, fig.height=4} f <- function(x) x[1]^2 + sin(x[1]) xs <- seq(0, 2 * pi, length.out = 200) # Compute f(x) and f'(x) at each grid point using D() fx <- sapply(xs, function(xi) f(xi)) fpx <- sapply(xs, function(xi) D(f, xi)) # Mark the evaluation point x = pi/4 x_mark <- pi / 4 oldpar <- par(mar = c(4, 4, 2, 1)) plot(xs, fx, type = "l", col = "steelblue", lwd = 2, xlab = "x", ylab = "y", main = expression(f(x) == x^2 + sin(x) ~ "and its derivative"), ylim = range(c(fx, fpx))) lines(xs, fpx, col = "firebrick", lwd = 2, lty = 2) points(x_mark, f(x_mark), pch = 19, col = "steelblue", cex = 1.3) points(x_mark, D(f, x_mark), pch = 19, col = "firebrick", cex = 1.3) abline(v = x_mark, col = "grey60", lty = 3) legend("topleft", legend = c("f(x)", "f'(x) via AD"), col = c("steelblue", "firebrick"), lty = c(1, 2), lwd = 2, bty = "n") par(oldpar) ``` ## Works through any R code `D()` handles control flow, loops, and higher-order functions transparently. No special annotation is needed — just write ordinary R: ```{r} # if/else branching safe_log <- function(x) { if (x[1] > 0) log(x[1]) else -Inf } D(safe_log, 2) # 1/2 = 0.5 D(safe_log, -1) # 0 (constant branch) # for loop poly <- function(x) { result <- 0 for (i in 1:5) result <- result + x[1]^i result } D(poly, 2) # 1 + 2*2 + 3*4 + 4*8 + 5*16 = 129 # Reduce f_reduce <- function(x) Reduce("+", lapply(1:4, function(i) x[1]^i)) D(f_reduce, 2) ``` More complex compositions also propagate correctly: ```{r} g <- function(x) exp(-x[1]^2 / 2) / sqrt(2 * pi) # standard normal PDF # D(g, 1) should equal -x * dnorm(x) at x = 1 D(g, 1) -1 * dnorm(1) D(g, 1) - (-1 * dnorm(1)) # ~0 ``` ## Multi-parameter functions For functions of several variables, `gradient()` returns the gradient vector, `hessian()` returns the Hessian matrix, and `jacobian()` handles vector-valued functions: ```{r} # Gradient of a 2-parameter function f2 <- function(x) x[1]^2 * x[2] gradient(f2, c(3, 4)) # c(2*3*4, 3^2) = c(24, 9) # Hessian hessian(f2, c(3, 4)) # Jacobian of a vector-valued function f_vec <- function(x) list(x[1] * x[2], x[1]^2 + x[2]) jacobian(f_vec, c(3, 4)) ``` The `D` operator composes for higher-order derivatives. `D(f, x, order = 2)` returns the Hessian; `D(f, x, order = 3)` returns a third-order tensor: ```{r} f2 <- function(x) x[1]^2 * x[2] D(f2, c(3, 4), order = 2) # Hessian D(f2, c(3, 4), order = 3) # 2x2x2 third-order tensor ``` ## Quick comparison: three ways to differentiate Three methods for computing the derivative of $f(x) = x^3 \sin(x)$ at $x = 2$: ```{r} f <- function(x) x[1]^3 * sin(x[1]) f_num <- function(x) x^3 * sin(x) # plain numeric version x0 <- 2 # 1. Analytical: f'(x) = 3x^2 sin(x) + x^3 cos(x) analytical <- 3 * x0^2 * sin(x0) + x0^3 * cos(x0) # 2. Finite differences h <- 1e-8 finite_diff <- (f_num(x0 + h) - f_num(x0 - h)) / (2 * h) # 3. Automatic differentiation ad_result <- D(f, x0) # Compare data.frame( method = c("Analytical", "Finite Diff", "AD (nabla)"), derivative = c(analytical, finite_diff, ad_result), error_vs_analytical = c(0, finite_diff - analytical, ad_result - analytical) ) ``` The AD result matches the analytical derivative to machine precision — zero error — because dual-number arithmetic propagates exact derivative rules through every operation. Finite differences, by contrast, introduce truncation error of order $O(h^2)$ for central differences; here the error is roughly $10^{-7}$, reflecting the choice of $h = 10^{-8}$ (the optimal step size for double-precision central differences is approximately $\epsilon_{\mathrm{mach}}^{1/3} \approx 6 \times 10^{-6}$, so our step is in the right ballpark). A bar chart makes this difference vivid: ```{r fig-error-comparison, fig.width=5, fig.height=3.5} errors <- abs(c(finite_diff - analytical, ad_result - analytical)) # Use log10 scale; clamp AD error to .Machine$double.eps if exactly zero errors[errors == 0] <- .Machine$double.eps oldpar <- par(mar = c(4, 5, 2, 1)) bp <- barplot(log10(errors), names.arg = c("Finite Diff", "AD (nabla)"), col = c("coral", "steelblue"), ylab = expression(log[10] ~ "|error|"), main = "Absolute error vs analytical derivative", ylim = c(-16, 0), border = NA) abline(h = log10(.Machine$double.eps), lty = 2, col = "grey40") text(mean(bp), log10(.Machine$double.eps) + 0.8, "machine epsilon", cex = 0.8, col = "grey40") par(oldpar) ``` This advantage becomes critical for optimization algorithms that depend on accurate gradients. ## Why exact derivatives matter Any optimizer can find the MLE — the point estimate is the easy part. What `nabla` gives you is the **exact Hessian** at the MLE, and that's where the real statistical payoff lies. The observed information matrix $J(\hat\theta) = -H(\hat\theta)$ is the negative Hessian of the log-likelihood at the MLE. Its inverse gives the asymptotic variance-covariance matrix: $$\text{Var}(\hat\theta) \approx J(\hat\theta)^{-1}$$ Inaccurate Hessians mean inaccurate standard errors and confidence intervals. Finite-difference Hessians introduce $O(h^2)$ error that propagates directly into your CIs. With `nabla`, the Hessian is exact to machine precision — so your standard errors and confidence intervals are as accurate as the asymptotic theory allows. Higher-order derivatives push further: `D(f, x, order = 3)` yields the third-order derivative tensor, which quantifies the **asymptotic skewness** of the MLE's sampling distribution — a correction that standard Wald intervals miss entirely. ## How it works: dual numbers Under the hood, `nabla` uses **dual numbers** — values of the form $a + b\varepsilon$, where $\varepsilon^2 = 0$. Evaluating a function $f$ on a dual number $x + \varepsilon$ yields: $$f(x + \varepsilon) = f(x) + f'(x)\,\varepsilon$$ The "value" part gives $f(x)$ and the "epsilon" part gives $f'(x)$, computed automatically by propagating $\varepsilon$ through every arithmetic operation. This is **forward-mode automatic differentiation (AD)**. The low-level constructors expose dual numbers directly. In this style, functions use bare `x` (not `x[1]`) because they receive a single `dualr` object rather than a vector: ```{r} # A dual variable: value = 3, derivative seed = 1 x <- dual_variable(3) value(x) deriv(x) # A dual constant: value = 5, derivative seed = 0 k <- dual_constant(5) value(k) deriv(k) # Explicit constructor y <- dual(2, 1) value(y) deriv(y) ``` Setting `deriv = 1` means "I'm differentiating with respect to this variable." Setting `deriv = 0` means "this is a constant." All standard arithmetic operators propagate derivatives: ```{r} x <- dual_variable(3) # Addition: d/dx(x + 2) = 1 r_add <- x + 2 value(r_add) deriv(r_add) # Subtraction: d/dx(5 - x) = -1 r_sub <- 5 - x value(r_sub) deriv(r_sub) # Multiplication: d/dx(x * 4) = 4 r_mul <- x * 4 value(r_mul) deriv(r_mul) # Division: d/dx(1/x) = -1/x^2 = -1/9 r_div <- 1 / x value(r_div) deriv(r_div) # Power: d/dx(x^3) = 3*x^2 = 27 r_pow <- x^3 value(r_pow) deriv(r_pow) ``` All standard mathematical functions are supported, with derivatives computed via the chain rule: ```{r} x <- dual_variable(1) # exp: d/dx exp(x) = exp(x) r_exp <- exp(x) value(r_exp) deriv(r_exp) # log: d/dx log(x) = 1/x r_log <- log(x) value(r_log) deriv(r_log) # sin: d/dx sin(x) = cos(x) x2 <- dual_variable(pi / 4) r_sin <- sin(x2) value(r_sin) deriv(r_sin) # cos(pi/4) # sqrt: d/dx sqrt(x) = 1/(2*sqrt(x)) x3 <- dual_variable(4) r_sqrt <- sqrt(x3) value(r_sqrt) deriv(r_sqrt) # 1/(2*2) = 0.25 # Gamma-related x4 <- dual_variable(3) r_lgamma <- lgamma(x4) value(r_lgamma) # log(2!) = log(2) deriv(r_lgamma) # digamma(3) ``` Dual numbers integrate with standard R patterns: ```{r} # sum() works a <- dual_variable(2) b <- dual_constant(3) total <- sum(a, b, dual_constant(1)) value(total) # 6 deriv(total) # 1 (only a has deriv = 1) # prod() works p <- prod(a, dual_constant(3)) value(p) # 6 deriv(p) # 3 (product rule: 3*1 + 2*0) # c() creates a dual_vector v <- c(a, b) length(v) # is.numeric returns TRUE (for compatibility) is.numeric(dual_variable(1)) ``` In most cases you won't need these constructors directly — `D()`, `gradient()`, `hessian()`, and `jacobian()` handle dual number creation internally. But understanding the mechanism helps when debugging or building custom AD workflows. ## What's next? This vignette introduced `D()` and showed that AD computes exact derivatives where finite differences cannot. For practical applications: - **`vignette("mle-workflow")`** applies `gradient()`, `hessian()`, and the observed information matrix to five increasingly complex MLE problems, from Normal to logistic regression, and builds a Newton-Raphson optimizer powered entirely by AD. - **`vignette("higher-order")`** demonstrates `D(f, x, order = 2)` for second derivatives, with applications to curvature analysis, Taylor expansion, and exact Hessians for MLE standard errors. - **`vignette("optimizer-integration")`** shows how to plug `nabla` gradients into R's built-in optimizers (`optim`, `nlminb`) for production-quality MLE fitting. - **`vignette("mle-skewness")`** uses `D(f, x, order = 3)` to compute the asymptotic skewness of MLEs — a third-order analysis that goes beyond standard Wald confidence intervals.