---
title: "Likelihood contributions model"
output:
    rmarkdown::html_vignette:
        toc: true
vignette: >
  %\VignetteIndexEntry{Likelihood contributions model}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}

knitr::opts_chunk$set(
    collapse = TRUE,
    comment = "#>"
)
old_opts <- options(digits = 2)
```


## Installation

You can install the development version of `likelihood.model` from
[GitHub](https://github.com/queelius/likelihood.model) with:

```{r install-likelihood-model, eval = FALSE}
if (!require(devtools)) {
    install.packages("devtools")
}
devtools::install_github("queelius/likelihood.model")
```

```{r load-packages, message=FALSE, warning=FALSE, echo = FALSE}
library(dplyr)
library(likelihood.model)
library(generics)
```

## Exponential series system

Consider a series system with $m$ components, where the $p$th component in the
$i$-th system has lifetime $T_{i p}$ with rate $\lambda_p$, and $T_{i j}$
for all $i,j$ are independent. The system lifetime is the minimum of the
component lifetimes, i.e., $T_i = \min\{T_{i 1}, \ldots, T_{i m}\}$.

Let's draw a sample for this model for $m=3$ and $\lambda = (1.1, 1.2, 1.3)$
for a sample size of $n=150$.
```{r data-gen-exp-series}
n <- 150
rates <- c(1.1, 1.2, 1.3)
set.seed(1235)

df <- data.frame(
    t1 = rexp(n, rates[1]),
    t2 = rexp(n, rates[2]),
    t3 = rexp(n, rates[3])
)
df$t <- apply(df, 1, min)

# map each observation to the corresponding index (component) that was minimum
df$k <- apply(df, 1, which.min)

for (i in 1:n) {
    for (p in 1:3) {
        if (p == df$k[i]) next
        df[i, p] <- NA
    }
}
head(df)
```

Column $t_k$ is the failure time of the $k$-th component, column $t$ is the 
column for the series system lifetime, the same as $t_k$, and column $k$ is the
component that caused the system failure.

We see that the component cause of failure is observed and the other component
lifetimes are not observed (NA).

In what follows, we consider two cases of masking or censoring, or otherwise
*incomplete data* of some form.

1. We know the system lifetime and the component cause of failure. (Columns $t$
and $k$ are observed.

2. We observe a system lifetime with a right-censoring mechanism and the
component cause of failure.

## Case 1: Observed system failure time and component cause of failure
Suppose we have a series system. As a series system, whenever a component
fails, the system fails, and thus the system lifetime is equal to the minimum
of the component lifetimes. 

Suppose the component lifetimes are exponentially distributed with rate
$\lambda_k$ for component $k$.

In this case, we assume we can observe the system lifetime and the
component that caused the system failure, which means we know a few things
about each system in the sample:

(1) The component cause of failure $K_i$ for each system $i$.
(2) The system failure time $T_i$ for each system $i$.
(3) The component failure time $T_{i K_i}$ for each system $i$.
(4) Each of the non-failed component failure times $T_{i p}$ for each
    system $i$ and each component $p \neq K_i$ survived at least until
    $T_{i \, K_i}$.

We might be tempted to think that we can just consider each observation, where
we know the system lifetime and the component cause, and then just estimate
the component's parameter based on its failure time.
However, this is incorrect, because when we condition on the component that
caused the system failure, we are not observing a random sample of that
component's lifetime, but a conditional sample. For instance, if $k = 1$,
then we are observing
$T_{i 1} | T_{i 1} < T_{i 2} \text{ and } T_{i 1} < T_{i 3}$.

What is this distribution? We can derive it as:
$$
    f_{T_i|K_i}(t_i | k_i) = f_{T_i,K_i}(t_i,k_i) / f_{K_i}(k_i) =
        f_{K_i|T_i}(k_i|t_i) f_{T_i}(t_i) / f_{K_i}(k_i).
$$
By the memoryless property of the exponential distribution, we have
$$
    f_{K_i|T_i}(k_i|t_i) = f_{K_i}(k_i),
$$
since the failure rates are constant (and thus independent of the time), and
thus the probability that a componet is the cause of a system failure is
independent of the time of the system failure.

That means that 
$$
    f_{T_i|K_i}(t_i | k_i) = f_{T_i}(t_i),
$$
which in the exponential case is just the pdf of the series system. Since the
minimum of exponentially distributed random variables is exponentially
distributed with a falure rate equal to the sum of the failure rates of the
components, this estimator is an estimator of the sum of the failure rates (or
the failure rate of the series system).

Instead, we just consider this to be a kind of right-censoring problem, where
we have $m=3$ components, and thus when component $1$ is the component cause
of failure at $t_i$, that means we observe an exact failure time $T_{i 1} = t_i$
and right-censor the other two components at $t_i$. The loglikehood contribution
for this observation is
$$
    L_i = \log f_1(t_i|\lambda_1) + \log R_2(t_i|\lambda_2) + \log R_3(t_i|\lambda_3),
$$
which simplifies to
$$
    L_i = \log \lambda_1 - (\lambda_1 + \lambda_2 + \lambda_3) t_i,
$$
which has a score
$$
    s_i =
        \frac{1}{\lambda_1}
        \begin{bmatrix}
            1 \\
            0 \\
            0
        \end{bmatrix}
        - t_i
        \begin{bmatrix}
            1 \\
            1 \\
            1
        \end{bmatrix}
$$
and a Hessian
$$
    H_i =
        -\frac{1}{\lambda_1^2}
        \begin{bmatrix}
            1 & 0 & 0 \\
            0 & 0 & 0 \\
            0 & 0 & 0
        \end{bmatrix}.
$$

To get the total log-likelihood, score, and Hessian, we just sum them up
(by i.i.d.).

In fact, that's what we do here. We show how to use the likelihood contribution
model. We show two ways to configure the model to dispatch the
`loglik` and `score` functions for the observation type.

We implement methods for the `loglik` and `score` functions.
This allows dispatching based on the observation type, as defined in `obs_type`.

```{r observed-likelihood, cache=TRUE}
score.observed <- function(df, rates, ...) {
  rep(-sum(df$t), length(rates)) + (df |> dplyr::count(k))$n / rates
}

loglik.observed <- function(df, rates, ...) {
  sum(log(rates[df$k])) - sum(rates) * sum(df$t)
}

model.observed <- likelihood_contr_model$new(
  obs_type = function(df) { "observed" }
)
```


Now that we have the likelihood model, we can use it to estimate the parameters.
We use the `optim` function to do the numerical optimization. We use the
`hessian = TRUE` argument to get the Hessian matrix, which we can use to get the
standard errors of the estimates.
```{r mle-optim-observed, cache=TRUE}
optim(rates, fn = loglik(model.observed), df=df, control=list(fnscale=-1))
```

The estimates are close to the true rates $\lambda = (1.1, 1.2, 1.3)$, and
convergence code 0 indicates successful optimization.

Note that `likelihood_model` objects have a default method for fitting MLEs, `fit`.
Internally, this method uses `optim` to maximize the log-likelihood, but it packs in some
additional information and returns the result as a `fisher_mle` object.
```{r mle-observed-likelihood-model, cache=TRUE}
rates.hat <- fit(model.observed)(df, par = rates)
summary(rates.hat)
```

The 95% confidence intervals all contain the true parameter values, and the
standard errors are small relative to the estimates, reflecting the
information in $n=150$ observations.

We implement `hess_loglik.observed` for completeness.
```{r hess-loglike}
hess_loglik.observed <- function(df, rates, ...) {
  p <- length(rates)
  H <- matrix(0, p, p)
  counts <- df |> dplyr::count(k)
  for (j in 1:p) {
    H[j, j] <- -counts$n[j] / rates[j]^2
  }
  H
}
# Observed information from MLE (negative Hessian)
cat("Observed information matrix (from MLE):\n")
print(-rates.hat$hessian)

# Computed directly via hess_loglik
cat("\nHessian at MLE:\n")
print(hess_loglik(model.observed)(df, coef(rates.hat)))
```

The two matrices agree, confirming that our analytical `hess_loglik.observed`
implementation is correct.

Now, we use the Bootstrap method to estimate the sampling distribution of the MLE.

```{r bootstrap, cache=TRUE}
model.samp <- sampler(model.observed, df = df, par = rates)
boot_result <- model.samp(n = 500)

cat("Bootstrap MLE:\n")
print(coef(boot_result))

cat("\nBootstrap standard errors:\n")
print(se(boot_result))

cat("\nBootstrap bias:\n")
print(bias(boot_result))

cat("\nBootstrap covariance:\n")
print(vcov(boot_result))
```

The bootstrap standard errors should be close to the asymptotic standard errors
from the Fisher information matrix, and the bootstrap bias estimates should be
negligible (all under 0.01), confirming that the MLE is nearly unbiased for
this sample size.

## Case 2: Right-censoring
We add right-censoring to the model. We assume that the right-censoring times
are independent of the component lifetimes and the candidate sets.

In particular, we consider the data described earlier, but with right-censoring
at time $\tau = 0.5$. The presence of this right-censoring means that we know
the event occurred after time $\tau$ for these censored observations, but we do
not know exactly when (and it may still be ongoing).

As a first stab, you might be inclined to just ignore the censored data.
However, this is not a good idea, because it will lead to biased estimates.
This is problematic because then we are estimating the parameters of a
truncated distribution, $T_i | T_i < \tau$, rather than the original
distribution, $S_i, \tau_i$ where $S_i = T_i$ if $T_i < \tau_i$ and otherwise
$S_i = \tau_i$. This is a well-known problem in statistics, and it is why we
need to use *survival analysis* techniques to handle right-censored data.

This is similiar to the earlier problem we described where we said it was
a mistake to only focus on the component cause of failure and estimate
the parameters of the components independently. In that
case, we would be estimating component $j$'s parameters from the
truncated distribution, $T_{i j} | K_i = j$, rather than the original
distribution, $T_i = \min\{T_{i 1}, \ldots, T_{i m}\}$. In either case, the
censoring or masking mechanisms must not be ignored, otherwise we will get
biased estimates.

Here's what that right-censored data looks like, with right-censoring time
of $\tau = 0.5$.

```{r}
# if df$t > .5, then the system was right-censored
df.censored <- df |> mutate(
    censored = t > .5,
    t = ifelse(censored, .5, t)
)

df.censored[df.censored[, "censored"], "k"] <- NA
df.censored[df.censored[, "censored"], paste0("t", 1:3)] <- rep(NA, 3)
head(df.censored)
```

We construct their censoring functions with:
```{r}
loglik.censored <- function(df, rates) {
    -sum(rates) * sum(df$t)
}
score.censored <- function(df, rates) {
    rep(-sum(df$t), length(rates))
}
hess_loglik_censored <- function(df, rates) {
    p <- length(rates)
    matrix(0, p, p)
}

model.censored <- likelihood_contr_model$new(
    obs_type = function(df) {
        ifelse(df$censored, "censored", "observed")
    }
)

mle.censored <- fit(model.censored)(df.censored, par = rates)
summary(mle.censored)
```


So, recall the true parameters are $\lambda = (`r rates`)'$.
We can see that the estimates are reasonable, particularly
for the relatively small sample size.

### Inference on censored estimates

Let's examine the censored estimates more closely and compare them to the
uncensored case. Censoring reduces information, so we expect wider confidence
intervals:

```{r case2-inference}
cat("Uncensored estimates:\n")
cat("  Estimates:", coef(rates.hat), "\n")
cat("  Std errors:", se(rates.hat), "\n")

cat("\nCensored estimates:\n")
cat("  Estimates:", coef(mle.censored), "\n")
cat("  Std errors:", se(mle.censored), "\n")

cat("\n95% Confidence Intervals (uncensored):\n")
print(confint(rates.hat))

cat("\n95% Confidence Intervals (censored):\n")
print(confint(mle.censored))

cat("\nTrue parameters:", rates, "\n")
```

As expected, censoring inflates the standard errors and widens the confidence
intervals because the censored observations carry less information than exact
ones. Both sets of intervals still contain the true parameter values.

We can also use Fisherian likelihood inference to examine the strength of
evidence for the censored estimates:

```{r case2-fisherian}
# Support at MLE (should be 0)
s_at_mle <- support(mle.censored, coef(mle.censored), df.censored,
                    model.censored)
cat("Support at MLE:", s_at_mle, "\n")

# Evidence for MLE vs true parameters
ev <- evidence(model.censored, df.censored, coef(mle.censored), rates)
cat("Evidence (MLE vs true):", ev, "\n")

# Likelihood interval for rate 1 (first component)
li <- likelihood_interval(mle.censored, df.censored, model.censored,
                          k = 8, param = 1)
cat("\n1/8 likelihood interval for lambda_1:\n")
print(li)
```

The evidence value near 0 means the MLE and true parameters are about equally
well-supported by the data, which is expected when the MLE is close to the
truth. The $1/8$ likelihood interval for $\lambda_1$ provides a plausible
range for the parameter without making probability statements --- it contains
all parameter values whose likelihood is at least $1/8$ of the maximum.

## Case 3: Single-parameter model (exponential rate)

The `likelihood_contr_model` also works well with single-parameter models.
Here we estimate a single exponential rate from data where some observations
are exact and others are right-censored, using separate contribution types
for each.

This example also demonstrates that the `score` function works correctly
with scalar parameters (a case that requires careful handling of R's
dimension-dropping behavior).

```{r single-param}
# Define contribution types for a single-rate exponential
loglik.exp_exact <- function(df, par, ...) {
  lambda <- par[1]
  sum(log(lambda) - lambda * df$t)
}

loglik.exp_right <- function(df, par, ...) {
  lambda <- par[1]
  -lambda * sum(df$t)
}

score.exp_exact <- function(df, par, ...) {
  lambda <- par[1]
  nrow(df) / lambda - sum(df$t)
}

score.exp_right <- function(df, par, ...) {
  lambda <- par[1]
  -sum(df$t)
}

# Simulate right-censored exponential data
set.seed(42)
lambda_true <- 2.0
n_single <- 200
censor_time_single <- 0.4

raw_lifetimes <- rexp(n_single, lambda_true)
df_single <- data.frame(
  t = pmin(raw_lifetimes, censor_time_single),
  type = ifelse(raw_lifetimes > censor_time_single, "exp_right", "exp_exact")
)

cat("Exact observations:", sum(df_single$type == "exp_exact"), "\n")
cat("Right-censored:", sum(df_single$type == "exp_right"), "\n")

model_single <- likelihood_contr_model$new(
  obs_type = function(df) df$type,
  assumptions = c("exponential lifetimes", "non-informative right censoring")
)

# Fit and verify
mle_single <- fit(model_single)(df_single, par = c(lambda = 1))
cat("\nMLE:", coef(mle_single), "(true:", lambda_true, ")\n")

# Closed-form MLE for comparison: d / T
d <- sum(df_single$type == "exp_exact")
total_T <- sum(df_single$t)
cat("Closed-form MLE:", d / total_T, "\n")
cat("95% CI:", confint(mle_single)[1, ], "\n")
```

The `optim`-based MLE matches the closed-form estimate
$\hat\lambda = d/T$ (number of exact observations divided by total
observation time), and the 95% confidence interval contains the true value
$\lambda = 2$.

## Best Practices

### When to use which model type

| Scenario | Recommended approach |
|----------|---------------------|
| Standard distribution, possibly censored | `likelihood_name()` |
| Heterogeneous observation types | `likelihood_contr_model` |
| Known analytical derivatives | Specialized model (e.g., `weibull_uncensored`) |
| Closed-form MLE exists | Custom class with `fit` override (e.g., `exponential_lifetime`) |
| Standard distribution as a contribution | Define `loglik.<type>` using `likelihood_name` internally |

### Providing analytical derivatives

Numerical differentiation (the default fallback) works but is slow --- each
gradient evaluation requires $2r \times p$ function evaluations (where $r=6$
and $p$ is the number of parameters). When analytical derivatives are
available, providing `score.<type>` and `hess_loglik.<type>` functions
gives 10--100x speedups:

```{r analytical-example, eval=FALSE}
# Slow (default): numerical differentiation of loglik.my_type
loglik.my_type <- function(df, par, ...) { ... }

# Fast: provide analytical derivatives
score.my_type <- function(df, par, ...) { ... }
hess_loglik.my_type <- function(df, par, ...) { ... }
```

The `likelihood_contr_model` checks for these functions in order:

1. Functions passed directly via `logliks`, `scores`, `hess_logliks` arguments
2. Functions found by name in the global environment (`loglik.<type>`, etc.)
3. Numerical differentiation of the log-likelihood (last resort)

### Performance tip: the S3 cache

The S3 wrappers (`loglik(model)`, `score(model)`, `hess_loglik(model)`) cache
the data frame split and resolved dispatcher functions. During optimization,
`optim` calls the returned function hundreds of times with the same `df` but
varying `par` --- caching eliminates the repeated `split()` and `obs_type()`
overhead. Always use the S3 interface for optimization:

```{r cache-tip, eval=FALSE}
# Good: cached S3 wrapper
ll <- loglik(model)
result <- optim(par, fn = ll, df = df, control = list(fnscale = -1))

# Also good: using fit() which does this internally
result <- fit(model)(df, par = par)

# Slower: calling R6 method directly in a loop (no caching)
# model$loglik(df, par) -- re-splits df every call
```

```{r cleanup, include=FALSE}
options(old_opts)
```