Type: Package
Title: VCG Sampling using Energy-Based Covariate Balancing
Version: 0.9.2
Description: Provides a principled framework for sampling Virtual Control Group (VCG) using energy distance-based covariate balancing. The package offers visualization tools to assess covariate balance and includes a permutation test to evaluate the statistical significance of observed deviations.
License: MIT + file LICENSE
Encoding: UTF-8
RoxygenNote: 7.3.3
Depends: ggplot2
Imports: ggforce, osqp, patchwork
Suggests: knitr, rmarkdown
VignetteBuilder: knitr
NeedsCompilation: no
Packaged: 2025-12-16 15:18:45 UTC; I0559285
Author: Andreas Schulz [aut, cre] (Created in 2025), Sanofi [cph, fnd]
Maintainer: Andreas Schulz <andreas.schulz2@sanofi.com>
Repository: CRAN
Date/Publication: 2025-12-19 20:10:17 UTC

The function attempts to find the optimal size for VCG.

Description

The function tries out different sizes of VCG and searches for the smallest distance.

Usage

BestVCGsize(formula, data = data, plot = TRUE)

Arguments

formula

A formula specifying the treated and covariates, e.g., 'treated ~ cov1 + cov2 | stratum'. The treated variable must be binary (0=pool, 1=treated)

data

A data frame containing the variables specified in the formula.

plot

Logical. If 'TRUE', returns a ggplot2 plot. Default: TRUE

Details

It is only intended for exploratory purposes, as the VCG size is normally given. But it can be used to see how well the given size fits. The recommendation for VCG size is based solely on distance and does not take into account other aspects such as power or validity.

Value

If 'plot = TRUE', returns a list with:

optimal_n

The estimated optimal VCG size (integer).

plot

A ggplot2 object visualizing the energy distance curve and plateau.

Examples


set.seed(2342)
dat <- data.frame(
  treat = rep(0:1, c(50, 30)),
  cov1 = c(rnorm(50, 11, 2),  rnorm(30, 10, 1)),
  cov2 = c(rnorm(50, 12, 2),  rnorm(30, 10, 1)),
  cov3 = c(rnorm(50, 9,  2),  rnorm(30, 10, 1))
)
 BestVCGsize(treat ~ cov1 + cov2 + cov3, data=dat)


VCG Sampler for Energy Distance Balancing

Description

This function performs energy distance based balancing and selects a subset from pool based on energy distance to approximate a randomized control trial. Optionally, it visualizes the balancing results.

Usage

VCG_sampler(formula, data, n, c_w = NULL, random = FALSE, plot = TRUE)

Arguments

formula

A formula specifying the treated indicator and covariates, e.g., 'treated ~ cov1 + cov2 | stratum'. The treated variable must be binary (0=pool, 1=treated)

data

A data frame containing the variables specified in the formula.

n

Integer. Number of observations to sample from the pool, or a vector of n for each stratum

c_w

Optional: Vector of positive weights for covariates, reflecting the relative importance of the covariates for balancing.

random

Logical. If 'TRUE', the distance is used as the probability for selecting the observation; otherwise, the nearest observations are used (deterministic). Default: FALSE

plot

Logical. If 'TRUE', returns a visualization of the balancing effect.

Details

If random is set to FALSE, the function selects the top 'n' units from the pool with the lowest energy distance and assigns them to the VCG group. If random is set to TRUE, the function samples 'n' units from pool with sampling probability inversely proportional to energy distance. The quality of covariate balancing is visualized using differences in medians and median absolute deviations (MADs). Permutation ellipses are generated by randomly permuting the pool and treated groups to estimate usual (random) variability. Only the X and Y axes are computed directly; the ellipse is interpolated between the axes. This method is intended as a visual approximation rather than a precise statistical test.

Value

If 'plot = TRUE', returns a list with:

If 'plot = FALSE', returns only the modified data frame.

Examples


dat   <- data.frame(
  cov1  = rnorm(50, 10, 1),
  cov2  = rnorm(50, 7,  1),
  cov3  = rnorm(50, 5,  1),
  treated = rep(c(0, 1), c(35, 15))
)
  VCG_sampler(treated ~ cov1 + cov2 + cov3, data=dat, n=5)


Combine data from pool and treated groups

Description

If your data is stored in separate files, you can use this function to merge them.

Usage

combine_data(POOL_data, TG_data, indicator_name = "treated")

Arguments

POOL_data

Data frame with POOL data, where you want to sample from.

TG_data

Data frame with TG (treated groups) data, all treated groups together!

indicator_name

Name of the variable that is created for further use in the package, Default: 'treated'

Value

Data frame with all covariates that were present in both files and with new indicator factor POOL vs TG

Examples


pool_data <- data.frame(
  cov1   = rnorm(100, 11, 2),
  cov2   = rnorm(100, 11, 2),
  cov3   = rnorm(100, 11, 2),
  sex    = rbinom(100, 1, 0.5))

tg_data <- data.frame(
  cov2   = rnorm(20, 12, 1),
  cov3   = rnorm(20, 12, 1),
  cov4   = rnorm(20, 12, 1),
  sex    = rbinom(20, 1, 0.5))

 dx <- combine_data(pool_data, tg_data)
 str(dx)


Compute Energy Distance Between Two Groups

Description

Calculates the energy distance between two groups.

Usage

energy_distance(formula, data, standardized = TRUE)

Arguments

formula

A formula specifying the treated and covariates, e.g., 'treated ~ cov1 + cov2'. The treated variable must be binary (0=pool, 1=treated)

data

A data frame containing the variables specified in the formula.

standardized

If TRUE, the standardized energy distance that lies in the range 0 to 1 is returned, the so-called E-coefficient. If FALSE, not scaled energy distance is returned that can be >1.

Details

Energy distance is a non-parametric measure of distributional difference. It is sensitive to differences in location, scale, and shape between groups. Before calculation, the covariates are scaled to a mean value of 0 and a standard deviation of 1.

Value

A numeric value representing the energy distance between the two groups.

Examples

dat <- data.frame(
 treated = rep(0:1, c(50, 30)),
 age    = c(rnorm(50, 5, 2),   rnorm(30, 5, 1)),
 weight = c(rnorm(50, 11, 2),  rnorm(30, 10, 1)),
 class  = c(rbinom(50, 3, 0.6),   rbinom(30, 3, 0.4))
 )

 energy_distance(treated ~ age + weight + class, data=dat)


Permutation Energy Test for Covariate Imbalance

Description

Performs a permutation-based energy distance test to assess whether two groups (defined by a binary treated variable) are balanced across a set of covariates. Optionally, it visualizes the distribution of permuted energy distances and highlights the observed test statistic and critical value.

Usage

energy_test(formula, data, alpha = 0.05, R = 2000, plot = TRUE)

Arguments

formula

A formula specifying the treated and covariates, e.g., 'treated ~ cov1 + cov2 | stratum'.

data

A data frame containing the variables specified in the formula.

alpha

Significance level for the test (default is 0.05).

R

Number of permutations to perform (default is 2000).

plot

Logical. If 'TRUE', returns a ggplot2 visualization of the permutation distribution.

Details

The energy distance is a non-parametric measure of distributional difference. This test evaluates whether the covariate distributions between two groups are statistically distinguishable. A small p-value indicates imbalance between groups. A one-sided test is used because the energy distance is strictly positive; only values greater than the observed statistic in the permutation distribution are relevant.

Value

If 'plot = TRUE', returns a list with:

If 'plot = FALSE', returns only the '"htest"' result list.

See Also

element

Examples


dat <- data.frame(
 treated = rep(0:1, c(50, 30)),
 age    = c(rnorm(50, 5, 2),   rnorm(30, 5, 1)),
 weight = c(rnorm(50, 11, 2),  rnorm(30, 10, 1)),
 class  = c(rbinom(50, 3, 0.6),   rbinom(30, 3, 0.4))
 )

 energy_test(treated ~ age + weight + class, data=dat, R = 500)


Multi-Sample VCG Generator and Overlap Visualization

Description

Repeatedly samples VCGs (via 'VCG_sampler' and 'random=TRUE') from the pool, optionally plots the overlap of VCGs.

Usage

multiSampler(formula, data, n, c_w = NULL, Nsamples = 20, plot = TRUE)

Arguments

formula

A formula specifying the treated and covariates, e.g., 'treated ~ cov1 + cov2'. The treated variable must be binary (0=pool, 1=treated)

data

A data frame containing the variables specified in the formula.

n

Integer. Number of observations to sample from the pool. Or a vector of n for each stratum.

c_w

Optional: Vector of positive weights for covariates, reflecting the relative importance of the covariates for the balancing.

Nsamples

Number of VCGs to generate (default is 20).

plot

Logical; if 'TRUE', returns a ggplot2 plot showing the overlap of VCGs (default is 'TRUE').

Details

The function repeatedly calls 'VCG_sampler' with 'random' set to TRUE to generate multiple VCG groups. It calculates the frequency of selection for each observation and computes the average percentage of overlapping observations. This function should only be used if you really need multiple VCG, e.g. for PoC studies. It is not intended for selecting one VCG from them afterwards! In this case, the VCG_sampler function should be used directly and only one VCG should be generated.

Value

If 'plot = TRUE', returns a list with:

data

The original data frame with additional VCG columns ('VCG_1', ..., 'VCG_Nsamples').

p

A 'ggplot2' object showing the number of times each observation was selected across VCG samples.

If 'plot = FALSE', returns the modified data frame only.

Examples


  dat <- data.frame(
  treat = rep(0:1, c(50, 30)),
  cov_1 = c(rnorm(50, 5, 2),   rnorm(30, 5, 1)),
  cov_2 = c(rnorm(50, 11, 2),  rnorm(30, 10, 1))
  )

  result <- multiSampler(treat ~ cov_1 + cov_2, data = dat, n = 10, Nsamples = 10)


Visualize Covariate Distribution Across TG, VCG, and POOL

Description

Creates a plot to compare the distribution of a selected variable across three groups: TG (treated groups), VCG (virtual control group), and POOL (data pool).

Usage

plot_var(data, what = NULL, stratum = "in_stratum", group = "VCG", title = "")

Arguments

data

A balanced data frame (output of the VCG_sampler function)

what

A string specifying the name of the variable to be visualized.

stratum

A string specifying the name of the stratum variable (default is '"in_stratum"')

group

A string specifying the column name used to define group membership (default is '"VCG"').

title

Optional title for the plot.

Details

The function uses energy distance to quantify distributional differences between groups. For continuous variables, it overlays dashed lines for TG group statistics (mean, min, max) and displays sample sizes. For categorical variables, it uses color-coded bars and cumulative proportion lines to highlight imbalance.

Value

A ggplot2 object showing either:

Examples


dat   <- data.frame(
  cov1  = rnorm(50, 10, 1),
  cov2  = rnorm(50, 7,  1),
  cov3  = rnorm(50, 5,  1),
  treated = rep(c(0, 1), c(35, 15))
)
  out <- VCG_sampler(treated ~ cov1 + cov2 + cov3, data=dat, n=5, plot=FALSE)
  plot_var(out, what='cov1', group='VCG')
  plot_var(out, what='cov2', group='VCG')


Robust Scaling of Numeric and Categorical Variables

Description

Applies robust scaling to numeric and categorical variables. For numeric variables, the function centers by the median and scales by the MAD. For categorical variables with 2–4 unique levels, it applies a custom transformation to map them to numeric values.

Usage

robust_scale(x, group)

Arguments

x

A numeric vector, factor, matrix, or data frame. If a matrix or data frame is provided, scaling is applied column-wise.

group

vector indicating which group is the TG to scale to

Details

This function is designed to make numeric and categorical variables comparable. This is an internal function that should not be used by package users.

Value

A scaled numeric vector or a data frame with scaled columns.

Examples


dat<-data.frame(x=rnorm(100, 10, 3), sex=factor(rbinom(100, 1, 0.5), labels=c("M","F")))

x<- robust_scale(dat$x, dat$sex)
round(median(x), 2)
round(mad(x), 2)