The riskdiff package provides robust methods for calculating risk differences (also known as prevalence differences in cross-sectional studies) using generalized linear models with automatic link function selection and boundary detection.
riskdiff now includes cutting-edge boundary detection capabilities that identify when maximum likelihood estimates lie at the edge of the parameter space - a common issue with identity link models that other packages ignore.
John D. Murphy, MPH, PhD ORCID: 0000-0002-7714-9976
You can install the development version of riskdiff from GitHub with:
# install.packages("devtools")
::install_github("jackmurphy2351/riskdiff") devtools
library(riskdiff)
# Load example data
data(cachar_sample)
# Simple risk difference with boundary detection
<- calc_risk_diff(
result data = cachar_sample,
outcome = "abnormal_screen",
exposure = "smoking"
)#> Waiting for profiling to be done...
print(result)
#> Risk Difference Analysis Results (v0.2.0+)
#> ==========================================
#>
#> Confidence level: 95%
#> Number of comparisons: 1
#>
#> Exposure Risk Difference 95% CI P-value Model Boundary CI Method
#> smoking 10.68% (5.95%, 15.75%) <0.001 identity wald
# Create data that challenges standard GLM methods
set.seed(123)
<- data.frame(
challenging_data outcome = c(rep(0, 40), rep(1, 60)), # High baseline risk
exposure = factor(c(rep("No", 50), rep("Yes", 50))),
age = rnorm(100, 45, 10)
)
# riskdiff handles this gracefully with boundary detection
<- calc_risk_diff(
result data = challenging_data,
outcome = "outcome",
exposure = "exposure",
adjust_vars = "age",
verbose = TRUE # Shows diagnostic information
)#> Formula: outcome ~ exposure + age
#> Sample size: 100
#> Trying identity link...
#> Using starting values: 0.2, 0.8, 0.004
#> Identity link error: cannot find valid starting values: please specify some
#> Trying log link...
#> log link error: no valid set of coefficients has been found: please supply starting values
#> Trying logit link...
#> [Huzzah!]logit link converged
#> Waiting for profiling to be done...
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#> Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm):
#> collapsing to unique 'x' values
#> Boundary case detected: separation
#> Warning: Logit model may have separation issues. Very large coefficient estimates detected.
#> Note: 1 of 1 analyses had MLE on parameter space boundary. Robust confidence intervals were used.
print(result)
#> Risk Difference Analysis Results (v0.2.0+)
#> ==========================================
#>
#> Confidence level: 95%
#> Number of comparisons: 1
#> Boundary cases detected: 1 of 1
#> Boundary CI method: auto
#>
#> Exposure Risk Difference 95% CI P-value Model Boundary
#> exposure 80.06% (-199.05%, 359.17%) 0.993 logit [Uh oh]separation
#> CI Method
#> wald_conservative
#>
#> Boundary Case Details:
#> =====================
#> Row 1 ( exposure ): Logit model may have separation issues. Very large coefficient estimates detected.
#>
#> Boundary Type Guide:
#> - upper_bound: Fitted probabilities near 1
#> - lower_bound: Fitted probabilities near 0
#> - separation: Complete/quasi-separation detected
#> - both_bounds: Probabilities near both 0 and 1
#> - [Uh oh] indicates robust confidence intervals were used
#>
#> Note: Standard asymptotic theory may not apply for boundary cases.
#> Confidence intervals use robust methods when boundary detected.
# Check if boundary cases were detected
if (any(result$on_boundary)) {
cat("\n🚨 Boundary case detected! Using robust inference methods.\n")
cat("Boundary type:", unique(result$boundary_type[result$on_boundary]), "\n")
cat("CI method:", unique(result$ci_method[result$on_boundary]), "\n")
}#>
#> 🚨 Boundary case detected! Using robust inference methods.
#> Boundary type: separation
#> CI method: wald_conservative
# Age-adjusted risk difference with boundary detection
<- calc_risk_diff(
rd_adjusted data = cachar_sample,
outcome = "abnormal_screen",
exposure = "smoking",
adjust_vars = "age",
boundary_method = "auto" # Automatic robust method selection
)#> Waiting for profiling to be done...
print(rd_adjusted)
#> Risk Difference Analysis Results (v0.2.0+)
#> ==========================================
#>
#> Confidence level: 95%
#> Number of comparisons: 1
#>
#> Exposure Risk Difference 95% CI P-value Model Boundary CI Method
#> smoking 10.94% (7.57%, 14.32%) <0.001 logit wald
# Stratified by residence with boundary detection
<- calc_risk_diff(
rd_stratified data = cachar_sample,
outcome = "abnormal_screen",
exposure = "smoking",
adjust_vars = "age",
strata = "residence"
)#> Waiting for profiling to be done...
#> Waiting for profiling to be done...
#> Waiting for profiling to be done...
print(rd_stratified)
#> Risk Difference Analysis Results (v0.2.0+)
#> ==========================================
#>
#> Confidence level: 95%
#> Number of comparisons: 3
#>
#> Exposure Risk Difference 95% CI P-value Model Boundary CI Method
#> smoking 11.63% (7.83%, 15.44%) <0.001 logit wald
#> smoking 9.99% (-5.89%, 25.87%) 0.218 identity wald
#> smoking -3.86% (-9.05%, 1.32%) 0.706 log wald
# Summary of boundary cases across strata
<- rd_stratified[rd_stratified$on_boundary,
boundary_summary c("residence", "boundary_type", "ci_method")]
if (nrow(boundary_summary) > 0) {
cat("\nBoundary cases by stratum:\n")
print(boundary_summary)
}
# Create a simple text table with boundary information
cat(create_simple_table(rd_stratified, "Risk by Smoking Status and Residence"))
#> Risk by Smoking Status and Residence
#> ====================================================================================
#> Exposure Risk Diff 95% CI P-value Model
#> ====================================================================================
#> smoking 11.63% (7.83%, 15.44%) <0.001 logit
#> smoking 9.99% (-5.89%, 25.87%) 0.218 identity
#> smoking -3.86% (-9.05%, 1.32%) 0.706 log
#> ====================================================================================
# Create publication-ready table (requires kableExtra)
library(kableExtra)
create_rd_table(rd_stratified,
caption = "Risk of Abnormal Screening Result by Smoking Status",
include_model_type = TRUE)
The package uses generalized linear models with different link functions:
New in v0.2.0: When models hit parameter space
boundaries (common with identity links), the package: - 🔍
Detects boundary cases automatically - ⚠️ Warns
users about potential inference issues
- 🛡️ Uses robust confidence intervals when appropriate
- 📊 Reports methodology transparently
# Force specific boundary handling
<- calc_risk_diff(
rd_conservative
cachar_sample,"abnormal_screen",
"smoking",
boundary_method = "auto" # Options: "auto", "profile", "wald"
)#> Waiting for profiling to be done...
# Check which methods were used
table(rd_conservative$ci_method)
#>
#> wald
#> 1
# Force a specific link function
<- calc_risk_diff(
rd_logit
cachar_sample, "abnormal_screen",
"smoking",
link = "logit"
)#> Waiting for profiling to be done...
# Check which model was used and if boundaries detected
cat("Model used:", rd_logit$model_type, "\n")
#> Model used: logit
cat("Boundary detected:", rd_logit$on_boundary, "\n")
#> Boundary detected: FALSE
# 90% confidence intervals with boundary detection
<- calc_risk_diff(
rd_90
cachar_sample,"abnormal_screen",
"smoking",
alpha = 0.10 # 1 - 0.10 = 90% CI
)#> Waiting for profiling to be done...
print(rd_90)
#> Risk Difference Analysis Results (v0.2.0+)
#> ==========================================
#>
#> Confidence level: 90%
#> Number of comparisons: 1
#>
#> Exposure Risk Difference 95% CI P-value Model Boundary CI Method
#> smoking 10.68% (6.68%, 14.91%) <0.001 identity wald
# The package automatically uses appropriate CI methods for boundary cases
# Examine the enhanced result structure
data(cachar_sample)
<- calc_risk_diff(cachar_sample, "abnormal_screen", "smoking")
result #> Waiting for profiling to be done...
names(result)
#> [1] "exposure_var" "rd" "ci_lower" "ci_upper"
#> [5] "p_value" "model_type" "on_boundary" "boundary_type"
#> [9] "ci_method" "n_obs"
# Key new columns:
# - on_boundary: Was a boundary case detected?
# - boundary_type: What type of boundary?
# - boundary_warning: Detailed diagnostic message
# - ci_method: Which CI method was used?
The package includes a realistic simulated cancer screening dataset:
data(cachar_sample)
str(cachar_sample)
#> 'data.frame': 2500 obs. of 11 variables:
#> $ id : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ age : int 53 25 18 28 51 25 56 20 58 18 ...
#> $ sex : Factor w/ 2 levels "male","female": 2 1 2 2 1 2 1 1 1 1 ...
#> $ residence : Factor w/ 3 levels "rural","urban",..: 3 1 1 1 1 1 1 1 1 1 ...
#> $ smoking : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 1 1 1 ...
#> $ tobacco_chewing : Factor w/ 2 levels "No","Yes": 2 1 1 2 2 1 2 1 2 2 ...
#> $ areca_nut : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 1 2 1 2 2 ...
#> $ alcohol : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 2 1 2 ...
#> $ abnormal_screen : int 0 0 0 0 0 0 1 0 1 0 ...
#> $ head_neck_abnormal: int 0 0 0 0 0 0 0 0 0 0 ...
#> $ age_group : Factor w/ 3 levels "Under 40","40-60",..: 2 1 1 1 2 1 2 1 2 1 ...
# Summary statistics showing realistic associations
table(cachar_sample$smoking, cachar_sample$abnormal_screen)
#>
#> 0 1
#> No 1851 317
#> Yes 248 84
# Risk difference analysis
<- calc_risk_diff(cachar_sample, "abnormal_screen", "smoking")
rd_analysis #> Waiting for profiling to be done...
cat("Smoking increases risk of abnormal screening result by",
sprintf("%.1f", rd_analysis$rd * 100), "percentage points\n")
#> Smoking increases risk of abnormal screening result by 10.7 percentage points
Risk differences are particularly valuable when:
Measure | Interpretation | Best When | riskdiff Advantage |
---|---|---|---|
Risk Difference | Absolute change in risk | Common outcomes, policy | Boundary detection |
Risk Ratio | Relative change in risk | Rare outcomes | Standard methods only |
Odds Ratio | Change in odds | Case-control studies | Standard methods only |
This package implements methods based on:
browseVignettes("riskdiff")
If you use this package in your research, please cite:
citation("riskdiff")
riskdiff uniquely provides boundary detection for robust inference!
Please note that the riskdiff project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.