Title: Machine Learning Modelling for Everyone
Version: 0.2.2
Description: A minimal library specifically designed to make the estimation of Machine Learning (ML) techniques as easy and accessible as possible, particularly within the framework of the Knowledge Discovery in Databases (KDD) process in data mining. The package provides essential tools to structure and execute each stage of a predictive or classification modeling workflow, aligning closely with the fundamental steps of the KDD methodology, from data selection and preparation, through model building and tuning, to the interpretation and evaluation of results using Sensitivity Analysis. The 'MLwrap' workflow is organized into four core steps; preprocessing(), build_model(), fine_tuning(), and sensitivity_analysis(). These steps correspond, respectively, to data preparation and transformation, model construction, hyperparameter optimization, and sensitivity analysis. The user can access comprehensive model evaluation results including fit assessment metrics, plots, predictions, and performance diagnostics for ML models implemented through 'Neural Networks', 'Random Forest', 'XGBoost' (Extreme Gradient Boosting), and 'Support Vector Machines' (SVM) algorithms. By streamlining these phases, 'MLwrap' aims to simplify the implementation of ML techniques, allowing analysts and data scientists to focus on extracting actionable insights and meaningful patterns from large datasets, in line with the objectives of the KDD process.
License: GPL-3
Encoding: UTF-8
RoxygenNote: 7.3.3
Depends: R (≥ 4.1.0)
Imports: R6, tidyr, magrittr, dials, parsnip, recipes, rsample, tune, workflows, yardstick, vip, glue, innsight, fastshap, DiagrammeR, ggbeeswarm, ggplot2, sensitivity, dplyr, rlang, tibble, patchwork, cli, scales
Suggests: testthat (≥ 3.0.0), torch, brulee, ranger, kernlab, xgboost
Config/testthat/edition: 3
URL: https://github.com/AlbertSesePsy/MLwrap
BugReports: https://github.com/AlbertSesePsy/MLwrap/issues
LazyData: true
NeedsCompilation: no
Packaged: 2025-11-05 12:11:17 UTC; uib
Author: Javier Martínez García ORCID iD [aut], Juan José Montaño Moreno ORCID iD [ctb], Albert Sesé ORCID iD [cre, ctb]
Maintainer: Albert Sesé <albert.sese@uib.es>
Repository: CRAN
Date/Publication: 2025-11-05 12:30:02 UTC

Pipe operator

Description

See magrittr::%>% for details.

Usage

lhs %>% rhs

Arguments

lhs

A value or the magrittr placeholder.

rhs

A function call using the magrittr semantics.

Value

The result of calling rhs(lhs).


Create ML Model

Description

The function build_model() is designed to construct and attach a ML model to an existing analysis object,which contains the preprocessed dataset generated in the previous step using the preprocessing() function. Based on the specified model type and optional hyperparameters, it supports several popular algorithms—including Neural Network, Random Forest, XGBOOST, and SVM (James et al., 2021)— by initializing the corresponding hyperparameter class, updating the analysis object with these settings, and invoking the appropriate model creation function. For SVM models, it further distinguishes between kernel types (rbf, polynomial, linear) to ensure the correct implementation. The function also updates the analysis object with the model name, the fitted model, and the current processing stage before returning the enriched object, thereby streamlining the workflow for subsequent training, evaluation, or prediction steps. This modular approach facilitates flexible and reproducible ML pipelines by encapsulating both the model and its configuration within a single structured object.

Usage

build_model(analysis_object, model_name, hyperparameters = NULL)

Arguments

analysis_object

analysis_object created from preprocessing function.

model_name

Name of the ML Model. A string of the model name: "Neural Network", "Random Forest", "SVM" or "XGBOOST".

hyperparameters

Hyperparameters of the ML model. List containing the name of the hyperparameter and its value or range of values.

Value

An updated analysis_object containing the fitted machine learning model, the model name, the specified hyperparameters, and the current processing stage. This enriched object retains all previously stored information from the preprocessing step and incorporates the results of the model-building process, ensuring a coherent and reproducible workflow for subsequent training, evaluation, or prediction tasks.

Hyperparameters

Neural Network

Parsnip model using brulee engine. Hyperparameters:

Random Forest

Parsnip model using ranger engine. Hyperparameters:

XGBOOST

Parsnip model using xgboost engine. Hyperparameters:

SVM

Parsnip model using kernlab engine. Hyperparameters:

References

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning: with Applications in R (2nd ed.). Springer. https://doi.org/10.1007/978-1-0716-1418-1

Examples

# Example 1: Random Forest for regression task

library(MLwrap)

data(sim_data) # sim_data is a simulated dataset with psychological variables

wrap_object <- preprocessing(
     df = sim_data,
     formula = psych_well ~ depression + emot_intel + resilience + life_sat,
     task = "regression"
     )

wrap_object <- build_model(
               analysis_object = wrap_object,
               model_name = "Random Forest",
               hyperparameters = list(
                                 mtry = 2,
                                 trees = 10
                                 )
                           )
# It is safe to reuse the same object name (e.g., wrap_object, or whatever)
# step by step, as all previous results and information are retained within
# the updated analysis object.

# Example 2: SVM for classification task

data(sim_data) # sim_data is a simulated dataset with psychological variables

wrap_object <- preprocessing(
         df = sim_data,
         formula = psych_well_bin ~ depression + emot_intel + resilience + life_sat,
         task = "classification"
         )

wrap_object <- build_model(
               analysis_object = wrap_object,
               model_name = "SVM",
               hyperparameters = list(
                                 type = "rbf",
                                 cost = 1,
                                 margin = 0.1,
                                 rbf_sigma = 0.05
                                 )
                           )

Fine Tune ML Model

Description

The fine_tuning() function performs automated hyperparameter optimization for ML workflows encapsulated within an AnalysisObject. It supports two tuning strategies: Bayesian Optimization (with cross-validation) and Grid Search Cross-Validation, allowing the user to specify evaluation metrics and whether to visualize tuning results. The function first validates arguments and updates the workflow and metric settings within the AnalysisObject. If hyperparameter tuning is enabled, it executes the selected tuning procedure, identifies the best hyperparameter configuration based on the specified metrics, and updates the workflow accordingly. For neural network models, it also manages the creation and integration of new model instances and provides additional visualization of training dynamics. Finally, the function fits the optimized model to the training data and updates the AnalysisObject, ensuring a reproducible and efficient model selection process (Bartz et al., 2023).

Usage

fine_tuning(analysis_object, tuner, metrics = NULL)

Arguments

analysis_object

analysis_object created from build_model function.

tuner

Name of the Hyperparameter Tuner. A string of the tuner name: "Bayesian Optimization" or "Grid Search CV".

metrics

Metric used for Model Selection. A string of the name of metric (see Metrics). By default either "rmse" (regression) or "roc_auc" (classification).

Value

An updated analysis_object containing the fitted model with optimized hyperparameters, the tuning results, and all relevant workflow modifications. This object includes the final trained model, the best hyperparameter configuration, tuning diagnostics, and, if applicable, plots of the tuning process. It can be used for further model evaluation, prediction, or downstream analysis within the package workflow.

Tuners

Bayesian Optimization (with cross-validation)

Grid Search CV

Metrics

Regression Metrics

Classification Metrics

References

Bartz, E., Bartz-Beielstein, T., Zaefferer, M., & Mersmann, O. (2023). Hyperparameter tuner for Machine and Deep Learning with R. A Practical Guide. Springer. doi:10.1007/978-981-19-5170-1

Examples

# Fine tuning function applied to a regression task using Random Forest

wrap_object <- preprocessing(
           df = sim_data[1:500 ,],
           formula = psych_well ~ depression + life_sat,
           task = "regression"
           )
wrap_object <- build_model(
               analysis_object = wrap_object,
               model_name = "Random Forest",
               hyperparameters = list(
                     mtry = 2,
                     trees = 3
                     )
                 )
set.seed(123) # For reproducibility
wrap_object <- fine_tuning(wrap_object,
                tuner = "Grid Search CV",
                metrics = c("rmse")
                )

Plotting Calibration Curve

Description

The plot_calibration_curve() function generates calibration plots for binary classification models evaluating the agreement between predicted probabilities and observed class frequencies in binned prediction intervals. Implements reliability diagrams comparing empirical success rates within each probability bin against the predicted probability levels, identifying systematic calibration errors including overconfidence (predicted probabilities exceed observed frequencies) and underconfidence patterns across prediction ranges.

Usage

plot_calibration_curve(analysis_object)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

Value

analysis_object

Examples

# Note: For obtaining the calibration curve plot the user needs to
# complete till fine_tuning( ) function of the MLwrap pipeline and
# only with binary outcome.

wrap_object <- preprocessing(df = sim_data[1:300 ,],
                             formula = psych_well_bin ~ depression + resilience,
                             task = "classification")
wrap_object <- build_model(wrap_object, "Random Forest",
                           hyperparameters = list(mtry = 2, trees = 5))
set.seed(123) # For reproducibility
wrap_object <- fine_tuning(wrap_object, "Grid Search CV")

# And then, you can obtain the calibration curve plot.

plot_calibration_curve(wrap_object)

Plotting Confusion Matrix

Description

The plot_confusion_matrix() function generates confusion matrices from classification predictions displaying the contingency table of true class labels versus predicted class labels. Visualizes true positives, true negatives, false positives, and false negatives for both training and test sets, enabling computation of derived performance metrics (sensitivity, specificity, precision, F1-score) and identification of specific class pair misclassification patterns.

Usage

plot_confusion_matrix(analysis_object)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

Value

analysis_object

See Also

plot_calibration_curve

Examples

# Note: For obtaining confusion matrix plot the user needs to
# complete till fine_tuning( ) function of the MLwrap pipeline and
# only with categorical outcome.
# See the full pipeline example under plot_calibration_curve()
# Final call signature:
# plot_confusion_matrix(wrap_object)

Plotting Output Distribution By Class

Description

The plot_distribution_by_class() function visualizes kernel density estimates or histograms of predicted probability distributions stratified by true class labels. Enables assessment of class separability through probability overlap quantification and identification of prediction probability ranges where different classes exhibit substantial overlap, indicating classification ambiguity regions.

Usage

plot_distribution_by_class(analysis_object)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

Value

analysis_object

See Also

plot_calibration_curve

Examples

# Note: For obtaining the distribution by class plot the user needs to
# complete till fine_tuning( ) function of the MLwrap pipeline
# and only with categorical outcome.
# See the full pipeline example under plot_calibration_curve()
# Final call signature:
# plot_distribution_by_class(wrap_object)

Plotting Gain Curve

Description

The plot_gain_curve() plots cumulative gain as a function of sorted population percentile when observations are ranked by descending predicted probability. For each percentile threshold, calculates the ratio of positive class proportion in the top-ranked subset relative to overall positive class proportion, quantifying model's efficiency in concentrating target cases at the top of rankings.

Usage

plot_gain_curve(analysis_object)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

Value

analysis_object

See Also

plot_calibration_curve

Examples

# Note: For obtaining the gain curve plot the user needs to complete till
# fine_tuning( ) function of the MLwrap pipeline and only with categorical
# outcome.
# See the full pipeline example under plot_calibration_curve()
# Final call signature:
# plot_gain_curve(wrap_object)

Plot Neural Network Architecture

Description

Renders a directed acyclic graph representation of Neural Network architecture showing layer stacking order, layer-specific dimensions (neurons per layer), activation functions applied at each layer, and optimized hyperparameter values (learning rate, batch size, dropout rates, regularization coefficients) obtained from hyperparameter tuning procedures.

Usage

plot_graph_nn(analysis_object)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

Value

analysis_object

See Also

table_best_hyperparameters

Examples

# Note: For obtaining the Neural Network architecture graph plot the user
# needs to complete till the fine_tuning( ) function of the MLwrap pipeline.
# See the full pipeline example under table_best_hyperparameters()
# (Neural Network engine required)
# Final call signature:
# plot_graph_nn(wrap_object)

Plotting Integrated Gradients Plots

Description

The plot_integrated_gradients() function implements interpretability visualizations of integrated gradient attributions measuring feature importance through accumulated gradients along the interpolation path from baseline (zero vector) to observed input. Provides four visualization modalities: mean absolute attributions (bar plots), directional effects showing positive and negative contribution patterns (directional plots), distributional properties of attributions across instances (box plots), and individual-level attribution contributions (swarm plots).

Usage

plot_integrated_gradients(analysis_object, show_table = FALSE)

Arguments

analysis_object

Fitted analysis_object with 'sensitivity_analysis(methods = "Integrated Gradients")'.

show_table

Boolean. Whether to print Integrated Gradients summarized results table.

Value

analysis_object

See Also

sensitivity_analysis

Examples

# Note: For obtaining the Integrated Gradients plot the user needs to
# complete till sensitivity_analysis( ) function of the MLwrap pipeline
# using the Integrated Gradients method.
# See the full pipeline example under sensitivity_analysis()
# (Requires sensitivity_analysis(methods = "Integrated Gradients"))
# Final call signature:
# plot_integrated_gradients(wrap_object)

Plotting Lift Curve

Description

The plot_lift_curve() function plots lift factor as a function of population percentile when observations are ranked by descending predicted probability. The lift factor quantifies model's ranking efficiency relative to random ordering baseline at each population cumulative segment, showing how much better model selection performs compared to random case selection.

Usage

plot_lift_curve(analysis_object)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

Value

analysis_object

See Also

plot_calibration_curve

Examples

# Note: For obtaining the lift curve plot the user needs to complete till
# fine_tuning( ) function of the MLwrap pipeline and only with categorical
# outcome.
# See the full pipeline example under plot_calibration_curve()
# Final call signature:
# plot_lift_curve(wrap_object)

Plot Neural Network Loss Curve

Description

Displays training loss trajectory computed on the validation set across training epochs. Enables visual diagnosis of convergence dynamics, identification of appropriate early stopping points, detection of overfitting patterns (where validation loss increases while training loss decreases), and assessment of optimization stability throughout the training process.

Usage

plot_loss_curve(analysis_object)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

Value

analysis_object

See Also

table_best_hyperparameters

Examples

# Note: For obtaining the loss curve plot the user needs to
# complete till the fine_tuning( ) function of the MLwrap pipeline.
# See the full pipeline example under table_best_hyperparameters()
# (Neural Network engine required)
# Final call signature:
# plot_loss_curve(wrap_object)

Plotting Olden Values Barplot

Description

The plot_olden() function visualizes Olden sensitivity values computed from products of input-to-hidden layer connection weights and hidden-to-output layer connection weights for each feature. Provides relative feature importance rankings specific to feedforward Neural Networks based on synaptic weight magnitude and directionality analysis across network layers.

Usage

plot_olden(analysis_object, show_table = FALSE)

Arguments

analysis_object

Fitted analysis_object with 'sensitivity_analysis(methods = "Olden")'.

show_table

Boolean. Whether to print Olden results table.

Value

analysis_object

See Also

sensitivity_analysis

Examples

# Note: For obtaining the Olden plot the user needs to complete till
# sensitivity_analysis( ) function of the MLwrap pipeline using the Olden
# method.
# See the full pipeline example under sensitivity_analysis()
# (Requires sensitivity_analysis(methods = "Olden"))
# Final call signature:
# plot_olden(wrap_object)

Plotting Permutation Feature Importance Barplot

Description

The plot_pfi() function generates feature importance estimates via Permutation Feature Importance measuring performance degradation when each feature's values are randomly permuted while holding all other features constant. Provides model-agnostic importance ranking independent of feature-target correlation patterns, capturing both linear and non-linear predictive contributions to model performance.

Usage

plot_pfi(analysis_object, show_table = FALSE)

Arguments

analysis_object

Fitted analysis_object with 'sensitivity_analysis(methods = "PFI")'.

show_table

Boolean. Whether to print PFI results table.

Value

analysis_object

See Also

sensitivity_analysis

Examples

# Note: For obtaining the PFI plot results the user needs to complete till
# sensitivity_analysis( ) function of the MLwrap pipeline using the PFI
# method.
# See the full pipeline example under sensitivity_analysis()
# (Requires sensitivity_analysis(methods = "PFI"))
# Final call signature:
# plot_pfi(wrap_object)

Plotting Precision-Recall Curve

Description

The plot_pr_curve() function generates Precision-Recall curve tracing the relationship between precision and recall across all classification probability thresholds. Particularly informative for imbalanced datasets where ROC curves may be misleading, as PR curves remain sensitive to class distribution changes and provide intuitive performance assessment when one class is substantially rarer than the other.

Usage

plot_pr_curve(analysis_object)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

Value

analysis_object

See Also

plot_calibration_curve

Examples

# See the full pipeline example under plot_calibration_curve()
# Final call signature:
# plot_pr_curve(wrap_object)

Plotting Residuals Distribution

Description

The plot_residuals_distribution() function generates histogram and kernel density visualizations of residuals for regression models on training and test datasets. Enables assessment of residual normality through visual inspection of histogram shape, detection of systematic biases indicating omitted variables or model specification errors, and identification of heavy tails suggesting outliers or influential observations.

Usage

plot_residuals_distribution(analysis_object)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

Value

analysis_object

See Also

table_best_hyperparameters

Examples

# Note: For obtaining the residuals distribution plot the user needs to
# complete till fine_tuning( ) function of the MLwrap pipeline.
# See the full pipeline example under table_best_hyperparameters()
# Final call signature:
# plot_residuals_distribution(wrap_object)

Plotting ROC Curve

Description

The plot_roc_curve() function plots Receiver Operating Characteristic (ROC) curve displaying true positive rate versus false positive rate across all classification probability thresholds. Computes Area Under Curve (AUC) as an aggregate discrimination performance metric independent of threshold selection, providing comprehensive assessment of classifier discrimination ability across the entire decision boundary range.

Usage

plot_roc_curve(analysis_object)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

Value

analysis_object

See Also

plot_calibration_curve

Examples

# Note: For obtaining roc curve plot the user needs to
# complete till fine_tuning( ) function of the MLwrap pipeline and
# only with categorical outcome.
# See the full pipeline example under plot_calibration_curve()
# Final call signature:
# plot_roc_curve(wrap_object)

Plotting Observed vs Predictions

Description

The plot_scatter_predictions() function generates scatter plots with 45-degree reference lines comparing observed values (vertical axis) against model predictions (horizontal axis) for training and test data. Enables visual assessment of prediction accuracy through distance from the reference line, identification of systematic bias patterns, detection of heteroscedastic prediction errors, and quantification of generalization performance gaps between training and test sets.

Usage

plot_scatter_predictions(analysis_object)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

Value

analysis_object

See Also

table_best_hyperparameters

Examples

# Note: For obtaining the observed vs. predicted values plot the user needs
# to complete till fine_tuning( ) function of the MLwrap pipeline.
# See the full pipeline example under table_best_hyperparameters()
# Final call signature:
# plot_scatter_predictions(wrap_object)

Plotting Residuals vs Predictions

Description

The plot_scatter_residuals() function Visualizes residuals plotted against fitted values to detect violations of ordinary least squares assumptions including homoscedasticity (constant error variance), linearity, and independence. Identifies heteroscedastic patterns (non-constant variance across the predictor range), systematic curvature indicating omitted polynomial terms, and outlier points with extreme residual magnitudes.

Usage

plot_scatter_residuals(analysis_object)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

Value

analysis_object

See Also

table_best_hyperparameters

Examples

# Note: For obtaining the residuals vs. predicted values plot the user needs
# to complete till fine_tuning( ) function of the MLwrap pipeline.
# See the full pipeline example under table_best_hyperparameters()
# Final call signature:
# plot_scatter_residuals(wrap_object)

Plotting SHAP Plots

Description

The plot_shap() function implements comprehensive SHAP (SHapley Additive exPlanations) value visualizations where SHAP values represent each feature's marginal contribution to model output based on cooperative game theory principles. Provides four visualization modalities: bar plots of mean absolute SHAP values ranking features by average impact magnitude, directional plots showing feature-value correlation with SHAP magnitude and sign, box plots illustrating SHAP value distributions across instances, and swarm plots combining individual prediction contributions with distributional information.

Usage

plot_shap(analysis_object, show_table = FALSE)

Arguments

analysis_object

Fitted analysis_object with 'sensitivity_analysis(methods = "SHAP")'.

show_table

Boolean. Whether to print SHAP summarized results table.

Value

analysis_object

See Also

sensitivity_analysis

Examples

# Note: For obtaining the SHAP plots the user needs to complete till
# sensitivity_analysis( ) function of the MLwrap pipeline using the SHAP
# method.
# See the full pipeline example under sensitivity_analysis()
# (Requires sensitivity_analysis(methods = "SHAP"))
# Final call signature:
# plot_shap(wrap_object)

Plotting Sobol-Jansen Values Barplot

Description

The plot_sobol_jansen() function displays first-order and total-order Sobol indices decomposing total output variance into contributions from individual features and higher-order interaction terms. Implements variance-based global sensitivity analysis providing comprehensive understanding of feature contributions to output uncertainty, with application restricted to continuous predictor variables.

Usage

plot_sobol_jansen(analysis_object, show_table = FALSE)

Arguments

analysis_object

Fitted analysis_object with 'sensitivity_analysis(methods = "Sobol_Jansen")'.

show_table

Boolean. Whether to print Sobol-Jansen results table.

Value

analysis_object

See Also

sensitivity_analysis

Examples

# Note: For obtaining the Sobol_Jansen plot the user needs to complete till
# sensitivity_analysis( ) function of the MLwrap pipeline using
# the Sobol_Jansen method.
# See the full pipeline example under sensitivity_analysis()
# (Requires sensitivity_analysis(methods = "Sobol_Jansen"))
# Final call signature:
# plot_sobol_jansen(wrap_object)

Plotting Tuner Search Results

Description

The plot_tuning_results() function Visualizes hyperparameter optimization search results adapting output format to the optimization methodology employed. For Bayesian Optimization: displays iteration-by-iteration loss function evolution across iterations, acquisition function values guiding sequential hyperparameter sampling, and final hyperparameter configuration with cross-validation performance metrics. For Grid Search: displays performance surfaces across hyperparameter dimensions and rank-ordered configurations by validation performance.

Usage

plot_tuning_results(analysis_object)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

Value

analysis_object

See Also

table_best_hyperparameters

Examples

# Note: For obtaining the plot with tuning results the user needs to
# complete till fine_tuning( ) function of the MLwrap pipeline.
# See the full pipeline example under table_best_hyperparameters()
# Final call signature:
# plot_tuning_results(wrap_object)

Preprocessing Data Matrix

Description

The preprocessing() function streamlines data preparation for regression and classification tasks by integrating variable selection, type conversion, normalization, and categorical encoding into a single workflow. It takes a data frame and a formula, applies user-specified transformations to numeric and categorical variables using the recipes package, and ensures the outcome variable is properly formatted. The function returns an AnalysisObject containing both the processed data and the transformation pipeline, supporting reproducible and efficient modeling (Kuhn & Wickham, 2020).

Usage

preprocessing(
  df,
  formula,
  task = "regression",
  num_vars = NULL,
  cat_vars = NULL,
  norm_num_vars = "all",
  encode_cat_vars = "all",
  y_levels = NULL
)

Arguments

df

Input DataFrame. Either a data.frame or tibble.

formula

Modelling Formula. A string of characters or formula.

task

Modelling Task. Either "regression" or "classification".

num_vars

Optional vector of names of the numerical features.

cat_vars

Optional vector of names of the categorical features.

norm_num_vars

Normalize numeric features as z-scores. Either vector of names of numerical features to be normalized or "all" (default).

encode_cat_vars

One Hot Encode Categorical Features. Either vector of names of categorical features to be encoded or "all" (default).

y_levels

Optional ordered vector with names of the target variable levels (Classification task only).

Value

The object returned by the preprocessing function encapsulates a dataset specifically prepared for ML analysis. This object contains the preprocessed data—where variables have been selected, standardized, encoded, and formatted according to the requirements of the chosen modeling task (regression or classification) —as well as a recipes::recipe object that documents all preprocessing steps applied. By automating essential transformations such as normalization, one-hot encoding of categorical variables, and the handling of missing values, the function ensures the data is optimally structured for input into machine learning algorithms. This comprehensive preprocessing not only exposes the underlying structure of the data and reduces the risk of errors, but also provides a robust foundation for subsequent modeling, validation, and interpretation within the machine learning workflow (Kuhn & Johnson, 2019).

References

Kuhn, M., & Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. Chapman and Hall/CRC. doi:10.1201/9781315108230

Kuhn, M., & Wickham, H. (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org.

Examples

# Example 1: Dataset with preformatted categorical variables
# In this case, internal options for variable types are not needed since
# categorical features are already formatted as factors.

library(MLwrap)

data(sim_data) # sim_data is a simulated dataset with psychological variables

wrap_object <- preprocessing(
          df = sim_data,
          formula = psych_well ~ depression + emot_intel + resilience + life_sat + gender,
          task = "regression"
         )

# Example 2: Dataset where neither the outcome nor the categorical features
# are formatted as factors and all categorical variables are specified to be
# formatted as factors

wrap_object <- preprocessing(
           df = sim_data,
           formula = psych_well_bin ~ gender + depression + age + life_sat,
           task = "classification",
           cat_vars = c("gender")
         )

Perform Sensitivity Analysis and Interpretable ML methods

Description

As the final step in the MLwrap package workflow, this function performs Sensitivity Analysis (SA) on a fitted ML model stored in an analysis_object (in the examples, e.g., tidy_object). It evaluates the importance of features using various methods such as Permutation Feature Importance (PFI), SHAP (SHapley Additive exPlanations), Integrated Gradients, Olden sensitivity analysis, and Sobol indices. The function generates numerical results and visualizations (e.g., bar plots, box plots, beeswarm plots) to help interpret the impact of each feature on the model's predictions for both regression and classification tasks, providing critical insights after model training and evaluation.

Following the steps of data preprocessing, model fitting, and performance assessment in the MLwrap pipeline, sensitivity_analysis() processes the training and test data using the preprocessing recipe stored in the analysis_object, applies the specified SA methods, and stores the results within the analysis_object. It supports different metrics for evaluation and handles multi-class classification by producing class-specific analyses and plots, ensuring a comprehensive understanding of model behavior (Iooss & Lemaître, 2015).

Usage

sensitivity_analysis(analysis_object, methods = c("PFI"), metric = NULL)

Arguments

analysis_object

analysis_object created from fine_tuning function.

methods

Method to be used. A string of the method name: "PFI" (Permutation Feature Importance), "SHAP" (SHapley Additive exPlanations), "Integrated Gradients" (Neural Network only), "Olden" (Neural Networks only), "Sobol_Jansen" (only when all input features are continuous).

metric

Metric used for "PFI" method (Permutation Feature Importance). A string of the name of metric (see Metrics).

Details

As the concluding phase of the MLwrap workflow—after data preparation, model training, and evaluation—this function interprets models by quantifying and visualizing feature importance. It validates input with check_args_sensitivity_analysis(), preprocesses data using the recipe stored in analysis_object$transformer, then calculates feature importance via the specified methods:

For classification tasks with more than two outcome levels, the function generates separate results and plots for each class. Visualizations include bar plots for importance metrics, box plots for distribution of values, and beeswarm plots for detailed feature impact across observations. All results are stored in the analysis_object under the sensitivity_analysis slot, finalizing the MLwrap pipeline with a deep understanding of model drivers.

Value

An updated analysis_object containing sensitivity analysis results. Results are stored in the sensitivity_analysis slot as a list, with each method's results accessible by name. Generates bar, box, and beeswarm plots for feature importance visualization, completing the workflow with actionable insights.

References

Iooss, B., & Lemaître, P. (2015). A review on global sensitivity analysis methods. In: G. Dellino & C. Meloni (Eds.), Uncertainty Management in Simulation-Optimization of Complex Systems. Operations Research/Computer Science Interfaces Series (vol. 59). Springer, Boston, MA. doi:10.1007/978-1-4899-7547-8_5

Jansen, M. J. W. (1999). Analysis of variance designs for model output. Computer Physics Communications, 117(1-2), 35–43. doi:10.1016/S0010-4655(98)00154-4

Examples

# Example: Using PFI

wrap_object <- preprocessing(
       df = sim_data,
       formula = psych_well ~ depression + life_sat,
       task = "regression"
       )
wrap_object <- build_model(
               analysis_object = wrap_object,
               model_name = "Random Forest",
               hyperparameters = list(
                                 mtry = 2,
                                 trees = 3
                                 )
                           )
set.seed(123) # For reproducibility
wrap_object <- fine_tuning(wrap_object,
                tuner = "Grid Search CV",
                metrics = c("rmse")
                )
wrap_object <- sensitivity_analysis(wrap_object, methods = "PFI")

# Extracting Results

table_pfi <- table_pfi_results(wrap_object)


sim_data

Description

This dataset, included in the MLwrap package, is a simulated dataset (Martínez-García et al., 2025) designed to capture relationships among psychological and demographic variables influencing psychological wellbeing, the primary outcome variable. It comprises data for 1,000 individuals.

Usage

data(sim_data)

Format

A data frame with 1,000 rows and 10 columns:

psych_well

Psychological Wellbeing Indicator. Continuous with (0,100)

psych_well_bin

Psychological Wellbeing Binary Indicator. Factor with ("Low", "High")

psych_well_pol

Psychological Wellbeing Polytomic Indicator. Factor with ("Low", "Somewhat", "Quite a bit", "Very Much")

gender

Patient Gender. Factor ("Female", "Male")

age

Patient Age. Continuous (18, 85)

socioec_status

Socioeconomial Status Indicator. Factor ("Low", "Medium", "High")

emot_intel

Emotional Intelligence Indicator. Continuous (24, 120)

resilience

Resilience Indicator. Continuous (4, 20)

depression

Depression Indicator. Continuous (0, 63)

life_sat

Life Satisfaction Indicator. Continuous (5, 35)

Details

The predictor variables include gender (50.7% female), age (range: 18-85 years, mean = 51.63, median = 52, SD = 17.11), and socioeconomic status, categorized as Low (n = 343), Medium (n = 347), and High (n = 310). Additional predictors (features) are emotional intelligence (range: 24-120, mean = 71.97, median = 71, SD = 23.79), resilience (range: 4-20, mean = 11.93, median = 12, SD = 4.46), life satisfaction (range: 5-35, mean = 20.09, median = 20, SD = 7.42), and depression (range: 0-63, mean = 31.45, median = 32, SD = 14.85). The primary outcome variable is emotional wellbeing, measured on a scale from 0 to 100 (mean = 50.22, median = 49, SD = 24.45).

The dataset incorporates correlations as conditions for the simulation. Psychological wellbeing is positively correlated with emotional intelligence (r = 0.50), resilience (r = 0.40), and life satisfaction (r = 0.60), indicating that higher levels of these factors are associated with better emotional health outcomes. Conversely, a strong negative correlation exists between depression and psychological wellbeing (r = -0.80), suggesting that higher depression scores are linked to lower emotional wellbeing. Age shows a slight positive correlation with emotional wellbeing (r = 0.15), reflecting the expectation that older individuals might experience greater emotional stability. Gender and socioeconomic status are included as potential predictors, but the simulation assumes no statistically significant differences in psychological wellbeing across these categories.

Additionally, the dataset includes categorical transformations of psychological wellbeing into binary and polytomous formats: a binary version ("Low" = 477, "High" = 523) and a polytomous version with four levels: "Low" (n = 161), "Somewhat" (n = 351), "Quite a bit" (n = 330), and "Very much" (n = 158). The polytomous transformation uses the 25th, 50th, and 75th percentiles as thresholds for categorizing psychological wellbeing scores. These transformations enable analyses using machine learning models for regression (continuous outcome) and classification (binary or polytomous outcomes) tasks.

Test Performance Exceeding Training Performance

If machine learning models, including SVMs, show better evaluation metrics on the test set than the training set, this anomaly usually signals methodological issues rather than genuine model quality. Typical causes reported in the literature (Hastie et al., 2017) include:

MLwrap implementation: MLwrap's hyperparameter optimization (via Bayesian Optimization or Grid Search CV) implements 5-fold cross-validation during the tuning process, which provides more robust parameter selection than single train-test splits. Users should examine evaluation metrics across both training and test sets, and review diagnostic plots (residuals, predictions) to identify potential distribution differences between partitions. When working with small datasets where partition variability may be substantial, running the complete workflow with different random seeds can help assess the stability of results and conclusions. The sim_data dataset included in MLwrap is a simulated matrix provided for demonstration purposes only. As synthetic data, it may occasionally exhibit some of these anomalous phenomena (e.g., better test than training performance) due to artificial patterns in the data generation process. Users working with real-world data should always verify results through careful examination of evaluation metrics and diagnostic plots across multiple runs.

References

An, C., Park, Y. W., Ahn, S. S., Han, K., Kim, H., & Lee, S. K. (2021). Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results. PLOS ONE, 16(8), e0256152. doi:10.1371/journal.pone.0256152

Hastie, T., Tibshirani, R., & Friedman, J. (2017). The elements of statistical learning: Data mining, inference, and prediction (2nd ed., corrected 12th printing, Chapter 7). Springer. doi:10.1007/978-0-387-84858-7

Kapoor, S., & Narayanan, A. (2023). Leakage and the reproducibility crisis in machine-learning-based science. Patterns, 4(9), 100804. doi:10.1016/j.patter.2023.100804

Martínez-García, J., Montaño, J. J., Jiménez, R., Gervilla, E., Cajal, B., Núñez, A., Leguizamo, F., & Sesé, A. (2025). Decoding Artificial Intelligence: A Tutorial on Neural Networks in Behavioral Research. Clinical and Health, 36(2), 77-95. doi:10.5093/clh2025a13

Vabalas, A., Gowen, E., Poliakoff, E., & Casson, A. J. (2019). Machine learning algorithm validation with a limited sample size. PLOS ONE, 14(11), e0224365. doi:10.1371/journal.pone.0224365


Best Hyperparameters Configuration

Description

The table_best_hyperparameters() function extracts and presents the optimal hyperparameter configuration identified during the model fine-tuning process. This function validates that the model has been properly trained and that hyperparameter tuning has been performed, combining both constant and optimized hyperparameters to generate a comprehensive table with the configuration that maximizes performance according to the specified primary metric. The function includes optional interactive visualization capabilities through the show_table parameter.

Usage

table_best_hyperparameters(analysis_object, show_table = FALSE)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

show_table

Boolean. Whether to print the table.

Value

Tibble with best hyperparameter configuration.

Examples

# Note: For obtaining hyoperparameters table the user needs to
# complete till fine_tuning( ) function.

set.seed(123) # For reproducibility
wrap_object <- preprocessing(df = sim_data[1:300 ,],
                             formula = psych_well ~ depression + resilience,
                             task = "regression")
wrap_object <- build_model(wrap_object, "Random Forest",
                           hyperparameters = list(mtry = 2, trees = 3))
wrap_object <- fine_tuning(wrap_object, "Grid Search CV")

# And then, you can obtain the best hyperparameters table.

table_best_hyp <- table_best_hyperparameters(wrap_object)

Evaluation Results

Description

The table_evaluation_results() function provides access to trained model evaluation metrics, automatically adapting to the type of problem being analyzed. For binary classification problems, it returns a unified table with performance metrics, while for multiclass classification it generates separate tables for training and test data, enabling comparative performance evaluation and detection of potential overfitting.

Usage

table_evaluation_results(analysis_object, show_table = FALSE)

Arguments

analysis_object

Fitted analysis_object with 'fine_tuning()'.

show_table

Boolean. Whether to print the table.

Value

Tibble or list of tibbles (multiclass classification) with evaluation results.

See Also

table_best_hyperparameters

Examples

# Note: For obtaining the evaluation table the user needs to
# complete till fine_tuning( ) function.
# See the full pipeline example under table_best_hyperparameters()
# Final call signature:
# table_evaluation_results(wrap_object)

Integrated Gradients Summarized Results Table

Description

The table_integrated_gradients_results() function implements a summarized metrics scheme for Integrated Gradients values. This methodology, specifically designed for neural networks, calculates feature importance through gradient integration along paths from baseline to input. Three different metrics are computed:

Usage

table_integrated_gradients_results(analysis_object, show_table = FALSE)

Arguments

analysis_object

Fitted analysis_object with 'sensitivity_analysis(methods = "Integrated Gradients")'.

show_table

Boolean. Whether to print the table.

Value

Tibble or list of tibbles (multiclass classification) with Integrated Gradient summarized results.

See Also

sensitivity_analysis

Examples

# Note: For obtaining the table with Integrated Gradients method results
# the user needs to complete till sensitivity_analysis() function of the
# MLwrap pipeline using the Integrated Gradient method.
# See the full pipeline example under sensitivity_analysis
# (Requires sensitivity_analysis(methods = "Integrated Gradients"))
# Final call signature:
# table_integrated_gradients_results(wrap_object)

Olden Results Table

Description

The table_olden_results() function extracts results from the Olden method, a technique specific to neural networks that calculates relative importance of input variables through analysis of connection weights between network layers. This method provides a measure of each variable's contribution based on the magnitude and direction of synaptic connections.

Usage

table_olden_results(analysis_object, show_table = FALSE)

Arguments

analysis_object

Fitted analysis_object with 'sensitivity_analysis(methods = "Olden")'.

show_table

Boolean. Whether to print the table.

Value

Tibble or list of tibbles (multiclass classification) with Olden results.

See Also

sensitivity_analysis

Examples

# Note: For obtaining the table with Olden method results the user needs to
# complete till sensitivity_analysis() function of the MLwrap pipeline using
# the Olden method. Remember Olden method only can be used with neural
# network model.
# See the full pipeline example under sensitivity_analysis
# (Requires sensitivity_analysis(methods = "Olden"))
# Final call signature:
# table_olden_results(wrap_object)

Permutation Feature Importance Results Table

Description

The table_pfi_results() function extracts Permutation Feature Importance results, a model-agnostic technique that evaluates variable importance through performance degradation when randomly permuting each feature's values.

Usage

table_pfi_results(analysis_object, show_table = FALSE)

Arguments

analysis_object

Fitted analysis_object with 'sensitivity_analysis(methods = "PFI")'.

show_table

Boolean. Whether to print the table.

Value

Tibble or list of tibbles (multiclass classification) with PFI results.

Examples

# Note: For obtaining the table with PFI method results the user needs to
# complete till sensitivity_analysis() function of the
# MLwrap pipeline using PFI method

set.seed(123) # For reproducibility
wrap_object <- preprocessing(df = sim_data[1:300 ,],
                             formula = psych_well ~ depression + emot_intel,
                             task = "regression")
wrap_object <- build_model(wrap_object, "Random Forest",
                           hyperparameters = list(mtry = 2, trees = 3))
wrap_object <- fine_tuning(wrap_object, "Grid Search CV")
wrap_object <- sensitivity_analysis(wrap_object, methods = "PFI")

# And then, you can obtain the PFI results table.

table_pfi <- table_pfi_results(wrap_object)

SHAP Summarized Results Table

Description

The table_shap_results() function processes previously calculated SHAP (SHapley Additive exPlanations) values and generates summarized metrics including mean absolute value, standard deviation of mean absolute value, and a directional sensitivity value calculated as the covariance between feature values and SHAP values divided by the variance of feature values. This directional metric provides information about the nature of the relationship between each variable and model predictions. To summarize the SHAP values calculated, three different metrics are computed:

Usage

table_shap_results(analysis_object, show_table = FALSE)

Arguments

analysis_object

Fitted analysis_object with 'sensitivity_analysis(methods = "SHAP")'.

show_table

Boolean. Whether to print the table.

Value

Tibble or list of tibbles (multiclass classification) with SHAP summarized results.

See Also

sensitivity_analysis

Examples

# Note: For obtaining the table with SHAP method results the user needs
# to complete till sensitivity_analysis() function of the
# MLwrap pipeline using the SHAP method.
# See the full pipeline example under sensitivity_analysis
# (Requires sensitivity_analysis(methods = "SHAP"))
# Final call signature:
# table_shap_results(wrap_object)

Sobol-Jansen Results Table

Description

The table_sobol_jansen_results() function processes results from Sobol-Jansen global sensitivity analysis, a variance decomposition-based methodology that quantifies each variable's contribution and their interactions to the total variability of model predictions. This technique is particularly valuable for identifying higher-order effects and complex interactions between variables.

Usage

table_sobol_jansen_results(analysis_object, show_table = FALSE)

Arguments

analysis_object

Fitted analysis_object with 'sensitivity_analysis(methods = "Sobol_Jansen")'.

show_table

Boolean. Whether to print the table.

Value

Tibble or list of tibbles (multiclass classification) with Sobol-Jansen results.

See Also

sensitivity_analysis

Examples

# Note: For obtaining the table with Sobol_Jansen method results the user
# needs to complete till sensitivity_analysis() function of the MLwrap
# pipeline using the Sobol_Jansen method. Sobol_Jansen method only works
# when all input features are continuous.
# See the full pipeline example under sensitivity_analysis
# (Requires sensitivity_analysis(methods = "Sobol_Jansen"))
# Final call signature:
# table_sobol_jansen_results(wrap_object)

MLwrap Comprehensive Tutorial

Description

A comprehensive tutorial demonstrating the complete MLwrap workflow is available. The tutorial provides detailed guidance on data preprocessing, model building, hyperparameter tuning, model evaluation, and sensitivity analysis across all supported machine learning algorithms (Neural Networks, Random Forests, SVM, and XGBoost) within the Knowledge Discovery in Databases (KDD) framework.

Usage

MLwrap_tutorial()

Details

Citation: Jiménez, R., Martínez-García, J., Montaño, J. J., & Sesé, A. (2025). MLwrap: Simplifying Machine Learning workflows in R. PsyarXiv. doi:10.31234/osf.io/j6m4z_v1

Value

Character string with the arXiv URL

Preprint

Available at doi:10.31234/osf.io/j6m4z_v1

Why consult the tutorial

While MLwrap provides a streamlined and user-friendly interface for implementing machine learning workflows, the underlying models represent sophisticated algorithms with substantial theoretical and computational complexity. The tutorial bridges this gap by explaining the rationale behind preprocessing decisions, hyperparameter choices, and interpretation of model outputs. Understanding these concepts ensures appropriate application of the methods, proper interpretation of results, and awareness of potential limitations in specific contexts.

The tutorial demonstrates practical applications through complete workflows, helping users navigate the balance between methodological rigor and implementation simplicity that MLwrap offers. This is particularly valuable for researchers transitioning from traditional statistical methods to machine learning approaches, or those seeking to ensure reproducible and theoretically sound applications in their work.

Users are strongly encouraged to consult the tutorial for detailed examples and best practices.

Tutorial for implementing ML with Python

This paper is also interesting for ML users as it serves as a primer for estimating ML models using Python code, particularly in the context of Social, Health, and Behavioral research.

Martínez-García, J., Montaño, J. J., Jiménez, R., Gervilla, E., Cajal, B., Núñez, A., Leguizamo, F., & Sesé, A. (2025). Decoding Artificial Intelligence: A Tutorial on Neural Networks in Behavioral Research. Clinical and Health, 36(2), 77-95. doi:10.5093/clh2025a13

Examples

MLwrap_tutorial()