vtable Bonus Functions

Nick Huntington-Klein

2024-12-20

The vtable package serves the purpose of outputting automatic variable documentation that can be easily viewed while continuing to work with data.

vtable contains four main functions: vtable() (or vt()), sumtable() (or st()), labeltable(), and dftoHTML()/dftoLaTeX().

This vignette focuses on some bonus helper functions that come with vtable that have been exported because they may be handy to you. This can come in handy for saving a little time, and can help you avoid having to create an unnamed function when you need to call a function.


Shortcut Helper Functions

vtable includes four shortcut functions. These are generally intended for use with the summ option in vtable and sumtable because nested functions don’t look very nice in a vtable, or in a sumtable unless you explicitly set the summ.names.

nuniq

nuniq(x) returns length(unique(x)), the number of unique values in the vector.

countNA, propNA, and notNA

These three functions are shortcuts for dealing with missing data. You have probably written out the nested versions of these many times!

Function Short For
countNA() sum(is.na())
propNA() mean(is.na())
notNA() sum(!is.na())

Note that notNA() also has some additional formatting options, which you would probably ignore if using it iteractively.

is.round

This function is a shortcut for !any(!(x == round(x,digits))).

It takes two arguments: a vector x and a number of digits (0 by default). It checks whether you can round to digits digits without losing any information.


Other Helper Functions

formatfunc

formatfunc() is a function that returns a function, which itself helps format numbers using the format() function, in the same spirit as the label_ functions in the scales package. It is largely used for the numformat argument of sumtable().

formatfunc() for the most part takes the same arguments as format(), and so help(format) can be a guide for using it. However, there are some differences.

Some defaults are changed. By default, scientific = FALSE, trim = TRUE.

There are four new arguments as well. percent = TRUE will format the number as a percentage by multiplying it by 100 and adding a % at the end. You can instead set percent equal to some number, and that number will instead be taken as 100%, instead of 1. So percent = 100, for example, will just add a % at the end without doing any multiplying.

prefix and suffix will, naturally, add prefixes or suffixes to the formatted number. So prefix = '$', suffix = 'M', for example, will produce a function that will turn 3 into $3M. scale will multiply the number by scale before formatting it. So prefix = '$', suffix = 'M', scale = 1/1000000 will turn 3000000 into $3M.

library(vtable)
my_formatter_func <- formatfunc(percent = TRUE, digits = 3, nsmall = 2, big.mark = ',')
my_formatter_func(523.2355987)
## [1] "52,323.56%"

pctile

pctile(x) is short for quantile(x,1:100/100). So in one sense this is another shortcut function. But this inherently lets you interact with percentiles a bit differently.

While quantile() has you specify which percentile you want in the function call, pctile() returns an object with all integer percentiles, and you can pull out which ones you want afterwards. pctile(x)[50] is the 50th percentile, etc.. This can be convenient in several applications, an obvious one being in sumtable.

library(vtable)
#Some random normal data, and its percentiles
d <- rnorm(1000)
pc <- pctile(d)

#25th, 50th, 75th percentile
pc[c(25,50,75)]
##          25%          50%          75% 
## -0.634294888 -0.009370405  0.703579880
#Inverse normal CDF with 100 points of articulation
plot(pc)

weighted.sd

weighted.sd(x, w) is a function to calculate a weighted standard deviation of x using w as weights, much like the base weighted.mean() does for means. It is mostly used as a helper function for sumtable() when group.weights is specified. However, you can use it on its own if you like. Unlike weighted.mean(), setting na.rm = TRUE will account for missings both in x and w.

The weighted standard deviation is calculated as

\[ \frac{\sum_i(w_i*(x_i-\bar{x}_w)^2)}{\frac{N_w-1}{N_w}\sum_iw_i} \]

Where \(\bar{x}_w\) is the weighted mean of \(x\), and \(N_w\) is the number of observations with a nonzero weight.

x <- 1:100
w <- 1:100
weighted.mean(x, w)
## [1] 67
sd(x)
## [1] 29.01149
weighted.sd(x, w)
## [1] 23.80476

independence.test

independence.test is a helper function for sumtable(group.test=TRUE) that tests for independence between a categorical variable x and another variable y that may be categorical or numerical.

Then, it outputs a formatted string as its output, with significance stars, for printing.

The function takes the format

independence.test(x,y,w=NA,
                  factor.test = NA,
                  numeric.test = NA,
                  star.cutoffs = c(.01,.05,.1),
                  star.markers = c('***','**','*'),
                  digits = 3,
                  fixed.digits = FALSE,
                  format = '{name}={stat}{stars}',
                  opts = list())

factor.test and numeric.test

These are functions that actually perform the independence test. numeric.test is used when y is numeric, and factor.test is used in all other instances.

Specifically, these functions should take only x, y, and w=NULL as arguments, and should return a list with three elements: the name of the test statistic, the test statistic itself, and the p-value of the test.

By default, these are the internal functions vtable:::chisq.it for factor.test and vtable:::groupf.it for numeric.test, so you can take a look at those (just put vtable:::chisq.it in the terminal and it will show you the function’s code) if you’d like to make your own test functions.

star.cutoffs and star.markers

These are numeric and character vectors, respectively, used for p-value cutoffs and to create significance markers.

star.cutoffs indicates the cutoffs, and star.markers indicates the markers to be used with each cutoff, in the same order. So with star.cutoffs = c(.01,.05,.1) and star.markers = c('***','**','*'), each p-value below .01 will get marked with '***', each from .01 to .05 will get '**', and each from .05 to .1 will get *.

Defaults are set to “economics defaults” (.1, .05, .01). But these are of course easy to change.

data(iris)
independence.test(iris$Species,
                  iris$Sepal.Length,
                  star.cutoffs = c(.05,.01,.001))
## [1] "F=119.265*"

digits and fixed.digits

digits indicates how many digits after the decimal place from the test statistics and p-values should be displayed. fixed.digits determines whether trailing zeros are maintained.

independence.test(iris$Species,
                  iris$Sepal.Width,
                  digits=1)
## [1] "F=49.2***"
independence.test(iris$Species,
                  iris$Sepal.Width,
                  digits=4,
                  fixed.digits = TRUE)
## [1] "F=49.1600***"

format

This is the printing format that the output will produce, incorporating the name of the test statistic {name}, the test statistic {stat}, the significance markers {stars}, and the p-value {pval}.

If your independence.test is heading out to another format besides being printed in the R console, you may want to add additional markup like '{name}$={stat}^{stars}$'} in LaTeX or '{name}={stat}<sup>{stars}</sup>' in HTML. If you do this, be sure to think carefully about escaping or not escaping characters as appropriate when you print!

independence.test(iris$Species,
                  iris$Sepal.Width,
                  format = 'Pr(>{name}): {pval}{stars}')
## [1] "Pr(>F): <0.001***"

opts

You can create a named list where the names are the above options and the values are the settings for those options, and input it into independence.test using opts=. This is an easy way to set the same options for many independence.tests.