Loading [MathJax]/jax/element/mml/optable/BasicLatin.js

Mathematical Framework for latentcor

Mingze Huang, Christian L. Müller, Irina Gaynanova

2022-09-05

Latent Gaussian Copula Model for Mixed Data

latentcor utilizes the powerful semi-parametric latent Gaussian copula models to estimate latent correlations between mixed data types (continuous/binary/ternary/truncated or zero-inflated). Below we review the definitions for each type.

Definition of continuous model (Fan et al. 2017)

A random XRp satisfies the Gaussian copula (or nonparanormal) model if there exist monotonically increasing f=(fj)pj=1 with Zj=fj(Xj) satisfying ZNp(0,Σ), σjj=1; we denote XNPN(0,Σ,f).

X = gen_data(n = 6, types = "con")$X
X
#>            [,1]
#> [1,]  1.8912120
#> [2,]  0.9633587
#> [3,]  0.0038847
#> [4,] -0.2654720
#> [5,] -0.8249745
#> [6,]  0.6162170

Definition of binary model (Fan et al. 2017)

A random XRp satisfies the binary latent Gaussian copula model if there exists WNPN(0,Σ,f) such that Xj=I(Wj>cj), where I() is the indicator function and cj are constants.

X = gen_data(n = 6, types = "bin")$X
X
#>      [,1]
#> [1,]    1
#> [2,]    0
#> [3,]    0
#> [4,]    1
#> [5,]    0
#> [6,]    1

Definition of ternary model (Quan, Booth, and Wells 2018)

A random XRp satisfies the ternary latent Gaussian copula model if there exists WNPN(0,Σ,f) such that Xj=I(Wj>cj)+I(Wj>cj), where I() is the indicator function and cj<cj are constants.

X = gen_data(n = 6, types = "ter")$X
X
#>      [,1]
#> [1,]    0
#> [2,]    1
#> [3,]    1
#> [4,]    0
#> [5,]    2
#> [6,]    1

Definition of truncated or zero-inflated model (Yoon, Carroll, and Gaynanova 2020)

A random XRp satisfies the truncated latent Gaussian copula model if there exists WNPN(0,Σ,f) such that Xj=I(Wj>cj)Wj, where I() is the indicator function and cj are constants.

X = gen_data(n = 6, types = "tru")$X
X
#>            [,1]
#> [1,] 2.31888927
#> [2,] 0.06226585
#> [3,] 1.06087423
#> [4,] 0.00000000
#> [5,] 0.00000000
#> [6,] 0.00000000

Mixed latent Gaussian copula model

The mixed latent Gaussian copula model jointly models W=(W1,W2,W3,W4)NPN(0,Σ,f) such that X1j=W1j, X2j=I(W2j>c2j), X3j=I(W3j>c3j)+I(W3j>c3j) and X4j=I(W4j>c4j)W4j.

set.seed("234820")
X = gen_data(n = 100, types = c("con", "bin", "ter", "tru"))$X
head(X)
#>            [,1] [,2] [,3]      [,4]
#> [1,] -0.5728663    0    1 0.0000000
#> [2,] -1.5632883    0    0 0.0000000
#> [3,]  0.4600555    1    1 0.2634213
#> [4,] -1.5186510    1    1 0.0000000
#> [5,] -1.5438165    0    1 0.0000000
#> [6,] -0.5656219    0    0 0.0000000

Moment-based estimation of Σ based on bridge functions

The estimation of latent correlation matrix Σ is achieved via the bridge function F which is defined such that E(ˆτjk)=F(σjk), where σjk is the latent correlation between variables j and k, and ˆτjk is the corresponding sample Kendall’s τ.

Kendall’s τ (τa)

Given observed xj,xkRn,

ˆτjk=ˆτ(xj,xk)=2n(n1)1i<insign(xijxij)sign(xikxik), where n is the sample size.

latentcor calculates pairwise Kendall’s ˆτ as part of the estimation process

estimate = latentcor(X, types = c("con", "bin", "ter", "tru"))
K = estimate$K
K
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.2557576 0.2456566 0.3331313
#> [2,] 0.2557576 1.0000000 0.1555556 0.2339394
#> [3,] 0.2456566 0.1555556 1.0000000 0.2183838
#> [4,] 0.3331313 0.2339394 0.2183838 1.0000000

Using F and ˆτjk, a moment-based estimator is ˆσjk=F1(ˆτjk) with the corresponding ˆΣ being consistent for Σ (Fan et al. 2017; Quan, Booth, and Wells 2018; Yoon, Carroll, and Gaynanova 2020).

The explicit form of bridge function F has been derived for all combinations of continuous(C)/binary(B)/ternary(N)/truncated(T) variable types, and we summarize the corresponding references. Each of this combinations is implemented in latentcor.

Type continuous binary ternary zero-inflated (truncated)
continuous Liu, Lafferty, and Wasserman (2009) - - -
binary Fan et al. (2017) Fan et al. (2017) - -
ternary Quan, Booth, and Wells (2018) Quan, Booth, and Wells (2018) Quan, Booth, and Wells (2018) -
zero-inflated (truncated) Yoon, Carroll, and Gaynanova (2020) Yoon, Carroll, and Gaynanova (2020) See Appendix Yoon, Carroll, and Gaynanova (2020)

Below we provide an explicit form of F for each combination.

Theorem (explicit form of bridge function) Let W1Rp1, W2Rp2, W3Rp3, W4Rp4 be such that W=(W1,W2,W3,W4)NPN(0,Σ,f) with p=p1+p2+p3+p4. Let X=(X1,X2,X3,X4)Rp satisfy Xj=Wj for j=1,...,p1, Xj=I(Wj>cj) for j=p1+1,...,p1+p2, Xj=I(Wj>cj)+I(Wj>cj) for j=p1+p2+1,...,p3 and Xj=I(Wj>cj)Wj for j=p1+p2+p3+1,...,p with Δj=f(cj). The rank-based estimator of Σ based on the observed n realizations of X is the matrix ˆR with ˆrjj=1, ˆrjk=ˆrkj=F1(ˆτjk) with block structure

ˆR=(F1CC(ˆτ)F1CB(ˆτ)F1CN(ˆτ)F1CT(ˆτ)F1BC(ˆτ)F1BB(ˆτ)F1BN(ˆτ)F1BT(ˆτ)F1NC(ˆτ)F1NB(ˆτ)F1NN(ˆτ)F1NT(ˆτ)F1TC(ˆτ)F1TB(ˆτ)F1TN(ˆτ)F1TT(ˆτ)) F()={CC: 2sin1(r)/πBC: 4Φ2(Δj,0;r/2)2Φ(Δj)BB: 2{Φ2(Δj,Δk;r)Φ(Δj)Φ(Δk)}NC: 4Φ2(Δ2j,0;r/2)2Φ(Δ2j)+4Φ3(Δ1j,Δ2j,0;Σ3a(r))2Φ(Δ1j)Φ(Δ2j)NB: 2Φ2(Δ2j,Δk,r){1Φ(Δ1j)}2Φ(Δ2j){Φ(Δk)Φ2(Δ1j,Δk,r)}NN: 2Φ2(Δ2j,Δ2k;r)Φ2(Δ1j,Δ1k;r)2{Φ(Δ2j)Φ2(Δ2j,Δ1k;r)}{Φ(Δ2k)Φ2(Δ1j,Δ2k;r)}TC: 2Φ2(Δj,0;1/2)+4Φ3(Δj,0,0;Σ3b(r))TB: 2{1Φ(Δj)}Φ(Δk)2Φ3(Δj,Δk,0;Σ3c(r))2Φ3(Δj,Δk,0;Σ3d(r))TN: 2Φ(Δ1k)Φ(Δ2k)+2Φ3(Δ1k,Δ2k,Δj;Σ3e(r))+2Φ4(Δ1k,Δ2k,Δj,0;Σ4a(r))+2Φ4(Δ1k,Δ2k,Δj,0;Σ4b(r))TT: 2Φ4(Δj,Δk,0,0;Σ4c(r))+2Φ4(Δj,Δk,0,0;Σ4d(r))

where Δj=Φ1(π0j), Δk=Φ1(π0k), Δ1j=Φ1(π0j), Δ2j=Φ1(π0j+π1j), Δ1k=Φ1(π0k), Δ2k=Φ1(π0k+π1k),

Σ3a(r)=(10r201r2r2r21),Σ3b(r)=(112r2121rr2r1),Σ3c(r)=(1r12r1r212r21),

Σ3d(r)=(101201r212r21),Σ3e(r)=(10001r0r1),Σ4a(r)=(100r201rr20r112r2r2121),

Σ4b(r)=(10rr2010r2r0112r2r2121),Σ4c(r)=(1012r201r21212r21rr212r1)andΣ4d(r)=(1r12r2r1r21212r21rr212r1).

Estimation methods

Given the form of bridge function F, obtaining a moment-based estimation ˆσjk requires inversion of F. latentcor implements two methods for calculation of the inversion:

Both methods calculate inverse bridge function applied to each element of sample Kendall’s τ matrix. Because the calculation is performed point-wise (separately for each pair of variables), the resulting point-wise estimator of correlation matrix may not be positive semi-definite. latentcor performs projection of the pointwise-estimator to the space of positive semi-definite matrices, and allows for shrinkage towards identity matrix using the parameter nu (see Subsection describing adjustment of point-wise estimator and relevant parameter nu).

Original method (method = "original")

Original estimation approach relies on numerical inversion of F based on solving uni-root optimization problem. Given the calculated ˆτjk (sample Kendall’s τ between variables j and k), the estimate of latent correlation ˆσjk is obtained by calling optimize function to solve the following optimization problem: ˆrjk=argmin The parameter tol controls the desired accuracy of the minimizer and is passed to optimize, with the default precision of 1e-8:

estimate = latentcor(X, types = c("con", "bin", "ter", "tru"), method = "original", tol = 1e-8)

Algorithm for Original method

Input: F(r)=F(r, \mathbf{\Delta}) - bridge function based on the type of variables j, k

estimate$K
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.2557576 0.2456566 0.3331313
#> [2,] 0.2557576 1.0000000 0.1555556 0.2339394
#> [3,] 0.2456566 0.1555556 1.0000000 0.2183838
#> [4,] 0.3331313 0.2339394 0.2183838 1.0000000
estimate$zratios
#> [[1]]
#> [1] NA
#> 
#> [[2]]
#> [1] 0.5
#> 
#> [[3]]
#> [1] 0.3 0.8
#> 
#> [[4]]
#> [1] 0.5
estimate$Rpointwise
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.5529903 0.4480984 0.5826171
#> [2,] 0.5529903 1.0000000 0.4050223 0.5821513
#> [3,] 0.4480984 0.4050223 1.0000000 0.4653875
#> [4,] 0.5826171 0.5821513 0.4653875 1.0000000

Approximation method (method = "approx")

A faster approximation method is based on multi-linear interpolation of pre-computed inverse bridge function on a fixed grid of points (Yoon, Müller, and Gaynanova 2021). This is possible as the inverse bridge function is an analytic function of at most 5 parameters:

In short, d-dimensional multi-linear interpolation uses a weighted average of 2^{d} neighbors to approximate the function values at the points within the d-dimensional cube of the neighbors, and to perform interpolation, latentcor takes advantage of the R package chebpol (Gaure 2019). This approximation method has been first described in (Yoon, Müller, and Gaynanova 2021) for continuous/binary/truncated cases. In latentcor, we additionally implement ternary case, and optimize the choice of grid as well as interpolation boundary for faster computations with smaller memory footprint.

#estimate = latentcor(X, types = c("con", "bin", "ter", "tru"), method = "approx")

Algorithm for Approximation method

Input: Let \check{g}=h(g), pre-computed values F^{-1}(h^{-1}(\check{g})) on a fixed grid \check{g}\in\check{\cal{G}} based on the type of variables j and k. For binary/continuous case, \check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j}); for binary/binary case, \check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j}, \check{\Delta}_{k}); for truncated/continuous case, \check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j}); for truncated/truncated case, \check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j}, \check{\Delta}_{k}); for ternary/continuous case, \check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j}^{1}, \check{\Delta}_{j}^{2}); for ternary/binary case, \check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j}^{1}, \check{\Delta}_{j}^{2}, \check{\Delta}_{k}); for ternary/truncated case, \check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j}^{1}, \check{\Delta}_{j}^{2}, \check{\Delta}_{k}); for ternay/ternary case, \check{g}=(\check{\tau}_{jk}, \check{\Delta}_{j}^{1}, \check{\Delta}_{j}^{2}, \check{\Delta}_{k}^{1}, \check{\Delta}_{k}^{2}).

To avoid interpolation in areas with high approximation errors close to the boundary, we use hybrid scheme in Step 3. The parameter ratio controls the size of the region where the interpolation is performed (ratio = 0 means no interpolation, ratio = 1 means interpolation is always performed). For the derivation of approximate bound for BC, BB, TC, TB, TT cases see Yoon, Müller, and Gaynanova (2021). The derivation of approximate bound for NC, NB, NN, NT case is in the Appendix.

\bar{\tau}_{jk}(\cdot)= \begin{cases} 2\pi_{0j}(1-\pi_{0j}) & for \; BC \; case\\ 2\min(\pi_{0j},\pi_{0k})\{1-\max(\pi_{0j}, \pi_{0k})\} & for \; BB \; case\\ 2\{\pi_{0j}(1-\pi_{0j})+\pi_{1j}(1-\pi_{0j}-\pi_{1j})\} & for \; NC \; case\\ 2\min(\pi_{0j}(1-\pi_{0j})+\pi_{1j}(1-\pi_{0j}-\pi_{1j}),\pi_{0k}(1-\pi_{0k})) & for \; NB \; case\\ 2\min(\pi_{0j}(1-\pi_{0j})+\pi_{1j}(1-\pi_{0j}-\pi_{1j}), \\ \;\;\;\;\;\;\;\;\;\;\pi_{0k}(1-\pi_{0k})+\pi_{1k}(1-\pi_{0k}-\pi_{1k})) & for \; NN \; case\\ 1-(\pi_{0j})^{2} & for \; TC \; case\\ 2\max(\pi_{0k},1-\pi_{0k})\{1-\max(\pi_{0k},1-\pi_{0k},\pi_{0j})\} & for \; TB \; case\\ 1-\{\max(\pi_{0j},\pi_{0k},\pi_{1k},1-\pi_{0k}-\pi_{1k})\}^{2} & for \; TN \; case\\ 1-\{\max(\pi_{0j},\pi_{0k})\}^{2} & for \; TT \; case\\ \end{cases}

By default, latentcor uses ratio = 0.9 as this value was recommended in Yoon, Müller, and Gaynanova (2021) having a good balance of accuracy and computational speed. This value, however, can be modified by the user.

#latentcor(X, types = c("con", "bin", "ter", "tru"), method = "approx", ratio = 0.99)$R
#latentcor(X, types = c("con", "bin", "ter", "tru"), method = "approx", ratio = 0.4)$R
latentcor(X, types = c("con", "bin", "ter", "tru"), method = "original")$R
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.5524373 0.4476503 0.5820345
#> [2,] 0.5524373 1.0000000 0.4046173 0.5815691
#> [3,] 0.4476503 0.4046173 1.0000000 0.4649222
#> [4,] 0.5820345 0.5815691 0.4649222 1.0000000

The lower is the ratio, the closer is the approximation method to original method (with ratio = 0 being equivalent to method = "original"), but also the higher is the cost of computations.

library(microbenchmark)
#microbenchmark(latentcor(X, types = c("con", "bin", "ter", "tru"), method = "approx", ratio = 0.99)$R)
#microbenchmark(latentcor(X, types = c("con", "bin", "ter", "tru"), method = "approx", ratio = 0.4)$R)
microbenchmark(latentcor(X, types = c("con", "bin", "ter", "tru"), method = "original")$R)
#> Unit: milliseconds
#>                                                                        expr
#>  latentcor(X, types = c("con", "bin", "ter", "tru"), method = "original")$R
#>      min       lq     mean  median       uq     max neval
#>  39.4301 40.65735 41.72452 41.0003 42.33695 50.8438   100

Rescaled Grid for Interpolation

Since |\hat{\tau}|\le \bar{\tau}, the grid does not need to cover the whole domain \tau\in[-1, 1]. To optimize memory associated with storing the grid, we rescale \tau as follows: \check{\tau}_{jk}=\tau_{jk}/\bar{\tau}_{jk}\in[-1, 1], where \bar{\tau}_{jk} is as defined above.

In addition, for ternary variable j, it always holds that \Delta_{j}^{2}>\Delta_{j}^{1} since \Delta_{j}^{1}=\Phi^{-1}(\pi_{0j}) and \Delta_{j}^{2}=\Phi^{-1}(\pi_{0j}+\pi_{1j}). Thus, the grid should not cover the the area corresponding to \Delta_{j}^{2}\le\Delta_{j}^{1}. We thus rescale as follows: \check{\Delta}_{j}^{1}=\Delta_{j}^{1}/\Delta_{j}^{2}\in[0, 1]; \check{\Delta}_{j}^{2}=\Delta_{j}^{2}\in[0, 1].

Speed Comparison

To illustrate the speed improvement by method = "approx", we plot the run time scaling behavior of method = "approx" and method = "original" (setting types for gen_data by replicating c("con", "bin", "ter", "tru") multiple times) with increasing dimensions p = [20, 40, 100, 200, 400] at sample size n = 100 using simulation data. Figure below summarizes the observed scaling in a log-log plot. For both methods we observe the expected O(p^2) scaling behavior with dimension p, i.e., a linear scaling in the log-log plot. However, method = "approx" is at least one order of magnitude faster than method = "original" independent of the dimension of the problem.

Adjustment of pointwise-estimator for positive-definiteness

Since the estimation is performed point-wise, the resulting matrix of estimated latent correlations is not guaranteed to be positive semi-definite. For example, this could be expected when the sample size is small (and so the estimation error for each pairwise correlation is larger)

set.seed("234820")
X = gen_data(n = 6, types = c("con", "bin", "ter", "tru"))$X
X
#>            [,1] [,2] [,3]      [,4]
#> [1,] -0.5182800    0    1 0.1021738
#> [2,] -1.3017092    0    0 0.0000000
#> [3,]  0.3145191    1    2 0.4213514
#> [4,] -0.6093291    0    1 1.2771610
#> [5,] -1.3175490    1    0 0.0000000
#> [6,] -0.7807245    1    1 0.0000000
out = latentcor(X, types = c("con", "bin", "ter", "tru"))
out$Rpointwise
#>            [,1]       [,2]      [,3]       [,4]
#> [1,]  1.0000000 -0.1477240 0.9990000  0.8548518
#> [2,] -0.1477240  1.0000000 0.3523666 -0.5030324
#> [3,]  0.9990000  0.3523666 1.0000000  0.9114307
#> [4,]  0.8548518 -0.5030324 0.9114307  1.0000000
eigen(out$Rpointwise)$values
#> [1]  2.85954424  1.29130852  0.09944544 -0.25029820

latentcor automatically corrects the pointwise estimator to be positive definite by making two adjustments. First, if Rpointwise has smallest eigenvalue less than zero, the latentcor projects this matrix to the nearest positive semi-definite matrix. The user is notified of this adjustment through the message (supressed in previous code chunk), e.g.

out = latentcor(X, types = c("con", "bin", "ter", "tru"))

Second, latentcor shrinks the adjusted matrix of correlations towards identity matrix using the parameter \nu with default value of 0.001 (nu = 0.001), so that the resulting R is strictly positive definite with the minimal eigenvalue being greater or equal to \nu. That is R = (1 - \nu) \widetilde R + \nu I, where \widetilde R is the nearest positive semi-definite matrix to Rpointwise.

out = latentcor(X, types = c("con", "bin", "ter", "tru"), nu = 0.001)
out$Rpointwise
#>            [,1]       [,2]      [,3]       [,4]
#> [1,]  1.0000000 -0.1477240 0.9990000  0.8548518
#> [2,] -0.1477240  1.0000000 0.3523666 -0.5030324
#> [3,]  0.9990000  0.3523666 1.0000000  0.9114307
#> [4,]  0.8548518 -0.5030324 0.9114307  1.0000000
out$R
#>            [,1]       [,2]      [,3]       [,4]
#> [1,]  1.0000000 -0.1053533 0.9232992  0.9048072
#> [2,] -0.1053533  1.0000000 0.2372115 -0.4244433
#> [3,]  0.9232992  0.2372115 1.0000000  0.7723678
#> [4,]  0.9048072 -0.4244433 0.7723678  1.0000000

As a result, R and Rpointwise could be quite different when sample size n is small. When n is large and p is moderate, the difference is typically driven by parameter nu.

set.seed("234820")
X = gen_data(n = 100, types = c("con", "bin", "ter", "tru"))$X
out = latentcor(X, types = c("con", "bin", "ter", "tru"), nu = 0.001)
out$Rpointwise
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.5528213 0.4476819 0.5823120
#> [2,] 0.5528213 1.0000000 0.4058967 0.5808889
#> [3,] 0.4476819 0.4058967 1.0000000 0.4567771
#> [4,] 0.5823120 0.5808889 0.4567771 1.0000000
out$R
#>           [,1]      [,2]      [,3]      [,4]
#> [1,] 1.0000000 0.5522684 0.4472342 0.5817297
#> [2,] 0.5522684 1.0000000 0.4054908 0.5803080
#> [3,] 0.4472342 0.4054908 1.0000000 0.4563203
#> [4,] 0.5817297 0.5803080 0.4563203 1.0000000

Appendix

Derivation of bridge function F for ternary/truncated case

Without loss of generality, let j=1 and k=2. By the definition of Kendall’s \tau, \tau_{12}=E(\hat{\tau}_{12})=E[\frac{2}{n(n-1)}\sum_{1\leq i\leq i' \leq n} sign\{(X_{i1}-X_{i'1})(X_{i2}-X_{i'2})\}]. Since X_{1} is ternary, \begin{align} &sign(X_{1}-X_{1}') \nonumber\\ =&[I(U_{1}>C_{11},U_{1}'\leq C_{11})+I(U_{1}>C_{12},U_{1}'\leq C_{12})-I(U_{1}>C_{12},U_{1}'\leq C_{11})] \nonumber\\ &-[I(U_{1}\leq C_{11}, U_{1}'>C_{11})+I(U_{1}\leq C_{12}, U_{1}'>C_{12})-I(U_{1}\leq C_{11}, U_{1}'>C_{12})] \nonumber\\ =&[I(U_{1}>C_{11})-I(U_{1}>C_{11},U_{1}'>C_{11})+I(U_{1}>C_{12})-I(U_{1}>C_{12},U_{1}'>C_{12}) \nonumber\\ &-I(U_{1}>C_{12})+I(U_{1}>C_{12},U_{1}'>C_{11})] \nonumber\\ &-[I(U_{1}'>C_{11})-I(U_{1}>C_{11},U_{1}'>C_{11})+I(U_{1}'>C_{12})-I(U_{1}>C_{12},U_{1}'>C_{12}) \nonumber\\ &-I(U_{1}'>C_{12})+I(U_{1}>C_{11},U_{1}'>C_{12})] \nonumber\\ =&I(U_{1}>C_{11})+I(U_{1}>C_{12},U_{1}'>C_{11})-I(U_{1}'>C_{11})-I(U_{1}>C_{11},U_{1}'>C_{12}) \nonumber\\ =&I(U_{1}>C_{11},U_{1}'\leq C_{12})-I(U_{1}'>C_{11},U_{1}\leq C_{12}). \end{align} Since X_{2} is truncated, C_{1}>0 and \begin{align} sign(X_{2}-X_{2}')=&-I(X_{2}=0,X_{2}'>0)+I(X_{2}>0,X_{2}'=0) \nonumber\\ &+I(X_{2}>0,X_{2}'>0)sign(X_{2}-X_{2}') \nonumber\\ =&-I(X_{2}=0)+I(X_{2}'=0)+I(X_{2}>0,X_{2}'>0)sign(X_{2}-X_{2}'). \end{align} Since f is monotonically increasing, sign(X_{2}-X_{2}')=sign(Z_{2}-Z_{2}'), \begin{align} \tau_{12}=&E[I(U_{1}>C_{11},U_{1}'\leq C_{12}) sign(X_{2}-X_{2}')] \nonumber\\ &-E[I(U_{1}'>C_{11},U_{1}\leq C_{12}) sign(X_{2}-X_{2}')] \nonumber\\ =&-E[I(U_{1}>C_{11},U_{1}'\leq C_{12}) I(X_{2}=0)] \nonumber\\ &+E[I(U_{1}>C_{11},U_{1}'\leq C_{12}) I(X_{2}'=0)] \nonumber\\ &+E[I(U_{1}>C_{11},U_{1}'\leq C_{12})I(X_{2}>0,X_{2}'>0)sign(Z_{2}-Z_{2}')] \nonumber\\ &+E[I(U_{1}'>C_{11},U_{1}\leq C_{12}) I(X_{2}=0)] \nonumber\\ &-E[I(U_{1}'>C_{11},U_{1}\leq C_{12}) I(X_{2}'=0)] \nonumber\\ &-E[I(U_{1}'>C_{11},U_{1}\leq C_{12})I(X_{2}>0,X_{2}'>0)sign(Z_{2}-Z_{2}')] \nonumber\\ =&-2E[I(U_{1}>C_{11},U_{1}'\leq C_{12}) I(X_{2}=0)] \nonumber\\ &+2E[I(U_{1}>C_{11},U_{1}'\leq C_{12}) I(X_{2}'=0)] \nonumber\\ &+E[I(U_{1}>C_{11},U_{1}'\leq C_{12})I(X_{2}>0,X_{2}'>0)sign(Z_{2}-Z_{2}')] \nonumber\\ &-E[I(U_{1}'>C_{11},U_{1}\leq C_{12})I(X_{2}>0,X_{2}'>0)sign(Z_{2}-Z_{2}')]. \end{align} From the definition of U, let Z_{j}=f_{j}(U_{j}) and \Delta_{j}=f_{j}(C_{j}) for j=1,2. Using sign(x)=2I(x>0)-1, we obtain \begin{align} \tau_{12}=&-2E[I(Z_{1}>\Delta_{11},Z_{1}'\leq \Delta_{12},Z_{2}\leq \Delta_{2})]+2E[I(Z_{1}>\Delta_{11},Z_{1}'\leq \Delta_{12},Z_{2}'\leq \Delta_{2})] \nonumber\\ &+2E[I(Z_{1}>\Delta_{11},Z_{1}'\leq \Delta_{12})I(Z_{2}>\Delta_{2},Z_{2}'>\Delta_{2},Z_{2}-Z_{2}'>0)] \nonumber\\ &-2E[I(Z_{1}'>\Delta_{11},Z_{1}\leq \Delta_{12})I(Z_{2}>\Delta_{2},Z_{2}'>\Delta_{2},Z_{2}-Z_{2}'>0)] \nonumber\\ =&-2E[I(Z_{1}>\Delta_{11},Z_{1}'\leq \Delta_{12}, Z_{2}\leq \Delta_{2})]+2E[I(Z_{1}>\Delta_{11},Z_{1}'\leq \Delta_{12}, Z_{2}'\leq \Delta_{2})] \nonumber\\ &+2E[I(Z_{1}>\Delta_{11},Z_{1}'\leq\Delta_{12},Z_{2}'>\Delta_{2},Z_{2}>Z_{2}')] \nonumber\\ &-2E[I(Z_{1}'>\Delta_{11},Z_{1}\leq\Delta_{12},Z_{2}'>\Delta_{2},Z_{2}>Z_{2}')]. \end{align} Since \{\frac{Z_{2}'-Z_{2}}{\sqrt{2}}, -Z{1}\}, \{\frac{Z_{2}'-Z_{2}}{\sqrt{2}}, Z{1}'\} and \{\frac{Z_{2}'-Z_{2}}{\sqrt{2}}, -Z{2}'\} are standard bivariate normally distributed variables with correlation -\frac{1}{\sqrt{2}}, r/\sqrt{2} and -\frac{r}{\sqrt{2}}, respectively, by the definition of \Phi_3(\cdot,\cdot, \cdot;\cdot) and \Phi_4(\cdot,\cdot, \cdot,\cdot;\cdot) we have \begin{align} F_{NT}(r;\Delta_{j}^{1},\Delta_{j}^{2},\Delta_{k})= & -2\Phi_{3}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},\Delta_{k};\begin{pmatrix} 1 & 0 & -r \\ 0 & 1 & 0 \\ -r & 0 & 1 \end{pmatrix} \right\} \nonumber\\ &+2\Phi_{3}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},\Delta_{k};\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & r \\ 0 & r & 1 \end{pmatrix}\right\}\nonumber \\ & +2\Phi_{4}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k},0;\begin{pmatrix} 1 & 0 & 0 & \frac{r}{\sqrt{2}} \\ 0 & 1 & -r & \frac{r}{\sqrt{2}} \\ 0 & -r & 1 & -\frac{1}{\sqrt{2}} \\ \frac{r}{\sqrt{2}} & \frac{r}{\sqrt{2}} & -\frac{1}{\sqrt{2}} & 1 \end{pmatrix}\right\} \nonumber\\ &-2\Phi_{4}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k},0;\begin{pmatrix} 1 & 0 & r & -\frac{r}{\sqrt{2}} \\ 0 & 1 & 0 & -\frac{r}{\sqrt{2}} \\ r & 0 & 1 & -\frac{1}{\sqrt{2}} \\ -\frac{r}{\sqrt{2}} & -\frac{r}{\sqrt{2}} & -\frac{1}{\sqrt{2}} & 1 \end{pmatrix}\right\}. \end{align} Using the facts that \begin{align} &\Phi_{4}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k},0;\begin{pmatrix} 1 & 0 & r & -\frac{r}{\sqrt{2}} \\ 0 & 1 & 0 & -\frac{r}{\sqrt{2}} \\ r & 0 & 1 & -\frac{1}{\sqrt{2}} \\ -\frac{r}{\sqrt{2}} & -\frac{r}{\sqrt{2}} & -\frac{1}{\sqrt{2}} & 1 \end{pmatrix}\right\} \nonumber\\ &+\Phi_{4}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k},0;\begin{pmatrix} 1 & 0 & r & \frac{r}{\sqrt{2}} \\ 0 & 1 & 0 & \frac{r}{\sqrt{2}} \\ r & 0 & 1 & \frac{1}{\sqrt{2}} \\ \frac{r}{\sqrt{2}} & \frac{r}{\sqrt{2}} & \frac{1}{\sqrt{2}} & 1 \end{pmatrix}\right\} \nonumber\\ =&\Phi_{3}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k};\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & r \\ 0 & r & 1 \end{pmatrix}\right\} \end{align} and \begin{align} &\Phi_{3}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k};\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & r \\ 0 & r & 1 \end{pmatrix}\right\}+\Phi_{3}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},\Delta_{k};\begin{pmatrix} 1 & 0 & -r \\ 0 & 1 & 0 \\ -r & 0 & 1 \end{pmatrix} \right\} \nonumber\\ =&\Phi_{2}(-\Delta_{j}^{1},\Delta_{j}^{2};0) =\Phi(-\Delta_{j}^{1})\Phi(\Delta_{j}^{2}). \end{align} So that, \begin{align} F_{NT}(r;\Delta_{j}^{1},\Delta_{j}^{2},\Delta_{k})= & -2\Phi(-\Delta_{j}^{1})\Phi(\Delta_{j}^{2}) \nonumber\\ &+2\Phi_{3}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},\Delta_{k};\begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & r \\ 0 & r & 1 \end{pmatrix}\right\}\nonumber \\ & +2\Phi_{4}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k},0;\begin{pmatrix} 1 & 0 & 0 & \frac{r}{\sqrt{2}} \\ 0 & 1 & -r & \frac{r}{\sqrt{2}} \\ 0 & -r & 1 & -\frac{1}{\sqrt{2}} \\ \frac{r}{\sqrt{2}} & \frac{r}{\sqrt{2}} & -\frac{1}{\sqrt{2}} & 1 \end{pmatrix}\right\} \nonumber\\ &+2\Phi_{4}\left\{-\Delta_{j}^{1},\Delta_{j}^{2},-\Delta_{k},0;\begin{pmatrix} 1 & 0 & r & \frac{r}{\sqrt{2}} \\ 0 & 1 & 0 & \frac{r}{\sqrt{2}} \\ r & 0 & 1 & \frac{1}{\sqrt{2}} \\ \frac{r}{\sqrt{2}} & \frac{r}{\sqrt{2}} & \frac{1}{\sqrt{2}} & 1 \end{pmatrix}\right\}. \end{align}

It is easy to get the bridge function for truncated/ternary case by switching j and k.

Derivation of approximate bound for the ternary/continuous case

Let n_{0x}=\sum_{i=1}^{n_x}I(x_{i}=0), n_{2x}=\sum_{i=1}^{n_x}I(x_{i}=2), \pi_{0x}=\frac{n_{0x}}{n_{x}} and \pi_{2x}=\frac{n_{2x}}{n_{x}}, then \begin{align} |\tau(\mathbf{x})|\leq & \frac{n_{0x}(n-n_{0x})+n_{2x}(n-n_{0x}-n_{2x})}{\begin{pmatrix} n \\ 2 \end{pmatrix}} \nonumber\\ = & 2\{\frac{n_{0x}}{n-1}-(\frac{n_{0x}}{n})(\frac{n_{0x}}{n-1})+\frac{n_{2x}}{n-1}-(\frac{n_{2x}}{n})(\frac{n_{0x}}{n-1})-(\frac{n_{2x}}{n})(\frac{n_{2x}}{n-1})\} \nonumber\\ \approx & 2\{\frac{n_{0x}}{n}-(\frac{n_{0x}}{n})^2+\frac{n_{2x}}{n}-(\frac{n_{2x}}{n})(\frac{n_{0x}}{n})-(\frac{n_{2x}}{n})^2\} \nonumber\\ = & 2\{\pi_{0x}(1-\pi_{0x})+\pi_{2x}(1-\pi_{0x}-\pi_{2x})\} \end{align}

For ternary/binary and ternary/ternary cases, we combine the two individual bounds.

Derivation of approximate bound for the ternary/truncated case

Let \mathbf{x}\in\mathcal{R}^{n} and \mathbf{y}\in\mathcal{R}^{n} be the observed n realizations of ternary and truncated variables, respectively. Let n_{0x}=\sum_{i=0}^{n}I(x_{i}=0), \pi_{0x}=\frac{n_{0x}}{n}, n_{1x}=\sum_{i=0}^{n}I(x_{i}=1), \pi_{1x}=\frac{n_{1x}}{n}, n_{2x}=\sum_{i=0}^{n}I(x_{i}=2), \pi_{2x}=\frac{n_{2x}}{n}, n_{0y}=\sum_{i=0}^{n}I(y_{i}=0), \pi_{0y}=\frac{n_{0y}}{n}, n_{0x0y}=\sum_{i=0}^{n}I(x_{i}=0 \;\& \; y_{i}=0), n_{1x0y}=\sum_{i=0}^{n}I(x_{i}=1 \;\& \; y_{i}=0) and n_{2x0y}=\sum_{i=0}^{n}I(x_{i}=2 \;\& \; y_{i}=0) then \begin{align} |\tau(\mathbf{x}, \mathbf{y})|\leq & \frac{\begin{pmatrix}n \\ 2\end{pmatrix}-\begin{pmatrix}n_{0x} \\ 2\end{pmatrix}-\begin{pmatrix}n_{1x} \\ 2\end{pmatrix}-\begin{pmatrix} n_{2x} \\ 2 \end{pmatrix}-\begin{pmatrix}n_{0y} \\ 2\end{pmatrix}+\begin{pmatrix}n_{0x0y} \\ 2 \end{pmatrix}+\begin{pmatrix}n_{1x0y} \\ 2\end{pmatrix}+\begin{pmatrix}n_{2x0y} \\ 2\end{pmatrix}}{\begin{pmatrix}n \\ 2\end{pmatrix}} \nonumber \end{align} Since n_{0x0y}\leq\min(n_{0x},n_{0y}), n_{1x0y}\leq\min(n_{1x},n_{0y}) and n_{2x0y}\leq\min(n_{2x},n_{0y}) we obtain \begin{align} |\tau(\mathbf{x}, \mathbf{y})|\leq & \frac{\begin{pmatrix}n \\ 2\end{pmatrix}-\begin{pmatrix}n_{0x} \\ 2\end{pmatrix}-\begin{pmatrix}n_{1x} \\ 2\end{pmatrix}-\begin{pmatrix} n_{2x} \\ 2 \end{pmatrix}-\begin{pmatrix}n_{0y} \\ 2\end{pmatrix}}{\begin{pmatrix}n \\ 2\end{pmatrix}} \nonumber\\ & + \frac{\begin{pmatrix}\min(n_{0x},n_{0y}) \\ 2 \end{pmatrix}+\begin{pmatrix}\min(n_{1x},n_{0y}) \\ 2\end{pmatrix}+\begin{pmatrix}\min(n_{2x},n_{0y}) \\ 2\end{pmatrix}}{\begin{pmatrix}n \\ 2\end{pmatrix}} \nonumber\\ \leq & \frac{\begin{pmatrix}n \\ 2\end{pmatrix}-\begin{pmatrix}\max(n_{0x},n_{1x},n_{2x},n_{0y}) \\ 2\end{pmatrix}}{\begin{pmatrix}n \\ 2\end{pmatrix}} \nonumber\\ \leq & 1-\frac{\max(n_{0x},n_{1x},n_{2x},n_{0y})(\max(n_{0x},n_{1x},n_{2x},n_{0y})-1)}{n(n-1)} \nonumber\\ \approx & 1-(\frac{\max(n_{0x},n_{1x},n_{2x},n_{0y})}{n})^{2} \nonumber\\ =& 1-\{\max(\pi_{0x},\pi_{1x},\pi_{2x},\pi_{0y})\}^{2} \nonumber\\ =& 1-\{\max(\pi_{0x},(1-\pi_{0x}-\pi_{2x}),\pi_{2x},\pi_{0y})\}^{2} \end{align}

It is easy to get the approximate bound for truncated/ternary case by switching \mathbf{x} and \mathbf{y}.

References

Croux, Christophe, Peter Filzmoser, and Heinrich Fritz. 2013. “Robust Sparse Principal Component Analysis.” Technometrics 55 (2): 202–14.
Fan, Jianqing, Han Liu, Yang Ning, and Hui Zou. 2017. “High Dimensional Semiparametric Latent Graphical Model for Mixed Data.” Journal of the Royal Statistical Society. Series B: Statistical Methodology 79 (2): 405–21.
Filzmoser, Peter, Heinrich Fritz, and Klaudius Kalcher. 2021. pcaPP: Robust PCA by Projection Pursuit. https://CRAN.R-project.org/package=pcaPP.
Fox, John. 2019. Polycor: Polychoric and Polyserial Correlations. https://CRAN.R-project.org/package=polycor.
Gaure, Simen. 2019. Chebpol: Multivariate Interpolation. https://github.com/sgaure/chebpol.
Liu, Han, John Lafferty, and Larry Wasserman. 2009. “The Nonparanormal: Semiparametric Estimation of High Dimensional Undirected Graphs.” Journal of Machine Learning Research 10 (10).
Quan, Xiaoyun, James G Booth, and Martin T Wells. 2018. “Rank-Based Approach for Estimating Correlations in Mixed Ordinal Data.” arXiv Preprint arXiv:1809.06255.
Yoon, Grace, Raymond J Carroll, and Irina Gaynanova. 2020. “Sparse Semiparametric Canonical Correlation Analysis for Data of Mixed Types.” Biometrika 107 (3): 609–25.
Yoon, Grace, Christian L Müller, and Irina Gaynanova. 2021. “Fast Computation of Latent Correlations.” Journal of Computational and Graphical Statistics, 1–8.