Scalable MCMC for the horseshoe prior

Scalable MCMC for the horseshoe prior Anirban Bhattacharya Department of Statistics, Texas A&M University Joint work with James Johndrow and Paolo Orenstein September 7, 2018 Cornell Day of Statistics 2018 Department of Statistical Science, Cornell University

Introduction Original motivation: develop scalable MCMC algorithms for large (N, p) regression with continuous shrinkage priors For example, the horseshoe prior of Carvalho et al. (2010) One-group alternative to traditional two-group (spike-and-slab) priors (George & McCulloch, 1997; Scott & Berger, 2010; Rockova & George, 2014) Interesting in their own right - many attractive theoretical properties Lack of algorithms that scale to large (N, p) problems

Approximations in MCMC Our proposed algorithm introduces certain approximations at each MCMC step - substantial computational advantages How to quantify the effect of such approximations? Perturbation theory for MCMC algorithms (Alquier et al. 2014, Rudolf & Schweizer (2018), Johndrow & Mattingley (2018)...) A new general result + bounds on approximation error for our algorithm

Bayesian shrinkage: motivation and background

Global-local shrinkage priors Consider a Gaussian linear model z = W β + ε, ε N(0, σ 2 I N ) where W is N p, with N, p both possibly large The basic form of the prior is β j σ, ξ, η ind N(0, σ 2 ξ 1 η 1 j ) The η 1/2 j are the local scales and ξ 1/2 the global scale A popular choice for π(ξ, η) is the Horseshoe η 1/2 j ind. Cauchy + (0, 1), ξ 1/2 Cauchy + (0, 1)

Global-local shrinkage priors The basic form of the prior is β j σ, ξ, η ind N(0, σ 2 ξ 1 η 1 j ) The η 1/2 j are the local scales and ξ 1/2 the global scale Only global scale ridge type shrinkage Local scales help adapt to sparsity The global scale ξ 1/2 controls how many β j are signals, and η 1/2 j control their identities

Continuous shrinkage via one group models 0.8 0.6 Comparsion of priors: central region Laplace Cauchy Horseshoe DL 1/2 0.4 0.2 0 3 2 1 0 1 2 3 Comparsion of priors: tails 0.03 Laplace Cauchy Horseshoe DL 1/2 0.02 0.01 0 3 4 5 6 7 8

Properties Minimax optimal posterior concentration (van der Pas et al. (2014); Ghosh & Chakraborty (2015); van der Pas et al. (2016a,b); Chakraborty, B., Mallick (2017) etc) der Pas et al. (2016b)) and frequentist coverage of credible sets (van No result on MCMC convergence (e.g., geometric ergodicity) to best of our knowledge; because of polynomial tails?

Example Simulation with N = 2, 000 and p = 20, 000: first 50 β j (rest are zero). Posterior medians, 95 percent credible intervals, along with the truth.

Computational challenges

MCMC review We need to compute the posterior expectation, and would like to have access to marginals for β We won t get this from optimization (not a convex problem anyway), instead we do MCMC Basic idea of MCMC: construct a Markov transition kernel P with invariant measure the posterior, i.e. µp = µ where µ is the posterior measure Then approximate µϕ n 1 ϕ(x)µ(dx) n 1 ϕ(x k ) k=0 for X k νp k 1

Computational cost What is the computational cost? Two factors 1. The cost of taking one step with P 2. How long the Markov chain needs to be to make approximation good

Computational cost per step Perform various matrix operations - multiplication, solving linear systems, Cholesky etc

Length of path required How long of a Markov chain do we need to approximate the posterior well? Informally, the higher the autocorrelations, the longer the path we will need Another performance metric: effective sample size, the equivalent number of independent samples (larger is better)

Computational cost What is the computational cost? Two factors 1. The cost of taking one step with P 2. How long the Markov chain needs to be to make approximation good For the horseshoe, both of these present challenges Linear algebra with large matrices High autocorrelation

Algorithmic developments

Gibbs sampling for the Horseshoe The full horseshoe hierarchical structure is z β, σ 2, ξ, η N(W β, σ 2 I ) β j σ 2, ξ, η iid N(0, σ 2 ξ 1 η 1 j ) ξ 1/2 Cauchy + (0, 1) η 1/2 j Cauchy + (0, 1) σ 2 InvGamma(a/2, b/2) Typical computational approach: blocked Gibbs sampling (Polson et al (2012)) η β, σ 2, ξ, z ξ β, σ 2, η, z (β, σ 2 ) η, ξ, z

Mixing issues The algorithm is known to exhibit poor mixing for ξ (Polson et al. (2012))

Mixing issues This algorithm is known to exhibit poor mixing for ξ (Polson et al (2012))

Remedy: Johndrow, Orenstein, B. (2018+) Our approach: blocking (β, σ 2, ξ) η, z η (β, σ 2, ξ), z The first step is done by sampling ξ η, z σ 2 η, ξ, z β η, σ 2, ξ, z

Results: Autocorrelations for ξ

Geometric ergodicity Theorem. The blocked sampler above is geometrically ergodic. Specifically, let x = (η, ξ, σ 2, β) and P the transition kernel. Also, let µ be the invariant measure, i.e., the posterior. We show: 1. Foster Lyapunov condition. There exists a function V : X [0, ) and constants 0 < γ < 1 and K > 0 such that (PV )(x) V (y)p(x, dy) γv (x) + K. 2. Minorization. For every R > 0 there exists α (0, 1) (depending on R) such that such that sup P(x, ) P(y, ) TV 2(1 α) x,y S(R) where S(R) = {x : V (x) < R}.

Geometric ergodicity Theorem. Let x = (η, ξ, σ 2, β) and P the transition kernel. Also, let µ be the invariant measure, i.e., the posterior. Together, (1) and (2) imply, ϕ(y)(p(x, ) µ)(dy) Cᾱ n V (x). sup ϕ <1+V

Our exact algorithm Blocking significantly improves mixing, plus provably geometrically ergodic. But what about the cost-per-step?

Cost-per-iteration Let s focus on the update of β: ( (W β σ 2, ξ, η, z N W + (ξ 1 D) 1) 1 W z, σ 2( W T W + (ξ 1 D) 1) 1) where D = diag(η 1 j ). Usual Cholesky based sampler (Rue, 2001) for N(Q 1 b, Q 1 ) requires O(p 3 ) computation for non-sparse Q. Highly prohibitive O(p 3 ) complexity per iteration when p N.

(Partial) Remedy In B., Chakraborty, Mallick (2016), we propose an alternative exact sampler with O(N 2 p) complexity. (i) Sample u N(0, ξ 1 D) and f N(0, I N ) indep. (ii) Set v = Wu + f (iii) Solve M ξ v = (z/σ v) where M ξ = I N + ξ 1 WDW (iv) Set β = σ(u + ξ 1 DW v ) (iii) is the costliest step taking max{o(n 2 p), O(N 3 )} steps. Significant savings when p N.

Cost-per-iteration However, still O(N 2 p) computation. N can be in the order of tens of thousands in GWAS studies. The remaining bottleneck is only in calculating M ξ = I N + ξ 1 WDW which is needed by the updates for β, σ 2, and ξ

Approximations in MCMC

Approximation Horseshoe is designed to shrink most coordinates of β toward zero... So most of the ξ 1 η 1 j will typically be tiny at any iteration Idea: Choose a small threshold δ, approximate M ξ by M ξ,δ = I N + ξ 1 W S D S W S, S = {j : ξ 1 η 1 j > δ} where W S is the submatrix consisting of columns in the set S, etc Carefully replace all calculations involving M ξ with M ξ,δ Reduces cost per step to Ns 2 Np, where s = S Note: this is different from setting some β j = 0 at each scan

Perturabtion bounds A general strategy to reduce cost per step: approximate P by P ɛ Question: is for X ɛ k νpk ɛ? n 1 µϕ n 1 ϕ(xk ɛ ) k=0 Answer: yes, if (along with other verifiable conditions) sup P(x, ) P ɛ (x, ) TV x is small, i.e., uniform total variation control between the exact and approximate chains.

Approximation error condition We show that our approximation achieves sup P(x, ) P ɛ (x, ) 2 TV δ W 2[ 4N( z 2 /b 0 ) + 9 ] + O(δ 2 ) x for any fixed threshold δ (0, 1).

Approximation error condition We show that this approximation achieves sup P(x, ) P ɛ (x, ) 2 TV δ W 2[ 4N( z 2 /b 0 ) + 9 ] + O(δ 2 ) x for any fixed threshold δ. Practically: we put δ = 10 4 and have observed no advantages from smaller values. Don t have a result in Wasserstein, but can numerically compute pathwise Wasserstein-2 and get

Simulation studies The results that follow use a common simulation structure w i iid Np (0, Σ) z i N(w i β, 4) { 2 (j/4 9/4) j < 24 β j = 0 j > 23, So there are always small and large signals, and true nulls We consider both Σ = I (independent design) and Σ ij = 0.9 i j (correlated design)

Effective sample size Recall effective sample size n e, a measure of the number of independent samples your Markov path is worth If n e = n then your MCMC is giving essentially independent samples (like vanilla Monte Carlo) If n e n then your MCMC has very high autocorrelations, need very long path to get good approximation to posterior

Mixing as p increases Effective sample sizes are essentially independent of p, even when the design matrix is highly correlated

Mixing as N increases Effective sample sizes are essentially independent of N, even when the design matrix is highly correlated

Effective samples per second Recall effective sample size n e, a measure of the number of independent samples your Markov path is worth So if t is computation time in seconds, effective samples per second n e /t is an empirical measurement of overall computational efficiency

Results: Effective samples per second The approximate algorithm is fifty times more efficient when N = 2, 000 and p = 20, 000

Results: Accuracy The old algorithm often failed to identify components of β with bimodal marginals This coveys uncertainty about whether β j is a true signal, which is one of the nice features of taking a Bayesian approach to multiple testing

Accuracy The old algorithm actually failed to converge in some cases, a result of numerical truncation rules used to control underflow

Scaling to a small GWAS dataset We consider a GWAS dataset consisting of p = 98, 385 from N = 2, 267 inbred maize lines in the USDA Ames, IA seed bank The phenotype is growing degree days to silking, which is basically a measurement of how fast the maize bears fruit, controlling for temperature Our goal is only to demonstrate that our algorithm can handle this application GWAS data obtained from Professors Xiaolei Liu and Xiang Zhou.

Application to GWAS Ran 30,000 iterations in about 4 hours, despite the signals (apparently) not being very sparse (s 2000 typically)

Inference: marginals for β j with largest posterior means Lasso solution indicated in red, horseshoe estimated posterior mean in blue

Conclusion Computational cost for MCMC shouldn t massively differ from alternatives designed for the same problem But making the algorithm fast takes work, often problem-specific More thrust on computing posteriors that we know have nice properties

References Carvalho, C., Polson, N. & Scott, J. (2010). The horseshoe estimator for sparse signals. Biometrika Bhattacharya, A., Chakraborty, A., Mallick, B. (2016). Fast sampling with Gaussian scale-mixture priors in high-dimensional regression. Biometrika arxiv preprint arxiv:1506.04778 Johndrow, J. E., Orenstein, P., & Bhattacharya, A. (2018). Scalable MCMC for Bayes Shrinkage Priors. arxiv preprint arxiv:1705.00841. Johndrow, J. E., & Mattingly, J. C. (2017). Error bounds for Approximations of Markov chains. arxiv preprint arxiv:1711.05382.

Thank You

Performance in p n settings

Data generation Replicated simulation study with horseshoe prior (Carvalho et al. (2010) n = 200 & p = 5000. True β 0 has 5 non-zero entries and σ = 1.5 Two signal strengths: (i) weak - β 0S = ±(0.75, 1, 1.25, 1.5, 1.75) T (ii) moderate - β 0S = ±(1.5, 1.75, 2, 2.25, 2.5) T Two types of design matrix: (i) Independent - X j i.i.d. N(0, I p ) (ii) compound symmetry - X j i.i.d. N(0, Σ), Σ jj = 0.5 + 0.5δ jj Summary over 100 datasets

Weak signal case β^ β 0 1 β^ β 0 2 Xβ^ Xβ 0 2 HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD 0 1 2 3 4 5 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 14 HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD 1 2 3 4 5 0.2 0.4 0.6 0.8 1.0 1.2 1.4 2 4 6 8 10 12 14 Estimation performance: Boxplots of l 1, l 2 and prediction error across 100 simulation replicates. HS me and HS m are posterior point wise median and mean for the horeshoe prior. Top row: Independent covariates, Bottom row: Compound symmetry

Moderate signal case β^ β 0 1 β^ β 0 2 Xβ^ Xβ 0 2 HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD 0 1 2 3 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 2 4 6 8 10 12 HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD 0.5 1.0 1.5 2.0 0.2 0.4 0.6 0.8 2 4 6 8 Estimation performance: Boxplots of l 1, l 2 and prediction error across 100 simulation replicates. HS me and HS m are posterior point wise median and mean for the horeshoe prior. Top row: Independent covariates, Bottom row: Compound symmetry

Frequentist coverage of 95% credible intervals Frequentist coverages (%) and 100 lengths of point wise 95% intervals. Average coverages and lengths are reported after averaging across all signal variables (rows 1 and 2) and noise variables (rows 3 and 4). Subscripts denote 100 standard errors for coverages. LASSO and SS respectively stand for the methods in van de Geer et al. (2014) and Javanmard & Montanari (2014). The intervals for the horseshoe (HS) are the symmetric posterior credible intervals.

Variable selection by postprocessing Q(β) = 1 p 2 X ˆβ X β 2 2 + µ j β j, µ j = ˆβ j 2. j=1

Variable selection performance SAVS: Variable selection by post-processing the posterior mean from the HS prior. Plot of Mathew s correlation coefficient (MCC) over 1000 simulations for various methods. MCC values closer to 1 indicate better variable selection performance.