Scalable MCMC for the horseshoe prior

Similar documents
Package horseshoe. November 8, 2016

Or How to select variables Using Bayesian LASSO

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Geometric ergodicity of the Bayesian lasso

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian linear regression

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Bayesian Regression Linear and Logistic Regression

Nearest Neighbor Gaussian Processes for Large Spatial Data

Scaling up Bayesian Inference

Partial factor modeling: predictor-dependent shrinkage for linear regression

Markov Chain Monte Carlo methods

Bayesian methods in economics and finance

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

CPSC 540: Machine Learning

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Markov Chain Monte Carlo (MCMC)

Bayesian variable selection and classification with control of predictive values

On Bayesian Computation

arxiv: v3 [stat.co] 15 Oct 2018

A MCMC Approach for Learning the Structure of Gaussian Acyclic Directed Mixed Graphs

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Sparse Linear Models (10/7/13)

Default Priors and Effcient Posterior Computation in Bayesian

Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage

Bayesian Linear Regression

Analysis Methods for Supersaturated Design: Some Comparisons

Efficient sampling for Gaussian linear regression with arbitrary priors

VCMC: Variational Consensus Monte Carlo

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Coupled Hidden Markov Models: Computational Challenges

MCMC Sampling for Bayesian Inference using L1-type Priors

University of Toronto Department of Statistics

Bayesian Estimation of Input Output Tables for Russia

sparse and low-rank tensor recovery Cubic-Sketching

MCMC algorithms for fitting Bayesian models

20: Gaussian Processes

Making rating curves - the Bayesian approach

STA 4273H: Statistical Machine Learning

arxiv: v2 [stat.me] 16 Aug 2018

Bayesian Variable Selection Under Collinearity

Bayes: All uncertainty is described using probability.

Part 8: GLMs and Hierarchical LMs and GLMs

Statistical Inference

Bayesian Methods for Machine Learning

Stochastic Proximal Gradient Algorithm

Bayesian shrinkage approach in variable selection for mixed

Computational statistics

Package IGG. R topics documented: April 9, 2018

MCMC: Markov Chain Monte Carlo

STAT 518 Intro Student Presentation

Nonconvex penalties: Signal-to-noise ratio and algorithms

Bayesian Regularization

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Integrated Non-Factorized Variational Inference

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Spatially Adaptive Smoothing Splines

Fitting Narrow Emission Lines in X-ray Spectra

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Comparing Non-informative Priors for Estimation and Prediction in Spatial Models

arxiv: v1 [stat.me] 6 Jul 2017

ST 740: Markov Chain Monte Carlo

Principles of Bayesian Inference

STA414/2104 Statistical Methods for Machine Learning II

Markov Chain Monte Carlo methods

Advanced Statistical Methods. Lecture 6

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

A new Hierarchical Bayes approach to ensemble-variational data assimilation

arxiv: v1 [stat.co] 2 Nov 2017

Horseshoe, Lasso and Related Shrinkage Methods

Discussion of Predictive Density Combinations with Dynamic Learning for Large Data Sets in Economics and Finance

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

A Note on the comparison of Nearest Neighbor Gaussian Process (NNGP) based models

Sparsity Regularization

Markov Chain Monte Carlo

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

SAMPLING ALGORITHMS. In general. Inference in Bayesian models

arxiv: v3 [stat.me] 10 Aug 2018

CSC 2541: Bayesian Methods for Machine Learning

Handling Sparsity via the Horseshoe

An introduction to adaptive MCMC

Ergodicity in data assimilation methods

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Variance prior forms for high-dimensional Bayesian variable selection

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Bi-level feature selection with applications to genetic association

An Extended BIC for Model Selection

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Package basad. November 20, 2017

Large-scale Collaborative Prediction Using a Nonparametric Random Effects Model

Overfitting, Bias / Variance Analysis

Bayesian Statistics in High Dimensions

Bayesian model selection in graphs by using BDgraph package

High-dimensional Ordinary Least-squares Projection for Screening Variables

Likelihood-free MCMC

Hastings-within-Gibbs Algorithm: Introduction and Application on Hierarchical Model

L 0 methods. H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands. December 5, 2011.

Introduction to Machine Learning CMU-10701

Transcription:

Scalable MCMC for the horseshoe prior Anirban Bhattacharya Department of Statistics, Texas A&M University Joint work with James Johndrow and Paolo Orenstein September 7, 2018 Cornell Day of Statistics 2018 Department of Statistical Science, Cornell University

Introduction Original motivation: develop scalable MCMC algorithms for large (N, p) regression with continuous shrinkage priors For example, the horseshoe prior of Carvalho et al. (2010) One-group alternative to traditional two-group (spike-and-slab) priors (George & McCulloch, 1997; Scott & Berger, 2010; Rockova & George, 2014) Interesting in their own right - many attractive theoretical properties Lack of algorithms that scale to large (N, p) problems

Approximations in MCMC Our proposed algorithm introduces certain approximations at each MCMC step - substantial computational advantages How to quantify the effect of such approximations? Perturbation theory for MCMC algorithms (Alquier et al. 2014, Rudolf & Schweizer (2018), Johndrow & Mattingley (2018)...) A new general result + bounds on approximation error for our algorithm

Bayesian shrinkage: motivation and background

Global-local shrinkage priors Consider a Gaussian linear model z = W β + ε, ε N(0, σ 2 I N ) where W is N p, with N, p both possibly large The basic form of the prior is β j σ, ξ, η ind N(0, σ 2 ξ 1 η 1 j ) The η 1/2 j are the local scales and ξ 1/2 the global scale A popular choice for π(ξ, η) is the Horseshoe η 1/2 j ind. Cauchy + (0, 1), ξ 1/2 Cauchy + (0, 1)

Global-local shrinkage priors The basic form of the prior is β j σ, ξ, η ind N(0, σ 2 ξ 1 η 1 j ) The η 1/2 j are the local scales and ξ 1/2 the global scale Only global scale ridge type shrinkage Local scales help adapt to sparsity The global scale ξ 1/2 controls how many β j are signals, and η 1/2 j control their identities

Continuous shrinkage via one group models 0.8 0.6 Comparsion of priors: central region Laplace Cauchy Horseshoe DL 1/2 0.4 0.2 0 3 2 1 0 1 2 3 Comparsion of priors: tails 0.03 Laplace Cauchy Horseshoe DL 1/2 0.02 0.01 0 3 4 5 6 7 8

Properties Minimax optimal posterior concentration (van der Pas et al. (2014); Ghosh & Chakraborty (2015); van der Pas et al. (2016a,b); Chakraborty, B., Mallick (2017) etc) der Pas et al. (2016b)) and frequentist coverage of credible sets (van No result on MCMC convergence (e.g., geometric ergodicity) to best of our knowledge; because of polynomial tails?

Example Simulation with N = 2, 000 and p = 20, 000: first 50 β j (rest are zero). Posterior medians, 95 percent credible intervals, along with the truth.

Computational challenges

MCMC review We need to compute the posterior expectation, and would like to have access to marginals for β We won t get this from optimization (not a convex problem anyway), instead we do MCMC Basic idea of MCMC: construct a Markov transition kernel P with invariant measure the posterior, i.e. µp = µ where µ is the posterior measure Then approximate µϕ n 1 ϕ(x)µ(dx) n 1 ϕ(x k ) k=0 for X k νp k 1

Computational cost What is the computational cost? Two factors 1. The cost of taking one step with P 2. How long the Markov chain needs to be to make approximation good

Computational cost per step Perform various matrix operations - multiplication, solving linear systems, Cholesky etc

Length of path required How long of a Markov chain do we need to approximate the posterior well? Informally, the higher the autocorrelations, the longer the path we will need Another performance metric: effective sample size, the equivalent number of independent samples (larger is better)

Computational cost What is the computational cost? Two factors 1. The cost of taking one step with P 2. How long the Markov chain needs to be to make approximation good For the horseshoe, both of these present challenges Linear algebra with large matrices High autocorrelation

Algorithmic developments

Gibbs sampling for the Horseshoe The full horseshoe hierarchical structure is z β, σ 2, ξ, η N(W β, σ 2 I ) β j σ 2, ξ, η iid N(0, σ 2 ξ 1 η 1 j ) ξ 1/2 Cauchy + (0, 1) η 1/2 j Cauchy + (0, 1) σ 2 InvGamma(a/2, b/2) Typical computational approach: blocked Gibbs sampling (Polson et al (2012)) η β, σ 2, ξ, z ξ β, σ 2, η, z (β, σ 2 ) η, ξ, z

Mixing issues The algorithm is known to exhibit poor mixing for ξ (Polson et al. (2012))

Mixing issues This algorithm is known to exhibit poor mixing for ξ (Polson et al (2012))

Remedy: Johndrow, Orenstein, B. (2018+) Our approach: blocking (β, σ 2, ξ) η, z η (β, σ 2, ξ), z The first step is done by sampling ξ η, z σ 2 η, ξ, z β η, σ 2, ξ, z

Results: Autocorrelations for ξ

Geometric ergodicity Theorem. The blocked sampler above is geometrically ergodic. Specifically, let x = (η, ξ, σ 2, β) and P the transition kernel. Also, let µ be the invariant measure, i.e., the posterior. We show: 1. Foster Lyapunov condition. There exists a function V : X [0, ) and constants 0 < γ < 1 and K > 0 such that (PV )(x) V (y)p(x, dy) γv (x) + K. 2. Minorization. For every R > 0 there exists α (0, 1) (depending on R) such that such that sup P(x, ) P(y, ) TV 2(1 α) x,y S(R) where S(R) = {x : V (x) < R}.

Geometric ergodicity Theorem. Let x = (η, ξ, σ 2, β) and P the transition kernel. Also, let µ be the invariant measure, i.e., the posterior. Together, (1) and (2) imply, ϕ(y)(p(x, ) µ)(dy) Cᾱ n V (x). sup ϕ <1+V

Our exact algorithm Blocking significantly improves mixing, plus provably geometrically ergodic. But what about the cost-per-step?

Cost-per-iteration Let s focus on the update of β: ( (W β σ 2, ξ, η, z N W + (ξ 1 D) 1) 1 W z, σ 2( W T W + (ξ 1 D) 1) 1) where D = diag(η 1 j ). Usual Cholesky based sampler (Rue, 2001) for N(Q 1 b, Q 1 ) requires O(p 3 ) computation for non-sparse Q. Highly prohibitive O(p 3 ) complexity per iteration when p N.

(Partial) Remedy In B., Chakraborty, Mallick (2016), we propose an alternative exact sampler with O(N 2 p) complexity. (i) Sample u N(0, ξ 1 D) and f N(0, I N ) indep. (ii) Set v = Wu + f (iii) Solve M ξ v = (z/σ v) where M ξ = I N + ξ 1 WDW (iv) Set β = σ(u + ξ 1 DW v ) (iii) is the costliest step taking max{o(n 2 p), O(N 3 )} steps. Significant savings when p N.

Cost-per-iteration However, still O(N 2 p) computation. N can be in the order of tens of thousands in GWAS studies. The remaining bottleneck is only in calculating M ξ = I N + ξ 1 WDW which is needed by the updates for β, σ 2, and ξ

Approximations in MCMC

Approximation Horseshoe is designed to shrink most coordinates of β toward zero... So most of the ξ 1 η 1 j will typically be tiny at any iteration Idea: Choose a small threshold δ, approximate M ξ by M ξ,δ = I N + ξ 1 W S D S W S, S = {j : ξ 1 η 1 j > δ} where W S is the submatrix consisting of columns in the set S, etc Carefully replace all calculations involving M ξ with M ξ,δ Reduces cost per step to Ns 2 Np, where s = S Note: this is different from setting some β j = 0 at each scan

Perturabtion bounds A general strategy to reduce cost per step: approximate P by P ɛ Question: is for X ɛ k νpk ɛ? n 1 µϕ n 1 ϕ(xk ɛ ) k=0 Answer: yes, if (along with other verifiable conditions) sup P(x, ) P ɛ (x, ) TV x is small, i.e., uniform total variation control between the exact and approximate chains.

Approximation error condition We show that our approximation achieves sup P(x, ) P ɛ (x, ) 2 TV δ W 2[ 4N( z 2 /b 0 ) + 9 ] + O(δ 2 ) x for any fixed threshold δ (0, 1).

Approximation error condition We show that this approximation achieves sup P(x, ) P ɛ (x, ) 2 TV δ W 2[ 4N( z 2 /b 0 ) + 9 ] + O(δ 2 ) x for any fixed threshold δ. Practically: we put δ = 10 4 and have observed no advantages from smaller values. Don t have a result in Wasserstein, but can numerically compute pathwise Wasserstein-2 and get

Simulation studies The results that follow use a common simulation structure w i iid Np (0, Σ) z i N(w i β, 4) { 2 (j/4 9/4) j < 24 β j = 0 j > 23, So there are always small and large signals, and true nulls We consider both Σ = I (independent design) and Σ ij = 0.9 i j (correlated design)

Effective sample size Recall effective sample size n e, a measure of the number of independent samples your Markov path is worth If n e = n then your MCMC is giving essentially independent samples (like vanilla Monte Carlo) If n e n then your MCMC has very high autocorrelations, need very long path to get good approximation to posterior

Mixing as p increases Effective sample sizes are essentially independent of p, even when the design matrix is highly correlated

Mixing as N increases Effective sample sizes are essentially independent of N, even when the design matrix is highly correlated

Effective samples per second Recall effective sample size n e, a measure of the number of independent samples your Markov path is worth So if t is computation time in seconds, effective samples per second n e /t is an empirical measurement of overall computational efficiency

Results: Effective samples per second The approximate algorithm is fifty times more efficient when N = 2, 000 and p = 20, 000

Results: Accuracy The old algorithm often failed to identify components of β with bimodal marginals This coveys uncertainty about whether β j is a true signal, which is one of the nice features of taking a Bayesian approach to multiple testing

Accuracy The old algorithm actually failed to converge in some cases, a result of numerical truncation rules used to control underflow

Scaling to a small GWAS dataset We consider a GWAS dataset consisting of p = 98, 385 from N = 2, 267 inbred maize lines in the USDA Ames, IA seed bank The phenotype is growing degree days to silking, which is basically a measurement of how fast the maize bears fruit, controlling for temperature Our goal is only to demonstrate that our algorithm can handle this application GWAS data obtained from Professors Xiaolei Liu and Xiang Zhou.

Application to GWAS Ran 30,000 iterations in about 4 hours, despite the signals (apparently) not being very sparse (s 2000 typically)

Inference: marginals for β j with largest posterior means Lasso solution indicated in red, horseshoe estimated posterior mean in blue

Conclusion Computational cost for MCMC shouldn t massively differ from alternatives designed for the same problem But making the algorithm fast takes work, often problem-specific More thrust on computing posteriors that we know have nice properties

References Carvalho, C., Polson, N. & Scott, J. (2010). The horseshoe estimator for sparse signals. Biometrika Bhattacharya, A., Chakraborty, A., Mallick, B. (2016). Fast sampling with Gaussian scale-mixture priors in high-dimensional regression. Biometrika arxiv preprint arxiv:1506.04778 Johndrow, J. E., Orenstein, P., & Bhattacharya, A. (2018). Scalable MCMC for Bayes Shrinkage Priors. arxiv preprint arxiv:1705.00841. Johndrow, J. E., & Mattingly, J. C. (2017). Error bounds for Approximations of Markov chains. arxiv preprint arxiv:1711.05382.

Thank You

Performance in p n settings

Data generation Replicated simulation study with horseshoe prior (Carvalho et al. (2010) n = 200 & p = 5000. True β 0 has 5 non-zero entries and σ = 1.5 Two signal strengths: (i) weak - β 0S = ±(0.75, 1, 1.25, 1.5, 1.75) T (ii) moderate - β 0S = ±(1.5, 1.75, 2, 2.25, 2.5) T Two types of design matrix: (i) Independent - X j i.i.d. N(0, I p ) (ii) compound symmetry - X j i.i.d. N(0, Σ), Σ jj = 0.5 + 0.5δ jj Summary over 100 datasets

Weak signal case β^ β 0 1 β^ β 0 2 Xβ^ Xβ 0 2 HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD 0 1 2 3 4 5 0.2 0.4 0.6 0.8 1.0 2 4 6 8 10 12 14 HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD 1 2 3 4 5 0.2 0.4 0.6 0.8 1.0 1.2 1.4 2 4 6 8 10 12 14 Estimation performance: Boxplots of l 1, l 2 and prediction error across 100 simulation replicates. HS me and HS m are posterior point wise median and mean for the horeshoe prior. Top row: Independent covariates, Bottom row: Compound symmetry

Moderate signal case β^ β 0 1 β^ β 0 2 Xβ^ Xβ 0 2 HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD 0 1 2 3 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 2 4 6 8 10 12 HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD HS_me HS_m MCP SCAD 0.5 1.0 1.5 2.0 0.2 0.4 0.6 0.8 2 4 6 8 Estimation performance: Boxplots of l 1, l 2 and prediction error across 100 simulation replicates. HS me and HS m are posterior point wise median and mean for the horeshoe prior. Top row: Independent covariates, Bottom row: Compound symmetry

Frequentist coverage of 95% credible intervals Frequentist coverages (%) and 100 lengths of point wise 95% intervals. Average coverages and lengths are reported after averaging across all signal variables (rows 1 and 2) and noise variables (rows 3 and 4). Subscripts denote 100 standard errors for coverages. LASSO and SS respectively stand for the methods in van de Geer et al. (2014) and Javanmard & Montanari (2014). The intervals for the horseshoe (HS) are the symmetric posterior credible intervals.

Variable selection by postprocessing Q(β) = 1 p 2 X ˆβ X β 2 2 + µ j β j, µ j = ˆβ j 2. j=1

Variable selection performance SAVS: Variable selection by post-processing the posterior mean from the HS prior. Plot of Mathew s correlation coefficient (MCC) over 1000 simulations for various methods. MCC values closer to 1 indicate better variable selection performance.