Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu 1,2, Daniel F. Schmidt 1, Enes Makalic 1, Guoqi Qian 2, John L. Hopper 1 1 Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health 2 School of Mathematics and Statistics, The University of Melbourne
Introduction Model estimation and selection The object is to find important explanatory factors in predicting the response variable There are potentially a large number of predictors and only a few of them are associated with the response variable Select the best subset of predictors for fitting or predicting the response variable Estimate a sparse vector 2 of 22
Introduction Consider the linear regression model: where y = Xβ + ɛ, (1) y is an n by 1 observation vector of the response variable X is an n by p observation or design matrix of the regressors or predictors β = (β 1,, β p ) T is a p by 1 vector of regression coefficients to be estimated ɛ is an n by 1 vector of i.i.d. N (0, σ 2 ) random errors with σ 2 unknown Here, β is assumed to be sparse. 3 of 22
Introduction Penalised likelihood methods The approach select a model by minimising a loss function that is usually proportional to the negative log likelihood plus a penalty term: ˆβ = arg min { (y Xβ) T (y Xβ) + λ q(β) }, (2) β R p where λ > 0 is the tuning parameter and q( ) is a penalty function. Well-known example: the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996) p ˆβ = arg min β R p (y Xβ)T (y Xβ) + λ β j. (3) 4 of 22 j=1
Introduction Bayesian approaches The motivation is that a good solution for β in linear model y = Xβ + ɛ, (4) can be interpreted as the posterior mode of β in the Bayesian model when β follows a certain prior distribution Two main sparse-estimation alternatives Discrete mixtures - a point mass at 0 and an absolute continuous alternatives Shrinkage priors - absolutely continuous shrinkage priors centered at 0 (example: the Bayesian Lasso with double-exponential prior (Park & Casella, 2008)) 5 of 22
The horseshoe prior Bayesian horseshoe model (Carvalho, Polson, & Scott, 2009) Shrinkage approach A one-component prior The horseshoe prior: β i δ i, τ N (0, δ 2 i τ 2 ), δ i C + (0, 1), (5) where δ i s are the local shrinkage parameters, τ is the global shrinkage parameter, and C + (0, 1) is a standard half-cauchy distribution with the probability density function: 6 of 22 f(x) = 2 π(1 + x 2, x > 0. (6) )
The horseshoe prior Flat, Cauchy-like tails Infinitely tall spike at the origin Figure: The horseshoe prior and two close cousins: Laplacian and Student-t. 7 of 22
Bayesian horseshoe model Without loss of generality, we assume the response y is centered and the covariates X are column centered and standardised The Bayesian hierarchical representation of the full model: y X, β, σ 2 N (Xβ, σ 2 I n ) β σ 2, τ 2, δ 2 1,, δ 2 p N (0, σ 2 τ 2 D δ ), where D δ = diag(δ 2 1,, δ 2 p) δ j C + (0, 1), j = 1,, p τ C + (0, 1) σ 2 1 σ 2 dσ2, where the scale parameters δ j are local shrinkage parameters, and τ is the global shrinkage parameter. 8 of 22
Group structures Group structures naturally exist in predictor variables A multi-level categorical predictor - a group of dummy variables A continuous predictor - a composition of basis functions The prior knowledge such as genes in the same biological pathway - a natural group 9 of 22
Bayesian grouped horseshoe model Suppose there are G {1,, p} groups of predictors in the data and the gth group has size s g, where g = 1,, G (i.e. there are s g variables in group g). The horseshoe hierarchical representation of the full model for grouped variables can be constructed as: y X, β, σ 2 N (Xβ, σ 2 I n ), β σ 2, τ 2, λ 2 1,, λ 2 G N (0, σ 2 τ 2 D λ ), where D λ = diag(λ 2 1I s1,, λ 2 GI sg ), λ g C + (0, 1), g = 1,, G, τ C + (0, 1), σ 2 1 σ 2 dσ2, where λ g are the shrinkage parameters at group level. 10 of 22
Hierarchical Bayesian grouped horseshoe model Suppose the total number of groups is G(> 1), the full hierarchical Bayesian grouped horseshoe model is: y X, β, σ 2 N (Xβ, σ 2 I n ) β σ 2, τ 2, λ 2 1,, λ 2 G, δ 2 1,, δ 2 p N (0, σ 2 τ 2 D λ D δ ) where D λ = diag(λ 2 1I s1,, λ 2 GI sg ), D δ = diag(δ 2 1,, δ 2 p) λ g C + (0, 1), g = 1,, G δ j C + (0, 1), j = 1,, p τ C + (0, 1) σ 2 1 σ 2 dσ2, where δ 1,, δ p are the shrinkage parameters for each predictor variable and λ 1,, λ G are the shrinkage parameters for group variables. 11 of 22
Sampling Bayesian horseshoe model Gibbs sampling A simple sampler proposed for the Bayesian horseshoe hierarchy (Makalic & Schmidt, 2016b) enables straightforward sampling of the full conditional posterior distributions. If x 2 a IG(1/2, 1/a) and a IG(1/2, 1/A 2 ), then x C + (0, A). 12 of 22
Application to additive models The additive models allow for nonlinear effects and grouped structures Given a data set {y i, x i1,, x ip } n i=1, the additive model has the form: p y = µ 0 + f j (X j ) + ɛ, (7) j=1 where µ 0 is an intercept term and f j ( ) are unknown smooth functions. Estimation of the selected smooth functions is expected to be as close to the corresponding true underlying functions or target functions as possible. 13 of 22
Application to additive models Various classes of basis functions: polynomials, spline functions Let g j (x), j = 1,, p, be a set of basis functions. Each smooth function component in the additive model can be represented as: f(x) = a 0 + a 1 g 1 (x) + a 2 g 2 (x) + + a p g p (x). (8) A special case of orthogonal polynomials: the Legendre polynomials The Legendre polynomials are defined on the interval [ 1, 1] 14 of 22
Simulation Function 1 (simple linear function): Function 2 (nonlinear function): y = X 1 + X 2 X 3 X 4 (9) y = cos(8x 1 ) + X 2 2 + sign(x 3 ) + X 4 + X 5 + X 2 5 X 3 5 (10) Function 3: y = f 1 (X 1 ) + f 2 (X 2 ) + f 3 (X 3 ), (11) where f j = β j1 P 1 (X j ) + β j2 P 2 (X j ) + β j3 P 3 (X j ), j = 1, 2, 3 that consists of the Legendre polynomials of order up to three and the unscaled true coefficients are: β = (2, 1, 1/2, 1, 1, 1, 1, 4, 1) 9 1. 15 of 22
Simulation For each of the three tests functions 100 data sets p = 10 predictors The maximum degree of Legendre polynomial expansions K = {3, 6, 9, 12} The number of samples n = {100, 200} The signal-to-noise ratio SNR= {1, 5, 10} Methods: BHS, HBGHS, lasso-bic and BHS-NE Comparison metric: the mean squared prediction error (MSPE) 1 n n [E(y i x i ) ŷ i ] 2, (12) i=1 16 of 22
Simulation results Test function 1 (simple linear function) BHS-NE produces smallest MSPE consistentely BHS and HBGHS are competitive when n = 100 HBGHS improves significantly compared to BHS when n = 200 Test function 2 (nonlinear function) HBGHS wins in most scenarios BHS slightly outperforms HBGHS when SNR= 1 BHS-NE performs poorly Test function 3 (polynomial functions) HBGHS gives smallest MSPE for all scenarios BHS is better than lasso-bic BHS-NE is the worst in almost all scenarios 17 of 22
Simulation 0.04 BHS 0.04 HBHSG Component wise squared prediction error 0.035 0.03 0.025 0.02 0.015 0.01 0.005 Component wise squared prediction error 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 1 2 3 4 5 6 7 8 9 10 X 0 1 2 3 4 5 6 7 8 9 10 X Figure: Boxplots of component-wise prediction error for BHS and HBGHS when there are p = 10 predictors, n = 100 samples, SNR = 5, K = 3 degree of Legendre polynomial expansions. 18 of 22
Discussion The Bayesian grouped horseshoe method and the hierarchical Bayesian grouped horseshoe method Perform both group-wise and within group selection Show good performance in terms of the mean squared prediction error on simulated data Outperform the regular BHS when it is applied to nonlinear functions and additive models Competitive with the regular BHS even when there is no underlying group structure Demonstrate promising performance with real data analysis 19 of 22
Package The package for implementing the Bayesian regularised regression (Makalic & Schmidt, 2016a) can be downloaded from http://au.mathworks.com/matlabcentral/fileexchange/60335- bayesian-regularized-linear-and-logistic-regression 20 of 22
References Alcalá, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., & Herrera, F. (2010). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17(2-3), 255-287. Carvalho, C. M., Polson, N. G., & Scott, J. G. (2009). Handling sparsity via the horseshoe. In Jmlr (Vol. 5, p. 73-80). Makalic, E., & Schmidt, D. F. (2016a). High-dlimensional Bayesian regularised regression with the Bayesreg package. arxiv:1611.06649. Makalic, E., & Schmidt, D. F. (2016b). A simple sampler for the horseshoe estimator. IEEE Signal Processing Letters, 23(1), 179-182. Park, T., & Casella, G. (2008, June). The Bayesian lasso. Journal of the American Statistical Associationt, 103(482), 681-686. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Soiety, 58(1), 267-288. 21 of 22
Thank you! 22 of 22