Bayesian Grouped Horseshoe Regression with Application to Additive Models

Similar documents
Bayesian Grouped Horseshoe Regression with Application to Additive Models

Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage

Or How to select variables Using Bayesian LASSO

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

ESL Chap3. Some extensions of lasso

Sparse Linear Models (10/7/13)

Machine Learning for OR & FE

Proteomics and Variable Selection

Partial factor modeling: predictor-dependent shrinkage for linear regression

Logistic Regression with the Nonnegative Garrote

Bi-level feature selection with applications to genetic association

Data Mining Stat 588

Bayesian Linear Regression

Spatial Lasso with Application to GIS Model Selection. F. Jay Breidt Colorado State University

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Horseshoe, Lasso and Related Shrinkage Methods

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Lasso & Bayesian Lasso

Linear model selection and regularization

Bayesian Linear Models

Regression, Ridge Regression, Lasso

Bayesian shrinkage approach in variable selection for mixed

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

The linear model is the most fundamental of all serious statistical models encompassing:

The lasso, persistence, and cross-validation

Bayesian Linear Models

Scalable MCMC for the horseshoe prior

Modeling Real Estate Data using Quantile Regression

Geometric ergodicity of the Bayesian lasso

ISyE 691 Data mining and analytics

Lecture 14: Shrinkage

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Analysis Methods for Supersaturated Design: Some Comparisons

Regression Shrinkage and Selection via the Lasso

Package horseshoe. November 8, 2016

Bayesian linear regression

Variable Selection in Structured High-dimensional Covariate Spaces

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Statistical Inference

Regularization and Variable Selection via the Elastic Net

LASSO-Type Penalization in the Framework of Generalized Additive Models for Location, Scale and Shape

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

Day 4: Shrinkage Estimators

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Bayesian methods in economics and finance

Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation

Bayesian variable selection and classification with control of predictive values

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Regularization Path Algorithms for Detecting Gene Interactions

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Ratemaking application of Bayesian LASSO with conjugate hyperprior

High-Dimensional Statistical Learning: Introduction

Recursive Sparse Estimation using a Gaussian Sum Filter

Module 11: Linear Regression. Rebecca C. Steorts

Penalized Regression

Lecture 14: Variable Selection - Beyond LASSO

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Statistical Methods for Data Mining

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Standardization and the Group Lasso Penalty

Handling Sparsity via the Horseshoe

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Generalized Elastic Net Regression

Prediction & Feature Selection in GLM

The Minimum Message Length Principle for Inductive Inference

Bayesian Linear Models

Statistics for high-dimensional data: Group Lasso and additive models

Iterative Selection Using Orthogonal Regression Techniques

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

A Confidence Region Approach to Tuning for Variable Selection

Introduction to Machine Learning

Sparse statistical modelling

High-dimensional regression

Linear Model Selection and Regularization

Shrinkage Methods: Ridge and Lasso

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Hierarchical Modelling for Univariate Spatial Data

Stability and the elastic net

arxiv: v1 [stat.me] 3 Aug 2014

MS-C1620 Statistical inference

Regularization Paths

Bayesian Variable Selection Regression Of Multivariate Responses For Group Data

Linear Methods for Regression. Lijun Zhang

An algorithm for the multivariate group lasso with covariance estimation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

A Short Introduction to the Lasso Methodology

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Conjugate direction boosting

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Support Vector Machines

TREE ENSEMBLES WITH RULE STRUCTURED HORSESHOE REGULARIZATION

Transcription:

Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu 1,2, Daniel F. Schmidt 1, Enes Makalic 1, Guoqi Qian 2, John L. Hopper 1 1 Centre for Epidemiology and Biostatistics, Melbourne School of Population and Global Health 2 School of Mathematics and Statistics, The University of Melbourne

Introduction Model estimation and selection The object is to find important explanatory factors in predicting the response variable There are potentially a large number of predictors and only a few of them are associated with the response variable Select the best subset of predictors for fitting or predicting the response variable Estimate a sparse vector 2 of 22

Introduction Consider the linear regression model: where y = Xβ + ɛ, (1) y is an n by 1 observation vector of the response variable X is an n by p observation or design matrix of the regressors or predictors β = (β 1,, β p ) T is a p by 1 vector of regression coefficients to be estimated ɛ is an n by 1 vector of i.i.d. N (0, σ 2 ) random errors with σ 2 unknown Here, β is assumed to be sparse. 3 of 22

Introduction Penalised likelihood methods The approach select a model by minimising a loss function that is usually proportional to the negative log likelihood plus a penalty term: ˆβ = arg min { (y Xβ) T (y Xβ) + λ q(β) }, (2) β R p where λ > 0 is the tuning parameter and q( ) is a penalty function. Well-known example: the least absolute shrinkage and selection operator (LASSO) (Tibshirani, 1996) p ˆβ = arg min β R p (y Xβ)T (y Xβ) + λ β j. (3) 4 of 22 j=1

Introduction Bayesian approaches The motivation is that a good solution for β in linear model y = Xβ + ɛ, (4) can be interpreted as the posterior mode of β in the Bayesian model when β follows a certain prior distribution Two main sparse-estimation alternatives Discrete mixtures - a point mass at 0 and an absolute continuous alternatives Shrinkage priors - absolutely continuous shrinkage priors centered at 0 (example: the Bayesian Lasso with double-exponential prior (Park & Casella, 2008)) 5 of 22

The horseshoe prior Bayesian horseshoe model (Carvalho, Polson, & Scott, 2009) Shrinkage approach A one-component prior The horseshoe prior: β i δ i, τ N (0, δ 2 i τ 2 ), δ i C + (0, 1), (5) where δ i s are the local shrinkage parameters, τ is the global shrinkage parameter, and C + (0, 1) is a standard half-cauchy distribution with the probability density function: 6 of 22 f(x) = 2 π(1 + x 2, x > 0. (6) )

The horseshoe prior Flat, Cauchy-like tails Infinitely tall spike at the origin Figure: The horseshoe prior and two close cousins: Laplacian and Student-t. 7 of 22

Bayesian horseshoe model Without loss of generality, we assume the response y is centered and the covariates X are column centered and standardised The Bayesian hierarchical representation of the full model: y X, β, σ 2 N (Xβ, σ 2 I n ) β σ 2, τ 2, δ 2 1,, δ 2 p N (0, σ 2 τ 2 D δ ), where D δ = diag(δ 2 1,, δ 2 p) δ j C + (0, 1), j = 1,, p τ C + (0, 1) σ 2 1 σ 2 dσ2, where the scale parameters δ j are local shrinkage parameters, and τ is the global shrinkage parameter. 8 of 22

Group structures Group structures naturally exist in predictor variables A multi-level categorical predictor - a group of dummy variables A continuous predictor - a composition of basis functions The prior knowledge such as genes in the same biological pathway - a natural group 9 of 22

Bayesian grouped horseshoe model Suppose there are G {1,, p} groups of predictors in the data and the gth group has size s g, where g = 1,, G (i.e. there are s g variables in group g). The horseshoe hierarchical representation of the full model for grouped variables can be constructed as: y X, β, σ 2 N (Xβ, σ 2 I n ), β σ 2, τ 2, λ 2 1,, λ 2 G N (0, σ 2 τ 2 D λ ), where D λ = diag(λ 2 1I s1,, λ 2 GI sg ), λ g C + (0, 1), g = 1,, G, τ C + (0, 1), σ 2 1 σ 2 dσ2, where λ g are the shrinkage parameters at group level. 10 of 22

Hierarchical Bayesian grouped horseshoe model Suppose the total number of groups is G(> 1), the full hierarchical Bayesian grouped horseshoe model is: y X, β, σ 2 N (Xβ, σ 2 I n ) β σ 2, τ 2, λ 2 1,, λ 2 G, δ 2 1,, δ 2 p N (0, σ 2 τ 2 D λ D δ ) where D λ = diag(λ 2 1I s1,, λ 2 GI sg ), D δ = diag(δ 2 1,, δ 2 p) λ g C + (0, 1), g = 1,, G δ j C + (0, 1), j = 1,, p τ C + (0, 1) σ 2 1 σ 2 dσ2, where δ 1,, δ p are the shrinkage parameters for each predictor variable and λ 1,, λ G are the shrinkage parameters for group variables. 11 of 22

Sampling Bayesian horseshoe model Gibbs sampling A simple sampler proposed for the Bayesian horseshoe hierarchy (Makalic & Schmidt, 2016b) enables straightforward sampling of the full conditional posterior distributions. If x 2 a IG(1/2, 1/a) and a IG(1/2, 1/A 2 ), then x C + (0, A). 12 of 22

Application to additive models The additive models allow for nonlinear effects and grouped structures Given a data set {y i, x i1,, x ip } n i=1, the additive model has the form: p y = µ 0 + f j (X j ) + ɛ, (7) j=1 where µ 0 is an intercept term and f j ( ) are unknown smooth functions. Estimation of the selected smooth functions is expected to be as close to the corresponding true underlying functions or target functions as possible. 13 of 22

Application to additive models Various classes of basis functions: polynomials, spline functions Let g j (x), j = 1,, p, be a set of basis functions. Each smooth function component in the additive model can be represented as: f(x) = a 0 + a 1 g 1 (x) + a 2 g 2 (x) + + a p g p (x). (8) A special case of orthogonal polynomials: the Legendre polynomials The Legendre polynomials are defined on the interval [ 1, 1] 14 of 22

Simulation Function 1 (simple linear function): Function 2 (nonlinear function): y = X 1 + X 2 X 3 X 4 (9) y = cos(8x 1 ) + X 2 2 + sign(x 3 ) + X 4 + X 5 + X 2 5 X 3 5 (10) Function 3: y = f 1 (X 1 ) + f 2 (X 2 ) + f 3 (X 3 ), (11) where f j = β j1 P 1 (X j ) + β j2 P 2 (X j ) + β j3 P 3 (X j ), j = 1, 2, 3 that consists of the Legendre polynomials of order up to three and the unscaled true coefficients are: β = (2, 1, 1/2, 1, 1, 1, 1, 4, 1) 9 1. 15 of 22

Simulation For each of the three tests functions 100 data sets p = 10 predictors The maximum degree of Legendre polynomial expansions K = {3, 6, 9, 12} The number of samples n = {100, 200} The signal-to-noise ratio SNR= {1, 5, 10} Methods: BHS, HBGHS, lasso-bic and BHS-NE Comparison metric: the mean squared prediction error (MSPE) 1 n n [E(y i x i ) ŷ i ] 2, (12) i=1 16 of 22

Simulation results Test function 1 (simple linear function) BHS-NE produces smallest MSPE consistentely BHS and HBGHS are competitive when n = 100 HBGHS improves significantly compared to BHS when n = 200 Test function 2 (nonlinear function) HBGHS wins in most scenarios BHS slightly outperforms HBGHS when SNR= 1 BHS-NE performs poorly Test function 3 (polynomial functions) HBGHS gives smallest MSPE for all scenarios BHS is better than lasso-bic BHS-NE is the worst in almost all scenarios 17 of 22

Simulation 0.04 BHS 0.04 HBHSG Component wise squared prediction error 0.035 0.03 0.025 0.02 0.015 0.01 0.005 Component wise squared prediction error 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 1 2 3 4 5 6 7 8 9 10 X 0 1 2 3 4 5 6 7 8 9 10 X Figure: Boxplots of component-wise prediction error for BHS and HBGHS when there are p = 10 predictors, n = 100 samples, SNR = 5, K = 3 degree of Legendre polynomial expansions. 18 of 22

Discussion The Bayesian grouped horseshoe method and the hierarchical Bayesian grouped horseshoe method Perform both group-wise and within group selection Show good performance in terms of the mean squared prediction error on simulated data Outperform the regular BHS when it is applied to nonlinear functions and additive models Competitive with the regular BHS even when there is no underlying group structure Demonstrate promising performance with real data analysis 19 of 22

Package The package for implementing the Bayesian regularised regression (Makalic & Schmidt, 2016a) can be downloaded from http://au.mathworks.com/matlabcentral/fileexchange/60335- bayesian-regularized-linear-and-logistic-regression 20 of 22

References Alcalá, J., Fernández, A., Luengo, J., Derrac, J., García, S., Sánchez, L., & Herrera, F. (2010). Keel data-mining software tool: Data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic and Soft Computing, 17(2-3), 255-287. Carvalho, C. M., Polson, N. G., & Scott, J. G. (2009). Handling sparsity via the horseshoe. In Jmlr (Vol. 5, p. 73-80). Makalic, E., & Schmidt, D. F. (2016a). High-dlimensional Bayesian regularised regression with the Bayesreg package. arxiv:1611.06649. Makalic, E., & Schmidt, D. F. (2016b). A simple sampler for the horseshoe estimator. IEEE Signal Processing Letters, 23(1), 179-182. Park, T., & Casella, G. (2008, June). The Bayesian lasso. Journal of the American Statistical Associationt, 103(482), 681-686. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Soiety, 58(1), 267-288. 21 of 22

Thank you! 22 of 22