Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage

Similar documents
Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Sparse Linear Models (10/7/13)

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Bayesian linear regression

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Machine Learning for OR & FE

Partial factor modeling: predictor-dependent shrinkage for linear regression

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Or How to select variables Using Bayesian LASSO

Minimum Message Length Analysis of the Behrens Fisher Problem

Bayesian variable selection and classification with control of predictive values

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Bayesian Regression Linear and Logistic Regression

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Linear Model Selection and Regularization

Introduction to Machine Learning

Regression, Ridge Regression, Lasso

Generalized Elastic Net Regression

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Proteomics and Variable Selection

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Statistics 203: Introduction to Regression and Analysis of Variance Course review

ESL Chap3. Some extensions of lasso

STA414/2104 Statistical Methods for Machine Learning II

Default Priors and Effcient Posterior Computation in Bayesian

The Minimum Message Length Principle for Inductive Inference

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Statistical Inference

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Expression Data Exploration: Association, Patterns, Factors & Regression Modelling

Supplement to Bayesian inference for high-dimensional linear regression under the mnet priors

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Bayesian methods in economics and finance

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Conjugate Analysis for the Linear Model

Stat 5101 Lecture Notes

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

The linear model is the most fundamental of all serious statistical models encompassing:

Logistic Regression with the Nonnegative Garrote

Penalized Regression

Data Mining Stat 588

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Bayesian Linear Regression

Linear regression methods

Lecture 14: Shrinkage

Stability and the elastic net

MSA220/MVE440 Statistical Learning for Big Data

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Shrinkage Methods: Ridge and Lasso

Bayesian Linear Models

Feature selection with high-dimensional data: criteria and Proc. Procedures

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

Linear Regression Models P8111

Scalable MCMC for the horseshoe prior

arxiv: v1 [stat.me] 6 Jul 2017

Bayesian Linear Models

Handling Sparsity via the Horseshoe

ECE521 week 3: 23/26 January 2017

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

STA 4273H: Sta-s-cal Machine Learning

Part 6: Multivariate Normal and Linear Models

Exam: high-dimensional data analysis January 20, 2014

Statistical Data Mining and Machine Learning Hilary Term 2016

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Variance prior forms for high-dimensional Bayesian variable selection

Linear model selection and regularization

LASSO-Type Penalization in the Framework of Generalized Additive Models for Location, Scale and Shape

Machine Learning Linear Classification. Prof. Matteo Matteucci

STA 4273H: Statistical Machine Learning

Introduction to Probabilistic Machine Learning

Package horseshoe. November 8, 2016

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

Modelling geoadditive survival data

STA 4273H: Statistical Machine Learning

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

ST 740: Linear Models and Multivariate Normal Inference

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Generalized Linear Models and Its Asymptotic Properties

Bayesian Methods for Machine Learning

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

STA 216, GLM, Lecture 16. October 29, 2007

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

IEOR165 Discussion Week 5

Bayes: All uncertainty is described using probability.

Introduction to Statistical modeling: handout for Math 489/583

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Transcription:

Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage Daniel F. Schmidt Centre for Biostatistics and Epidemiology The University of Melbourne Monash University May 11, 2017

Outline Bayesian Global-Local Shrinkage Hierarchies 1 Bayesian Global-Local Shrinkage Hierarchies Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies 2

Outline Bayesian Global-Local Shrinkage Hierarchies Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies 1 Bayesian Global-Local Shrinkage Hierarchies Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies 2

Problem Description Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies Consider the linear model y = Xβ + β 0 1 n + ε, where y R n is a vector of targets; X R n p is a matrix of features; β R p is a vector of regression coefficients; β 0 R is the intercept; ε R n is a vector of random disturbances. Let β denote the true values of the coefficients Task: we observe y, X and must estimate β We do not require that p < n

Sparse Linear Models (1) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies The value βj = 0 is special means that feature j is not associated with targets Define the index of sparsity by β 0, where x 0 is the l 0 (counting) norm A linear model is sparse if β 0 p Sparsity is useful when p n Enables us to estimate less entries of β If trying to find which β j 0, conditions on β 0 required

Sparse Linear Models (2) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies Why is sparsity useful? Loosely, an estimator sequence ˆθ n is asymptotically efficient if lim sup{e[(ˆθ n θ ) 2 ]} = 1/J(θ ) n where J(θ ) is the Fisher information. Estimators exist for which the above bound can be beaten but only on a set of measure zero (Hodges 51, Le Cam 53) Sparse models have a special set of measure zero The set βj = {0} has measure zero, but is extremely important Good sparse estimators achieve superefficiency for βj = 0

Maximum Likelihood Estimation of β Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies Assume we have a probabilistic model for the disturbances ε i Then, a standard way of estimating β is maximum likelihood { ˆβ, ˆβ 0 } = arg max{p(y β, β 0, X)} β,β 0 If ε i N(0, σ 2 ), then ˆβ is the least squares estimator. Has several drawbacks: Requires p < n for uniqueness Potentially high variance Cannot produce sparse estimates Traditional fixes to maximum likelihood Remove some covariates Exploits sparsity

Penalized Regression (1) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies Method Type Comments Ridge Convex (+) Computationally efficient ( ) Suffers from potentially high estimation bias Lasso Convex (+) Convex optimisation problem Elastic net (+) Can produce sparse estimates ( ) Suffers from potentially high estimation bias ( ) Can have model selection consistency problems Non-convex Non-convex (+) Reduced estimation bias shrinkers (+) Improved model selection consistency (SCAD, MCP, etc.) (+) Can produce sparse estimates ( ) Non-convex optimisation; difficult, multi-modal Subset Non-convex (+) Model selection consistency selection ( ) Computationally intractable ( ) High statistically unstable

Penalized Regression (2) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies All methods require an additional model selection step Cross validation Information criteria Asymptotically optimal choices Quantifying statistical uncertainty is problematic Uncertainty in λ difficult to incorporate For sparse methods standard errors of β difficult Bootstrap requires special modifications Bayesian inference provides natural solutions to these problems

Bayesian Linear Regression (1) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies Assuming normal disturbances, the Bayesian regression y β, β 0 N(Xβ + β 0 1 n, σ 2 I n ), β 0 dβ 0, β σ 2 π(β σ 2 )dβ, where π(β σ 2 ) is a prior distribution over β; σ 2 is the noise variance. Inferences about β formed using the posterior distribution π(β, β 0 y) p(y β, β 0, σ 2 )π(β σ 2 ). Inference usually performed by MCMC sampling.

Bayesian Linear Regression (2) Spike-and-slab variable selection Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies y β, β 0 N(Xβ + β 0 1 n, σ 2 I n ), β 0 dβ 0, [ ] β j I j, σ 2 I j π(β j σ 2 ) + (1 I j )δ 0 (β j ) dβ j, I j Be(α) α π(α)dα where I j {0, 1} are indicators and Be( ) is a Bernoulli distribution; δ z (x) denotes at a Dirac point-mass at x = z; α (0, 1) is the a priori inclusion probability. Considered gold standard computationally intractable as involves exploring 2 p models

Bayesian Linear Regression (3) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies Variable selection with continuous shrinkage priors Treat the prior distribution for β as a Bayesian penalty Taking β j σ 2, λ N(0, λ 2 σ 2 ) leads to Bayesian ridge regression; or, taking β j σ 2, λ La(0, λ/σ) where La(a, b) is a Laplace distribution with location a and scale b leads to Bayesian lasso. More generally...

Global-Local Shrinkage Hierarchies (1) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies The global-local shrinkage hierarchy generalises many popular Bayesian regression priors y β, β 0, σ 2 N(Xβ + β 0 1 n, σ 2 I n ), β 0 dβ 0, β j λ 2 j, τ 2, σ 2 N(0, λ 2 jτ 2 σ 2 ) λ j π(λ j )dλ j τ π(τ)dτ Models priors for β j as scale-mixtures of normals choice of π(λ j ), π(τ) controls behaviour

Global-Local Shrinkage Hierarchies (2) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies The global-local shrinkage hierarchy generalises many popular Bayesian regression priors y β, β 0, σ 2 N(Xβ + β 0 1 n, σ 2 I n ), β 0 dβ 0, β j λ 2 j, τ 2, σ 2 N(0, λ 2 jτ 2 σ 2 ) λ j π(λ j )dλ j τ π(τ)dτ Local shrinkers λ j control selection of important variables play the role of indicators I j in spike-and-slab

Global-Local Shrinkage Hierarchies (3) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies The global-local shrinkage hierarchy generalises many popular Bayesian regression priors y β, β 0, σ 2 N(Xβ + β 0 1 n, σ 2 I n ), β 0 dβ 0, β j λ 2 j, τ 2, σ 2 N(0, λ 2 jτ 2 σ 2 ) λ j π(λ j )dλ j τ π(τ)dτ Global shrinker τ controls for multiplicity plays the role of inclusion probability α in spike-and-slab

Local Shrinkage Priors (1) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies What makes a good prior for local variance components? Denote the marginal prior of β j by π(β j τ, σ) = 0 ( ) 1 ( ) 1 2 λ 2 j τ 2 σ 2 exp β2 j 2λ 2 j τ 2 σ 2 π(λ j )dλ j Carvalho, Polson and Scott (2010) proposed two desirable properties of π(β j τ, σ)

Local Shrinkage Priors (2) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies Two desirable properties: 1 Should concentrate sufficient mass near β j = 0 such that lim β j 0 π(β j τ, σ) to guarantee fast rate of posterior contraction when β j = 0 2 Should have sufficiently heavy tails so that E [β j y] = ˆβ j + o ˆβj (1) to guarantee asymptotic (in effect-size) unbiasedness

Local Shrinkage Priors (2) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies Two desirable properties: 1 Should concentrate sufficient mass near β j = 0 such that lim β j 0 π(β j τ, σ) to guarantee fast rate of posterior contraction when β j = 0 2 Should have sufficiently heavy tails so that E [β j y] = ˆβ j + o ˆβj (1) to guarantee asymptotic (in effect-size) unbiasedness

Local Shrinkage Priors (3) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies Classic shrinkage priors do not satisfy either property Bayesian ridge takes λ j δ 1 (λ j )dλ j, leading to β j τ, σ 2 N(0, τ 2 σ 2 ). which expects β j s to be same squared magnitude 1 Does not model sparsity 2 Large bias if β mix of weak and strong signals Bayesian lasso takes λ j Exp(1), lead to β j τ, σ La(0, 2 3/2 στ) which expects β j s to be same absolute magnitude 1 Super-efficient at β j = 0 but not fast enough contraction, 2 Large bias if β sparse with few strong signals

Local Shrinkage Priors (3) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies Classic shrinkage priors do not satisfy either property Bayesian ridge takes λ j δ 1 (λ j )dλ j, leading to β j τ, σ 2 N(0, τ 2 σ 2 ). which expects β j s to be same squared magnitude 1 Does not model sparsity 2 Large bias if β mix of weak and strong signals Bayesian lasso takes λ j Exp(1), lead to β j τ, σ La(0, 2 3/2 στ) which expects β j s to be same absolute magnitude 1 Super-efficient at β j = 0 but not fast enough contraction, 2 Large bias if β sparse with few strong signals

The horseshoe prior (1) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies The horseshoe prior satisfies both properties The horseshoe prior takes λ j C + (0, 1), with C + (0, A) a half-cauchy distribution with scale A. Does not admit closed-form for marginal prior, but has bounds K (1 2 log + 4b ) 2 < π(β j τ, σ) < K2 (1 log + 2b ) 2, where b = β j τσ and K = (2π 3 ) 1/2. Has a pole at β j = 0; Has polynomial tails in β j

The horseshoe prior (1) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies The horseshoe prior satisfies both properties The horseshoe prior takes λ j C + (0, 1), with C + (0, A) a half-cauchy distribution with scale A. Does not admit closed-form for marginal prior, but has bounds K (1 2 log + 4b ) 2 < π(β j τ, σ) < K2 (1 log + 2b ) 2, where b = β j τσ and K = (2π 3 ) 1/2. Has a pole at β j = 0; Has polynomial tails in β j

The horseshoe prior (1) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies The horseshoe prior satisfies both properties The horseshoe prior takes λ j C + (0, 1), with C + (0, A) a half-cauchy distribution with scale A. Does not admit closed-form for marginal prior, but has bounds K (1 2 log + 4b ) 2 < π(β j τ, σ) < K2 (1 log + 2b ) 2, where b = β j τσ and K = (2π 3 ) 1/2. Has a pole at β j = 0; Has polynomial tails in β j

The horseshoe prior (1) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies The horseshoe prior satisfies both properties The horseshoe prior takes λ j C + (0, 1), with C + (0, A) a half-cauchy distribution with scale A. Does not admit closed-form for marginal prior, but has bounds K (1 2 log + 4b ) 2 < π(β j τ, σ) < K2 (1 log + 2b ) 2, where b = β j τσ and K = (2π 3 ) 1/2. Has a pole at β j = 0; Has polynomial tails in β j

The horseshoe prior (2) Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies Flat, Cauchy-like tails and infinitely tall spike at the origin The horseshoe prior and two close cousins: Laplacian and Student-t.

Higher order horseshoe priors Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies More generally, we can model λ j as a product of k half-cauchy variables The HS k (our notation) prior is λ j C 1 C 2... C k where C i C + (0, 1), i = 1,..., k. Generalises several existing priors HS 0 is ridge regression; HS 1 is the usual horseshoe; HS 2 is the horseshoe+ prior (Bhadra et al, 2015). Tail weight and mass at β j = 0 increase as k grows models β as increasingly sparse

Horseshoe estimator Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies The horseshoe estimator also takes τ C + (0, 1) though most heavy tailed priors will perform similarly How does the horseshoe prior work in practice? The horseshoe prior works well high dimensional, sparse regressions Experiments show it performs similarly to spike-and-slab at variable selection Continuous nature of prior means mixing is much better for large p Posterior mean has strong prediction properties

Outline Bayesian Global-Local Shrinkage Hierarchies 1 Bayesian Global-Local Shrinkage Hierarchies Linear Models Bayesian Estimation of Linear Models Global-Local Shrinkage Hierarchies 2

A Bayesian regression toolbox (1) Motivation We have lots of genomic/epigenomic data Large numbers of genomic markers measured along genome Associated disease outcomes (breast and prostate cancer, etc.) Dimensionality is large (total p > 5, 000, 000 in some cases). Number of true associations expected to be small Both association discovery and prediction of importance We wished to apply horseshoe-type methods But no flexible, easy-to-use and efficient toolbox existed So we (myself, Enes Makalic) wrote the for MATLAB and R

A Bayesian regression toolbox (2) Requirements: Handle large p (at least 10,000+) Implement both normal and logistic linear regression Implement the horseshoe priors Additionally: Handle group-structures within variables, for example, genes Perform (grouped) variable selection even when p > n

(1) Bayesreg uses the following hierachy z i x i, β, β 0, ωi 2, σ 2 N(x iβ + β 0, σ 2 ωi 2 ), σ 2 σ 2 dσ 2, ωi 2 π(ωi 2 ) dωi 2, β 0 dβ 0, β j λ 2 j, τ 2, σ 2 N(0, λ 2 jτ 2 σ 2 ), λ 2 j π(λ 2 j) dλ 2 j, τ 2 π(τ 2 ) dτ 2, We use scale-mixture representation of likelihood Continuous data z i = y i ; binary data z i = (y i 1/2)/ωi 2

(2) Bayesreg uses the following hierachy z i x i, β, β 0, ω 2 i, σ 2 N(x iβ + β 0, σ 2 ω 2 i ), σ 2 σ 2 dσ 2, ω 2 i π(ω 2 i ) dω 2 i, β 0 dβ 0, β j λ 2 j, τ 2, σ 2 N(0, λ 2 jτ 2 σ 2 ), λ 2 j π(λ 2 j) dλ 2 j, τ 2 π(τ 2 ) dτ 2, The z i follow a (potentially) heteroskedastic Gaussian π(ω i ) determines data model (normal, logistic, etc.)

(3) Bayesreg uses the following hierachy z i x i, β, β 0, ω 2 i, σ 2 N(x iβ + β 0, σ 2 ω 2 i ), σ 2 σ 2 dσ 2, ω 2 i π(ω 2 i ) dω 2 i, β 0 dβ 0, β j λ 2 j, τ 2, σ 2 N(0, λ 2 jτ 2 σ 2 ), λ 2 j π(λ 2 j) dλ 2 j, τ 2 π(τ 2 ) dτ 2, The priors for β follow a global-local shrinkage hierarchy π(λ 2 j ) determines the estimator (horseshoe, lasso, etc.)

(1) Sampler for β Is a multivariate normal of the form β N(A 1 e, A 1 ) where A = (B + D) and D is diagonal. This form allows for specialised sampling algorithms If p/n < 2 we use Rue s algorithm O(p 3 ) Otherwise we use Bhattarchaya s algorithm, O(n 2 p) Sampler for σ 2 We integrate out the βs to improve mixing Conditional distribution is an inverse-gamma Uses quantities computed when sampling β

(2) Sampler for λ j Recall that λ j C + (0, 1) Conditional distribution for λ j is ( ) 1 ( ) 2 1 π(λ j β j, τ, σ) λ 2 j τ 2 σ 2 exp β2 j 2λ 2 j τ 2 σ 2 (1 + λ 2 j) 1 which is not a standard distribution When we started this work in 2015, only one slow slice sampler (monomvn) existed for horseshoe Since our implementation there have been several competing samplers

Alternative Horseshoe Samplers Slice sampling Heavy tails can cause mixing issues Requires CDF inversions Does not easily extend to higher-order horseshoe priors NUTS sampler (Stan implementation) Very slow Numerically unstable for true horseshoe Unable to handle heavier tailed priors Elliptical slice sampler Computationally efficient Cannot be applied if p > n Cannot handle grouped variable structures

Our approach (1) Based on auxiliary variables Let x and a be random variables such that x 2 a IG(1/2, 1/a), and a IG(1/2, 1/A 2 ) then x C + (0, A), where IG(, ) denotes the inverse-gamma distribution with pdf p(z α, β) = ( βα Γ(α) z α 1 exp β ) z Inverse-gamma conjugate with normal for scale parameters Also conjugate with itself

Our approach (2) Rewrite prior for β j as β j λ 2 j, τ 2, σ 2 N(0, λ 2 jτ 2 σ 2 ) λ 2 j ν j IG(1/2, 1/ν j ) ν j IG(1/2, 1) Leads to simple Gibbs sampler for λ j and ν j ( λ 2 j IG 1, 1 ) + β2 j ν j 2τ 2 σ 2, ( ν j IG 1, 1 + 1 ) τ 2 Both are simply inverted exponential random variables extremely quick and stable sampling

Higher order horseshoe priors The HS k prior is λ j C 1 C 2... C k, where C i C + (0, 1). Prior for λ j has very complex form, but Can rewrite prior as the hierarchy λ j C + (0, φ (1) j ) φ (1) j C + (0, φ (2) j ). φ (k 1) j C + (0, 1) We can apply our expansion to easily sample λ j and the φ ( ) j s currently only sampler than can efficiently handle k > 1

Group structures (1) Group structures exist naturally in predictor variables A multi-level categorical predictor - a group of dummy variables A continuous predictor - composition of basis functions (additive models) Prior knowledge such as genes grouped in the same biological pathway - a natural group We wanted our toolbox to take exploit such structures

Group structures (2) We add additional group-specific shrinkage parameters For convenience we assume K levels of disjoint groupings Assume β j belongs to group g k at level k β j N(0, λ 2 jδ 2 1,g 1 δ 2 K,g K τ 2 σ 2 ) Group shrinkers are given appropriate prior distributions Our horseshoe sampler trivially adapted to group shrinkers conditional distribution of δ k,gk is inverse-gamma In contrast, slice-sampler requires inversions of gamma CDFs Paper detailing this work about to be submitted

Group structures (3) Variables 1 2 3 4 p 3 p Group level Individual shrinkage parameters λ1 λ2 λ3 λ4 λp 3 λp 2 λp 1 λp δ1,1 δ1,g1 1 Group shrinkage parameters.. δk,1 δk,gk 1 δk,gk K Global shrinkage parameter τ An illustration of possible group structures of total p number of variables with 1 level of individual variables, K levels of grouped variables and 1 level of all variables.

The (1) The currently has: Data models Gaussian, logistic, Laplace, Robust student-t regression Priors Bayesian ridge and g-prior regression Bayesian lasso Horseshoe Higher order horseshoe (horseshoe+, etc.) Other features Variable ranking Some basic variable selection criteria Easy to use

The (2) In comparison to Stan: On simple problem with p = 10 and n = 442 Stan took 50 seconds to produce 1, 000 samples Bayesreg took < 0.1s In comparison to other slice-sampling implementations: Speed and mixing for small p considerable better For large p performance is similar Scope of options much smaller (no horseshoe+, no grouping) Currently being used by group at University College London to fit logistic regressions for brain lesion work involving p = 50, 000 predictors

The (3) In current development version: Negative binomial regression for count data Multi-level variable grouping Higher order horseshoe priors (beyond HS 2 ) To be added in the near future: Posterior sparsification tools Autoregressive (time-series) residuals Additional diagnostics

The (3) In current development version: Negative binomial regression for count data Multi-level variable grouping Higher order horseshoe priors (beyond HS 2 ) To be added in the near future: Posterior sparsification tools Autoregressive (time-series) residuals Additional diagnostics

Sparse Bayesian Point Estimates We obtain m samples from the posterior What if we want a single point estimate? An attractive choice is the Bayes estimator with squared-prediction loss (Potentially) admissable, invariant to reparameterisation Reduces to posterior mean β of β in standard parameterisation Ironically, even if the prior promotes sparsity (i.e., horseshoe), the posterior mean will not be sparse The exact posterior mode may be sparse, but is impossible to find from samples A number of simple sparsification rules exist Most do not work when p > n Largely consider only marginal effects

Decoupled Shrinkage and Selection (DSS) estimator (1) Polson et al. (2016) recently introduced the DSS procedure 1 Obtain samples for β from posterior distribution 2 Form a new data vector incorporating the effects of shrinkage ȳ = X β. 3 Find sparsified approximations of β by solving { } β λ = arg min ȳ Xβ 2 2 + λ β 0 β or a similar penalised estimator for different values of λ 4 Select one of the sparsified models β λ The authors use adaptive lasso in place of intractable l 0 penalisation

Decoupled Shrinkage and Selection (DSS) estimator (1) Polson et al. (2016) recently introduced the DSS procedure 1 Obtain samples for β from posterior distribution 2 Form a new data vector incorporating the effects of shrinkage ȳ = X β. 3 Find sparsified approximations of β by solving { } β λ = arg min ȳ Xβ 2 2 + λ β 0 β or a similar penalised estimator for different values of λ 4 Select one of the sparsified models β λ The authors use adaptive lasso in place of intractable l 0 penalisation

Decoupled Shrinkage and Selection (DSS) estimator (1) Polson et al. (2016) recently introduced the DSS procedure 1 Obtain samples for β from posterior distribution 2 Form a new data vector incorporating the effects of shrinkage ȳ = X β. 3 Find sparsified approximations of β by solving { } β λ = arg min ȳ Xβ 2 2 + λ β 0 β or a similar penalised estimator for different values of λ 4 Select one of the sparsified models β λ The authors use adaptive lasso in place of intractable l 0 penalisation

Decoupled Shrinkage and Selection (DSS) estimator (1) Polson et al. (2016) recently introduced the DSS procedure 1 Obtain samples for β from posterior distribution 2 Form a new data vector incorporating the effects of shrinkage ȳ = X β. 3 Find sparsified approximations of β by solving { } β λ = arg min ȳ Xβ 2 2 + λ β 0 β or a similar penalised estimator for different values of λ 4 Select one of the sparsified models β λ The authors use adaptive lasso in place of intractable l 0 penalisation

Decoupled Shrinkage and Selection (DSS) estimator (2) While clever, the initial DSS proposal has several weaknesses: 1 It does not apply to non-continuous data 2 Selection of degree of sparsification is done by an ad-hoc rule 3 It cannot be applied to selection of groups of variables Current work being done with PhD student Zemei Xu addresses all three problems

Generalised DSS estimator (1) We first generalise the procedure to arbitrary data types Let p(y θ, X) be the data model in our Bayesian hierarchy Partition the parameter vector as θ = (β, γ) γ are additional parameters, such as σ 2 if the model is normal Given X, the posterior predictive density defines a probability density over possible values of y, say ỹ, that could arise p(ỹ y, X) = p(ỹ θ, X)π(θ y, X)dθ Incorporates all posterior beliefs about θ Defines a complete distribution over ỹ

Generalised DSS estimator (2) Recall the ideal sparsification scheme from DSS: } β λ = arg min { ȳ Xβ 2 2 + λ β 0 β We can now replace the sum-of-squares goodness of fit term by an expected likelihood goodness-of-fit term Lỹ(β, γ) = Eỹ [log p(ỹ β, γ)] and use a sparsifying penalized likelihood estimator Simple for binary data as mixture of Bernoullis is a Bernoulli

Generalised DSS estimator (3) Second problem solved by selecting using information criteria Formed as sum of likelihood plus dimensionality penalty We adapt information criterion to the DSS problem by using GIC(λ) = min {Lỹ(β λ, γ) + α(n, k λ, γ)} γ where k λ = β 0 is the degrees-of-freedom of β λ ; Some common choices for α( ) α(n, k λ, γ) = (k λ /2) log n for the BIC; α(n, k λ, γ) = nk λ /(n k λ 2) for the corrected AIC We also considered an MML criterion Select the β λ that minimises the information criterion score

Generalised DSS estimator (4) We have extended this further to selecting groups of variables Very relevant for testing genes and pathways in genomic data We compared our generalised DSS to grouped spike-and-slab Info criterion approach (using MML) outperformed original ad-hoc proposal of Polson et al. Performed as well as spike-and-slab in overall selection error However, over 20, 000 times faster! Analysis of icogs data is about to begin Very large dataset, n = 120, 000 and p = 2, 000, 000.

Conclusion Bayesian Global-Local Shrinkage Hierarchies MATLAB http://au.mathworks.com/matlabcentral/fileexchange/ 60335-bayesian-regularized-linear-and-logistic-regression R package available as package bayesreg from CRAN A pre-print describing the toolbox in detail: High-Dimensional Bayesian Regularised Regression with the BayesReg Package, E. Makalic and D. F. Schmidt, arxiv preprint: https://arxiv.org/pdf/1611.06649v1/ Thank you questions?