Horseshoe, Lasso and Related Shrinkage Methods

Similar documents
Lasso & Bayesian Lasso

Or How to select variables Using Bayesian LASSO

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Bayesian shrinkage approach in variable selection for mixed

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Hierarchical sparsity priors for regression models

Partial factor modeling: predictor-dependent shrinkage for linear regression

Efficient sampling for Gaussian linear regression with arbitrary priors

On the Half-Cauchy Prior for a Global Scale Parameter

Integrated Anlaysis of Genomics Data

Bayesian Grouped Horseshoe Regression with Application to Additive Models

arxiv: v2 [stat.me] 27 Oct 2012

On the half-cauchy prior for a global scale parameter

Bayesian variable selection and classification with control of predictive values

Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage

Geometric ergodicity of the Bayesian lasso

arxiv: v2 [stat.me] 16 Aug 2018

arxiv: v1 [stat.me] 6 Jul 2017

Package horseshoe. November 8, 2016

STA 216, GLM, Lecture 16. October 29, 2007

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Feature selection with high-dimensional data: criteria and Proc. Procedures

arxiv: v3 [stat.me] 10 Aug 2018

Scalable MCMC for the horseshoe prior

arxiv: v2 [stat.me] 24 Feb 2018

Sparse Estimation and Uncertainty with Application to Subgroup Analysis

Handling Sparsity via the Horseshoe

arxiv: v3 [stat.me] 26 Feb 2012

Extended Bayesian Information Criteria for Model Selection with Large Model Spaces

Robust Bayesian Regression

Bayesian Regression with Undirected Network Predictors with an Application to Brain Connectome Data

Robust Bayesian Simple Linear Regression

Hierarchical Bayesian Modeling with Approximate Bayesian Computation

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Lecture 16: Mixtures of Generalized Linear Models

Lasso Maximum Likelihood Estimation of Parametric Models with Singular Information Matrices

Regularization with variance-mean mixtures

A sparse factor analysis model for high dimensional latent spaces

Bayesian Models for Structured Sparse Estimation via Set Cover Prior

Time Series and Dynamic Models

Model Choice. Hoff Chapter 9, Clyde & George Model Uncertainty StatSci, Hoeting et al BMA StatSci. October 27, 2015

Self-adaptive Lasso and its Bayesian Estimation

Markov Chain Monte Carlo (MCMC)

Introduction to Machine Learning

Statistical Inference

Flexible Learning on the Sphere Via Adaptive Needlet Shrinkage and Selection

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Nonparametric Bayes Uncertainty Quantification

A Hybrid Bayesian Approach for Genome-Wide Association Studies on Related Individuals

Bayes Estimators & Ridge Regression

Bayesian linear regression

Lecture 13 Fundamentals of Bayesian Inference

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Statistics & Data Sciences: First Year Prelim Exam May 2018

Classical and Bayesian inference

Aspects of Feature Selection in Mass Spec Proteomic Functional Data

ST 740: Model Selection

Statistical Applications in Genetics and Molecular Biology

Luminosity Functions

STA414/2104 Statistical Methods for Machine Learning II

arxiv: v4 [stat.co] 12 May 2018

NONPARAMETRIC BAYESIAN DENSITY ESTIMATION ON A CIRCLE

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Regularized Skew-Normal Regression. K. Shutes & C.J. Adcock. March 31, 2014

Computationally Efficient MCMC for Hierarchical Bayesian Inverse Problems

Model Checking and Improvement

Regularized Skew-Normal Regression

Sparsity and the truncated l 2 -norm

Bayesian methods in economics and finance

On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Nearest Neighbor Gaussian Processes for Large Spatial Data

arxiv: v1 [stat.me] 3 Aug 2014

On the Use of Cauchy Prior Distributions for Bayesian Logistic Regression

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models

Efficient MCMC Sampling for Hierarchical Bayesian Inverse Problems

Principles of Bayesian Inference

Analysis Methods for Supersaturated Design: Some Comparisons

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Sparse Factor-Analytic Probit Models

Simulating Random Variables

Sparse additive Gaussian process with soft interactions. Title

One-parameter models

A Confidence Region Approach to Tuning for Variable Selection

Dirichlet-Laplace priors for optimal shrinkage

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Consistent change point estimation. Chi Tim Ng, Chonnam National University Woojoo Lee, Inha University Youngjo Lee, Seoul National University

Mixtures of Prior Distributions

IEOR165 Discussion Week 5

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Bayesian Lasso Binary Quantile Regression

Mixtures of Prior Distributions

arxiv: v1 [math.st] 13 Feb 2017

New Statistical Methods That Improve on MLE and GLM Including for Reserve Modeling GARY G VENTER

The Horseshoe Estimator for Sparse Signals

Homework 1: Solutions

Transcription:

Readings Chapter 15 Christensen Merlise Clyde October 15, 2015

Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso

Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ)

Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ)

Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ) τ 2 1..., τ 2 p α, φ, λ iid Exp(λ 2 /2)

Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ) τ 2 1..., τ 2 p α, φ, λ iid Exp(λ 2 /2) p(α, φ) 1/φ

Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ) τ 2 1..., τ 2 p α, φ, λ iid Exp(λ 2 /2) p(α, φ) 1/φ Can show that β j φ, λ iid DE(λ φ) 0 1 e 1 2 φ β2 s 2πs λ 2 2 e λ 2 s 2 ds = λφ1/2 2 e λφ1/2 β

Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ) τ 2 1..., τ 2 p α, φ, λ iid Exp(λ 2 /2) p(α, φ) 1/φ Can show that β j φ, λ iid DE(λ φ) 0 1 e 1 2 φ β2 s 2πs λ 2 2 e λ 2 s 2 ds = λφ1/2 2 e λφ1/2 β Scale Mixture of Normals (Andrews and Mallows 1974)

Gibbs Sampling Prior λ 2 Gamma(r, δ)

Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ)

Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, )

Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, )

Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, ) λ 2 β s, φ, τ 2, Y G(, )

Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, ) λ 2 β s, φ, τ 2, Y G(, ) 1/τ 2 j β s, φ, λ, Y InvGaussian(, )

Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, ) λ 2 β s, φ, τ 2, Y G(, ) 1/τ 2 j β s, φ, λ, Y InvGaussian(, ) X InvGaussian(µ, λ) λ 2 f (x) = 2π x 3/2 e 1 λ 2 (x µ) 2 2 µ 2 x x > 0

Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, ) λ 2 β s, φ, τ 2, Y G(, ) 1/τ 2 j β s, φ, λ, Y InvGaussian(, ) X InvGaussian(µ, λ) λ 2 f (x) = 2π x 3/2 e 1 λ 2 (x µ) 2 2 µ 2 x x > 0 Homework Nextweek: Derive the full conditionals for β s, φ, 1/τ 2 see http://www.stat.ufl.edu/~casella/papers/lasso.pdf

Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ

Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation)

Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1)

Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1) p(α, φ) 1/φ)

Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1) p(α, φ) 1/φ) In the case λ = φ = 1 and with canonical representation Y = Iβ + ɛ

Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1) p(α, φ) 1/φ) In the case λ = φ = 1 and with canonical representation Y = Iβ + ɛ E[β i Y] = 1 0 (1 κ i )yi p(κ i Y) dκ i = (1 E[κ yi ])yi where κ i = 1/(1 + τi 2 ) shrinkage factor

Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1) p(α, φ) 1/φ) In the case λ = φ = 1 and with canonical representation Y = Iβ + ɛ E[β i Y] = 1 0 (1 κ i )yi p(κ i Y) dκ i = (1 E[κ yi ])yi where κ i = 1/(1 + τi 2 ) shrinkage factor Half-Cauchy prior induces a Beta(1/2, 1/2) distribution on κ i a priori

Horseshoe Beta(1/2, 1/2) density 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 x

Prior Comparison (from PSC)

Bounded Influence Normal means case Y i iid N(βi, 1) (Equivalent to Canonical case) Posterior mean E[β y] = y + d dy log m(y) where m(y) is the predictive denisty under the prior (known λ)

Bounded Influence Normal means case Y i iid N(βi, 1) (Equivalent to Canonical case) Posterior mean E[β y] = y + d dy log m(y) where m(y) is the predictive denisty under the prior (known λ) HS has Bounded Influence: lim y d log m(y) = 0 dy

Bounded Influence Normal means case Y i iid N(βi, 1) (Equivalent to Canonical case) Posterior mean E[β y] = y + d dy log m(y) where m(y) is the predictive denisty under the prior (known λ) HS has Bounded Influence: lim y d log m(y) = 0 dy lim y E[β y) y (MLE)

Bounded Influence Normal means case Y i iid N(βi, 1) (Equivalent to Canonical case) Posterior mean E[β y] = y + d dy log m(y) where m(y) is the predictive denisty under the prior (known λ) HS has Bounded Influence: lim y d log m(y) = 0 dy lim y E[β y) y (MLE) DE is also bounded influence, but bound does not decay to zero in tails

R packages The monomvn package in R includes blasso bhs See Diabetes.R code

Simulation Study with Diabetes Data RMSE 50 55 60 65 70 75 80 OLS LASSO HORSESHOE

Other Options Range of other scale mixtures used

Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee)

Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α)

Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf

Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η)

Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density)

Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density) See the monomvn package on CRAN

Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density) See the monomvn package on CRAN Choice of prior? Properties? Fan & Li (JASA 2001) discuss Variable selection via nonconcave penalties and oracle properties

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero)

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage)

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5)

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem Selection part of model uncertainty add prior

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem Selection part of model uncertainty add prior probability that β s j = 0 and combine with decision problem Remember all models are wrong, but some may be useful!