Horseshoe, Lasso and Related Shrinkage Methods

Size: px

Start display at page:

Download "Horseshoe, Lasso and Related Shrinkage Methods"

Cory Watson
5 years ago
Views:

1 Readings Chapter 15 Christensen Merlise Clyde October 15, 2015

2 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso

3 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ)

4 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ)

5 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ) τ , τ 2 p α, φ, λ iid Exp(λ 2 /2)

6 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ) τ , τ 2 p α, φ, λ iid Exp(λ 2 /2) p(α, φ) 1/φ

7 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ) τ , τ 2 p α, φ, λ iid Exp(λ 2 /2) p(α, φ) 1/φ Can show that β j φ, λ iid DE(λ φ) 0 1 e 1 2 φ β2 s 2πs λ 2 2 e λ 2 s 2 ds = λφ1/2 2 e λφ1/2 β

8 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ) τ , τ 2 p α, φ, λ iid Exp(λ 2 /2) p(α, φ) 1/φ Can show that β j φ, λ iid DE(λ φ) 0 1 e 1 2 φ β2 s 2πs λ 2 2 e λ 2 s 2 ds = λφ1/2 2 e λφ1/2 β Scale Mixture of Normals (Andrews and Mallows 1974)

9 Gibbs Sampling Prior λ 2 Gamma(r, δ)

10 Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ)

11 Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, )

12 Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, )

13 Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, ) λ 2 β s, φ, τ 2, Y G(, )

14 Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, ) λ 2 β s, φ, τ 2, Y G(, ) 1/τ 2 j β s, φ, λ, Y InvGaussian(, )

15 Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, ) λ 2 β s, φ, τ 2, Y G(, ) 1/τ 2 j β s, φ, λ, Y InvGaussian(, ) X InvGaussian(µ, λ) λ 2 f (x) = 2π x 3/2 e 1 λ 2 (x µ) 2 2 µ 2 x x > 0

16 Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, ) λ 2 β s, φ, τ 2, Y G(, ) 1/τ 2 j β s, φ, λ, Y InvGaussian(, ) X InvGaussian(µ, λ) λ 2 f (x) = 2π x 3/2 e 1 λ 2 (x µ) 2 2 µ 2 x x > 0 Homework Nextweek: Derive the full conditionals for β s, φ, 1/τ 2 see

17 Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ

18 Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation)

19 Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1)

20 Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1) p(α, φ) 1/φ)

21 Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1) p(α, φ) 1/φ) In the case λ = φ = 1 and with canonical representation Y = Iβ + ɛ

22 Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1) p(α, φ) 1/φ) In the case λ = φ = 1 and with canonical representation Y = Iβ + ɛ E[β i Y] = 1 0 (1 κ i )yi p(κ i Y) dκ i = (1 E[κ yi ])yi where κ i = 1/(1 + τi 2 ) shrinkage factor

23 Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1) p(α, φ) 1/φ) In the case λ = φ = 1 and with canonical representation Y = Iβ + ɛ E[β i Y] = 1 0 (1 κ i )yi p(κ i Y) dκ i = (1 E[κ yi ])yi where κ i = 1/(1 + τi 2 ) shrinkage factor Half-Cauchy prior induces a Beta(1/2, 1/2) distribution on κ i a priori

24 Horseshoe Beta(1/2, 1/2) density x

25 Prior Comparison (from PSC)

26 Bounded Influence Normal means case Y i iid N(βi, 1) (Equivalent to Canonical case) Posterior mean E[β y] = y + d dy log m(y) where m(y) is the predictive denisty under the prior (known λ)

27 Bounded Influence Normal means case Y i iid N(βi, 1) (Equivalent to Canonical case) Posterior mean E[β y] = y + d dy log m(y) where m(y) is the predictive denisty under the prior (known λ) HS has Bounded Influence: lim y d log m(y) = 0 dy

28 Bounded Influence Normal means case Y i iid N(βi, 1) (Equivalent to Canonical case) Posterior mean E[β y] = y + d dy log m(y) where m(y) is the predictive denisty under the prior (known λ) HS has Bounded Influence: lim y d log m(y) = 0 dy lim y E[β y) y (MLE)

29 Bounded Influence Normal means case Y i iid N(βi, 1) (Equivalent to Canonical case) Posterior mean E[β y] = y + d dy log m(y) where m(y) is the predictive denisty under the prior (known λ) HS has Bounded Influence: lim y d log m(y) = 0 dy lim y E[β y) y (MLE) DE is also bounded influence, but bound does not decay to zero in tails

30 R packages The monomvn package in R includes blasso bhs See Diabetes.R code

31 Simulation Study with Diabetes Data RMSE OLS LASSO HORSESHOE

32 Other Options Range of other scale mixtures used

33 Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee)

34 Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α)

35 Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see

36 Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η)

37 Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density)

38 Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density) See the monomvn package on CRAN

39 Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density) See the monomvn package on CRAN Choice of prior? Properties? Fan & Li (JASA 2001) discuss Variable selection via nonconcave penalties and oracle properties

40 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero)

41 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage)

42 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0

43 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5)

44 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf

45 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem

46 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem Selection part of model uncertainty add prior

47 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem Selection part of model uncertainty add prior probability that β s j = 0 and combine with decision problem Remember all models are wrong, but some may be useful!

Lasso & Bayesian Lasso

Lasso & Bayesian Lasso Readings Chapter 15 Christensen Merlise Clyde October 6, 2015 Lasso Tibshirani (JRSS B 1996) proposed estimating coefficients through L 1 constrained least squares Least Absolute Shrinkage and Selection