Readings Chapter 15 Christensen Merlise Clyde October 15, 2015
Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso
Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ)
Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ)
Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ) τ 2 1..., τ 2 p α, φ, λ iid Exp(λ 2 /2)
Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ) τ 2 1..., τ 2 p α, φ, λ iid Exp(λ 2 /2) p(α, φ) 1/φ
Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ) τ 2 1..., τ 2 p α, φ, λ iid Exp(λ 2 /2) p(α, φ) 1/φ Can show that β j φ, λ iid DE(λ φ) 0 1 e 1 2 φ β2 s 2πs λ 2 2 e λ 2 s 2 ds = λφ1/2 2 e λφ1/2 β
Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β s, φ N(1 n α + X s β s, I n /φ) β s α, φ, τ, λ N(0, diag(τ 2 )/φ) τ 2 1..., τ 2 p α, φ, λ iid Exp(λ 2 /2) p(α, φ) 1/φ Can show that β j φ, λ iid DE(λ φ) 0 1 e 1 2 φ β2 s 2πs λ 2 2 e λ 2 s 2 ds = λφ1/2 2 e λφ1/2 β Scale Mixture of Normals (Andrews and Mallows 1974)
Gibbs Sampling Prior λ 2 Gamma(r, δ)
Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ)
Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, )
Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, )
Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, ) λ 2 β s, φ, τ 2, Y G(, )
Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, ) λ 2 β s, φ, τ 2, Y G(, ) 1/τ 2 j β s, φ, λ, Y InvGaussian(, )
Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, ) λ 2 β s, φ, τ 2, Y G(, ) 1/τ 2 j β s, φ, λ, Y InvGaussian(, ) X InvGaussian(µ, λ) λ 2 f (x) = 2π x 3/2 e 1 λ 2 (x µ) 2 2 µ 2 x x > 0
Gibbs Sampling Prior λ 2 Gamma(r, δ) Integrate out α: α Y, φ N(ȳ, 1/(nφ) Full Conditionals β s τ, φ, λ, Y N(, ) φ τ, β s, λ, Y G(, ) λ 2 β s, φ, τ 2, Y G(, ) 1/τ 2 j β s, φ, λ, Y InvGaussian(, ) X InvGaussian(µ, λ) λ 2 f (x) = 2π x 3/2 e 1 λ 2 (x µ) 2 2 µ 2 x x > 0 Homework Nextweek: Derive the full conditionals for β s, φ, 1/τ 2 see http://www.stat.ufl.edu/~casella/papers/lasso.pdf
Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ
Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation)
Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1)
Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1) p(α, φ) 1/φ)
Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1) p(α, φ) 1/φ) In the case λ = φ = 1 and with canonical representation Y = Iβ + ɛ
Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1) p(α, φ) 1/φ) In the case λ = φ = 1 and with canonical representation Y = Iβ + ɛ E[β i Y] = 1 0 (1 κ i )yi p(κ i Y) dκ i = (1 E[κ yi ])yi where κ i = 1/(1 + τi 2 ) shrinkage factor
Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β s φ, τ N(0 p, diag(τ 2 ) ) φ τ j λ iid C + (0, λ 2 ) (difference is CPS notation) λ C + (0, 1) p(α, φ) 1/φ) In the case λ = φ = 1 and with canonical representation Y = Iβ + ɛ E[β i Y] = 1 0 (1 κ i )yi p(κ i Y) dκ i = (1 E[κ yi ])yi where κ i = 1/(1 + τi 2 ) shrinkage factor Half-Cauchy prior induces a Beta(1/2, 1/2) distribution on κ i a priori
Horseshoe Beta(1/2, 1/2) density 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 x
Prior Comparison (from PSC)
Bounded Influence Normal means case Y i iid N(βi, 1) (Equivalent to Canonical case) Posterior mean E[β y] = y + d dy log m(y) where m(y) is the predictive denisty under the prior (known λ)
Bounded Influence Normal means case Y i iid N(βi, 1) (Equivalent to Canonical case) Posterior mean E[β y] = y + d dy log m(y) where m(y) is the predictive denisty under the prior (known λ) HS has Bounded Influence: lim y d log m(y) = 0 dy
Bounded Influence Normal means case Y i iid N(βi, 1) (Equivalent to Canonical case) Posterior mean E[β y] = y + d dy log m(y) where m(y) is the predictive denisty under the prior (known λ) HS has Bounded Influence: lim y d log m(y) = 0 dy lim y E[β y) y (MLE)
Bounded Influence Normal means case Y i iid N(βi, 1) (Equivalent to Canonical case) Posterior mean E[β y] = y + d dy log m(y) where m(y) is the predictive denisty under the prior (known λ) HS has Bounded Influence: lim y d log m(y) = 0 dy lim y E[β y) y (MLE) DE is also bounded influence, but bound does not decay to zero in tails
R packages The monomvn package in R includes blasso bhs See Diabetes.R code
Simulation Study with Diabetes Data RMSE 50 55 60 65 70 75 80 OLS LASSO HORSESHOE
Other Options Range of other scale mixtures used
Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee)
Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α)
Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf
Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η)
Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density)
Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density) See the monomvn package on CRAN
Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β s j GDP(ξ = η/α, α) f (β s j ) = 1 2ξ (1 + βs j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density) See the monomvn package on CRAN Choice of prior? Properties? Fan & Li (JASA 2001) discuss Variable selection via nonconcave penalties and oracle properties
Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero)
Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage)
Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0
Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5)
Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf
Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem
Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem Selection part of model uncertainty add prior
Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection, just shrinkage) Bayesian Posterior does not assign any probability to β s j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem Selection part of model uncertainty add prior probability that β s j = 0 and combine with decision problem Remember all models are wrong, but some may be useful!