Lasso & Bayesian Lasso

Size: px

Start display at page:

Download "Lasso & Bayesian Lasso"

Arabella Booker
5 years ago
Views:

1 Readings Chapter 15 Christensen Merlise Clyde October 6, 2015

2 Lasso Tibshirani (JRSS B 1996) proposed estimating coefficients through L 1 constrained least squares Least Absolute Shrinkage and Selection Operator Control how large coefficients may grow

3 Lasso Tibshirani (JRSS B 1996) proposed estimating coefficients through L 1 constrained least squares Least Absolute Shrinkage and Selection Operator Control how large coefficients may grow subject to min β (Yc X c β c ) T (Y c X c β c ) β c j t

4 Lasso Tibshirani (JRSS B 1996) proposed estimating coefficients through L 1 constrained least squares Least Absolute Shrinkage and Selection Operator Control how large coefficients may grow subject to min β (Yc X c β c ) T (Y c X c β c ) β c j t Equivalent Quadratic Programming Problem for penalized Likelihood min β c Y c X c β c 2 + λ β c 1

5 Lasso Tibshirani (JRSS B 1996) proposed estimating coefficients through L 1 constrained least squares Least Absolute Shrinkage and Selection Operator Control how large coefficients may grow subject to min β (Yc X c β c ) T (Y c X c β c ) β c j t Equivalent Quadratic Programming Problem for penalized Likelihood min β c Y c X c β c 2 + λ β c 1 Posterior mode max β φ 2 { Yc X c β c 2 + λ β c 1 }

6 Lasso Tibshirani (JRSS B 1996) proposed estimating coefficients through L 1 constrained least squares Least Absolute Shrinkage and Selection Operator Control how large coefficients may grow subject to min β (Yc X c β c ) T (Y c X c β c ) β c j t Equivalent Quadratic Programming Problem for penalized Likelihood min β c Y c X c β c 2 + λ β c 1 Posterior mode max β φ 2 { Yc X c β c 2 + λ β c 1 }

7 Picture

8 R Code > library(lars) > longley.lars = lars(as.matrix(longley[,-7]), longley[,7], type="lasso") > plot(longley.lars) LASSO Standardized Coefficients * * * * * * * ** * * * * * * * * ** * * ** * * * * * * * * * * * * * * * * * beta /max beta

9 Solutions > round(coef(longley.lars),5) GNP.deflator GNP Unemployed Armed.Forces Population Year [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,]

10 Cp Solution Min C p = SSE p /ˆσ 2 F n + 2p

11 Cp Solution Min C p = SSE p /ˆσ 2 F n + 2p > summary(longley.lars) LARS/LASSO Call: lars(x = as.matrix(longley[, -7]), y = longley[, 7], type Df Rss Cp

12 Cp Solution Min C p = SSE p /ˆσ 2 F n + 2p > summary(longley.lars) LARS/LASSO Call: lars(x = as.matrix(longley[, -7]), y = longley[, 7], type Df Rss Cp GNP.deflator GNP Unemployed Armed.Forces Population Year [7,]

13 Features Combines shrinkage (like Ridge Regression) with Selection (like stepwise selection)

14 Features Combines shrinkage (like Ridge Regression) with Selection (like stepwise selection) Uncertainty? Interval estimates?

15 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso

16 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β, φ N(1 n α + X c β, I n /φ)

17 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β, φ N(1 n α + X c β, I n /φ) β α, φ, τ N(0, diag(τ 2 )/φ)

18 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β, φ N(1 n α + X c β, I n /φ) β α, φ, τ N(0, diag(τ 2 )/φ) τ , τ 2 p α, φ iid Exp(λ 2 /2)

19 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β, φ N(1 n α + X c β, I n /φ) β α, φ, τ N(0, diag(τ 2 )/φ) τ , τ 2 p α, φ iid Exp(λ 2 /2) p(α, φ) 1/φ

20 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β, φ N(1 n α + X c β, I n /φ) β α, φ, τ N(0, diag(τ 2 )/φ) τ , τ 2 p α, φ iid Exp(λ 2 /2) p(α, φ) 1/φ Can show that β j φ, λ iid DE(λ φ) 0 1 e 1 2 φ β2 s 2πs λ 2 2 e λ 2 s 2 ds = λφ1/2 2 e λφ1/2 β

21 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β, φ N(1 n α + X c β, I n /φ) β α, φ, τ N(0, diag(τ 2 )/φ) τ , τ 2 p α, φ iid Exp(λ 2 /2) p(α, φ) 1/φ Can show that β j φ, λ iid DE(λ φ) 0 1 e 1 2 φ β2 s 2πs λ 2 2 e λ 2 s 2 ds = λφ1/2 2 e λφ1/2 β Scale Mixture of Normals (Andrews and Mallows 1974)

22 Gibbs Sampling Integrate out α: α Y, φ N(ȳ, 1/(nφ)

23 Gibbs Sampling Integrate out α: α Y, φ N(ȳ, 1/(nφ) β τ, φ, λ, Y N(, )

24 Gibbs Sampling Integrate out α: α Y, φ N(ȳ, 1/(nφ) β τ, φ, λ, Y N(, ) φ τ, β, λ, Y G(, )

25 Gibbs Sampling Integrate out α: α Y, φ N(ȳ, 1/(nφ) β τ, φ, λ, Y N(, ) φ τ, β, λ, Y G(, ) 1/τ 2 j β, φ, λ, Y InvGaussian(, )

26 Gibbs Sampling Integrate out α: α Y, φ N(ȳ, 1/(nφ) β τ, φ, λ, Y N(, ) φ τ, β, λ, Y G(, ) 1/τ 2 j β, φ, λ, Y InvGaussian(, ) X InvGaussian(µ, λ) λ 2 f (x) = 2π x 3/2 e 1 λ 2 (x µ) 2 2 µ 2 x x > 0

27 Gibbs Sampling Integrate out α: α Y, φ N(ȳ, 1/(nφ) β τ, φ, λ, Y N(, ) φ τ, β, λ, Y G(, ) 1/τ 2 j β, φ, λ, Y InvGaussian(, ) X InvGaussian(µ, λ) λ 2 f (x) = 2π x 3/2 e 1 λ 2 (x µ) 2 2 µ 2 x x > 0 Homework: Derive the full conditionals for β, φ, 1/τ 2 see

28 Other Options Range of other scale mixtures used

29 Other Options Range of other scale mixtures used Horseshoe (Carvalho, Polson & Scott)

30 Other Options Range of other scale mixtures used Horseshoe (Carvalho, Polson & Scott) Generalized Double Pareto (Armagan, Dunson & Lee)

31 Other Options Range of other scale mixtures used Horseshoe (Carvalho, Polson & Scott) Generalized Double Pareto (Armagan, Dunson & Lee) Normal-Exponenetial-Gamma (Griffen & Brown)

32 Other Options Range of other scale mixtures used Horseshoe (Carvalho, Polson & Scott) Generalized Double Pareto (Armagan, Dunson & Lee) Normal-Exponenetial-Gamma (Griffen & Brown) Bridge - Power Exponential Priors

33 Other Options Range of other scale mixtures used Horseshoe (Carvalho, Polson & Scott) Generalized Double Pareto (Armagan, Dunson & Lee) Normal-Exponenetial-Gamma (Griffen & Brown) Bridge - Power Exponential Priors Properties of Prior?

34 Other Options Range of other scale mixtures used Horseshoe (Carvalho, Polson & Scott) Generalized Double Pareto (Armagan, Dunson & Lee) Normal-Exponenetial-Gamma (Griffen & Brown) Bridge - Power Exponential Priors Properties of Prior?

35 Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β φ N(0 p, diag(τ 2 ) ) φ

36 Horseshoe Carvalho, Polson & Scott propose Prior Distribution on τ 2 j λ iid C + (0, λ) β φ N(0 p, diag(τ 2 ) ) φ

37 Horseshoe Carvalho, Polson & Scott propose Prior Distribution on τ 2 j λ iid C + (0, λ) λ C + (0, 1/φ) β φ N(0 p, diag(τ 2 ) ) φ

38 Horseshoe Carvalho, Polson & Scott propose Prior Distribution on τ 2 j λ iid C + (0, λ) λ C + (0, 1/φ) p(α, φ) 1/φ) β φ N(0 p, diag(τ 2 ) ) φ

39 Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β φ N(0 p, diag(τ 2 ) ) φ τ 2 j λ iid C + (0, λ) λ C + (0, 1/φ) p(α, φ) 1/φ) In the case λ = φ = 1 and with X t X = I Y = X T Y

40 Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β φ N(0 p, diag(τ 2 ) ) φ τj 2 λ iid C + (0, λ) λ C + (0, 1/φ) p(α, φ) 1/φ) In the case λ = φ = 1 and with X t X = I Y = X T Y E[β i Y] = 1 0 (1 κ i )yi p(κ i Y) dκ i = (1 E[κ yi ])yi where κ i = 1/(1 + τi 2 ) shrinkage factor

41 Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β φ N(0 p, diag(τ 2 ) ) φ τj 2 λ iid C + (0, λ) λ C + (0, 1/φ) p(α, φ) 1/φ) In the case λ = φ = 1 and with X t X = I Y = X T Y E[β i Y] = 1 0 (1 κ i )yi p(κ i Y) dκ i = (1 E[κ yi ])yi where κ i = 1/(1 + τi 2 ) shrinkage factor Half-Cauchy prior induces a Beta(1/2, 1/2) distribution on κ i a priori

42 Horseshoe Beta(1/2, 1/2) density x

43 Simulation Study with Diabetes Data RMSE OLS LASSO HORSESHOE

44 Other Options Range of other scale mixtures used

45 Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee)

46 Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β j GDP(ξ = η/α, α)

47 Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β j GDP(ξ = η/α, α) f (β j ) = 1 2ξ (1 + β j ξα ) (1+α) see

48 Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β j GDP(ξ = η/α, α) f (β j ) = 1 2ξ (1 + β j ξα ) (1+α) see Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η)

49 Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β j GDP(ξ = η/α, α) f (β j ) = 1 2ξ (1 + β j ξα ) (1+α) see Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density)

50 Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β j GDP(ξ = η/α, α) f (β j ) = 1 2ξ (1 + β j ξα ) (1+α) see Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density) See the monomvn package on CRAN

51 Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β j GDP(ξ = η/α, α) f (β j ) = 1 2ξ (1 + β j ξα ) (1+α) see Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density) See the monomvn package on CRAN Choice of prior? Properties? Fan & Li (JASA 2001) discuss Variable selection via nonconcave penalties and oracle properties

52 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero)

53 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection)

54 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection) Bayesian Posterior does not assign any probability to β j = 0

55 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection) Bayesian Posterior does not assign any probability to β j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5)

56 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection) Bayesian Posterior does not assign any probability to β j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf

57 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection) Bayesian Posterior does not assign any probability to β j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem

58 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection) Bayesian Posterior does not assign any probability to β j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem Selection part of model uncertainty add prior

59 Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection) Bayesian Posterior does not assign any probability to β j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem Selection part of model uncertainty add prior probability that β j = 0 and combine with decision problem

Horseshoe, Lasso and Related Shrinkage Methods

Horseshoe, Lasso and Related Shrinkage Methods Readings Chapter 15 Christensen Merlise Clyde October 15, 2015 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Bayesian Lasso Park & Casella