Lasso & Bayesian Lasso

Similar documents
Horseshoe, Lasso and Related Shrinkage Methods

Or How to select variables Using Bayesian LASSO

Bayes Estimators & Ridge Regression

Bayesian shrinkage approach in variable selection for mixed

Bayesian Sparse Linear Regression with Unknown Symmetric Error

Partial factor modeling: predictor-dependent shrinkage for linear regression

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Hierarchical sparsity priors for regression models

On the Half-Cauchy Prior for a Global Scale Parameter

Geometric ergodicity of the Bayesian lasso

Efficient sampling for Gaussian linear regression with arbitrary priors

arxiv: v2 [stat.me] 27 Oct 2012

Lecture 14: Shrinkage

Bayesian variable selection and classification with control of predictive values

Regularization with variance-mean mixtures

Chapter 3. Linear Models for Regression

arxiv: v3 [stat.me] 26 Feb 2012

Analysis Methods for Supersaturated Design: Some Comparisons

On the half-cauchy prior for a global scale parameter

arxiv: v1 [stat.me] 6 Jul 2017

Package horseshoe. November 8, 2016

arxiv: v3 [stat.me] 10 Aug 2018

Integrated Anlaysis of Genomics Data

Shrinkage Methods: Ridge and Lasso

arxiv: v2 [stat.me] 24 Feb 2018

Bayesian linear regression

Biostatistics Advanced Methods in Biostatistics IV

Linear regression methods

Statistical Inference

Model Choice. Hoff Chapter 9, Clyde & George Model Uncertainty StatSci, Hoeting et al BMA StatSci. October 27, 2015

Sparse Estimation and Uncertainty with Application to Subgroup Analysis

Generalized Elastic Net Regression

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Regularization Paths

arxiv: v2 [stat.me] 16 Aug 2018

Regularized Skew-Normal Regression. K. Shutes & C.J. Adcock. March 31, 2014

STA414/2104 Statistical Methods for Machine Learning II

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Regularized Skew-Normal Regression

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Estimating Sparse High Dimensional Linear Models using Global-Local Shrinkage

Robust Bayesian Simple Linear Regression

Handling Sparsity via the Horseshoe

Consistent change point estimation. Chi Tim Ng, Chonnam National University Woojoo Lee, Inha University Youngjo Lee, Seoul National University

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Regression, Ridge Regression, Lasso

Bayesian Models for Structured Sparse Estimation via Set Cover Prior

Regularization: Ridge Regression and the LASSO

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Dimension Reduction Methods

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

ISyE 691 Data mining and analytics

Sparsity and the truncated l 2 -norm

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

arxiv: v4 [stat.co] 12 May 2018

Hierarchical models. Dr. Jarad Niemi. August 31, Iowa State University. Jarad Niemi (Iowa State) Hierarchical models August 31, / 31

Machine Learning for OR & FE

Self-adaptive Lasso and its Bayesian Estimation

Lasso regression. Wessel van Wieringen

Statistical Applications in Genetics and Molecular Biology

Bayesian methods in economics and finance

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Regularization Paths. Theme

Comparisons of penalized least squares. methods by simulations

Lecture 5: Soft-Thresholding and Lasso

Introduction to Machine Learning

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Hierarchical Bayesian Modeling with Approximate Bayesian Computation

MSG500/MVE190 Linear Models - Lecture 15

GENOMIC SELECTION WORKSHOP: Hands on Practical Sessions (BL)

Robust Bayesian Regression

High-dimensional regression with unknown variance

A Hybrid Bayesian Approach for Genome-Wide Association Studies on Related Individuals

An Introduction to Bayesian Linear Regression

A sparse factor analysis model for high dimensional latent spaces

Reliability Monitoring Using Log Gaussian Process Regression

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Computationally Efficient MCMC for Hierarchical Bayesian Inverse Problems

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Sparse additive Gaussian process with soft interactions. Title

Log Covariance Matrix Estimation

On Mixture Regression Shrinkage and Selection via the MR-LASSO

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

IEOR165 Discussion Week 5

Scalable MCMC for the horseshoe prior

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

The RXshrink Package

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Weakly informative priors

Homework 1: Solutions

Least Squares Regression

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

STAT 462-Computational Data Analysis

Scalable Bayesian shrinkage and uncertainty quantification for high-dimensional regression

ABSTRACT. POST, JUSTIN BLAISE. Methods to Improve Prediction Accuracy under Structural Constraints. (Under the direction of Howard Bondell.

Frequentist Accuracy of Bayesian Estimates

Transcription:

Readings Chapter 15 Christensen Merlise Clyde October 6, 2015

Lasso Tibshirani (JRSS B 1996) proposed estimating coefficients through L 1 constrained least squares Least Absolute Shrinkage and Selection Operator Control how large coefficients may grow

Lasso Tibshirani (JRSS B 1996) proposed estimating coefficients through L 1 constrained least squares Least Absolute Shrinkage and Selection Operator Control how large coefficients may grow subject to min β (Yc X c β c ) T (Y c X c β c ) β c j t

Lasso Tibshirani (JRSS B 1996) proposed estimating coefficients through L 1 constrained least squares Least Absolute Shrinkage and Selection Operator Control how large coefficients may grow subject to min β (Yc X c β c ) T (Y c X c β c ) β c j t Equivalent Quadratic Programming Problem for penalized Likelihood min β c Y c X c β c 2 + λ β c 1

Lasso Tibshirani (JRSS B 1996) proposed estimating coefficients through L 1 constrained least squares Least Absolute Shrinkage and Selection Operator Control how large coefficients may grow subject to min β (Yc X c β c ) T (Y c X c β c ) β c j t Equivalent Quadratic Programming Problem for penalized Likelihood min β c Y c X c β c 2 + λ β c 1 Posterior mode max β φ 2 { Yc X c β c 2 + λ β c 1 }

Lasso Tibshirani (JRSS B 1996) proposed estimating coefficients through L 1 constrained least squares Least Absolute Shrinkage and Selection Operator Control how large coefficients may grow subject to min β (Yc X c β c ) T (Y c X c β c ) β c j t Equivalent Quadratic Programming Problem for penalized Likelihood min β c Y c X c β c 2 + λ β c 1 Posterior mode max β φ 2 { Yc X c β c 2 + λ β c 1 }

Picture

R Code > library(lars) > longley.lars = lars(as.matrix(longley[,-7]), longley[,7], type="lasso") > plot(longley.lars) LASSO 0 1 2 4 6 8 9 10 Standardized Coefficients 10 0 10 20 30 * * * * * * * ** * * * * * * * * ** * * ** * * * * * * * * * * * * * * * * * 0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 1 6 beta /max beta

Solutions > round(coef(longley.lars),5) GNP.deflator GNP Unemployed Armed.Forces Population Year [1,] 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 [2,] 0.00000 0.03273 0.00000 0.00000 0.00000 0.00000 [3,] 0.00000 0.03623-0.00372 0.00000 0.00000 0.00000 [4,] 0.00000 0.03717-0.00459-0.00099 0.00000 0.00000 [5,] 0.00000 0.00000-0.01242-0.00539 0.00000 0.90681 [6,] 0.00000 0.00000-0.01412-0.00713 0.00000 0.94375 [7,] 0.00000 0.00000-0.01471-0.00861-0.15337 1.18430 [8,] -0.00770 0.00000-0.01481-0.00873-0.17076 1.22888 [9,] 0.00000-0.01212-0.01663-0.00927-0.13029 1.43192 [10,] 0.00000-0.02534-0.01869-0.00989-0.09514 1.68655 [11,] 0.01506-0.03582-0.02020-0.01033-0.05110 1.82915

Cp Solution Min C p = SSE p /ˆσ 2 F n + 2p

Cp Solution Min C p = SSE p /ˆσ 2 F n + 2p > summary(longley.lars) LARS/LASSO Call: lars(x = as.matrix(longley[, -7]), y = longley[, 7], type Df Rss Cp 0 1 185.009 1976.7120 1 2 6.642 59.4712 2 3 3.883 31.7832 3 4 3.468 29.3165 4 5 1.563 10.8183 5 4 1.339 6.4068 6 5 1.024 5.0186 7 6 0.998 6.7388 8 7 0.907 7.7615 9 6 0.847 5.1128 10 7 0.836 7.0000

Cp Solution Min C p = SSE p /ˆσ 2 F n + 2p > summary(longley.lars) LARS/LASSO Call: lars(x = as.matrix(longley[, -7]), y = longley[, 7], type Df Rss Cp 0 1 185.009 1976.7120 1 2 6.642 59.4712 2 3 3.883 31.7832 3 4 3.468 29.3165 4 5 1.563 10.8183 5 4 1.339 6.4068 6 5 1.024 5.0186 7 6 0.998 6.7388 8 7 0.907 7.7615 9 6 0.847 5.1128 10 7 0.836 7.0000 GNP.deflator GNP Unemployed Armed.Forces Population Year [7,] 0.00000 0.00000-0.01471-0.00861-0.15337 1.18430

Features Combines shrinkage (like Ridge Regression) with Selection (like stepwise selection)

Features Combines shrinkage (like Ridge Regression) with Selection (like stepwise selection) Uncertainty? Interval estimates?

Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso

Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β, φ N(1 n α + X c β, I n /φ)

Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β, φ N(1 n α + X c β, I n /φ) β α, φ, τ N(0, diag(τ 2 )/φ)

Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β, φ N(1 n α + X c β, I n /φ) β α, φ, τ N(0, diag(τ 2 )/φ) τ 2 1..., τ 2 p α, φ iid Exp(λ 2 /2)

Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β, φ N(1 n α + X c β, I n /φ) β α, φ, τ N(0, diag(τ 2 )/φ) τ 2 1..., τ 2 p α, φ iid Exp(λ 2 /2) p(α, φ) 1/φ

Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β, φ N(1 n α + X c β, I n /φ) β α, φ, τ N(0, diag(τ 2 )/φ) τ 2 1..., τ 2 p α, φ iid Exp(λ 2 /2) p(α, φ) 1/φ Can show that β j φ, λ iid DE(λ φ) 0 1 e 1 2 φ β2 s 2πs λ 2 2 e λ 2 s 2 ds = λφ1/2 2 e λφ1/2 β

Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Y α, β, φ N(1 n α + X c β, I n /φ) β α, φ, τ N(0, diag(τ 2 )/φ) τ 2 1..., τ 2 p α, φ iid Exp(λ 2 /2) p(α, φ) 1/φ Can show that β j φ, λ iid DE(λ φ) 0 1 e 1 2 φ β2 s 2πs λ 2 2 e λ 2 s 2 ds = λφ1/2 2 e λφ1/2 β Scale Mixture of Normals (Andrews and Mallows 1974)

Gibbs Sampling Integrate out α: α Y, φ N(ȳ, 1/(nφ)

Gibbs Sampling Integrate out α: α Y, φ N(ȳ, 1/(nφ) β τ, φ, λ, Y N(, )

Gibbs Sampling Integrate out α: α Y, φ N(ȳ, 1/(nφ) β τ, φ, λ, Y N(, ) φ τ, β, λ, Y G(, )

Gibbs Sampling Integrate out α: α Y, φ N(ȳ, 1/(nφ) β τ, φ, λ, Y N(, ) φ τ, β, λ, Y G(, ) 1/τ 2 j β, φ, λ, Y InvGaussian(, )

Gibbs Sampling Integrate out α: α Y, φ N(ȳ, 1/(nφ) β τ, φ, λ, Y N(, ) φ τ, β, λ, Y G(, ) 1/τ 2 j β, φ, λ, Y InvGaussian(, ) X InvGaussian(µ, λ) λ 2 f (x) = 2π x 3/2 e 1 λ 2 (x µ) 2 2 µ 2 x x > 0

Gibbs Sampling Integrate out α: α Y, φ N(ȳ, 1/(nφ) β τ, φ, λ, Y N(, ) φ τ, β, λ, Y G(, ) 1/τ 2 j β, φ, λ, Y InvGaussian(, ) X InvGaussian(µ, λ) λ 2 f (x) = 2π x 3/2 e 1 λ 2 (x µ) 2 2 µ 2 x x > 0 Homework: Derive the full conditionals for β, φ, 1/τ 2 see http://www.stat.ufl.edu/~casella/papers/lasso.pdf

Other Options Range of other scale mixtures used

Other Options Range of other scale mixtures used Horseshoe (Carvalho, Polson & Scott)

Other Options Range of other scale mixtures used Horseshoe (Carvalho, Polson & Scott) Generalized Double Pareto (Armagan, Dunson & Lee)

Other Options Range of other scale mixtures used Horseshoe (Carvalho, Polson & Scott) Generalized Double Pareto (Armagan, Dunson & Lee) Normal-Exponenetial-Gamma (Griffen & Brown)

Other Options Range of other scale mixtures used Horseshoe (Carvalho, Polson & Scott) Generalized Double Pareto (Armagan, Dunson & Lee) Normal-Exponenetial-Gamma (Griffen & Brown) Bridge - Power Exponential Priors

Other Options Range of other scale mixtures used Horseshoe (Carvalho, Polson & Scott) Generalized Double Pareto (Armagan, Dunson & Lee) Normal-Exponenetial-Gamma (Griffen & Brown) Bridge - Power Exponential Priors Properties of Prior?

Other Options Range of other scale mixtures used Horseshoe (Carvalho, Polson & Scott) Generalized Double Pareto (Armagan, Dunson & Lee) Normal-Exponenetial-Gamma (Griffen & Brown) Bridge - Power Exponential Priors Properties of Prior?

Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β φ N(0 p, diag(τ 2 ) ) φ

Horseshoe Carvalho, Polson & Scott propose Prior Distribution on τ 2 j λ iid C + (0, λ) β φ N(0 p, diag(τ 2 ) ) φ

Horseshoe Carvalho, Polson & Scott propose Prior Distribution on τ 2 j λ iid C + (0, λ) λ C + (0, 1/φ) β φ N(0 p, diag(τ 2 ) ) φ

Horseshoe Carvalho, Polson & Scott propose Prior Distribution on τ 2 j λ iid C + (0, λ) λ C + (0, 1/φ) p(α, φ) 1/φ) β φ N(0 p, diag(τ 2 ) ) φ

Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β φ N(0 p, diag(τ 2 ) ) φ τ 2 j λ iid C + (0, λ) λ C + (0, 1/φ) p(α, φ) 1/φ) In the case λ = φ = 1 and with X t X = I Y = X T Y

Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β φ N(0 p, diag(τ 2 ) ) φ τj 2 λ iid C + (0, λ) λ C + (0, 1/φ) p(α, φ) 1/φ) In the case λ = φ = 1 and with X t X = I Y = X T Y E[β i Y] = 1 0 (1 κ i )yi p(κ i Y) dκ i = (1 E[κ yi ])yi where κ i = 1/(1 + τi 2 ) shrinkage factor

Horseshoe Carvalho, Polson & Scott propose Prior Distribution on β φ N(0 p, diag(τ 2 ) ) φ τj 2 λ iid C + (0, λ) λ C + (0, 1/φ) p(α, φ) 1/φ) In the case λ = φ = 1 and with X t X = I Y = X T Y E[β i Y] = 1 0 (1 κ i )yi p(κ i Y) dκ i = (1 E[κ yi ])yi where κ i = 1/(1 + τi 2 ) shrinkage factor Half-Cauchy prior induces a Beta(1/2, 1/2) distribution on κ i a priori

Horseshoe Beta(1/2, 1/2) density 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 x

Simulation Study with Diabetes Data RMSE 50 55 60 65 70 75 80 OLS LASSO HORSESHOE

Other Options Range of other scale mixtures used

Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee)

Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β j GDP(ξ = η/α, α)

Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β j GDP(ξ = η/α, α) f (β j ) = 1 2ξ (1 + β j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf

Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β j GDP(ξ = η/α, α) f (β j ) = 1 2ξ (1 + β j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η)

Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β j GDP(ξ = η/α, α) f (β j ) = 1 2ξ (1 + β j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density)

Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β j GDP(ξ = η/α, α) f (β j ) = 1 2ξ (1 + β j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density) See the monomvn package on CRAN

Other Options Range of other scale mixtures used Generalized Double Pareto (Armagan, Dunson & Lee) λ Gamma(α, η) then β j GDP(ξ = η/α, α) f (β j ) = 1 2ξ (1 + β j ξα ) (1+α) see http://arxiv.org/pdf/1104.0861.pdf Normal-Exponenetial-Gamma (Griffen & Brown 2005) λ 2 Gamma(α, η) Bridge - Power Exponential Priors (Stable mixing density) See the monomvn package on CRAN Choice of prior? Properties? Fan & Li (JASA 2001) discuss Variable selection via nonconcave penalties and oracle properties

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero)

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection)

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection) Bayesian Posterior does not assign any probability to β j = 0

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection) Bayesian Posterior does not assign any probability to β j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5)

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection) Bayesian Posterior does not assign any probability to β j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection) Bayesian Posterior does not assign any probability to β j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection) Bayesian Posterior does not assign any probability to β j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem Selection part of model uncertainty add prior

Choice of Estimator & Selection? Posterior Mode (may set some coefficients to zero) Posterior Mean (no selection) Bayesian Posterior does not assign any probability to β j = 0 selection based on posterior mode ad hoc rule - Select if κ i <.5) See article by Datta & Ghosh http: //ba.stat.cmu.edu/journal/forthcoming/datta.pdf Selection solved as a post-analysis decision problem Selection part of model uncertainty add prior probability that β j = 0 and combine with decision problem