A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

Size: px

Start display at page:

Download "A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables"

Jerome George
5 years ago
Views:

1 A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables Qi Tang (Joint work with Kam-Wah Tsui and Sijian Wang) Department of Statistics University of Wisconsin-Madison Feb. 8, / 46

2 Outline Introduction of Bayesian Lasso Bayesian Lasso with Pseudo Variables Bayesian Group Lasso with Pseudo Variables Conclusions and Future Work 2 / 46

3 Variable Selection Why? Interpretation: principle of parsimony. Prediction: bias and variance tradeoff. What if number of variables is greater than number of observations (p > n)? Shrinkage. Frequentist: loss + penalty. Examples: Ridge regression(hoerl and Kennard, 1970), Lasso (Tibshirani, 1996). Bayesian: Likelihood Shrinkage prior. Griffin and Brown (2005), Park and Casella (2008). 3 / 46

4 Notation Consider a data set with one response variable, p predictors and n observations. Focus on linear models: y = X β + ɛ, ɛ N(0, σ 2 I n ). y is the centered response; X i s, columns of X, are standardized to have mean 0 and unit L 2 norm. 4 / 46

5 Bayesian Interpretation of Lasso Lasso (Tibshirani 1996): min{ y X β 2 + λ β p β i } (1) i=1 Bayesian interpretation: Consider the Bayesian model y N(X β, In ) and β i λ 2 e λ β i (Laplacian prior). The solution of (1) can be interpreted as the posterior mode of β. 5 / 46

6 Laplacian Priors The Laplacian prior is more sparsity promoting than the normal prior. 6 / 46

7 The Bayesian Lasso (Park and Casella, 2008) Model y (X, β, σ 2 ) N(X β, σ 2 I n ). Propose the conditional Laplacian prior β i (σ 2, λ 2 ) λ 2σ e λ β i /σ, Rewrite the laplacian prior into a mixture of β i (σ 2, γ 2 i ) N(0, σ 2 γ 2 i ) and γ 2 i σ 2 Exp(λ 2 /2). (2) Empirical Bayesian treatment of λ: Estimate λ by the marginal maximum likelihood ˆλ. Assign a hyperprior that places high density at ˆλ. Estimate β i by its posterior median. Limitation: heavy computation load and sparsity NOT achieved. 7 / 46

8 Outline Introduction of Bayesian Lasso Bayesian Lasso with Pseudo Variables Bayesian Group Lasso with Pseudo Variables Conclusions and Future Work 8 / 46

9 Benefit of Our Method Avoid the computation burden of finding the marginal maximum likelihood estimate. Assign a prior to λ 2 that does not depend on the data. Achieve sparsity. 9 / 46

10 Intuition for Achieving Sparsity Find an unimportant pseudo variable Z as the benchmark. Augment the model: y = β z Z + X β + ɛ. Criteria: Orthogonal with y (true value of βz is 0). Orthogonal with Xi s (keep the data structure). 10 / 46

11 Benchmark: Intercept! Z int = (1/ n,..., 1/ n) }{{} T. n Orthogonal with y and all the X i s. Does NOT depend on the specific observations. 11 / 46

12 Variable Selection Regression model: Y = β int Z int + X β + ɛ. Assign hierarchical priors and obtain posterior distributions of β int and β i s. Measure the importance of X i by d i = P( β i > β int y, X ). If Xi is orthogonal with y and other variables, then d i = 0.5. The Xi will be selected as an important variable, if d i > c, where c > / 46

13 Some Thoughts on Tuning c 1. Choose the c such that the false discovery rate is controlled. 2. Find the lim n d i for the X i that is unimportant but weakly correlated with the important variables. Use it as a guideline to choose c. 13 / 46

14 Illustration Consider the following simulation setting (Tibshirani, 1996): y = X β + ɛ, β = (3, 1.5, 0, 0, 2, 0, 0, 0) T. X = (X 1,..., X p ), X i N(0, 1), cor(x i, X j ) = 0.5 i j. σ 2 = 9. n = / 46

15 Posterior Distribution of β int Z int is a good benchmark for unimportant variables. 15 / 46

16 Changes on Posterior Distributions of β i s For each β i, the estimated posterior densities are almost unaffected by adding Z int. 16 / 46

17 Estimated d i ˆdi : proportion of (β i, β int ) satisfying β i > β int. ˆβi,PC : posterior median of β i by Park and Casella s Bayesian Lasso method. All unimportant variables have ˆd i Posterior medians do NOT yield sparsity. β i ˆd i ˆβ i,pc / 46

18 Variable Selection Result Empirically, c = 0.9 yields good sparsity. When c = 0.6, the result is almost the same as Lasso. Table: Frequencies that each variable is selected. β i c = c = c = Lasso / 46

19 Posteriors of β i s after Adding Z int Lemma Consider regression model and priors y = X β + ɛ with ɛ N(0, σ 2 I n ) β i (σ 2, λ 2 ) λ/(2σ)e λ β i /σ for i = 1,..., p. Let π 1 and π 2 be the joint posteriors of β i s conditional on σ 2 and λ 2 before and after adding Z int, respectively. Then we have, π 1 = π / 46

20 Outline Introduction of Bayesian Lasso Bayesian Lasso with Pseudo Variables Bayesian Group Lasso with Pseudo Variables Conclusions and Future Work 20 / 46

21 Motivation of Group Selection Method Assayed genes or proteins are naturally grouped by biological roles or biological pathways. It is desired to first select important pathways (group selection), and then select important genes (within group selection). Correlated important variables in the same group should all be selected. Lasso tends to pick only a few of them. 21 / 46

22 Extra Notation g: Number of groups k: index of groups; j: index of variables inside groups. For example, X k,j is the jth variable in group k. p k : number of variables in group k. Assume there is no overlap, that is, p = g k=1 p k. 22 / 46

23 Current Lasso Type Methods for Group Selection Frequentist approach Designed for group selection only: Yuan & Lin (2006) Designed for both group selection and within group selection: Ma & Huang (2007); Huang at el. (2009); Wang at el. (2009). Bayesian approach. Raman et al.(2009). 23 / 46

24 Hierarchical Priors with Group Structure Model: y = β int Z int + g pk k=1 j=1 β k,jx k,j + ɛ with ɛ N(0, σ 2 I n ). β k,j N(0, γ 2 k σ2 /p k ). Variables in the same group are shrunk simultaneously. γ 2 k measures the total variations of p k variables: β k,1,..., β k,pk. γ 2 k Exp ( λ 2 2p k ). Treat βk,j equally across groups. E(β k,j ) and V (β k,j ) do not depend on k or j. 24 / 46

25 Group Selection Definition of important group: groups have at least one important variable. Selection of important groups: the kth group is selected, if max j {P( β k,j > β int )} > c. 25 / 46

26 Within Group Selection: More Benchmarks Limitation of variable selection by β int : unimportant variables in the important groups are less likely to be removed. Solution: find a benchmark Z k,ben for group with p k > 1. The regression model becomes ( ) p m k y = β int Z int + β k,ben Z k,ben + β k,j X k,j + g k=m+1 k=1 β k,1 X k,1 + ɛ, where m is the number of groups with size greater than 1. How to make Z k,ben orthogonal with other variables and benchmarks? Data augmentation! 26 / 46 j=1

27 An Example: Construction of Two More Benchmarks Data: 7 observations, two groups, {X 1,1, X 1,2 } and {X 2,1, X 2,2 }. Z int is orthogonal to y and X k,j s. Obs. y X 1,1 X 1,2 X 2,1 X 2,2 Z int 1 1/ 7 2 1/ 7 3 1/ 7 4 Data 1/ 7 5 1/ 7 6 1/ 7 7 1/ 7 27 / 46

28 Data Augmentation Obs. y X 1,1 X 1,2 X 2,1 X 2,2 Z 1,ben Z 2,ben Z int 1 a 1 a 2 1/ 9 2 a 1 a 2 1/ 9 3 a 1 a 2 1/ 9 4 Data a 1 a 2 1/ 9 5 a 1 a 2 1/ 9 6 a 1 a 2 1/ 9 7 a 1 a 2 1/ a 1 b 1 a 2 1/ a 2 b 2 1/ 9 a 1 = 1/ 56; b 1 = 7 a 2 = 1/ 72; b 2 = 8 Z 1,ben, Z 2,ben and Z int are pairwise orthogonal and also orthogonal with response and predictors. 28 / 46

29 Geometry Interpretation of Data Augmentation Adding one zero observation brings the original data to n + 1 dimensional space. 29 / 46

30 Steps of Constructing Benchmarks 1. Add m zero observations to the original data, where m is the number of groups with p k > Let Z k,ben = {a k,..., a }{{} k, a k b k, 0,..., 0}, where }{{} n+k 1 m k a k = [(n + k 1)(n + k)] 1/2 and b k = n + k Let Z int = {(m + n) 1/2,..., (m + n) 1/2 }. }{{} m+n 30 / 46

31 Group Selection and Within Group Selection Regression model: y = β int Z int + g k=m+1 ( ) p m k β k,ben Z k,ben + β k,j X k,j + k=1 β k,1 X k,1 + ɛ. Assign hierarchical priors with group structure and obtain posterior distributions of the coefficients. j=1 Group selection: the kth group is selected, if max j {P( β k,j > β int )} > c. Within group selection: suppose group k is selected, then X k,j is selected if P( β k,j > β k,ben ) > c. 31 / 46

32 Illustration Consider that p = 20, g = 6, and β = [1.5, 0.8, 0, 0, 0, 1.2, 0, 0, 0.8, 0, }{{}}{{} G1 G2 1.2, 0, 0, 0, 0, 0, 0, 0, }{{}}{{}}{{} 0, }{{} 0.8 ] T. G3 G4 G5 G6 y = X β + ɛ; X k,j N(0, 1), cov(x k,i, X k,j ) = 0.5 i j for k = 1, 2; cov(x k,j, X k,l ) = 0 for k = 3, 4 and j l. Signal to noise ratio is 3. n = 100 Let c = / 46

33 Posteriors of Variables in An Important Group Important variables deviate further from the benchmark. 33 / 46

34 Posteriors of Variables in An Unimportant Group All the unimportant variables are very close to the benchmark. 34 / 46

35 Group Selection Result Table: The frequency each group is selected in 100 simulations. Group Size Important Y Y Y N N Y Selected / 46

36 Within Group Selection Result Average false discovery rate is (0.011); average false negative rate is (0.003). Average number of selected variables is 6.13 (0.08). (True number is 6) Table: Number of times each variable is selected in 100 simulations. Variable X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 True Coef Selected / 46

37 A Big p Small n Example Let p = 200 and n = 100. There are 40 groups and each group consisted of 5 variables. β 1,j s in group 1 : (1.2, 0.8, 0, 0, 1.6) β 2,j s in group 2 : (1, 0.9, 1.1, 1.3, 0.8) β 3,j s in group 3: (0.8, 0, 0, 0, 0) β k,j s in group 4 to 8 are all zero. The above 8 groups form a block and is replicated 5 times to yield the coefficients of 240 variables in total. There are 45 important variables and 255 unimportant variables. 37 / 46

38 Covariance Structure The X k,j s in the each block are generated from multivariate normal with mean 0 and covariance structure: cov(x k,i, X m,j ) = 1/3(0.5) k m. Variables in different blocks are uncorrelated. Signal to noise ratio is / 46

39 Group Selection Result When c = 0.7, unimportant groups are effectively removed. When c = 0.7, false discovery rate is 5.1% (0.8%) and the group false negative rate is 24.0% (0.4%). When c = 0.6, false discovery rate is 23.9% (0.9%) and the group false negative rate is 16.1% (0.5%). Table: Frequencies that first 8 groups are selected based on 100 simulations. Group Important Y Y Y N N N N N Selected(c = 0.7) Selected(c = 0.6) / 46

40 Within Group Selection Result Unimportant variables in group 1 (important group) are effectively removed when c = 0.7. When c = 0.6, average false discovery rate over all groups is 33% (0.8%) and average false negative rate is 12% (0.2%). When c = 0.7, average false discovery rate over all groups is 11% (0.8%) and average false negative rate is 17% (0.2%). Table: Frequencies that 5 variables in the first group are selected based on 100 simulations. (k, j) (1, 1) (1, 2) (1, 3) (1, 4) (1, 5) True β k,j Selected(c = 0.7) Selected(c = 0.6) / 46

41 Outline Introduction of Bayesian Lasso Bayesian Lasso with Pseudo Variables Bayesian Group Lasso with Pseudo Variables Conclusions and Future Work 41 / 46

42 Conclusions Intercept is a good benchmark for unimportant variables. Bayesian Lasso with pseudo variables achieve the sparsity. Bayesian Group Lasso with pseudo variables achieve both good group selection and within group selection results. 42 / 46

43 Future Work Optimize the threshold. More numerical comparisons with other variable selection methods. Real data analysis. 43 / 46

44 Other Work Shao, J. & Tang, Q., Random Group Variance Estimators for Survey Data with Random Hot Deck Imputation. (Submitted) Tang, Q. & Qian, P.Z.G., Enhancing the Sample Average Approximation method with U designs. (In revision) 44 / 46

45 Acknowledgement (Alphabetic) Jun Shao, University of Wisconsin-Madison Kam-Wah Tsui, University of Wisconsin-Madison Peter Qian, University of Wisconsin-Madison Sijian Wang, University of Wisconsin-Madison 45 / 46

46 Thank you! 46 / 46

Lecture 14: Shrinkage

Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the