Regularized Regression A Bayesian point of view

Size: px

Start display at page:

Download "Regularized Regression A Bayesian point of view"

Rolf Thornton
5 years ago
Views:

1 Regularized Regression A Bayesian point of view Vincent MICHEL Director : Gilles Celeux Supervisor : Bertrand Thirion Parietal Team, INRIA Saclay Ile-de-France LRI, Université Paris Sud CEA, DSV, I2BM, Neurospin MM - 22 février /34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février /

2 Outline 1 Introduction to Regularization 2 Bayesian Regression 2/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février /

3 Example of Application in fmri Fig.: Example of application of regression (here SVR) in a fmri paradigm. For clouds of different numbers of dots, we predict the quantities seen by the subject. In collaboration with E. Eger - INSERM U562. 3/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février /

4 Introduction Regression Given a set of m images of n variables X = x 1,..., x n R n,m and a target y R m, we have the following regression model : y = Xw + ɛ with w the parameters of the model ( weights ), and ɛ the error. The estimation ŵ of w is commonly done by Ordinary Least Squares : ŵ = arg minŵ y ŵx 2 ŵ = (X T X) 1 X T y 4/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février /

5 Regularization Motivations OLS estimates often have low bias but large variance. All features have a weight Smaller subset with strong effects is more interpretable. We need some shrinkage (or regularization) to constraint ŵ. Constraint the values of ŵ to have few parameters which explain well the data. Bridge Regression - I.E. Frank and J.H. Friedman ŵ = arg minŵ y ŵx 2 + λ }{{} i ŵ i i, withi 0 }{{} Data fit Penalization 5/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février /

6 Ridge Regression L 2 norm - A.E. Hoerl and R.W. Kennard Ridge regression introduce a regularization with the L 2 norm : Properties ŵ = arg minŵ y ŵx 2 + λ 2 ŵ 2 2 Sacrifice a little of bias to reduce the variance of predicted values More stable. Grouping effect : voxels having closed signal will have closed coefficients Exhibits groups of correlated features. Keeps all the regressors in the model not easily interpretable model. 6/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février /

7 Ridge Regression - Simulation Fig.: Coefficients of Ridge Regression for different values of λ 2. 7/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février /

8 Lasso - Least Absolute Shrinkage and Selection Operator L 1 norm - R. Tibshirani Lasso regression introduce a regularization with the L 1 norm : Properties ŵ = arg minŵ y ŵx 2 + λ 1 ŵ 1 Parsimony : only a small subset of features with ŵ i 0 are selected increases the interpretability (especially conjugate with parcels). If n > m, i.e. more features than samples, Lasso will select at most m variables. More difficult to implement than Ridge Regression. 8/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février /

9 Lasso - Simulation Fig.: Coefficients of Lasso Regression for different values of ŵ 1 ŵ OLS 2. 9/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février /

10 Outline 1 Introduction to Regularization 2 Bayesian Regression 10/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

11 Bayesian Regression Questions What is a graphical model? Why using Bayesian probabilities? What is a prior? What is a Hierarchical Bayesian Model? Why using hyperparameters? 11/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

12 Bayesian Regression and Graphical model Model The regression model : y = Xw + ɛ can be see as the following model : p(y X, w, λ) = N (Xw, λ) with p(ɛ) = N (0, λ) λ ɛ N (0, λi n) ɛ X w y = Xw + ɛ y Can we add some (prior) knowledge on the parameters? 12/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

13 Bayesian probabilities Main idea - Introducing prior information in the modelization Baye s theorem : p(w y) = p(y w)p(w) p(y) where p(w) is called a prior. This prior, introduced before any access to the data, reflects our prior knowledge on w. Moreover, the posteriori probability p(w y) gives us the uncertainty in w after we have observed y. Example Unbiased coin tossed three times 3 heads. A frequentist approach : p(head) = 1 always head for all future tosses. A bayesian approach : we can add a prior (Beta), in order to represent our prior knowledge of unbiased coin gives less extrem results. see Bishop, Pattern Recognition and Machine Learning, for more details. 13/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

14 Priors for Bayesian Regression - Ridge Regression Prior for the weights w Penalization by weighted L 2 norm is equivalent to set a Gaussian prior on the weights w : w N (0, αi ) Bayesian Ridge Regression (C. Bishop 2000). Prior for the noise ɛ Classical prior (M.E. Tipping, 2000). ɛ N (0, λi n ) 14/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

15 Graphical model - Ridge Regression λ ɛ N (0, λi n ) α ɛ X w w N (0, αi ) y = Xw + ɛ y 15/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

16 Example on Simulated Data - Ridge Regression Log-densisity of the distribution of w for the Bayesian Ridge Regression. Can we find some more adaptive priors? 16/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

17 Priors for Bayesian Regression - ARD Prior for the weights w If we have different Gaussian priors for w : w N (0, A 1 ), A = diag(α 1,..., α m ) with α i α j if i j. Automatic Relevance Determination - ARD (MacKay 1994). Prior for the noise ɛ Classical prior (M.E. Tipping, 2000). ɛ N (0, λi n ) 17/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

18 Graphical model - ARD λ ɛ N (0, λi n ) α ɛ X w w N (0, A 1 ) y = Xw + ɛ A = diag(α 1,..., α m ) y 18/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

19 Example on Simulated Data - ARD Log-densisity of the distribution of w for ARD. 19/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

20 Other (less classical) Priors for Bayesian Regression LASSO - Laplacian Prior for the weights w Penalization by weighted L 1 norm is equivalent to set Laplacian priors on the weights w : w C exp λ w Elastic Net - Complex Prior for the weights w Penalization by weighted L 1 and L 2 norms is equivalent to set more complex prior on the weights w : w C(λ, α) exp λ w 1+α w 2 But... Difficult to estimate the parameters of the model... 20/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

21 Notions of Hyperpriors What is an hyperprior? The parameters of a prior are called hyperparameters. E.g. for ARD : α i is a hyperparameter. w N (0, A 1 ), A = diag(α 1,..., α m ) An hyperprior is a prior on the hyperparameters. Allow to adapt the parameters of the model to the data. Define a Hierarchical Bayesian Model. Classical hyperpriors See C. Bishop & M.E. Tipping, /34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

22 Bayesian Regression- Hierarchical Bayesian Model λ 1 λ 2 p(λ) = Γ(λ 1, λ 2 ) λ ɛ N (0, λi n ) γ i 1 γ i 2 p(α i ) = Γ(γ i 1, γi 2 ) α ɛ X y = Xw + ɛ y w w N (0, A 1 ) A = diag(α 1,..., α m ) But, how to learn the parameters? 22/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

23 THE difficult question : How to estimate our model? We have a model, we have some data How to learn the parameters? Likelihood The Log-Likelihood is the quantity p(y X, w, λ) viewed as a function of the parameter w. It expresses how probable the observed data is for different settings w. The main idea is to find the parameters which optimizes the Log-Likelihood L(y w), i.e. choosing the parameters for which the probability of the explained data y is maximized. Simple methods - When L(y w) is not too complicated Maximum A Posteriori - MAP Estimation Maximization - EM (Dempster, 1977). can be used in the presence of latent variables (e.g. mixture of Gaussians). 23/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

24 When L(y w) IS too complicated... Sampling methods When we cannot directly evaluate the distributions. Metropolis-Hatings, Gibb s Sampling, simulated annealing. Computationally expensive. 2d example of Gibb s sampling. 24/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

25 When L(y w) IS too complicated... Variational Bayes framework - Beal, M.J. (2003) Based on some (strongs) assumptions and approximations : Laplace approximation : Gaussian approximation of a probability density. Factorized some distributions. Is equivalent to the maximization of the Free Energy which is a lower bound on the log-likelihood. 25/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

26 Limits of the Variational Bayes framework Sometimes, the approximations of the VB framework are too strong for the model the free energy is no longer useful... 26/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

27 Results on Real data - Visual task Fig.: w for : Elastic net (left) - Bayesian Regression (Right) 27/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

28 Conclusion Bayesian Regression Allows to introduce prior information. Apriori distribution : uncertainty of the parameters after observing the data. Model estimation can be computationally expensive. Poor priors will lead to poor results with high confidence always validate the results by cross-validation! Thank you for your attention! 28/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

29 29/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

30 ARD - Practical point of view EM α new i = 1 + 2γ i 1 µ 2 i + Σ ii + 2γ i 2 with : λ new n + 2λ 1 = (y Xµ) 2 + Trace(ΣX T X) + 2λ 2 w = µ = λσx T y and Σ = (A + λx T X) 1 Gibb s Sampling p(w θ {w}) N (w µ, Σ) p(λ θ {λ}) Γ(l 1, l 2 ) p(α θ {α}) Π i=m i=1 Γ(α i g i 1, g i 2) with g i 1 = γi , g i 2 = γi X2 i, l 1 = λ 1 + n/2 and l 2 = λ (y Xw)2. 30/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

31 Estimation by Maximum Likelihood (ML) Likelihood The Log-Likelihood is the quantity p(y X, w, λ) viewed as a function of the parameter w. It expresses how probable the observed data is for different settings w. Estimation by ML The main idea is to find the parameter ŵ which optimizes the Log-Likelihood L(y w) : ŵ = arg maxŵl(y w) i.e. choosing ŵ for which the probability of the explained data y is maximized. The solution is (hopefully!) the same as the one found by the OLS : ŵ = (X T X) 1 X T y 31/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

32 How to construct the prediction Ŷ? Iterative Algorithm i=n Ŷ = x i ˆβ i = X ˆβ i=1 Start with Ŷ = 0, and at each step : Compute the vector of current correlations c(ŷ ) : ĉ = c(ŷ ) = X T (Y Ŷ ) Then, take the direction of the greatest current correlation for a small step ɛ : Ŷ Ŷ + ɛ sign (ĉ j) x j with j = arg max i ĉ i. 32/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

33 Choice of ɛ Forward Selection technique We have ɛ = ĉ j. If a feature seems interesting, we add it with β j = 1. Could we weight the influence of x j? Forward Stagewise Linear Regression We have ɛ 1. If a feature seems interesting, we add it with β j 1. Another feature can be selected at the next step if it is more correlated with Ŷ. We adapt the weights to the features. But the computation can be time-consuming. 33/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

34 LARS - Least Angle Regression Aim : accelerate the computation of the Stagewise procedure. Least Angle Regression We compute the optimal ɛ, i.e. the optimal step until another feature becomes more interesting. Take the equiangular direction between the selected features. Link with Lasso/Elastic net. 34/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression