Regularized Regression A Bayesian point of view

Regularized Regression A Bayesian point of view Vincent MICHEL Director : Gilles Celeux Supervisor : Bertrand Thirion Parietal Team, INRIA Saclay Ile-de-France LRI, Université Paris Sud CEA, DSV, I2BM, Neurospin MM - 22 février 2010 1/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 1 /

Outline 1 Introduction to Regularization 2 Bayesian Regression 2/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 2 /

Example of Application in fmri Fig.: Example of application of regression (here SVR) in a fmri paradigm. For clouds of different numbers of dots, we predict the quantities seen by the subject. In collaboration with E. Eger - INSERM U562. 3/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 3 /

Introduction Regression Given a set of m images of n variables X = x 1,..., x n R n,m and a target y R m, we have the following regression model : y = Xw + ɛ with w the parameters of the model ( weights ), and ɛ the error. The estimation ŵ of w is commonly done by Ordinary Least Squares : ŵ = arg minŵ y ŵx 2 ŵ = (X T X) 1 X T y 4/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 4 /

Regularization Motivations OLS estimates often have low bias but large variance. All features have a weight Smaller subset with strong effects is more interpretable. We need some shrinkage (or regularization) to constraint ŵ. Constraint the values of ŵ to have few parameters which explain well the data. Bridge Regression - I.E. Frank and J.H. Friedman - 1993 ŵ = arg minŵ y ŵx 2 + λ }{{} i ŵ i i, withi 0 }{{} Data fit Penalization 5/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 5 /

Ridge Regression L 2 norm - A.E. Hoerl and R.W. Kennard - 1970 Ridge regression introduce a regularization with the L 2 norm : Properties ŵ = arg minŵ y ŵx 2 + λ 2 ŵ 2 2 Sacrifice a little of bias to reduce the variance of predicted values More stable. Grouping effect : voxels having closed signal will have closed coefficients Exhibits groups of correlated features. Keeps all the regressors in the model not easily interpretable model. 6/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 6 /

Ridge Regression - Simulation Fig.: Coefficients of Ridge Regression for different values of λ 2. 7/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 7 /

Lasso - Least Absolute Shrinkage and Selection Operator L 1 norm - R. Tibshirani - 1996 Lasso regression introduce a regularization with the L 1 norm : Properties ŵ = arg minŵ y ŵx 2 + λ 1 ŵ 1 Parsimony : only a small subset of features with ŵ i 0 are selected increases the interpretability (especially conjugate with parcels). If n > m, i.e. more features than samples, Lasso will select at most m variables. More difficult to implement than Ridge Regression. 8/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 8 /

Lasso - Simulation Fig.: Coefficients of Lasso Regression for different values of ŵ 1 ŵ OLS 2. 9/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 9 /

Outline 1 Introduction to Regularization 2 Bayesian Regression 10/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 10

Bayesian Regression Questions What is a graphical model? Why using Bayesian probabilities? What is a prior? What is a Hierarchical Bayesian Model? Why using hyperparameters? 11/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 11

Bayesian Regression and Graphical model Model The regression model : y = Xw + ɛ can be see as the following model : p(y X, w, λ) = N (Xw, λ) with p(ɛ) = N (0, λ) λ ɛ N (0, λi n) ɛ X w y = Xw + ɛ y Can we add some (prior) knowledge on the parameters? 12/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 12

Bayesian probabilities Main idea - Introducing prior information in the modelization Baye s theorem : p(w y) = p(y w)p(w) p(y) where p(w) is called a prior. This prior, introduced before any access to the data, reflects our prior knowledge on w. Moreover, the posteriori probability p(w y) gives us the uncertainty in w after we have observed y. Example Unbiased coin tossed three times 3 heads. A frequentist approach : p(head) = 1 always head for all future tosses. A bayesian approach : we can add a prior (Beta), in order to represent our prior knowledge of unbiased coin gives less extrem results. see Bishop, Pattern Recognition and Machine Learning, for more details. 13/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 13

Priors for Bayesian Regression - Ridge Regression Prior for the weights w Penalization by weighted L 2 norm is equivalent to set a Gaussian prior on the weights w : w N (0, αi ) Bayesian Ridge Regression (C. Bishop 2000). Prior for the noise ɛ Classical prior (M.E. Tipping, 2000). ɛ N (0, λi n ) 14/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 14

Graphical model - Ridge Regression λ ɛ N (0, λi n ) α ɛ X w w N (0, αi ) y = Xw + ɛ y 15/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 15

Example on Simulated Data - Ridge Regression Log-densisity of the distribution of w for the Bayesian Ridge Regression. Can we find some more adaptive priors? 16/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 16

Priors for Bayesian Regression - ARD Prior for the weights w If we have different Gaussian priors for w : w N (0, A 1 ), A = diag(α 1,..., α m ) with α i α j if i j. Automatic Relevance Determination - ARD (MacKay 1994). Prior for the noise ɛ Classical prior (M.E. Tipping, 2000). ɛ N (0, λi n ) 17/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 17

Graphical model - ARD λ ɛ N (0, λi n ) α ɛ X w w N (0, A 1 ) y = Xw + ɛ A = diag(α 1,..., α m ) y 18/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 18

Example on Simulated Data - ARD Log-densisity of the distribution of w for ARD. 19/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 19

Other (less classical) Priors for Bayesian Regression LASSO - Laplacian Prior for the weights w Penalization by weighted L 1 norm is equivalent to set Laplacian priors on the weights w : w C exp λ w Elastic Net - Complex Prior for the weights w Penalization by weighted L 1 and L 2 norms is equivalent to set more complex prior on the weights w : w C(λ, α) exp λ w 1+α w 2 But... Difficult to estimate the parameters of the model... 20/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 20

Notions of Hyperpriors What is an hyperprior? The parameters of a prior are called hyperparameters. E.g. for ARD : α i is a hyperparameter. w N (0, A 1 ), A = diag(α 1,..., α m ) An hyperprior is a prior on the hyperparameters. Allow to adapt the parameters of the model to the data. Define a Hierarchical Bayesian Model. Classical hyperpriors See C. Bishop & M.E. Tipping, 2000 21/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 21

Bayesian Regression- Hierarchical Bayesian Model λ 1 λ 2 p(λ) = Γ(λ 1, λ 2 ) λ ɛ N (0, λi n ) γ i 1 γ i 2 p(α i ) = Γ(γ i 1, γi 2 ) α ɛ X y = Xw + ɛ y w w N (0, A 1 ) A = diag(α 1,..., α m ) But, how to learn the parameters? 22/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 22

THE difficult question : How to estimate our model? We have a model, we have some data How to learn the parameters? Likelihood The Log-Likelihood is the quantity p(y X, w, λ) viewed as a function of the parameter w. It expresses how probable the observed data is for different settings w. The main idea is to find the parameters which optimizes the Log-Likelihood L(y w), i.e. choosing the parameters for which the probability of the explained data y is maximized. Simple methods - When L(y w) is not too complicated Maximum A Posteriori - MAP Estimation Maximization - EM (Dempster, 1977). can be used in the presence of latent variables (e.g. mixture of Gaussians). 23/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 23

When L(y w) IS too complicated... Sampling methods When we cannot directly evaluate the distributions. Metropolis-Hatings, Gibb s Sampling, simulated annealing. Computationally expensive. 2d example of Gibb s sampling. 24/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 24

When L(y w) IS too complicated... Variational Bayes framework - Beal, M.J. (2003) Based on some (strongs) assumptions and approximations : Laplace approximation : Gaussian approximation of a probability density. Factorized some distributions. Is equivalent to the maximization of the Free Energy which is a lower bound on the log-likelihood. 25/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 25

Limits of the Variational Bayes framework Sometimes, the approximations of the VB framework are too strong for the model the free energy is no longer useful... 26/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 26

Results on Real data - Visual task Fig.: w for : Elastic net (left) - Bayesian Regression (Right) 27/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 27

Conclusion Bayesian Regression Allows to introduce prior information. Apriori distribution : uncertainty of the parameters after observing the data. Model estimation can be computationally expensive. Poor priors will lead to poor results with high confidence always validate the results by cross-validation! Thank you for your attention! 28/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 28

29/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 29

ARD - Practical point of view EM α new i = 1 + 2γ i 1 µ 2 i + Σ ii + 2γ i 2 with : λ new n + 2λ 1 = (y Xµ) 2 + Trace(ΣX T X) + 2λ 2 w = µ = λσx T y and Σ = (A + λx T X) 1 Gibb s Sampling p(w θ {w}) N (w µ, Σ) p(λ θ {λ}) Γ(l 1, l 2 ) p(α θ {α}) Π i=m i=1 Γ(α i g i 1, g i 2) with g i 1 = γi 1 + 1 2, g i 2 = γi 2 + 1 2 X2 i, l 1 = λ 1 + n/2 and l 2 = λ 2 + 1 2 (y Xw)2. 30/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 30

Estimation by Maximum Likelihood (ML) Likelihood The Log-Likelihood is the quantity p(y X, w, λ) viewed as a function of the parameter w. It expresses how probable the observed data is for different settings w. Estimation by ML The main idea is to find the parameter ŵ which optimizes the Log-Likelihood L(y w) : ŵ = arg maxŵl(y w) i.e. choosing ŵ for which the probability of the explained data y is maximized. The solution is (hopefully!) the same as the one found by the OLS : ŵ = (X T X) 1 X T y 31/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 31

How to construct the prediction Ŷ? Iterative Algorithm i=n Ŷ = x i ˆβ i = X ˆβ i=1 Start with Ŷ = 0, and at each step : Compute the vector of current correlations c(ŷ ) : ĉ = c(ŷ ) = X T (Y Ŷ ) Then, take the direction of the greatest current correlation for a small step ɛ : Ŷ Ŷ + ɛ sign (ĉ j) x j with j = arg max i ĉ i. 32/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 32

Choice of ɛ Forward Selection technique We have ɛ = ĉ j. If a feature seems interesting, we add it with β j = 1. Could we weight the influence of x j? Forward Stagewise Linear Regression We have ɛ 1. If a feature seems interesting, we add it with β j 1. Another feature can be selected at the next step if it is more correlated with Ŷ. We adapt the weights to the features. But the computation can be time-consuming. 33/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 33

LARS - Least Angle Regression Aim : accelerate the computation of the Stagewise procedure. Least Angle Regression We compute the optimal ɛ, i.e. the optimal step until another feature becomes more interesting. Take the equiangular direction between the selected features. Link with Lasso/Elastic net. 34/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 34