Regularized Regression A Bayesian point of view

Similar documents
SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Introduction to Machine Learning

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Bayesian methods in economics and finance

y Xw 2 2 y Xw λ w 2 2

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Probabilistic Machine Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Generalized Elastic Net Regression

GWAS IV: Bayesian linear (variance component) models

Outline lecture 2 2(30)

Least Squares Regression

Machine Learning for Biomedical Engineering. Enrico Grisan

L 0 methods. H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands. December 5, 2011.

First Technical Course, European Centre for Soft Computing, Mieres, Spain. 4th July 2011

Outline Lecture 2 2(32)

Relevance Vector Machines

CPSC 540: Machine Learning

Introduction to Probabilistic Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

A Bayesian Perspective on Residential Demand Response Using Smart Meter Data

The Bayesian approach to inverse problems

Least Squares Regression

Bayesian linear regression

y(x) = x w + ε(x), (1)

Bayesian Methods for Machine Learning

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Linear Models for Regression

Linear regression methods

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Lecture 1b: Linear Models for Regression

Sparse Linear Models (10/7/13)

Regression Shrinkage and Selection via the Lasso

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

IEOR165 Discussion Week 5

An Introduction to Statistical and Probabilistic Linear Models

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Regression, Ridge Regression, Lasso

Lecture : Probabilistic Machine Learning

Bayesian inference J. Daunizeau

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Linear Models for Regression

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Week 3: Linear Regression

Statistical Data Mining and Machine Learning Hilary Term 2016

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Machine learning - HT Basis Expansion, Regularization, Validation

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Integrated Non-Factorized Variational Inference

Density Estimation. Seungjin Choi

Linear Methods for Regression. Lijun Zhang

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Shrinkage Methods: Ridge and Lasso

Pattern Recognition and Machine Learning

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

Variational Principal Components

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

MS-C1620 Statistical inference

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Sparse Probabilistic Principal Component Analysis

Linear Regression. Volker Tresp 2018

g-priors for Linear Regression

Bayesian inference J. Daunizeau

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Regularization: Ridge Regression and the LASSO

Topic 12 Overview of Estimation

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Introduction to Machine Learning

Bayesian Machine Learning

Announcements. Proposals graded

Adaptive Sparseness Using Jeffreys Prior

Lecture 2: Priors and Conjugacy

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 10

Bayesian Linear Regression [DRAFT - In Progress]

Chapter 3. Linear Models for Regression

Bayesian Inference and MCMC

ISyE 691 Data mining and analytics

6. Regularized linear regression

Hierarchical Models & Bayesian Model Selection

ECE521 week 3: 23/26 January 2017

Machine Learning for OR & FE

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Bayesian Microwave Breast Imaging

The connection of dropout and Bayesian statistics

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Recent Advances in Bayesian Inference Techniques

Unsupervised Learning

DATA MINING AND MACHINE LEARNING

Transcription:

Regularized Regression A Bayesian point of view Vincent MICHEL Director : Gilles Celeux Supervisor : Bertrand Thirion Parietal Team, INRIA Saclay Ile-de-France LRI, Université Paris Sud CEA, DSV, I2BM, Neurospin MM - 22 février 2010 1/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 1 /

Outline 1 Introduction to Regularization 2 Bayesian Regression 2/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 2 /

Example of Application in fmri Fig.: Example of application of regression (here SVR) in a fmri paradigm. For clouds of different numbers of dots, we predict the quantities seen by the subject. In collaboration with E. Eger - INSERM U562. 3/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 3 /

Introduction Regression Given a set of m images of n variables X = x 1,..., x n R n,m and a target y R m, we have the following regression model : y = Xw + ɛ with w the parameters of the model ( weights ), and ɛ the error. The estimation ŵ of w is commonly done by Ordinary Least Squares : ŵ = arg minŵ y ŵx 2 ŵ = (X T X) 1 X T y 4/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 4 /

Regularization Motivations OLS estimates often have low bias but large variance. All features have a weight Smaller subset with strong effects is more interpretable. We need some shrinkage (or regularization) to constraint ŵ. Constraint the values of ŵ to have few parameters which explain well the data. Bridge Regression - I.E. Frank and J.H. Friedman - 1993 ŵ = arg minŵ y ŵx 2 + λ }{{} i ŵ i i, withi 0 }{{} Data fit Penalization 5/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 5 /

Ridge Regression L 2 norm - A.E. Hoerl and R.W. Kennard - 1970 Ridge regression introduce a regularization with the L 2 norm : Properties ŵ = arg minŵ y ŵx 2 + λ 2 ŵ 2 2 Sacrifice a little of bias to reduce the variance of predicted values More stable. Grouping effect : voxels having closed signal will have closed coefficients Exhibits groups of correlated features. Keeps all the regressors in the model not easily interpretable model. 6/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 6 /

Ridge Regression - Simulation Fig.: Coefficients of Ridge Regression for different values of λ 2. 7/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 7 /

Lasso - Least Absolute Shrinkage and Selection Operator L 1 norm - R. Tibshirani - 1996 Lasso regression introduce a regularization with the L 1 norm : Properties ŵ = arg minŵ y ŵx 2 + λ 1 ŵ 1 Parsimony : only a small subset of features with ŵ i 0 are selected increases the interpretability (especially conjugate with parcels). If n > m, i.e. more features than samples, Lasso will select at most m variables. More difficult to implement than Ridge Regression. 8/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 8 /

Lasso - Simulation Fig.: Coefficients of Lasso Regression for different values of ŵ 1 ŵ OLS 2. 9/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 9 /

Outline 1 Introduction to Regularization 2 Bayesian Regression 10/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 10

Bayesian Regression Questions What is a graphical model? Why using Bayesian probabilities? What is a prior? What is a Hierarchical Bayesian Model? Why using hyperparameters? 11/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 11

Bayesian Regression and Graphical model Model The regression model : y = Xw + ɛ can be see as the following model : p(y X, w, λ) = N (Xw, λ) with p(ɛ) = N (0, λ) λ ɛ N (0, λi n) ɛ X w y = Xw + ɛ y Can we add some (prior) knowledge on the parameters? 12/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 12

Bayesian probabilities Main idea - Introducing prior information in the modelization Baye s theorem : p(w y) = p(y w)p(w) p(y) where p(w) is called a prior. This prior, introduced before any access to the data, reflects our prior knowledge on w. Moreover, the posteriori probability p(w y) gives us the uncertainty in w after we have observed y. Example Unbiased coin tossed three times 3 heads. A frequentist approach : p(head) = 1 always head for all future tosses. A bayesian approach : we can add a prior (Beta), in order to represent our prior knowledge of unbiased coin gives less extrem results. see Bishop, Pattern Recognition and Machine Learning, for more details. 13/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 13

Priors for Bayesian Regression - Ridge Regression Prior for the weights w Penalization by weighted L 2 norm is equivalent to set a Gaussian prior on the weights w : w N (0, αi ) Bayesian Ridge Regression (C. Bishop 2000). Prior for the noise ɛ Classical prior (M.E. Tipping, 2000). ɛ N (0, λi n ) 14/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 14

Graphical model - Ridge Regression λ ɛ N (0, λi n ) α ɛ X w w N (0, αi ) y = Xw + ɛ y 15/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 15

Example on Simulated Data - Ridge Regression Log-densisity of the distribution of w for the Bayesian Ridge Regression. Can we find some more adaptive priors? 16/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 16

Priors for Bayesian Regression - ARD Prior for the weights w If we have different Gaussian priors for w : w N (0, A 1 ), A = diag(α 1,..., α m ) with α i α j if i j. Automatic Relevance Determination - ARD (MacKay 1994). Prior for the noise ɛ Classical prior (M.E. Tipping, 2000). ɛ N (0, λi n ) 17/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 17

Graphical model - ARD λ ɛ N (0, λi n ) α ɛ X w w N (0, A 1 ) y = Xw + ɛ A = diag(α 1,..., α m ) y 18/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 18

Example on Simulated Data - ARD Log-densisity of the distribution of w for ARD. 19/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 19

Other (less classical) Priors for Bayesian Regression LASSO - Laplacian Prior for the weights w Penalization by weighted L 1 norm is equivalent to set Laplacian priors on the weights w : w C exp λ w Elastic Net - Complex Prior for the weights w Penalization by weighted L 1 and L 2 norms is equivalent to set more complex prior on the weights w : w C(λ, α) exp λ w 1+α w 2 But... Difficult to estimate the parameters of the model... 20/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 20

Notions of Hyperpriors What is an hyperprior? The parameters of a prior are called hyperparameters. E.g. for ARD : α i is a hyperparameter. w N (0, A 1 ), A = diag(α 1,..., α m ) An hyperprior is a prior on the hyperparameters. Allow to adapt the parameters of the model to the data. Define a Hierarchical Bayesian Model. Classical hyperpriors See C. Bishop & M.E. Tipping, 2000 21/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 21

Bayesian Regression- Hierarchical Bayesian Model λ 1 λ 2 p(λ) = Γ(λ 1, λ 2 ) λ ɛ N (0, λi n ) γ i 1 γ i 2 p(α i ) = Γ(γ i 1, γi 2 ) α ɛ X y = Xw + ɛ y w w N (0, A 1 ) A = diag(α 1,..., α m ) But, how to learn the parameters? 22/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 22

THE difficult question : How to estimate our model? We have a model, we have some data How to learn the parameters? Likelihood The Log-Likelihood is the quantity p(y X, w, λ) viewed as a function of the parameter w. It expresses how probable the observed data is for different settings w. The main idea is to find the parameters which optimizes the Log-Likelihood L(y w), i.e. choosing the parameters for which the probability of the explained data y is maximized. Simple methods - When L(y w) is not too complicated Maximum A Posteriori - MAP Estimation Maximization - EM (Dempster, 1977). can be used in the presence of latent variables (e.g. mixture of Gaussians). 23/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 23

When L(y w) IS too complicated... Sampling methods When we cannot directly evaluate the distributions. Metropolis-Hatings, Gibb s Sampling, simulated annealing. Computationally expensive. 2d example of Gibb s sampling. 24/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 24

When L(y w) IS too complicated... Variational Bayes framework - Beal, M.J. (2003) Based on some (strongs) assumptions and approximations : Laplace approximation : Gaussian approximation of a probability density. Factorized some distributions. Is equivalent to the maximization of the Free Energy which is a lower bound on the log-likelihood. 25/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 25

Limits of the Variational Bayes framework Sometimes, the approximations of the VB framework are too strong for the model the free energy is no longer useful... 26/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 26

Results on Real data - Visual task Fig.: w for : Elastic net (left) - Bayesian Regression (Right) 27/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 27

Conclusion Bayesian Regression Allows to introduce prior information. Apriori distribution : uncertainty of the parameters after observing the data. Model estimation can be computationally expensive. Poor priors will lead to poor results with high confidence always validate the results by cross-validation! Thank you for your attention! 28/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 28

29/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 29

ARD - Practical point of view EM α new i = 1 + 2γ i 1 µ 2 i + Σ ii + 2γ i 2 with : λ new n + 2λ 1 = (y Xµ) 2 + Trace(ΣX T X) + 2λ 2 w = µ = λσx T y and Σ = (A + λx T X) 1 Gibb s Sampling p(w θ {w}) N (w µ, Σ) p(λ θ {λ}) Γ(l 1, l 2 ) p(α θ {α}) Π i=m i=1 Γ(α i g i 1, g i 2) with g i 1 = γi 1 + 1 2, g i 2 = γi 2 + 1 2 X2 i, l 1 = λ 1 + n/2 and l 2 = λ 2 + 1 2 (y Xw)2. 30/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 30

Estimation by Maximum Likelihood (ML) Likelihood The Log-Likelihood is the quantity p(y X, w, λ) viewed as a function of the parameter w. It expresses how probable the observed data is for different settings w. Estimation by ML The main idea is to find the parameter ŵ which optimizes the Log-Likelihood L(y w) : ŵ = arg maxŵl(y w) i.e. choosing ŵ for which the probability of the explained data y is maximized. The solution is (hopefully!) the same as the one found by the OLS : ŵ = (X T X) 1 X T y 31/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 31

How to construct the prediction Ŷ? Iterative Algorithm i=n Ŷ = x i ˆβ i = X ˆβ i=1 Start with Ŷ = 0, and at each step : Compute the vector of current correlations c(ŷ ) : ĉ = c(ŷ ) = X T (Y Ŷ ) Then, take the direction of the greatest current correlation for a small step ɛ : Ŷ Ŷ + ɛ sign (ĉ j) x j with j = arg max i ĉ i. 32/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 32

Choice of ɛ Forward Selection technique We have ɛ = ĉ j. If a feature seems interesting, we add it with β j = 1. Could we weight the influence of x j? Forward Stagewise Linear Regression We have ɛ 1. If a feature seems interesting, we add it with β j 1. Another feature can be selected at the next step if it is more correlated with Ŷ. We adapt the weights to the features. But the computation can be time-consuming. 33/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 33

LARS - Least Angle Regression Aim : accelerate the computation of the Stagewise procedure. Least Angle Regression We compute the optimal ɛ, i.e. the optimal step until another feature becomes more interesting. Take the equiangular direction between the selected features. Link with Lasso/Elastic net. 34/34 Vincent MICHEL (Inria - Parietal) Inverse Inference MM - 22 février 2010 34