Lecture 1b: Linear Models for Regression

Similar documents
STA414/2104 Statistical Methods for Machine Learning II

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Unsupervised Learning

Density Estimation. Seungjin Choi

Outline Lecture 2 2(32)

Lecture : Probabilistic Machine Learning

Bayesian Machine Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Recent Advances in Bayesian Inference Techniques

Relevance Vector Machines

Lecture 1c: Gaussian Processes for Regression

Variational Principal Components

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Statistical Machine Learning

Linear Models for Regression

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Bayesian Regression Linear and Logistic Regression

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Bayesian Linear Regression. Sargur Srihari

The Variational Gaussian Approximation Revisited

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Bayesian methods in economics and finance

Introduction to Machine Learning

GWAS V: Gaussian processes

Linear Models for Regression

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Gaussian Process Regression

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

GWAS IV: Bayesian linear (variance component) models

Ch 4. Linear Models for Classification

Overview c 1 What is? 2 Definition Outlines 3 Examples of 4 Related Fields Overview Linear Regression Linear Classification Neural Networks Kernel Met

Variational Methods in Bayesian Deconvolution

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Pattern Recognition and Machine Learning

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Linear Models for Regression. Sargur Srihari

Linear Dynamical Systems

Bayesian Learning in Undirected Graphical Models

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Nonparametric Bayesian Methods (Gaussian Processes)

Outline lecture 2 2(30)

STA 4273H: Statistical Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Probabilistic Reasoning in Deep Learning

Introduction to Probabilistic Machine Learning

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

PMR Learning as Inference

Logistic Regression. COMP 527 Danushka Bollegala

Linear Classification

Machine Learning Techniques for Computer Vision

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

Least Squares Regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Bayesian Inference Course, WTCN, UCL, March 2013

Least Squares Regression

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Latent Variable Models and EM algorithm

Lecture 1a: Basic Concepts and Recaps

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Graphical Models for Collaborative Filtering

Gaussian Process Approximations of Stochastic Differential Equations

Probabilistic Graphical Models

Introduction to Machine Learning

13: Variational inference II

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Introduction to Bayesian inference

Variational Scoring of Graphical Model Structures

Statistical Machine Learning Lectures 4: Variational Bayes

Lecture 9: PGM Learning

Regularized Regression A Bayesian point of view

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

GAUSSIAN PROCESS REGRESSION

Lecture 13 : Variational Inference: Mean Field Approximation

CS-E3210 Machine Learning: Basic Principles

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

PATTERN RECOGNITION AND MACHINE LEARNING

Probabilistic & Bayesian deep learning. Andreas Damianou

STAT 518 Intro Student Presentation

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

COMP90051 Statistical Machine Learning

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Variational Autoencoders

Linear Regression and Discrimination

Integrated Non-Factorized Variational Inference

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 week 3: 23/26 January 2017

Data Mining Techniques

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Lehel Csató. March Faculty of Mathematics and Informatics Babeş Bolyai University, Cluj-Napoca,

Machine Learning. 7. Logistic and Linear Regression

Learning from Data: Regression

Transcription:

Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics in Machine Learning (MSc in Intelligent Systems) January 2008

Today s plan Least squares and ridge regression: a probabilistic view Bayesian linear models for regression Sparse extensions Maximum likelihood, maximum a posteriori, type II ML, variational inference

t t Regression problem Given a finite number of noisy observations {t n} N n=1 associated to some input data {x n} N n=1, we would like to predict the outcome of an unseen input x new. This process is called generalisation and the input-target pairs {x n, t n} N n=1 are the training data. 1.5 1.5 1 1 0.5 0.5 f(x new ) =? 0 0!0.5!10!8!6!4!2 0 2 4 6 8 10 x!0.5!10!8!6!4!2 0 2 4 6 8 10 x (a) Target. (b) Generalisation.

Linear models for regression The model y(x, w) is linear in the parameters w and is expressed as a weighted sum of nonlinear basis functions {φ m( )} M m=1 centred on M learning prototypes: y(x, w) = Least squares regression Partial least squares Regularization networks Support vector machines MX w mφ m(x) + w 0 = w φ(x). m=1 Radial basis function networks Splines Lasso... The goal is to infer the parameters from a set of real valued input-target pairs {x n, t n} N n=1 which lead to the best prediction on unseen data.

Least squares regression We consider the sum-of-squares error: E(w) = 1 2 NX (t n y n) 2 = 1 2 t y 2, where y n y(x n, w), t = (t 1,..., t N ) and y = (y 1,..., y N ). n=1 Since this expression is quadratic in w, its minimisation leads to a unique minimum: w LS = (Φ Φ) 1 Φ t = Φ t, where Φ is the Moore-Penrose pseudo-inverse of Φ R N (M+1). In practice, Φ is often ill-conditioned and solving the linear system leads to overfitting (low bias, but high variance; too much flexibility!).

The design matrix Φ is defined as follows: 0 1 φ 1(x 1)... φ M (x 1) 1 B Φ = @........ C A. 1 φ 1(x N )... φ M (x N ) Using this definition we can write y = Φw. Standard linear regression is recovered for Φ = X.

t Example of overfitting The target function is given by f (x) = sin x, x [ 10, 10]. x We choose the squared exponential basis function: φ m(x) = exp j λm2 ff (x xm)2, λ m > 0. 1.5 1 0.5 0!0.5!10!8!6!4!2 0 2 4 6 8 10 x Figure: The target sinc function (dashed line) and the least squares regression solution (solid line) for λ m = 1/36 for all m. The noisy observations are denoted by crosses.

Ridge regression (or weight decay) The idea is to introduce regularisation, i.e. favour smooth regressors by penalising large w. The error function for ridge regression is defined as follows: ee(w) = E(w) + α 2 w 2, where α 0 is the complexity parameter. The global optimum for w is given by w R = (Φ Φ + αi M ) 1 Φ t. Hence, α converts an ill-conditioned problem into a well-conditioned one. The effective complexity of the model is reduced, avoiding overfitting.

t Example revisited The target function is still given by f (x) = sin x, x [ 10, 10]. x The amount of penalisation depends on the value of α. 1.5 6.1 1 6.05 0.5 E 6 0!0.5!10!8!6!4!2 0 2 4 6 8 10 x (a) 5.95 0 0.01 0.02 0.03 0.04 0.05 0.06! (b) Figure: (a) The target sinc function (dashed line) and the least squares regression solution (solid line) for λ m = 1/36 for all m. The noisy observations are denoted by crosses. (b) Penalised error as a function of α.

Parameter shrinkage and the LASSO More general penalised error functions are of the following form: argmin w E(w) subject to MX w m q η, m=0 where q defines the type of regulariser: Ridge regression corresponds to the specific choice q = 2. The LASSO corresponds to q = 1 and leads to sparse solutions for a sufficiently small η as most weights are driven to zero. W 2 W 2 W * W * W 1 W 1

Why using a probabilistic formalism in regression problems? Well-known least squares and ridge regression are special cases. The approach provides additional insights in the solution by dealing with the uncertainty in a principled way. Various sources of uncertainty can be modelled. Confidence measures are associated to the predictions that are made. Intensive resampling techniques such as cross-validation and the bootstrap are avoided. What do we loose? In practice, probabilistic and Bayesian inference require the computation of intractable integrals (marginalisation and normalisation/partition function). These integrals are either estimated by Markov Chain Monte Carlo (MCMC) or by an approximate inference procedure (e.g. Laplace approximation, variational Bayes, expectation-propagation).

A probabilistic view of least squares regression We assume that the observations are noisy iid samples drawn from a (univariate) Gaussian: t n = y n + ɛ n, ɛ n N (0, σ 2 ). The likelihood of the observations is then given by a multivariate Gaussian: (t 1,..., t N ) w, σ NY N (y n, σ 2 ) = N (y, σ 2 I N ). The maximum likelihood (ML) solution leads to a set of equations: n=1 d ln p(t w, σ) = 0 dw w ML = Φ t, d ln p(t w, σ) = 0 dσ 2 σml 2 = 1 N t y 2. The ML solution for w is equal to the least squares solution. The ML estimate for the noise variance is to the residual error or unexplained variance.

The log-likelihood is given by Hence, this leads to ln p(t w, σ) = N 2 ln 2π N 2 ln σ2 1 2σ (t 2 y) (t y) {z } = t y 2. d ln p(t w, σ) = 0 1 dw σ 2 Φ (t Φw) = 0, d ln p(t w, σ) = 0 N dσ 2 2σ + 1 2 2σ (t 4 y) (t y) = 0.

A probabilistic view of ridge regression How to avoid overfitting? We model the uncertainty on the value of the parameters by imposing some prior distribution on them: where A diag{α 0,..., α M }. w N (0, A 1 ), W 2 W 1 Figure: Zero-mean Gaussian prior with diagonal covariance matrix.

A probabilistic view of ridge regression (continued) Applying Bayes rule leads to the posterior distribution of the parameters: p(w t) p(t w)p(w). The maximum a posteriori (MAP) solution is given by d ln p(w t) dw = 0 w MAP = σ 2 (σ 2 Φ Φ + A) 1 Φ t, where the noise variance σ 2 is assumed to be known. In the particular case where α m = α 0 for all m (and σ is fixed), the MAP solution is equivalent to ridge regression, since we have α = α 0σ 2. In order to infer the amount of noise, one has to use the Expectation- Maximisation algorithm as w and σ 2 are coupled (see later).

The log-posterior is given by Hence, this leads to ln p(w t) = N 2 ln 2π N 2 ln σ2 1 2σ (t 2 y) (t y) M + 1 ln 2π + 1 2 2 ln A 1 2 w Aw ln Z. d ln p(w t) dw = 0 1 σ 2 Φ (t Φw) Aw = 0 1 σ 2 Φ t = 1 σ 2 Φ Φw + Aw.

Is the MAP solution a good solution? Overfitting and the model selection problem (i.e. the choice of the number of prototypes) are solved by limiting the effective model complexity. It might be difficult to deal with (very) large data sets. The better (smoother) solution is at the cost of additional hyperparameters, which can only be set by resampling. MAP makes predictions based on point estimates as ML; the uncertainty on the parameters is not taken into account when making predictions: p(t t) p(t w MAP) = N (t y(x, w MAP), σ 2 ). The MAP solution depends on the parametrisation of the prior.

Type II ML (or evidence maximisation) We view some parameters as latent, i.e. unobserved, random variables. In order to deal properly with the uncertainty, we want to integrate them out before estimating the remaining parameters by ML or making predictions. Assume the random variable x is observed and the variable z is unobserved. Let us denote the (remaining) parameters by θ. 1 The full predictive distribution is approximated by Z p(x x) p(x z, θ) p(z x, θ) dz. where p(z x, θ) = p(x z,θ)p(z θ) p(x θ) is the posterior. 2 The parameters are estimated by maximising (e.g. by gradient descent) the marginal likelihood (or evidence) 1 : where p(x θ) = R p(x, z θ) dz. θ ML2 = argmax θ p(x θ), 1 Type II ML might be more sensitive to model mis-specification than resampling techniques based on the pseudo-likelihood (see Section 4.8 in Wahba, 1990).

The Expectation-Maximisation (EM) algorithm in a nutshell The EM algorithm maximises a lower bound of the log-marginal likelihood in presence of latent variables. Using Jensen s inequality, we get for a distribution q(z) within a tractable family: Z ln p(x θ) = ln p(x, z θ)dz Z p(x, z θ) q(z) ln dz q(z) F(q, θ). The variational free energy F(q, θ) can be decomposed in two different ways: F(q, θ) = ln p(x θ) KL[q(z) p(z x, θ)], F(q, θ) = ln p(x, z θ) q(z) + H[q(z)]. (E step) (M step) The EM algorithm maximises the lower bound by alternatively minimising the KL for fixed parameters (E step) and maximising the expected complete loglikelihood for a fixed q(z) (M step). By construction, the EM algorithm ensures a monotonic increase of the bound.

EM for linear regressors We view w as a latent (unobserved) variable on which an isotropic Gaussian prior is imposed: w α N (0, α 1 I M+1 ). The goal is to learn the noise variance σ 2 and the scale parameter α. The E step is exact as the posterior on w is tractable (conjugate prior): w t, σ, α N (µ w, Σ w), where µ w σ 2 Σ wφ t and Σ w (σ 2 Φ Φ + αi M+1 ) 1. For a Gaussian distribution the mode is equal to the mean. Hence, the estimate for w is equal to the one obtained before, i.e. w = µ w = w MAP. The M step is given by σml2 2 argmax ln p(t, w σ, α) = t Φµ w 2 + tr{φσ wφ }, σ 2 N M + 1 α ML2 argmax ln p(t, w σ, α) = α µ w µ w + tr{σw}.

The posterior is given by (completing the square) p(w t, σ, α) e 1 2σ 2 (t Φw) (t Φw) e α 2 w w e 1 2 (w (σ 2 Φ Φ+αI M+1 )w 2σ 2 t Φw) e 1 2 (w Σ 1 w w 2µ w Σ 1 w w) e 1 2 (w µ w ) Σ 1 w (w µ w ). The expected complete log-likelihood is given by ln p(t, w σ, α) = N 2 ln 2π N 2 ln σ2 1 2σ 2 (t y) (t y) M + 1 2 ln 2π + M + 1 2 ln α α 2 w w = N 2 ln 2π N 2 ln σ2 1 2σ 2 (t Φ w ) (t Φ w ) 1 2σ 2 tr{φσwφ } M + 1 ln 2π + M + 1 ln α α 2 2 2 w w. Taking the derivative wrt σ 2 and α, and equating to zero leads to the desired updates.

Predictive distributions We are not only interested in the optimal predictions, but also in the best approximation of the full predictive distribution. The predictive distributions for the ML and the type II ML solutions are given by p(t t) p(t w ML, σ ML) = N (w MLφ(x), σ 2 ML), p(t t) p(t t, σ ML2, α ML2) = N (µ w φ(x), σ2 ML2+φ (x)σ wφ(x)). In the case of type II ML, the predictive variance has two components: The noise on the data. The uncertainty associated to the parameters.

t t Example revisited We compare the solutions on the sinc example with N = 25, σ = 0.1 and λ m = 1/9 for all m. We show the mean and the error bars (±3 std): 1.5 1.5 1 1 0.5 0.5 0 0!0.5!10!8!6!4!2 0 2 4 6 8 10 x (a) ML.!0.5!10!8!6!4!2 0 2 4 6 8 10 x (b) ML2. Figure: (a) ML solution: σ ML = 0.05. (b) Type II ML solution: σ ML2 = 0.08 and α ML2 = 1.15. (Target function: dashed; observations: crosses.)

" Is the type II ML solution a good solution? The solution avoids overfitting by taking some uncertainty into account. Predictions provide error bars by integrating out w. The lower bound is moderately suitable for selecting the kernel width: 0.1 RMSE!F/c 0.095 0.09 0.1 0.085 0.08 0.075 0.07 0.05 0.065 0.06 0.055 0 10!1 10 0 10 1! 0.05 10!1 10 0 10 1! (a) (b) Figure: (a) Root mean square error (RMSE) and normalised lower bound ( F/c) versus the kernel width λ. (b) Noise standard deviation versus λ. Will we gain something if we take the uncertainty of the hyperparameters into acount and be agnostic at a higher level?

Variational Bayes (VEM) in a nutshell Variational Bayes generalises EM by viewing all parameters as random variables. Using Jensen s inequality, we obtain a lower bound on the log-marginal likelihood: ZZ ln p(x) = ln p(x, z, θ) dz dθ ZZ p(x, z, θ) q(z)q(θ) ln dz dθ q(z)q(θ) = ln p(x) KL[q(z)q(θ) p(z, θ x)] F(q(z), q(θ)). For tractability it is assumed that the variational posterior factorises (at least) between the latent variables and the parameters given the observations. When a fully factorised posterior is assumed, one talks about mean field. Assuming a factorised form corresponds to neglecting the correlations between dependent variables and thus prevents transmitting uncertainty. Considering KL[q p] leads a more compact posterior as it is zero forcing.

Variational Bayes in a nutshell (continued) The variational bound can be maximised by alternating between the following updates: q(z) e ln p(x,z θ ) q(θ), q(θ) e ln p(x,z θ) q(z) p(θ), (VE step) (VM step) In practice, the priors are chosen to be conjugate to the likelihood such that updating the posterior simply consists in updating the hyperparameters. By construction, VEM ensures a monotonic increase of the bound. The variational lower bound can be evaluated using F(q(z), q(θ)) = ln p(x, z, θ) q(z)q(θ) + H[q(z)] + H[q(θ)]. The predictive distribution is approximated as follows: Z p(x x) p(x z, θ)q(z)q(θ)dzdθ.

First, we show that VE maximises the lower bound (i.e. minimises the free energy) for fixed q(θ): Z F = q(z) ln p(x, z θ) q(θ) dz H[q(z)] + const. Z = q(z) ln e ln p(x,z θ) q(θ) dz + const. q(z) h = KL q(z) 1 Z e ln p(x,z θ) q(θ) i + const. Second, we show that VM maximises the lower bound for fixed q(z): Z F = q(θ) ln p(x, z, θ) q(z) dθ H[q(θ)] + const. Z = q(θ) ln e ln p(x,z,θ) q(z) dθ + const. q(θ) i h = KL q(θ) 1 Z e ln p(x,z θ) q(z) p(θ) + const.

Bayesian linear models for regression For simplicity, the prior on the weights is assumed to be isotropic Gaussian: w α N (0, α 1 I M+1 ), To force non-negative value, we impose Gamma priors on α and τ = σ 2 : α G(a 0, b 0), τ G(c 0, d 0). The Gamma is conjugate to the Gaussian. 2 1.5 p(x) 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 x Figure: Gamma distribution for various values of the scale and the shape parameter.

Variational inference for Bayesian linear regressors The complete log-likelihood is given by p(t, w, τ, α) = p(t w, τ)p(w α)p(α)p(τ). It is assumed that the variational posterior fully factorises, such that q(w, α, τ) = q(w)q(α)q(τ). Applying the variational framework leads to q(w) = N ( µ w, Σ w), q(α) = G(a, b), q(τ) = G(c, b), where the special quantities are defined as µ w τ Σ wφ t, Σ w ( τ Φ Φ + α I M+1 ) 1, a M + 1 2 + a 0, b µ w µ w + tr{ Σ w} 2 + b 0, c N 2 + c0, d t Φ µ w 2 + tr{φ Σ wφ } 2 where τ = c/d and α = a/b. + d 0,

The variational posterior for w is given by q(w) e ln p(t,w α,τ) q(α)q(τ) The variational posterior for α is given by = e 1 2 (w µ w ) Σ 1 w (w µ w ). q(α) e ln p(t,w α,τ) q(w)q(τ) p(α) e (a 1) ln α bα, where a M+1 2 + a 0 and b w w 2 + b 0. The variational posterior for α is given by q(τ) e ln p(t,w α,τ) q(w)q(α) p(τ) e (c 1) ln τ dτ. where c N 2 + c0 and d (t Φw) (t Φw) 2 + d 0.

Variational vs type II ML In practice, the variational approach consists in updating the parameters of the posteriors in turn. The updates for the posterior mean and the posterior covariance of w have a similar form as before. The estimates of σ 2 and α are replaced by their expectation: τ 1 = t Φ µ w 2 + tr{φ Σ wφ }+2d 0, N+2c 0 M + 1+2a 0 α = µ w µ w + tr{ Σ. w}+2b 0 For flat priors, these estimates will give very similar results as type II ML. The predictive distribution is also very similar in form: Z p(t t) p(t w, τ 1 ) q(w) dw = N ( µ w φ(x), τ 1 + φ (x) Σ wφ(x)).

" Variational vs type II ML (continued) The variational approach does not provide a better selection criterion for the kernel width, but leads to a better estimate of the observation noise! 2 1.8 1.6 1.4 RMSE ML2!F ML2 RMSE VAR!F VAR 0.105 0.1 0.095 ML2 VAR 1.2 0.09 1 0.085 0.8 0.08 0.6 0.075 0.4 0.07 0.2 0.065 0 10!1 10 0 10 1! 0.06 10!1 10 0 10 1! (a) (b) Figure: Comparison of the quality of the type II ML solution and the variational solution. (a) Normalised root mean square error (RMSE) and normalised lower bound ( F) versus the kernel width λ. (b) Noise standard deviation σ versus λ. Note that the true value of σ is 0.1.

Selecting the number of the basis functions Why seeking sparse solutions? Placing a basis function on each training data (M = N) rapidly becomes infeasible in practice: Training becomes slow (cf. matrix Σ w R (M+1) (M+1) to invert). Even if training off-line making predictions may be too expensive. Excessive memory usage as the design matrix grows quadratically with M. Sparse solutions lead to better generalisation and are fast (e.g. SVM). Simple heuristics based on resampling: Subset selection (randomly or based on an information theoretic criterion). Vector quantisation or clustering.... These approaches are unsupervised and thus suboptimal.

Sparse Bayesian linear models for regression A systematic approach is to build a hierarchical model such that the effective prior on w is sparsity inducing, i.e. most weights are driven to zero: Sparse Bayesian learning and the relevance vector machines (Tipping, 2001) Adaptive sparseness for supervised learning (Figueiredo, 2003) Comparing the effects of different weight distributions on finding sparse representations (Wipf and Rao, 2006) Bayesian adaptive lassos with non-convex penalization (Griffin and Brown, 2007)... These methods lead to very sparse solutions (more sparse than standard SVMs).

Relevance vector machines Consider a Gaussian prior with a different scale parameter α m for each w m, along with a different hyperprior: w α 0,..., α M N (0, A 1 ) = (α 0,..., α M ) MY G(a m, b m). m=0 MY m=0 N (0, α 1 m ), Integrating out α m leads to an effective prior on w m which corresponds to a Student-t: Z p(w m) = N (0, αm 1 ) G(a m, b m) dα m = S(0, a m/b m, 2a m), for all m. The Student-t distribution approximates the Laplace distribution (or double exponential), which corresponds to L 1-regularisation (LASSO). The Student-t prior is peaked around zero and has fat tails.

Sparsity inducing priors 0.5 0.45 Laplace Gaussian Student!t 0.4 0.35 0.3 p(x) 0.25 0.2 0.15 0.1 0.05 0!5!4!3!2!1 0 1 2 3 4 5 x Figure: Two sparsity inducing priors compared to the Gaussian.

References Christopher M. Bishop: Pattern Recognition and Machine Learning. Springer, 2006. Trevor Hastie, Robert Tibshirani and Jerome Friedman: The Elements of Statistical Learning. Springer, 2001. John Shawe-Taylor and Nello Cristianini: Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.