Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu

Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2

Linear Regression: model complexity M Polynomial regression y(x, w) = w 0 + w 1 x + w 2 x 2 +.. + w M x M = w j x j Red lines are best fits with M = 0,1,3,9 and =10 M j =0 Poor representations of sin(2πx) Best Fit to sin(2πx) Over Fit Poor representation of sin(2πx) 3

Max Likelihood Regression Input vector x, basis functions {ϕ 1 (x),.., ϕ M (x)}: y(x,w) = M 1 j =0 w j φ j (x) = w T φ(x) Objective Function: Closed-form ML solution is: w ML = (Φ T Φ) 1 Φ T t w ML = (λi + Φ T Φ) 1 Φ T t Gradient Descent: { } Radial basis fns: Max Likelihood objective with examples {x 1,..x }: (equivalent to Mean Squared Error Objective) Regularized solution is: Regularized MSE with examples: (λ is the regularization coefficient) E(w) = 1 2 E(w) = 1 2 where Φ is the design matrix: (Φ T Φ) is Moore-Penrose inverse w (τ +1) = w (τ ) η E φ j (x) = exp 1 2 (x µ j )t Σ 1 (x µ j ) { t n w T φ(x n ) } { t n w T φ(x n ) } 2 + λ 2 wt w φ 0(x1) φ 0(x 2) Φ = φ 0(x ) { } E = t n w (τ)t φ(x n ) φ(x n ) Regularized version: E = t n w (τ )T φ(x n ) φ(x n ) λw (τ ) 2 φ (x 1 1 )... φ φ M 1 M 1 (x1) (x ) 4

Shortcomings of MLE M.L.E. of parameters w does not address M (Model complexity: how many basis functions? It is controlled by data size More data allows better fit without overfitting Regularization also controls overfit (λ controls effect) E(w) = E D (w) + λe W (w) where E D (w) = 1 2 { t n w T φ(x n ) } But M and choice of ϕ j are still important M can be determined by holdout, but wasteful of data E W (w) = 1 2 wt w Model complexity and over-fitting are better handled using Bayesian approach 5 2

Bayesian Linear Regression Using Bayes rule, posterior is proportional to Likelihood Prior: p(w t) = where p(t w) is the likelihood of observed data p(w) is prior distribution over the parameters We will look at: p(t w)p(w) p(t) A normal distribution for prior p(w) Likelihood p(t w) is a product of Gaussians based on the noise model And conclude that posterior is also Gaussian 6

Gaussian Prior Parameters Assume multivariate Gaussian prior for w (which has components w 0,..,w M ) p(w) = (w m 0, S 0 ) with mean m 0 and covariance matrix S 0 If we choose S 0 = α I it means that the variances of the weights are all equal to α and covariances are zero p(w) with zero mean (m 0 =0) and isotropic over weights (same variances) w 1 w 0 7

Likelihood of Data is Gaussian Assume noise precision parameter β t = y(x,w)+ε where ε is defined probabilistically as Gaussian noise p(t x,w,β)=(t y(x,w),β ) ote that output t is a scalar Likelihood of t ={t 1,..,t } is then p(t X,w, β) = ( t n w T φ(x n ), β ) 1 This is the probability of target data t given the parameters w and input X={x 1,..,x } Due to Gaussian noise, likelihood p(t w) is also a Gaussian 8

Posterior Distribution is also Gaussian Prior: p(w)~ (w m 0, S 0 ) i.e., it is Gaussian Likelihood comes from Gaussian noise p(t X,w, β) = It follows that posterior p(w t) is also Gaussian Proof: use standard result from Gaussians: If marginal p(w) & conditional p(t w) have Gaussian forms then the marginals p(t) and p(w t) are also Gaussian: Let p(w) = (w µ,λ ) and p(t w) = (t Aw+b,L ) Then marginal p(t) = (t Aµ+b,L +AΛ A T ) and conditional p(w t) = (w Σ{A t L(t-b)+Λµ},Σ) where Σ=(Λ+A T LA) ( ) t n w T φ(x n ), β 1 9

Exact form of Posterior Distribution We have p(w)= (w m 0, S 0 ) & p(t X,w, β) = ( t n w T φ(x n ), β ) 1 Posterior is also Gaussian, written directly as p(w t)=(w m,s ) where m is the mean of the posterior given by m = S (S 0 m 0 + β Φ T t) Φ is the design matrix and S is the covariance matrix of posterior given by S = S 0 + β Φ T Φ φ 0(x1) φ 0(x 2) Φ = φ 0(x ) φ (x ) 1 1... φ M 1(x 1) φ M 1(x ) w 1 Prior p(w α) = (w 0,α 1 I ) and Posterior in weight space for scalar input x and y(x,w)=w 0 +w 1 x w 0 w 1 w 1 w 0 10

Properties of Posterior 1. Since posterior p(w t)=(w m,s ) is Gaussian its mode coincides with its mean Thus maximum posterior weight is w MAP = m 2. Infinitely broad prior S 0 =α I, i.e.,precision α à0 Then mean m reduces to the maximum likelihood value, i.e., mean is the solution vector w ML = (Φ T Φ) 1 Φ T t 3. If = 0, posterior reverts to the prior 4. If data points arrive sequentially, then posterior to any stage acts as prior distribution for subsequent data points 11

Choose a simple Gaussian prior p(w) y(x,w)=w 0 +w 1 x p(w α)~(0,1/α) Zero mean (m 0 =0) isotropic (same variances) Gaussian p(w α) (w 0,α 1 I) Corresponding posterior distribution is p(w t)=(w m,s ) where m =β S Φ T t and S =α I+β Φ T Φ ote: β is noise precision and α is variance of parameter w in prior w 1 Single precision parameter α Point Estimate with infinite samples w 0 12

Equivalence to MLE with Regularization Since we have Log of Posterior is ln p(w t) = β 2 ( ) and { t n w T φ(x n )} Thus Maximization of posterior is equivalent to minimization of sum-of-squares error { t n w T φ(x n ) } 2 + λ 2 wt w E(w) = 1 2 p(t X,w, β) = t n w T φ(x n ), β 1 p(w α) = (w 0,α 1 I ) p(w t) = ( ) t n w T φ(x n ), β 1 (w 0,α 1 I) with addition of quadratic regularization term w T w with λ = α /β 2 α 2 wt w + const 13

Bayesian Linear Regression Example (Straight Line Fit) Single input variable x Single target variable t Goal is to fit Linear model y(x,w) = w 0 + w 1 x Goal of Linear Regression is to recover w =[w 0,w 1 ] given the samples t x 14

Data Generation Synthetic data generated from f(x,w)=w 0 +w 1 x with parameter values w 0 = -0.3 and w 1 =0.5 t x 1 First choose x n from U(x,1), then evaluate f(x n,w) Add Gaussian noise with st dev 0.2 to get target t n Precision parameter β = (1/0.2 ) 2 = 25 For prior over w we choose α = 2 p(w α) = (w 0,α 1 I ) 15

Sampling p(w) and p(w t) Each sample represents a straight line in data space (modified by examples) Distribution y(x,w)=w 0 +w 1 x Six samples w 1 p(w) With no examples: w 0 p(w t) With two examples: Goal of Bayesian Linear Regression: Determine p(w t) 16

Sequential Bayesian Learning Machine Learning Since there are only two parameters We can plot prior and posterior distributions in parameter space We look at sequential update of posterior We are plotting p(w t) for a single data point Before data points observed After first data point (x,t) observed Band represents values of w 0, w 1 representing st lines going near data point x Likelihood for 2 nd point alone Likelihood p(t x.w) as function of w True parameter Value X Prior/ Posterior p(w) gives p(w t) Six samples (regression functions) corresponding to y(x,w) with w drawn from posterior o Data Point First Data Point (x 1,t 1 ) Second Data Point With infinite points posterior is a delta function centered at true parameters (white cross) Likelihood for 20 th point alone Twenty Data Points 17

Generalization of Gaussian prior The Gaussian prior over parameters is p(w α) = (w 0,α 1 I) Maximization of posterior ln p(w t) is equivalent to minimization of sum of squares error E(w) = 1 { t 2 n w T φ(x n ) } 2 + λ 2 wt w Other prior yields Lasso and variations: p(w α) = q α 2 2 q=2 corresponds to Gaussian 1/q 1 Γ(1 /q) M exp α 2 w j q Corresponds to minimization of regularized error function M t n w T φ(x n ) w j q 1 2 { } 2 + λ 2 j =1 M j =1 18

Predictive Distribution Usually not interested in the value of w itself But predicting t for a new value of x p(t t,x,x) or p(t t) Leaving out conditioning variables X and x for convenience Marginalizing over parameter variable w, is the standard Bayesian approach Sum rule of probability We can now write p(t)= p(t,w)dw = p(t w)p(w)dw p(t t)= p(t w)p(w t)dw 19

Predictive Distribution with α, β,x,t We can predict t for a new value of x using p(t t)= p(t w)p(w t)dw With explicit dependence on prior parameter α, noise parameter β, & targets in training set t p(t t,α,β)= We have left out conditioning variables X and x for convenience. Also we have applied sum rule of probability p(t)=σ w p(t w)p(w) p(t w,β) p(w t,α,β)d w Conditional of target t given weight w p(t x,w, β) = (t y(x,w), β 1 ) posterior of weight w p(w t)=(w m,s ) where m =β S Φ T t S =α I+β Φ T Φ RHS is a convolution of two Gaussian distributions whose result is the Gaussian: p(t x,t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x)

Variance of Predictive Distribution Predictive distribution is a Gaussian: p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x) Since noise process and distribution of w are independent Gaussians their variances are additive oise in data Uncertainty associated with parameters w: where S is the covariance of p(w α) =α I+β Φ T Φ 2 Since σ +1 (x) σ 2 (x) as no. of samples increases it becomes narrower As à, second term of variance goes to zero and variance of predictive distribution arises solely from the additive noise parameter β 21

Example of Predictive Distribution Data generated from sin(2πx) Model: nine Gaussian basis functions y(x,w) = 8 j=0 w j φ j (x) = w T φ(x) Predictive distribution p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) φ j (x) = exp (x µ j )2 2σ 2 where σ 2 (x) = 1 β + φ(x)t S φ(x) Plot of p(t x) for one data point showing mean (red) and one std dev (pink) where m =β S Φ T t, S =α I+β Φ T Φ and α and β come from assumptions p(w α)= (w 0, α I ) p(t x,w, β) = (t y(x,w), β 1 ) Mean of Predictive Distribution 22

Predictive Distribution Variance Bayesian prediction: p(t x, t,α, β) = (t m T φ(x),σ 2 (x))where σ 2 (x) = 1 β + φ(x)t S φ(x) where we have assumed =1 =2 Gaussian prior over parameters: p(w α) = (w 0,α 1 I) oise model assumed Gaussian: p(t x,w,β)=(t y(x,w),β ) and use design matrix as: S =α I+β Φ T Φ 0 Using data from sin(2πx): φ 0(x1) φ 0(x 2) Φ = φ (x ) φ (x ) σ 2 (x), std dev of t, is smallest in neighborhood of data points 1 1... φ M 1(x 1) φ M 1(x ) Mean of the Gaussian Predictive Distribution One standard deviation from Mean =4 =25 Uncertainty decreases as more data points are observed Plot only shows point-wise predictive variance To show covariance between predictions at different values of x draw samples from posterior distribution over w p(w t) and plot corresponding functions y(x,w)

Plots of function y(x,w) Draw samples w from from posterior distribution p(w t) p(w t)=(w m,s ) and plot samples from y(x,w) = w T ϕ(x) Shows covariance between predictions at different values of x For a given function, for a pair of x,x, the values of y,y are determined by k(x,x ) which in turn is determined by the samples =1 =2 =4 =25 24

Disadvantage of Local Basis Predictive distribution, assuming Gaussian prior p(w α) = (w 0,α 1 I) and Gaussian noise t = y(x,w)+ε where noise is defined probabilistically as p(t x,w,β)=(t y(x,w),β ) p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x) S =α I+β Φ T Φ With localized basis functions, e.g., Gaussian at regions away from basis function centers, contribution of second term of variance σ n 2 in will go to zero leaving only noise contribution β Model becomes very confident outside of region occupied by basis functions Problem avoided by alternative Bayesian approach of Gaussian Processes 25

Dealing with unknown β If both w and β are treated as unknown then we can introduce a conjugate prior distribution p(w,β) which is given by a Gaussian-gamma distribution p(µ,λ) = µ µ 0, βλ ( ) 1 Gam λ a,b ( ) In this case the predictive distribution is a Student s t-distribution St( x µ, λ, ν ) = Γ( ν / 2 + 1/ 2) Γ( ν + 2) λ πν 1/ 2 1 + λ( x µ ) ν 2 ν / 2 1/ 2 26

Mean of p(w t) has Kernel Interpretation Regression function is: y(x,w) = M 1 j =0 If we take a Bayesian approach with Gaussian prior p(w)=(w m 0, S 0 ) then we have: Posterior p(w t)= (w m,s ) where m = S (S 0 m 0 + βφ T t) S = S 0 + βφ T Φ With zero mean isotropic p(w α)= (w 0, α I) m = β S Φ T t, w j φ j (x) = w T φ(x) S = α I+ β Φ T Φ Posterior mean β S Φ T t has a kernel interpretation Sets stage for kernel methods and Gaussian processes 27

Equivalent Kernel Posterior mean of w is m =βs Φ T t where S = S 0 + βφ T Φ, S 0 is the covariance matrix of the prior p(w), β is the noise parameter and Φ is the design matrix that depends on the samples Substitute mean value into Regression function y(x,w) = M 1 j =0 w j φ j (x) = w T φ(x) Mean of predictive distribution at point x is y(x,m ) = m T φ(x) = βφ(x) T S Φ T t = βφ(x) T S φ(x n )t n = k(x,x n )t n where k(x,x )=βϕ (x) T S ϕ (x ) is the equivalent kernel Thus mean of predictive distribution is a linear combination of training set target variables t n ote: the equivalent kernel depends on input values x n from the dataset because they appear in S

Kernel Function Regression functions such as y(x,m ) = k(x,x n )t n k(x,x )=βϕ (x) T S ϕ (x ) S = S 0 + βφ T Φ That take a linear combination of the training set target values are known as linear smoothers φ 0(x1) φ 0(x 2) Φ = φ 0(x ) They depend on the input values x n from the data set since they appear in the definition of S 29 φ (x ) 1 1... φ M 1(x 1) φ M 1(x )

Example of kernel for Gaussian Basis Equivalent Kernel k(x, x )=ϕ(x) T S ϕ(x ) S = S 0 + βφ T Φ φ 0(x1) φ 0(x 2) Φ = φ 0(x ) φ (x 1 1 )... φ φ M 1 M 1 (x1) (x ) For three values of x the behavior of k(x,x ) is shown as a slice Kernel used directly in regression. Mean of the predictive distribution is Kernels are localized around x, i.e., peaks when x =x y(x,m ) = k(x,x n )t n Obtained by forming a weighted combination of target values: Data points close to x are given higher weight than points further removed from x Gaussian Basis ϕ(x) x φ j (x) = exp (x µ j )2 2s 2 x Plot of k(x,x ) shown as a function of x and x Peaks when x=x Data set used to generate kernel were 200 values of x equally spaced in (,1) 30

Equivalent Kernel for Polynomial Basis Function ϕ j (x)=x j k(x,x )=βϕ (x) T S ϕ (x ) S = S 0 + β Φ T Φ Data points close to x are given higher weight than points further removed from x Plotted as a function of x for x=0 Localized function of x even though corresponding basis function is nonlocal 31

Equivalent Kernel for Sigmoidal Basis Function φ j (x) = σ x µ j s where σ(a) = 1 1 + exp( a) k(x,x )=βϕ(x) T S ϕ(x ) Localized function of x even though corresponding basis function is nonlocal 32

Covariance between y(x) and y(x ) An important insight: The value of the kernel function between two points is directly related to the covariance between their target values cov [y(x), y(x )] = cov[ϕ(x) T w, w T ϕ (x )] = ϕ (x) T S ϕ (x ) = β k(x, x ) From the form of the equivalent kernel k(x, x ) the predictive mean at nearby points y(x), y(x ) will be highly correlated For more distant pairs correlation is smaller The kernel captures the covariance where we have used: p(w t)~(w m,s ) k(x, x )= βϕ(x) T S ϕ (x) 33

Predictive plot vs. Posterior plots Predictive distribution allows us to visualize pointwise uncertainty in the predictions governed by p(t x, t,α, β) = (t m T φ(x),σ 2 (x)) where σ 2 (x) = 1 β + φ(x)t S φ(x) Drawing samples from posterior p(w t) Plotting corresponding functions y(x,w) we visualize joint uncertainty in the posterior distribution between the y values at two or more x values as governed by the kernel 34

Directly Specifying Kernel Function Formulation of Linear Regression in terms of kernel function suggests an alternative approach to regression: Instead of introducing a set of basis functions, which implicitly determines an equivalent kernel: Directly define kernel functions and use to make predictions for new input x, given the observation set This leads to a practical framework for regression (and classification) called Gaussian Processes 35

Summing Kernel Values Over samples Effective kernel defines weights by which target values combined to make a prediction at x It can be shown that weights sum to one, i.e., For all values of x k(x,x n ) = 1 This result can be proven intuitively: k(x, x )=ϕ(x) T S ϕ(x ) S = S 0 + βφ T Φ Since y(x, m ) = k(x,x n )t n summation is equivalent to considering predictive mean ŷ(x) for a set of integer data in which t n =1 for all n Provided basis functions are linearly independent, that >M, one of the basis functions is constant (corresponding to the bias parameter), then we can fit 36 training data exactly, and hence ŷ(x)=1

Kernel Function Properties Equivalent kernel can be positive or negative Although it satisfies a summation constraint, the corresponding predictions are not necessarily a convex combination of the training set target variables Equivalent kernel satisfies important property shared by kernel functions in general. It can be expressed in the form of an inner product wrt a vector ψ(x) of nonlinear functions: k(x,z) = ψ(x) T ψ(z) k(x, x )=ϕ(x) T S ϕ(x ) where ψ(x) = β 1/2 S 1/2 φ(x) 37