When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we want a full posterior Over the parameters β (or w in the book) Over the estimated variance σ $ Bayesian linear regression Likelihood function: Linear Regression First, assume σ $ is known So we re estimating p w D, σ $ or in the more general case p w, σ $ D Recall our linear regression model is defined as: μ is an offset term so if the inputs are centered (or standardized) for each j Then our prior belief about the mean of the output is equally likely to be positive or negative 1
Likelihood function: Linear Regression We want an (uninformative) prior that will not affect the mean. We choose an improper prior on μ f the form p μ 1 This is for mathematical ease We include that and then integrate it out again Prior: Linear Regression We now have a well defined likelihood function What do we choose for our prior? Conjugate priors are nice Conjugate prior of a normal is a normal Where y is the empirical mean of the output and y y1 / indicates the output vector y is centered Now our likelihood function is free of μ Let s define our prior (on the parameters) as Bayes rule to compute Posterior Relationship to Ridge If we define our prior to be w 0 = 0 V 0 = τ $ I This reduces to the ridge estimate with λ = 78 9 8 But in this case we d have the full posterior rather than just a point estimate 2
Bayesian approach Recall: A walkthrough: Assume data is sampled from a model where w ; = 0.3 and w B = 0.5 The possible linear fit includes many different lines, with the single most probable weight value going through (0,0) Initially, we have no data and thus no likelihood Our prior is a MVN centered at (0,0) A walkthrough: Assume data is sampled from a model where w ; = 0.3 and w B = 0.5 We see one data point (the blue circle) our data space changes our hypothesis is more restricted A walkthrough: Assume data is sampled from a model where w ; = 0.3 and w B = 0.5 With two data points the model is more sure about valid hypotheses The dark red values in the band would generate the data point in data space Our multivariate normal is now skewed to reflect the likelihood The dark red values in the band would generate the data points in data space Our multivariate normal is now skewed and smaller to reflect the likelihood 3
A walkthrough: Assume data is sampled from a model where w ; = 0.3 and w B = 0.5 After 20 data points, we re More confident about the weights / best fitting line Posterior predictive distribution We can show that So the variance depends on two terms The variance of the observed noise σ $ The variance in the parameters V / The likelihood, reflecting the data has high confidence that the true weights are -.3 and.5 Our multivariate normal is now very close to the true value (white cross) Posterior predictive Posterior predictive Standard MLE estimation Bayesian MAP estimation Features: As we move further from the observed data, we increase our uncertainty The MLE estimate is the MAP estimate 4
Function generation with posterior predictives Standard MLE function generation Bayesian MAP function generation What happens with unknown variance? Unknown variance We instead define where NIG is a normal-inverse gamma whose pdf is The normal distribution arises as a special case by setting β = 0, δ = σ $ α and α w / and V / are similar to the case where σ $ is known. a J is just an update of counts b / is a contribution of the prior sum of squares b ; and the empirical sum of squares y L y plus a term due to the error in the prior on w. 5
Unknown variance The posterior predictive distribution with unknown variance is a Student t-distribution. Looks like a normal distribution but with thick tails What happens if the prior is unknown? There s an empirical Bayes procedure for picking the hyperparameters Choose η = (α, λ) to maximize the marginal likelihood where λ = 1/σ $ is the precision of the observation noise and α is the precision of the prior p w = N(w 0, α TB I) This is known as the evidence procedure Evidence procedure Alternative to using cross validation 5- fold cross validation Evidence procedure 6
Bayesian approach Bayesian Logistic Regression Recall: Logistic regression Approximate inference We may want to compute the full posterior over the parameters P w D This would allow us to associate confidence intervals with our predictions Applications include bandit problems But there is no convenient conjugate prior to compute the posterior exactly Simple approximation to follow known as Laplace approximation Also known as saddle point approximation Other solutions include Markov Chain Monte Carlo (MCMC) Variational inference Expectation propagation 7
Laplace approximation Gaussian approximation to the posterior Taylor series Let θ R Y and p θ D) = 1 et\ θ Z where E θ is an energy function For this approximation, we assume E θ = log p(θ, D) where Z = p(d) Approximation to a sine function with great and great expansion. As the degree of the Taylor polynomial rises, it approaches the correct function. This image shows sin x and its Taylor approximations, polynomials of degree 1, 3, 5, 7, 9, 11 and 13. Applying a Taylor series Laplace approximation Expand around the mode of θ Recall the mode is the maximum (most likely) of the posterior distribution It s also the lowest energy state in our definition of E θ = log p(θ, D) E θ E θ + θ θ L g + 1 2 θ θ L H θ θ Where g is the gradient and H is the Hessian of the energy function evaluated at the mode. E θ E θ + θ θ L g + 1 2 θ θ L H θ θ g E θ i j H $ E θ θ θ L m θ 8
Laplace approximation cont. Since θ is the mode, the gradient term is zero. Why? It s the maximum of the distribution Hence This is known as the Laplace approximation to the marginal likelihood So the posterior. is an approximation to As the sample size increases this will start to look more and more like a Gaussian and thus is commonly referred to as a Guassian approximation Gaussian approximation Gaussian approximation Our prior becomes: p w D N w wn, H TB If we have linearly separable data, the MLE estimate is not well defined since w can get arbitrarily large wn = argmin r E w E w = log p D w + log p w H = $ E w w 9
Unnormalized vs Laplace approximation 10