1 Bayesian Linear Regression (BLR)

Size: px

Start display at page:

Download "1 Bayesian Linear Regression (BLR)"

Everett Corey Hodges
6 years ago
Views:

1 Statistical Techniques in Robotics (STR, S15) Lecture#10 (Wednesday, February 11) Lecturer: Byron Boots Gaussian Properties, Bayesian Linear Regression 1 Bayesian Linear Regression (BLR) In linear regression, the goal is to predict a continuous outcome variable. In particular, let: θ = parameter vector of the learned model x i R n is the i th set of features in the dataset, y i R is the true outcome given x i, The problem we are solving is to find a θ that can make the best prediction on the output y given an input x. Then our model is as follows: y i = θx i + ɛ i, where ɛ i is a noise independent of everything else. This has the following form y T N(θ T x i, σ 2 ). Thus the likelihood if θ is known is: P (y x, θ) = 1 Z exp θx 2σ 2 In BLR, we maintain a distribution over the weight vector θ to represent our beliefs about what θ is likely to be. The math is easiest if we restrict this distribution to be a Gaussian. The problem can be represented by the following graphical model: Figure 1: Bayesian linear regression model. x i s are known. 1

2 1.1 Prior Distribution of θ We assume that the prior distribution of θ is a normal distribution N (µ, Σ) with mean µ and covariance matrix Σ, and the probability of θ is given by P (θ) = 1 Z exp{ 1 2 (θ µ)t Σ 1 (θ µ)}, (1) where µ = E P (θ) [θ] and Σ = E P (θ) [(θ µ) T (θ µ)]. 2 Parameterizations for Gaussians There are two common parameterizations for Gaussians, the moment parameterization and the natural parameterization. It is often most practical to switch back and forth between representations, depending on which calculations are needed. Equation (1) is called the moment parametrization of θ since it consists of the first moment (µ) and the second moment (Σ, also called the central moment) of the variable θ. Z is a normalization factor with the value (2π) n det(σ), where n is the dimension of θ. To prove this, one can translate the distribution to center it at the origin, and do change of variables so that the distribution has the form P (θ ) = 1 Z exp{ 1 2 θ T θ }. Then, express θ in polar coordinates and integrate over the space to compute Z. 2.1 Moment Parameterization N (µ, Σ) = p(θ) = 1 ( z exp 1 ) 2 (θ µ) Σ 1 (θ µ) (2) Given: N ([ µ1 µ 2 ] [ Σ11 Σ, 12 Σ 21 Σ 22 ]) (3) Marginal: Conditional: µ marg 2 = µ 2 (4) Σ marg 2 = Σ 22 (5) µ 1 2 = µ 1 + Σ 12 Σ 1 22 (θ 2 µ 2 ) (6) Σ 1 2 = Σ 11 Σ 12 Σ 1 22 Σ 21 (7) 2.2 Natural Parameterization The normal distribution can also be expressed as Ñ (J, P ) = p(θ) = 1 (J z exp T θ 1 ) 2 θt P θ (8) 2

3 The natural parametrization simplifies the multiplication of normal distributions as it becomes addition of the J and P matrices of different distributions. Transforming the moment parametrization to the natural parametrization can be done by first expanding the exponent: 1 2 (θ µ)t Σ 1 (θ µ) = 1 2 θt Σ 1 + µ T Σ 1 θ 1 2 µt Σ 1 µ (9) The last term in equation (9) has nothing to do with θ and can therefore be absorbed into the normalizer. By (8) and (9), J = Σ 1 µ P = Σ 1 (10) Given: Marginal: Conditional: N ([ J1 J 2 ] [ P11 P, 12 P 21 P 22 ]) (11) J marg 1 = J 1 P 12 P (12) P marg 1 = P 11 P 12 P (13) J 1 2 = J 1 P 12 θ 2 (14) P 1 2 = P 11 (15) Derivation of these conditionals is a straightforward expansion of the full moment parameterization. [ J1 J 2 ] T [ θ1 θ 2 ] 1 2 [ θ1 θ 2 ] T [ ] [ ] P11 P 12 θ1 = (J P 21 P 1 P 12 θ 2 ) T θ θt 1 (P 11 )θ 1 + (J 2 θ 2 + θ2 T P 22 θ 2 ) θ Gaussian MRFs The matrix P of the natural parameterization has a graphical model interpretation. If and only if there is a non-zero entry for (z i, z j ), then there is a lik between z i and z j in the MRF that corresponds to Ñ(J, P ). P = X X 0 X X X 0 X X z 1 z 3 z 2 Following the graphical model interpretation, P is in many cases highly structured. Consider for example the graphical model of a markov chain: 3

4 x 1 x 2 x 3 x 4 x 5 The corresponding matrix P will be non-zero only on the diagonal and off-diagonal: X X X X X 0 0 P = 0 X X X X X X X X (16) Note: P describes which variables directly affect each other. Note: P 1 is, in general, not sparse! (this makes intuitve sense since P 1 = Σ the covariance matrix, and the covariance of two states along the markov chain are not independent.) 3 Return to Bayesian Linear Regression 3.1 Prediction With the Bayesian linear regression model, we would like to know the probability of an output y t+1 given an new input x t+1 and the set of data D = {(x i, y i )} i=1,...,t. To compute the probability P (y t+1 x t+1, D), we introduce θ into this expression and marginalize over it P (y t+1 x t+1, D) = P (y t+1 x t+1, θ, D)P (θ x t+1, D) (17) θ Θ Because D tells no more than what θ does, P (y t+1 x t+1, θ, D) is P (y t+1 x t+1, θ). From the Markov assumption and the graphical model, equation (17) becomes P (y t+1 x t+1, D) = P (y t+1 x t+1, θ)p (θ D) (18) θ Θ Computing (18) is hard with the moment parametrization of normal distributions but not with the natural parametrization. 3.2 Posterior Distribution P (θ D) Using Bayes rule, the posterior probability P (θ D) can be expressed as ( t ) P (θ D) P (y 1:t x 1:t, θ)p (θ) P (y i x i, θ) P (θ) (19) i=1 4

5 If θ is known, the data is independent; that is, P (y 1:t x 1:t, θ) = t i=1 P (y i x i, θ). We will see that this product can be computed by a simple update rule. First, let s look at the product of P (y i x i, θ)p (θ): P (y i x i, θ)p (θ) exp{ 1 2σ 2 (y i θ T x) 2 } exp{j T θ 1 2 θt P θ} exp{ 1 2σ 2 ( 2y iθ T x i + θ T x i x T i θ)} exp{j T θ 1 2 θt P θ} = exp{ 1 σ 2 y ix T θ 1 2σ 2 θt x i x i T θ} exp{j T θ 1 2 θt P θ} = exp{(j + 1 σ 2 y ix i ) T θ 1 2 θt (P + 1 σ 2 x ix i T )θ} = exp{j T θ 1 2 θt P θ} Line 1 to line 2 is true because any term that does not have θ can be absorbed into the normalizer. Now, we can apply the generalized result to (19) and derive { ( i P (θ D) exp J + y ) T ix i σ 2 θ 1 ( i 2 θt P + x T ) } ix i σ 2 θ (20) i y ix i So P (θ D) is also a normal distribution with J final = J + and P σ 2 final = P + 1 σ i x ix T 2 i. The mean and the covariance of this distribution can be derived with the relation provided earlier: µ final = Σ final = ( i Σ 1 + x T ) 1 ix i i y ix i σ 2 σ 2 ( i Σ 1 + x T ) 1 ix i σ 2 P final is the precision matrix of the normal distribution, and as the number of x i increases, the terms in this matrix become larger. Also, since P final is the inverse of the covariance, the variance gets lower as the number of samples grow. This is a characteristic of a Gaussian model that a new data point always lowers the variance, but this downgrading of variance does not always make sense. If you believe that there are outliers in your dataset, this model will not work for you. A few more facts about the BLR update: 1. If µ 0 0, the update for µ θ will be more complicated 2. N y tx t t=1 σ 2 is the gradient of Bayes online linear regression 3. this looks just like Newton s method 4. Computation time : o(n 2 ) for update, o(n 3 ) for mean (that can be reduced to o(n 2 ) with tricks) 5

6 3.3 Probability Distribution of the Prediction The next step to compute (18) is to compute P (y t+1 x t+1, θ). Since the linear combination of normal distributions is also a normal distribution, P (y t+1 x t+1, θ) should be in the form 1 Z exp{ 1 (y 2σ 2 t+1 µ yt+1 ) T Σ yt+1 (y t+1 µ yt+1 )}, where µ yt+1 = E[y t+1 ] = E[θ T x t+1 + ɛ] = E[θ T x t+1 ] + E[ɛ] = E[θ] T x t = µ θ T x t+1, and Σ yt+1 = x t+1 T Σ θ x t+1 + σ 2. The term x T Σ θ x measures how large the variance is on the direction that x is on. If x is never observed before, then the variance of the direction of x is large. Also, the variance is not a function of y t+1. The precision is only affected by the input not the output. This is the consequence of having the same σ (observation error) everywhere in the space. 6

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters Lecturer: Drew Bagnell Scribe:Greydon Foil 1 1 Gauss Markov Model Consider X 1, X 2,...X t, X t+1 to be