The Multivariate Gaussian Distribution [DRAFT]

Similar documents
Bayesian Linear Regression [DRAFT - In Progress]

The Multivariate Gaussian Distribution

Multivariate Bayesian Linear Regression MLAI Lecture 11

A Probability Review

We introduce methods that are useful in:

Today. Probability and Statistics. Linear Algebra. Calculus. Naïve Bayes Classification. Matrix Multiplication Matrix Inversion

CMPE 58K Bayesian Statistics and Machine Learning Lecture 5

Gaussian Mixture Models

Determinants and Scalar Multiplication

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

K-Means and Gaussian Mixture Models

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

A quadratic expression is a mathematical expression that can be written in the form 2

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

8 Eigenvectors and the Anisotropic Multivariate Gaussian Distribution

Notes on Random Variables, Expectations, Probability Densities, and Martingales

Next is material on matrix rank. Please see the handout

ECE 275B Homework # 1 Solutions Winter 2018

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1

ECE 275B Homework # 1 Solutions Version Winter 2015

Power series solutions for 2nd order linear ODE s (not necessarily with constant coefficients) a n z n. n=0

Ch. 12 Linear Bayesian Estimators

Lecture 11. Multivariate Normal theory

IEOR 4701: Stochastic Models in Financial Engineering. Summer 2007, Professor Whitt. SOLUTIONS to Homework Assignment 9: Brownian motion

[y i α βx i ] 2 (2) Q = i=1

15-388/688 - Practical Data Science: Basic probability. J. Zico Kolter Carnegie Mellon University Spring 2018

Linear Algebra for Beginners Open Doors to Great Careers. Richard Han

A Bayesian Treatment of Linear Gaussian Regression

Introduction to the Mathematical and Statistical Foundations of Econometrics Herman J. Bierens Pennsylvania State University

The Expectation-Maximization Algorithm

CSC2541 Lecture 2 Bayesian Occam s Razor and Gaussian Processes

UC Berkeley Department of Electrical Engineering and Computer Sciences. EECS 126: Probability and Random Processes

Gaussian Models (9/9/13)

The Multivariate Normal Distribution. In this case according to our theorem

But if z is conditioned on, we need to model it:

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Lecture : Probabilistic Machine Learning

Review of Probability Theory

Nonparameteric Regression:

Linear Regression and Discrimination

Large Sample Properties of Estimators in the Classical Linear Regression Model

Lecture 13: Simple Linear Regression in Matrix Format

Bivariate distributions

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1

EXAM 2 REVIEW DAVID SEAL

Introduction to Probability and Stocastic Processes - Part I

3. Probability and Statistics

Exam 2. Jeremy Morris. March 23, 2006

Week 7 Algebra 1 Assignment:

22 : Hilbert Space Embeddings of Distributions

A Derivation of the EM Updates for Finding the Maximum Likelihood Parameter Estimates of the Student s t Distribution

COM336: Neural Computing

Multivariate Distributions

(Multivariate) Gaussian (Normal) Probability Densities

3.7 Constrained Optimization and Lagrange Multipliers

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

18.440: Lecture 28 Lectures Review

Solving Linear Equations (in one variable)

A Few Notes on Fisher Information (WIP)

Stephen F Austin. Exponents and Logarithms. chapter 3

CS281 Section 4: Factor Analysis and PCA

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Probabilistic Graphical Models

Lecture Note 1: Probability Theory and Statistics

Section 1.6 Inverse Functions

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

8 Square matrices continued: Determinants

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Partial Fractions. June 27, In this section, we will learn to integrate another class of functions: the rational functions.

Dot Products, Transposes, and Orthogonal Projections

Bayesian Inference for the Multivariate Normal

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Section 3.7: Solving Radical Equations

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Introduction to Gaussian Processes

Probabilistic Graphical Models

3d scatterplots. You can also make 3d scatterplots, although these are less common than scatterplot matrices.

The Multivariate Normal Distribution

Machine learning - HT Maximum Likelihood

Basic Linear Algebra in MATLAB

5.2 Infinite Series Brian E. Veitch

Stat 206: Sampling theory, sample moments, mahalanobis

Continuous Random Variables

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

1.1 Basic Algebra. 1.2 Equations and Inequalities. 1.3 Systems of Equations

Handout 1 Systems of linear equations Gaussian elimination

Exponents and Logarithms

Numerical Methods Lecture 2 Simultaneous Equations

Next tool is Partial ACF; mathematical tools first. The Multivariate Normal Distribution. e z2 /2. f Z (z) = 1 2π. e z2 i /2

STA 4273H: Statistical Machine Learning

Probability Background

Math101, Sections 2 and 3, Spring 2008 Review Sheet for Exam #2:

Probabilistic Graphical Models

Linear Algebra and Robot Modeling

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Gaussian random variables inr n

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Generating Function Notes , Fall 2005, Prof. Peter Shor

4.5 Integration of Rational Functions by Partial Fractions

Transcription:

The Multivariate Gaussian Distribution DRAFT David S. Rosenberg Abstract This is a collection of a few key and standard results about multivariate Gaussian distributions. I have not included many proofs, because I don t know how to do it without either being very tedious and technical, or using some mathematics that are beyond the prerequisites. To my knowledge, there are two primary approaches to developing the theory of multivariate Gaussian distributions. The first, and by far the most common approach in machine learning textbooks, is to define the multivariate gaussian distribution in terms of its density function, and to derive results by manipulating these density functions. With this approach, a lot of the work turns out to be elaborate matrix algebra calculations happening inside the exponent of the Gaussian density. One issue with this approach is that the multivariate Gaussian density is only defined when the covariance matrix is invertible. To keep the derivations rigorous, some care must be taken to justify that the new covariance matrices we come up with are invertible. For my taste, I find the rigor in our textbooks to be a bit light on these points. We ve included the proof to Theorem 4 to give a flavor of the details one should add. The second major approach to multivariate Gaussian distributions does not use density functions at all and does not require invertible covariance matrices. This approach is much cleaner and more elegant, but it relies on the theory of characteristic functions and the Cramer-Wold device to get started, and these are beyond the prerequisites for this course. You can often find this development in more advanced probability and statistics books, such as Rao s excellent Linear Statistical Inference and Its Applications Chapter 8. 1

2 2 Recognizing a Gaussian Density 1 Multivariate Gaussian Density A random vector x R d has a d-dimensional multivariate Gaussian distribution with mean µ R d and covariance matrix Σ R d d if its density is given by N x µ, Σ = 1 exp 1 2π d/2 Σ 1/2 2 x µt Σ 1 x µ, where Σ denotes the determinant of Σ. Note that this expression requires that the covariance matrix Σ be invertible 1. Sometimes we will rewrite the factor in front of the exp as 2πΣ 1/2, which follows from basic facts about determinants. Exercise 1. There are at least 2 claims implicit in this definition. First, that the expression given is, in fact, a density i.e. it s non-negative and integrates to 1. Second, the density corresponds to a distribution with mean µ and covariance Σ, as claimed. 2 Recognizing a Gaussian Density If we come across a density function of the form px e qx/2, where qx is a positive definite quadratic function, then px is the density for a Gaussian distribution. More precisely, we have the following theorem: Theorem 2. Consider the quadratic function qx = x T Λx 2b T x + c, for any symmetric positive definite Λ R d d, any b R d, and c R. If px is a density function with px e qx/2, then px is a multivariate Gaussian density with mean Λ 1 b and covariance Λ 1. That is, px = Λ 1/2 exp 1 2π d/2 2 x Λ 1 b T Λx Λ 1 b. 1 We can have a d-dimensional Gaussian distribution with a non-invertible Σ, but such a distribution will not have a density on R d, and we will not address that case here.

3 Note: The inverse of the covariance matrix is called the precision matrix. Precision matrices of multivariate Gaussians have some interesting properties. explain that this is the Gaussian density in information form or canonical form c.f. Murphy p. 117. Proof. Completing the square, we have qx = x T Λx 2b T x + c = x Λ 1 b T Λx Λ 1 b b T Λ 1 b + c. Since the last two terms are independent of x, when we exponentiate qx, they can be absorbed into the constant of proportionality. That is, e qx/2 = exp 1 x Λ 1 b T Λx Λ 1 b exp 1 b T Λ 1 b + c 2 2 exp 1 x Λ 1 b T Λx Λ 1 b 2 Now recall that the density function for the multivariate Gaussian density N µ, Σ is 1 φx = exp 1 2π d/2 Σ 1/2 2 x µt Σ 1 x µ. Thus we see that px must also be a Gaussian density with covariance Σ = Λ 1 and mean Λ 1 b. 3 Conditional Distributions Bishop Section 2.3.1 Let x R d have a Gaussian distribution: x N µ, Σ. Let s partition the random variables in x into two pieces: x =, where x 1 R d 1, R d 2 and d = d 1 + d 2. Similarly, we ll partition the mean vector, the covariance matrix, and the precision matrix as µ1 Σ11 Σ µ = Σ = 12 Λ = Σ 1 Λ11 Λ = 12, µ 2 Σ 21 Σ 22 Λ 21 Λ 22

4 3 Conditional Distributions Bishop Section 2.3.1 where µ 1 R d 1, Σ 12 R d 1 d 2, Λ 12 R d 1 d 2, etc. Note that by the symmetry of the covariance matrix Σ, we have Σ 12 = Σ T 21. Note also that Σ 11 and Σ 22 are spd since Σ is trivial from definition of spd. When x = has a Gaussian distribution, we say that x 1 and are jointly Gaussian. Can we conclude anything about the marginal distributions of x 1 and? Indeed, the following theorem states that they are individually Gaussian: Theorem 3.. Let x, µ, and Σ be as defined above. Then the marginal distributions of x 1 and are each Gaussian, with x 1 N µ 1, Σ 1 N µ 2, Σ 2. Proof. See Bishop Section 2.3.2, p. 88 This can be done by showing that the marginal density px 1 = px 1, d has the form claimed, and similarly for. So when x 1 and are jointly Gaussian, we know that x 1 and are also marginally Gaussian. It turns out that the conditional distributions x 1 and x 1 are also Gaussian: Theorem 4. Let x, µ, and Σ be as defined above. Assume that Σ 22 is positive definite 2. Then the distribution of x 1 given is multivariate normal. More specifically, x 1 N µ 1 2, Σ 1 2, where µ 1 2 = µ 1 + Σ 12 Σ 1 22 µ 2 Σ 1 2 = Σ 11 Σ 12 Σ 1 22 Σ 21. Note that Σ 22 is positive definite, and thus invertible, since Σ is positive definite. Proof. See Bishop Section 2.3.1, p. 85 2 In fact, this is implied by our assumption that Σ is positive definite.

5 Example. Consider a standard regression framework in which we are building a predictive model for x 1 R given R d. Recall that if we are using a square loss, then the Bayes optimal prediction function is f = E x 1. If we assume that x 1 and are jointly Gaussian with a positive definite covariance matrix, then Theorem 4 tells us that E x 1 = µ 1 + Σ 12 Σ 1 22 µ 2. Of course, in practice we don t know µ and Σ. Nevertheless, what s interesting is that the Bayes optimal prediction function is an affine function of i.e. a linear function plus a constant. Thus if we think that our input vector and our response variable x 1 are jointly Gaussian, there s no reason to go beyond a hypothesis space of affine functions of. In other words, linear regression is all we need. 4 Joint Distribution from Marginal + Conditional In Section 3, we found that if x 1 and are jointly Gaussian, then is marginally Gaussian and the conditional distribution x 1 was also Gaussian, where the mean is a linear function of. The following theorem shows that we can we can go in the reverse direction as well. Theorem. Suppose x 1 N µ 1, Σ 1 and x 1 N Ax 1 + b, Σ 2 1, for some µ 1 R d 1, Σ 1 R d 1 d 1, A R d 2 d 1, and Σ 2 1 R d 2 d 2. Then x 1 and are jointly Gaussian with x = N µ1 Aµ 1 + b Σ1 Σ, 1 A T AΣ 1 Σ 2 1 + AΣ 1 A T. We ll prove this with two steps. First, we ll show that the mean and variance of x take the form claimed above. Then, we ll write down the joint density px 1, = px 1 p x 1 and show that it s proportional to e qx/2 for an appropriate quadratic qx. The result then follows from 2. Proof. We re given that Ex 1 = µ 1. For the other part of the mean vector, note that E = EE x 1 = E Ax 1 + b = Aµ 1 + b,

6 4 Joint Distribution from Marginal + Conditional which explains the lower entry in the mean. We are given that the marginal covariance of x 1 is Σ 1. That is, E x 1 µ 1 x 1 µ 1 T = Σ 1. We re also given the conditional covariance of : E Ax 1 b Ax 1 b T x 1 = Σ 2 1. We ll now try to express Cov in terms of these expressions above. For convenience, we ll introduce the random variable m 2 1 = Ax 1 +b. It s random because it depends on x 1. Note that Em 2 1 = E = Aµ 1 + b. So Cov = E E E T by definition = EE E E T x 1 law of iterated expectations = EE m 2 1 + m 2 1 E m 2 1 + m 2 1 E x 1 }{{}}{{} =0 =0 x2 T = EE m 2 1 + m2 1 E x2 m 2 1 + m2 1 E = U + 2V + W, where we ve multiplied out the parenthesized terms. The terms are as follows: x2 T U = EE m 2 1 x2 m 2 1 = Σ 2 1 The cross-term turns out to be zero: x2 T V = EE m 2 1 m2 1 E EE Ax 1 b Ax 1 + b Aµ 1 b T x 1 = E E Ax 1 + b x 1 Ax }{{} 1 + b Aµ 1 b T =0 = 0, T

7 where in the second to last step we used the fact that E fxgx, y x = fxe gx, y x. This same identity is used a couple more times below. Finally the last term is m2 1 T W = EE Em 2 1 m2 1 Em 2 1 = EE Ax 1 Aµ 1 Ax 1 Aµ 1 T x 1 Ax 1 Aµ 1 Ax 1 Aµ 1 T = E = A = AΣ 1 A T E x 1 µ 1 x 1 µ 1 T A T So Cov = Σ 2 1 + AΣ 1 A T, The top-right cross-covariance submatrix can be computed as follows: E x 1 µ 1 Aµ 1 b T = EE x 1 µ 1 Aµ 1 b T x 1 = E x 1 µ 1 E Aµ 1 b T x 1 x 1 µ 1 Ax 1 + b Aµ 1 b T = E = E = Σ 1 A T. x 1 µ 1 x 1 µ 1 T A T Finally, the bottom left cross-covariance matrix is just the transpose of the top right. So far we have shown that the x = has the mean and covariance specified in the theorem statement. We now show that the joint density is indeed Gaussian:

8 4 Joint Distribution from Marginal + Conditional where px 1, = px 1 p x 1 = N x 1 µ 1, Σ 1 N Ax 1 + b, Σ 2 1 exp 1 2 x 1 µ 1 T Σ 1 1 x 1 µ 1 exp 1 2 Ax 1 b T Σ 1 2 1 Ax 1 b = e qx/2, qx = x 1 µ 1 T Σ 1 1 x 1 µ 1 + Ax 1 b T Σ 1 2 1 Ax 1 b. To apply Theorem 2, we need to make sure we can write the quadratic terms of qx as x T Mx, where M is symmetric positive definite. We ll separate the quadratic terms in qx and write l.o.t. for lower order terms, which includes linear terms of the form b T x and constants: qx = x 1 Σ 1 1 x 1 + Ax 1 T Σ 1 2 1 Ax 1 + l.o.t. = x T 1 Σ 1 1 + A T Σ 1 2 1 A x 1 2x T 1 A T Σ 1 2 1 + x T 2 Σ 1 2 1 + l.o.t. T Σ 1 1 + A T Σ 1 2 1 = A A T Σ 1 2 1 Σ 1 2 1 A + l.o.t. Σ 1 2 1 Let M be that matrix in the middle. We only need to show that M is positive definite. By definition, the symmetric matrix M is positive definite, iff for all x = 0, we have x T Mx > 0. Referring back to our first expression for qx in the equation block above, note that x T Mx = x 1 Σ 1 1 x }{{} 1 + Ax 1 T Σ 1 2 1 Ax 1 }{{} α 1 α 2, where we ll refer to the first term on the RHS as α 1 and the second term as α 2. Suppose x 1 0. Then since Σ 1 1 is positive definite, α 1 > 0 and since Σ 1 2 1 is positive definite, α 2 0 it could still be 0 if = Ax 1. Thus x T Mx > 0. Suppose x 1 = 0 and 0. Then α 1 = 0 and α 2 = x T 2 Σ 1 2 1 > 0, by

9 positive definiteness. So again x T Mx > 0. This demonstrates that M is positive definite. Thus px e qx/2, where qx has the form required by Theorem 2. We conclude that x = is jointly Gaussian. We have also shown that the marginal means and covariances, as well as the cross-covariances all have the forms claimed.