CS281A/Stat241A Lecture 17

Similar documents
Factor Analysis and Kalman Filtering (11/2/04)

Lecture 6: April 19, 2002

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1

CS229 Lecture notes. Andrew Ng

Gaussian Models (9/9/13)

CS281 Section 4: Factor Analysis and PCA

X t = a t + r t, (7.1)

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Introduction to Probabilistic Graphical Models: Exercises

Probabilistic Graphical Models

TAMS39 Lecture 2 Multivariate normal distribution

Linear Dynamical Systems

Machine Learning 4771

A Bayesian Treatment of Linear Gaussian Regression

Part 6: Multivariate Normal and Linear Models

Chapter 4. Chapter 4 sections

Vectors and Matrices Statistics with Vectors and Matrices

Factor Analysis (10/2/13)

COM336: Neural Computing

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Markov Chains and Hidden Markov Models

Chapter 17: Undirected Graphical Models

Stat 206: Sampling theory, sample moments, mahalanobis

Linear Regression and Discrimination

Lecture 16: State Space Model and Kalman Filter Bus 41910, Time Series Analysis, Mr. R. Tsay

Lecture 2: From Linear Regression to Kalman Filter and Beyond

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models

Probabilistic Graphical Models

Chapter 3: Maximum Likelihood Theory

1 Kalman Filter Introduction

Data Mining Techniques

The Multivariate Gaussian Distribution [DRAFT]

CPSC 540: Machine Learning

STAT 730 Chapter 4: Estimation

Chapter 4: Factor Analysis

Week 3: The EM algorithm

Petter Mostad Mathematical Statistics Chalmers and GU

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Exam 2. Jeremy Morris. March 23, 2006

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Linear Dynamical Systems (Kalman filter)

Review of Discrete Probability (contd.)

Introduction to Machine Learning

Random Vectors 1. STA442/2101 Fall See last slide for copyright information. 1 / 30

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Exercises and Answers to Chapter 1

Probabilistic Graphical Models

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

Gaussian random variables inr n

B4 Estimation and Inference

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

Matrix Algebra, Class Notes (part 2) by Hrishikesh D. Vinod Copyright 1998 by Prof. H. D. Vinod, Fordham University, New York. All rights reserved.

Naïve Bayes classification

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Chapter 5 continued. Chapter 5 sections

Notes on Random Vectors and Multivariate Normal

ECE 541 Stochastic Signals and Systems Problem Set 9 Solutions

ECE534, Spring 2018: Solutions for Problem Set #3

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Convergence of Square Root Ensemble Kalman Filters in the Large Ensemble Limit

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Problem Set 1. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 20

1 Probability Model. 1.1 Types of models to be discussed in the course

Dimension Reduction. David M. Blei. April 23, 2012

Lecture 3. Inference about multivariate normal distribution

Kalman Filter. Predict: Update: x k k 1 = F k x k 1 k 1 + B k u k P k k 1 = F k P k 1 k 1 F T k + Q

. a m1 a mn. a 1 a 2 a = a n

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Statistics 910, #15 1. Kalman Filter

Chris Bishop s PRML Ch. 8: Graphical Models

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

MAS223 Statistical Inference and Modelling Exercises

Latent Variable Models and EM algorithm

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

4 Derivations of the Discrete-Time Kalman Filter

t x 1 e t dt, and simplify the answer when possible (for example, when r is a positive even number). In particular, confirm that EX 4 = 3.

10708 Graphical Models: Homework 2

Recitation. Shuang Li, Amir Afsharinejad, Kaushik Patnaik. Machine Learning CS 7641,CSE/ISYE 6740, Fall 2014

ECONOMETRIC METHODS II: TIME SERIES LECTURE NOTES ON THE KALMAN FILTER. The Kalman Filter. We will be concerned with state space systems of the form

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Chapter 5. Chapter 5 sections

Knowledge Discovery and Data Mining 1 (VO) ( )

Machine learning - HT Maximum Likelihood

ESTIMATION THEORY. Chapter Estimation of Random Variables

Lecture 5: Moment Generating Functions

Lecture Notes 4 Vector Detection and Estimation. Vector Detection Reconstruction Problem Detection for Vector AGN Channel

State-space Model. Eduardo Rossi University of Pavia. November Rossi State-space Model Financial Econometrics / 49

1 Bayesian Linear Regression (BLR)

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

9 Forward-backward algorithm, sum-product on factor graphs

Lecture 8: Signal Detection and Noise Assumption

STAT 430/510: Lecture 16

STA 414/2104, Spring 2014, Practice Problem Set #1

Math 152. Rumbos Fall Solutions to Assignment #12

Continuous Random Variables

Transcription:

CS281A/Stat241A Lecture 17 p. 1/4 CS281A/Stat241A Lecture 17 Factor Analysis and State Space Models Peter Bartlett

CS281A/Stat241A Lecture 17 p. 2/4 Key ideas of this lecture Factor Analysis. Recall: Gaussian factors plus observations. Parameter estimation with EM. The vec operator. Motivation: Natural parameters of Gaussian. Linear functions of matrices as inner products. Kronecker product. State Space Models. Linear dynamical systems, gaussian disturbances. All distributions are Gaussian: parameters suffice.

CS281A/Stat241A Lecture 17 p. 3/4 Factor Analysis: Definition X R p, factors Y R d, observations Local conditionals: p(x) = N(x 0,I), p(y x) = N(y µ + Λx, Ψ).

CS281A/Stat241A Lecture 17 p. 4/4 Factor Analysis λ 2 Λx λ1 µ

CS281A/Stat241A Lecture 17 p. 5/4 Factor Analysis: Definition Local conditionals: p(x) = N(x 0,I), p(y x) = N(y µ + Λx, Ψ). The mean of y is µ R d. The matrix of factors is Λ R d p. The noise covariance Ψ R d d is diagonal. Thus, there are d + dp + d dp d 2 parameters.

CS281A/Stat241A Lecture 17 p. 6/4 Factor Analysis: Marginals, Conditionals Theorem 1. Y N(µ, ΛΛ + Ψ). (( ) 0 2. (X,Y ) N µ, Σ ), with Σ = ( I Λ Λ ΛΛ + Ψ ). 3. p(x y) is Gaussian, with mean = Λ (ΛΛ + Ψ) 1 (y µ), covariance I Λ (ΛΛ + Ψ) 1 Λ.

CS281A/Stat241A Lecture 17 p. 7/4 Factor Analysis: Parameter Estimation iid data y = (y 1,...,y n ). The log likelihood is l(θ;y) = log p(y θ) = const n 2 log ΛΛ + Ψ 1 (y i µ) (ΛΛ + Ψ) 1 (y i µ). 2 i

CS281A/Stat241A Lecture 17 p. 8/4 Factor Analysis: Parameter Estimation Let s first consider estimation of µ: ˆµ ML = 1 n n i=1 y i, as for the full covariance case. From now on, let s assume µ = 0, so we can ignore it: l(θ;y) = const 1 2 log ΛΛ + Ψ 1 2 y i(λλ + Ψ) 1 y i. But how do we find a factorized covariance matrix, Σ = ΛΛ + Ψ? i

CS281A/Stat241A Lecture 17 p. 9/4 Factor Analysis: EM We follow the usual EM recipe: 1. Write out the complete log likelihood, l c. 2. E step: Calculate E[l c y]. Typically, find E[suff. stats y]. 3. M step: Maximize E[l c y].

CS281A/Stat241A Lecture 17 p. 10/4 Factor Analysis: EM 1. Write out the complete log likelihood, l c. l c (θ) = log(p(x,θ)p(y x,θ)) n = const n 2 log Ψ 1 2 1 2 i=1 x ix i n (y i Λx i ) Ψ 1 (y i Λx i ). i=1

CS281A/Stat241A Lecture 17 p. 11/4 Factor Analysis: EM 2. E step: Calculate E[l c y]. Typically, find E[suff. stats y]. Claim: Sufficient statistics are x i, x i x i. Indeed, E[l c y] is a constant plus n 2 log Ψ plus ( 1 n ) 2 E n x ix i + (y i Λx i ) Ψ 1 (y i Λx i ) y = 1 2 = 1 2 i=1 i=1 n ( E(x i x i y i ) + E ( tr ( (y i Λx i ) Ψ 1 (y i Λx i ) ) )) y i i=1 n E(x i x i y i ) n 2 E( tr ( SΨ 1) ) y i, i=1

CS281A/Stat241A Lecture 17 p. 12/4 Factor Analysis: EM. E-step E[l c y] = n 2 log Ψ 1 2 with S = 1 n n E(x ix i y i ) n 2 E( tr ( SΨ 1) ) y i, i=1 n (y i Λx i )(y i Λx i ), i=1 We used the fact that the trace (sum of diagonal elements) of a matrix satisfies tr(abc) = tr(cab), as long as the products ABC and CAB are square.

CS281A/Stat241A Lecture 17 p. 13/4 Factor Analysis: EM. E-step Now, we can calculate E(S y) = 1 n = 1 n n i=1 n i=1 E [ y i y i 2Λx iy i + Λx ix i Λ y i ] y i y i 2ΛE[x i y i ]y i + ΛE[x i x i y i ]Λ, and from this it is clear that the expected sufficient statistics are E(x i y i ), E(x i x i y i) (and its trace, E(x i x i y i )).

CS281A/Stat241A Lecture 17 p. 14/4 Factor Analysis: EM. E-step We calculated these conditional expectations earlier: E[x i y i ] = Λ ( ΛΛ + Ψ ) 1 (yi µ) Var[x i y i ] = I Λ ( ΛΛ + Ψ ) 1 Λ E[x i x i y i ] = Var[x i y i ] + E[x i y i ]E[x i y i ]. So we can plug them in to calculate the expected complete log likelihood.

CS281A/Stat241A Lecture 17 p. 15/4 Factor Analysis: EM. M-step 3. M step: Maximize E[l c y]. For Λ, this is equivalent to minimizing ntr ( E(S y)ψ 1) = tr ( (Y ΛX )(Y ΛX ) Ψ 1), where Y R n d, with rows y i, X R n p, with rows x i. This is a matrix version of linear regression, with the d separate components of the squared error weighted by one of the diagonal entries in Ψ 1. Thus, the Ψ matrix plays no role, and the solution satisfies the normal equations.

CS281A/Stat241A Lecture 17 p. 16/4 Factor Analysis: EM. M-step Normal Equations: ˆΛ = ( n E[x i x i y i ] ) 1 n ( E[xi y i ]y i) i=1 i=1 (They are the same sufficient statistics as in linear regression; here we need to compute the expectation of the suff. stats given the observations.)

CS281A/Stat241A Lecture 17 p. 17/4 Factor Analysis: EM. M-step For Ψ, we need to find a diagonal Ψ to minimize log Ψ + tr ( E(S y)ψ 1). = d j=1 (log ψ j + s j /ψ j ), It s easy to check that this is minimized for ψ j = s j, the diagonal entries of E[S y].

CS281A/Stat241A Lecture 17 p. 18/4 Factor Analysis: EM. Summary. E step: Calculate the expected suff. stats: E[x i y i ] = Λ ( ΛΛ + Ψ ) 1 (yi µ) Var[x i y i ] = I Λ ( ΛΛ + Ψ ) 1 Λ E[x i x i y i ] = Var[x i y i ] + E[x i y i ]E[x i y i ]. And use these to compute the (diagonal entries of the) matrix E(S y) = 1 n n i=1 y i y i 2ΛE[x i y i ]y i + ΛE[x ix i y i]λ.

CS281A/Stat241A Lecture 17 p. 19/4 Factor Analysis: EM. Summary. M step: Maximize E[l c y]: ( n ) 1 n ˆΛ = E[x i x i y i ] i=1 i=1 ˆΨ = diag(e(s y)), ( E[xi y i ]y i) and recall: ˆµ = 1 n n i=1 y i.

CS281A/Stat241A Lecture 17 p. 20/4 Key ideas of this lecture Factor Analysis. Recall: Gaussian factors plus observations. Parameter estimation with EM. The vec operator. Motivation: Natural parameters of Gaussian. Linear functions of matrices as inner products. Kronecker product. State Space Models. Linear dynamical systems, gaussian disturbances. All distributions are Gaussian: parameters suffice. Inference: Kalman filter and smoother. Parameter estimation with EM.

CS281A/Stat241A Lecture 17 p. 21/4 The vec operator: Motivation Consider the multivariate Gaussian: ( p(x) = (2π) (d)/2 Σ 1/2 exp 1 ) 2 (x µ) Σ 1 (x µ). What is the natural parameterization? p(x) = h(x) exp ( θ T(x) A(θ) ).

CS281A/Stat241A Lecture 17 p. 22/4 The vec operator: Motivation If we define then we can write Λ = Σ 1 η = Σ 1 µ, (x µ) Σ 1 (x µ) = µ Σ 1 µ 2µ Σ 1 x + x Σ 1 x = η Λ 1 η 2η x + x Λx. Why is this of the form θ T(x) A(θ)?

CS281A/Stat241A Lecture 17 p. 23/4 The vec operator: Example ( x 1 x 2 ) ( λ 11 λ 12 λ 21 λ 22 ) ( x 1 x 2 ) = λ 11 x 2 1 + λ 21 x 1 x 2 + λ 12 x 1 x 2 + λ 22 x 2 2 = ( ) λ 11 λ 21 λ 12 λ 22 = vec(λ) vec(xx ). x 2 1 x 1 x 2 x 2 x 1 x 2 2

CS281A/Stat241A Lecture 17 p. 24/4 The vec operator Definition [vec]: For a matrix A, vec(a) = a 1 a 2. a n where the a i are the column vectors of A: ( ) A = a 1 a 2 a n

CS281A/Stat241A Lecture 17 p. 25/4 The vec operator Theorem: tr(a B) = vec(a) vec(b). (Trace of the product is the sum of the corresponding row column inner products.) In the example above, x Λx = tr ( x Λx ) = tr ( Λxx ) = vec(λ ) vec(xx ) }{{}}{{} nat. param. suff. stat.

CS281A/Stat241A Lecture 17 p. 26/4 The Kronecker product We also use the vec operator for matrix equations like XΘY = Z, or, for instance, for least squares minimization of XΘY Z. Then we can write vec(z) = vec(xθy ) = (Y X)vec(Θ), where (Y X) is the Kronecker product of Y and X:

CS281A/Stat241A Lecture 17 p. 27/4 The Kronecker product Definition [Kronecker product] The Kronecker product of A and B is a 11 B a 1n B A B =..... a m1 B a mn B

CS281A/Stat241A Lecture 17 p. 28/4 The Kronecker product Theorem [Kronecker product and vec operator] vec(abc) = (C A)vec(B). The Kronecker product and vec operator are used in matrix algebra. They are convenient for differentiation of a function of a matrix, and they arise: in statistical models involving products of features; in systems theory; and in stability theory.

CS281A/Stat241A Lecture 17 p. 29/4 Key ideas of this lecture Factor Analysis. Recall: Gaussian factors plus observations. Parameter estimation with EM. The vec operator. Motivation: Natural parameters of Gaussian. Linear functions of matrices as inner products. Kronecker product. State Space Models. Linear dynamical systems, gaussian disturbances. All distributions are Gaussian: parameters suffice. Inference: Kalman filter and smoother. Parameter estimation with EM.

CS281A/Stat241A Lecture 17 p. 30/4 State Space Models mixture model discrete factor model Gaussian HMM discrete State Space Model Gaussian

CS281A/Stat241A Lecture 17 p. 31/4 State Space Models In linear dynamic systems, The evolution of the state x t, and The relationship between the state x t and the observation y t are linear with Gaussian noise.

CS281A/Stat241A Lecture 17 p. 32/4 State Space Models State space models revolutionized control theory in the late 50s and early 60s. Prior to these models, classical control theory could cope with decoupled low order systems. State space models allowed the effective control of complex systems like spacecraft and fast aircraft. The directed graph is identical to an HMM, so the conditional independencies are identical: Given the current state (not observation), the past and the future are independent.

CS281A/Stat241A Lecture 17 p. 33/4 Linear System: Definition x t 1 x t x t+1 R p y t 1 y t y t+1 R d

CS281A/Stat241A Lecture 17 p. 34/4 Linear System: Definition State x t R p Observation y t R d Initial state x 0 N(0,P 0 ) Dynamics x t+1 = Ax t + Gw t, w t N(0,Q) Observation y t = Cx t + v t, v t N(0,R).

CS281A/Stat241A Lecture 17 p. 35/4 Linear System: Observations 1. All the distributions are Gaussian (joints, marginals, conditionals), so they can be described by their means and variances. 2. The conditional distribution of the next state, x t+1 x t, is To see this: N(Ax t,gqg ). E(x t+1 x t ) = Ax t + GE(w t+1 x t ) = Ax t. Var(x t+1 x t ) = E(Gw t (Gw t ) ) = GQG.

CS281A/Stat241A Lecture 17 p. 36/4 Linear System: Observations 3. The marginal distribution of x t is N(0,P t ), where P 0 is given and, for t 0, To see this: P t+1 = AP t A + GQG. Ex t+1 = EE(x t+1 x t ) = AE(x t ) = 0. P t+1 = E(x t+1 x t+1) = E((Ax t + Gw t )(Ax t + Gw t ) ) = AP t A + GQG.

CS281A/Stat241A Lecture 17 p. 37/4 Inference in SSMs Filtering: p(x t y 0,...,y t ). Smoothing: p(x t y 0,...,y T ). For inference, it suffices to calculate the appropriate conditional means and covariances.

CS281A/Stat241A Lecture 17 p. 38/4 Inference in SSMs: Notation ˆx t s = E(x t y 0,...,y s ), ˆP t s = E((x t x t s )(x t x t s ) y 0,...,y s ). Filtering: x t t N(ˆx t t,p t t ), Smoothing: x t T N(ˆx t T,P t T ). The Kalman Filter is an inference algorithm for ˆx t t, P t t. The Kalman Smoother is an inference algorithm for ˆx t T, P t T.

CS281A/Stat241A Lecture 17 p. 39/4 The Kalman Filter x t 1 x t x t+1 y t 1 y t y t+1 p(x t y 0,...,y t ) ˆx t t,p t t x t 1 x t x t+1 time update y t 1 y t y t+1 p(x t+1 y 0,...,y t ) ˆx t+1 t,p t+1 t x t 1 x t x t+1 measurement update y t 1 y t y t+1 p(x t+1 y 0,...,y t+1 )ˆx t+1 t+1,p t+1 t+1

CS281A/Stat241A Lecture 17 p. 40/4 Key ideas of this lecture Factor Analysis. Recall: Gaussian factors plus observations. Parameter estimation with EM. The vec operator. Motivation: Natural parameters of Gaussian. Linear functions of matrices as inner products. Kronecker product. State Space Models. Linear dynamical systems, gaussian disturbances. All distributions are Gaussian: parameters suffice. Inference: Kalman filter and smoother. Parameter estimation with EM.