Factor Analysis and Kalman Filtering (11/2/04)

Similar documents
CS281A/Stat241A Lecture 17

Probabilistic Graphical Models

Statistical Techniques in Robotics (16-831, F12) Lecture#17 (Wednesday October 31) Kalman Filters. Lecturer: Drew Bagnell Scribe:Greydon Foil 1

Linear Dynamical Systems

Lecture 6: April 19, 2002

Dimension Reduction. David M. Blei. April 23, 2012

Probabilistic Graphical Models

Machine Learning 4771

CS281 Section 4: Factor Analysis and PCA

Chapter 4: Factor Analysis

Factor Analysis (10/2/13)

Linear Factor Models. Sargur N. Srihari

Lecture 3: Latent Variables Models and Learning with the EM Algorithm. Sam Roweis. Tuesday July25, 2006 Machine Learning Summer School, Taiwan

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models

Introduction to Graphical Models

Lecture 2: From Linear Regression to Kalman Filter and Beyond

CS229 Lecture notes. Andrew Ng

Lecture 7: Con3nuous Latent Variable Models

ABSTRACT INTRODUCTION

Introduction to Machine Learning

Introduction to Probabilistic Graphical Models: Exercises

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Notes on Latent Semantic Analysis

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Unsupervised Learning

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Decision and Bayesian Learning

Probabilistic & Unsupervised Learning

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Nonparameteric Regression:

1 Principal Components Analysis

Lecture 16: State Space Model and Kalman Filter Bus 41910, Time Series Analysis, Mr. R. Tsay

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

STA 414/2104: Machine Learning

Chris Bishop s PRML Ch. 8: Graphical Models

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 13: SEQUENTIAL DATA

STA 4273H: Sta-s-cal Machine Learning

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Linear Dynamical Systems (Kalman filter)

Advanced Machine Learning & Perception

9 Multi-Model State Estimation

Maximum variance formulation

STA 414/2104: Machine Learning

p L yi z n m x N n xi

Principles of the Global Positioning System Lecture 11

1 Kalman Filter Introduction

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Lecture 6: Bayesian Inference in SDE Models

PILCO: A Model-Based and Data-Efficient Approach to Policy Search

Machine Learning Techniques for Computer Vision

Week 3: The EM algorithm

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Variational Autoencoders

HST.582J/6.555J/16.456J

Statistics 910, #15 1. Kalman Filter

Latent Variable Models and EM Algorithm

Gaussian Models (9/9/13)

PCA and admixture models

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

Probabilistic Graphical Models

The Kalman Filter ImPr Talk

TAMS39 Lecture 10 Principal Component Analysis Factor Analysis

The 1d Kalman Filter. 1 Understanding the forward model. Richard Turner

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

STATISTICAL LEARNING SYSTEMS

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Probabilistic & Unsupervised Learning. Latent Variable Models

STA414/2104 Statistical Methods for Machine Learning II

Graphical Models for Statistical Inference and Data Assimilation

Joint Factor Analysis for Speaker Verification

X t = a t + r t, (7.1)

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Chapter 17: Undirected Graphical Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

Lecture 16 Deep Neural Generative Models

Unsupervised Learning

Graphical Models for Collaborative Filtering

State Space Gaussian Processes with Non-Gaussian Likelihoods

Machine Learning for Data Science (CS4786) Lecture 12

Pattern Recognition and Machine Learning

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Kalman Filter Computer Vision (Kris Kitani) Carnegie Mellon University

STA 414/2104: Lecture 8

Organization. I MCMC discussion. I project talks. I Lecture.

9 Forward-backward algorithm, sum-product on factor graphs

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Cheng Soon Ong & Christian Walder. Canberra February June 2018

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

The Expectation-Maximization Algorithm

CS 4495 Computer Vision Principle Component Analysis

A Study of the Kalman Filter applied to Visual Tracking

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Statistical learning. Chapter 20, Sections 1 3 1

Transcription:

CS281A/Stat241A: Statistical Learning Theory Factor Analysis and Kalman Filtering (11/2/04) Lecturer: Michael I. Jordan Scribes: Byung-Gon Chun and Sunghoon Kim 1 Factor Analysis Factor analysis is used for dimensionality reduction. It finds a subspace where the data lie. Figure 1 shows the geometry of the factor analysis model. In the figure, µ is the mean or the centroid of manifold, Λ is the coordinate system, and Ψ is the noise variance (represented by the sphere in the figure). The latent Gaussian variable X n N (0, I), and the observed variable Y n N (µ + Λx n, Ψ) where Ψ is a diagonal covariance matrix. When λ 1 and λ 2 are basis vectors, Λ = (λ 1 λ 2 ). ˆµ + ˆΛ xˆ n is the expectation of Y n. y3 λ1 µ λ2 Λ = (λ1 λ2) y2 y1 Figure 1: The geometry of the factor analysis model. Why do we need dimensionality reduction? One application of factor analysis is compression. In the k-means algorithm, we can represent data with centroid coordinates and residuals. Likewise, with factor analysis, we can compress data using the coordinates of the subspace and residuals. Also, it can be used to predict regression. In high dimensions, there can be a large amount of noise. With dimensionality reduction, we can perform prediction with a small number of states, in effect reducing the impact of the noise. 2 Principal Component Analysis Principal component analysis (PCA) finds the direction of maximal variance, and is a more simplistic model than factor analysis. We have data D = {y 1,..., y N }. First, we center the data, that is, subtract the sample mean from data. Let the modified data be y n. y n = y n 1 N N y n. Then, we find θ to maximize J = N (θt y n ) 2 such that θ = 1. This is a simple optimization problem. Let s manipulate J to achieve θ to maximize J. J = (θ T y n ) 2 (1) 1

2 Factor Analysis and Kalman Filtering = (θ T y n y T n θ) (2) = θ T ( y n y T n )θ (3) = θ T Y θ (4) ˆθ that maximizes J meets Y ˆθ = λˆθ. As you can see, when λ is the largest eigenvalue, ˆθ is the first eigenvector. We compare factor analysis (FA) with principal component analysis (PCA). FA is insensitive to scale factors. Since FA captures the underlying coordinate systems, FA does not change the direction of projections when Y i has different variance. The uncertainty is given by covariance matrix in FA. When Y n is far off from the subspace, the covariance increases and the sphere in Figure 1 grows. On the contrary, PCA is sensitive to scale factors, since the eigenvectors are affected by scale factors. Whether to use FA or PCA depends on goals. If the goal is compression, the scale factor does not matter much, and PCA can be effective. 3 Maximum Likelihood Estimation of the Factor Analysis Model In this section, we derive the maximum likelihood estimation of the factor analysis model. µ ML is the sample mean and we subtract µ ML from data. The graphical model is presented in Figure 2. In the model, unconditional mean and unconditional variance are as follows. ( ) 0 unconditional mean = (5) 0 ( ) I Λ T unconditional covariance = Λ ΛΛ T (6) + Ψ Xn Xn ~ N(0,I) Yn ~ N( Λ Xn, Ψ) Yn Figure 2: The factor analysis model as a graphical model. Given complete data, the complete log likelihood is a product of Gaussian distributions. The complete log likelihood is: l c (Λ, Ψ) = N 2 log Ψ N 2 tr(sψ 1 ) (7) where S = 1 N (Y n ΛX n )(Y n ΛX n ) T (8) Now, we take the conditional expectation of the complete log likelihood, conditioning on the observed data and the current parameter vector. < > denotes the conditional expectation. The expected complete log

Factor Analysis and Kalman Filtering 3 likelihood is: < l c (Λ, Ψ) >= N 2 log Ψ N 2 tr(< S > Ψ 1 ) (9) < S >= 1 {Y n Yn T Y n < Xn T > Λ T Λ < X n > Yn T + Λ < X n Xn T > Λ T } (10) N n 1 In the E-step of the algorithm, sufficient statistics we need are < X n > and < X n Xn T >. < X n >= E[X n Y n ] = Λ T (ΛΛ T + Ψ) 1 Y n (11) < X n Xn T >= V ar(x n Y n ) + E[X n Y n ]E[X n Y n ] T = I Λ T (ΛΛ T + Ψ) 1 Λ + E[X n Y n ]E[X n Y n ] T (12) In the M-step of the algorithm, we update Λ and Ψ using the updated sufficient statistics. Recall ˆθ = (X T X) 1 X T Y. Ψ (t+1) = ( Y n < Xn T >)( < X n Xn T >) 1 (13) Ψ (t+1) = 1 N N diag{ (Y n Λ (t+1) < X n >)Yn T } (14) Here, Λ (t+1) < X n > is the best guess and Y n Λ (t+1) < X n > is the residual. In the above, we derive the E-step and the M-step for estimation. 4 State Space Models Figure 3: The SSM as a graphical model The state nodes in the factor analysis model are continuous, vector-valued nodes endowed with a Gaussian probability distribution. To develop a dynamical generalization of the factor analysis model we must represent the transition between the nodes at successive moments in time. Perhaps the simplest choice that we can make is to allow the mean of the state at time t + 1 to be a linear function of the state at time t. x t+1 = Ax t + Gw t where w t is a noise term a Gaussian random variable that is independent of w s for s < t, and thus independent of x t. We assume that w t has zero mean and covariance matrix Q and endow the initial state,

4 Factor Analysis and Kalman Filtering x 0, with a Gaussian distribution having mean 0 and covariance P 0 The assumption of zero mean is without loss of generality (a non-zero mean gives rise to a deterministic component that can added to the probabilistic solution) w t N (0, Q) x 0 N (0, P 0 ) Given that the sum of Gaussian variables is Gaussian, we have that x t + 1 is indeed Gaussian. x t+1 x t N (Ax t, GQG T ) In the factor analysis model, the output is endowed with a Gaussian distribution having a mean that is a linear function of the state. We continue to use this model for the output of the SSM: y t = Cx t + v t where v t is a Gaussian random variable with zero mean and covariance matrix R. v t N (0, R) Conditional on x t, Y t is a Gaussian random variable with mean Cx t and covariance R. y t x t N (Cx t, R) The inference problem for the SSM is the same as it was for the HMM that of calculating the posterior probability of the states given an output sequence. Based on our experience with the HMM, we hope to be able to calculate such posterior probabilities recursively. Goal: Compute p(x t y 1,..., y t ) (sum-product, a.k.a., Kalman filters) Compute p(x t y 1,..., y T ) (sum-product, a.k.a. Rauch-Tung-Striebel, Smoothing) Estimate P 0, A, C, G, Q, R (EM) (15) 5 Filtering Figure 4: A fragment of a SSM The problem is to calculate an estimate of the state x t based on a partial output sequence y 0,..., y t. Compute p(x t+1 y 0,..., y t, y t+1 ) Sums of Gaussian variables are Gaussian, and thus, considering all of the variables in the SSM jointly, we have a (large) multivariate Gaussian distribution. Conditionals of Gaussians are Gaussian and thus the

Factor Analysis and Kalman Filtering 5 probability distribution p(x t y 0,..., y t ) must be Gaussian. We write ˆx t t to denote the mean of x t conditioned on the partial sequence y 0,..., y t. The covariance matrix of x t conditioned on y 0,..., y t. is denoted P t t ; thus: ˆx t t E[x t y 0,..., y t ] P t t E[(x t ˆx t t )(x t ˆx t t ) T y 0,..., y t ] We assume that we have already calculated p(x t y 0,..., y t ) that is, we have calculated ˆx t t and p(x t y 0,..., y t ). We wish to carry this distribution forward into the fragment on the right, where we condition on y 0,..., y t+1. We decompose the transformation into two steps: time update: p(x t y 0,..., y t ) p(x t+1 y 0,..., y t ) measurement update: p(x t+1 y 0,..., y t ) p(x t+1 y 0,..., y t+1 ) Thus, in the time update step, we simply propagate the distribution forward one step in time, calculating the new mean and covariance based on the old mean and covariance, but conditioning on no new measurements (i.e., no new output). In the measurement update step, we incorporate the new measurement y t+1 and update the probability distribution for x t+1. The overall result is a transformation from ˆx t t and P t t to ˆx t+1 t+1 and P t+1 t+1. Given ˆx t t, compute ˆx t+1 t ˆx t+1 t = E[x t+1 y 0,..., y t ] = E[Ax t + Gw t y 0,..., y t ] = AE[x t y 0,..., y t ] = Aˆx t t