STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Similar documents
Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models

Hidden Markov Models

Tracking with Kalman Filter

Relevance Vector Machines Explained

6 Supplementary Materials

EM and Structure Learning

Composite Hypotheses testing

Lecture Notes on Linear Regression

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Homework Assignment 3 Due in class, Thursday October 15

Feature Selection & Dynamic Tracking F&P Textbook New: Ch 11, Old: Ch 17 Guido Gerig CS 6320, Spring 2013

Quantifying Uncertainty

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Conjugacy and the Exponential Family

Introduction to Hidden Markov Models

CS 468 Lecture 16: Isometry Invariance and Spectral Techniques

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Computing MLE Bias Empirically

Gaussian process classification: a message-passing viewpoint

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

Hidden Markov Models

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

10-701/ Machine Learning, Fall 2005 Homework 3

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Homework Notes Week 7

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

13 Principal Components Analysis

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Lecture 4: Universal Hash Functions/Streaming Cont d

Statistical learning

Limited Dependent Variables

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

LECTURE 9 CANONICAL CORRELATION ANALYSIS

Convergence of random processes

Lecture 10: May 6, 2013

Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

Lecture 3: Probability Distributions

Course 395: Machine Learning - Lectures

Time-Varying Systems and Computations Lecture 6

Lecture 10 Support Vector Machines II

Singular Value Decomposition: Theory and Applications

Supporting Information

2.3 Nilpotent endomorphisms

An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Maximum Likelihood Estimation

1 Convex Optimization

Linear Feature Engineering 11

LECTURE :FACTOR ANALYSIS

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Generalized Linear Methods

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

Gaussian Conditional Random Field Network for Semantic Segmentation - Supplementary Material

STAT 3008 Applied Regression Analysis

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Differentiating Gaussian Processes

1 Motivation and Introduction

Learning from Data 1 Naive Bayes

Development Pattern and Prediction Error for the Stochastic Bornhuetter-Ferguson Claims Reserving Method

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

4DVAR, according to the name, is a four-dimensional variational method.

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

The Expectation-Maximization Algorithm

Global Sensitivity. Tuesday 20 th February, 2018

Linear Approximation with Regularization and Moving Least Squares

Goodness of fit and Wilks theorem

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Feature Selection: Part 1

Feb 14: Spatial analysis of data fields

Lecture 4: September 12

Chapter 11: Simple Linear Regression and Correlation

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Here is the rationale: If X and y have a strong positive relationship to one another, then ( x x) will tend to be positive when ( y y)

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Lecture 3 Stat102, Spring 2007

THE ROYAL STATISTICAL SOCIETY 2006 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE

Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

6. Stochastic processes (2)

Lecture 12: Discrete Laplacian

Chapter 8 Indicator Variables

6. Stochastic processes (2)

18.1 Introduction and Recap

Chat eld, C. and A.J.Collins, Introduction to multivariate analysis. Chapman & Hall, 1980

Overview. Hidden Markov Models and Gaussian Mixture Models. Acoustic Modelling. Fundamental Equation of Statistical Speech Recognition

1. Inference on Regression Parameters a. Finding Mean, s.d and covariance amongst estimates. 2. Confidence Intervals and Working Hotelling Bands

Stat 543 Exam 2 Spring 2016

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

Week 5: Neural Networks

Statistics for Economics & Business

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

The Geometry of Logit and Probit

Transcription:

STATS 306B: Unsupervsed Learnng Sprng 2014 Lecture 10 Aprl 30 Lecturer: Lester Mackey Scrbe: Joey Arthur, Rakesh Achanta 10.1 Factor Analyss 10.1.1 Recap Recall the factor analyss (FA) model for lnear dmensonalty reducton of contnuous data. In ths model, our observatons x R p are related to latent factors z R q n the followng manner: z d nd N (0, I q q ), x z N (µ + Λz ; Ψ), where we assume Ψ R p p s dagonal. Gven the observatons, we would lke to nfer the latent factors, whch provde a lower dmensonal (approxmate) representaton of our data. Last tme we computed the condtonal dstrbuton of z gven x, but ths dstrbuton of course depends on the unknown parameters θ = (µ, Λ, Ψ). We derved the maxmum lkelhood estmator for µ, whch s the sample mean. However, Λ and Ψ are coupled together n the lkelhood by a determnant and a matrx nverse, and there s no closed-form MLE for these parameters. We wll nstead estmate Λ and Ψ usng an EM algorthm. 10.1.2 EM Parameter Estmaton Snce the MLE for µ s known, we wll assume w.l.o.g. that the data have been mean-centered as x x ˆµ MLE and remove the parameter µ from the model. In order to derve an EM algorthm, we begn as usual wth the complete log-lkelhood of our data together wth the latent varables: log p(z 1:n, x 1:n ; θ) = 1 2 z T z n 2 log Ψ 1 2 (x Λz ) T Ψ 1 (x Λz ) + C 1, (10.1) where C 1 s a parameter free term ncludng normalzng constants. Observe that n ths complete log-lkelhood, Λ and Ψ are no longer coupled together as they were n the margnal lkelhood of the observed data. Notng that the z T z term above does not nvolve the 10-1

parameters and makng several other smplfcatons, we have log p(z 1:n, x 1:n ; θ) = n 2 log Ψ 1 2 = n 2 log Ψ 1 2 = n 2 log Ψ n 2 tr ( (x Λz ) T Ψ 1 (x Λz ) ) + C 2 tr ( (x Λz )(x Λz ) T Ψ 1) + C 2 tr(sψ 1 ) + C 2, where S s defned as the emprcal condtonal covarance 1 n (x Λz )(x Λz ) T = 1 n [ x x T + Λz z T Λ T Λz x T x z T Λ ] T. In the second lne above we used the fact that a scalar s equal to ts trace. In the thrd lne we used the cyclc property of the trace tr(abc) = tr(cab), whch can be appled whenever the matrx/vector multplcatons are all well-defned. We now derve the E-step by computng the expected complete log-lkelhood (ECLL) under q t (z 1:n ) = p(z 1:n x 1:n ; θ (t) ), where θ (t) s our estmate from the prevous EM teraton. Recall that ths condtonal dstrbuton s Gaussan and that we derved ts mean and covarance last tme. The ECLL s E qt log p(z 1:n, x 1:n ; θ) = n 2 log Ψ n 2 tr ( E qt [S]Ψ 1) + C 2 after nterchangng the trace and expectaton. We must therefore compute E qt [S] = 1 n [ x x T + ΛE qt [z z T ]Λ T ΛE qt [z ]x T x E qt [z T ]Λ ] T, where E qt [z ] = E[z x ] was computed last tme and E qt [z z T ] = Cov[z x ] + E[z x ]E[z x ] T s smlarly easy to compute. For the M-step, one can show that the ECLL s maxmzed by takng and ( ) ( Λ (t+1) = x E qt [z T ] E qt [z z T ] ) 1 Ψ (t+1) = dag(e qt [S]) = 1 n dag( x x T Λ (t+1) E qt [z ]x T ). Note the smlarty of the Λ update to the normal equatons solved durng lnear regresson. Also, notce that the Ψ update nvolves the updated Λ (t+1) and not Λ (t). 10-2

10.1.3 Observatons There are several connectons between FA and prevous models/algorthms we have consdered. We mght consder FA as smlar to Gaussan mxture modelng but wth the latent varables z contnuous rather than dscrete. We can also draw smlartes between FA and PCA. Both methods descrbe data usng a lower dmensonal lnear representaton. However, factor analyss allows for more general covarance structure than PCA does, and so the loadngs and factors derved from factor analyss do not n general correspond to the results of PCA. In the case that Ψ s restrcted to be sotropc (.e., Ψ = σ 2 I for unknown σ 2 ) we recover the probablstc PCA (PPCA) model (Tppng & Bshop 95). In ths restrcted case there are closed form MLEs. If U s the matrx whose columns are the top q egenvectors of the emprcal covarance XT X, and λ n 1,..., λ p are the egenvalues, then we have ˆσ 2 MLE = 1 p q p j=q+1 λ j, ˆΛ MLE = U(dag(λ 1,..., λ q ) ˆσ 2 MLE) 1/2. In ths restrcted setup, the factor analyss loadngs (columns of ˆΛ) span the same subspace as the PCA loadngs U. Moreover, f we consder σ 2 as known then as σ 2 0, PPCA actually recovers the PCA algorthm. Ths s another example of small varance asymptotcs, lke we have seen before. We should also menton a few caveats to usng factor analyss. Frst, the FA parameters are n general not dentfable. For example, gven an orthogonal matrx O (such that OO T = O T O = I), the parameters Λ and ΛO wll gve rse to the same dstrbuton of x. Hence, nterpretaton of the learned values of Λ and z must be done wth care. Even apart from these nterpretablty ssues, factor analyss treats dataponts as ndependent draws. What f our data has known, e.g., sequental, dependence structure? Such structure arses n a varety of settngs: Trackng 3D object movement gven radar or vdeo Autoplot, n whch we would lke to estmate the state of a vehcle over tme from nternal and external sensors The nference of evolvng market factors from fnancal tme seres Character recognton based on touch screen contact over tme GPS navgaton Recommender systems, n whch we am to estmate users preferences over tme We wll next nvestgate a probablstc model desgned for data wth such a known sequental dependence structure. 10-3

Fgure 10.1. Graphcal model for LGSSM 10.2 Lnear Gaussan State Space Model The lnear Gaussan state space model s a generalzaton of factor analyss to the setttng of sequental contnuous data. Under ths model, we vew our data sequence x 0, x 1,... x T R p as a random draw from the followng generatve process: (0) z 0 N (0, Σ 0 ): sample the ntal state n R q nd nd (1) z t = Az t 1 + w t 1, where, w t 1 N (0, Q) or alternately, z t z t 1 N (Az t 1, Q)..e. z t s sampled from lnear gaussan dynamcs gven the pror state z t 1 va the unknown transton matrx A R q q and unknown covarance matrx Q R q q. nd nd (2) x t = Cz t + v t for v t N (0, R) or, x t z t N (Cz t, R)..e. x t are the sample observatons gven the state z t, normally dstrbuted wth mean Cz t, where C R p q s the unknown loadngs matrx, and unknown covarance R R p p Notce that ths s smlar to the emsson model from factor analyss but wth a more general covarance matrx R and wth dependent states. 10.2.1 Graphcal Model The LGSSM graphcal model s the same as the Hdden Markov Model graphcal model, snce the two models have dentcal condtonal ndependence structures. However, n the present settng, we have Gaussan as opposed to dscrete hdden varables z. 10.2.2 Unsupervsed Learnng Goal Our unsupervsed learnng goal s to draw nferences about the hdden states z 0, z 1,..., z T. Here are three of the most common nferental tasks: (1) Flterng. Infer the current state gven hstory of observatons P (z t x 0,... x t ). e.g:- What s the current state of the mssle gven ts poston over some past tme? (2) Smoothng. Infer a past state gven observatons P (z s x 0,... x t ) where s < t e.g:- Where dd the mssle orgnate gven we observed t over some tme? 10-4

(3) Predcton. Predct a future state gven observatons P (z u x 0,... x t ) where u > t e.g:- Where would we expect the mssle to be n some tme from now? In ths lecture and the next, we wll detal recursve algorthms for carryng out flterng and smoothng, assumng all model parameters are known. 10.2.3 Kalman Flter The Kalman Flter s an algorthm for flterng n an LGSSM where the parameters are known. We wsh to fnd the probablty dstrbuton of the current state gven hstory of observatons. Snce the states and the observatons are jontly Gaussan, t suffces to fnd the mean and varance of the condtonal dstrbuton whch s also gong to be Gaussan. We ntroduce the followng shorthand notaton for fltered means and covarances ẑ t t = E[z t x 0:t ] P t t = E[(z t ẑ t t )(z t ẑ t t ) T x 0:t ] We wll also be nterested n computng the one-step predcton means and covarances ẑ t t 1, P t t 1 of P r(z t x 0:t 1 ). Flterng Strategy We wll derve our flterng algorthm va a two step recurson: (1) Tme update: Compute the predcton dstrbuton P (z t+1 x 0:t ) gven the last fltered dstrbuton P (z t x 0:t ) (2) Measurement update: Compute the new fltered dstrbuton P (z t+1 x 0:t+1 ) gven the predcton dstrbuton P (z t+1 x 0:t ) Tme Update We wll use the fact that z t+1 = Az t + w t to compute the mean and covarance of the predcton dstrbuton from the fltered dstrbuton: ẑ t+1 t = E[Az t + w t x 0:t ] = AE[z t x 0:t ] + 0 = Aẑ t t P t+1 t = E[(z t+1 ẑ t+1 t )(z t+1 ẑ t+1 t ) T x 0:t ] = E[(Az t + w t Aẑ t t )(Az t + w t Aẑ t t ) T x 0:t ] = AE[(z t ẑ t t )(z t ẑ t t ) T x 0:t ]A T + E[w t wt T x 0:t ] + 0 = AP t t A T + Q 10-5