STATS 306B: Unsupervsed Learnng Sprng 2014 Lecture 10 Aprl 30 Lecturer: Lester Mackey Scrbe: Joey Arthur, Rakesh Achanta 10.1 Factor Analyss 10.1.1 Recap Recall the factor analyss (FA) model for lnear dmensonalty reducton of contnuous data. In ths model, our observatons x R p are related to latent factors z R q n the followng manner: z d nd N (0, I q q ), x z N (µ + Λz ; Ψ), where we assume Ψ R p p s dagonal. Gven the observatons, we would lke to nfer the latent factors, whch provde a lower dmensonal (approxmate) representaton of our data. Last tme we computed the condtonal dstrbuton of z gven x, but ths dstrbuton of course depends on the unknown parameters θ = (µ, Λ, Ψ). We derved the maxmum lkelhood estmator for µ, whch s the sample mean. However, Λ and Ψ are coupled together n the lkelhood by a determnant and a matrx nverse, and there s no closed-form MLE for these parameters. We wll nstead estmate Λ and Ψ usng an EM algorthm. 10.1.2 EM Parameter Estmaton Snce the MLE for µ s known, we wll assume w.l.o.g. that the data have been mean-centered as x x ˆµ MLE and remove the parameter µ from the model. In order to derve an EM algorthm, we begn as usual wth the complete log-lkelhood of our data together wth the latent varables: log p(z 1:n, x 1:n ; θ) = 1 2 z T z n 2 log Ψ 1 2 (x Λz ) T Ψ 1 (x Λz ) + C 1, (10.1) where C 1 s a parameter free term ncludng normalzng constants. Observe that n ths complete log-lkelhood, Λ and Ψ are no longer coupled together as they were n the margnal lkelhood of the observed data. Notng that the z T z term above does not nvolve the 10-1
parameters and makng several other smplfcatons, we have log p(z 1:n, x 1:n ; θ) = n 2 log Ψ 1 2 = n 2 log Ψ 1 2 = n 2 log Ψ n 2 tr ( (x Λz ) T Ψ 1 (x Λz ) ) + C 2 tr ( (x Λz )(x Λz ) T Ψ 1) + C 2 tr(sψ 1 ) + C 2, where S s defned as the emprcal condtonal covarance 1 n (x Λz )(x Λz ) T = 1 n [ x x T + Λz z T Λ T Λz x T x z T Λ ] T. In the second lne above we used the fact that a scalar s equal to ts trace. In the thrd lne we used the cyclc property of the trace tr(abc) = tr(cab), whch can be appled whenever the matrx/vector multplcatons are all well-defned. We now derve the E-step by computng the expected complete log-lkelhood (ECLL) under q t (z 1:n ) = p(z 1:n x 1:n ; θ (t) ), where θ (t) s our estmate from the prevous EM teraton. Recall that ths condtonal dstrbuton s Gaussan and that we derved ts mean and covarance last tme. The ECLL s E qt log p(z 1:n, x 1:n ; θ) = n 2 log Ψ n 2 tr ( E qt [S]Ψ 1) + C 2 after nterchangng the trace and expectaton. We must therefore compute E qt [S] = 1 n [ x x T + ΛE qt [z z T ]Λ T ΛE qt [z ]x T x E qt [z T ]Λ ] T, where E qt [z ] = E[z x ] was computed last tme and E qt [z z T ] = Cov[z x ] + E[z x ]E[z x ] T s smlarly easy to compute. For the M-step, one can show that the ECLL s maxmzed by takng and ( ) ( Λ (t+1) = x E qt [z T ] E qt [z z T ] ) 1 Ψ (t+1) = dag(e qt [S]) = 1 n dag( x x T Λ (t+1) E qt [z ]x T ). Note the smlarty of the Λ update to the normal equatons solved durng lnear regresson. Also, notce that the Ψ update nvolves the updated Λ (t+1) and not Λ (t). 10-2
10.1.3 Observatons There are several connectons between FA and prevous models/algorthms we have consdered. We mght consder FA as smlar to Gaussan mxture modelng but wth the latent varables z contnuous rather than dscrete. We can also draw smlartes between FA and PCA. Both methods descrbe data usng a lower dmensonal lnear representaton. However, factor analyss allows for more general covarance structure than PCA does, and so the loadngs and factors derved from factor analyss do not n general correspond to the results of PCA. In the case that Ψ s restrcted to be sotropc (.e., Ψ = σ 2 I for unknown σ 2 ) we recover the probablstc PCA (PPCA) model (Tppng & Bshop 95). In ths restrcted case there are closed form MLEs. If U s the matrx whose columns are the top q egenvectors of the emprcal covarance XT X, and λ n 1,..., λ p are the egenvalues, then we have ˆσ 2 MLE = 1 p q p j=q+1 λ j, ˆΛ MLE = U(dag(λ 1,..., λ q ) ˆσ 2 MLE) 1/2. In ths restrcted setup, the factor analyss loadngs (columns of ˆΛ) span the same subspace as the PCA loadngs U. Moreover, f we consder σ 2 as known then as σ 2 0, PPCA actually recovers the PCA algorthm. Ths s another example of small varance asymptotcs, lke we have seen before. We should also menton a few caveats to usng factor analyss. Frst, the FA parameters are n general not dentfable. For example, gven an orthogonal matrx O (such that OO T = O T O = I), the parameters Λ and ΛO wll gve rse to the same dstrbuton of x. Hence, nterpretaton of the learned values of Λ and z must be done wth care. Even apart from these nterpretablty ssues, factor analyss treats dataponts as ndependent draws. What f our data has known, e.g., sequental, dependence structure? Such structure arses n a varety of settngs: Trackng 3D object movement gven radar or vdeo Autoplot, n whch we would lke to estmate the state of a vehcle over tme from nternal and external sensors The nference of evolvng market factors from fnancal tme seres Character recognton based on touch screen contact over tme GPS navgaton Recommender systems, n whch we am to estmate users preferences over tme We wll next nvestgate a probablstc model desgned for data wth such a known sequental dependence structure. 10-3
Fgure 10.1. Graphcal model for LGSSM 10.2 Lnear Gaussan State Space Model The lnear Gaussan state space model s a generalzaton of factor analyss to the setttng of sequental contnuous data. Under ths model, we vew our data sequence x 0, x 1,... x T R p as a random draw from the followng generatve process: (0) z 0 N (0, Σ 0 ): sample the ntal state n R q nd nd (1) z t = Az t 1 + w t 1, where, w t 1 N (0, Q) or alternately, z t z t 1 N (Az t 1, Q)..e. z t s sampled from lnear gaussan dynamcs gven the pror state z t 1 va the unknown transton matrx A R q q and unknown covarance matrx Q R q q. nd nd (2) x t = Cz t + v t for v t N (0, R) or, x t z t N (Cz t, R)..e. x t are the sample observatons gven the state z t, normally dstrbuted wth mean Cz t, where C R p q s the unknown loadngs matrx, and unknown covarance R R p p Notce that ths s smlar to the emsson model from factor analyss but wth a more general covarance matrx R and wth dependent states. 10.2.1 Graphcal Model The LGSSM graphcal model s the same as the Hdden Markov Model graphcal model, snce the two models have dentcal condtonal ndependence structures. However, n the present settng, we have Gaussan as opposed to dscrete hdden varables z. 10.2.2 Unsupervsed Learnng Goal Our unsupervsed learnng goal s to draw nferences about the hdden states z 0, z 1,..., z T. Here are three of the most common nferental tasks: (1) Flterng. Infer the current state gven hstory of observatons P (z t x 0,... x t ). e.g:- What s the current state of the mssle gven ts poston over some past tme? (2) Smoothng. Infer a past state gven observatons P (z s x 0,... x t ) where s < t e.g:- Where dd the mssle orgnate gven we observed t over some tme? 10-4
(3) Predcton. Predct a future state gven observatons P (z u x 0,... x t ) where u > t e.g:- Where would we expect the mssle to be n some tme from now? In ths lecture and the next, we wll detal recursve algorthms for carryng out flterng and smoothng, assumng all model parameters are known. 10.2.3 Kalman Flter The Kalman Flter s an algorthm for flterng n an LGSSM where the parameters are known. We wsh to fnd the probablty dstrbuton of the current state gven hstory of observatons. Snce the states and the observatons are jontly Gaussan, t suffces to fnd the mean and varance of the condtonal dstrbuton whch s also gong to be Gaussan. We ntroduce the followng shorthand notaton for fltered means and covarances ẑ t t = E[z t x 0:t ] P t t = E[(z t ẑ t t )(z t ẑ t t ) T x 0:t ] We wll also be nterested n computng the one-step predcton means and covarances ẑ t t 1, P t t 1 of P r(z t x 0:t 1 ). Flterng Strategy We wll derve our flterng algorthm va a two step recurson: (1) Tme update: Compute the predcton dstrbuton P (z t+1 x 0:t ) gven the last fltered dstrbuton P (z t x 0:t ) (2) Measurement update: Compute the new fltered dstrbuton P (z t+1 x 0:t+1 ) gven the predcton dstrbuton P (z t+1 x 0:t ) Tme Update We wll use the fact that z t+1 = Az t + w t to compute the mean and covarance of the predcton dstrbuton from the fltered dstrbuton: ẑ t+1 t = E[Az t + w t x 0:t ] = AE[z t x 0:t ] + 0 = Aẑ t t P t+1 t = E[(z t+1 ẑ t+1 t )(z t+1 ẑ t+1 t ) T x 0:t ] = E[(Az t + w t Aẑ t t )(Az t + w t Aẑ t t ) T x 0:t ] = AE[(z t ẑ t t )(z t ẑ t t ) T x 0:t ]A T + E[w t wt T x 0:t ] + 0 = AP t t A T + Q 10-5