CS281A/Stat241A Lecture 17 - PDF Free Download

CS281A/Stat241A Lecture 17 p. 1/4 CS281A/Stat241A Lecture 17 Factor Analysis and State Space Models Peter Bartlett

CS281A/Stat241A Lecture 17 p. 2/4 Key ideas of this lecture Factor Analysis. Recall: Gaussian factors plus observations. Parameter estimation with EM. The vec operator. Motivation: Natural parameters of Gaussian. Linear functions of matrices as inner products. Kronecker product. State Space Models. Linear dynamical systems, gaussian disturbances. All distributions are Gaussian: parameters suffice.

CS281A/Stat241A Lecture 17 p. 3/4 Factor Analysis: Definition X R p, factors Y R d, observations Local conditionals: p(x) = N(x 0,I), p(y x) = N(y µ + Λx, Ψ).

CS281A/Stat241A Lecture 17 p. 4/4 Factor Analysis λ 2 Λx λ1 µ

CS281A/Stat241A Lecture 17 p. 5/4 Factor Analysis: Definition Local conditionals: p(x) = N(x 0,I), p(y x) = N(y µ + Λx, Ψ). The mean of y is µ R d. The matrix of factors is Λ R d p. The noise covariance Ψ R d d is diagonal. Thus, there are d + dp + d dp d 2 parameters.

CS281A/Stat241A Lecture 17 p. 6/4 Factor Analysis: Marginals, Conditionals Theorem 1. Y N(µ, ΛΛ + Ψ). (( ) 0 2. (X,Y ) N µ, Σ ), with Σ = ( I Λ Λ ΛΛ + Ψ ). 3. p(x y) is Gaussian, with mean = Λ (ΛΛ + Ψ) 1 (y µ), covariance I Λ (ΛΛ + Ψ) 1 Λ.

CS281A/Stat241A Lecture 17 p. 7/4 Factor Analysis: Parameter Estimation iid data y = (y 1,...,y n ). The log likelihood is l(θ;y) = log p(y θ) = const n 2 log ΛΛ + Ψ 1 (y i µ) (ΛΛ + Ψ) 1 (y i µ). 2 i

CS281A/Stat241A Lecture 17 p. 8/4 Factor Analysis: Parameter Estimation Let s first consider estimation of µ: ˆµ ML = 1 n n i=1 y i, as for the full covariance case. From now on, let s assume µ = 0, so we can ignore it: l(θ;y) = const 1 2 log ΛΛ + Ψ 1 2 y i(λλ + Ψ) 1 y i. But how do we find a factorized covariance matrix, Σ = ΛΛ + Ψ? i

CS281A/Stat241A Lecture 17 p. 9/4 Factor Analysis: EM We follow the usual EM recipe: 1. Write out the complete log likelihood, l c. 2. E step: Calculate E[l c y]. Typically, find E[suff. stats y]. 3. M step: Maximize E[l c y].

CS281A/Stat241A Lecture 17 p. 10/4 Factor Analysis: EM 1. Write out the complete log likelihood, l c. l c (θ) = log(p(x,θ)p(y x,θ)) n = const n 2 log Ψ 1 2 1 2 i=1 x ix i n (y i Λx i ) Ψ 1 (y i Λx i ). i=1

CS281A/Stat241A Lecture 17 p. 11/4 Factor Analysis: EM 2. E step: Calculate E[l c y]. Typically, find E[suff. stats y]. Claim: Sufficient statistics are x i, x i x i. Indeed, E[l c y] is a constant plus n 2 log Ψ plus ( 1 n ) 2 E n x ix i + (y i Λx i ) Ψ 1 (y i Λx i ) y = 1 2 = 1 2 i=1 i=1 n ( E(x i x i y i ) + E ( tr ( (y i Λx i ) Ψ 1 (y i Λx i ) ) )) y i i=1 n E(x i x i y i ) n 2 E( tr ( SΨ 1) ) y i, i=1

CS281A/Stat241A Lecture 17 p. 12/4 Factor Analysis: EM. E-step E[l c y] = n 2 log Ψ 1 2 with S = 1 n n E(x ix i y i ) n 2 E( tr ( SΨ 1) ) y i, i=1 n (y i Λx i )(y i Λx i ), i=1 We used the fact that the trace (sum of diagonal elements) of a matrix satisfies tr(abc) = tr(cab), as long as the products ABC and CAB are square.

CS281A/Stat241A Lecture 17 p. 13/4 Factor Analysis: EM. E-step Now, we can calculate E(S y) = 1 n = 1 n n i=1 n i=1 E [ y i y i 2Λx iy i + Λx ix i Λ y i ] y i y i 2ΛE[x i y i ]y i + ΛE[x i x i y i ]Λ, and from this it is clear that the expected sufficient statistics are E(x i y i ), E(x i x i y i) (and its trace, E(x i x i y i )).

CS281A/Stat241A Lecture 17 p. 14/4 Factor Analysis: EM. E-step We calculated these conditional expectations earlier: E[x i y i ] = Λ ( ΛΛ + Ψ ) 1 (yi µ) Var[x i y i ] = I Λ ( ΛΛ + Ψ ) 1 Λ E[x i x i y i ] = Var[x i y i ] + E[x i y i ]E[x i y i ]. So we can plug them in to calculate the expected complete log likelihood.

CS281A/Stat241A Lecture 17 p. 15/4 Factor Analysis: EM. M-step 3. M step: Maximize E[l c y]. For Λ, this is equivalent to minimizing ntr ( E(S y)ψ 1) = tr ( (Y ΛX )(Y ΛX ) Ψ 1), where Y R n d, with rows y i, X R n p, with rows x i. This is a matrix version of linear regression, with the d separate components of the squared error weighted by one of the diagonal entries in Ψ 1. Thus, the Ψ matrix plays no role, and the solution satisfies the normal equations.

CS281A/Stat241A Lecture 17 p. 16/4 Factor Analysis: EM. M-step Normal Equations: ˆΛ = ( n E[x i x i y i ] ) 1 n ( E[xi y i ]y i) i=1 i=1 (They are the same sufficient statistics as in linear regression; here we need to compute the expectation of the suff. stats given the observations.)

CS281A/Stat241A Lecture 17 p. 17/4 Factor Analysis: EM. M-step For Ψ, we need to find a diagonal Ψ to minimize log Ψ + tr ( E(S y)ψ 1). = d j=1 (log ψ j + s j /ψ j ), It s easy to check that this is minimized for ψ j = s j, the diagonal entries of E[S y].

CS281A/Stat241A Lecture 17 p. 18/4 Factor Analysis: EM. Summary. E step: Calculate the expected suff. stats: E[x i y i ] = Λ ( ΛΛ + Ψ ) 1 (yi µ) Var[x i y i ] = I Λ ( ΛΛ + Ψ ) 1 Λ E[x i x i y i ] = Var[x i y i ] + E[x i y i ]E[x i y i ]. And use these to compute the (diagonal entries of the) matrix E(S y) = 1 n n i=1 y i y i 2ΛE[x i y i ]y i + ΛE[x ix i y i]λ.

CS281A/Stat241A Lecture 17 p. 19/4 Factor Analysis: EM. Summary. M step: Maximize E[l c y]: ( n ) 1 n ˆΛ = E[x i x i y i ] i=1 i=1 ˆΨ = diag(e(s y)), ( E[xi y i ]y i) and recall: ˆµ = 1 n n i=1 y i.

CS281A/Stat241A Lecture 17 p. 20/4 Key ideas of this lecture Factor Analysis. Recall: Gaussian factors plus observations. Parameter estimation with EM. The vec operator. Motivation: Natural parameters of Gaussian. Linear functions of matrices as inner products. Kronecker product. State Space Models. Linear dynamical systems, gaussian disturbances. All distributions are Gaussian: parameters suffice. Inference: Kalman filter and smoother. Parameter estimation with EM.

CS281A/Stat241A Lecture 17 p. 21/4 The vec operator: Motivation Consider the multivariate Gaussian: ( p(x) = (2π) (d)/2 Σ 1/2 exp 1 ) 2 (x µ) Σ 1 (x µ). What is the natural parameterization? p(x) = h(x) exp ( θ T(x) A(θ) ).

CS281A/Stat241A Lecture 17 p. 22/4 The vec operator: Motivation If we define then we can write Λ = Σ 1 η = Σ 1 µ, (x µ) Σ 1 (x µ) = µ Σ 1 µ 2µ Σ 1 x + x Σ 1 x = η Λ 1 η 2η x + x Λx. Why is this of the form θ T(x) A(θ)?

CS281A/Stat241A Lecture 17 p. 23/4 The vec operator: Example ( x 1 x 2 ) ( λ 11 λ 12 λ 21 λ 22 ) ( x 1 x 2 ) = λ 11 x 2 1 + λ 21 x 1 x 2 + λ 12 x 1 x 2 + λ 22 x 2 2 = ( ) λ 11 λ 21 λ 12 λ 22 = vec(λ) vec(xx ). x 2 1 x 1 x 2 x 2 x 1 x 2 2

CS281A/Stat241A Lecture 17 p. 24/4 The vec operator Definition [vec]: For a matrix A, vec(a) = a 1 a 2. a n where the a i are the column vectors of A: ( ) A = a 1 a 2 a n

CS281A/Stat241A Lecture 17 p. 25/4 The vec operator Theorem: tr(a B) = vec(a) vec(b). (Trace of the product is the sum of the corresponding row column inner products.) In the example above, x Λx = tr ( x Λx ) = tr ( Λxx ) = vec(λ ) vec(xx ) }{{}}{{} nat. param. suff. stat.

CS281A/Stat241A Lecture 17 p. 26/4 The Kronecker product We also use the vec operator for matrix equations like XΘY = Z, or, for instance, for least squares minimization of XΘY Z. Then we can write vec(z) = vec(xθy ) = (Y X)vec(Θ), where (Y X) is the Kronecker product of Y and X:

CS281A/Stat241A Lecture 17 p. 27/4 The Kronecker product Definition [Kronecker product] The Kronecker product of A and B is a 11 B a 1n B A B =..... a m1 B a mn B

CS281A/Stat241A Lecture 17 p. 28/4 The Kronecker product Theorem [Kronecker product and vec operator] vec(abc) = (C A)vec(B). The Kronecker product and vec operator are used in matrix algebra. They are convenient for differentiation of a function of a matrix, and they arise: in statistical models involving products of features; in systems theory; and in stability theory.

CS281A/Stat241A Lecture 17 p. 29/4 Key ideas of this lecture Factor Analysis. Recall: Gaussian factors plus observations. Parameter estimation with EM. The vec operator. Motivation: Natural parameters of Gaussian. Linear functions of matrices as inner products. Kronecker product. State Space Models. Linear dynamical systems, gaussian disturbances. All distributions are Gaussian: parameters suffice. Inference: Kalman filter and smoother. Parameter estimation with EM.

CS281A/Stat241A Lecture 17 p. 30/4 State Space Models mixture model discrete factor model Gaussian HMM discrete State Space Model Gaussian

CS281A/Stat241A Lecture 17 p. 31/4 State Space Models In linear dynamic systems, The evolution of the state x t, and The relationship between the state x t and the observation y t are linear with Gaussian noise.

CS281A/Stat241A Lecture 17 p. 32/4 State Space Models State space models revolutionized control theory in the late 50s and early 60s. Prior to these models, classical control theory could cope with decoupled low order systems. State space models allowed the effective control of complex systems like spacecraft and fast aircraft. The directed graph is identical to an HMM, so the conditional independencies are identical: Given the current state (not observation), the past and the future are independent.

CS281A/Stat241A Lecture 17 p. 33/4 Linear System: Definition x t 1 x t x t+1 R p y t 1 y t y t+1 R d

CS281A/Stat241A Lecture 17 p. 34/4 Linear System: Definition State x t R p Observation y t R d Initial state x 0 N(0,P 0 ) Dynamics x t+1 = Ax t + Gw t, w t N(0,Q) Observation y t = Cx t + v t, v t N(0,R).

CS281A/Stat241A Lecture 17 p. 35/4 Linear System: Observations 1. All the distributions are Gaussian (joints, marginals, conditionals), so they can be described by their means and variances. 2. The conditional distribution of the next state, x t+1 x t, is To see this: N(Ax t,gqg ). E(x t+1 x t ) = Ax t + GE(w t+1 x t ) = Ax t. Var(x t+1 x t ) = E(Gw t (Gw t ) ) = GQG.

CS281A/Stat241A Lecture 17 p. 36/4 Linear System: Observations 3. The marginal distribution of x t is N(0,P t ), where P 0 is given and, for t 0, To see this: P t+1 = AP t A + GQG. Ex t+1 = EE(x t+1 x t ) = AE(x t ) = 0. P t+1 = E(x t+1 x t+1) = E((Ax t + Gw t )(Ax t + Gw t ) ) = AP t A + GQG.

CS281A/Stat241A Lecture 17 p. 37/4 Inference in SSMs Filtering: p(x t y 0,...,y t ). Smoothing: p(x t y 0,...,y T ). For inference, it suffices to calculate the appropriate conditional means and covariances.

CS281A/Stat241A Lecture 17 p. 38/4 Inference in SSMs: Notation ˆx t s = E(x t y 0,...,y s ), ˆP t s = E((x t x t s )(x t x t s ) y 0,...,y s ). Filtering: x t t N(ˆx t t,p t t ), Smoothing: x t T N(ˆx t T,P t T ). The Kalman Filter is an inference algorithm for ˆx t t, P t t. The Kalman Smoother is an inference algorithm for ˆx t T, P t T.

CS281A/Stat241A Lecture 17 p. 39/4 The Kalman Filter x t 1 x t x t+1 y t 1 y t y t+1 p(x t y 0,...,y t ) ˆx t t,p t t x t 1 x t x t+1 time update y t 1 y t y t+1 p(x t+1 y 0,...,y t ) ˆx t+1 t,p t+1 t x t 1 x t x t+1 measurement update y t 1 y t y t+1 p(x t+1 y 0,...,y t+1 )ˆx t+1 t+1,p t+1 t+1

CS281A/Stat241A Lecture 17 p. 40/4 Key ideas of this lecture Factor Analysis. Recall: Gaussian factors plus observations. Parameter estimation with EM. The vec operator. Motivation: Natural parameters of Gaussian. Linear functions of matrices as inner products. Kronecker product. State Space Models. Linear dynamical systems, gaussian disturbances. All distributions are Gaussian: parameters suffice. Inference: Kalman filter and smoother. Parameter estimation with EM.