Outline lecture 35 Lecture Expectation aximization E and clustering Thomas Schön Division of Automatic Control Linöping University Linöping Sweden. Email: schon@isy.liu.se Phone: 13-1373 Office: House B Entrance 7. 1. Summary of lecture 5. Expectation aximization E General derivation Example - identification of a linear state-space model Example - identification of a Wiener system 3. Gaussian mixtures Standard construction Equivalent construction using latent variables L estimation using E. Connections to the -means algorithm for clustering Chapter 9 The Gaussian mixture model and the -means algorithm will be finished during lecture 7 on Wednesday. Summary of lecture 5 335 Latent variables example 35 A Gaussian process is a collection of random variables any finite number of which have a joint Gaussian distribution. By assuming that the considered system is a Gaussian process predictions can be made by computing the conditional distribution pyx all the observations yx being the output for which we see a prediction. This regression approach is referred to as Gaussian process regression. The suppor vector machine SV is a discriminative classifier that gives the maximum margin decision boundary. A latent variable is a variable that is not directly observed. Other common names are hidden variables unobserved variables or missing data. An example of a latent variable is the state x t in a state space model. Consider the following linear Gaussian state space LGSS model x t+1 = θx t + v t y t = 1 x t + e t vt e t.1..1
Expectation aximization E strategy and idea 535 The Expectation aximization E algorithm computes L estimates of unnown parameters in probabilistic models involving latent variables. Strategy: Use structure inherent in the probabilistic model to separate the original L problem into two closely lined subproblems each of which is hopefully in some sense more tractable than the original problem. E focus on the joint log-lielihood function of the observed variables X and the latent variables Z {z 1... z } l θ X Z = ln p θ X Z. Expectation aximization algorithm 35 Algorithm 1 Expectation aximization E 1. Initialise: Set i = 1 and choose an initial θ 1.. While not converged do: a Expectation E step: Compute Qθ θ i = E θi [ln p θ Z X X] = ln p θ Z Xp θi Z XdZ b aximization step: Compute c i i + 1 θ i+1 = arg max Qθ θ i θ E example 1 linear system identification 735 Consider the following scalar LGSS model x t+1 = θx t + v t y t = 1 x t + e t vt e t.1..1 The initial state is fully nown = and the true θ-parameter is given by θ =.9. The identification problem is now to determine the parameter θ on the basis of the observations Y = {y 1... y } using the E algorithm. The latent variables Z are given by the states Z = X {... x +1 }. ote the difference in notation compared to Bishop! The observations are denoted Y and the latent variables are denoted by X. E Example 1 linear system identification 35 The expectation E step: Qθ θ i E θi {ln p θ X Y Y} = ln p θ X Yp θi X YdX. Let us start investigating ln p θ X Y. Using conditional probabilities we have p θ X Y = p θ x +1 X y Y 1 = p θ x +1 y X Y 1 p θ X Y 1 According to the arov property we have resulting in p θ x +1 y X Y 1 = p θ x +1 y x p θ X Y = p θ x +1 y x p θ X Y 1.
E Example 1 linear system identification 935 E example 1 linear system identification 135 Repeated use of the above ideas straightforwardly yields p θ X Y = p θ According to the model we have xt+1 p θ x y t = t xt+1 y t ; p θ x t+1 y t x t. θ.1 x 1/ t..1 The resulting Q-function is Qθ θ i E θi { t Y = ϕθ + ψθ where we have defined ϕ } θ + E θi { x t x t+1 Y } E θi { t Y ψ E θi {x t x t+1 Y}. } θ There exist explicit expressions for these expected values. E example 1 linear system identification 1135 The maximization step θ i+1 = arg max Qθ θ i. θ simply amounts to solving the following quadratic problem The solution is given by θ i+1 = arg max ϕθ + ψθ. θ θ i+1 = ψ ϕ. E example 1 linear system identification 35 Algorithm E example 1 1. Initialise: Set i = 1 and choose an initial θ 1.. While not converged do: a Expectation E step: Compute ϕ = } E θi { t Y ψ = E θi {x t x t+1 Y}. b aximization step: Find the next iterate according to θ i+1 = ψ ϕ. c If L θi Y L θi 1 Y 1 update i := i + 1 and return to step otherwise terminate.
E example 1 linear system identification 1335 E example 1 linear system identification 135 Different number of samples used. onte Carlo studies each using 1 realisations of data. Initial guess θ =.1. 1 5 1 5 1 θ.71.5.95.97.9.99.99 o surprise since L is asymptotically efficient. Log lielihood Q function 1 Log lielihood and Q function 3 5 7.... 1 1. Parameter a a Iteration 1 Log lielihood and Q function 1 3 5 Log lielihood Q function 7.... 1 1. Parameter a b Iteration E example 1 linear system identification 1535 onlinear system identification using E I/VI 135 Log lielihood and Q function 1 3 5 Log lielihood Q function Log lielihood and Q function 1 3 5 Log lielihood Q function A general state space model SS consists of a arov process {x t } t 1 and a measurement process {y t } t 1 related according to x t+1 x t f θt x t+1 x t u t y t x t h θt y t x t u t µ θ. 7.... 1 1. Parameter a c Iteration 3 7.... 1 1. Parameter a d Iteration 11 Identification problem: Find θ based on {u 1:T y 1:T }. According to the above the first step is to compute the Q-function All details including ATLAB code are provided in Thomas B. Schön An Explanation of the Expectation aximization Algorithm. Division of Automatic Control Linöping University Sweden Technical Report nr: LiTH-ISY-R-915 August 9. users.isy.liu.se/rt/schon/publications/schone9.pdf Qθ θ = E θ {ln p θ Z Y Y}
onlinear system identification using E II/VI 1735 Applying E θ { Y} to lnp θ X Y = ln p θ Y X + ln p θ X 1 = ln p θ + ln p θ x t+1 x t + This results in Qθ θ = I 1 + I + I 3 where I 1 = ln p θ p θ Yd ln p θ y t x t. 1 I = ln p θ x t+1 x t p θ x t+1 x t Ydx t dx t+1 I 3 = ln p θ y t x t p θ x t Ydx t. onlinear system identification using E III/VI 135 This leads us to a nonlinear state smoothing problem which we can solve using a particle smoother PS. The PS provides us with the following approximation of the joint smoothing density px Y 1 δ X X i which allows for the following approximations of the marginal smoothing densities that we need p θ x t Y p θ x t Y = 1 p θ x t:t+1 Y p θ x t:t+1 Y = 1 δx t x i t δx t:t+1 x i t:t+1. onlinear system identification using E IV/VI 1935 Inserting the above approximations into the integrals straightforwardly yields the approximation we are looing for Î 1 = ln p θ Î = 1 1 1 δ x i 1 d = 1 ln p θ x t+1 x t = 1 ln p θ x i t+1 xi t Î 3 = ln p θ y t x t ln p θ x i 1 1 δ x t:t+1 x i t:t+1 1 δx t x i tdx t = 1 dx t:t+1 ln p θ y t x i t onlinear system identification using E V/VI 35 It is straightforward to mae use of the approximation of the Q-function just derived in order to compute gradients of the Q-function θ Qθ θ = Î 1 θ + Î θ + Î 3 θ For example the other two terms are treated analogously Î 3 = 1 Î 3 θ = 1 ln p θ y t x i t ln p θ y t x i t θ With these gradients in place there are many algorithms that can be used in order to solve the maximization problem we employ BFGS.
onlinear system identification using E VI/VI 135 E example blind Wiener identification I/III 35 e 1t Algorithm 3 onlinear System Identification Using E 1. Initialise: Set i = 1 and choose an initial θ 1.. While not converged do: a Expectation E step: Run a FFBS PS and compute u t L z t h 1 z t β Σ e t y 1t Qθ θ = Î 1 θ θ + Î θ θ + Î 3 θ θ b aximization step: Compute θ +1 = arg max θ Qθ θ using an off-the-shelf numerical optimization algorithm. c + 1 Thomas B. Schön Adrian Wills and Brett inness. System Identification of onlinear State-Space odels. Automatica 71:39-9 January 11. x t+1 = A h z t β Σ y t B x t u u t Q t z t = Cx t y t = hz t β + e t e t R. Identification problem: Find A B C β Q and R based on {y 11:T y 1:T } using E. E example blind Wiener identification I/III 335 E example blind Wiener identification I/III 35 Second order LGSS model with complex poles. Employ the E-PS with = 1 particles. E-PS was terminated after 1 iterations. Results obtained using T = 1 samples. The plots are based on 1 realizations of data. onlinearities dead-zone and saturation shown on next slide. Gain db 3 5 15 1 5 5 1 15.5 1 1.5.5 3 ormalised frequency Bode plot of estimated mean blac true system red and the result for all 1 realisations gray. Output of nonlinearity 1.5.5 1 1.5 1.5 1.5.5 1 Input to nonlinearity 1. 1.5.5 1 1.5 Input to nonlinearity Estimated mean blac true static nonlinearity red and the result for all 1 realisations gray. Adrian Wills Thomas B. Schön Lennart Ljung and Brett inness. Identification of Hammerstein-Wiener odels. Automatica 91: 7-1 January 13. Output of nonlinearity 1.......
Gaussian mixture G standard construction 535 A linear superposition of Gaussians px = =1 π p x n µ Σ px n G example 35 Consider the following G px =.3 π 1 x 1...5..5 } {{ } µ 1 Σ 1 +.5 π x 1 1 1 } {{ } µ Σ +. π 3 x 9..5.5 1.5 } {{ } µ 3 Σ 3 1 is called a Gaussian mixture G. The mixture coefficients π satisfies π = 1 π 1. =1 Interpretation: The density px = x µ Σ is the probability of x given that component was chosen. The probability of choosing component is given by the prior probability p. Figure: Probability density function. 1 1 Figure: Contour plot. G problem with standard construction 735 Given independent observations {x n } n=1 the log-lielihood function if given by ln px; π 1: µ 1: Σ 1: = ln π x n µ Σ n=1 =1 There is no closed form solution available due to the sum inside the logarithm. Let us now see how this problem can be separated into two simple problems using the E algorithm. First we introduce an equivalent construction of the Gaussian mixture by introducing a latent variable. E for Gaussian mixtures intuitive preview 35 Based on pz n = =1 π z n and px n z n = =1 we have for independent observations {x n } n=1 px Z = n=1 =1 resulting in the following log-lielihood ln px Z = n=1 =1 π z n x n µ Σ z n x n µ Σ z n z n ln π + ln x n µ Σ. 1 Let us now use wishful thining and assume that Z is nown. Then maximization of 1 is straightforward.
E for Gaussian mixtures explicit algorithm 935 Algorithm E for Gaussian mixtures 1. Initialise: Initialize µ 1 Σ1 π1 and set i = 1.. While not converged do: a Expectation E step: Compute γz n = πi x n µ i Σi j=1 πi j x n µ i j n = 1... = 1.... Σi j b aximization step: Compute µ i+1 = 1 γz n x n π i+1 = = γz n n=1 n=1 Σ i+1 = 1 γz n x n µ i+1 x n µ i+1 n=1 c i i + 1 T Example E for Gaussian mixtures I/III 335 Consider the same Gaussian mixture as before px =.3 π 1 x 1...5..5 } {{ } µ 1 Σ 1 +.5 π Figure: Probability density function. 1 x 1 1 } {{ } µ Σ 1 +. π 3 x 9..5.5 1.5 } {{ } µ 3 Σ 3 1 1 Figure: = 1 samples from the Gaussian mixture px. Example E for Gaussian mixtures II/III 3135 Example E for Gaussian mixtures III/III 335 Apply the E algorithm to estimate a Gaussian mixture with = 3 Gaussians i.e. use the 1 samples to compute estimates of π 1 π π 3 µ 1 µ µ 3 Σ 1 Σ Σ 3. iterations. 1 1 1 1 1 1 1 1 1 Figure: Initial guess. Figure: True PDF. Figure: Estimate after iterations of the E algorithm.
The -means algorithm I/II 3335 Algorithm 5 -means algorithm a..a. Lloyd s algorithm 1. Initialize µ 1 and set i = 1.. inimize J w.r.t. r n eeping µ = µ i fixed. r i+1 n = { 1 if = arg minj x n µ i j otherwise 3. inimize J w.r.t. µ eeping r n = r i+1 n µ i+1 fixed. = n=1 ri+1 n x n n=1 ri+1 n. If not converged update i := i + 1 and return to step.. The -means algorithm II/II 335 The name -means stems from the fact that in step 3 of the algorithm u is give by the mean of all the data points assigned to cluster. ote the similarities between the -means algorithm and the E algorithm for Gaussian mixtures! -means is deterministic with hard assignment of data points to clusters no uncertainty whereas E is a probabilistic method that provides a soft assignment. If the Gaussian mixtures are modeled using covariance matrices Σ = ɛi = 1... it can be shown that the E algorithm for a mixture of Gaussian s is equivalent to the -means algorithm when ɛ. A few concepts to summarize lecture 3535 Latent variable: A variable that is not directly observed. Sometimes also referred to as hidden variable or missing data. Expectation aximization E: The E algorithm computes maximum lielihood estimates of unnown parameters in probabilistic models involving latent variables. Jensen s inequality: States that if f is a convex function then Ef x f Ex. Clustering: Unsupervised learning where a set of observations is divided into clusters. The observations belonging to a certain cluster are similar in some sense. -means algorithm a..a. Lloyd s algorithm: A clustering algorithm that assigns observations into clusters such that each observation belongs to the cluster with nearest in the Euclidean sense mean.