Outline Lecture 3 2(40)

Size: px

Start display at page:

Download "Outline Lecture 3 2(40)"

Marilyn Lamb
5 years ago
Views:

Outline Lecture 3 4 Lecture 3 Expectation aximization E and Clustering Thomas Schön Division of Automatic Control Linöping University Linöping Sweden. Email: schon@isy.liu.

1 Outline Lecture 3 4 Lecture 3 Expectation aximization E and Clustering Thomas Schön Division of Automatic Control Linöping University Linöping Sweden. schon@isy.liu.se Phone: Office: House B Entrance 7.. Summary of Lecture. Expectation aximization General derivation Example - identification of a linear state-space model Example - identification of a Wiener system 3. Gaussian mixtures Standard construction Equivalent construction using latent variables L estimation using E 4. Connections to the K-means algorithm for clustering Chapter 9 Summary of Lecture I/IV 34 Summary of Lecture II/IV 44 Using the posterior mean m = βs Φ T T the linear regression model is given by yx m = m T φx = βφxt S Φ T T = βφx T S φx n xx n Hence the mean of the predictive distribution at a point x is a linear combination of the training target variables t n yx m = x x nt n where x x = βφx T S φx is called the equivalent ernel. An alternative approach to regression is to directly mae use of a localized ernel rather than starting from basis functions. This leads to the so called ernel methods. t n Investigated one linear discriminant a function that taes an input and assigns it to one of K classes method in detail least squares odelled each class as y x = w T x + w and solved the LS problem resulting in ŵ = X T X X T T Showed how probabilistic generative models could be built for classification using the strategy. odel px C a..a. class-conditional density. odel pc 3. Use L to find the parameters in px C and pc. 4. Use Bayes rule to find pc x

2 Summary of Lecture III/IV 54 The direct method called logistic regression was introduced. Start by stating the model pc φ = σw T φ = + e wt φ which results in a log-lielihood function according to Lw = ln pt w = t n lny n + t n ln y n where y n = pc φ = σw T φ. ote that this is a nonlinear but concave function of w. Hence we can easily find the global minimum using ewton s method resulting in an algorithm nown as IRLS. Summary of Lecture IV/IV 64 The lielihood function for logistic regression is pt w = σw T φ n t n σw T tn φ n Hence computing the posterior density pw T = pt wpw is pt intractable and we considered the Laplace approximation for solving this. The Laplace approximation is a simple local approximation that is obtained by fitting a Gaussian centered around the AP mode of the distribution. Very Brief on Evaluating Classifiers 74 Let us define the following concepts True positives tp: Items correctly classified as belonging to the class. True negatives tn: Items correctly classified as not belonging to the class. False positives fp: Items incorrectly classified as belonging to the class. In other words a false alarm. False negatives fn: Items that are not classified in the correct class even though they should have. In other words a missed detection. Precision = tp tp + fp Recall = tp tp + fn Latent Variables - Example 84 A latent variable is a variable that is not directly observed. Other common names are hidden variables unobserved variables or missing data. An example of a latent variable is the state x t in a state-space model. Consider the following linear scalar state-space model x t+ = θx t + v t y t = x t + e t vt e t...

3 Expectation aximization E - Strategy and Idea 94 Expectation aximization E Algorithm 4 The strategy underlying the E algorithm is to separate the original L problem into two lined problems each of which is hopefully easier to solve than the original problem. This separation is accomplished by exploiting the structure inherent in the probabilistic model. The ey idea is to consider the joint log-lielihood function of both the observed variables X and the latent variables Z L θ Z X = ln p θ Z X. Algorithm Expectation aximization E. Set i = and choose an initial guess θ.. Expectation E step: Compute Qθ θ i = E θi {ln p θ Z X X} = ln p θ Z Xp θi Z XdZ. 3. aximization step: Compute θ i+ = arg max Qθ θ i. θ 4. If not converged update i := i + and return to step. E Example - Linear System Identification 4 Consider the following linear scalar state-space model x t+ = θx t + v t y t = x t + e t vt e t... The initial state is fully nown x =. Finally the true θ-parameter is given by θ =.9. The identification problem is now to determine the parameter θ on the basis of the observations Y = {y... y } using the E algorithm. The latent variables Z are given by the states Z = X {x... x + }. ote the difference in notation compared to Bishop! The observations are denoted Y and the latent variables are given by X. E Example - Linear System Identification 4 The resulting Q-function is Qθ θ i E θi { x t Y = ϕθ + ψθ where we have defined ϕ } θ + E θi { x t x t+ Y } E θi {x t Y ψ E θi {x t x t+ Y}. There exists explicit expressions for these expected values. } θ

4 E Example - Linear System Identification 34 E Example - Linear System Identification 44 The maximization step: θ i+ = arg max Qθ θ i. θ Hence the step simply amounts to solving the following quadratic problem which results in θ i+ = arg max ϕθ + ψθ θ θ i+ = ψ ϕ. Algorithm. Set i = and choose an initial guess θ.. Expectation E step: Compute ϕ = } E θi {x t Y ψ = E θi {x t x t+ Y}. 3. aximization step: Find the next iterate according to θ i+ = ψ ϕ. 4. If L θi Y L θi Y 6 update i := i + and return to step otherwise terminate. E Example - Linear System Identification 54 E Example - Linear System Identification 64 Different number of samples used. onte Carlo studies each using realisations of data. Initial guess θ = θ o surprise since L is asymptotically efficient. Log lielihood Q function Log lielihood and Q function Parameter a a Iteration Log lielihood and Q function Parameter a b Iteration Log lielihood Q function

5 E Example - Linear System Identification 74 onlinear System Identification Using E I/VI 84 Log lielihood Q function Log lielihood Q function ow consider the problem of estimating the parameters θ in Log lielihood and Q function Log lielihood and Q function Commonly also written as x t+ = f t x t θ + v t θ y t = h t x t θ + e t θ Parameter a c Iteration Parameter a d Iteration x t+ p θ x t+ x t y t p θ y t x t. All details including ATLAB code are provided in Thomas B. Schön An Explanation of the Expectation aximization Algorithm. Division of Automatic Control Linöping University Sweden Technical Report nr: LiTH-ISY-R-95 August 9. According to the above the first step is to compute the Q-function Qθ ˆθ = E θ {log p θ Z Y Y} onlinear System Identification Using E II/VI 94 Applying E θ { Y} to logp θ X Y = log p θ Y X + log p θ X = log p θ x + log p θ x t+ x t + results in Qθ θ = I + I + I 3 where I = log p θ x p θ x Ydx log p θ y t x t I = log p θ x t+ x t p θ x t+ x t Ydx t dx t+ I 3 = log p θ y t x t p θ x t Ydx t. onlinear System Identification Using E III/VI 4 This leads us to a nonlinear state smoothing problem which we can solve using particle smoothers. The current particle smoothers provides us with the following approximation of the joint smoothing density px Y δ X X i which allows for the following approximations of the marginal smoothing densities that we need p θ x t Y ˆp θ x t Y = p θ x t:t+ Y ˆp θ x t:t+ Y = δx t x i t δx t:t+ x i t:t+.

6 onlinear System Identification Using E IV/VI 4 Inserting the above approximations into the integrals straightforwardly yields the approximation we are looing for Î = log p θ x Î = δx x i dx = log p θ x t+ x t = log p θ x i t+ xi t Î 3 = log p θ y t x t log p θ x i δ x t:t+ x i t:t+ δx t x i tdx t = dx t:t+ log p θ y t x i t onlinear System Identification Using E V/VI 4 It is straightforward to mae use of the approximation of the Q-functions just derived in order to compute gradients of the Q-function θ Qθ θ = Î θ + Î θ + Î 3 θ For example the other two terms analogously Î 3 = Î 3 θ = log p θ y t x i t log p θ y t x i t θ With these gradients in place there are many algorithms that can be used in order to solve the maximization problem we employ BFGS. onlinear System Identification Using E VI/VI 34 E Example - Blind Wiener Identification I/IV 44 Algorithm onlinear System Identification Using E. Set i = and choose an initial guess θ.. Expectation E step:. Run the particle filter and the particle smoother. Calculate an approximation of the Q-function Q θ ˆθ i = Î + Î + Î aximization step: Compute Figure: The blind only the outputs y t are available Wiener linear dynamic system followed by a static nonlinearity problem. ξ t+ = Aξ t + w t w t Q x t = Cξ t y t = f x t η + e t e t R. θ = [ϑ T η T vec{r} T ] ϑ = [vec{a} T vec{c} T vec{q}] T. θ i+ = arg max θ Q θ θ i. 4. If not converged update i := i + and return to step. Thomas B. Schön Adrian Wills and Brett inness. System Identification of onlinear State-Space odels. Automatica 47:39-49 January. ost existing wor deals with special cases with assumptions lie;. Invertible nonlinearities. no measurement noise 3. no process noise. We are not forced to do any of these assumptions.

7 E Example - Blind Wiener Identification II/IV 54 R Recall that Z = Ξ Qθ θi ln pθ Ξ Ypθi Ξ YdΞ. According to the model we have q +.q.49q 3 +.q q q +.546q q 4 : x >.3.3 f x = x :. x.3. : x <.. H q = ln pθ yt ξ t + ln pθ ξ t+ ξ t resulting in Qθ θi = I + I Z Z I = I = ln pθ ξ t+ ξ t pθi ξ t+ ξ t Ydξ t dξ t+ = samples used wt. et.. ln pθ yt ξ t pθ ξ t Ydξ t = particles are used and onte Carlo runs. i Initialize the linear system using a subspace method. where pθi ξ t+ ξ t Y and pθi ξ t Y are provided by the particle smoother. 64 Wiener system: ln pθ Ξ Y = ln pθ Y Ξ + ln pθ Ξ = E Example - Blind Wiener Identification III/IV E Example - Blind Wiener Identification IV/IV 74 E Example - Blind Wiener Identification 84 3 et 5 yt f η Hq ϑ yt f η Gain db 5 wt 5 5 et Figure: Bode plot of estimated grey and true blue systems. Figure: Estimated grey and true blue memoryless nonlinearity saturation. Figure: Bloc diagram of blind Wiener model with two outputs ormalised frequency.5 3 Figure: Bode plot of estimated grey and true blue systems.

8 E Example - Blind Wiener Identification 94 Gaussian ixture - Standard Construction 34 Output of nonlinearity.5.5 Output of nonlinearity 3 A linear superposition of Gaussians px = K = π p x µ Σ px is called a Gaussian mixture. The mixture coefficients π satisfies Input to nonlinearity Figure: Estimated grey and true blue nonlinearity saturation Input to nonlinearity Figure: Estimated grey and true blue nonlinearity dead band. Adrian Wills Thomas B. Schön Lennart Ljung and Brett inness. Blind Identification of Wiener odels. Proceedings of the 8th World Congress of the International Federation of Automatic Control IFAC ilan Italy September. accepted for publication K π = π. = Interpretation: The density px = x µ Σ is the probability of x given that component was chosen. The probability of choosing component is given by the prior probability p. Gaussian ixture - Example 34 Consider the following Gaussian mixture px =.3 π x } {{ } µ Σ +.5 π Figure: Probability density function. x 8 } {{ } µ Σ x π 3 x } {{ } µ 3 Σ x Figure: Contour plot. Gaussian ixture - Problem with Standard Construction 34 Given independent observations {x n } the log-lielihood function if given by ln px; π :K µ :K Σ :K = K ln π x µ Σ = There is no closed form solution available due to the sum inside the logarithm. We will now see that this problem can be separated into two simple problems using the E algorithm. First we introduce an equivalent construction of the Gaussian mixture by introducing a latent variable.

E for Gaussian ixtures - Intuitive Preview 334 Based on pz n = K = π z n and px n z n = K = we have for independent observations {x n } px Z = K = resulting in the following log-lielihood ln px Z = K

9 E for Gaussian ixtures - Intuitive Preview 334 Based on pz n = K = π z n and px n z n = K = we have for independent observations {x n } px Z = K = resulting in the following log-lielihood ln px Z = K = π z n x n µ Σ z n x n µ Σ z n z n ln π + ln x µ Σ Let us now use wishful thining and assume that Z is nown. Then maximization of is straightforward. E for Gaussian ixtures - Explicit Algorithm 344 Algorithm E for Gaussian ixtures. Initialize µ Σ π and set i =.. Expectation E step: Compute γz n = πi x n µ i Σi K j= πi j x n µ i j n =... =... K. Σi j 3. aximization step: Compute µ i+ = γz n x n π i+ = = γz n Σ i+ = γz n x n µ i+ x n µ i+ 4. If not converged update i := i + and return to step. T Example - E for Gaussian ixtures I/III 354 Consider the same Gaussian mixture as before px =.3 π x } {{ } µ Σ +.5 π 8 x } {{ } µ Σ +. π 3 x } {{ } µ 3 Σ 3 Example - E for Gaussian ixtures II/III 364 x x Apply the E algorithm to estimate a Gaussian mixture with K = 3 Gaussians i.e. use the samples to compute estimates of π π π 3 µ µ µ 3 Σ Σ Σ 3. iterations. x x Figure: Initial guess. Figure: Probability density function. Figure: = samples from the Gaussian mixture px.

10 Example - E for Gaussian ixtures III/III 374 The K-eans Algorithm I/II 384 Algorithm K-means algorithm a..a. Lloyd s algorithm. Initialize µ and set i =.. inimize J w.r.t. r n eeping µ = µ i fixed. x x r i+ n = { if = arg minj x n µ i j otherwise x Figure: True PDF x Figure: Estimate after iterations of the E algorithm. 3. inimize J w.r.t. µ eeping r n = r i+ n µ i+ fixed. = ri+ n x n ri+ n 4. If not converged update i := i + and return to step.. The K-eans Algorithm II/II 394 The name K-means stems from the fact that in step 3 of the algorithm u is give by the mean of all the data points assigned to cluster. ote the similarities between the K-means algorithm and the E algorithm for Gaussian mixtures! K-means is deterministic with hard assignment of data points to clusters no uncertainty whereas E is a probabilistic method that provides a soft assignment. If the Gaussian mixtures are modeled using covariance matrices Σ = ɛi =... K A Few Concepts to Summarize Lecture 6 44 Latent variable: A variable that is not directly observed. Sometimes also referred to as hidden variable or missing data. Expectation aximization E: The E algorithm computes maximum lielihood estimates of unnown parameters in probabilistic models involving latent variables. Jensen s inequality: States that if f is a convex function then Ef x f Ex. Clustering: Unsupervised learning where a set of observations is divided into clusters. The observations belonging to a certain cluster are similar in some sense. K-means algorithm a..a. Lloyd s algorithm: A clustering algorithm that assigns observations into K clusters such that each observation belongs to the cluster with nearest in the Euclidean sense mean. it is straightforward to show that the E algorithm for a mixture of K Gaussian s is equivalent to the K-means algorithm when ɛ.

Outline lecture 6 2(35)

Outline lecture 6 2(35) Outline lecture 35 Lecture Expectation aximization E and clustering Thomas Schön Division of Automatic Control Linöping University Linöping Sweden. Email: schon@isy.liu.se Phone: 13-1373 Office: House