Outline lecture 6 2(35)

Similar documents
Outline Lecture 3 2(40)

Outline Lecture 2 2(32)

Outline lecture 2 2(30)

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

1 / 31 Identification of nonlinear dynamic systems (guest lectures) Brussels, Belgium, June 8, 2015.

1 / 32 Summer school on Foundations and advances in stochastic filtering (FASF) Barcelona, Spain, June 22, 2015.

10-701/15-781, Machine Learning: Homework 4

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Latent Variable Models and EM algorithm

CPSC 540: Machine Learning

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Estimating Gaussian Mixture Densities with EM A Tutorial

But if z is conditioned on, we need to model it:

Introduction to Machine Learning

COM336: Neural Computing

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Gaussian Mixture Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Mixtures of Gaussians. Sargur Srihari

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Outline lecture 4 2(26)

Expectation Maximization

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Latent Variable Models and Expectation Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Expectation Maximization

CSCI-567: Machine Learning (Spring 2019)

Latent Variable Models and Expectation Maximization

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Cheng Soon Ong & Christian Walder. Canberra February June 2017

The Expectation-Maximization Algorithm

Bayesian Machine Learning

Statistical Pattern Recognition

Latent Variable View of EM. Sargur Srihari

Expectation Maximization

Linear Dynamical Systems

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Lecture 14. Clustering, K-means, and EM

Unsupervised Learning

EM Algorithm LECTURE OUTLINE

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Clustering, K-Means, EM Tutorial

The Expectation-Maximization Algorithm

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning

ESTIMATION ALGORITHMS

1 Expectation Maximization

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

CS Lecture 18. Expectation Maximization

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Expectation Maximization (EM)

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Technical Details about the Expectation Maximization (EM) Algorithm

Previously on TT, Target Tracking: Lecture 2 Single Target Tracking Issues. Lecture-2 Outline. Basic ideas on track life

Statistical Pattern Recognition

Exercises Tutorial at ICASSP 2016 Learning Nonlinear Dynamical Models Using Particle Filters

Clustering and Gaussian Mixtures

CS181 Midterm 2 Practice Solutions

ECE 5984: Introduction to Machine Learning

Overfitting, Bias / Variance Analysis

Latent Variable Models and EM Algorithm

Clustering and Gaussian Mixture Models

Machine Learning Lecture 5

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017

Ch 4. Linear Models for Classification

Machine Learning Lecture Notes

Lecture 4: Probabilistic Learning

System Identification of Nonlinear State-Space Models

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Unsupervised Learning

CPSC 540: Machine Learning

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Gaussian Mixture Models, Expectation Maximization

Probabilistic Graphical Models

K-Means and Gaussian Mixture Models

Learning the Linear Dynamical System with ASOS ( Approximated Second-Order Statistics )

Probability and Information Theory. Sargur N. Srihari

Clustering with k-means and Gaussian mixture distributions

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Machine Learning for Signal Processing Bayes Classification and Regression

Lecture : Probabilistic Machine Learning

Machine Learning for Data Science (CS4786) Lecture 12

Expectation Propagation Algorithm

Hidden Markov Models and Gaussian Mixture Models

STA 4273H: Statistical Machine Learning

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Machine Learning (CS 567) Lecture 5

Probabilistic clustering

Gaussian discriminant analysis Naive Bayes

Chapter 14 Combining Models

Sequential Monte Carlo methods for system identification

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Transcription:

Outline lecture 35 Lecture Expectation aximization E and clustering Thomas Schön Division of Automatic Control Linöping University Linöping Sweden. Email: schon@isy.liu.se Phone: 13-1373 Office: House B Entrance 7. 1. Summary of lecture 5. Expectation aximization E General derivation Example - identification of a linear state-space model Example - identification of a Wiener system 3. Gaussian mixtures Standard construction Equivalent construction using latent variables L estimation using E. Connections to the -means algorithm for clustering Chapter 9 The Gaussian mixture model and the -means algorithm will be finished during lecture 7 on Wednesday. Summary of lecture 5 335 Latent variables example 35 A Gaussian process is a collection of random variables any finite number of which have a joint Gaussian distribution. By assuming that the considered system is a Gaussian process predictions can be made by computing the conditional distribution pyx all the observations yx being the output for which we see a prediction. This regression approach is referred to as Gaussian process regression. The suppor vector machine SV is a discriminative classifier that gives the maximum margin decision boundary. A latent variable is a variable that is not directly observed. Other common names are hidden variables unobserved variables or missing data. An example of a latent variable is the state x t in a state space model. Consider the following linear Gaussian state space LGSS model x t+1 = θx t + v t y t = 1 x t + e t vt e t.1..1

Expectation aximization E strategy and idea 535 The Expectation aximization E algorithm computes L estimates of unnown parameters in probabilistic models involving latent variables. Strategy: Use structure inherent in the probabilistic model to separate the original L problem into two closely lined subproblems each of which is hopefully in some sense more tractable than the original problem. E focus on the joint log-lielihood function of the observed variables X and the latent variables Z {z 1... z } l θ X Z = ln p θ X Z. Expectation aximization algorithm 35 Algorithm 1 Expectation aximization E 1. Initialise: Set i = 1 and choose an initial θ 1.. While not converged do: a Expectation E step: Compute Qθ θ i = E θi [ln p θ Z X X] = ln p θ Z Xp θi Z XdZ b aximization step: Compute c i i + 1 θ i+1 = arg max Qθ θ i θ E example 1 linear system identification 735 Consider the following scalar LGSS model x t+1 = θx t + v t y t = 1 x t + e t vt e t.1..1 The initial state is fully nown = and the true θ-parameter is given by θ =.9. The identification problem is now to determine the parameter θ on the basis of the observations Y = {y 1... y } using the E algorithm. The latent variables Z are given by the states Z = X {... x +1 }. ote the difference in notation compared to Bishop! The observations are denoted Y and the latent variables are denoted by X. E Example 1 linear system identification 35 The expectation E step: Qθ θ i E θi {ln p θ X Y Y} = ln p θ X Yp θi X YdX. Let us start investigating ln p θ X Y. Using conditional probabilities we have p θ X Y = p θ x +1 X y Y 1 = p θ x +1 y X Y 1 p θ X Y 1 According to the arov property we have resulting in p θ x +1 y X Y 1 = p θ x +1 y x p θ X Y = p θ x +1 y x p θ X Y 1.

E Example 1 linear system identification 935 E example 1 linear system identification 135 Repeated use of the above ideas straightforwardly yields p θ X Y = p θ According to the model we have xt+1 p θ x y t = t xt+1 y t ; p θ x t+1 y t x t. θ.1 x 1/ t..1 The resulting Q-function is Qθ θ i E θi { t Y = ϕθ + ψθ where we have defined ϕ } θ + E θi { x t x t+1 Y } E θi { t Y ψ E θi {x t x t+1 Y}. } θ There exist explicit expressions for these expected values. E example 1 linear system identification 1135 The maximization step θ i+1 = arg max Qθ θ i. θ simply amounts to solving the following quadratic problem The solution is given by θ i+1 = arg max ϕθ + ψθ. θ θ i+1 = ψ ϕ. E example 1 linear system identification 35 Algorithm E example 1 1. Initialise: Set i = 1 and choose an initial θ 1.. While not converged do: a Expectation E step: Compute ϕ = } E θi { t Y ψ = E θi {x t x t+1 Y}. b aximization step: Find the next iterate according to θ i+1 = ψ ϕ. c If L θi Y L θi 1 Y 1 update i := i + 1 and return to step otherwise terminate.

E example 1 linear system identification 1335 E example 1 linear system identification 135 Different number of samples used. onte Carlo studies each using 1 realisations of data. Initial guess θ =.1. 1 5 1 5 1 θ.71.5.95.97.9.99.99 o surprise since L is asymptotically efficient. Log lielihood Q function 1 Log lielihood and Q function 3 5 7.... 1 1. Parameter a a Iteration 1 Log lielihood and Q function 1 3 5 Log lielihood Q function 7.... 1 1. Parameter a b Iteration E example 1 linear system identification 1535 onlinear system identification using E I/VI 135 Log lielihood and Q function 1 3 5 Log lielihood Q function Log lielihood and Q function 1 3 5 Log lielihood Q function A general state space model SS consists of a arov process {x t } t 1 and a measurement process {y t } t 1 related according to x t+1 x t f θt x t+1 x t u t y t x t h θt y t x t u t µ θ. 7.... 1 1. Parameter a c Iteration 3 7.... 1 1. Parameter a d Iteration 11 Identification problem: Find θ based on {u 1:T y 1:T }. According to the above the first step is to compute the Q-function All details including ATLAB code are provided in Thomas B. Schön An Explanation of the Expectation aximization Algorithm. Division of Automatic Control Linöping University Sweden Technical Report nr: LiTH-ISY-R-915 August 9. users.isy.liu.se/rt/schon/publications/schone9.pdf Qθ θ = E θ {ln p θ Z Y Y}

onlinear system identification using E II/VI 1735 Applying E θ { Y} to lnp θ X Y = ln p θ Y X + ln p θ X 1 = ln p θ + ln p θ x t+1 x t + This results in Qθ θ = I 1 + I + I 3 where I 1 = ln p θ p θ Yd ln p θ y t x t. 1 I = ln p θ x t+1 x t p θ x t+1 x t Ydx t dx t+1 I 3 = ln p θ y t x t p θ x t Ydx t. onlinear system identification using E III/VI 135 This leads us to a nonlinear state smoothing problem which we can solve using a particle smoother PS. The PS provides us with the following approximation of the joint smoothing density px Y 1 δ X X i which allows for the following approximations of the marginal smoothing densities that we need p θ x t Y p θ x t Y = 1 p θ x t:t+1 Y p θ x t:t+1 Y = 1 δx t x i t δx t:t+1 x i t:t+1. onlinear system identification using E IV/VI 1935 Inserting the above approximations into the integrals straightforwardly yields the approximation we are looing for Î 1 = ln p θ Î = 1 1 1 δ x i 1 d = 1 ln p θ x t+1 x t = 1 ln p θ x i t+1 xi t Î 3 = ln p θ y t x t ln p θ x i 1 1 δ x t:t+1 x i t:t+1 1 δx t x i tdx t = 1 dx t:t+1 ln p θ y t x i t onlinear system identification using E V/VI 35 It is straightforward to mae use of the approximation of the Q-function just derived in order to compute gradients of the Q-function θ Qθ θ = Î 1 θ + Î θ + Î 3 θ For example the other two terms are treated analogously Î 3 = 1 Î 3 θ = 1 ln p θ y t x i t ln p θ y t x i t θ With these gradients in place there are many algorithms that can be used in order to solve the maximization problem we employ BFGS.

onlinear system identification using E VI/VI 135 E example blind Wiener identification I/III 35 e 1t Algorithm 3 onlinear System Identification Using E 1. Initialise: Set i = 1 and choose an initial θ 1.. While not converged do: a Expectation E step: Run a FFBS PS and compute u t L z t h 1 z t β Σ e t y 1t Qθ θ = Î 1 θ θ + Î θ θ + Î 3 θ θ b aximization step: Compute θ +1 = arg max θ Qθ θ using an off-the-shelf numerical optimization algorithm. c + 1 Thomas B. Schön Adrian Wills and Brett inness. System Identification of onlinear State-Space odels. Automatica 71:39-9 January 11. x t+1 = A h z t β Σ y t B x t u u t Q t z t = Cx t y t = hz t β + e t e t R. Identification problem: Find A B C β Q and R based on {y 11:T y 1:T } using E. E example blind Wiener identification I/III 335 E example blind Wiener identification I/III 35 Second order LGSS model with complex poles. Employ the E-PS with = 1 particles. E-PS was terminated after 1 iterations. Results obtained using T = 1 samples. The plots are based on 1 realizations of data. onlinearities dead-zone and saturation shown on next slide. Gain db 3 5 15 1 5 5 1 15.5 1 1.5.5 3 ormalised frequency Bode plot of estimated mean blac true system red and the result for all 1 realisations gray. Output of nonlinearity 1.5.5 1 1.5 1.5 1.5.5 1 Input to nonlinearity 1. 1.5.5 1 1.5 Input to nonlinearity Estimated mean blac true static nonlinearity red and the result for all 1 realisations gray. Adrian Wills Thomas B. Schön Lennart Ljung and Brett inness. Identification of Hammerstein-Wiener odels. Automatica 91: 7-1 January 13. Output of nonlinearity 1.......

Gaussian mixture G standard construction 535 A linear superposition of Gaussians px = =1 π p x n µ Σ px n G example 35 Consider the following G px =.3 π 1 x 1...5..5 } {{ } µ 1 Σ 1 +.5 π x 1 1 1 } {{ } µ Σ +. π 3 x 9..5.5 1.5 } {{ } µ 3 Σ 3 1 is called a Gaussian mixture G. The mixture coefficients π satisfies π = 1 π 1. =1 Interpretation: The density px = x µ Σ is the probability of x given that component was chosen. The probability of choosing component is given by the prior probability p. Figure: Probability density function. 1 1 Figure: Contour plot. G problem with standard construction 735 Given independent observations {x n } n=1 the log-lielihood function if given by ln px; π 1: µ 1: Σ 1: = ln π x n µ Σ n=1 =1 There is no closed form solution available due to the sum inside the logarithm. Let us now see how this problem can be separated into two simple problems using the E algorithm. First we introduce an equivalent construction of the Gaussian mixture by introducing a latent variable. E for Gaussian mixtures intuitive preview 35 Based on pz n = =1 π z n and px n z n = =1 we have for independent observations {x n } n=1 px Z = n=1 =1 resulting in the following log-lielihood ln px Z = n=1 =1 π z n x n µ Σ z n x n µ Σ z n z n ln π + ln x n µ Σ. 1 Let us now use wishful thining and assume that Z is nown. Then maximization of 1 is straightforward.

E for Gaussian mixtures explicit algorithm 935 Algorithm E for Gaussian mixtures 1. Initialise: Initialize µ 1 Σ1 π1 and set i = 1.. While not converged do: a Expectation E step: Compute γz n = πi x n µ i Σi j=1 πi j x n µ i j n = 1... = 1.... Σi j b aximization step: Compute µ i+1 = 1 γz n x n π i+1 = = γz n n=1 n=1 Σ i+1 = 1 γz n x n µ i+1 x n µ i+1 n=1 c i i + 1 T Example E for Gaussian mixtures I/III 335 Consider the same Gaussian mixture as before px =.3 π 1 x 1...5..5 } {{ } µ 1 Σ 1 +.5 π Figure: Probability density function. 1 x 1 1 } {{ } µ Σ 1 +. π 3 x 9..5.5 1.5 } {{ } µ 3 Σ 3 1 1 Figure: = 1 samples from the Gaussian mixture px. Example E for Gaussian mixtures II/III 3135 Example E for Gaussian mixtures III/III 335 Apply the E algorithm to estimate a Gaussian mixture with = 3 Gaussians i.e. use the 1 samples to compute estimates of π 1 π π 3 µ 1 µ µ 3 Σ 1 Σ Σ 3. iterations. 1 1 1 1 1 1 1 1 1 Figure: Initial guess. Figure: True PDF. Figure: Estimate after iterations of the E algorithm.

The -means algorithm I/II 3335 Algorithm 5 -means algorithm a..a. Lloyd s algorithm 1. Initialize µ 1 and set i = 1.. inimize J w.r.t. r n eeping µ = µ i fixed. r i+1 n = { 1 if = arg minj x n µ i j otherwise 3. inimize J w.r.t. µ eeping r n = r i+1 n µ i+1 fixed. = n=1 ri+1 n x n n=1 ri+1 n. If not converged update i := i + 1 and return to step.. The -means algorithm II/II 335 The name -means stems from the fact that in step 3 of the algorithm u is give by the mean of all the data points assigned to cluster. ote the similarities between the -means algorithm and the E algorithm for Gaussian mixtures! -means is deterministic with hard assignment of data points to clusters no uncertainty whereas E is a probabilistic method that provides a soft assignment. If the Gaussian mixtures are modeled using covariance matrices Σ = ɛi = 1... it can be shown that the E algorithm for a mixture of Gaussian s is equivalent to the -means algorithm when ɛ. A few concepts to summarize lecture 3535 Latent variable: A variable that is not directly observed. Sometimes also referred to as hidden variable or missing data. Expectation aximization E: The E algorithm computes maximum lielihood estimates of unnown parameters in probabilistic models involving latent variables. Jensen s inequality: States that if f is a convex function then Ef x f Ex. Clustering: Unsupervised learning where a set of observations is divided into clusters. The observations belonging to a certain cluster are similar in some sense. -means algorithm a..a. Lloyd s algorithm: A clustering algorithm that assigns observations into clusters such that each observation belongs to the cluster with nearest in the Euclidean sense mean.