Expectation-Maximization Algorithm.

Similar documents
The Expectation-Maximization (EM) Algorithm

Clustering. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar.

Algorithms for Clustering

Mixtures of Gaussians and the EM Algorithm

Outline. CSCI-567: Machine Learning (Spring 2019) Outline. Prof. Victor Adamchik. Mar. 26, 2019

Statistical Pattern Recognition

Clustering: Mixture Models

ECE 8527: Introduction to Machine Learning and Pattern Recognition Midterm # 1. Vaishali Amin Fall, 2015

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

Regression and generalization

CS284A: Representations and Algorithms in Molecular Biology

Grouping 2: Spectral and Agglomerative Clustering. CS 510 Lecture #16 April 2 nd, 2014

Chapter 12 EM algorithms The Expectation-Maximization (EM) algorithm is a maximum likelihood method for models that have hidden variables eg. Gaussian

Exponential Families and Bayesian Inference

Lecture 11 and 12: Basic estimation theory

Empirical Process Theory and Oracle Inequalities

The Bayesian Learning Framework. Back to Maximum Likelihood. Naïve Bayes. Simple Example: Coin Tosses. Given a generative model

Axis Aligned Ellipsoid

EECS564 Estimation, Filtering, and Detection Hwk 2 Solns. Winter p θ (z) = (2θz + 1 θ), 0 z 1

1.010 Uncertainty in Engineering Fall 2008

LECTURE NOTES 9. 1 Point Estimation. 1.1 The Method of Moments

Expectation maximization

CSE 527, Additional notes on MLE & EM

Infinite Sequences and Series

ECE 901 Lecture 12: Complexity Regularization and the Squared Loss

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Lecture 9: September 19

15-780: Graduate Artificial Intelligence. Density estimation

Probabilistic Unsupervised Learning

Lecture 2: Monte Carlo Simulation

Let us give one more example of MLE. Example 3. The uniform distribution U[0, θ] on the interval [0, θ] has p.d.f.

ECE 901 Lecture 14: Maximum Likelihood Estimation and Complexity Regularization

Lecture 13: Maximum Likelihood Estimation

Probability and MLE.

Solution of Final Exam : / Machine Learning

Unsupervised Learning 2001

January 25, 2017 INTRODUCTION TO MATHEMATICAL STATISTICS

Application to Random Graphs

Jacob Hays Amit Pillay James DeFelice 4.1, 4.2, 4.3

10-701/ Machine Learning Mid-term Exam Solution

3/8/2016. Contents in latter part PATTERN RECOGNITION AND MACHINE LEARNING. Dynamical Systems. Dynamical Systems. Linear Dynamical Systems

Distributional Similarity Models (cont.)

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Vector Quantization: a Limiting Case of EM

Lecture 10 October Minimaxity and least favorable prior sequences

Distributional Similarity Models (cont.)

8 : Learning Partially Observed GM: the EM algorithm

Lecture 2 February 8, 2016

Random Variables, Sampling and Estimation

Machine Learning Brett Bernstein

Introduction to Extreme Value Theory Laurens de Haan, ISM Japan, Erasmus University Rotterdam, NL University of Lisbon, PT

1 Review and Overview

Machine Learning Theory (CS 6783)

Markov Decision Processes

Chapter 2 The Monte Carlo Method

Machine Learning Regression I Hamid R. Rabiee [Slides are based on Bishop Book] Spring

Stat 421-SP2012 Interval Estimation Section

Pattern Classification, Ch4 (Part 1)

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

ECE 901 Lecture 13: Maximum Likelihood Estimation

EE 6885 Statistical Pattern Recognition

6.867 Machine learning

Computing the maximum likelihood estimates: concentrated likelihood, EM-algorithm. Dmitry Pavlyuk

Three classification models Discriminant Model: learn the decision boundary directly and apply it to determine the class of each data point

Introduction to Machine Learning DIS10

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

Chapter 6 Principles of Data Reduction

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Maximum Likelihood Estimation and Complexity Regularization

REGRESSION WITH QUADRATIC LOSS

Lecture 19: Convergence

Sequences, Mathematical Induction, and Recursion. CSE 2353 Discrete Computational Structures Spring 2018

Lecture 12: November 13, 2018

Generalized Semi- Markov Processes (GSMP)

Problem Set 4 Due Oct, 12

Stat410 Probability and Statistics II (F16)

Chapter 8: Estimating with Confidence

Module 1 Fundamentals in statistics

w (1) ˆx w (1) x (1) /ρ and w (2) ˆx w (2) x (2) /ρ.

Approximations and more PMFs and PDFs

Simulation. Two Rule For Inverting A Distribution Function

Optimally Sparse SVMs

NUMERICAL METHODS FOR SOLVING EQUATIONS

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Agnostic Learning and Concentration Inequalities

CS537. Numerical Analysis and Computing

Estimation of the Mean and the ACVF

Element sampling: Part 2

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Elementary manipulations of probabilities

Randomized Algorithms I, Spring 2018, Department of Computer Science, University of Helsinki Homework 1: Solutions (Discussed January 25, 2018)

11 Hidden Markov Models

Discrete Mathematics and Probability Theory Spring 2013 Anant Sahai Lecture 18

Optimization Methods MIT 2.098/6.255/ Final exam

Lecture 2 October 11

Pixel Recurrent Neural Networks

Pattern Classification

Transcription:

Expectatio-Maximizatio Algorithm. Petr Pošík Czech Techical Uiversity i Prague Faculty of Electrical Egieerig Dept. of Cyberetics MLE 2 Likelihood......................................................................................................... 3 Icomplete data.................................................................................................... 4 Geeral EM........................................................................................................ 5 K-meas 7 Algorithm......................................................................................................... 8 Illustratio......................................................................................................... 9 EM view.......................................................................................................... 15 EM for Mixtures 16 Geeral mixture................................................................................................... 17 EM for Mixtures................................................................................................... 19 GMM............................................................................................................ 2 EM for GMM...................................................................................................... 21 EM for HMM 37 HMM............................................................................................................ 38 HMM learig.................................................................................................... 39 Sufficiet statistics................................................................................................. 4 Baum-Welch...................................................................................................... 41 Summary 42 Competecies..................................................................................................... 43 1

Maximum likelihood estimatio 2 / 43 Likelihood maximizatio Let s have a radom variable X with probability distributio p X (x θ). This emphasizes that the distributio is parameterized by θ Θ, i.e. the distributio comes from certai parametric family. Θ is the space of possible parameter values. Learig task: assume the parameters θ are ukow, but we have a i.i.d. traiig dataset T = {x 1,..., x } which ca be used to estimate the ukow parameters. The probability of observig dataset T give some parameter values θ is p(x θ) = p X (x j θ) def = L(θ; T). j=1 This probability ca be iterpretted as a degree with which the model parameters θ coform to the data T. It is thus called the likelihood of parameters θ w.r.t. data T. The optimal θ is obtaied by maximizig the likelihood θ = arg max θ Θ L(θ; T) = arg max θ Θ j=1 p X (x j θ) Sice arg max x f(x) = arg max x log f(x), we ofte maximize the log-likelihood l(θ; T) = log L(θ; T) θ = arg max l(θ; T) = arg max log θ Θ θ Θ p X (x j θ) = arg max log p X (x j θ), j=1 θ Θ j=1 which is ofte easier tha maximizatio of L. P. Pošík c 217 Artificial Itelligece 3 / 43 Icomplete data Assume we caot observe the objects completely: r.v. X describes the observable part, r.v. K describes the uobservable, hidde part. We assume there is a uderlyig distributio p XK (x, k θ) of objects (x, k). Learig task: we wat to estimate the model parameters θ, but the traiig set cotais i.i.d. samples for the observable part oly, i.e. T X = {x 1,..., x }. (Still, there also exists a hidde, uobservable dataset T K = {k 1,..., k }.) If we had a complete data (T X, T K ), we could directly optimize l(θ; T X, T K ) = log p(t X, T K θ). But we do ot have access to T K. If we would like to maximize l(θ; T X ) = log p(t X θ) = log T K p(t X, T K θ), the summatio iside log() results i complicated expressios, or we would have to use umerical methods. Our state of kowledge about T K is give by p(t K T X, θ). The complete-data likelihood L(θ; T X, T K ) = P(T X, T K θ) is a radom variable sice T K is ukow, radom, but govered by the uderlyig distributio. Istead of optimizig it directly, cosider its expected value uder the posterior distributio over latet variables (E-step), ad the maximize this expectatio (M-step). P. Pošík c 217 Artificial Itelligece 4 / 43 2

Expectatio-Maximizatio algorithm EM algorithm: A geeral method of fidig MLE of prob. dist. parameters from a give dataset whe data is icomplete (hidde variables, or missig values). Hidde variables: mixture models, Hidde Markov models,... It is a family of algorithms, or a recipe to derive a ML estimatio algorithm for various kids of probabilistic models. 1. Preted that you kow θ. (Use some iitial guess θ ().) Set iteratio couter i = 1. 2. E-step: Use the curret parameter values θ (i 1) to fid the posterior distributio of the latet variables P(T K T X, θ (i 1) ). Use this posterior distributio to fid the expectatio of the complete-data log-likelihood evaluated for some geeral parameter values θ: Q(θ, θ (i 1) ) = T K p(t K T X, θ (i 1) ) log p(t X, T K θ). 3. M-step: maximize the expectatio, i.e. compute a updated estimate of θ as θ (i) = arg max θ Θ Q(θ, θ(i 1) ). 4. Check for covergece: fiish, or advace the iteratio couter i = i+1, ad repeat from 2. P. Pošík c 217 Artificial Itelligece 5 / 43 EM algorithm features Pros: Amog the possible optimizatio methods, EM exploits the structure of the model. For p X K from expoetial family: M-step ca be doe aalytically ad there is a uique optimizer. The expected value i the E-step ca be expressed as a fuctio of θ without solvig it explicitly for each θ. p X (T X θ (i+1) ) p X (T X θ (i) ), i.e. the process fids a local optimum. Works well i practice. Cos: Not guarateed to get globally optimal estimate. MLE ca overfit; use MAP istead (EM ca be used as well). Covergece may be slow. P. Pošík c 217 Artificial Itelligece 6 / 43 3

K-meas 7 / 43 K-meas algorithm Clusterig is oe of the tasks of usupervised learig. K-meas algorithm for clusterig [Mac67]: K is the apriori give umber of clusters. Algorithm: 1. Choose K cetroids µ k (i almost ay way, but every cluster should have at least oe example.) 2. For all x, assig x to its closest µ k. 3. Compute the ew positio of cetroids µ k based o all examples x i, i I k, i cluster k. 4. If the positios of cetroids chaged, repeat from 2. Algorithm features: Algorithm miimizes the fuctio (itracluster variace): k j J = xi,j c j 2 j=1 i=1 (1) Algorithm is fast, but each time it ca coverge to a differet local optimum of J. [Mac67] J. B. MacQuee. Some methods for classificatio ad aalysis of multivariate observatios. I Proceedigs of 5-th Berkeley Symposium o Mathematical Statistics ad Probability, volume 1, pages 281 297, Berkeley, 1967. Uiversity of Califoria Press. P. Pošík c 217 Artificial Itelligece 8 / 43 Illustratio K meas clusterig: iteratio 1 9 8 7 6 5 4 3 2 1 2 4 6 8 P. Pošík c 217 Artificial Itelligece 9 / 43 4

Illustratio K meas clusterig: iteratio 2 9 8 7 6 5 4 3 2 1 2 4 6 8 P. Pošík c 217 Artificial Itelligece / 43 Illustratio K meas clusterig: iteratio 3 9 8 7 6 5 4 3 2 1 2 4 6 8 P. Pošík c 217 Artificial Itelligece 11 / 43 5

Illustratio K meas clusterig: iteratio 4 9 8 7 6 5 4 3 2 1 2 4 6 8 P. Pošík c 217 Artificial Itelligece 12 / 43 Illustratio K meas clusterig: iteratio 5 9 8 7 6 5 4 3 2 1 2 4 6 8 P. Pošík c 217 Artificial Itelligece 13 / 43 6

Illustratio K meas clusterig: iteratio 6 9 8 7 6 5 4 3 2 1 2 4 6 8 P. Pošík c 217 Artificial Itelligece 14 / 43 K-meas: EM view Assume: A object ca be i oe of the K states with equal probabilities. All p X K (x k) are isotropic Gaussias: p X K (x k) = N(x µ k, σi). Recogitio (Part of E-step): The task is to decide the state k for each x, assumig all µ k are kow. The Bayesia strategy (miimizes the probability of error) chooses the cluster which ceter is the closest to observatio x: q (x) = arg mi k K (x µ k) 2 If µ k, k K, are ot kow, it is a parametrized strategy q Θ (x), where Θ = (µ k ) K k=1. Decidig state k for each x assumig kow µ k is actually the computatio of a degeerate probability distributio p(t K T X, θ (i 1) ), i.e. the first part of E-step. Learig (The rest of E-step ad M-step): Fid the maximum-likelihood estimates of µ k based o kow (x 1, k 1 ),...,(x l, k l ): µ k = 1 I k i I k x i, where I k is a set of idices of traiig examples (curretly) belogig to state k. This completes the E-step ad implemets the M-step. P. Pošík c 217 Artificial Itelligece 15 / 43 7

EM for Mixture Models 16 / 43 Geeral mixture distributios Assume the data are samples from a distributio factorized as p XK (x, k) = p K (k)p X K (x k), i.e. p X (x) = p K (k)p X K (x k) k K ad that the distributio is kow (except the distributio parameters). Recogitio (Part of E-step): Let s defie the result of recogitio ot as a sigle decisio for some state k (as doe i K-meas), but rather as a set of posterior probabilities (sometimes called resposibilities) for all k give x i γ k (x i ) = p K X (k x i, θ (t) p X K (x i k)p K (k) ) = k K p X K (x i k)p K (k) that a object was i state k whe observatio x i was made. The γ k (x) fuctios ca be viewed as discrimiat fuctios. P. Pošík c 217 Artificial Itelligece 17 / 43 Geeral mixture distributios (cot.) Learig (The rest of E-step ad M-step): Give the traiig multiset T = (x i, k i ) i=1 (or the respective γ k(x i ) istead of k i ), assume γ k (x) is kow, p K (k) are ot kow, ad p X K (x k) are kow except the parameter values Θ k, i.e. we shall write p X K (x k, Θ k ). Let the object model m be a set of all ukow parameters m = (p K (k), Θ k ) k K. The log-likelihood of model m if we assume k i is kow: log L(m) = log i=1 p XK (x i, k i ) = log p K (k i )+ log p X K (x i k i, Θ ki ) i=1 i=1 The log-likelihood of model m if we assume a distributio (γ) over k is kow: log L(m) = i=1 k K γ k (x i ) log p K (k)+ γ k (x i ) log p X K (x i k, Θ k ) i=1 k K We search for the optimal model usig maximum likelihood: m = (p K (k), Θ k i.e. we compute ) = arg max log L(m) m p K (k) = 1 γ k (x i ) ad solve k idepedet tasks i=1 Θ k = arg max γ k (x i ) log p X K (x i k, Θ k ). Θ k i=1 P. Pošík c 217 Artificial Itelligece 18 / 43 8

EM for mixture distributio Usupervised learig algorithm [?] for geeral mixture distributios: 1. Iitialize the model parameters m = ((p K (k), Θ k ) k). 2. Perform the recogitio task, i.e. assumig m is kow, compute γ k (x i ) = ˆp K X (k x i ) = p K(k)p X K (x i k, Θ k ) j K p K (j)p X K (x i j, Θ j ). 3. Perform the learig task, i.e. assumig γ k (x i ) are kow, update the ML estimates of the model parameters p K (k) ad Θ k for all k: p K (k) = 1 γ k (x i ) i=1 Θ k = arg max γ k (x i ) log p X K (x i k, Θ k ) Θ k i=1 4. Iterate 2 ad 3 util the model stabilizes. Features: The algorithm does ot specify how to update Θ k i step 3, it depeds o the chose form of p X K. The model created i iteratio t is always at least as good as the model from iteratio t 1, i.e. L(m) = p(t m) icreases. [Mac67] J. B. MacQuee. Some methods for classificatio ad aalysis of multivariate observatios. I Proceedigs of 5-th Berkeley Symposium o Mathematical Statistics ad Probability, volume 1, pages 281 297, Berkeley, 1967. Uiversity of Califoria Press. P. Pošík c 217 Artificial Itelligece 19 / 43 Special Case: Gaussia Mixture Model Each kth compoet is a Gaussia distributio: 1 N(x µ k, Σ k ) = (2π) D 2 Σ k 1 2 Gaussia Mixture Model (GMM): exp{ 1 2 (x µ k) T Σ 1 k (x µ k )} K K p(x) = p K (k)p X K (x k, Θ k ) = α k N(x µ k, Σ k ) k=1 k=1 assumig K α k = 1 ad α k 1 k=1 5 x 3 4 5 3 2 2 3 4 2 3 4 2 P. Pošík c 217 Artificial Itelligece 2 / 43 9

EM for GMM 1. Iitialize the model parameters m = ((p K (k), µ k, Σ k ) k). 2. Perform the recogitio task as i the geeral case, i.e. assumig m is kow, compute γ k (x i ) = ˆp K X (k x i ) = p K(k)p X K (x i k, Θ k ) j K p K (j)p X K (x i j, Θ j ) = α kn(x i µ k, Σ k ) j K α j N(x i µ j, Σ j ). 3. Perform the learig task, i.e. assumig γ k (x i ) are kow, update the ML estimates of the model parameters α k, µ k ad Σ k for all k: α k = p K (k) = 1 γ k (x i ) i=1 µ k = i=1 γ k(x i )x i i=1 γ k(x i ) Σ k = i=1 γ k(x i )(x i µ k )(x i µ k ) T i=1 γ k(x i ) 4. Iterate 2 ad 3 util the model stabilizes. Remarks: Each data poit belogs to all compoets to a certai degree γ k (x i ). The eq. for µ k is just a weighted average of x i s. The eq. for Σ k is just a weighted covariace matrix. P. Pošík c 217 Artificial Itelligece 21 / 43 Example: Source data 5 4 3 2 2 Source data geerated from 3 Gaussias. 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 22 / 43

Example: Iput to EM algorithm 5 The data were give to the EM algorithm as a ulabeled dataset. 4 3 2 2 5 4 3 2 2 3 P. Pošík c 217 Artificial Itelligece 23 / 43 Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 24 / 43 11

Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 25 / 43 Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 26 / 43 12

Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 27 / 43 Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 28 / 43 13

Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 29 / 43 Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 3 / 43 14

Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 31 / 43 Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 32 / 43 15

Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 33 / 43 Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 34 / 43 16

Example: EM Iteratios 5 4 3 2 2 4 3 2 2 3 4 P. Pošík c 217 Artificial Itelligece 35 / 43 Example: Groud Truth ad EM Estimate 5 5 4 4 3 3 2 2 2 4 3 2 2 3 4 2 4 3 2 2 3 4 The groud truth (left) ad the EM estimate (right) are very close because we have eough data, we kow the right umber of compoets, ad we were lucky that EM coverged to the right local optimum of the likelihood fuctio. P. Pošík c 217 Artificial Itelligece 36 / 43 17

Baum-Welch Algorithm: EM for HMM 37 / 43 Hidde Markov Model 1st order HMM is a geerative probabilistic model formed by a sequece of hidde variables X,..., X t, the domai of all of them is the set of states {s 1,..., s N }. a sequece of observed variables E 1,..., E t, the domai of all of them is the set of observatios {v 1,..., v M }. a iitial distributio over hidde states P(X ), a trasitio model P(X t X t 1 ), ad a emissio model P(E t X t ). Simulatig HMM: 1. Geerate a iitial state x accordig to P(X ). Set t 1. 2. Geerate a ew curret state x t accordig to P(X t x t 1 ). 3. Geerate a observatio e t accordig to P(E t x t ). 4. Advace time t t+1. 5. Fiish, or repeat from step 2. With HMM: efficiet algorithms exist for solvig iferece tasks; but we have o idea (so far) how to lear HMM parameters from the observatio sequece, because we do ot have access to the hidde states. P. Pošík c 217 Artificial Itelligece 38 / 43 Learig HMM from data Is it possible to lear HMM from data? No kow way to aalytically solve for the model which maximizes the probability of observatios. No optimal way of estimatig the model parameters from the observatio sequeces. We ca fid model parameters such that the probability of observatios is maximized Baum-Welch algorithm (a special case of EM). Let s use a slightly differet otatio to emphasize the model parameters: π = [π i ] = [P(X 1 = s i )]... vector of the iitial probabilities of states A = [a i,j ] = [P(X t = s j X t 1 = s i )]... the matrix of trasitio probabilities to ext state give the curret state B = [b i,k ] = [P(E t = v k X t = s i )]... the matrix of observatio probabilities give the curret state The whole set of HMM parameters is the θ = (π, A, B) The algorithm (preseted o the ext slides) will compute the expected umbers of beig i a state or takig a trasitio give the observatios ad the curret model parameters θ = (π, A, B), ad the compute the ew estimate of model parameters θ = (π, A, B ), such that P(e t 1 θ ) P(e t 1 θ). P. Pošík c 217 Artificial Itelligece 39 / 43 18

Sufficiet statistics Let s defie the probability of trasitio from state s i at time t to state s j at time t+1, give the model ad the observatio sequece e t 1 : ξ t (i, j) = P(X t = s i, X t+1 = s j e t 1, θ) = α t(s i )a ij b jk β t+1 (s j ) P(e t 1 θ) = = α t (s i )a ij b jk β t+1 (s j ) N i=1 N j=1 α t(s i )a ij b jk β t+1 (s j ), where α t ad β t are the forward ad backward messages computed by the forward-backward algorithm, ad the probability of beig i state s i at time t, give the model ad the observatio sequece: γ t (i) = The we ca iterpret N ξ t (i, j). j=1 T 1 γ k (i) as the expected umber of trasitios from state s i, ad k=1 T 1 ξ k (i, j) as the expected umber of trasitios from s i to s j. k=1 P. Pošík c 217 Artificial Itelligece 4 / 43 Baum-Welch algorithm The re-estimatio formulas are π i = expected frequecy of beig i state s i at time (t = 1) = = γ 1 (i) a ij = expected umber of trasitios from s i to s j expected umber of trasitios from s i = = T 1 k=1 ξ k(i, j) T 1 k=1 γ k(i) b jk = expected umber of times beig i state s j ad observig v k expected umber of times beig i state s j = = T t=1 I(e t = v k )γ t (j) T t=1 γ t(j) As with other EM variats, with the old model parameters θ = (π, A, B) ad ew, re-estimated parameters θ = (π, A, B ), the ew model is at least as likely as the old oe: P(e t 1 θ ) P(e t 1 θ) The above equatios are used iteratively with θ takig place of θ. P. Pošík c 217 Artificial Itelligece 41 / 43 19

Summary 42 / 43 Competecies After this lecture, a studet shall be able to... defie ad explai the task of maximum likelihood estimatio; explai why we ca maximize log-likelihood istead of likelihood, describe the advatages; describe the issues we face whe tryig to maximize the likelihood i case of icomplete data; explai the geeral high-level priciple of Expectatio-Maximizatio algorithm; describe the pros ad cos of the EM algorithm, especially what happes with the likelihood i oe EM iteratio; describe the EM algorithm for mixture distributios, icludig the otio of resposibilities; explai the Baum-Welch algorithm, i.e. the applicatio of EM to HMM; what parameters are leared ad how (coceptually). P. Pošík c 217 Artificial Itelligece 43 / 43 2