Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems

Size: px

Start display at page:

Download "Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems"

Flora Gray
5 years ago
Views:

1 LEARNING AND INFERENCE IN GRAPHICAL MODELS Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems Dr. Martin Lauer University of Freiburg Machine Learning Lab Karlsruhe Institute of Technology Institute of Measurement and Control Systems Learning and Inference in Graphical Models. Chapter 08 p. 1/28

2 References for this chapter Christopher M. Bishop, Pattern Recognition and Machine Learning, ch. 9, Springer, 2006 Joseph L. Schafer, Analysis of Incomplete Multivariate Data, Chapman&Hall, 1997 Zoubin Ghahramani, Michael I. Jordan, Learning from incomplete data, Technical Report #1509, MIT Artificial Intelligence Laboratory, /AIM-1509.pdf?sequence=2 Arthur P. Dempster, Nan M. Laird, Donald B. Rubin, Maximum Likelihood from Incomplete Data via the EM Algorithm, in: Journal of the Royal Statistical Society Series B, vol. 39, pp. 1-38, 1977 Xiao-Li Meng and Donald B. Rubin, Maximum likelihood estimation via the ECM algorithm: A general framework, in: Biometrika, vol. 80, no. 2, pp , 1993 Learning and Inference in Graphical Models. Chapter 08 p. 2/28

3 Motivation up to now: 1. calculate/approximate p(parameters data) 2. find a meaningful reference value for p(parameters data), e.g. argmax parameters p(parameters data) requires more calculation than is actually necessary this chapter: findargmax parameters findargmax parameters p(parameters data) direct (MAP) or p(data parameters) direct (ML) Remark: ML and MAP require basically the same approaches. The only difference is whether we consider priors (which are just additional factors in graphical models). Therefore, we consider both approaches together. Learning and Inference in Graphical Models. Chapter 08 p. 3/28

4 Direct MAP calculation Posterior distribution in a graphical model: p(u 1,...,u n o 1,...,o m )= p(u 1,...,u n,o 1,...,o m ) p(o 1,...,o m ) p(u 1,...,u n,o 1,...,o m ) = i f i (Neighbors(i)) MAP means: solve arg max u 1,...,u n i logf i (Neighbors(i)) =e i logf i(neighbors(i)) Learning and Inference in Graphical Models. Chapter 08 p. 4/28

5 Direct MAP calculation Ways to find the MAP The systems of equations u j i logf i (Neighbors(i)) = 0 can be resolved analytically Each equation u j i logf i (Neighbors(i)) = 0 can be solved analytically analytical solution for MAP use an iterative approach Learning and Inference in Graphical Models. Chapter 08 p. 5/28

6 Direct MAP calculation Iterative approach 1. repeat 2. setu 1 argmax u1 i logf i(neighbors(i)) 3. setu 2 argmax u2 i logf i(neighbors(i)) setu n argmax un i logf i(neighbors(i)) 6. until convergence 7. return(u 1,...,u n ) The derivatives u j i logf i (Neighbors(i)) can be calculated easily numerical solution use generic gradient descent algorithm for Second approach often converges faster than generic gradient descent Learning and Inference in Graphical Models. Chapter 08 p. 6/28

7 Example: bearing-only tracking revisited observing a moving object from a fixed position object moves with constant velocity for every point in time, observer senses angle of observation, but only sometimes distance to object distributions: x 0 N( a,r) v N( b,s) y i x 0, v N( x 0 +t i v,σ 2 I) r i = y i w i = y i y i angle of observation σ t i x 0 unknown object movement w i y i unknown distance v r i x 0 v observer w i r i n Learning and Inference in Graphical Models. Chapter 08 p. 7/28

8 Example: bearing-only tracking revisited conditional distributions: x 0 v,( y i ),(t i ) N ( ( n σ 2I +R 1 ) 1 ( 1 σ 2 ( yi t i v)+r 1 a), ( n σ 2I +R 1 ) 1) v x 0,( y i ),(t i ) N ( ( 1 σ 2 t 2 i I +S 1 ) 1 ( 1 σ 2( t i ( y i x 0 ))+S 1 b), ( 1 σ 2 t 2 i I +S 1 ) 1) r i x 0, v,t i, w i N( w T i ( x 0 +t i v),σ 2 ) updates derived from conditionals: x 0 ( n σ 2I +R 1 ) 1 ( 1 σ 2 ( yi t i v)+r 1 a) v ( 1 σ 2 t 2 i I +S 1 ) 1 ( 1 σ 2( t i ( y i x 0 ))+S 1 b) r i w T i ( x 0 +t i v) Matlab demo (using non-informative priors) Learning and Inference in Graphical Models. Chapter 08 p. 8/28

9 Example: Gaussian mixtures revisited m 0 r 0 a 0 b 0 µ j s j k β w µ j N(m 0,r 0 ) s j Γ 1 (a 0,b 0 ) w D( β) Z i w C( w) X i Z i,µ Zi,s Zi N(µ Zi,s Zi ) X i Z i n Learning and Inference in Graphical Models. Chapter 08 p. 9/28

10 Example: Gaussian mixtures revisited conditional distributions: see slide 07/36 derive MAP updates: β 1 +n 1 1 w ( n k + k j=1 β j µ j s jm 0 +r 0 s j +n j r 0 i z i =j x i,..., β k +n k 1 n k + k j=1 β j ) withn j = {i z i = j} s j b z i argmax j i z i =j (x i µ j ) 2 1+a 0 + n j 2 w j 2πsj e 1 2 (x i µ j ) 2 Matlab demo (using priors close to non-informativity) s j Learning and Inference in Graphical Models. Chapter 08 p. 10/28

11 Example: Gaussian mixtures revisited Observations: convergence is very fast m 0 r 0 a 0 b 0 β result depends very much from initialization µ j s j k we treatz i like parameters of the model although mixture w model is completely specified by w,µ 1,...,µ k,s 1,...,s k z i are no parameters of the mixture but latent variables X i Z i n which are only used to simplify our calculations why should we maximize posterior w.r.tz i? Learning and Inference in Graphical Models. Chapter 08 p. 11/28

12 Latent variables Latent variables are not part of the stochastic model not interesting for the final estimate useful to simplify calculations often interpreted as missing observation Examples the class assignment variablesz i in the mixture modeling can be interpreted as missing class labels for a multi-class distribution the missing distancesr i in the bearing-only tracking task can be interpreted as missing parts of the data occluded parts of an object in an image can be seen as missing pixels data from a statistical evaluation which have been lost Learning and Inference in Graphical Models. Chapter 08 p. 12/28

13 Incomplete data problems Let us assume that all data x are split into an observed part y and a missing part z, i.e. x = ( y, z). We can distinguish three cases: completely missing at random (CMAR): whether an entry of x belongs to y or z is stochastically independent on both, y and z P(x i belongs to z) = P(x i belongs to z y) = P(x i belongs to z y, z) missing at random (MAR): whether an entry of x belongs to y or z is stochastically independent of z but might depend on y P(x i belongs to z) P(x i belongs to z y) = P(x i belongs to z y, z) censored data: whether an entry of x belongs to y or z is stochastically dependent on z P(x i belongs to z y) P(x i belongs to z y, z) Learning and Inference in Graphical Models. Chapter 08 p. 13/28

14 Incomplete data problems Discuss the following examples of incomplete data: thez i in mixture models a sensor that measures values only down to a certain minimal value an interrupted connection between a sensor and a host computer so that some measurements are not transmitted a stereo camera system that measures light intensity and distance but is unable to calculate the distance for overexposed areas a sensor that fails often if temperatures are low if the sensor measures the activities of the sun if the sensor measures the persons on a beach non-responses at public opinion polls Learning and Inference in Graphical Models. Chapter 08 p. 14/28

15 Incomplete data problems consequences for stochastic analysis CMAR: no problem at all, incomplete data do not disturb our results MAR: can be treated if we model the stochastic dependency between the observed data and the missing data censored data: no general treatment at all possible. Results will be disturbed. No reconstruction of missing data possible we focus on the CMAR+MAR case here Learning and Inference in Graphical Models. Chapter 08 p. 15/28

16 Inference for incomplete data problems variational Bayes, Monte Carlo: Model the full posterior over the parameters of the model and the latent (missing) data. Afterwards, ignore the latent variables and return the result for the parameters of your model. direct MAP/ML: do not maximize the posterior/likelihood over the parameters and the latent variables. But, consider all possible values that can be taken by the latent variables and maximize the posterior/likelihood only w.r.t. the parameters of your stochastic model. expectation-maximization algorithm (EM) expectation-conditional-maximization algorithm (ECM) Learning and Inference in Graphical Models. Chapter 08 p. 16/28

17 EM algorithm Let us denote the parameters of the stochastic model (the posterior distribution) θ = (θ 1,...,θ k ) the latent variables λ = (λ1,...,λ m ) the observed data o = (o 1,...,o n ) the log-posterior L( θ, λ, o) = i logf i (Neighbors(i)) Learning and Inference in Graphical Models. Chapter 08 p. 17/28

18 EM algorithm We aim at maximizing the expected log-posterior over all values of the latent variables arg max θ R m L( θ, λ, o) p( λ θ, o) d λ an iterative approach to solve it 1. start with some parameter vector θ 2. repeat 3. Q( θ ) R m L( θ, λ, o) p( λ θ, o) d λ 4. θ argmax θ Q( θ ) 5. until convergence This algorithm is known as expectation maximization algorithm (Dempster, Laird, Rubin, 1977) step 3: expectation step (E-step) step 4: maximization step (M-step) Learning and Inference in Graphical Models. Chapter 08 p. 18/28

19 EM algorithm Remarks: during the E-step intermediate variables are calculated which allow to representqwithout relying on the previous values of θ closed form expressions for Q and explicit maximization often requires lengthy algebraic calculations for some applications calculating the E-step means calculating the expectation values of latent variables. But this does not apply in general. Famous application areas mixture distributions learning hidden Markov models from example sequences (Baum-Welch-algorithm) Learning and Inference in Graphical Models. Chapter 08 p. 19/28

20 Example: bearing-only tracking revisited conditions distribution r i x o, v, w i N( w T i ( x 0 +t i v),σ 2 ) unknown object movement x 0 v the posterior distribution 1 2π 1 R e 2 ( x 0 a) T R 1 ( x 0 a) }{{} prior of x 0 1 2π 1 S e 2 ( v b) T S 1 ( v b) }{{} n i=1 1 2πσ 2e prior of v 1 2 x 0 +t i v r i w i 2 σ 2 } {{ } data term angle of observation σ t i x 0 w i y i unknown distance v r i observer w i r i n Learning and Inference in Graphical Models. Chapter 08 p. 20/28

21 Example: bearing-only tracking revisited... (after lengthy, error-prone calculations)... Q( x 0, v )=const 1 2 ( x 0 a) T R 1 ( x 0 a) 1 2 ( v b) T S 1 ( v b) 1 n x 0 +t i v ρ i w i 2 2 σ 2 with ρ i = i=1 { r i ( x 0 +t i v) T w i ifr i is observed ifr i is unobserved Determining the maxima w.r.t x 0, v ( ) R 1 + n I ( 1 σ 2 σ 2 ti )I ( 1 σ 2 ti )I S 1 +( 1 σ t 2 2 i )I ( ) x 0 v = ( ) R 1 a+ 1 σ 2 ρi w i S 1 b+ 1 σ 2 ti ρ i w i Matlab demo (using non-informative priors) Learning and Inference in Graphical Models. Chapter 08 p. 21/28

22 ECM algorithm We still aim at maximizing the expected log-posterior over all values of the latent variables arg max θ R m L( θ, λ, o) p( λ θ, o) d λ sometimes, the M-step of the EM-algorithm cannot be calculated, i.e. arg max θ 1,...,θ k Q( θ ) cannot be resolved analytically. But it might happen that arg max θ i Q( θ ) can be resolved for eachθ i or for groups of parameters Expectation-conditional-maximization algorithm (Meng & Rubin, 1993) Learning and Inference in Graphical Models. Chapter 08 p. 22/28

23 ECM algorithm Define a set of constraintsg i ( θ, θ) on the parameter set, e.g. g i : θ j = θ j for allj i replace the single M-step of the EM algorithm by a sequence of CM-steps, one for each constraint 1. start with some parameter vector θ 2. repeat 3. Q( θ ) R m L( θ, λ, o) p( λ θ, o) d λ (M-step) 4. θ argmax θ Q( θ ) subject tog 1 ( θ, θ) (CM-step) θ argmax θ Q( θ ) subject tog ν ( θ, θ) (CM-step) 7. until convergence Learning and Inference in Graphical Models. Chapter 08 p. 23/28

24 Example: Gaussian mixtures revisited m 0 r 0 a 0 b 0 µ j s j k β w µ j N(m 0,r 0 ) s j Γ 1 (a 0,b 0 ) w D( β) Z i w C( w) X i Z i,µ Zi,s Zi N(µ Zi,s Zi ) X i Z i n conditional distribution (cf. slide 07/35) z i = j w,x i,µ 1,...,µ k,s 1,...,s k C(h i,1,...,h i,k ) withh i,j w j 2πsj e 1 2 (x i µ j ) 2 s j Learning and Inference in Graphical Models. Chapter 08 p. 24/28

25 Example: Gaussian mixtures revisited Q( w,µ 1,...,µ k,s 1,...,s k ) = k k ( ( k µ 1 log( e 1 j m 0 2 r 0 ) + 2πr0 z 1 =1 z n =1 j=1 }{{} k j=1 b 0 s j ) log( ba 0 0 Γ(a 0 ) (s j) a0 1 e }{{} prior ofs j 1 log( e 1 2 2πs zj µ z j x i s z j ) } {{ } data term ofx i easily, we can maximize Q (blackboard/homework) prior ofµ j k (w j) βj 1 ) +log( Γ(β 1 + +β k ) Γ(β 1 ) Γ(β k ) j=1 }{{} prior of w ) ) hn,zn h 1,z1 + log(w z i ) }{{} data term ofz i + Learning and Inference in Graphical Models. Chapter 08 p. 25/28

26 Example: Gaussian mixtures revisited Matlab demo (using non-informative priors) Some observations on EM/ECM for Gaussian mixtures very popular very sensitive to initialization of parameters overfits the data if mixture is too large (for ML/MAP with non-informative priors) Learning and Inference in Graphical Models. Chapter 08 p. 26/28

27 Laplace approximation MAP calculates a best estimate. Can we derive an approximation for the posterior distribution? Idea: determine a Gaussian that is locally most similar to the posterior. Taylor-approximation of log-posterior around MAP estimate θ MAP logp( θ) logp( θ MAP )+grad( θ θ MAP )+ 1 2 ( θ θ MAP ) T H( θ θ MAP ) =logp( θ MAP )+ 1 2 ( θ θ MAP ) T H( θ θ MAP ) withh the Hessian oflogp log of a Gaussian aroundθ MAP : 1 log d 1 2π Σ 2 ( θ θ MAP ) T Σ 1 ( θ θ MAP ) We obtain the same shape of the Gaussian if we chooseσ 1 = H. This is known as Laplace approximation. Learning and Inference in Graphical Models. Chapter 08 p. 27/28

28 Summary direct maximization of likelihood/posterior latent variables incomplete data problems EM/ECM algorithm Laplace approximation Learning and Inference in Graphical Models. Chapter 08 p. 28/28

Gaussian Mixture Models

Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some