Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Size: px

Start display at page:

Download "Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS"

Rosamund Johnson
5 years ago
Views:

1 Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics Outline Maximum likelihood (ML) Priors, and maximum a posteriori (MAP) Cross-validation Expectation Maximization (EM) Page 1!

2 Thumbtack Let µ = P(up), 1-µ = P(down) How to determine µ? Empirical estimate: 8 up, 2 down à Page 2!

3 Maximum Likelihood µ = P(up), 1-µ = P(down) Observe: Likelihood of the observation sequence depends on µ: Maximum likelihood finds à extrema at µ = 0, µ = 1, µ = 0.8 à Inspection of each extremum yields µ ML = 0.8 Maximum Likelihood More generally, consider binary-valued random variable with µ = P(1), 1-µ = P(0), assume we observe n 1 ones, and n 0 zeros Likelihood: Derivative: Hence we have for the extrema: n1/(n0+n1) is the maximum = empirical counts. Page 3!

4 Log-likelihood The function is a monotonically increasing function of x Hence for any (positive-valued) function f: In practice often more convenient to optimize the loglikelihood rather than the likelihood itself Example: Log-likelihood ß à Likelihood Reconsider thumbtacks: 8 up, 2 down Likelihood log-likelihood Not Concave Concave Definition: A function f is concave if and only Concave functions are generally easier to maximize then non-concave functions Page 4!

5 Concavity and Convexity f is concave if and only f is convex if and only x 1 x 2 x 2 +(1- )x 2 x 1 x 2 x 2 +(1- )x 2 Easy to maximize Easy to minimize ML for Multinomial Consider having received samples Page 5!

6 ML for Fully Observed HMM Given samples Dynamics model: Observation model: à Independent ML problems for each and each ML for Exponential Distribution Source: wikipedia Consider having received samples 3.1, 8.2, 1.7 ll Page 6!

7 ML for Exponential Distribution Source: wikipedia Consider having received samples Uniform Consider having received samples Page 7!

8 ML for Gaussian Consider having received samples ML for Conditional Gaussian Equivalently: More generally: Page 8!

9 ML for Conditional Gaussian ML for Conditional Multivariate Gaussian Page 9!

10 Aside: Key Identities for Derivation on Previous Slide ML Estimation in Fully Observed Linear Gaussian Bayes Filter Setting Consider the Linear Gaussian setting: Fully observed, i.e., given à Two separate ML estimation problems for conditional multivariate Gaussian: 1: 2: Page 10!

11 Priors --- Thumbtack Let µ = P(up), 1-µ = P(down) How to determine µ? ML estimate: 5 up, 0 down à Laplace estimate: add a fake count of 1 for each outcome Priors --- Thumbtack Alternatively, consider $µ$ to be random variable Prior P(µ) / µ(1-µ) Measurements: P( x µ ) Posterior: Maximum A Posterior (MAP) estimation à = find µ that maximizes the posterior Page 11!

12 Priors --- Beta Distribution Figure source: Wikipedia Priors --- Dirichlet Distribution Generalizes Beta distribution MAP estimate corresponds to adding fake counts n 1,, n K Page 12!

13 MAP for Mean of Univariate Gaussian Assume variance known. (Can be extended to also find MAP for variance.) Prior: MAP for Univariate Conditional Linear Gaussian Assume variance known. (Can be extended to also find MAP for variance.) Prior: [Interpret!] Page 13!

14 MAP for Univariate Conditional Linear Gaussian: Example TRUE --- Samples. ML --- MAP --- Cross Validation Choice of prior will heavily influence quality of result Fine-tune choice of prior through cross-validation: 1. Split data into training set and validation set 2. For a range of priors, Train: compute µ MAP on training set Cross-validate: evaluate performance on validation set by evaluating the likelihood of the validation data under µ MAP just found 3. Choose prior with highest validation score For this prior, compute µ MAP on (training+validation) set Typical training / validation splits: 1-fold: 70/30, random split 10-fold: partition into 10 sets, average performance for each of the sets being the validation set and the other 9 being the training set Page 14!

15 Outline Maximum likelihood (ML) Priors, and maximum a posteriori (MAP) Cross-validation Expectation Maximization (EM) Mixture of Gaussians Generally: Example: ML Objective: given data z (1),, z (m) Setting derivatives w.r.t. µ, µ, equal to zero does not enable to solve for their ML estimates in closed form We can evaluate function à we can in principle perform local optimization, see future lectures. In this lecture: EM algorithm, which is typically used to efficiently optimize the objective (locally) Page 15!

16 Expectation Maximization (EM) Example: Model: Goal: Given data z (1),, z (m) (but no x (i) observed) Find maximum likelihood estimates of µ 1, µ 2 EM basic idea: if x (i) were known à two easy-to-solve separate ML problems EM iterates over E-step: For i=1,,m fill in missing data x (i) according to what is most likely given the current model µ M-step: run ML for completed data, which gives new model µ EM Derivation EM solves a Maximum Likelihood problem of the form: µ: parameters of the probabilistic model we try to find x: unobserved variables z: observed variables Jensen s Inequality Page 16!

17 Jensen s inequality Illustration: P(X=x 1 ) = 1-, P(X=x 2 ) = x 1 x 2 E[X] = x 2 +(1- )x 2 EM Derivation (ctd) Jensen s Inequality: equality holds when is an affine function. This is achieved for EM Algorithm: Iterate 1. E-step: Compute 2. M-step: Compute M-step optimization can be done efficiently in most cases E-step is usually the more expensive step It does not fill in the missing data x with hard values, but finds a distribution q(x) Page 17!

18 EM Derivation (ctd) M-step objective is upperbounded by true objective M-step objective is equal to true objective at current parameter estimate à Improvement in true objective is at least as large as improvement in M-step objective EM 1-D Example iterations Estimate 1-d mixture of two Gaussians with unit variance: one parameter µ ; µ 1 = µ - 7.5, µ 2 = µ+7.5 Page 18!

19 EM for Mixture of Gaussians X ~ Multinomial Distribution, P(X=k ; µ) = µ k Z ~ N(µ k, k ) Observed: z (1), z (2),, z (m) EM for Mixture of Gaussians E-step: M-step: Page 19!

20 ML Objective HMM Given samples Dynamics model: Observation model: ML objective: à à No simple decomposition into independent ML problems for each and each No closed form solution found by setting derivatives equal to zero EM for HMM --- M-step à µ and computed from soft counts Page 20!

21 EM for HMM --- E-step No need to find conditional full joint Run smoother to find: ML Objective for Linear Gaussians Linear Gaussian setting: Given ML objective: EM-derivation: same as HMM Page 21!

22 EM for Linear Gaussians --- E-Step Forward: Backward: EM for Linear Gaussians --- M-step [Updates for A, B, C, d. TODO: Fill in once found/derived.] Page 22!

23 EM for Linear Gaussians --- The Log-likelihood When running EM, it can be good to keep track of the loglikelihood score --- it is supposed to increase every iteration EM for Extended Kalman Filter Setting As the linearization is only an approximation, when performing the updates, we might end up with parameters that result in a lower (rather than higher) log-likelihood score à Solution: instead of updating the parameters to the newly estimated ones, interpolate between the previous parameters and the newly estimated ones. Perform a line-search to find the setting that achieves the highest log-likelihood score Page 23!

Maximum Likelihood (ML), Expecta6on Maximiza6on (EM)

Maximum Likelihood (ML), Expecta6on Maximiza6on (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, ProbabilisAc RoboAcs Outline Maximum likelihood (ML) Priors, and maximum