CS Lecture 13. More Maximum Likelihood

Size: px

Start display at page:

Download "CS Lecture 13. More Maximum Likelihood"

Bryan Evans
5 years ago
Views:

1 CS 6347 Lecture 13 More Maxiu Likelihood

2 Recap Last tie: Introduction to axiu likelihood estiation MLE for Bayesian networks Optial CPTs correspond to epirical counts Today: MLE for CRFs 2

3 Maxiu Likelihood Estiation Given saples x 1,, x M fro soe unknown distribution with paraeters θ The log-likelihood of the evidence is defined to be log l θ = log p(x θ) Goal: axiize the log-likelihood 3

4 MLE for MRFs Let s copute the MLE for MRFs that factor over the graph G as p x θ = 1 Z(θ) ς C ψ C x C θ The paraeters θ control the allowable potential functions Again, suppose we have saples x 1,, x M fro soe unknown MRF of this for log l θ = C log ψ C x C θ M log Z (θ) 4

5 MLE for MRFs Let s copute the MLE for MRFs that factor over the graph G as p x θ = 1 Z(θ) ς C ψ C x C θ The paraeters θ control the allowable potential functions Again, suppose we have saples x 1,, x M fro soe unknown MRF of this for log l θ = C log ψ C x C θ M log Z (θ) Z(θ) couples all of the potential functions together! Even coputing Z(θ) by itself was a challenging task 5

6 Conditional Rando Fields Learning MRFs is quite restrictive Most real probles are really conditional odels Exaple: iage segentation Represent a segentation proble as a MRF over a two diensional grid Each x i is an binary variable indicating whether or not the pixel is in the foreground or the background How do we incorporate pixel inforation? The potentials over the edge (i, j) of the MRF should depend on x i, x j as well as the pixel inforation at nodes i and j 6

7 Iage Segentation

8 Feature Vectors The pixel inforation is called a feature of the odel Features will consist of ore than just a scalar value (i.e., pixels, at the very least, are vectors of RGBA values) Vector of features y (e.g., one vector of features y i for each i V) We think of the joint probability distribution as a conditional distribution p(x y, θ) This akes MLE even harder Saples are pairs (x 1, y 1 ),, (x M, y M ) The feature vectors can be different for each saple: need to copute Z(θ, y ) in the log-likelihood! 8

9 Log-Linear Models MLE sees daunting for MRFs and CRFs Need a nice way to paraeterize the odel and to deal with features We often assue that the odels are log-linear in the paraeters Many of the odels that we have seen so far can easily be expressed as log-linear odels of the paraeters Feature vectors should also be incorporated in a log-linear way 9

10 Log-Linear Models The potential on the clique C should be a log-linear function of the paraeters where ψ C x C y, θ = exp θ, f C x C, y θ, f C x C, y = θ k f C x C, y k k Here, f is a feature ap that takes the input variables and returns a vector the sae size as θ 10

11 Log-Linear MRFs Over coplete representation: one paraeter for each clique C and choice of x C p x θ = 1 Z C exp(θ C (x C )) f C x C is a 0-1 vector that is indexed by C and x C whose only non-zero coponent corresponds to θ C (x C ) One paraeter per clique p x θ = 1 Z C exp( θ, f C (x C ) ) f C x C is a vector that is indexed ONLY by C whose only nonzero coponent corresponds to θ C 11

12 MLE for Log-Linear Models p x y, θ = 1 Z θ, y C exp θ, f C x C, y log l θ = C θ, f C x C, y log Z(θ, y ) = θ, C f C x C, y log Z(θ, y ) 12

13 MLE for Log-Linear Models p x y, θ = 1 Z θ, y C exp θ, f C x C, y log l θ = C θ, f C x C, y log Z(θ, y ) = θ, C f C x C, y log Z(θ, y ) Linear in θ Depends non-linearly on θ 13

14 Concavity of MLE We will show that log Z(θ, y) is a convex function of θ Fix a distribution q(x y) q x y D(q p = q x y log p x y, θ x = q x y log q(x y) q x y log p x y, θ x x = H(q) q x y log p x y, θ x = H(q) + log Z(θ, y) q x y θ, f C x C, y x C = H(q) + log Z(θ, y) C x C q C x C y θ, f C x C, y 14

15 Concavity of MLE log Z(θ, y) = ax q H(q) + C x C q C x C y θ, f C x C, y If a function g(x, y) is convex in x for each y, then ax g(x, y) is convex in x y Linear in θ As a result, log Z(θ, y) is a convex function of θ for a fixed value of y 15

16 MLE for Log-Linear Models p x y, θ = 1 Z θ, y C exp θ, f C x C, y log l θ = C θ, f C x C, y log Z(θ, y ) = θ, C f C x C, y log Z(θ, y ) Linear in θ Convex in θ 16

17 MLE for Log-Linear Models p x y, θ = 1 Z θ, y C exp θ, f C x C, y log l θ = C θ, f C x C, y log Z(θ, y ) = θ, C f C x C, y log Z(θ, y ) Concave in θ Could optiize it using gradient ascent! (need to copute θ log Z(θ, y)) 17

18 MLE via Gradient Ascent What is the gradient of the log-likelihood with respect to θ? θ log Z(θ, y ) =? (worked out on board) 18

19 MLE via Gradient Ascent What is the gradient of the log-likelihood with respect to θ? θ log Z(θ, y ) = C p C x C y, θ f C x C, y x C This is the expected value of the feature aps under the joint distribution 19

20 MLE via Gradient Ascent What is the gradient of the log-likelihood with respect to θ? θ log l(θ) = C f C x C, y x C p C x C y, θ f C x C, y To copute/approxiate this quantity, we only need to copute/approxiate the arginal distributions p C (x C y, θ) This requires perforing arginal inference on a different odel at each step of gradient ascent! 20

21 Moent Matching Let f x, y = σ C f C x C, y Setting the gradient with respect to θ equal to zero and solving gives f(x, y ) = p x y, θ f x, y x This condition is called oent atching and when the odel is an MRF instead of a CRF this reduces to 1 M f(x ) = p x θ f x x 21

22 Moent Matching As an exaple, consider a log-linear MRF p x = 1 Z C exp(θ C (x C )) That is, f C x C is a vector that is indexed by C and x C whose only non-zero coponent corresponds to θ C (x C ) The oent atching condition becoes 1 1 xc =x C = p C x C θ, for all C, x C M 22

23 Regularization in MLE Recall that we can also incorporate prior inforation about the paraeters into the MLE proble This involved solving an augented MLE p x θ p(θ) What types of priors should we choose for the paraeters? 23

24 Regularization in MLE Recall that we can also incorporate prior inforation about the paraeters into the MLE proble This involved solving an augented MLE p x θ p(θ) What types of priors should we choose for the paraeters? Gaussian prior: p θ exp( 1 2 (θ μ)t Σ 1 (θ μ) T ) Unifor over [0,1] 24

25 Regularization in MLE Recall that we can also incorporate prior inforation about the paraeters into the MLE proble This involved solving an augented MLE p x θ exp( 1 2 θt Dθ T ) What types of priors should we choose for the paraeters? Gaussian prior: p θ exp( 1 2 (θ μ)t Σ 1 (θ μ) T ) Unifor over [0,1] Gaussian prior with a diagonal covariance atrix all of whose entries are equal to λ 25

26 Regularization in MLE Using the previous Gaussian prior yields the following logoptiization proble log p x θ exp( 1 2 θt Dθ T ) = log p(x θ) λ 2 k θ k 2 = log p(x θ) λ 2 θ

27 Regularization in MLE Using the previous Gaussian prior yields the following logoptiization proble log p x θ exp( 1 2 θt Dθ T ) = log p(x θ) λ 2 k θ k 2 = log p(x θ) λ 2 θ 2 2 Known as l 2 regularization 27

28 Regularization l 1 l 2 28

29 Duality and MLE log Z(θ, y) = ax q H(q) + C x C q C x C y θ, f C x C, y log l θ = θ, f C x C, y C log Z(θ, y ) Plugging the first into the second gives: log l θ = θ, f C x C, y C ax q H(q ) + C q C x C y x C θ, f C x C, y 29

30 Duality and MLE ax θ log l θ = ax θ in q 1,,q M θ, C f C x C, y q C x C y f C x C, y x C H(q ) This is called a iniax or saddle-point proble Recall that we ended up with siilar looking optiization probles when we constructed the Lagrange dual function When can we switch the order of the ax and in? The function is linear in theta, so there is an advantage to swapping the order 30

31 Saddle Point Source: Wikipedia 31

32 Sion s Miniax Theore Let X be a copact convex subset of R n and Y be a convex subset of R Let f be a real-valued function on X Y such that then f(x, ) a continuous concave function over Y for each x X f(, y) a continuous convex function over X for each y Y sup y in x f(x, y) = in x sup y f x, y 32

33 Duality and MLE ax θ in q 1,,q M θ, C f C x C, y q C x C y f C x C, y x C H(q ) is equal to in ax q 1,,qM θ θ, C f C x C, y q C x C y f C x C, y x C H(q ) Solve for θ? 33

34 Maxiu Entropy ax q 1,,q M H(q ) such that the oent atching condition is satisfied f(x, y ) = q x y f x, y and q 1,, q are discrete probability distributions Instead of axiizing the log-likelihood, we could axiize the entropy over all approxiating distributions that satisfy the oent atching condition x 34

35 MLE in Practice We can copute the partition function in linear tie over trees using belief propagation We can use this to learn the paraeters of tree-structured odels What if the graph isn t a tree? Use variable eliination to copute the partition function (exact but slow) Use iportance sapling to approxiate the partition function (can also be quite slow; aybe only use a few saples?) Use loopy belief propagation to approxiate the partition function (can be bad if loopy BP doesn t converge quickly) 35

36 MLE in Practice Practical wisdo: If you are trying to perfor soe prediction task (i.e., MAP inference to do prediction), then it is better to learn the wrong odel Learning and prediction should use the sae approxiations What people actually do: Use a few iterations of loopy BP or sapling to approxiate the arginals Approxiate arginals give approxiate gradients (recall that the gradient only depended on the arginals) Perfor approxiate gradient descent and hope it works 36

37 MLE in Practice Other options Replace the true entropy with the Bethe entropy and solve the approxiate dual proble Use fancier optiization techniques to solve the proble faster e.g., the ethod of conditional gradients 37

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians Using EM To Estiate A Probablity Density With A Mixture Of Gaussians Aaron A. D Souza adsouza@usc.edu Introduction The proble we are trying to address in this note is siple. Given a set of data points