Learning Markov Networks. Presented by: Mark Berlin, Barak Gross

Size: px

Start display at page:

Download "Learning Markov Networks. Presented by: Mark Berlin, Barak Gross"

Joanna Ellis
5 years ago
Views:

1 Learning Markov Networks Presented by: Mark Berlin, Barak Gross

2 Introduction We shall egi, pehaps Eugene Onegin, Chapter VI Off did he take, I folloed at his heels. Inferno, Canto II

3 Reminder Until now we considered Bayesian Networks Intuitive Easily Decomposable Local Markov Networks are a wholly different story

4 Example Consider the simple Markov Network A B C Factors:,,, Partition function: =,,,, Log-likelihood: = ln, + ln, ln =, ln, +, ln,,, ln Z couples the factors together and precludes decomposition Transition to Bayesian representation is clearly not a panacea, as the MN and BN models are not equivalent

5 Programme Maximum Likelihood Spoiler: Bad news ahead Dealing with Missing Data Alternative Objectives Pseudo-Likelihood Contrasting Divergence

6 Max likelihood Who controls the past controls the futue. George Orwell

7 Log-linear models Let us now focus on the particular case where the features f are indicators, i.e. of the form 0,0, = = { = } The weights become the factor values we look for, of the form 0, 0

8 Log likelihood The log-likelihood thus becomes: = = is the number of samples ln ln is a set of values for the set of variables which is a clique in the Markov Network

9 Log likelihood, cont. = The first term is linear in What about ln? ln

10 Deriving = exp ln = exp = exp = [ ]

11 Deriving, cont. = exp ln = exp = C,

12 Log likelihood, cont. We have derived the Hessian of ln has elements of the form 2 ln = C, A covariance matrix is always positive semidefinite Therefore, ln is convex

13 Log likelihood, cont. = ln The log-likelihood function is a sum of a linear element and a concave element it is concave Therefore, it has no local optima, only a global maximum (albeit a non-unique one)

14 Max likelihood = ln Apparently, what remains is to compute the gradient and find its zeroes = [ ] Meaning the maximum is attained when = [ ]

15 Theorem 20.1 Max likelihood, cont.

16 Max likelihood, cont. Good news: The maximum is unique It is attained when the empirical expectancy of all features matches their a priori expectancy Bad news: impossible to compute analytically Gradient ascent to the rescue

17 Max likelihood, cont. Gradient ascent to the rescue Guaranteed to converge to the maximum = [ ] How to compute?

18 Max likelihood Gradient ascent = [ ] The first element, for indicator features, is the empirical frequency of the relevant event

19 Max likelihood Gradient ascent = [ ] The second element, for indicator features, is of the form, Computed using inference on the graph Single inference pass yields all probabilities

20 Max likelihood Gradient ascent = [ ] The second element, for indicator features, is of the form, Computed using inference on the graph But such computation is to be performed afresh for each step of the ascent!

21 Max likelihood = Max Entropy Idea: we want to re-construct the distribution corresponding to the given empirical expectations of the features We demand maximal entropy as a sign that we add no other constraints (no more information)

22 Max likelihood = Max Entropy Formally: Find maximizing H Subject to, = Tus out that Theorem 20.2

23 Max Likelihood = Max Entropy Formally: Find maximizing H Subject to, = The distribution given by the MLE clearly satisfies the constraints Now we need to prove it maximizes the entropy Point of notation: we denote

24 Max Likelihood = Max Entropy Let be another distribution adhering to the same constraints. Then: ln = ln = ln = ln = ln = ln

25 Max Likelihood = Max Entropy Let be another distribution adhering to the same constraints. Then: H H = ln + ln = ln + ln = Meaning, H > H, Q.E.D.

26 MLE Prior Reminder: MLE by itself is prone to overfitting Therefore, a prior distribution is taken in order to bias the solution toward a prior model

27 MLE Prior Two priors: Gaussian (L 2 ): = Laplacian (L 1 ): = 2 2 exp exp

28 MLE Prior Idea: both priors penalize too large Since we do not want to assume too much dependence on a single feature Gaussian Prior: penalizes large values more but there is no incentive to get to 0 Laplacian Prior: exactly the opposite Resulting in more sparse constructions

29 MLE Prior Note: in log form, both priors are concave Therefore, they can be added to with no need to change the algorithm The parameters regulating the width of the prior(s) reflect how important it is for us to drive them to 0 Method for selection Cross-Validation: select values, run on part of the data, check vs. the remaining data

30 MLE Conjugate Prior = exp ln In order for the posterior to be of the same form, the prior has to be exp ln Which might be construed as a prior data of size yielding as the expected value per feature

31 Missing Data All the business of a is to endeavour to find out what you don't know by what you do The Duke of Wellington

32 MLE with missing data As previously, let us denote: o[m]: vector of observed values in the m th sample h[m]: vector of hidden values in the m th sample The log likelihood then becomes: = ln, = ln, ln

33 MLE with missing data, cont. Let us have a closer look at the term, It has the form of a partition function It is the partition function on the reduction of the original network by the observation and thus adheres to the derivation of ln presented previously

34 MLE with missing data, cont. The derivand is a partition function, and thus: ln, = ~ H, Leading to the conclusion that the gradient of the loglikelihood is = ~ H,

35 MLE with missing data, cont. = ~ H, The second term, as before, requires an inference computation The first term requires a computation of inference on the reduced network per instance of fo a sigle iteatio of the aset

36 Expectation Maximization EM: an alternative approach Efficient for BNs (previous lecture) Main idea: bootstrapping Start with some initial Compute corresponding distribution for missing data H Based on the full data <, H >, compute a new etc until convergence

37 Expectation Maximization, cont. For BNs: Assess probabilities for each, Based on these probabilities, compute: + =,

38 Expectation Maximization, cont. For MNs: Assess probabilities for each feature (E-step): = ~ H, Done using inference per each Compute next ased o that ho?

39 Expectation Maximization, cont. Computing an optimal from a set of full data in MN is done using gradient ascent iolig uig ifeee ultiple ties for a single iteration of the EM

40 Missing Data: GA vs. EM In both methods, we need inference In GA: M+1 times per each step In EM: (M times + 1 time per step of the GA) per step of the algorithm

41 Alternative Objectives A lee a goes ot oe a outai, ut athe aoud it. A Russian proverb

42 Log likelihood revisited Reminder: Interpretation: = ln ln We strive to increase the difference between the log-measures of the data and the aggregate of all instances Problem: the aggregate (second term, ) is exponential Idea: define a simpler objective

43 Pseudolikelihood Consider the probability of a single instance = =, = =,, =, = =

44 Pseudolikelihood =, = = From here we can derive the pseudo-likelihood: = ln,

45 Pseudolikelihood = ln, The main advantage: it is easier to compute, as there is less coupling and summation Since =, =, =,, The number of elements to sum is, which is clearly sub-exponential

46 Pseudolikelihood Let us now further analyze the summands: ln = ln, ln, =, ln exp,

47 Pseudolikelihood ln =, ln exp, Sce Sce The expression above is the log-likelihood of a MN over conditioned on the rest Meaig it is oae And the pseudo-likelihood, being the sum of such terms, is concave as well! Gadiet Aset agai

48 Pseudolikelihood ln =, ln exp, Sce Sce ln =,, exp, Sce exp, Sce =, ~, Note: when Scope, it does not affect the value of the feature, and so the expression becomes 0

49 Pseudolikelihood ln =, ~, = Sce ~, The computation is much simpler All expectations (summations) are local Finding the maximum of the PL is tractable But does it do us a good?

50 Pseudolikelihood Theorem (20.3): Assume the data is generated by a log-linear model of the form described previously. Then, as M goes to infinity, is the argument for the PL global optimum, with probability approaching 1. Idea of proof: Show the gradient is 0 at (why is that sufficient?)

51 Pseudolikelihood = Sce ~, The first term is the empirical expectancy, which as, goes to

52 Pseudolikelihood = Sce ~, The second term is: = ~, =, = =,

53 Pseudolikelihood = Sce ~, = ~, =,, = =

54 Pseudolikelihood = Sce ~, The first term, as, goes to The second term, as and =, goes to Ergo: as, the gradient at = is 0, QED

55 Pseudolikelihood, concluded An alternative objective to the MLE Tractable The same as MLE as the data size increases But! A large data sample is required for the PL to reflect on the real MLE

56 Contrasting Divergence Main idea: increase difference between the logprobability of the observed data and some othe alue, epesetig the old Global Partition Function (MLE) Single-Variable Partition Function (PL) Log-Probability of Perturbed Data (CD)

57 Contrasting Divergence CD is about the difference between the original data set and a perturbed data set Formally: = ~ ln ~ ln The difference between the empirical expectations on the log-probability

58 Contrasting Divergence CD is about the difference between the original data set and a perturbed data set How to choose this data set? The contrasting data set needs to represent a data sample characteristic of the current So that we strive to increase the probability of the original sampled data relative to the current result, which serves as the contrast in this iterative process

59 Contrasting Divergence The contrasting data set needs to represent a data sample characteristic of the current How? Gibbs Sampling starting from Log saplig util oegee too epesie Short chain is good enough and yields better convergence

60 Contrasting Divergence How do we compute the optimum? Gradient Ascent yet again = ~ ln ~ ln = = = ~ ~

61 Contrasting Divergence = ~ ~ Note: as, the elements of the gradient converge to the gradient of the max log likelihood At the limit of the Markov chain, the CD converges to the actual MLE

62 Checkpoint No this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning. Winston Churchill

63 Checkpoint Maximum likelihood for MN Not easily decomposable due to the Partition Function olule usig Gadiet Aset ut euies running inference on the MN for each step Priors: Gaussian vs. Laplacian Both aim to reduce too strong dependency of overall probability on single feature Conjugate Prior

64 Checkpoint MLE with missing data GA: requires running inference per observation per step EM: reduces to Gradient Ascent requiring slightly less runs of inference Alternative goals Pseudo-likelihood: tractable, requires sufficiently large data sample Contrasting Divergence: tractable, does not require much sampling

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient