13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401
Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs
Inference and Learning Given parameters (of potentials) and the graph, one can ask for: x = argmax x P(x) MAP Inference P(x c ) = x V /c P(x) Marginal Inference How to get parameters and the graph? Learning.
Learning Learn parameters if graph given (Lecture 9) Bayes Net (Directed graphical models) Markov Random Fields (Undirected or factor graphical models) Structure estimation ( to learn or estimate the graph structure, Lecture 10)
Parameters for bayesian networks For bayesian networks, P(x 1,..., x n ) = n i=1 P(x i Pa(x i )). Parameters: P(x i Pa(x i )). P(D) Difficulty D Intelligence I P(I) P(G D,I) Grade G P(L G) S SAT P(S I) Happy H Letter L J Job P(J L,S) P(H G,J)
Learning parameters in Bayes Net Y = Yes. N = No. Case D I G S L H J 1 Y Y Y Y Y N Y 2 N N Y N N Y N 3 Y N Y N N Y N.. P(D = d) = N D=d N total P(G = g D = d, I = i) = N G=g,D=d,I =i N D=d,I =i
Learning parameters in Bayes Net Problems?
Learning parameters in Bayes Net Problems? not minimise classification error. not much flexibility on the features nor the parameters.
Parameters for MRFs For MRFs, let V be the set of nodes, and C be the set of clusters c. P(x; θ) = exp( c C θ c(x c )), (1) Z(θ) where normaliser Z(θ) = x exp{ c C θ c (x c )}. Parameters: {θ c } c C. Inference: MAP inference x = argmax x c C θ c(x c ) log P(x) c C θ c(x c ) Marginal inference P(x c ) = x V /c P(x) Learning (parameter estimation): learn θ and the graph structure. Often assume θ c (x c ) = w, Φ c (x c ). w empirical risk minimisation (ERM).
Parameters for MRFs In learning, we look for a F that predicts labels well via y = max y Y F (x i, y; w). Given graph G = (V, E), one often assume F (x, y; w) = w, Φ(x, y) = w 1, Φ i (y (i), x) + w 2, Φ i,j (y (i), y (j), x) i V (i,j) E = θ i (y (i), x) + θ i,j (y (i), y (j), x) (MAP inference) i V (i,j) E Here w = [w 1 ; w 2 ], and Φ(x, y) = [ i V Φ i(y (i), x); (i,j) E Φ i,j(y (i), y (j), x)].
A gap between F (x i, y i ; w) and best F (x i, y; w) for y y i, that is F (x i, y i ; w) max F (x i, y; w) y Y,y y i
Structured SVM - 1 Learning parameters in MRFs Primal: min w,ξ 1 2 w 2 + C m ξ i s.t. (2a) i=1 i, y y i, w, Φ(x i, y i ) Φ(x i, y) (y i, y) ξ i. Dual is a quadratic programming (QP) problem: max α (2b) (y i, y)α i y 1 α i y α j y Φ(xi, y), Φ(x j, y ) 2 i,y y i i,j,y y i,y y j i, y y i, α i y 0, i, α i y C. y y i (3)
Structured SVM - 2 Learning parameters in MRFs Cutting plane method needs to find the label for the most violated constraint in (2b) y i = argmax (y i, y) + w, Φ(x i, y). (4) y Y With y i, one can solve following relaxed problem (with much fewer constraints) 1 m min w,ξ 2 w 2 + C ξ i s.t. (5a) i=1 i, w, Φ(x i, y i ) Φ(x i, y i ) (y i, y i ) ξ i. (5b)
Structured SVM - 3 Learning parameters in MRFs Simplified over all procedure. Input: data x i, labels y i, sample size m, number of iterations T Initialise S 0 =, w 0 = 0 (or a random vector), and t = 0. for t = 0 to T do for i = 1 to m do y i = argmax y Y,y yi w t, Φ(x i, y) + (y i, y), ξ i = [ (y i, y) + w t, Φ(x i, y i ) Φ(x i, y i ) if ξ i > 0 then Increase constraint set S t S t {y i } end if end for w t = i y S t α i y Φ(x i, y) α optimise dual QP with constraint set S t. end for ] +,
Other Other approaches using Max Margin principle such as Max Margin Markov Network (M3N),...
Main types: Maximum Entropy (MaxEnt) Maximum a Posteriori (MAP) Maximum Likelihood (ML)
Maximum Entropy Learning parameters in MRFs Maximum Entropy (ME) estimates w by maximising the entropy. That is, w = argmax P w (x, y) ln P w (x, y). w x X,y Y Duality between maximum likelihood, and maximum entropy, subject to moment matching constraints on the expectations of features.
MAP Learning parameters in MRFs Let likelihood function L(w) be the modelled probability or density for the occurrence of a sample configuration (x 1, y 1 ),..., (x m, y m ) given the probability density P w parameterised by w. That is, ) L(w) = P w ((x 1, y 1 ),..., (x m, y m ). Maximum a Posteriori (MAP) estimates w by maximising L(w) times a prior P(w). That is w = argmax L(w)P(w). (6) w Assuming {(x i, y i )} 1 i m are I.I.D. samples from P w (x, y), (6) becomes w = argmax P w (x i, y i )P(w) w 1 i m = argmin ln P w (x i, y i ) ln P(w). w 1 i m
Maximum Likelihood Maximum Likelihood (ML) is a special case of MAP when P(w) is uniform which means w = argmax P w (x i, y i ) w 1 i m = argmin ln P w (x i, y i ). w 1 i m Alternatively, one can replace the joint distribution P w (x, y) by the conditional distribution P w (y x) that gives a discriminative model called Conditional Random Fields (CRFs)
Conditional Random Fields (CRFs) - 1 Assume the conditional distribution over Y X has a form of exponential families, i.e., where P(y x; w) = exp( w, Φ(x, y) ), (7) Z(w x) Z(w x) = y Y exp( w, Φ(x, y ) ), (8) and Φ(x, y) = [ Φ i (y (i), x); Φ i,j (y (i), y (j), x)] i V w = [w 1 ; w 2 ]. (i,j) E More generally speaking, the global feature can be decomposed into local features on cliques (fully connected subgraphs).
CRFs - 2 Learning parameters in MRFs Denote (x 1,..., x m ) as X, (y 1,..., y m ) as Y. The classical approach is to maximise the conditional likelihood of Y on X, incorporating a prior on the parameters. This is a Maximum a Posteriori (MAP) estimator, which consists of maximising P(w X, Y) P(w) P(Y X; w). From the i.i.d. assumption we have P(Y X; w) = m P(y i x i ; w), i=1 and we impose a Gaussian prior on w ( ) w 2 P(w) exp 2σ 2.
CRFs - 3 Learning parameters in MRFs Maximising the posterior distribution can also be seen as minimising the negative log-posterior, which becomes our risk function R(w X, Y) R(w X, Y) = ln(p(w) P(Y X; w)) + c w 2 m = 2σ 2 ( Φ(x i, y i ), w ) ln(z(w x i )) +c, }{{} i=1 :=l L (x i,y i,w) where c is a constant and l L denotes the log loss i.e. negative log-likelihood. Now learning is equivalent to w = argmin R(w X, Y). w
CRFs - 4 Learning parameters in MRFs Above is a convex optimisation problem on w since ln Z(w x) is a convex function of w. The solution can be obtained by gradient descent since ln Z(w x) is also differentiable. We have w R(w X, Y) = m (Φ(x i, y i ) w ln(z(w x i )). i=1 It follows from direct computation that w ln Z(w x) = E y P(y x;w) [Φ(x, y)].
CRFs - 5 Learning parameters in MRFs Since Φ(x, y) are decomposed over nodes and edges, it is straightforward to show that the expectation also decomposes into expectations on nodes V and edges E E y P(y x;w) [Φ(x, y)] = E y (i) P(y (i) x;w) [Φ i(y (i), x)] i V + (ij) E E y (i),y (j) P(y (i),y (j) x;w) [Φ i,j(y (i), y (j) x)], where the node and edge expectations can be computed given P(y (i) x; w) and P(y (i), y (j) x; w), which can be computed exactly by variable elimination or junction tree or approximately using e.g. (loopy) belief propagation, or being circumvented through sampling.
That s all Learning parameters in MRFs Thanks!