Probabilistic & Unsupervised Learning

Size: px

Start display at page:

Download "Probabilistic & Unsupervised Learning"

Maurice Long
5 years ago
Views:

1 Probablstc & Unsupervsed Learnng Convex Algorthms n Approxmate Inference Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computatonal Neuroscence Unt Unversty College London Term 1, Autumn 2008

2 Convexty A convex functon f : X R s one where for any x, y X and 0 α 1. f(αx + (1 α)y) αf(x) + (1 α)f(y) f(x) af(x)+(1-a)f(y) f(y) f(ax+(1-a)y) x y Convex functons have global mnmum (unless not bounded below) and there are effcent algorthms to optmze them subject to convex constrants. Examples: lnear programs (LP), quadratc programs (QP), second-order cone programs (SOCP), sem-defnte programs (SDP), geometrc programs.

3 Convexty and Approxmate Inference There has been much recent efforts usng convex programmng technques to solve nference problems both exactly and approxmately. Lnear programmng relaxaton as approxmate method to fnd MAP assgnment n Markov random felds. Attractve Markov random felds: bnary case exact and related to a maxmum flowmnmum cut problem n graph theory (a lnear program). Approxmate otherwse. Tree-structured convex upper bounds on the log partton functon (convexfed belef propagaton). Unfed vew of approxmate nference as optmzaton on the margnal polytope. Learnng graphcal models usng maxmum margn prncples and convex approxmate nference....

4 LP Relaxaton for Markov Random Felds Dscrete Markov random felds (MRFs) wth parwse nteractons: p(x) = 1 f j (X, X j ) f (X ) = 1 Z Z exp E j (X, X j ) + (j) (j) E (X ) The problem s to fnd the MAP assgnment X MAP : X MAP = argmax E j (X, X j ) + X (j) E (X ) Reformulate n terms of slghtly dfferent varables: b (x ) = δ(x = x ) b j (x, x j ) = δ(x = x )δ(x j = x j ) wher δ( ) = 1 f argument s true, 0 otherwse. Each b (x ) s an ndcator for whether varable X takes on value x. The ndcator varables need to satsfy certan constrants: b (x ), b j (x, x j ) {0, 1} Indcator varables are bnary varables. x b (x ) = 1 X takes on exactly one value. x j b j (x, x j ) = b (x ) Parwse ndcators are consstent wth sngle-ste ndcators.

5 LP Relaxaton for Markov Random Felds MAP assgnment problem s equvalent to: argmax b j (x, x j )E j (x, x j ) + b (x )E (x ) {b,b j } x,x j x (j) wth constrants:, j, x, x j : b (x ), b j (x, x j ) {0, 1} x b (x ) = 1 x j b j (x, x j ) = b (x ) The lnear programmng relaxaton for MRFs s: argmax b j (x, x j )E j (x, x j ) + {b,b j } x,x j (j) x b (x )E (x ) wth constrants:, j, x, x j : b (x ), b j (x, x j ) [0, 1] x b (x ) = 1 x j b j (x, x j ) = b (x )

6 LP Relaxaton for Markov Random Felds The LP relaxaton s a lnear program whch can be solved effcently. If the soluton s ntegral,.e. each b (x ), b j (x, x j ) {0, 1}, then the soluton corresponds to the MAP soluton X MAP. LP relaxaton s a zero-temperature verson of the Bethe free energy formulaton of loopy BP, where the Bethe entropy term can be gnored. If the MRF s bnary and attractve, then a slghtly dfferent reformulaton of LP relaxaton wll always gve the MAP soluton. Next: we show how to fnd the MAP soluton drectly for bnary attractve MRFs usng network flow.

7 Attractve Bnary MRFs and Max Flow-Mn Cut Bnary MRFs: p(x) = 1 Z exp W j δ(x = X j ) + c X (j) The bnary MRF s attractve f W j 0 for all, j. Neghbourng varables prefer to be n the same state n such MRFs. No loss of generalty; can be equvalently expressed as Boltzmann machnes wth postve nteractons. Many practcal MRFs are attractve, e.g. mage segmentaton, webpage classfcaton. MAP X can be found effcently by convertng problem nto a maxmum flow-mnmum cut program.

8 Attractve Bnary MRFs and Max Flow-Mn Cut The MAP problem: argmax W j δ(x = x j ) + x (j) Construct a network as follows: c x 1. Edges (j) are undrected wth weght λ j = W j ; 2. Add a source s and a snk t node; 3. c >0: Connect the source node to varable wth weght λ s = c ; 4. c j < 0: Connect varable j to the snk node wth weght λ jt = c j A cut s a partton of the nodes nto S and T wth s S and t T. The weght of the cut s Λ(S, T ) = S,j T The mnmum cut problem s to fnd the cut wth mnmum weght. λ j +c - Wj + -cj - j - + +

9 Attractve Bnary MRFs and Max Flow-Mn Cut Identfy an assgnment X = x wth a cut: S= {s} { : x = 1} T = {t} {j : x j = 0} - The weght of the cut s: Λ(S, T ) = (j) W j δ(x x j ) - -cj - + (1 x ) max(0, c ) + + Wj - j + x j max(0, c j ) +c + + j = (j) W j δ(x = x j ) x c + constant + So fndng the mnmum cut corresponds to fndng the MAP assgnment. How do we fnd the mnmum cut? The mnmum cut problem s dual to the maxmum flow problem,.e. fnd the maxmum flow allowable from the source to the snk through the network. Ths can be solved extremely effcently (see wkpeda entry). The framework can be generalzed to general attractve MRFs, but wll not be exact anymore.

10 Convexty and Exponental Famles An exponental famly dstrbuton s parametrzed by a natural parameter vector θ and equvalent by ts mean parameter vector µ. p(x θ) = exp ( θ s(x) Φ(θ) ) where Φ(θ) s the log partton functon Φ(θ) = log x exp ( θ s(x) ) Φ(θ) plays an mportant role n the characterzaton of the exponental famly. For example, t s a moment generatng functon for the dstrbuton: θ Φ(θ) = E θ[s(x)] = µ(θ) = µ 2 θ 2Φ(θ) = V θ[s(x)] The second dervatve s postve sem-defnte, so Φ(θ) s convex n θ.

11 Convexty and Exponental Famles The log partton functon and the negatve entropy are ntmately related. We express the negatve entropy as a functon of the mean parameter: θ µ = Φ(θ) + Ψ(µ) Ψ(µ) = E θ [log p(x θ)] = θ µ Φ(θ) The KL dvergence between two exponental famly dstrbutons p(x θ) and p(x θ ) s: KL(p(X θ ) p(x θ)) =KL(θ θ) = E θ [log p(x θ ) log p(x θ)] = θ µ + Φ(θ) + Ψ(µ ) 0 Ψ(µ ) θ µ Φ(θ) For any par of mean and natural parameter vectors. Because the mnmum of the KL dvergence s zero, and attaned at θ = θ, we have: Ψ(µ) = sup θ θ µ Φ(θ) The constructon on the RHS s called the convex dual of Φ(θ). functons, the dual of the dual s the orgnal functon, thus: For contnuous convex Φ(θ) = sup µ θ µ Ψ(µ)

12 Convexty and Undrected Trees Par-wse MRFs can be parametrzed as follows: p(x) = 1 f (X) f j (X, X j ) Z (j) = exp θ (x )δ(x = x ) + θ j (x, x j )δ(x = x )δ(x j = x j ) Φ(θ) x x,x j So MRFs form an exponental famly, wth natural and mean parameters: θ = [ θ (x ), θ j (x, x j ), j, x, x j ] µ = [ p(x = x ), p(x = x, X j = x j ), j, x, x j ] (j) If the MRF has tree structure T, the negatve entropy s composed of sngle-ste entropes and mutual nformatons on edges: Ψ(µ T ) = E θt log p(x ) p(x, X j ) p(x )p(x j ) (j) T = H(X ) + I(X, X j ) (j) T

13 Convex Upper Bounds on the Log Partton Functon Let us try to upper bound Φ(θ). Imagne a set of spannng trees T for the MRF, each wth ts own parameters θ T, µ T. By paddng entres of off-tree edges wth zero, we can assume that θ T has the same dmensonalty as θ. Suppose also that we have a dstrbuton β over the spannng trees so that E β [θ T ] = θ. Then by the convexty of Φ(θ), Optmzng over all θ T, we get: Φ(θ) = Φ(E β [θ T ]) E β [Φ(θ T )] Φ(θ) nf E β[φ(θ T )] θ T :E β [θ T ]=θ

14 Convex Upper Bounds on the Log Partton Functon Φ(θ) nf E β[φ(θ T )] θ T :E β [θ T ]=θ We solve ths constraned optmzaton problem usng Lagrange multplers: Settng the dervatves wrt θ T to zero, we get: L = E β [Φ(θ T )] µ (E β [θ T ] θ) β(t )µ T β(t )µ(t ) = 0 µ T = µ(t ) where µ(t ) are the Lagrange multplers correspondng to vertces and edges on the tree T. Although there can be many θ T parameters, at optmum they are all constraned: ther correspondng mean parameters are all consstent wth each other and wth µ.

15 Convex Upper Bounds on the Log Partton Functon Φ(θ) sup µ = sup µ = sup µ = sup µ = sup µ nf θ T E β [Φ(θ T )] µ (E β [θ T ] θ) µ θ + E β [Φ(θ T ) θ T µ(t )] µ θ + E β [ Ψ(µ(T ))] µ θ + E β µ θ + Ths s a convexfed Bethe free energy. H µ (X ) H µ (X ) (j) (j) T I µ (X, X j ) β j I µ (X, X j )

16 References Exact Maxmum A Posteror Estmaton for Bnary Images. Greg, Porteous and Seheult, Journal of the Royal Statstcal Socety B, 51(2): , Fast Approxmate Energy Mnmzaton va Graph Cuts. Boykov, Veksler and Zabh, Internatonal Conference on Computer Vson MAP estmaton va agreement on (hyper)trees: Message-passng and lnear-programmng approaches. Wanwrght, Jaakkola and Wllsky, IEEE Transactons on Informaton Theory, 2005, 51(11): Learnng Assocatve Markov Networks. Taskar, Chatalbashev and Koller, Internatonal Conference on Machne Learnng, A New Class of Upper Bounds on the Log Partton Functon. Wanwrght, Jaakkola and Wllsky. IEEE Transactons on Informaton Theory, 2005, 51(7): Graphcal Models, Exponental Famles, and Varatonal Inference. Wanwrght and Jordan. UC Berkeley Dept. of Statstcs, Techncal Report 649, MAP Estmaton, Lnear Programmng and Belef Propagaton wth Convex Free Energes. Wess, Yanover and Meltzer, Uncertanty n Artfcal Intellgence, 2007.

17 References

EM and Structure Learning

EM and Structure Learning EM and Structure Learnng Le Song Machne Learnng II: Advanced Topcs CSE 8803ML, Sprng 2012 Partally observed graphcal models Mxture Models N(μ 1, Σ 1 ) Z X N N(μ 2, Σ 2 ) 2 Gaussan mxture model Consder