Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Outlne Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Huzhen Yu janey.yu@cs.helsnk.f Dept. Computer Scence, Unv. of Helsnk Probablstc Models, Sprng, 200 Notces: I corrected a number of errors/typos n the sldes of Lec.. Ths affected n partcular sldes 5, 6, 32, 34, 36. There may be other correctons after today s lecture. Please check the onlne verson of the sldes; I wll put an update sgn besde the lnk. Please do not hestate to contact me f you have any questons before the exam. Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 / 20 Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 2 / 20 Outlne Our Model and Data Let = { v, v V } be a collecton of dscrete random varables. G: a DAG on V. Our model for : the set of all dstrbutons P( ) that factorze recursvely accordng to G. The true, unknown dstrbuton of : Q, not necessarly n our model. Maxmum lkelhood (ML) estmaton: Data: {x, x 2,..., x n }, n observatons ndependently generated accordng to Q, (.e., a random sample of sze n). The emprcal dstrbuton Q( ): Q( = x) s the observed frequency of the confguraton x n the data. P ML : the dstrbuton n our model that maxmzes the lkelhood functon based on the data, ny ny Y L(P) = P( = x ) = p`xv xpa(v). = = (For smplcty, we do not use the θ notaton for parameters here.) Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 3 / 20 Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 4 / 20

Relaton between the ML Estmate, the emprcal and the true dstrbutons The relaton between P ML, Q and Q : Q* (unknown) P ML Q (emprcal) { P() : P factorzes recursvely accordng to G } Among all P n our model, P ML s the closest dstrbuton to Q n terms of the KL-dvergence KL(q, p). (q s the PMF of Q.) (See dscussons n Lec. 3 and Problem 3 of Exercse 2.) Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 5 / 20 Expresson of the ML Estmate The ML estmate P ML s the dstrbuton gven by p ML (x) = Y p ML (x v x pa(v) ), where the component condtonal dstrbutons are defned by p ML (x v x pa(v) ) = Q( v = x v pa(v) = x pa(v) ) = n(xv, x pa(v)), () n(x pa(v) ) and n the last expresson, n(x pa(v) ): the counts for the confguraton x pa(v) n the data; n(x v, x pa(v) ): the counts for the confguraton (x v, x pa(v) ) n the data. The maxmzed log lkelhood can be expressed as l(p ML ) = n E Qˆ ln p ML ( ) = n E Q ln q( v pa(v) ), (2) where E Q denotes expectaton wth respect to the dstrbuton Q. (Eqs. ()-(2) can be derved usng the nformaton nequalty; see sldes 8-9 for detals.) Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 6 / 20 Outlne Learnng a Rooted Tree Problem: Gven the data as descrbed earler, fnd a rooted tree G whch maxmzes the profle log lkelhood l p(g): l p(g) def = l(g, PG ML ) = max l(g, P). P P(G) Here P(G) s the set of all dstrbutons that factorze recursvely accordng to G. Such a tree s also called a Chow-Lu tree, and can be found by the Chow-Lu tree algorthm (Chow and Lu, 968). The algorthm can be generalzed to solve smlar types of problems (we wll show one). Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 7 / 20 Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 8 / 20

Recall Mutual Informaton and Condtonal Mutual Informaton Let, Y, Z be dscrete random varables wth jont dstrbuton P. The mutual nformaton between and Y s defned as» p(, Y ) I( ; Y ) = E ln, p( )p(y ) and equvalently, I( ; Y ) = x,y «p(x, y) p(x, y) ln. p(x)p(y) The condtonal mutual nformaton between and Y gven Z s defned as» p(, Y Z) I( ; Y Z) = E ln, p( Z)p(Y Z) and equvalently, I( ; Y Z) = z By the nformaton nequalty, p(z) x,y «p(x, y z) p(x, y z) ln. p(x z)p(y z) I( ; Y ) 0, and I( ; Y ) = 0 ff. Y ; I( ; Y Z) 0, and I( ; Y Z) = 0 ff. Y Z. Dervng the We start wth the profle log lkelhood: by Eq. (2), l p(g) = n E Q ln q( v pag (v)). Here pa G (v) s the parent of v n the rooted tree G. Rewrte l p(g) n terms of the mutual nformaton I Q ( v ; pag (v)), v V (w.r.t. the dstrbuton Q): h» q(v pag (v)) q( pag (v)) q( v ) E Q ln q( v pag (v)) = E Q ln q( v ) q( pag (v)) q(v, pag (v)) = E Q + E Qˆ ln q(v ) q( v ) q( pag (v)) hence = I Q ( v ; pag (v)) + E Qˆ ln q(v ) ; n lp(g) = I Q ( v ; pag (v)) + E Qˆ ln q(v ). (3) Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 9 / 20 Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 0 / 20 Dervng the In the last equaton, n lp(g) = I Q ( v ; pag (v)) + E Qˆ ln q(v ), the second term does not dependent on G and therefore can be left out when maxmzng l p(g) over G; the mutual nformaton s symmetrc: I Q ( v ; pag (v)) = I Q ( pag (v); v ). Therefore, max l p(g) max G {rooted trees} G {undrected trees} v G u where the summaton P v G u s over all edges of G. I Q ( v ; u), (4) () Compute all parwse mutual nformaton q(v, u) I Q ( v ; u) = E Q, v, u V. q( v )q( u) (2) Fnd a maxmum spannng tree of the undrected, fully connected graph on V wth edge weght I Q ( v ; u) between node v and u. Ths can be done by Kruskal s algorthm: repeatedly select an edge wth maxmum weght that does not create a cycle. (3) Make any node of the spannng tree as the root and drect edges away from t. The result s a rooted tree G that maxmzes l p(g). Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 / 20 Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 2 / 20

Generalzaton to Learnng Tree Augmented Nave Bayes A nave Bayes classfer wth class varable C and feature varables F : C F F Fm 2 Tree augmented nave Bayes classfers (TAN): Nave Bayes neglects the dependence between feature varables. Ths can be troublesome for rare classes that have characterstc combnatons of features. Each feature varable has at most one other feature varable as ts parent besdes the class varable. In other words, the subgraph nduced by the feature varables s a rooted tree or forest. Consder the problem of learnng a TAN G wth maxmum lkelhood. Notaton: v, v V : feature varables. Learnng TAN b G: the subgraph of G nduced by the feature varables v, v V. pa bg (v): the parent of v n b G,.e., the parent of v n G besdes C. Note that a TAN G s unquely determned by ts assocated b G. Apply the Chow-Lu tree algorthm to learnng TAN: Replace all parwse mutual nformaton by the condtonal mutual nformaton between all pars of feature varables gven the class varable: q(v, u C) I Q ( v ; u C) = E Q, v, u V. q( v C)q( u C) The output of the algorthm s the subgraph b G whose assocated TAN G maxmzes the profle log lkelhood among all TANs. Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 3 / 20 Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 4 / 20 Dervng the for TAN Smlarly to learnng a rooted tree, we start wth the profle log lkelhood: by Eq. (2), l p( G) b = l p(g) = n E Q ln q( v pa bg (v), C) + n E Qˆ ln q(c). (5) We rewrte l p( b G) n terms of the condtonal mutual nformaton I Q`v ; pa bg (v) C between v and pa bg (v) gven C for v V : E Qˆ ln q(v pa bg (v), C) = E Q "ln hence = E Q "ln!# q` v pa bg (v), C q` pa bg (v) C q` v C q` v C q` pa bg (v) C!# q` v, pa bg (v) C + E Qˆ ln q(v C) q` v C q` pa bg (v) C = I Q`v ; pa bg (v) C + E Qˆ ln q(v C) ; n lp( G) b = I Q`v ; pa bg (v) C + E Qˆ ln q(v C) +E Qˆ ln q(c). (6) Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 5 / 20 Dervng the for TAN In the last equaton, n lp( G) b = I Q`v ; pa bg (v) C + E Qˆ ln q(v C) + E Qˆ ln q(c), the second and thrd terms do not dependent on b G and therefore can be left out when maxmzng l p( b G) over b G; the condtonal mutual nformaton s symmetrc: I Q ( v ; pa bg (v) C) = I Q ( pa bg (v); v C); f b G s a forest, addng edges to make t a tree wll not decrease l p( b G). Therefore, max l p( G) b max bg {rooted trees} bg {undrected trees} v b G u I Q ( v ; u C), (7) where the summaton P s over all edges of G b. G v b u Ths verfes the clam n slde 4, that we can apply the Chow-Lu tree algorthm wth I Q ( v ; u C) replacng I Q ( v ; u) for all v, u V, to obtan the desrable G. b Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 6 / 20

Dscusson Further Readngs For TAN:. Fnn V. Jensen and Thomas D. Nelsen. Bayesan Networks and Decson Graphs. Sprnger, 2007. Chap. 8. Rooted trees and TANs are perfect DAGs: G m = G. So the models are equvalent to those assocated wth the undrected graphs G, and t s not surprsng that the structure learnng algorthms we derved can dsregard edge drectons. For learnng a sngly connected network (under certan assumptons) wth the Chow-Lu tree algorthm, see Pearl s 988 book. An old revew artcle dscussng the deas and steps nvolved n developng a probablstc expert system, usng the example CHILD network: 2. Davd J. Spegelhalter et al. Bayesan analyss n expert systems, Statstcal Scence, Vol. 8, No. 3, pp. 29-283, 993. (It ncludes Bayesan nference, whch we dd not talk about.) You may also fnd the related materals n the book by Cowell et al. 2007. A recent book by Koller and Fredman, Probablstc Graphcal Models, 2009 has many materals on both approxmate and exact nference algorthms. Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 7 / 20 Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 8 / 20 Dervaton of Eqs. ()-(2) The lkelhood and log lkelhood functons are ny Y n L(P) = p`xv xpa(v), l(p) = ln p`xv xpa(v). = = The varables n the maxmzaton of l(p) are the condtonal dstrbutons p(x v x pa(v) ) of v for each confguraton x pa(v) of v s parents, for all v V. We next express l(p) n terms of these varables (colored n blue below) By exchangng the order of summatons n the expresson of l(p), l(p) = n ln p`xv x pa(v) = n(x v, x pa(v) ) ln p`x v x pa(v). = x v x pa(v) where n(x v, x pa(v) ) s the counts for the confguraton (x v, x pa(v) ) n the data. Under our model, there are no constrants between the component condtonal dstrbutons we can choose. So the maxmzaton problem max P l(p) decomposes nto separate maxmzaton problems, one for each v and ts parent confguraton x pa(v) : max p( x pa(v) ) x v n(x v, x pa(v) ) ln p`x v x pa(v). (8) (x pa(v) s fxed n the above subproblem.) Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 9 / 20 Dervaton of Eqs. ()-(2) The subproblem (8) s equvalent to n(x v, x pa(v)) max ln p`x v x pa(v), (9) p( x pa(v) ) n(x x v pa(v) ) where n(x pa(v) ) = P x v n(x v, x pa(v) ), and t s the counts of the parent confguraton x pa(v) n the data. By the nformaton nequalty (see Lec. 3), the maxmum of (9) s attaned at p(x v x pa(v) ) = n(xv, x pa(v)), x v, n(x pa(v) ) whch s the ML estmate p ML ( x pa(v) ) gven n Eq. (). The maxmzed log lkelhood thus equals l(p ML ) = n(x v, x pa(v) ) ln n(xv, x pa(v)) n(x x v pa(v) ) x pa(v) = n n(x v, x pa(v)) ln n(xv, x pa(v)) n n(x x pa(v) x v pa(v) ) = n q(x v, x pa(v) ) ln q(x v x pa(v) ) = n E Q ln q( v pa(v) ). x pa(v) x v (q s the PMF of Q.) Ths verfes Eq. (2). Huzhen Yu (U.H.) Bayesan Networks: Maxmum Lkelhood Estmaton and Tree Structure Learnng Feb. 23 20 / 20