Y. Xiang, Learning Bayesian Networks 1

Learning Bayesian Neworks Objecives Acquisiion of BNs Technical conex of BN learning Crierion of sound srucure learning BN srucure learning in 2 seps BN CPT esimaion Reference R.E. Neapolian: Learning Bayesian Neworks (2004) Acquisiion of BNs Eliciaion based acquisiion Deermine he se V of env variables and heir domains. Deermine he graphical dependence srucure. Deermine CPTs one for each variable. Time consuming for domain expers & agen developer. Learning based acquisiion Inpu: a raining se R of examples in an applicaion env Oupu: a BN for inference abou he env Unsupervised learning The focus of his uni. Y. Xiang, Learning Bayesian Neworks 1 Y. Xiang, Learning Bayesian Neworks 2 Task Decomposiion Denoe a BN by S = (V, G, Pb), where V is a se of environmen variables, G = (V, E) is a DAG, and Pb is a se of CPTs, Pb = {P(v (v)) v V. Task: Learning a BN from a raining se R 1) Idenificaion of V 2) Definiion of variable domains 3) Consrucion of dependency srucure G Referred o as srucure learning 4) Esimaion of CPTs Referred o as parameer learning Y. Xiang, Learning Bayesian Neworks 3 Review on BN Semanics 1. A variable v in BN is condiionally independen of is non-descendans given is parens (v). 2. Variables x and y are dependen given heir common descendan(s). Ex Burglar-quake burglary (b) callbyjohn (j) alarm (a) quake (q) callbymary (m) Y. Xiang, Learning Bayesian Neworks 4 1

Technical Conex of BN Learning The environmen is characerized by a unknown full join P*(V). A unknown BN S* = (V, G*, Pb*) encodes he same condiional independencies of P*(V) by G*. S* perfecly encodes P*(V). A se R of raining examples are obained from environmen (i.e., P*(V)) by independen rials. Task: From R, learn a BN S = (V, G, Pb) ha models P*(V) as accuraely and concisely as possible. Srucure Markov Equivalence Suppose a full join P*(V) over env V can be perfecly encoded by a BN S = (V, G, Pb). Is he DAG srucure G unique? Ex P*(V) can be perfecly encoded by a BN wih G. G: child_age foo_size shoe_size Two DAGs are Markov equivalen if hey enail he same condiional independencies. Ex G : child_age foo_size shoe_size Are G and G Markov equivalen? Y. Xiang, Learning Bayesian Neworks 5 Y. Xiang, Learning Bayesian Neworks 6 Crierion of Sound Srucure Learning Le S = (V,G,Pb) and S = (V,G,Pb ) be BNs s.. a) G and G are Markov equivalen, and b) Pb and Pb are derived from he same environmen. Then S and S model he same full join over V. Ex Env V = {a, b wih rue full join P*(V): a b P*(a,b) 0.05 G 1 : a b; G 2 :a b; G 3 is disconneced. If full join P*(V) can be encoded by BN f f 0.35 0.50 f f 0.10 S = (V,G,Pb) bu S = (V,G,Pb ) is learned, hen srucure learning is sound as long as G and G are Markov equivalen. Learning DAG Skeleon Le G = (V, E) be a DAG. The undireced graph G = (V, E ), where E is obained by removing direcion of each link in E, is he skeleon of G. Srucure learning can be performed in wo seps. 1) Learn a skeleon G = (V, E ). 2) Direc links in G o obain DAG G = (V, E). In skeleon learning, how do we know wheher a pair of variables should be adjacen? [Theorem] Variables x, y V are adjacen in G iff here exiss no Z V\{x,y s.. I(x, Z, y) holds. Y. Xiang, Learning Bayesian Neworks 7 Y. Xiang, Learning Bayesian Neworks 8 2

Enropy In decision ree learning, he amoun of informaion conained in he value of a variable is measured by enropy. Le X be a se of variables wih JPD P(X). The enropy of X is H(X) = - x P(x) log 2 (P(x)). Inerpreaion 1) H(X) is he measure of uncerainy associaed wih X. 2) H(X) is he amoun of info in an assignmen x of X. How o Deermine I(x,Z,y)? [Theorem] For variables x, y V and Z V\{x,y, we have I(x,Z,y) H(x,Z,y) = H(x,Z)+H(Z,y)-H(Z). Algorihm Tes I(x,Z,y) using raining se r esimae P(x,Z,y) from r; marginalize P(x,Z,y) o obain P(x,Z), P(Z,y) and P(Z); compue H(x,Z,y), H(x,Z), H(Z,y), and H(Z); compue diff = H(x,Z,y) (H(x,Z)+H(Z,y)-H(Z)) ; if diff < hreshold, reurn I(x,Z,y); else reurn I(x,Z,y); Y. Xiang, Learning Bayesian Neworks 9 Y. Xiang, Learning Bayesian Neworks 10 How o Choose Z o Tes I(x,Z,y)? Given Z V\{x,y and I(x, Z, y), i is possible ha for Z - Z, I(x, Z -, y) or for Z + Z, I(x, Z +, y). I appears ha, o deermine I(x,Z,y), all subses of V\{x,y mus be esed. [Theorem] In a BN, x,y V, Z V\{x,y, and I(x,Z,y). Then eiher I(x, (x),y) or I(x, (y),y) holds. Idea a) Sar wih he complee graph and delee <x,y> if I(x,Z,y) for some Z. Idea b) To find Z s.. I(x,Z,y), limi search o Z Adj(x) and Z Adj(y). Idea c) Tes smaller subses Z firs. Srucure Learning Algorihm learnbndag(v, R) { G = complee undireced graph over V; for each link <x,y> in G, associae <x,y> wih se Sxy = null; G = geskeleon(g, R); G = direclink(g ); reurn G; Y. Xiang, Learning Bayesian Neworks 11 Y. Xiang, Learning Bayesian Neworks 12 3

geskeleon(g, R) { k = 0; done = false; while done = false, do { done = rue; for each node x in G, ge Adj(x); if Adj(x) k, coninue; done = false; for each node y in Adj(x), for each subse Z of Adj(x)\{y wih Z =k, if I(x,Z,y), hen {Sxy = Z; rm <x,y> in G ; break; k++; 13 Types of Chains in Srucure Undireced chain A chain x-z-y where x and y are no adjacen is called an uncoupled meeing. Direced chain 1. A chain x z y is a head-o-ail meeing a z. 2. A chain x z y is a ail-o-ail meeing a z. 3. A chain x z y is a head-o-head meeing a z. When isolaed, are hese direced meeings Markov equivalen? Y. Xiang, Learning Bayesian Neworks 14 Direc Links in Skeleon Idea: Direc head-o-head meeings firs, and use DAG consrain o direc remaining links. [Theorem] Le S be a BN, x-z-y be a uncoupled meeing in is skeleon, I(x,W,y) for W V\{x,y, and z W. Then x-z-y is a head-o-head meeing in S. Operaional implicaion 1) If x-z-y is an uncoupled meeing, hen Sxy null. 2) If z Sxy, hen x-z-y mus be x z y. Y. Xiang, Learning Bayesian Neworks 15 direclink(g) { for each uncoupled meeing x-z-y, if (z Sxy) direc x-z-y as x z y; // rule 1 done = false; while done = false, do done = rue; for each uncoupled meeing x z-y, direc z-y as z y; done = false; // rule 2 for each x-y s.. here is a direced pah from x o y, direc x-y as x y; done = false; // rule 3 for each uncoupled meeing x-z-y s.. x w, z-w & y w, direc z-w as z w; done = false; // rule 4 direc remaining links randomly s.. no direced cycle or head-o-head meeing is creaed; 16 4

Direc Link by DAG Consrain [Rule 2] For each uncoupled meeing x z-y, direc as x z y. Why? [Rule 3] For each x-y s.. here is a direced pah from x o y, direc x-y as x y. Why? [Rule 4] For each uncoupled meeing x-z-y s.. x w, z-w and y w, direc z-w as z w. Why? CPT Esimaion To ge P(x (x)), for each x = u and each assignmen wof (x), we need o esimae P(u w). Maximum likelihood esimaion 1. Gaher he se S of examples in R ha saisfies (x)=w. 2. N = S. 3. M = number of examples in S ha saisfies x=u. 4. Esimae P(u w) based on N and M. To deermine P(X), esimae P(x) for each x of X. A. Gaher he se T of examples in R ha saisfies X=x. B. Esimae P(x) as T / R. Y. Xiang, Learning Bayesian Neworks 17 Y. Xiang, Learning Bayesian Neworks 18 Maximum Likelihood Esimaion 1. Denoe unknown P(u w) as parameer [0, 1]. 2. Denoe examples in S as e 1, e 2,, e N. 3. Derive likelihood P(S ) of observing S. 4. Deermine parameer ha maximizes P(S ). a) Maximizing P(S ) is equivalen o maximizing he log likelihood ln(p(s )). b) Differeniae ln(p(s )). c) Se derivaive o 0. d) Solve for. Remarks BN learning overcomes he bole-neck of knowledge acquisiion by eliciaion, and allows BN inference o be more widely applied. More advanced opics 1) Alernaive BN srucure learning mehods 2) Alernaive BN CPT esimaion mehods 3) Inegraing BN learning wih eliciaion 4) Learning BNs wih coninuous variables 5) Learning BNs in dynamic envs Y. Xiang, Learning Bayesian Neworks 19 Y. Xiang, Learning Bayesian Neworks 20 5