to the data. The search procedure tries to identify network structures with high scores. Heckerman

Similar documents
232 Calculus and Structures

4. The slope of the line 2x 7y = 8 is (a) 2/7 (b) 7/2 (c) 2 (d) 2/7 (e) None of these.

Generic maximum nullity of a graph

Numerical Differentiation

The Complexity of Computing the MCD-Estimator

1 The concept of limits (p.217 p.229, p.242 p.249, p.255 p.256) 1.1 Limits Consider the function determined by the formula 3. x since at this point

Efficient algorithms for for clone items detection

HOMEWORK HELP 2 FOR MATH 151

MVT and Rolle s Theorem

Combining functions: algebraic methods

Exam 1 Review Solutions

Preface. Here are a couple of warnings to my students who may be here to get a copy of what happened on a day that you missed.

1. Consider the trigonometric function f(t) whose graph is shown below. Write down a possible formula for f(t).

Consider a function f we ll specify which assumptions we need to make about it in a minute. Let us reformulate the integral. 1 f(x) dx.

7.1 Using Antiderivatives to find Area

Continuity and Differentiability Worksheet

Chapter 2 Limits and Continuity

DIGRAPHS FROM POWERS MODULO p

Differentiation in higher dimensions

THE IDEA OF DIFFERENTIABILITY FOR FUNCTIONS OF SEVERAL VARIABLES Math 225

5.1 We will begin this section with the definition of a rational expression. We

OSCILLATION OF SOLUTIONS TO NON-LINEAR DIFFERENCE EQUATIONS WITH SEVERAL ADVANCED ARGUMENTS. Sandra Pinelas and Julio G. Dix

How to Find the Derivative of a Function: Calculus 1

Polynomial Interpolation

Symmetry Labeling of Molecular Energies

1. Questions (a) through (e) refer to the graph of the function f given below. (A) 0 (B) 1 (C) 2 (D) 4 (E) does not exist

Poisson Equation in Sobolev Spaces

Probabilistic Graphical Models Homework 1: Due January 29, 2014 at 4 pm

University Mathematics 2

Volume 29, Issue 3. Existence of competitive equilibrium in economies with multi-member households

2.3 Product and Quotient Rules

Characterization of reducible hexagons and fast decomposition of elementary benzenoid graphs

SECTION 1.10: DIFFERENCE QUOTIENTS LEARNING OBJECTIVES

On the Identifiability of the Post-Nonlinear Causal Model

Chapter 2. Limits and Continuity 16( ) 16( 9) = = 001. Section 2.1 Rates of Change and Limits (pp ) Quick Review 2.1

Math 161 (33) - Final exam

Copyright c 2008 Kevin Long

1 1. Rationalize the denominator and fully simplify the radical expression 3 3. Solution: = 1 = 3 3 = 2

Continuity. Example 1

f a h f a h h lim lim

3.4 Worksheet: Proof of the Chain Rule NAME

Precalculus Test 2 Practice Questions Page 1. Note: You can expect other types of questions on the test than the ones presented here!

Math 1241 Calculus Test 1

Digital Filter Structures

3.1 Extreme Values of a Function

Some Review Problems for First Midterm Mathematics 1300, Calculus 1

REVIEW LAB ANSWER KEY

More on generalized inverses of partitioned matrices with Banachiewicz-Schur forms

Recall from our discussion of continuity in lecture a function is continuous at a point x = a if and only if

Exam 1 Solutions. x(x 2) (x + 1)(x 2) = x

MATH1131/1141 Calculus Test S1 v8a

3.4 Algebraic Limits. Ex 1) lim. Ex 2)

MAT 145. Type of Calculator Used TI-89 Titanium 100 points Score 100 possible points

Pre-Calculus Review Preemptive Strike

Material for Difference Quotient

Yishay Mansour. AT&T Labs and Tel-Aviv University. design special-purpose planning algorithms that exploit. this structure.

Improved Algorithms for Largest Cardinality 2-Interval Pattern Problem

Tangent Lines-1. Tangent Lines

MATH1151 Calculus Test S1 v2a

= 0 and states ''hence there is a stationary point'' All aspects of the proof dx must be correct (c)

EDML: A Method for Learning Parameters in Bayesian Networks

Click here to see an animation of the derivative

NUMERICAL DIFFERENTIATION. James T. Smith San Francisco State University. In calculus classes, you compute derivatives algebraically: for example,

Taylor Series and the Mean Value Theorem of Derivatives

Polynomial Interpolation

Math 212-Lecture 9. For a single-variable function z = f(x), the derivative is f (x) = lim h 0

(a) At what number x = a does f have a removable discontinuity? What value f(a) should be assigned to f at x = a in order to make f continuous at a?

Math 1210 Midterm 1 January 31st, 2014

Chapter 1D - Rational Expressions

Learning based super-resolution land cover mapping

A = h w (1) Error Analysis Physics 141

Lab 6 Derivatives and Mutant Bacteria

1. Which one of the following expressions is not equal to all the others? 1 C. 1 D. 25x. 2. Simplify this expression as much as possible.

On convexity of polynomial paths and generalized majorizations

The Verlet Algorithm for Molecular Dynamics Simulations

2.1 THE DEFINITION OF DERIVATIVE

Lecture XVII. Abstract We introduce the concept of directional derivative of a scalar function and discuss its relation with the gradient operator.

2.8 The Derivative as a Function

Order of Accuracy. ũ h u Ch p, (1)

Section 15.6 Directional Derivatives and the Gradient Vector

Optimal Mechanism with Budget Constraint Bidders

Regularized Regression

LATTICE EXIT MODELS S. GILL WILLIAMSON

lecture 26: Richardson extrapolation

ALGEBRA AND TRIGONOMETRY REVIEW by Dr TEBOU, FIU. A. Fundamental identities Throughout this section, a and b denotes arbitrary real numbers.

2.3 Algebraic approach to limits

The Derivative The rate of change

Te comparison of dierent models M i is based on teir relative probabilities, wic can be expressed, again using Bayes' teorem, in terms of prior probab

CHAPTER (A) When x = 2, y = 6, so f( 2) = 6. (B) When y = 4, x can equal 6, 2, or 4.

Time (hours) Morphine sulfate (mg)

HOW TO DEAL WITH FFT SAMPLING INFLUENCES ON ADEV CALCULATIONS

Continuity and Differentiability of the Trigonometric Functions

Technology-Independent Design of Neurocomputers: The Universal Field Computer 1

Stationary Gaussian Markov Processes As Limits of Stationary Autoregressive Time Series

Math 102 TEST CHAPTERS 3 & 4 Solutions & Comments Fall 2006

Domination Problems in Nowhere-Dense Classes of Graphs

arxiv: v3 [cs.ds] 4 Aug 2017

Derivatives of Exponentials

LIMITS AND DERIVATIVES CONDITIONS FOR THE EXISTENCE OF A LIMIT

Introduction to Derivatives

Transcription:

2 Learning Bayesian Networks is NP-Complete David Maxwell Cickering Computer Science Department University of California at Los Angeles dmax@cs.ucla.edu ABSTRACT Algoritms for learning Bayesian networks from data ave two components: a scoring metric and a searc procedure. Te scoring metric computes a score reecting te goodness-of-t of te structure to te data. Te searc procedure tries to identify network structures wit ig scores. Heckerman et al. (995) introduce a Bayesian metric, called te BDe metric, tat computes te relative posterior probability of a network structure given data. In tis paper, we sow tat te searc problem of identifying a Bayesian network among tose were eac node as at most K parents tat as a relative posterior probability greater tan a given constant is NP-complete, wen te BDe metric is used. 2. Introduction Recently, many researcers ave begun to investigate metods for learning Bayesian networks. Many of tese approaces ave te same basic components: a scoring metric and a searc procedure. Te scoring metric takes a database of observed cases D and a network structure B S, and returns a score reecting te goodness-of-t of te data to te structure. A searc procedure generates networks for evaluation by te scoring metric. Tese approaces use te two components to identify a network structure or set of structures tat can be used to predict future events or infer causal relationsips. Cooper and Herskovits (992) erein referred to as CH derive abayesian metric, wic we call te BD metric, from a set of reasonable assumptions about learning Bayesian networks containing only discrete variables. Heckerman et al. (995) erein referred to as HGC expand upon te work of CH to derive a new metric, wic we call te BDe metric, wic as te desirable property oflikeliood equivalence. Likeliood equivalence says tat te data cannot elp to discriminate equivalent structures. Wenow present te BD metric derived by CH. We use BS to denote te ypotesis tat B S is an I-map of te distribution tat generated te database. 2 Given a belief-network structure B S,weuse i to denote te parents of x i.we use r i to denote te number of states of variable x i,andq i = Q x l 2 i r l to denote te number of instances of i.weuse te integer j to index tese instances. Tat is, we write i = j to denote te observation of te jt instance of te parents of x i. Learning from Data: AI and Statistics V. Edited by D. Fiser and H.-J. Lenz. c996 Springer-Verlag. 2 Tere is an alternative causal interpretation of network structures not discussed ere. See HGC for details.

22 David Maxwell Cickering Using reasonable assumptions, CH derive te following Bayesian scoring metric: p(d B S j) =p(b S j) ny Yq i i= j= ;(N 0 ) ri ij ;(Nij 0 + N ij ) Y k= ;(N 0 ijk + N ijk) ;(N 0 ijk ) (2:) were is used to summerize all background information, N ijk is te number of cases in D were x i = k and i = j, N ij = P r i k= N ijk, Nij 0 = P r i k= Nijk, 0 and ;() is te Gamma function. Te parameters Nijk 0 caracterize our prior knowledge of te domain. We call tis expression or any expression proportional to it te BD (Bayesian Diriclet) metric. HGC derive a special case of te BD metric tat follows from likeliood equivalence. Te resulting metric is te BD metric wit te prior parameters constrained by te relation N 0 ijk = N 0 p(x i = k i = jjb S C ) (2:2) were N 0 is te user's equivalent sample size for te domain, and BS C is te ypotesis corresponding to te complete network structure. HGC note tat te probabilities in Equation 2.2 may be computed from a prior network: abayesian network encoding te probability of te rst case to be seen. HGC discuss situations wen a restricted version of te BDe metric sould be used. Tey argue tat in tese cases, te metric sould ave te property ofprior equivalence, wic states tat p(bsj) =p(bs2j) wenever B S and B S2 are equivalent. HGC sow tat te searc problem of nding te l network structures wit te igest score among tose structure were eac node as at most one parent is polynomial wenever a decomposable metric is used. In tis paper, we examine te general case of searc, as described in te following decision problem: K-LEARN INSTANCE: Set of variables U, database D = fc ::: C m g, were eac C i is an instance of all variables in U, scoring metric M(D B S ) and real value p. QUESTION: Does tere exist a network structure B S dened over te variables in U, were eac nodeinb S as at most K parents, suc tat M(D B S ) p? Hogen (993) sows tat a similar problem for PAC learning is NP-complete. His results can be translated easily to sow tat K-LEARN is NP-complete for k> wen te BD metric is used. In tis paper, we sow tat K-LEARN is NP-complete, even wen we use te BDe metric and te constraint of prior equivalence. 2.2 K-LEARN is NP-Complete In tis section, wesowtatk-learn is NP-complete, even wen we use te likelioodequivalent BDe metric and te constraint of prior equivalence. Te inputs to K-LEARN are () a set of variables U, (2) a database D, (3) te relative prior probabilities of all network structures were eac node as no more tan K parents, (4) parameters N 0 ijk and N 0 ij for some node{parent pairs and some values of i, j, and k, and (5) a value p. Te input need only include enoug parameters N 0 ijk and N 0 ij so tat te metric score can be computed for all network structures were eac node as no more tan K parents.

Learning Bayesian Networks is NP-Complete 23 Consequently, we do not need te Nijk 0 and Nij 0 parameters for nodes aving more tan K parents, nodes wit parent congurations tat always ave zero prior probabilities, and values of i, j, andk for wic tere is no corresponding data in te database. Also, we empasize tat te parameters Nijk 0 must be derivable from some joint probability distribution using Equation 2.2. Given tese inputs, we see from Equation 2. tat te BDe metric for any given network structure and database can be computed in polynomial time. Consequently, K- LEARN is in NP. In te following sections, we sow tat K-LEARN is NP-ard. In Section 2.2., we give a polynomial time reduction from a known NP-complete problem to 2-LEARN. In Section 2.2.2, we sow tat 2-LEARN is NP-ard using te reduction from Section 2.2., and ten sow tatk-learn for K>2is NP-ard by reducing 2-LEARN to K-LEARN. In tis discussion, we omit conditioning on background information to simplify te notation. 2.2. Reduction from DBFAS to 2-LEARN In tis section we provide a polynomial time reduction from a restricted version of te feedback arc set problem to 2-LEARN. Te general feedback arc set problem is stated in Garey and Jonson (979) as follows: FEEDBACK ARC SET INSTANCE: Directed grap G =(V A), positive integer K jaj. QUESTION: Is tere a subset A 0 A wit ja 0 jk suc tata 0 contains at least one arcfromevery directed cycle in G? It is sown in Garvill (977) tat FEEDBACK ARC SET remains NP-complete for directed graps in wic no vertex as a total in-degree and out-degree more tan tree. We refer to tis restricted version as DEGREE BOUNDED FEEDBACK ARC SET, or DBFAS for sort. Given an instance of DBFAS consisting of G =(V A) and K, our task is to specify, in polynomial time, te ve components of an instance of 2-LEARN. To simplify discussion, we assume tat in te instance of DBFAS, no vertex as in-degree or out-degree of zero. If any suc vertex exists, none of te incident edges can participate in a cycle and we can remove te vertex from te grap witout canging te answer to te decision problem. To elp distinguis between te instance of DBFAS and te instance of 2-LEARN, we adopt te following convention. We use te term arc to refer to a directed edge in te instance of DBFAS, and te term edge to refer to a directed edge in te instance of 2-LEARN. We construct te variable set U as follows. For eac nodev i in V,we include a corresponding binary variable v i in U. We use V to denote te subset of U tat corresponds to V.For eac arc a i 2 A, we include ve additional binary variables a i ::: a i5 in U. We usea i to denote te subset of U containing tese ve variables, and dene A to be A [ :::[A jaj.we include no oter variables in U. Te database D consists of a single case C = f ::: g. Te relative prior probability of every network structure is one. Tis assignment satises our constraint of prior equivalence. From Equation 2. wit database D = C and relative

24 David Maxwell Cickering prior probabilities equal to one, te BDe metric denoted M BDe (D B S ) becomes M BDe (C B S )= Y i Nijk 0 Nij 0 (2:3) were k is te state of x i equal to one, and j is te instance of i suc tat te state of eac variable in i is equal to one. Te reduction to tis point is polynomial. To specify te necessary N 0 ijk and N 0 ij parameters, we specify a prior network and ten compute te parameters using Equation 2.2, assuming an arbitrary equivalent sample size of one. 3 From Equation 2.3, we ave M BDe (C B S )= Y i p(x i =j i = ::: B S C ) (2:4) To demonstrate tat te reduction is polynomial, we sow tat te prior network can be constructed in polynomial time. In Section 2.2.3 (Teorem 2), we sow tat eac probability in Equation 2.4 can be inferred from te prior network in constant time due to te special structure of te network. We denote te prior Bayesian network B =(B S B P ). Te prior network B contains bot idden nodes, wic do not appear in U, andvisible nodes wic do appear in U. Every variable x i in U as a corresponding visible node in B wic is also denoted by x i. Tere are no oter visible nodes in B. For every arc a k from v i to v j in te given instance of DBFAS, B contains ten idden binary nodes and te directed edges as sown in Figure at te end of tis subsection. In te given instance of DBFAS, we know tat eac nodev i in V is adjacent to eiter two or tree nodes. For every node v i in V wic is adjacent to exactly two oter nodes in G, tere is a idden node i in B and an edge from i to x i. Tere are no oter edges or idden nodes in B. We use ij to denote te idden node parent common to visible nodes x i and x j.we create te parameters B P as follows. For every idden node ij we set p( ij =0)=p( ij =)= 2 Eac visiblenodeinb is one of two types. Te type of a node is dened by its conditional probability distribution. Every node a i5 in B (corresponding to te ft variable created in U for te it arc in te instance of DBFAS) isatype IInode, and all oter nodes are type Inodes. A type I node as te conditional probability distribution sown in Table 2.. We saytattwo variables in U are prior siblings if te corresponding nodes in te prior network B sare a common idden parent. We uses xi to denote te set of all variables in U wic are prior siblings of x i. For eac type II node a i5, we dene te distinguised siblings as te set D ai5 = fa i3 a i4 gs ai5.table 2.2 sows te conditional probability distribution of a type II node x i wit distinguised siblings fx j x k g. 3 Because tere is only one case in te database, only te ratios N 0 ijk are needed (see Equation 2.3), Nij 0 and from Equation 2.2 te equivalent sample size is irrelevant. In general te equivalent sample size will need to be specied to uniquely determine te parameter values.

Learning Bayesian Networks is NP-Complete 25 TABLE 2.. Conditional probability distribution for a type I node. ij ik il p(x i =j ij ik il ) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 TABLE 2.2. Conditional probability distribution for a type II node x i wit D xi = fx j x k g. ij ik il p(x i =j ij ik il ) 0 0 0 3 0 0 0 0 2 3 0 0 0 0 2 3 0 0 0 3 0 Tere are jv j+5jaj visible nodes in B, eac visible node as at most tree idden node parents, and eac probability table as constant size. Tus, te construction of B takes time polynomial in te size of te instance of DBFAS. We now derive te value for p. From Equation 2.4, we obtain M BDe (C B S ) = Y i p(x i =j i = ::: B S C ) = Y 3;j i\s xi j p(x i =j i = ::: BS C ) i 3;j i\s xi j P = (3n; i j i\s xi j) Y s 0 (x i j i S i ) (2.5) i were < is a positive constant tat we sall x to be 5=6 for te remainder of te paper. Let be te total number of prior sibling pairs as dened by B, and let be te number of prior sibling pairs wic are not adjacent inb S. Te sum P i j i \ S xi j is te number of edges in B S wic connect prior sibling pairs and is terefore equal to ;. Rewriting Equation 2.5, we get M BDe (C B S )= (3n;(;)) Y i s 0 (x i j i S i )=c 0 Y i s 0 (x i j i S i ) (2:6) We now state 3 lemmas, postponing teir proofs to Section 2.2.3. A network structure B S is a prior sibling grap if all pairs of adjacent nodes are prior siblings. (Not all pairs of prior siblings in a prior sibling grap, owever, need be adjacent.) Lemma Let B S be a network structure, and let B S 0 be te prior sibling grap created by removing every edge in B S wic does not connect a pair of prior siblings. Ten it follows tat M BDe (C B S 0) M BDe (C B S ) Trougput te remainder of te paper, te symbol stands for te constant 24=25. Lemma 2 If B S is a prior sibling grap, ten for every type I node x i in B S, if i contains at least one element, ten s 0 (x i j i S i ) is maximized and is equal to m =64=35. If i =, tens 0 (x i j i S i )= m.

26 David Maxwell Cickering Lemma 3 If B S is a prior sibling grap, ten for every type IInode x i in B S,if i = D xi, were D xi is te set of two distinguised siblings of x i, ten s 0 (x i j i S i ) is maximized and is equal to m 2 =40=8. If i 6= D xi ten s 0 (x i j i S i ) m 2. Finally, wedenep in te instance of 2-LEARN as p = c 0 m jv j m 4 m 2 jaj K (2:7) were m and m 2 are dened by Lemma 2 and 3 respectively,andc 0 is te constant from Equation 2.6. Te value for p can be derived in polynomial time. Consequently, te entire reduction is polynomial. a k3 a k3 v i a k a k2 a k5 v j v i a k a k2 a k5 v j a k4 FIGURE. Subgrap of te prior net B corresponding to te kt arc in A from v i to v j. a k4 FIGURE 2. Optimal conguration of te edges incident to te nodes in A k corresponding to te arc from v i to v j. 2.2.2 Proof of NP-Hardness In tis section, we rst prove tat 2-LEARN is NP-ard using te reduction from te previous section. Ten, we prove tatk-learn is NP-ard for all k >, using a reduction from 2-LEARN. Te following lemma explains te selection of p made in Equation 2.7, wic inturn facilitates te proof tat 2-LEARN is NP-ard. Let k be te number of prior sibling pairs fx i x j g wic are not adjacent inb S, were at least one of fx i x j g is in A k. It follows tat P k k =, andwe can express Equation 2.6 as 2 M BDe (C B S ) = c 0 4 Y 3 2 s 0 (x i j i S i ) 5 4 Y 3 t(a j j ) 5 (2.8) j were t(a j j )= j Q x i 2A j s 0 (x i j i S i ). x i 2V Lemma 4 Let B S be a prior sibling grap. If eac node in A k is adjacent to all of its prior siblings, and te orientation of te connecting edges are as sown in Figure 2, ten t(a k k ) is maximized and is equal to m 4 m 2. Oterwise, t(a k k ) m 4 m 2. Proof: In Figure 2, every type I node in A k as at least one prior sibling as a parent, and te single type II node as its distinguised siblings as parents. Tus, by Lemmas 2 and

Learning Bayesian Networks is NP-Complete 27 3, te score s 0 (x i j i S i ) for eac nodex i 2A k is maximized. Furtermore, every pair of prior siblings are adjacent. Tus, we ave Y t(a j j ) = j s 0 (x i j i S i )= 0 m m m m m 2 x i 2A j Suppose tere exists anoter orientation of edges incident to te nodes in A k suc tat tat t(a k k ) >m 4 m 2. Because <( 5 < 24 ), every pair of prior siblings must be 6 25 adjacent in tis ypotetical conguration. Furtermore, every node in A k must acieve its maximum score, else te total score will be bounded above by m 4 m 2.From Lemma 3 and Lemma 2, it follows tat te resulting conguration must be identical to Figure 2 2 Te next two teorems prove tat 2-LEARN is NP-ard. Teorem 5 Tere exists a solution to te 2-LEARN instance constructed in Section 2.2. wit M BDe (C B S ) p if tere exists a solution to te given DBFAS problem wit A 0 K. Proof: Given a solution to DBFAS, create te solution to 2-LEARN as follows: For every arc a k =(v i v j ) 2 A suc tata k 62 A 0, insert te edges in B S between te corresponding nodes in A k [ v i [ v j as sown in Figure 2. For every arc a k =(v i v j ) 2 A 0, insert te edges in B S between te corresponding nodes in A k [ v i [ v j as sown in Figure 2, except for te edge between a k and a k2 wic isreversed and terefore oriented from a k2 to a k. To complete te proof, we must rst sow tat B S is a solution to te 2-LEARN instance, and ten sow tatm BDe (C B S ) is greater tan or equal to p. Because eac node in B S as at most two parents, we know B S is a solution as long as it is acyclic. By construction, B S cannot contain a cycle unless tere is a cycle in G for wic none of te edges are contained in A 0. Because G is a solution to DBFAS, tis implies B S is acyclic. We nowderive M BDe (C B S ). Let A opt be te subset of A k sets wic correspond to te arcs in A n A 0. Rewriting Equation 2.8 we get 2 M BDe (C B S ) = c 0 4 Y 3 2 s 0 (x i j i S i ) 5 4 Y 3 2 3 Y t(a j j ) 5 4 t(a k k ) 5 x i 2V A j 2A opt A k 2AnA opt Every node x i 2Vas at least one prior sibling node as a parent because eac node in te instance of DBFAS as an in-degree of at least one. Furtermore, Lemma 4 guarantees tat for every A k in A opt, t(a j j ) equals m 4 m 2.Now consider any A k in AnA opt.all prior sibling pairs for wic at least one node is in tis set are adjacent inb S,so k is zero. Furtermore, every node in tis set attains a maximum score, except for te type Inodea k2 wic by Lemma 2 attains a score of m. Plugging into Equation 2.9 we ave i i i M BDe (C B S ) = c 0 m jv j (m 4 m 2 ) jaopt j (m 4 m 2 ) janaopt j i = c 0 m jv j jaj m 4 m 2 ja 0 j Because < and ja 0 jk we conclude tat M BDe (C B S ) p. 2

28 David Maxwell Cickering Teorem 6 Tere exists a solution to te given DBFAS problem wit A 0 K if tere exists a solution to te 2-LEARN instance constructed insection 2.2. wit M BDe (C B S ) p. Proof: Given te solution B S to te instance of 2-LEARN, removeany edges in B S wic do not connect prior siblings. Lemma guarantees tat te BDe score does not decrease due to tis transformation. Now create te solution to DBFAS as follows. Recall tat eac set of nodes A k corresponds to an arc a k =(v i v j ) in te instance of DBFAS. Dene te solution arc set A 0 to be te set of arcs corresponding to tose sets A k for wic te edges incident tote nodes in A k are not congured as sown in Figure 2. To complete te proof, we rstsowtata 0 is a solution to DBFAS, and ten sow tat ja 0 jk. Suppose tat A 0 is not a solution to DBFAS. Tis means tat tere exists a cycle in G tat does not pass troug an arc in A 0.For every arc (v i v j ) in tis cycle, tere is a corresponding directed pat from v i to v j in B S (see Figure 2). But tis implies tere is a cycle in B S wic contradicts te fact tat we ave a solution to 2-LEARN. >From Lemma 4 we know tateac set A k tat corresponds to an arc in A 0 as t(a k k ) bounded above by m 4 m 2. Because M BDe (C B S ) p, we conclude from Equation 2.8 tat tere can be at most K suc arcs. 2 Teorem 7 K-LEARN wit M BDe (D B S ) satisfying prior equivalence is NP-ard for every integer K>. Proof: Because 2-LEARN is NP-ard, we establis te teorem by sowing tat any 2-LEARN problem can be solved using an instance of K-LEARN. Given an instance of 2-LEARN, an equivalent instance of K-LEARN is identical to te instance of 2-LEARN, except tat te relative prior probability is zero for any structure tat contains a node wit more tan two parents 4. It remains to be sown tat tis assignment satises prior equivalence. We can establis tis fact by sowing tat no structure containing a node wit more tan two parents is equivalent to a structure for wic nonodecontains more tan two parents. Cickering (995) sows tat for any two equivalent structures B S and B S2, tere exists a nite sequence of arc reversals in B S suc tat () after eacreversal B S remains equivalent tob S2, (2) after all reversals B S = B S2, and (3) if te edge v i! v j is te next edge to be reversed, ten v i and v j ave te same parents wit te exception tat v i is also a parent ofv j. It follows tat after eac reversal, v i as te same number of parents as v j did before te reversal, and v j as te same number of parents as v i did before te reversal. Tus, if tere exists a node wit l parents in some structure B S, ten tere exists a node wit l parents in any structure tat is equivalent tob S. 2 2.2.3 Proof of Lemmas To prove Lemmas troug 3, we derive s 0 (x i j i S xi )forevery pair fx i i g.letx i be any node. Te set i must satisfy one of te following mutually exclusive and collectively exaustive assertions: 4 Note tat no new parameters need be specied.

Learning Bayesian Networks is NP-Complete 29 Assertion For every node x j wic is bot a parent ofx i and a prior sibling of x i (i.e. x j 2 i \ S xi ), tere is no prior sibling of x j wic is also a parent ofx i. Assertion 2 Tere exists a node x j wic is bot a parent ofx i and a prior sibling of x i,suc tat one of te prior siblings of x j is also a parent ofx i. Te following teorem sows tat to derive s 0 (x i j i S xi )forany pair fx i i g for wic i satises Assertion, we need only compute te cases for wic i S xi. Teorem 8 Let x i beanynode in B S.If i satises Assertion, ten s 0 (x i j i S xi )= s 0 (x i j i \ S xi S xi ). Proof: From Equation 2.5, we ave s 0 (x i j i S xi )= p(x ij i B e S C ) 3;j i\s xi j (2:9) Because i satises Assertion, it follows by construction of B tat x i is d-separated from all parents tat are not prior siblings once te values of i \ S xi are known. 2 For te next two teorems, we use te following equalities. 5 p( ij ik il )=p( ij )p( ik )p( il ) (2:0) p( ij ik il jx j )=p( ij jx j )p( ik )p( il ) (2:) p( ij ik il jx j x k )=p( ij jx j )p( ik jx k )p( il ) (2:2) p( ij =0jx i =)= 2 3 (2:3) Equation 2.0 follows because eac idden node is a root in B. Equation 2. follows because any pat from x j to eiter ik or il must pass troug some node x 6= x j wic is a sink. Equation 2.2 follows from a similar argument, noting from te topology of B tat x 62 fx j x k g. Equation 2.3 follows from Tables and 2, using te fact tat p( ij = 0) equals =2. Teorem 9 Let x i beanytypeinode in B S for wic i satises Assertion. If j i \ S xi j = 0 ten s 0 (x i j i S xi ) = m. Ifj i \ S xi j = ten s 0 (x i j i S xi ) = m. If j i \ S xi j =2ten s 0 (x i j i S xi )=m. Proof: Follows by solving Equation 2.9, using Equations 2.0 troug 2.3 and te probabilities given in Table 2.. 2 Teorem 0 Let x i be any type II node in B S for wic i satises assertion. If j i \ S xi j =0 ten s 0 (x i j i )= 2 m 2. Ifj i \ S xi j = ten s 0 (x i j i )= m 2.If j i \ S xi j =2and i 6= D xi ten s 0 (x i j i )= m 2.If i = D xi ten s 0 (x i j i )=m 2. 5 We drop te conditioning event B e S C to simplify notation.

30 David Maxwell Cickering Proof: Follows by solving Equation 2.9, using Equations 2.0 troug 2.3 and te probabilities given in Table 2.2. 2 Now wesow tat if Assertion 2 olds for te parents of some node, ten we can remove te edge from te parent wic is not a sibling witout decreasing te score. Once tis teorem is establised, te lemmas follow. Teorem Let x i be any node. If i = fx j x k g, were x j 2 S xi and x k 2 S xj, ten s 0 (x i jx j ) s 0 (x i jx j x k ). Proof: For any nodeweave p(x i =jx j = x k =) = p(x i =)p(x k =jx i =)p(x j =jx i = x k =) p(x k =)p(x j =jx k =) Because x i and x k are not prior siblings, it follows tat p(x k jx i )=p(x k ). Expressing te resulting equality in terms of s 0 (x i j i S xi ), noting tat x i as only one prior sibling as a parent, and canceling terms of, we obtain s 0 (x i jfx j x k g S xi )=s 0 (x i j S xi ) s0 (x j jfx i x k g S xj ) s 0 (x j jfx k g S xj ) (2:4) If x j is a type I node, or if x j is a typeiinodeandx i and x k are not its distinguised siblings, ten s 0 (x j jfx i x k g S xj ) equals s 0 (x j jfx k g S xj ), wic implies tat we can improve te local score of x i byremoving te edge from x k.ifx j is a type II node, and D xj = fx i x k g,tens 0 (x j jfx i x k g S xj ) equals (=) s 0 (x j jfx k g S xj ), wic implies we can remove te edge from x k witout aecting te score of x i. 2 Te preceding arguments also demonstrate te following teorem. Teorem 2 For any pair fx i i g, were j i j2, te value p(x i =j i ) can be computed from B in constant time wen te state of eac of te variable in i is equal to one. 2.3 References [Cickering, 995] Cickering, D. M. (995). ATransformational caracterization of Bayesian network structures. In Proceedings of Elevent Conference onuncertainty in Articial Intelligence, Montreal, QU. Morgan Kaufman. [Cooper and Herskovits, 992] Cooper, G. and Herskovits, E. (992). A Bayesian metod for te induction of probabilistic networks from data. Macine Learning, 9:309{347. [Garey and Jonson, 979] Garey, M. and Jonson, D. (979). Computers and intractability: A guide to te teory of NP-completeness. W.H. Freeman. [Garvil, 977] Garvil, F. (977). Some NP-complete problems on graps. In Proc. t Conf. on Information Sciences and Systems, Jons Hopkins University, pages 9{95. Baltimore, MD. [Heckerman et al., 995] Heckerman, D., Geiger, D., and Cickering, D. (995). Learning discrete Bayesian networks. Macine Learning, 20:97-243. [Hogen, 993] Hogen, K. (revised 993). Learning and robust learning of product distributions. Tecnical Report 464, Facbereic Informatik, Universitat Dortmund.