Algorithms and Complexity Results for Exact Bayesian Structure Learning

Algorithms and Complexity Results or Exact Bayesian Structure Learning Sebastian Ordyniak and Stean Szeider Institute o Inormation Systems Vienna University o Technology, Austria Abstract Bayesian structure learning is the NP-hard problem o discovering a Bayesian network that optimally represents a given set o training data. In this paper we study the computational worst-case complexity o exact Bayesian structure learning under graph theoretic restrictions on the super-structure. The super-structure (a concept introduced by Perrier, Imoto, and Miyano, JMLR 2008) is an undirected graph that contains as subgraphs the skeletons o solution networks. Our results apply to several variants o score-based Bayesian structure learning where the score o a network decomposes into local scores o its nodes. Results: We show that exact Bayesian structure learning can be carried out in non-uniorm polynomial time i the super-structure has bounded treewidth and in linear time i in addition the super-structure has bounded maximum degree. We complement this with a number o hardness results. We show that both restrictions (treewidth and degree) are essential and cannot be dropped without loosing uniorm polynomial time tractability (subject to a complexity-theoretic assumption). Furthermore, we show that the restrictions remain essential i we do not search or a globally optimal network but we aim to improve a given network by means o at most k arc additions, arc deletions, or arc reversals (k-neighborhood local search). Keywords: Bayesian structure learning, super-structure, treewidth, ixed-parameter tractability, parameterized complexity 1 Introduction Bayesian structure learning is the important task o discovering a Bayesian network that represents a given set o training data. Unortunately the problem is NPhard (Chickering 1996). This predicament has motivated Research supported by the European Research Council, grant reerence 239962. a wide range o approaches, including heuristicsbased algorithms that compute near-optimal solutions (see, e.g., Heckerman, Geiger, & Chickering 1995; Chickering 2003). In recent years several exponential-time algorithms or exact Bayesian structure learning have been proposed (see, e.g., Parviainen & Koivisto 2009; Perrier, Imoto, & Miyano 2008; Silander & Myllymäki 2006). Recent progress has been made to limit the space requirement by advanced dynamic programming techniques (Parviainen & Koivisto 2009) and to limit the exponential time requirement by restricting the search to networks whose skeletons are subgraphs o a given undirected graph that speciies a super-structure (Perrier, Imoto, & Miyano 2008). Recent research indicates that the super-structure can be practically computed and eectually used to guide the search or near-optimal Bayesian networks (Mukund & Je 2004; Anton & Carlos 2007; Pieter, Daphne, & Andrew 2006). In this paper we study the worst-case time complexity o exact Bayesian structure learning under graph-theoretic restrictions on the super-structure. In particular, we consider bounds on the treewidth and on the maximum degree o super-structures. Our results are as ollows: (1) Exact Bayesian structure learning is easible in non-uniorm polynomial time i the treewidth o the super-structure is bounded by an arbitrary constant. (2) Exact Bayesian structure learning is easible in linear time i both treewidth and maximum degree o the super-structure are bounded by arbitrary constants. By non-uniorm we mean that the order o the polynomial depends on the treewidth. We obtain results (1) and (2) by means o a dynamic programming algorithm along a decomposition tree o the super-structure. We show that in a certain sense both results are optimal: (3) Exact Bayesian structure learning or instances with super-structures o maximum degree 4 (but unbounded treewidth) is not easible in polynomial time unless P = NP. Thus, in (1) and (2) we cannot drop the bound on the treewidth.

(4) Exact Bayesian structure learning or instances with super-structures o bounded treewidth (but unbounded maximum degree) is not easible in uniorm polynomial time unless FPT = W[1]. Thus, in (2) we cannot drop the bound on the degree. FPT W[1] is a widely accepted complexity theoretic assumption (Downey & Fellows 1999). For example, FPT = W[1] implies the (unlikely) existence o a 2 o(n) algorithm or n-variable 3SAT (Impagliazzo, Paturi, & Zane 2001; Flum & Grohe 2006). Result (3) easily ollows rom Chickering s reduction (Chickering 1996). We establish result (4) by means o a parameterized reduction rom a variant o the Maximum Clique problem. We will provide necessary background on parameterized complexity and parameterized reductions in Section 2.2. We urther extend the hardness results (3) and (4) rom the search or an optimal network to the presumably easier problem o improving a given network by changing at most k o its arcs (with the operations o arc addition, arc deletion, and arc reversal). We reer to this restricted problem as k-neighborhood local search or k-local search or short. By trivial reasons k-local search is easible in non-uniorm polynomial time n O(k). We show, however, that uniorm polynomial-time tractability is again unlikely: (5) k-local search or instances with super-structures o bounded maximum degree is not possible in uniorm polynomial time unless FPT = W[1]. (6) k-local search or instances with super-structures o bounded treewidth is not possible in uniorm polynomial time unless FPT = W[1]. We obtain result (5) by a reduction rom the Red/Blue Non-Blocker problem (Downey & Fellows 1999). I both the maximum degree and the treewidth are bounded, then k-local search is easible in linear time, however this result is subsumed by (2). Both hardness results (5) and (6) even hold or several cases where not all o the three operations (addition, deletion, reversal) are available, or example i arc reversal is the only operation. 2 Preliminaries In this section we will introduce the basic concepts and notions that we will use throughout the paper. 2.1 Basic Graph Theory We will assume that the reader is amiliar with basic graph theory. We consider undirected graphs and directed graphs (digraphs). A DAG is a directed acyclic graph. We write V (G) = V and E(G) = E or the sets o vertices and edges o a (directed or undirected) graph G = (V, E). We denote an undirected edge between vertices u and v as {u, v} and a directed edge (or arc), directed rom u to v as (u, v). For a subset V V we write G[V ] to denote the induced subgraph G = (V, E ) where E = { e V : e E } i G is undirected and E = { e V V : e E } i G is directed. I G is a digraph we deine P G (v) = { u V (G) : (u, v) E(G) } as the set o parents o v in G. An undirected graph G = (V, E ) is the skeleton o G i E = { {u, v} : (u, v) E(G) }. 2.2 Parameterized Complexity Parameterized complexity provides a theoretical ramework to distinguish between uniorm and non-uniorm polynomial-time tractability with respect to a parameter. An instance o a parameterized problem is a pair (I, k) where I is the main part and k is the parameter; the latter is usually a non-negative integer. A parameterized problem is ixed-parameter tractable i there exist a computable unction and a constant c such that instances (I, k) o size n can be solved in time O((k)n c ). FPT is the class o all ixed-parameter tractable decision problems. Fixed-parameter tractable problems are also called uniorm polynomial-time tractable because i k is considered constant, then instances with parameter k can be solved in polynomial time where the order o the polynomial is independent o k (in contrast to non-uniorm polynomial-time running times such as n k ). Parameterized complexity oers a completeness theory similar to the theory o NP-completeness. One uses parameterized reductions which are many-one reductions where the parameter or one problem maps into the parameter or the other. More speciically, problem L reduces to problem L i there is a mapping R rom instances o L to instances o L such that (i) (I, k) is a yes-instance o L i and only i (I, k ) = R(I, k) is a yes-instance o L, (ii) k = g(k) or a computable unction g, and (iii) R can be computed in time O((k)n c ) where is a computable unction, c is a constant, and n denotes the size o (I, k). The parameterized complexity class W[1] is considered as the parameterized analog to NP. For example, the parameterized Maximum Clique problem (given a graph G and a parameter k 0, does G contain a complete subgraph on k vertices?) is W[1]-complete under parameterized reductions. Note that there exists a trivial non-uniorm polynomial-time n k algorithm or the Maximum Clique problems that checks all sets o k vertices. 2.3 Tree Decompositions Treewidth is an important graph parameter that indicates in a certain sense the tree-likeness o a graph. The treewidth o a graph G = (V, E) is deined via the ollowing notion o decomposition: a tree decomposition o G is a pair (T, χ) where T is a tree and χ is a labeling unction with χ(t) V or every tree node t, such that the ollowing conditions hold:

1. Every vertex o G occurs in χ(t) or some tree node t. 2. For every edge {u, v} o G there is a tree node t such that u, v χ(t). 3. For every vertex v o G, the tree nodes t with v χ(t) induce a connected subtree o T. The width o a tree decomposition (T, χ) is the size o a largest set χ(t) minus 1 among all nodes t o T. A tree decomposition o smallest width is optimal. The treewidth o a graph G, denoted tw(g), is the width o an optimal tree decomposition o G. Given G with n vertices and a constant w, it is possible to decide whether G has treewidth at most w, and i so, to compute an optimal tree decomposition o G in time O(n) (Bodlaender 1996). Furthermore there exist powerul heuristics to compute tree decomposition o small width in a practically easible way (Gogate & Dechter 2004). 3 Bayesian Structure Learning In this section we deine the theoretical ramework or Bayesian structure learning that we shall use or our considerations. We closely ollow the abstract ramework used by Parviainen and Koivisto (2009) which encloses a wide range o score-based approaches to structure learning. We assume that the input data speciies a set V o nodes (or variables) and a local score unction that assigns to each v V and each subset A V \ {v} a non-negative real number (v, A). Given the local score unction, the problem is to ind a DAG D = (V, E) such that the score o D under (D) := v V (v, P D (v)) is as large as possible (the DAG D together with certain local probability distributions orms a Bayesian network). This setting accommodates several popular scores like BDe, BIC and AIC (Parviainen & Koivisto 2009; Chickering 1995). We consider the ollowing decision problem: EXACT BAYESIAN STRUCTURE LEARNING Instance: A local score unction deined on a set V o nodes, a real number s > 0. Question: Is there a DAG D such that (D) s? For our complexity theoretic considerations we will assume that the local score unction is given as the list o all tuples (v, A, (v, A)) or v V and A V \ {v} where (v, A) > 0. We deine P (v) := { P V : (v, P ) > 0} { } to be the set o all potential parent sets o v. We also deine δ := max v V P (v) ; which will be an important measurement or our worst-case analysis o running times. Let be a local score unction deined on a set V o nodes. The super-structure o is the undirected graph S = (V, E ) where E contains an edge {u, v} i and only i u is a potential parent o v, i.e., i u P or some P P (v). We say that a DAG D is admissible or i the skeleton o D is a spanning subgraph o the super-structure S. Furthermore, we say that a DAG D is strictly admissible or i or every vertex v V (D) we have P D (v) P (v). Note that every strictly admissible DAG is also admissible. Furthermore, there always exists a (strictly) admissible DAG D with the highest score: I D is not (strictly) admissible, i.e., i there exists v V (D) such that (v, P D (v)) = 0, we can delete all arcs (w, v) such that w P D (v). This does not decrease the score since (v, ) (v, P D (v)) = 0 or every such v. 4 An Algorithm or Exact Bayesian Structure Learning In this section we present the dynamic programming algorithm and establish our tractability results. For the remainder o this section w denotes an arbitrary but ixed constant. Theorem 1. Given a local score unction with a super-structure S = (V, E ) o treewidth bounded by a constant w. Then we can ind in time O(δ w+1 V ) a DAG D with maximal score (D). Corollary 1. EXACT BAYESIAN STRUCTURE LEARNING can be decided in polynomial time or instances where the super-structure has bounded treewidth. The problem can be decided in linear time i additionally the super-structure has bounded maximum degree. Proo. The irst statement ollows immediately rom the theorem since δ is bounded by the total input size o the instance. The second statement ollows since δ is bounded whenever the maximum degree d o the super-structure is bounded as clearly δ 2 d. We are going to establish Theorem 1 by means o a dynamic programming algorithm along a tree decomposition or S, computing local inormation at the nodes o the tree decomposition that can then be put together to orm an optimal DAG. For this approach, it is convenient to consider tree decompositions in the ollowing normal orm (Kloks 1994): A triple (T, χ, r) is a nice tree decomposition o a graph G i (T, χ) is a tree decomposition o G, the tree T is rooted at node r, and each node o T is o one o the ollowing our types: 1. a lea node: a node having no children;

2. a join node: a node t having exactly two children t 1, t 2, and χ(t) = χ(t 1 ) = χ(t 2 ); 3. an introduce node: a node t having exactly one child t, and χ(t) = χ(t ) {v} or a vertex v o G; 4. a orget node: a node t having exactly one child t, and χ(t) = χ(t ) \ {v} or a vertex v o G. For convenience we will also assume that χ(r) = or the root r o T. For a nice tree decomposition (T, χ, r) we deine χ (t) to be the union o all the sets χ(t ) where t is contained in the subtree o T rooted at t. Given a tree decomposition o a graph G o width w, one can eectively obtain in time O( V (G) ) a nice tree decomposition o G with O( V (G) ) nodes and o width at most w (Kloks 1994). In the ollowing we will assume that we are given an instance I = (V, ) o EXACT BAYESIAN STRUCTURE LEARNING together with a nice tree decomposition (T, χ, r) or S o width at most w. A partial solution or a tree node t V (T ) is a digraph that can be obtained as the induced subdigraph D[χ (t)] o a strictly admissible DAG D or. For a tree node t let D(t) denote the set o all partial solutions or t. For a partial solution D D(t) we set t (D) = (v, P D (v)), v (V (D)\χ(t)) i.e., t (D) is the sum o the scores o all nodes o D except or the nodes in χ(t). A record o a tree node t V (T ) is a triple R = (a, p, s) such that: 1. a is a mapping χ(t) P (v); 2. p is a transitive binary relation on χ(t); 3. s is a non-negative real number. We say that a record represents a partial solution D D(t) i it satisies the ollowing conditions: 1. a(v) V (D) = P D (v) or every v χ(t). 2. For every pair o vertices v 1, v 2 χ(t) it holds that (v 1, v 2 ) p i and only i D contains a directed path rom v 1 to v 2. We say that a record R = (a, p, s) o a tree node t V (T ) is valid i it represents some DAG D D(t) and s is the maximum score t (D) over all DAGs in D(t) represented by R. With each tree node t V (T ) we associate the set R(t) o all valid records representing partial solutions in D(t). In a certain sense, R(t) is a succinct representation o the optimal elements o D(t), using space that only depends on w and δ, but not on V. The next three lemmas will allow us to compute the valid records o a tree node rom the valid records o its children. Lemma 1 (join nodes). Let t 1, t 2 be the children o t in T. Then R(t) can be computed rom R(t 1 ) and R(t 2 ) in time O(δ w+1 ). Proo. It ollows rom the above deinitions that a record R = (a, p, s) o t is valid i and only i there are valid records R 1 = (a 1, p 1, s 1 ) R(t 1 ) and R 2 = (a 2, p 2, s 2 ) R(t 2 ) such that: 1. a = a 1 = a 2. 2. p is the transitive closure o p 1 p 2. 3. p is irrelexive, i.e., there is no v χ(t) such that (v, v) p. 4. s = s 1 + s 2. It ollows that R(t) can be computed by considering all pairs o records R 1 R(t 1 ) and R 2 R(t 2 ) and checking conditions 1 4. Since, there are at most O(δ w+1 ) valid records or every t V (T ) and or every such pair the time required to check the conditions only depend on w, the result ollows. Lemma 2 (introduce node). Let t be an introduce node with child t, such that χ(t) = χ(t ) {v 0 }. Then R(t) can be computed rom R(t ) in time O(δ w+1 ). Proo. A record R = (a, p, s) o t is valid i and only i there is a set P P (v 0 ) and a valid record R = (a, p, s ) R(t ) such that: 1. a(v 0 ) = P. 2. For every v χ(t ) it holds that a(v) = a (v). 3. p is the transitive closure o the relation p { (u, v 0 ) : u P } { (v 0, u) : v 0 a (u), u χ(t ) }. 4. p is irrelexive. 5. s = s. It ollows that R(t) can be computed by checking or every pair (P,R ) as deined above, whether it satisies conditions 1 5. Since there are at most δ possible sets P and at most O(δ w ) possible valid records or t (observe that χ(t ) w) the result ollows rom the act that or every pair (P,R ) the conditions can be checked in time that only depends on w. Lemma 3 (orget node). Let t be a orget node with child t such that χ(t) = χ(t ) \ {v 0 }. Then R(t) can be computed rom R(t ) in time O(δ w+1 ). Proo. A record R = (a, p, s) o t is valid i and only i there is a valid record R = (a, p, s ) R(t ) such that: 1. a and p are the restrictions o a and p to χ(t), respectively. That is, a(u) = a (u) or all u χ(t), and p = { (u, v) p : u, v χ(t) }. 2. s = s + (v 0, a (v 0 )). Evidently R(t) can be computed rom R(t ) in time O(δ w+1 ).

We are now ready to prove Theorem 1. Proo. Let I = (V, ) be an instance o EXACT BAYESIAN STRUCTURE LEARNING where the super-structure S has treewidth w (a constant) and V = n. We compute a nice tree decomposition (T, χ, r) o S o width w and with O(n) nodes. This can be accomplished in time O(n) (see the discussion in Section 2.3). Next we compute the sets R(t) via a bottom-up traversal o T. For a lea node t we can compute R(t) just by considering all valid records or every possible strictly admissible DAG on the at most w + 1 vertices in χ(t). We can now use Lemmas 1, 2 and 3 to compute the sets R(t) or all other O(n) tree nodes in time O(δ w+1 n). Since χ(r) =, the partial solutions or the root r o T are exactly the strictly admissible DAGs or, and we have r (D) = (D) or each such DAG D. Ater the computation o the sets R(t) or all tree nodes t, the set R(r) contains exactly one record R = (,, s). By the above considerations, it ollows that s is the largest score o all strictly admissible DAGs or, and, as noted in Section 3, this is also the largest score o any DAG whose vertices belong to V. It is now easy to compute a DAG D with score (D) = s via a top-down traversal o T starting rom r and using the inormation previously stored at each node in T. This can also be accomplished in time O(δ w+1 n). 5 Hardness Results Theorem 2 (Chickering 1996). EXACT BAYESIAN STRUCTURE LEARNING is NP-hard or instances with super-structures o maximum degree 4. Proo. This theorem ollows rom Chickering s proo, we only sketch the argument. The reduction is rom FEED- BACK ARC SET (FAS). The problem asks whether a digraph D = (V, E) can be made acyclic by deleting at most k arcs (the deleted arcs orm a eedback arc set o D). The problem is NP-hard or digraphs with skeletons o maximum degree 4 (Karp 1972). Given an instance (D, k) o FAS, where the skeleton o D has maximum degree 4, we construct a set V = V (D) E(D) o nodes and a local score unction on V by setting ((u, v), {u}) = 1 or all (u, v) E(D), (v, { (u, v) : u P D (v) }) = P D (v) or all v V (D), and (v, P ) = 0 in all other cases. Clearly, the super-structure S is the undirected graph obtained rom the skeleton o D ater subdividing every edge once, hence the maximum degree o S is at most 4. It is easy to see that D has a eedback arc set o size k i and only i there exists a DAG D whose skeleton is a spanning subgraph o S with (D ) 2 E k. Theorem 3. EXACT BAYESIAN STRUCTURE LEARNING parameterized by the treewidth o the super-structure is W[1]-hard. Proo. We devise a parameterized reduction rom the ollowing problem, which is well-known to be W[1]- complete (Pietrzak 2003). PARTITIONED CLIQUE Instance: A k-partite graph G = (V, E) with partition V 1,..., V k such that V i = V j = n or 1 i < j k. Parameter: The integer k. Question: Are there vertices v 1,..., v k such that v i V i or 1 i k and {v i, v j } E or 1 i < j k? (The graph K = ({v 1,..., v k }, { {v i, v j } : 1 i < j k }) is a k-clique o G.) Let G = (V, E) be an instance o this problem with partition V 1,..., V k, V 1 = = V k = n. Let α = k 2 1 and ɛ = 2k. We construct a set N o nodes and a local score unction on N such that (i) tw(s ) k(k 1)/2 and (ii) G has a k-clique i and only i there exists a DAG D such that (D) k(n 1)α + (k(k 1)/2)ɛ. See Figure 1 or an illustration. v 12 v 11 v 3 v 2 v 12 a 13 v 11 a 12 v 3 v 2 a 23 Figure 1: Illustration or the reduction in the proo o Theorem 3, k = 3. We set A = { a ij : 1 i < j k }, N = V (G) A, and A i = { a lk : l = i or k = i } or every 1 i k. We are now ready to deine. We set (v, A i ) = α or every v V i, and (a ij, {u, w}) = ɛ or every 1 i < j k, u V i, w V j, and {u, w} E(G). Furthermore we set (v, P ) = 0 or all the remaining combinations o v and P. It is easy to see claim (i) as deleting the k(k 1)/2 vertices a ij rom S yields a collection o isolated vertices, i.e., a graph o treewidth 0. Hence, it remains to show claim (ii). So suppose that G has a k-clique K = ({v 1,..., v k }, E K ), such that v i V i or every 1 i k. It ollows that or the DAG D with arc set E(D) = { (v i, a) : 1 i k, a A i } { (a, v) : 1 i k, a A i, v V i \ {v i } } the ollowing holds: 1. (v, P D (v)) = 0, or every v V (K); 2. (v, P D (v)) = α, or every v V (G) \ V (K); 3. (a, P D (a)) = ɛ, or every a A. Hence, (D) = k(n 1)α + (k(k 1)/2)ɛ and the only-i direction o claim (ii) ollows. To show the i direction o claim (ii) suppose that there exists a DAG D such that (D) k(n 1)α + (k(k 1)/2)ɛ. It can be shown that such a score can only be obtained i every vertex in A attains its maximum score and

exactly one vertex v i rom every V i does not. It is then easy to see that the vertices {v 1,..., v k } orm a k-clique in G and the claim ollows. Note that in contrast to Theorem 2, it is essential or Theorem 3 that the super-structure has unbounded degree: i both degree and treewidth are bounded then the problem is ixed-parameter tractable by Corollary 1 and so unlikely to be W[1]-hard. 6 k-neighborhood Local Search Important and widely used algorithms or Bayesian structure learning are based on local search methods (Heckerman, Geiger, & Chickering 1995). Usually the local search algorithm tries to improve the score o a given DAG by transorming it into a new DAG by adding, deleting, or reversing an arc (in symbols ADD, DEL, and REV, respectively). The main obstacle or local search methods is the danger o getting stuck at a poor local optimum. A possibility or decreasing this danger is to perorm k > 1 elementary changes in one step, known as k-neighborhood local search or k-local search or short. For Bayesian structure learning, when we try to improve the score o a DAG on n nodes, the k-local search space is o order n O(k). Thereore, i carried out by brute-orth, k-local search is too costly even or small values o k. It is thereore not surprising that most practical local search algorithms or Bayesian structure learning consider 1-neighborhoods only. In this section we investigate whether under restrictions on the super-structure where EXACT BAYESIAN STRUCTURE LEARNING remains hard (as considered in Theorems 2 and 3) at least k-local search becomes easier compared to the general unrestricted case. Our results are mostly negative. In act, somewhat surprisingly, k-local search remains hard even i edge reversal is the only allowed operation. Beore we give the hardness proos we deine k-local search more ormally. Let k 0 and O {ADD, DEL, REV}. Consider a DAG D = (V, E). A directed graph D = (V, E ) is a k-o-neighbor o D i 1. D is a DAG, 2. V = V, 3. E can be obtained rom E by perorming at most k operations rom the set O. For O {ADD, DEL, REV} we consider the ollowing parameterized decision problem. k-o-local SEARCH BAYESIAN STRUCTURE LEARN- ING Instance: A local score unction, a DAG D that is admissible or, and an integer k. Question: Is there a k-o-neighbor D o D with a higher score than D? Note that the problem does not change i we require D to be admissible, as we can always avoid the addition o an inadmissible arc. Theorem 4. I O = {ADD} or O = {DEL}, then k-o- LOCAL SEARCH BAYESIAN STRUCTURE LEARNING is solvable in polynomial time. Proo. We only consider O = {ADD} as the proo or O = {DEL} is analogous. Let I = (D,, k) be the given instance o k-{add}-local SEARCH BAYESIAN STRUC- TURE LEARNING. Note that there exists a k-{add}- neighbor D o D with (D ) > (D) i and only i there exists a vertex v V (D) such that the addition o at most k incoming arcs increases the score o v and the resulting digraph remains acyclic. Now, or every entry (v, P ) such that P V (D) \ {v} one can easily check whether (v, P ) > (v, P D (v)) and whether P can be obtained rom P D (v) via the addition o at most k incoming arcs such that the resulting digraph is acyclic. In view o Theorem 4 let us deine a set O {ADD, DEL, REV} to be non-trivial i O / {, {ADD}, {DEL}}. Theorem 5. Let O {ADD, DEL, REV} be non-trivial. Then k-o-local SEARCH BAYESIAN STRUCTURE LEARNING is W[1]-hard or parameter tw(s ) + k. Proo. We slightly modiy the reduction given in the proo o Theorem 3. Let D be the directed acyclic graph with vertex set N, arc set { (a, v) : a A i, v V i }. We set k = k(k 1)/2. Then, or every O that contains the operation REV, it is easy to see using the same arguments as in the proo o Theorem 3 that G has a k-clique i and only i D has a k -O-neighbor D with (D ) > (D). Similarly, or the remaining case O = {ADD, DEL}, one can show that G has a k-clique i and only i D has a 2k - O-neighbor D with (D ) > (D). Theorem 6. Let O {ADD, DEL, REV} be non-trivial. Then k-o-local SEARCH BAYESIAN STRUCTURE LEARNING is W[1]-hard or parameter k, hardness even holds i the super-structure S has bounded maximum degree. Proo. We devise a parameterized reduction rom the ollowing problem which is known to be W[1]-complete or every constant d 3 (Downey & Fellows 1999; Flum & Grohe 2006). BOUNDED DEGREE RED/BLUE NON-BLOCKER Instance: An undirected graph G = (V, E) with maximum degree d, where V is the disjoint union o sets Red and Blue, and an integer k. Parameter: The integer k.

Question: Is there a set S Red o size k such that every v Blue has a neighbor outside o S? (S is a k-red/blue non-blocker o G). Let (G, Red, Blue, k) be an instance o this problem with Red = {vr, 1..., vr n } and Blue = {vb 1,..., vm b }. We may assume that G is bipartite with partition {Red, Blue}. To see this, observe that without aecting the answer we can remove every edge in G between two vertices in Red and similarly we can remove every vertex in Blue that has a neighbor in Blue. Let k = (d + 1)k + 1. We construct a DAG D and a local score unction such that G has a k-red/blue non-blocker i and only i D has a k -O-neighbor D with a higher score than D. The construction given below applies to all cases with REV O. For the only remaining nontrivial set O = {DEL, ADD} it is easy to adapt the construction by setting k to 2((d + 1)k + 1). To make the ollowing arguments easier, it is convenient that all vertices in Red are o degree exactly d. Hence we introduce an intermediate graph G that is obtained rom G by adding d d(v) vertices or every v Red and connecting each o these vertices by an edge to the corresponding v. The DAG D is obtained rom G by applying the ollowing steps (see Figure 2 or an illustration): v 1 r v 2 r v 3 r v 4 r v 1 b v 2 b v 3 b l1 1 l2 1 l1 2 l2 2 r 1 r 2 l1 3 l2 3 v 1 r v 2 r v 3 r v 1 b v 2 b v 3 b Figure 2: Illustration or the reduction in the proo o Theorem 6, k = 3. To improve readability, vertices in V (G ) \ V (G) and most o the arcs between the leaves o T 1 and T 2 and the vertices in Red are omitted. 1. We replace every edge {v, w} o G with v Red by an arc (v, w). 2. We add the complete binary tree T 1 o lowest height with at least n leaves, with edges directed away rom the root. Let r 1 denote the root and l 1 1,..., l n 1 leaves o T 1. 3. We add the complete binary tree T 2 o lowest height with at least n leaves, with edges directed towards the root. Let r 2 denote the root and l 1 2,..., l n 2 leaves o T 2. 4. For every 1 i n we add the arcs (v i r, l i 1) and (v i r, l i 2), running between G and the trees T 1 and T 2. l1 4 l2 4 v 4 r 5. We add the arc (r 2, r 1 ). This completes the construction o D. Next we deine the local score unction or V = V (D). Let α = k 1, β = n and ɛ = 1. Furthermore, or v V (G ) we write N G (v) = { u V (G ) : {u, v} E(G )}. (i) For every vr i Red we set (vr, i N G (vr) i l1) i = ɛ. (ii) For every vb i Blue and = P N G (vi b ) we set (vb i, P ) = β. (iii) For every v V (D) \ (V (G ) {r 2, l1, 1..., l1 n }) we set (v, P D (v)) = α. (iv) For the root o T 2 we set (r 2, P D (r 2 )) = (r 2, P D (r 2 ) {r 1 }) = α. (v) For every l1 i we set (l1, i P D (l1)) i = (l1, i P D (l1) i \ {vr}) i = α. (vi) For all the remaining combinations o v V (D) and P V (D) we set (v, P ) = 0. Evidently D is acyclic and both D and can be constructed rom G in polynomial time. Observe that the super-structure S is exactly the skeleton o D. Hence, by construction, the degree o every vertex o S is bounded by d + 2. It remains to show that G has a k-red/blue non-blocker i and only i D has a k -neighbor D with a higher score than D. To see this, we irst assume that G contains a k-red/blue non-blocker S Red and S = k. We obtain D rom D by reversing the k arcs in { (vr, i w) : vr i S, w N G (vr) i {l1} i } {(r 2, r 1 )}. Note that the reversal o the arc (r 2, r 1 ) ensures that D is acyclic, and since S is a k-red/blue non-blocker in G it ollows that the score or every vertex in Blue does not change. Hence (D ) = (D) α + kɛ = (D) + 1 > (D). To see the reverse direction, note that the vertices in Red are the only vertices o D whose score is not yet maximum. Increasing the score o any o these vertices v Red introduces a cycle that uses only vertices in V (T 1 ) V (T 2 ) {v}. It is easy too see that in order to break this cycle the score or at least one vertex in V (T 1 ) V (T 2 ) has to be decreased by α and that all cycles produced in this way can be destroyed by reversing the arc (r 2, r 1 ). Since α = (k 1)ɛ it ollows that in order to increase the score or D the score or at least k vertices in Red must be increased to ɛ. Let S be the set o these vertices in Red whose score has been increased in this manner. Since or every vertex in S exactly k + 1 arcs need to be reversed and k < (d + 1)(k + 1) it ollows that S k and hence S = k. Because, β = nɛ it ollows that all vertices in Blue must have kept their score and hence S is a k-red/blue non-blocker or G. Theorem 6 provides a surprising contrast to a similar study o k-local search or MAX-SAT where the problem is ixed-parameter tractable or instances o bounded degree (Szeider 2009). A possible explanation or the surprising hardness o k-o-local SEARCH BAYESIAN STRUCTURE LEARNING could be that, in contrast to MAX-SAT, a global property o the entire instance (acyclicity) must be checked.

network n m w d link 724 1125 16 17 alarm 37 46 3 6 carpo 61 74 4 12 barley 48 84 6 8 hailinder 56 66 3 17 diabetes 413 602 4 24 insurance 27 52 6 9 win95pts 76 112 5 10 mildew 35 46 3 5 munin1 189 282 11 15 munin2 1003 1244 6 30 munin3 1044 1315 8 69 munin4 1041 1397 8 69 pigs 441 592 9 41 water 32 66 7 8 Table 1: Bayesian networks rom http://compbio.cs.huji.ac.il/ Repository/. n = number o nodes, m = number o edges, w = upper bound on the treewidth, d = maximum degree. All parameters reer to the skeleton o the network. 7 Conclusion We have studied the computational complexity o exact Bayesian Structure Learning under graph-theoretic restrictions on the super-structure. Our results show that exact learning is linear-time tractable i the super-structure has bounded treewidth and bounded maximum degree, but none o the two restrictions can be dropped without loosing linear time tractability (or uniorm polynomialtime tractability). Our algorithm is based on dynamic programming along a tree decomposition o the superstructure. We have ocused on theoretical worst-case complexity results, leaving an empirical evaluation o the algorithm on real-world data or uture research. As a irst step in that direction we have computed treewidth and maximum degree o the skeletons o some well-known benchmark networks and ound relatively small numbers, see Table 1. We take this as an encouraging indication that rom a practical point o view it makes sense to consider super-structures o small treewidth and small maximum degree. In act, it is desirable to learn networks o small treewidth and small maximum degree as such networks allow eicient reasoning. On the theoretical side we oer as an objective or uture research the identiication o other graph-theoretic parameters that allow eicient exact structure learning. In particular, it would be interesting to identiy parameters that, in contrast to treewidth and maximum degree, separate the (parameterized) complexities o inding globally optimal networks rom improving networks locally by k-neighborhood local search. Reerences Anton, C., and Carlos, G. 2007. Eicient principled learning o thin junction trees. In Advances in Neural Inormation Processing Systems (NIPS 2007). Bodlaender, H. L. 1996. A linear-time algorithm or inding tree-decompositions o small treewidth. SIAM J. Comput. 25(6):1305 1317. Chickering, D. M. 1995. A transormational characterization o equivalent Bayesian network structures. In UAI 1995. Chickering, D. M. 1996. Learning Bayesian networks is NP-complete. In Learning rom Data: Artiicial Intelligence and Statistics, Springer Verlag. 121 130. Chickering, D. M. 2003. Optimal structure identiication with greedy search. J. Mach. Learn. Res. 3(3):507 554. Downey, R. G., and Fellows, M. R. 1999. Parameterized Complexity. Springer Verlag. Flum, J., and Grohe, M. 2006. Parameterized Complexity Theory. Springer Verlag. Gogate, V., and Dechter, R. 2004. A complete anytime algorithm or treewidth. In UAI 2004. Heckerman, D.; Geiger, D.; and Chickering, D. M. 1995. Learning Bayesian networks: The combination o knowledge and statistical data. Machine Learning 20(3):197 243. Impagliazzo, R.; Paturi, R.; and Zane, F. 2001. Which problems have strongly exponential complexity? J. Comput. System Sci. 63(4):512 530. Karp, R. M. 1972. Reducibility among combinatorial problems. In Complexity o Computer Computations. 85 103. Kloks, T. 1994. Treewidth: Computations and Approximations. Springer Verlag. Mukund, N., and Je, B. 2004. PAC-learning bounded tree-width graphical models. In UAI 2004. Parviainen, P., and Koivisto, M. 2009. Exact structure discovery in Bayesian networks with less space. In UAI 2009. Perrier, E.; Imoto, S.; and Miyano, S. 2008. Finding optimal Bayesian network given a super-structure. J. Mach. Learn. Res. 9:2251 2286. Pieter, A.; Daphne, K.; and Andrew, Y. N. 2006. Learning actor graphs in polynomial time and sample complexity. J. Mach. Learn. Res. 7:1743 1788. Pietrzak, K. 2003. On the parameterized complexity o the ixed alphabet shortest common supersequence and longest common subsequence problems. J. Comput. System Sci. 67(4):757 771. Silander, T., and Myllymäki, P. 2006. A simple approach or inding the globally optimal Bayesian network structure. In UAI 2006. Szeider, S. 2009. The Parameterized Complexity o k-flip Local Search or SAT and MAX SAT. In SAT 2009. Lecture Notes in Computer Science 5584, Springer Verlag, 276 283.