Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse Gaussian Bayesian Network paraeterized by Θ and its associated directed graph G with nodes, the graph G is DAG if and only if there exist soe o i (i =,, ) and Υ R, such that for arbitary > 0, the following constraints are satisfied: o j o i Υ ij, i, j {,, }, i j (a) Υ ij 0, Υ ij Θ ij = 0, o i 0. (b) (c) (d) Proof. As is known, a Bayesian network is equivalent to a topological ordering (Chapter 8, Section 8. on Page 362 in []). Therefore, we prove Proposition by showing that i) Eqn. (a d) lead to a topological ordering (the necessary condition), and ii) a topological ordering fro a DAG can eet the requireents in Eqn. (a d) (sufficient condition). First, we prove the necessary condition by contradiction (Fig. ). We consider three cases for two nodes j and i. Case ) the nodes j and i are directly connected. If there is an edge fro node i to node j, the paraeter Θ ij is then non-zero, and thus Υ ij ust be zero. According to Eqn. (a), we have o j > o i. If at the sae tie, there is an edge fro node j to node i, siilarly we have o i > o j, which contradicts with o j > o i, and therefore is ipossible. Case 2: the nodes j and i are not directly linked but connected
by a path. Suppose there is a directed path P fro node i to node j, where P is coposed of nodes k, k 2,, k in order. Following the above proof, we can have o j > o k > > o k > o i. If at the sae tie another directed path P 2 links node j to node i, where P 2 is coposed of nodes l, l 2,, l 2 in order, siilarly we have o i > o l2 > > o l > o j, aking the contradiction. Case 3) If there is no edge between node i and node j, by definition Θ ij = 0. It is straightforward to see Eqn. (b) and Eqn. (c) hold for any arbitrary non-negative Υ ij. Moreover, for any o i and o j satisfying Eqn. (d), we can show that as long as Υ ij ( + ) (which is positive), Eqn. (a) will always hold. This is further explained as follows. By Eqn. (d), we have o j o i. For Eqn. (a) to be always held, we need soe Υ ij such that o j o i Υ ij, which requires Υ ij ( + ). Therefore, there exist a set of o i and Υ valid for Eqn. (a d) when no edge links node i and node j. In su, Eqn. (a d) show a topological ordering, that is, if node j coes after node i (that is, o j > o i ) in the ordering, there can not be a link fro node j to node i, which guarantees the acyclicity. Figure : Explanation of our ordering based DAG constraint. Now let us consider the sufficient condition. if G is a DAG, we can obtain soe topological ordering (, 2,, ) fro it. Let õ i be the index of node i in this ordering. Setting o i = (õ i ) ( i {,, }), we have in(o i) = ( ) = 0 and ax(o i) = ( ). If node j coes after node i, we have o j o i Υ ij. If node j coes before node i, we can always set Υ ij sufficiently large to satisfy Eqn. (a d). Therefore, fro a DAG, we can always construct a set of ordering variables that satisfy Eqn. (a d). Cobining the proofs above, Eqn. (a d) are the sufficient and nec- 2
essary condition for a directed graph G to be DAG. Proposition 2. The optiization proble in Eqn. (2) (i.e., Eqn. (4.2) in the paper) is iteratively solved by alternate optiizations of (i) o and Υ with Θ fixed, and (ii) Θ with o and Υ fixed. This optiization converges and the output Θ is DAG when λ dag > 2( 2)(n )2 +λ (2n 2 λ ) λ, where is (+) the nuber of nodes and n is the nuber of saples. in Θ,o,Υ x :,i PA i θ i 2 2 + λ θ i + λ dag ɛ i θ i (2) i= s.t. o j o i Υ ij, i, j {,, }, 0 o i, Υ ij 0 i j Here o and Υ are the variables defined in the DAG constraint in Section 4.2, and Θ is the odel paraeters of SGBN. The vector ɛ i denotes the i-th colun of the atrix Υ, and θ i the coponent-wise absolute value of the i-th colun of Θ. Other paraeters are defined in Table in the paper. Proof. In the following, we prove that:. The alternate optiization in Eqn. (2) converges. 2. The solution Θ of Eqn. (2) is DAG when λ dag is sufficiently large. Let us denote f(θ, o, Υ) = i= x :,i PA i θ i 2 2 + λ θ i + λ dag ɛ i θ i. First, we prove Eqn. (2) converges by showing that (i) f(θ, o, Υ) is lower bounded; and (ii) f(θ (t+), o (t+), Υ (t+) ) f(θ (t), o (t), Υ (t) ), eaning that the function value will onotonically decrease with the iteration nuber t. It is easy to see that f(θ, o, Υ) is lower bounded by 0, since each ter in f(θ, o, Υ) is non-negative. And the second point can be proven as follows. 3
At the t-th iteration, we update Θ by Θ (t+) = arg in Θ x :,i PA i θ i 2 2 + λ θ i + λ dag ɛ (t) i θ i (3) i= = arg in f(θ, o (t), Υ (t) ). Θ It holds that f(θ (t+), o (t), Υ (t) ) f(θ (t), o (t), Υ (t) ). Also it is noted that Θ (t+) is an achievable global iniu of Θ since f(θ, o (t), Υ (t) ) is a convex function with respect to Θ. Siilarly, we then update o and Υ by {o (t+), Υ (t+) } =arg in o,υ f(θ(t+), o, Υ) (4) s.t. o j o i Υ ij, i, j {,, }, 0 o i, Υ ij 0. i j It holds that f(θ (t+), o (t+), Υ (t+) ) f(θ (t+), o (t), Υ (t) ). Also, f(θ (t+), o, Υ) is a linear function with respect to o and Υ. Consequently we have f(θ (t+), o (t+), Υ (t+) ) f(θ (t+), o (t), Υ (t) ) f(θ (t), o (t), Υ (t) ). Therefore, the optiization proble in Eqn. (2) is guaranteed to converge with the alternate optiization strategy, because the objective function is lower-bounded and onotonically decreases with the iteration nubers. Second, we prove that when λ dag > 2( 2)(n )2 +λ (2n 2 λ ) λ, the output Θ is guaranteed to be DAG. This could be proven by contradiction. Sup- (+) pose that such a λ dag does not lead to a DAG, say, Υ ji Θ ji 0 for at least one pair of nodes i and j, with Θ ji 0 and Υ ji > 0. Without loss of generality, we assue Υ ji ( + ) (where is an arbitary positive nuber), so the ordering constraints in Eqn. (2) always hold regardless of the variables o i and o j. This is because o i and o j are constrained by 0 o i and 0 o j, and o j o i = ( + ). Based on the first-order optiality condition, Θ ji 0 i.f.f. 2 ( x :,i PA i(\j,:)θi\j) x:,j (λ + λ dag Υ ij ) > 0. Here, PA i(\j,:) denotes the eleents in the atrix PA i with the j-th row reoved (i.e., parents of the node i without the node j), and θ i\j denotes 4
the eleents in θ i without Θ ji. However, it can be shown that, ( ) x :,i PA i(\j,:)θ i\j x:,j x :,i x :,j + θ i\j PA i(\j,:) x :,j (5) = x :,i x :,j + k=,k i,j Θ kix :,kx :,j (n ) + ( 2)(n ) ax Θ ki (n ) + ( 2)(n )2 λ. The second last inequality holds due to the noralization of features x :,i (to zero ean ( and unit std). The last inequality holds because ax Θ ki θ i x:,i λ PA i θ i 2 2 + λ θ i + λ dag ɛ i θ i ) = λ f(θ, o, Υ ) λ f(0, o, Υ ) = λ x :,ix :,i = n λ. With the given λ dag, Eqn. (5) results in 2 ( x :,i PA i(\j,:)θi\j) x:,j (λ + λ dag Υ ij ) < 0, which contradicts the above first-order optiality condition with Θ ji 0. Therefore, when λ dag is sufficiently large, the output Θ is guaranteed to be DAG. Suing up the proofs above, the alternate optiization of Eqn. (2) converges and the output Θ is guaranteed to be DAG when λ dag is sufficiently large. References [] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2007. 5