Supplementary to Learning Discriminative Bayesian Networks from High-dimensional Continuous Neuroimaging Data

Similar documents
Estimating Parameters for a Gaussian pdf

Bayes Decision Rule and Naïve Bayes Classifier

Kernel Methods and Support Vector Machines

Intelligent Systems: Reasoning and Recognition. Perceptrons and Support Vector Machines

Lecture 9 November 23, 2015

CS Lecture 13. More Maximum Likelihood

Boosting with log-loss

Block designs and statistics

Topic 5a Introduction to Curve Fitting & Linear Regression

Using EM To Estimate A Probablity Density With A Mixture Of Gaussians

Ch 12: Variations on Backpropagation

The Distribution of the Covariance Matrix for a Subset of Elliptical Distributions with Extension to Two Kurtosis Parameters

Supplementary Material for Fast and Provable Algorithms for Spectrally Sparse Signal Reconstruction via Low-Rank Hankel Matrix Completion

A Simple Regression Problem

e-companion ONLY AVAILABLE IN ELECTRONIC FORM

Polygonal Designs: Existence and Construction

1 Bounding the Margin

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Support Vector Machines. Maximizing the Margin

CSE525: Randomized Algorithms and Probabilistic Analysis May 16, Lecture 13

In this chapter, we consider several graph-theoretic and probabilistic models

Tight Bounds for Maximal Identifiability of Failure Nodes in Boolean Network Tomography

Probability Distributions

are equal to zero, where, q = p 1. For each gene j, the pairwise null and alternative hypotheses are,

Support Vector Machines. Machine Learning Series Jerry Jeychandra Blohm Lab

1 Identical Parallel Machines

Consistent Multiclass Algorithms for Complex Performance Measures. Supplementary Material

Understanding Machine Learning Solution Manual

Linear Transformations

Math Reviews classifications (2000): Primary 54F05; Secondary 54D20, 54D65

Support Vector Machines MIT Course Notes Cynthia Rudin

Computational and Statistical Learning Theory

13.2 Fully Polynomial Randomized Approximation Scheme for Permanent of Random 0-1 Matrices

Multi-Dimensional Hegselmann-Krause Dynamics

Supplementary Materials: Proofs and Technical Details for Parsimonious Tensor Response Regression Lexin Li and Xin Zhang

arxiv: v1 [cs.ds] 3 Feb 2014

Experimental Design For Model Discrimination And Precise Parameter Estimation In WDS Analysis

Pattern Recognition and Machine Learning. Artificial Neural networks

PAC-Bayes Analysis Of Maximum Entropy Learning

Feature Extraction Techniques

Lean Walsh Transform

Intelligent Systems: Reasoning and Recognition. Artificial Neural Networks

Random Process Review

Detection and Estimation Theory

INNER CONSTRAINTS FOR A 3-D SURVEY NETWORK

Lecture October 23. Scribes: Ruixin Qiang and Alana Shine

Bootstrapping Dependent Data

Quantum algorithms (CO 781, Winter 2008) Prof. Andrew Childs, University of Waterloo LECTURE 15: Unstructured search and spatial search

Chapter 6 1-D Continuous Groups

Multivariate Methods. Matlab Example. Principal Components Analysis -- PCA

3.3 Variational Characterization of Singular Values

Supplementary Information for Design of Bending Multi-Layer Electroactive Polymer Actuators

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

MODIFICATION OF AN ANALYTICAL MODEL FOR CONTAINER LOADING PROBLEMS

Introduction to Machine Learning. Recitation 11

Constrained Consensus and Optimization in Multi-Agent Networks arxiv: v2 [math.oc] 17 Dec 2008

Introduction to Discrete Optimization

On Lotka-Volterra Evolution Law

Distributed Subgradient Methods for Multi-agent Optimization

Pattern Recognition and Machine Learning. Artificial Neural networks

A Theoretical Analysis of a Warm Start Technique

A Self-Organizing Model for Logical Regression Jerry Farlow 1 University of Maine. (1900 words)

Graphical Models in Local, Asymmetric Multi-Agent Markov Decision Processes

Design of Spatially Coupled LDPC Codes over GF(q) for Windowed Decoding

A Note on Scheduling Tall/Small Multiprocessor Tasks with Unit Processing Time to Minimize Maximum Tardiness

Probabilistic Machine Learning

List Scheduling and LPT Oliver Braun (09/05/2017)

A Note on Online Scheduling for Jobs with Arbitrary Release Times

Foundations of Machine Learning Boosting. Mehryar Mohri Courant Institute and Google Research

A note on the multiplication of sparse matrices

1 The algorithm for variable-order linear-chain CRFs

Sharp Time Data Tradeoffs for Linear Inverse Problems

Handout 6 Solutions to Problems from Homework 2

Mechanics Physics 151

Order Recursion Introduction Order versus Time Updates Matrix Inversion by Partitioning Lemma Levinson Algorithm Interpretations Examples

Algorithms for parallel processor scheduling with distinct due windows and unit-time jobs

Vulnerability of MRD-Code-Based Universal Secure Error-Correcting Network Codes under Time-Varying Jamming Links

Geometrical intuition behind the dual problem

Physics 215 Winter The Density Matrix

Fairness via priority scheduling

Approximation in Stochastic Scheduling: The Power of LP-Based Priority Policies

Seismic Analysis of Structures by TK Dutta, Civil Department, IIT Delhi, New Delhi.

}, (n 0) be a finite irreducible, discrete time MC. Let S = {1, 2,, m} be its state space. Let P = [p ij. ] be the transition matrix of the MC.

A Simplified Analytical Approach for Efficiency Evaluation of the Weaving Machines with Automatic Filling Repair

The Fundamental Basis Theorem of Geometry from an algebraic point of view

Midterm 1 Sample Solution

Pattern Recognition and Machine Learning. Artificial Neural networks

Support Vector Machines. Goals for the lecture

A Low-Complexity Congestion Control and Scheduling Algorithm for Multihop Wireless Networks with Order-Optimal Per-Flow Delay

Introduction to Optimization Techniques. Nonlinear Programming

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization

COS 424: Interacting with Data. Written Exercises

Lecture 20 November 7, 2013

Use of PSO in Parameter Estimation of Robot Dynamics; Part One: No Need for Parameterization

PHYSICS 110A : CLASSICAL MECHANICS MIDTERM EXAM #2

Support recovery in compressed sensing: An estimation theoretic approach

Asynchronous Gossip Algorithms for Stochastic Optimization

An improved self-adaptive harmony search algorithm for joint replenishment problems

Curious Bounds for Floor Function Sums

Transcription:

Suppleentary to Learning Discriinative Bayesian Networks fro High-diensional Continuous Neuroiaging Data Luping Zhou, Lei Wang, Lingqiao Liu, Philip Ogunbona, and Dinggang Shen Proposition. Given a sparse Gaussian Bayesian Network paraeterized by Θ and its associated directed graph G with nodes, the graph G is DAG if and only if there exist soe o i (i =,, ) and Υ R, such that for arbitary > 0, the following constraints are satisfied: o j o i Υ ij, i, j {,, }, i j (a) Υ ij 0, Υ ij Θ ij = 0, o i 0. (b) (c) (d) Proof. As is known, a Bayesian network is equivalent to a topological ordering (Chapter 8, Section 8. on Page 362 in []). Therefore, we prove Proposition by showing that i) Eqn. (a d) lead to a topological ordering (the necessary condition), and ii) a topological ordering fro a DAG can eet the requireents in Eqn. (a d) (sufficient condition). First, we prove the necessary condition by contradiction (Fig. ). We consider three cases for two nodes j and i. Case ) the nodes j and i are directly connected. If there is an edge fro node i to node j, the paraeter Θ ij is then non-zero, and thus Υ ij ust be zero. According to Eqn. (a), we have o j > o i. If at the sae tie, there is an edge fro node j to node i, siilarly we have o i > o j, which contradicts with o j > o i, and therefore is ipossible. Case 2: the nodes j and i are not directly linked but connected

by a path. Suppose there is a directed path P fro node i to node j, where P is coposed of nodes k, k 2,, k in order. Following the above proof, we can have o j > o k > > o k > o i. If at the sae tie another directed path P 2 links node j to node i, where P 2 is coposed of nodes l, l 2,, l 2 in order, siilarly we have o i > o l2 > > o l > o j, aking the contradiction. Case 3) If there is no edge between node i and node j, by definition Θ ij = 0. It is straightforward to see Eqn. (b) and Eqn. (c) hold for any arbitrary non-negative Υ ij. Moreover, for any o i and o j satisfying Eqn. (d), we can show that as long as Υ ij ( + ) (which is positive), Eqn. (a) will always hold. This is further explained as follows. By Eqn. (d), we have o j o i. For Eqn. (a) to be always held, we need soe Υ ij such that o j o i Υ ij, which requires Υ ij ( + ). Therefore, there exist a set of o i and Υ valid for Eqn. (a d) when no edge links node i and node j. In su, Eqn. (a d) show a topological ordering, that is, if node j coes after node i (that is, o j > o i ) in the ordering, there can not be a link fro node j to node i, which guarantees the acyclicity. Figure : Explanation of our ordering based DAG constraint. Now let us consider the sufficient condition. if G is a DAG, we can obtain soe topological ordering (, 2,, ) fro it. Let õ i be the index of node i in this ordering. Setting o i = (õ i ) ( i {,, }), we have in(o i) = ( ) = 0 and ax(o i) = ( ). If node j coes after node i, we have o j o i Υ ij. If node j coes before node i, we can always set Υ ij sufficiently large to satisfy Eqn. (a d). Therefore, fro a DAG, we can always construct a set of ordering variables that satisfy Eqn. (a d). Cobining the proofs above, Eqn. (a d) are the sufficient and nec- 2

essary condition for a directed graph G to be DAG. Proposition 2. The optiization proble in Eqn. (2) (i.e., Eqn. (4.2) in the paper) is iteratively solved by alternate optiizations of (i) o and Υ with Θ fixed, and (ii) Θ with o and Υ fixed. This optiization converges and the output Θ is DAG when λ dag > 2( 2)(n )2 +λ (2n 2 λ ) λ, where is (+) the nuber of nodes and n is the nuber of saples. in Θ,o,Υ x :,i PA i θ i 2 2 + λ θ i + λ dag ɛ i θ i (2) i= s.t. o j o i Υ ij, i, j {,, }, 0 o i, Υ ij 0 i j Here o and Υ are the variables defined in the DAG constraint in Section 4.2, and Θ is the odel paraeters of SGBN. The vector ɛ i denotes the i-th colun of the atrix Υ, and θ i the coponent-wise absolute value of the i-th colun of Θ. Other paraeters are defined in Table in the paper. Proof. In the following, we prove that:. The alternate optiization in Eqn. (2) converges. 2. The solution Θ of Eqn. (2) is DAG when λ dag is sufficiently large. Let us denote f(θ, o, Υ) = i= x :,i PA i θ i 2 2 + λ θ i + λ dag ɛ i θ i. First, we prove Eqn. (2) converges by showing that (i) f(θ, o, Υ) is lower bounded; and (ii) f(θ (t+), o (t+), Υ (t+) ) f(θ (t), o (t), Υ (t) ), eaning that the function value will onotonically decrease with the iteration nuber t. It is easy to see that f(θ, o, Υ) is lower bounded by 0, since each ter in f(θ, o, Υ) is non-negative. And the second point can be proven as follows. 3

At the t-th iteration, we update Θ by Θ (t+) = arg in Θ x :,i PA i θ i 2 2 + λ θ i + λ dag ɛ (t) i θ i (3) i= = arg in f(θ, o (t), Υ (t) ). Θ It holds that f(θ (t+), o (t), Υ (t) ) f(θ (t), o (t), Υ (t) ). Also it is noted that Θ (t+) is an achievable global iniu of Θ since f(θ, o (t), Υ (t) ) is a convex function with respect to Θ. Siilarly, we then update o and Υ by {o (t+), Υ (t+) } =arg in o,υ f(θ(t+), o, Υ) (4) s.t. o j o i Υ ij, i, j {,, }, 0 o i, Υ ij 0. i j It holds that f(θ (t+), o (t+), Υ (t+) ) f(θ (t+), o (t), Υ (t) ). Also, f(θ (t+), o, Υ) is a linear function with respect to o and Υ. Consequently we have f(θ (t+), o (t+), Υ (t+) ) f(θ (t+), o (t), Υ (t) ) f(θ (t), o (t), Υ (t) ). Therefore, the optiization proble in Eqn. (2) is guaranteed to converge with the alternate optiization strategy, because the objective function is lower-bounded and onotonically decreases with the iteration nubers. Second, we prove that when λ dag > 2( 2)(n )2 +λ (2n 2 λ ) λ, the output Θ is guaranteed to be DAG. This could be proven by contradiction. Sup- (+) pose that such a λ dag does not lead to a DAG, say, Υ ji Θ ji 0 for at least one pair of nodes i and j, with Θ ji 0 and Υ ji > 0. Without loss of generality, we assue Υ ji ( + ) (where is an arbitary positive nuber), so the ordering constraints in Eqn. (2) always hold regardless of the variables o i and o j. This is because o i and o j are constrained by 0 o i and 0 o j, and o j o i = ( + ). Based on the first-order optiality condition, Θ ji 0 i.f.f. 2 ( x :,i PA i(\j,:)θi\j) x:,j (λ + λ dag Υ ij ) > 0. Here, PA i(\j,:) denotes the eleents in the atrix PA i with the j-th row reoved (i.e., parents of the node i without the node j), and θ i\j denotes 4

the eleents in θ i without Θ ji. However, it can be shown that, ( ) x :,i PA i(\j,:)θ i\j x:,j x :,i x :,j + θ i\j PA i(\j,:) x :,j (5) = x :,i x :,j + k=,k i,j Θ kix :,kx :,j (n ) + ( 2)(n ) ax Θ ki (n ) + ( 2)(n )2 λ. The second last inequality holds due to the noralization of features x :,i (to zero ean ( and unit std). The last inequality holds because ax Θ ki θ i x:,i λ PA i θ i 2 2 + λ θ i + λ dag ɛ i θ i ) = λ f(θ, o, Υ ) λ f(0, o, Υ ) = λ x :,ix :,i = n λ. With the given λ dag, Eqn. (5) results in 2 ( x :,i PA i(\j,:)θi\j) x:,j (λ + λ dag Υ ij ) < 0, which contradicts the above first-order optiality condition with Θ ji 0. Therefore, when λ dag is sufficiently large, the output Θ is guaranteed to be DAG. Suing up the proofs above, the alternate optiization of Eqn. (2) converges and the output Θ is guaranteed to be DAG when λ dag is sufficiently large. References [] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2007. 5