Adapting Bayes Nets to Non-stationary Probability Distributions
|
|
- Shawn Matthews
- 5 years ago
- Views:
Transcription
1 Adapting Bayes Nets to Non-stationary robability Distributions Søren Holbech Nielsen & Thomas D. Nielsen June 4, Introduction Ever since earl (1988) published his seminal book on Bayesian networks (BNs), the formalism has become a widespread tool for representing, eliciting, and discovering probabilistic relationships for problem domains defined over discrete variables. One area of research that has seen much activity is the area of learning the structure of BNs, where probabilistic relationships for variables are discovered, or inferred, from a database of observations of these variables (main papers in learning include (Cooper and Herskovits 1992), (Heckerman, Geiger, and Chickering 1994), and (Chickering 2002)). One part of this research area focuses on incremental learning of BNs, where observations are received sequentially, and a BN structure is gradually constructed along the way without keeping all the observations in memory. A special case of incremental learning is structural adaptation, where the incremental algorithm maintains one or more candidate structures and applies changes to these as observations are received. This particular area of research has received very little attention, with the only results that we are aware of being (Buntine 1991), (Lam and Bacchus 1994) (and later (Lam 1998)), (Friedman and Goldszmidt 1997), and (Alcobé 2004). These papers all assume that the database of observations has been produced by a stationary stochastic process. That is, the ordering of the observations in the database is inconsequential. However, many real life observable processes cannot really be said to be invariant with respect to time: Mechanical mechanisms tend to become worn out as time progresses, for instance, and influences not observed in the data may change suddenly. When human decision makers are somehow involved in the data generating process, these are almost surely not fully describable by the observables 1
2 and may change their behaviour instantaneously if some advantage can be gained. A very simple example of a situation, where it is unrealistic to expect a stationary generating process, is a database storing business transactions and internal communications of some company. If at some point a high ranking employee is replaced, the subsequent data would be distributed differently from that representing the time before the change. In this work we relax the assumption on stationary data, opting instead for learning from data which is only approximately stationary. More concretely, we assume that the data generating process is piecewise stationary, as in the example given above. Furthermore, we focus on domains where the shifts in distribution from one stationary period to the next is of a local nature (i.e. only a subset of the probabilistic relationships among variables change as the shifts take place). As the assumptions underlying the method are generalizations of those carrying other structural adaptation methods, we may even hope that our method is competitive for incremental learning from data generated by stationary distributions. 2 reliminaries Here we present the definitions and terminology necessary to understand the remainder of the text. As a general notational rule we use bold font to denote sets and vectors (V, c, etc.) and calligraphic font to denote mathematical structures and compositions (B, G, etc.). Moreover, we shall use upper case letters to denote random variables or sets of random variables (X, Y, V, etc.), and lower case letters to denote specific states of these variables (x 4, y, c, etc.). is used to denote defined as. A BN B (G,Φ) over a set of discrete variables V consists of an acyclic directed graph (traditionally abbreviated DAG) G, whose nodes are the variables in V, and a set of conditional probability distributions Φ (which we abbreviate CTs for conditional probability table ). A correspondence between G and Φ is enforced by requiring that Φ consists of one CT (X A G (X)) for each variable X, specifying a conditional probability distribution for X given each possible instantiation of the parents A G (X) of X in G. A unique joint distribution B over V is obtained by taking the product of all the CTs in Φ. In a graph G we shall often need to distinguish between different types of connections on paths between nodes: For three nodes X, Y, and Z next to each other on some path a, we say that the connection at Y is serial, if arcs X Y and Y Z are part of a; if arcs X Y and Y Z are part of a, 2
3 we call the connection diverging; and if arcs X Y and Y Z are part of a, we call it converging. If for such a converging connection, we furthermore have that X and Z are not adjacent in G, then we call the triple (X,Y,Z) a v-structure of G. Due to the construction of B we are guaranteed that all dependencies inherent in B can be read directly from G by use of the d-separation criterion (earl 1988): For any path a between two nodes X and Y in G, we say that a is active given a set of nodes Z, if for each node W on the path, either the connection at W is serial or diverging, and W is not in Z, or the connection is converging, and either W or one of its descendants are in Z. If there is no active path between two nodes X and Y given Z, then we say that X and Y are d-separated given Z. The d-separation criterion states that, if X and Y are d-separated by Z, then it holds that X is conditionally independent of Y given Z in B, or equivalently, if X is conditionally dependent of Y given Z in B, then X and Y are not d-separated by Z in G. In the remainder of the text, we use X G Y Z to denote that X is d-separated from Y by Z in the DAG G, and X Y Z to denote that X is conditionally independent of Y given Z in the distribution. The d-separation criterion is thus X G Y Z X B Y Z, (1) for any BN B (G,Φ). The set of all conditional independence statements that may be read from a graph in this manner, we refer to as that graph s d-separation properties. We refer to any two graphs over the same variables as being equivalent if they have the same d-separation properties. Equivalence is obviously an equivalence relation. Verma and earl (1991) proved that equivalent graphs necessarily must have the same skeleton and the same v-structures, and the equivalence class of graphs equivalent to a specific graph G can then be uniquely represented by the partially directed graph G obtained from the skeleton of G by directing links that participate in a v-structure in G in the direction dictated by G. G is called the pattern of G. Any graph G, which is obtained from G by directing the remaining undirected links, without creating a directed cycle or a new v-structure, is then equivalent to G. We say that G is a consistent extension of G. The partially directed graph G obtained from G, by directing undirected links as they appear in G, whenever all consistent extensions of G agree on this direction, is called the completed pattern of G. G is obviously a unique representation of G s equivalence class as well. Any arc in G is called compelled in G. 3
4 Given any joint distribution over V it is possible to construct a BN B such that = B (earl 1988). A distribution for which there is a BN B (G,Φ ) such that B = and also X Y Z X G Y Z (2) holds, is called DAG faithful, and B (and sometimes G alone) is called a perfect map. DAG faithful distributions are important since, if a data generating process is known to be DAG faithful, then a perfect map can, in principle, be inferred from the data under the assumption that the data is representative of the distribution. (earl 1988) proves that the following important properties holds of any DAG faithful probability distributions : A B D C A B C A D C (3) A B D C A B C D. (4) For any probability distribution over variables V and variable X V, we define a Markov boundary of X to be a set S V \ {X} such that X V \ (S {X}) S and this holds for no proper subset of S. It is easy to see that if is DAG faithful, the Markov boundary of X is uniquely defined, and consists of X s parents, children, and children s parents in a perfect map of. In the case of being DAG faithful, we denote the Markov boundary by MB (X). For any DAG G, we denote the set of X s parents, children, and children s parents in G as MB G (X). 3 An Adaptive Approach We now present our method for structural adaptation to a piecewise stationary data sequence. We first describe the problem formally, and then the proposed method is presented. 3.1 The Adaptation roblem We say that a sequence is samples from a piecewise DAG faithful distribution, if the sequence can be partitioned into sets such that each set is a database sampled from a single DAG faithful distribution. The rank of the sequence is the size of the smallest such partition. Formally, let d = (d 1,...,d l ) be a sequence of observations over variables V. We say that the sequence is sampled from a piecewise DAG faithful distribution (or simply that it is a piecewise DAG faithful sequence), if there are indices 1 = i 1 < < i m = l, such that each of d (j) = (d ij,...,d ij+1 ), for 1 j m 1, is a sequence 4
5 of samples from a DAG faithful distribution. The rank of the sequence is defined to be min j i j+1 i j, and we say that m 1 is its size and l its length. Obviously, we can have any sequence of observations being indistinguishable from a piecewise DAG faithful sequence, by selecting the partitions small enough, so we restrict our attention to sequences that are piecewise DAG faithful of at least rank r. However, we do not assume that neither the actual rank nor size of the sequence are known, and specifically we do not assume that the indices i 1,...,i m are known. We will assume that each sample is complete, though, so that no observations in the sequence have missing values. The learning task that we address in this paper consists of incrementally learning a BN, while receiving a piecewise DAG faithful sequence of samples, and making sure that after reception of each sample point the BN structure is as close as possible to the distribution that generated this point. Formally, let d be a complete piecewise DAG faithful sample sequence of length l, and let t be the distribution generating sample point t. Furthermore, let B 1,..., B l be BNs output (or maintained) by a structural adaptation method M, when receiving d and starting with BN B. Given a distance measure on BNs d, we define the precision of M on d wrt. d as prec(m, B,d) 1 l l d(b i, B i ). (5) i=1 For a method M to adapt to a DAG faithful sample sequence d wrt. d then means that M has a higher precision than the trivial learner (which never learns) on d wrt. d. 3.2 Our Approach The method that we propose here builds on the idea of automating the human practice of revising models only when needed, and then only the parts of the models that are in need of a revision. The method consists of two main mechanisms: One, monitoring the current BN while receiving evidence, detecting when the model should be changed, and two, adapting the parts of the model that conflicts with the evidence. These two mechanisms roughly corresponds with Algorithms 1 and 2. More detailed, our method maintains histories of conflict measures for the nodes in the current BN along with recent observations, and then initiate a local search over a set of nodes if the conflict measure histories of these nodes indicate that the underlying distribution has changed. A local 5
6 search means that we assume that only some aspects of the perfect maps of the two underlying generating distributions differ. The general form of the method is given in Algorithm 1. We have designed it to use a number of helper functions, whose exact implementation is open for experimentation as long as they fulfill a number of requirements. The responsibilities of ConflictMeasure( ) and ChangeInStream( ) will be described shortly. The actual technical implementation choices we have made for these methods are presented in relation to our experimental results in Section 5. The specification of UpdateNet( ) is given in Algorithm 2, and is elaborated upon later. The role of the rest of the methods should be self-explanatory, and their implementation is trivial. Algorithm 1 Algorithm for BN adaption. Takes as input an initial network B, defined over variables V, a series of cases h, and an allowed delay k for determining shifts in h. 1: procedure Adapt(B, V, h, k) 2: h [] 3: c X [] X V 4: loop 5: x NextCase(h ) 6: Append(h, (x)) 7: S 8: for X V do 9: c ConflictMeasure(B, X, x) 10: Append(c X, c) 11: if ChangeInStream(c X, k) then 12: S S {X} 13: end if 14: end for 15: h LastKEntries(h, k) 16: if S then 17: UpdateNet(B, S, h) 18: end if 19: end loop 20: end procedure The responsibilities of the two helper methods are as follows: ConflictMeasure(B, X i, x) returns a real value reflecting how inconsistent the observation x is with the way X i connects to the rest of the model B. ChangeInStream(c X, k) is a boolean valued function that, given a sequence of conflict measure values for X, tells if there appears to be a jump between the concrete values of the last k values and the values before that; 6
7 such a jump would be an indication of an abrupt change in the underlying generating distribution. Algorithm 2 Update Algorithm for BN. Takes as input the network to be updated B, a set of variables whose structural bindings may be wrong S, and data to learn from h 1: procedure UpdateNet(B, S, h) 2: for X S do 3: M X UncoverMarkovBoundary(X, h) 4: G X UncoverAdjacencies(X, M X, h) 5: artiallydirect(x, G X, M X, h) 6: end for 7: G MergeFragments({G X } X S ) 8: G GraphOf(B) 9: (G, C) MergeAndDirect(G, G, S) 10: Φ 11: for X V do 12: if X C then 13: Φ Φ { h (X A G (X))} 14: else 15: Φ Φ { B (X A G (X))} 16: end if 17: end for 18: B (G,Φ ) 19: end procedure UpdateNet(B, S, h) in Algorithm 2 adapts the BN B around the nodes in S to fit the empirical distribution defined by the data in h, which will be the last k received cases at the time of invocation. Throughout the text, this empirical distribution will also be denoted h which meaning h takes on should be clear from the context. The update of B to fit h is based on the assumption that only nodes in S need updating of their probabilistic bindings (i.e. their families in B h ). The method is listed in Algorithm 2. As Adapt( ) in Algorithm 1, it employs a number of helper methods that we only specify semantically. The concrete implementations of these methods that we have used in our experiments are presented in Section 5. UpdateNet( ) first runs through the nodes in S and uncovers for each node X the subgraph 1 of Bh consisting of X and links and arcs connecting X to its adjacents. The construction of this subgraph is divided into three steps: UncoverMarkovBoundary(X, h), which should compute MB h (X), or 1 Subgraph is used in a loose sense here, as the graph fragments produced do not contain links and arcs between X s adjacent variables. 7
8 at least a superset thereof; UncoverAdjacencies(X, M X, h), which first computes the nodes in M X that, according to h, are inseparable from X given subsets of M X, and then constructs a simple graph structure, where X is connected to each of these nodes by a link; and, finally, artiallydirect(x, G X, M X, h), which directs some of the links in G X. This method is presented in Algorithm 3 and elaborated upon later in this section. After constructing network fragments for all nodes in S, Algorithm 2 merges these into a single (not necessarily connected) graph G in Merge- Fragments( ). This happens as a simple graph union. 2 The result may still contain undirected arcs. G is merged with the a partially directed graph based on the subgraph of B s graph G that corresponds to nodes not in S into a new graph G by MergeAndDirect( ). MergeAndDirect( ) is described in the end of this section. MergeAndDirect( ) returns both the joined graph G and a set C which contains all nodes who have got a new parent set (ideally a subset of S). Algorithm 2 ends by constructing new CTs for the newly constructed BN structure G. Nodes not in C are assigned the same CT as in B, and nodes in C have new CTs calculated from h. Algorithm 3 Uncovers the direction of some arcs adjacent to a variable X in Bh. G X (X) is nodes connected to X by a link in G X. 1: procedure artiallydirect(x, G X, M X, h) 2: DirectAsInOtherFragments(G X ) 3: for Y M X \ (NE GX (X) A GX (X)) do 4: for T M X \ {X, Y } do 5: if I h (X, Y T) then 6: for Z NE GX (X) \ T do 7: if I h (X, Y T {Z}) then 8: LinkToArc(G X, X, Z) 9: end if 10: end for 11: end if 12: end for 13: end for 14: end procedure The method artiallydirect(x, G X, M X, h) directs a number of links in the graph fragment G X in accordance with the direction of these in B h. With DirectAsInOtherFragments(G X), the procedure first ex- 2 Due to the inner workings of artiallydirect( ) no conflicts of orientation among nets can occur. 8
9 change links for arcs when previously directed graph fragments dictate this. The procedure then runs through each variable Y in M X not adjacent to X, finds a set of nodes in M X that separate X from Y, and then tries to re-establish connection to Y by repeatedly expanding the set of separating nodes by a single node adjacent to X. If such a node can be found it has to be a child of X (see Corollary 16). No arc in Bh originating from X, is left as a link by the procedure (see Lemma 13). During these steps the procedure uses an independence oracle I h derived from h. The independence oracle should return true for a call I h (X,Y T) iff X h Y T. Algorithm 4 Merges graphs G and G under the assumption that connection to nodes in S have just been relearned. 1: procedure MergeAndDirect(G, G, S) 2: E EdgesOf(G ) 3: for X / S do 4: for Y AD G (X) \ S do 5: E E {(X, Y ), (Y, X)} 6: end for 7: end for 8: for X / S do 9: for Y AD G (X) \ S do 10: if Z MB G (X) \ AD G (X) : Y is descendant of Z then 11: E E \ {(Y, X)} 12: end if 13: end for 14: end for 15: return Direct((V, E)) 16: end procedure MergeAndDirect(G, G, S) in Algorithm 4 merges the graphs G and G by adding links between nodes in V \S in G, if these are adjacent in G. Then it directs some of these links: A link X Y is directed as X Y if X Y is an arc in G, and Y is a descendent of some node Z in MB G (X)\AD G (X) through a path involving only children of X. That this particular rule is sensible is proved in Section 4. Links and arcs between nodes in S and those not in S should theoretically be the same, but due to flawed statistical tests, in practice they are set either as defined in G or as defined in G, depending on how aggressive the adaptation should be. Initial experiments seemed to indicate that the latter option yields a small performance increase, and this is the setting that we have used in the end experiments, and the setting that Algorithm 4 has been written to conform to. 9
10 After the merge, any remaining links in the joined graph are directed by Direct( ) by application of the following rules 1. if X Y is a link, Z X is an arc, and Z and Y are non-adjacent, then direct the link X Y as X Y, 2. if X Y is a link and there is a directed path from X to Y, then direct the link X Y as X Y, and 3. if Rules 1 and 2 cannot be applied, chose a link at random, and direct it randomly. Due to potentially flawed statistical tests, the resultant graph may contain cycles involving at least some nodes in S. These are eliminated by reversing arcs between nodes in S only. The reversal process resembles the one used in (Margaritis and Thrun 2000): We remove all arcs between nodes in S that appears in at least one cycle. We order the removed arcs according to how many cycles they appear in, and then insert them back in the graph, starting with the arcs that appear in the least number of cycles. When at some point the insertion of an arc gives rise to a cycle, we insert the arc as its reverse. 4 Formal Correctness We have obtained a proof ensuring that under a set of assumptions our adaptation method is correct, in the sense that when the underlying distribution generating the sequence of data changes, the algorithm reacts and changes the current BN to accurately reflect the perfect map of the new underlying distribution. Moreover, the result makes it clear what it means for these changes to be local, probabilistically speaking. First, we need a few definitions, on which the assumptions are based: Definition 1. Let be a DAG faithful probability distribution over variables V, X in V, and Y in MB (X). We say that a set T is a maximal separating set of Y from X wrt. if T MB (X) \ {Y }, Y X T, and this holds for no T, where T T MB (X) \ {Y }. 10
11 The set of all variables in V, for which a maximal separating set from X wrt. exists, we denote by S X (the intuition being separable from ). For each Y in S X Y X, we denote by T the intersection of all maximal separating sets of Y from X wrt., and by E Y X the set MB (X) \ (T Y X {Y }) ( dependency enabling ). We expect, but have been unable to prove, that there is only one maximal separating set of Y from X wrt. for any DAG faithful probability distribution. Definition 2. We say that a DAG faithful probability distribution 1 over variables V is similar to a DAG faithful probability distribution 2 over variables V on the variables S V, if we have that 1. MB 1 (X) = MB 2 (X), for all X S, 2. S X 1 = S X 2, for all X S, and 3. E Y X 1 \ S X 1 = E Y X 2 \ S X 2, for all X S and Y S X 1. Here Bullet 3 states that inseparable variables enabling dependencies between X and Y must be the same in both 1 and 2. A few points that should be noted, are that similarity on a set of variables S is an equivalence relation, and that two distributions similar on a set of variables S are also similar on any subset of S. Our main formal result is: Theorem 3. Let BN B 1 (G 1,Φ 1 ) and DAG faithful probability distributions 1 and 2 each be defined over variables V, and h be a DAG faithful sample sequence of size 2 and rank r, where 1 is the distribution of the first sample and 2 is the distribution of the last sample. Furthermore, let 1 be similar to 2 on S, and B 2 (G 2,Φ 2 ) be the result of running Algorithm 1 on B 1 with cases h and some choice of k. If 1. G 1 is equivalent to G 1, 2. r > k, 3 3. ChangeInStream(, k) is true for a variable X iff X / S and the algorithm is currently processing the k th sample drawn from 2, 3 Note that Bullet 6 incorporates the traditional assumption on independencies being identifiable from sufficient data, and that Bullet 2 is only concerned with ensuring that ChangeInStream( ) always has at least k cases from a new partition of h to detect the change. 11
12 4. UncoverMarkovBoundary(X, h) returns MB h (X), 5. G X returned from UncoverAdjacencies(X, M X, h) connects X to Y iff Y is in MB h (X) \ S X h, and 6. the oracle used in artiallydirect( ) is correct, then G 2 is equivalent to G 2. The theorem is important as it guarantees that if our method is started with a BN that is equivalent to the perfect map of some DAG faithful distribution, then no matter how often changes, as long as it remains DAG faithful, and at least 2k observations are made from between each change, then our method continues to adapt B to fit the distribution, with a delay of k observations. In what follows, we gradually build up a theory to support Theorem 3, and then give the proof for it. The main difficulty is that our method directs arcs between a node and its adjacents, without knowing how these adjacents interconnect. revious works on constraint based DAG discovery have been based on two phase algorithms that first uncover the skeleton of the DAG and then direct the links into arcs. That has not been possible in our case, where local parts of the graph have to be learned in isolation from the remaining graph. First, we need some small results that we have been unable to find proved elsewhere: Lemma 4. Let be a DAG faithful probability distribution over V, and let X Y S for X,Y V and all S MB (X) \ {Y }. Then Y X T for all T V \ {X,Y }. roof. First note that if X Y S for all S MB (X) \ {Y }, then it follows that Y must be in MB (X). As is DAG faithful, there is some DAG G which is a perfect map of. If there is an arc between X and Y in G, then there is no set T V \ {X,Y }, such that X G Y T and since G is a perfect map of that X Y T. Therefore, we just need to prove that X and Y must be adjacent if X Y S for all S MB (X) \ {Y } = MB G (X) \ {Y }: As Y is a member of MB G (X) it must either be adjacent to X or share a child with X. In the first case, the result follows, so we assume that X and Y share the children Z. It follows that Y cannot be a descendent of any variable in Z, and hence by the definition of a BN that X G Y S MB G (X) \ (Z {Y }). Thus, there is a set S MB G (X) \ {Y }, where 12
13 X G Y S, and since G is a perfect map of also X Y S. This violates the assumptions, so we conclude that X and Y must be adjacent. Corollary 5. Let be a DAG faithful probability distribution over V and X,Y V. If there is a set T V \ {X,Y } such that Y X T, then there is a set S MB (X) \ {Y }, such that Y X S. Since perfect maps of a distribution are equivalent and equivalent DAGs have the same skeleton, it follows from Corollary 5 that Corollary 6. Let be a DAG faithful probability distribution over V and X,Y V. Then Y / S X iff Y is adjacent to X in all perfect maps of, and Y S X iff Y is not adjacent to X in any perfect map of. Given the above result it is thus sufficient for a learning algorithm to be able to identify S X when learning the skeleton of G. Clearly, this can be seen as a localized operation. Now, we deduce a set of lemmas that are needed for local reasoning about direction of arcs. Lemma 7. Let be a DAG faithful probability distribution over variables V, X,Y,Z V, T V \ {X,Y,Z}, and G be a perfect map of. If X Y T and X Y T {Z}, then in G there is a node C, with non-adjacent parents A and B, such that Z is C or a descendent of C on a path involving no nodes in T, A is connected to X with a path that is active given T, and B is connected to Y with a path that is active given T. Trivially, we also have X G Z T and Y G Z T. roof. Since G is a perfect map of, it is clear that X G Y T and X G Y T {Z}. From the definition of d-separation, it follows that there is a node C, such that either Z is C or Z is a descendent of C along a path c involving no nodes in T, and C has two parents A and B, where A (B) is either X (Y ) or connected to X (Y ) on a path a (b) that is active given T {Z}. We need to show that for at least one such C, 1. both a and b are active given T alone, and 2. that A and B cannot be adjacent. We first show that there is a C fulfilling 1, and then that at least one of these Cs fulfills 2. Take C such that a is active given T {Z} but not T. This means that there is a converging connection at some node V on a, such that Z is V or 13
14 a descendent of V. WLOG let V be the V closest to X on a, and let a be the path from X to V, and v be the path from V to Z. Moreover, let U be the node furthest away from Z on v such that U is also on c, and let u be the part of c (or v) from U to Z, v be the part of v from V to U, and c the part of c from C to U. Then U satisfies the definition of C since the three paths a v, u, and c b connect X to U, U to Z, and U to B, and the connection at U between a v and c b is converging. However, unlike a, a v is active given T only. If b is also only active given T {Z}, we can perform the same transformation to that path and thus we get that 1 is satisfiable. Next, we show that given Cs that satisfy 1, we can find one that also satisfies 2. articularly, let C satisfy 1, and furthermore let C be chosen such that c has maximal length among all valid choices of C. Then A and B must be non-adjacent. Suppose this is not the case, and WLOG let the arc between A and B be oriented A B. This means that B cannot be Y as that would create a path from X to Y active given T only, contradicting that X G Y T. So consider the arc between B and the next node V on b. This arc cannot be directed into B as that would mean that B is a valid choice for C, yet with the length of the path from B to Z through C being longer than c, which contradicts the choice of C. But if the arc between B and V is directed away from B then we have that aa Bb is an acive path given T, which again contradicts X G Y T. Thus A and B are non-adjacent. Lemma 8. Let be a DAG faithful probability distribution over V, X,Z V, Y S X, T V \ {X,Y,Z}, and Z MB (X) \ (T S X {Y }). If X Y T and X Y T {Z}, then for any DAG G, which is a perfect map of, Z is in CH G (X). roof. As Z is not in S X, Corollary 6 yields that Z and X must be adjacent in every DAG that is a perfect map of. All that is left to be shown is that Z cannot be a parent of X. Assume this is false, and let G be a perfect map of in which Z is a parent of X. As X Y T and G is a perfect map of, it follows that X G Y T. (6) Specifically, (6) in conjunction with the assumption that X is a child of Z implies that there can be no path between Z and Y, which is active given T, i.e. Z G Y T. (7) 14
15 But from X Y T and X Y T {Z}, Lemma 7 allows us to conclude that Z G Y T, which contradicts 7. Lemma 9. Let be a DAG faithful probability distribution over V, X V, Y S X, T MB (X) \ {Y }, and Z MB (X) \ (T S X {Y }). If X Y T and X Y T {Z}, then there is W S X such that Z E W X. roof. Let G be a perfect map of. We show that there is a W S X such that X G W T (8) and X G W T T {Z}, (9) for some set T MB G (X)\{W,Z} and any T MB G (X)\(T {W,Z}). The result then follows by G being a perfect map of. From the assumptions, it is clear that there is at least one W S X (namely Y ), such that (8) and X G W T {Z}, (10) if we choose T as T. Since G is a perfect map of, we have that (8) implies X W T, (11) and (10) implies X W T {Z}. From Lemma 8 it follows that Z is a child of X in G. Moreover, from Lemma 7 we get that there must be at least one active path between W and Z. WLOG let a be the shortest such path connecting Z to such a W. We prove that the existence of a implies that (9) is fulfilled. First note that the last arc in a (connecting to Z) must be directed into Z, as otherwise there would be an active path from X to W given T, contradicting (8). Also note that if the length of a is 1, then we have the situation where Z is a child of both W and Z, and hence that a trivially ensures that (9) is fulfilled. Assume therefore that a has length 2 or more. We prove that the existence of a implies (9) by contradiction. That is we assume that there is a set T MB G (X) \ (T {W,Z}) such that X G W T T {Z}, and we prove that this means that a cannot be the shortest path matching its definition. As a is active given T and Z is a child of X, it follows that the nodes in T must render a inactive, which can only happen if some of the nodes in T lie on a. Let U be the node in T closest to W on a such that the connection at U is either serial or diverging (meaning that U renders a inactive). There are three possible scenarios: 15
16 1. a : (W U Z) 2. a : (W U Z) 3. a : (W U Z) Let b be the part of a running between W and U and c be the part between U and Z. We show that U can serve the same role as W, and hence that a cannot be the shortest active path connecting Z to a suitable W. Assume first that either Scenario 1 or 2 are the case: If X G U T, then there is an active path d between X and U given T, and hence also an active path from W to X given T formed by joining b and d. But that means that X G W T, which contradicts (8). Thus we must have that X G U T. Furthermore, since U is on a, it follows that X G U T {Z}. But then U satisfies (8) and (10) by the virtue of the active path c, and hence a is not the shortest such path. Thus none of Scenarios 1 and 2 can be the case. Assume then that Scenario 3 is the case: First we note that it cannot be the case that X G U T since b is active given T and this implies that (8) and (10) holds for U, thus violating the definition of a as the shortest path. Hence, X G U T meaning that there is an active path d from X to U given T. Since b is also active given T and (8) is assumed, it follows that the arc attached to U in d, must be an arc going into U. But since that results in a converging connection at U joining b and d, and X G W T, we have that no descendent of U can be in T. Specifically, we have that c cannot contain any nodes of T, and hence that it must be a directed path from U to Z. First assume that c contains some nodes that are adjacent to X. None of these can be parents of X, as the arc from the parent node to X along with the active path a, would mean that X G W T, contradicting (8). Thus nodes on c in MB G (X) are either children of X or non-adjacent to X. U itself must be non-adjacent to X, as U cannot be a child of X: That would imply that X G U R, for any set R, and since b is active given T T {Z} and there is a converging connection at U on the path formed by b and the arc from X to U, that X G W T T {Z}, violating assumption (10). Since U is not adjacent to X, we thus have that there is some node on c non-adjacent to X and closer to Z on c than any other node. Let V be this node. Since Z is a child of X and a direct descendant of V, it is easy to see that V satisfies both (8) and (10) and thus that a cannot be the shortest path connecting Z to a satisfying node, and the lemma is thus proved. 16
17 These basic but important lemmas aside, we now establish a connection between the purely probabilistic notion of similarity, and the DAGs that ensure it. Since we have that two equivalent DAGs G 1 and G 2 encode the same conditional independency statements, and these are reflected in any probability distributions whose perfect maps are one of G 1 and G 2, the following corollary is immediately obtained. Corollary 10. Let G 1 and G 2 be perfect maps of probability distributions 1 and 2 both defined over variables V. If G 1 and G 2 are equivalent, then 1 and 2 are similar on V. This connection spurs the introduction of a new concept: Definition 11 (Locally Compelled Arc). Let G 1 be a perfect map of a probability distribution 1 over V, and X and Y in V, such that X is a parent of Y in G 1. We call the arc from X to Y locally compelled in G 1, if for any probability distribution 2 similar to 1 on {X}, and any perfect map G 2 of 2, we have that X is a parent of Y in G 2. We denote by CH G (X) the variables that are attached to X in G by locally compelled arcs originating from X. It turns out that locally compelled arcs are important for learning network structures locally: Locally compelled arcs of a graph encompass all arcs of the pattern of that graph, and enjoy the property of being compelled too, as the following results show. Lemma 12. Let be a DAG faithful probability distribution over V, X V, Y S X, and T a maximal seperating set of Y from X wrt.. Then for any DAG G, which is a perfect map of, each member Z of MB (X) \ (T {Y } S X) must be a compelled child of X. That is, Z is in CH G (X). roof. We need to show that Z is a child of X in any perfect map of any probability distribution 2 similar to on {X}. Since T is a maximal separating set, it follows that Z E Y X. Since and 2 are similar on X, Y S X X, and Z EY \ S X, we have that Y SX 2 and Z E Y X 2 \ S X 2. Hence, there is a (maximal separating) set T 2 MB 2 (X) not including Z, such that X 2 Y T 2 and X 2 Y T 2 {Z}. The result then follows from Lemma 8. Lemma 13. Let be a DAG faithful probability distribution over V, and X,Y,Z V. Then for any DAG G, which is a perfect map of, if (X,Z,Y ) 17
18 is a v-structure in G, then Y S X X, Z EY \S X, and X Z and Y Z are both locally compelled. roof. We only show that X Z is locally compelled, as the case of Y Z is similar. Since Y is not adjacent to X, Corollary 6 implies that Y S X, and hence that E Y X is well-defined. Clearly, Z cannot be in any maximal seperating set of Y from X wrt., as G is a perfect map of and X G Y T {Z} for any set T. Furthermore, as Z is adjacent to X, Corollary 6 ensures that Z / S X. The result now follows from Lemma 12. From Definition 11, Corollary 10, and Lemma 13, we thus have: Corollary 14. For any DAG G and any arc participating in a v-structure, we have that this arc is locally compelled in G. Furthermore, we have that each locally compelled arc in G is compelled in G. At this point, we have that it is sufficient to require of an algorithm uncovering G, for some DAG faithful distribution, that it identifies the skeleton of G, identifies all locally compelled arcs, directs them, and then eliminates arcs not participating in v-structures. Moreover, any member of G s equivalence class can be constructed by identifying all locally compelled arcs and then directing the remaining arcs without introducing cycles or new v-structures. Formally, we can restate the corollary in terms that emphasize the connection between similarity and equivalence: Corollary 15. Let G 1 = (V,E 1 ) and G 2 = (V,E 2 ) be DAGs. Then G 1 is similar to G 2 on V iff G 1 is equivalent to G 2. Now, we proceed by asserting that the arcs directed by our method are indeed locally compelled, and that no locally compelled arcs are left undirected. From Lemmas 9 and 12 we immediately get: Corollary 16. Let be a DAG faithful probability distribution over V, X V, Y S X, T MB (X)\{Y }, and Z MB (X)\(T S X {Y }). If X Y T and X Y T {Z}, then for any DAG G, which is a perfect map of, Z is in CH G (X). We thus have that all arcs directed by artiallydirect( ) are locally compelled. That a arc X Y participating in a v-structure (X, Y, Z) will be found by artiallydirect( ), follows from Lemma
19 Lemma 17. Let be a DAG faithful probability distribution over V, X,Y V, and G be a perfect map of. There is a W S X such that Y E W X \ S X iff there is Z MB G(X) S X such that Y is a descendent of Z through a path encompassing only children of X. roof. The if part follows trivially from the d-separation properties of graphs, and we therefore only prove the only if part: Since Y E W X \S X we have from Lemma 12 that Y is a child of X in G. Moreover, we know that there is a set T MB (X) = MB G (X) such that X G W T, but X G W T {Y }, and thus by Lemma 7 that in G there is a node C with non-adjacent parents A and B, such that Z is a descendent of C on path c, and there are active paths a and b given T between X and A and between B and W. Let V be the first node on c following Y. As V is also a descendent of C it follows that it is also in E W X. If V is furthermore in S X then we have found the node Z that we are looking for, so assume that this is not the case. Then Lemma 12 can be applied again with V taking Y s place. Iterating in this manner we either find the Z or we get that each node on c must be a child of X. Specifically this holds for C. But then either B is the node Z we are looking for, or there is an active path from X to W given T, consisting of the connection between X and B along with b. The latter possibility is precluded by the assumption X G W T. A direct consequence of this graphical result and d-separation properties on graphs is: Corollary 18. Let be a DAG faithful probability distribution over V, X,Y V, and G be a perfect map of. There is a W S X such that Y E W X \ S X iff there is Z SX MB (X) such that Y E Z X \ S X. Lemma 19. Let be a DAG faithful probability distribution over V, X,Y V, and G be a perfect map of. We then have that Y CH G (X) iff there is a W S X X such that Y EW \ S X. roof. The if part follows immediately from Lemma 12. We therefore only treat the only if part: Since X Y is an arc in G it follows from Corollary 6 that Y / S X W X, so what we need to prove is that Y W for some W V. We do so by contradiction, assuming that Y is not in any such E W X, and showing that any distribution 2, having the graph G 2 resulting from reversing X Y as its perfect map, is similar to on X, and thus that X Y cannot be locally compelled. 19
20 We assume that 2 is not similar to on X, and hence that at least one of oints 1 to 3 in Definition 2 is violated. First we note that it cannot be the case that there is a v-structure X,Y,Z, as Y would then be in E Z X by Lemma 13. Hence, MB G (X) = MB G2 (X) and therefore MB (X) = MB 2 (X), so oint 1 is satisfied. Additionally, Corollary 6 means that the fact that no adjacencies are changed implies that oint 2 is satisfied. So it must be oint 3 that is violated, meaning that for some node W we have that E W X \S X \S X 2. Let Z be a node having either left or entered such an enabling set due to the arc reversal. First note that the assumption that Y is in no such sets means that if Z = Y, Y must have entered such a set, but then Lemma 12 states that X Y is a compelled arc in G 2 which contradicts the definition of G 2. Thus Z Y. We prove this is impossible, too, reaching the required contradiction. X EW 2 First assume that Z E W X but Z / E W X 2. This means that there is a set T MB (X) such that X G W T, but X G W T {Z}, and by Lemma 7 that in G there is a node C with non-adjacent parents A and B, such that Z is a descendent of C on path c, and there are active paths a and b given T between X and A and between B and W, respectively, but the reversal of X Y somehow destroys this setup. Clearly, X Y cannot be on c since that would mean that X G W T, and X Y cannot be on b, as that would either mean X G W T or require Y to participate in a v-structure at C, which we have established cannot be the case. Thus X Y is on a, and WLOG we take it to be the first arc of a. If the next arc on a is an arc out of Y, then exchanging X Y for X Y changes the connection at Y from being serial into being diverging, and then a is still active. Therefore the connection at Y must be converging. Since Y does not participate in a v-structure, X must be adjacent to the next node V on a. The connection between X and V must be directed towards V as otherwise there is guaranteed to be an active path from X to C without the help of X Y. We consider the two cases for the arc V U following Y V on a: If this is an arc out of V, then X V provides an active path between X and C, so we know this cannot be the case. So the arc is oriented towards V. Now V cannot itself participate in any v-structures, since by Lemma 17 that would imply that Y E R X for some R, which we have assumed is not the case. So U is also adjacent to X, and by repeating the previous arguments we continue to establish that X must be a parent of all nodes on a, including A. But then there is an active path X A C Bb between X and W without the help of X Y, and we conclude that Z cannot have left an enabling set due to the arc reversal. 20
21 Finally, assume that Z / E W X and Z E W X 2. As before Lemma 7 guarantee that we have the structure of paths a, b, and c along with node Z and set T, but this time active in G 2 and not in G. We know that X Y cannot be part of b as either that would mean X G2 W T or that X is required to participate in a v-structure and be part of T, defying the definition of T. Moreover, X Y cannot be on c, as that would either imply that X G2 W T or that X G2 W T {Y }, the latter being precluded by Corollary 16 and the definition of G 2 as a DAG where Y is a parent of X. So X Y must be on a, and we asumme WLOG that it is the first arc on a. Since reversing X Y breaks the setup we must have that the arc following it on a cannot be directed out of Y. Let V be the node connected to Y like this. Since Y does not participate in a v-structure in G, and only the direction of the link X Y differs between the two, this means that X must also be adjacent to V in G. As above we get that V, and all other nodes on a must be children of X, ending up with the active path X A C Bb between X and W contradicting the assumption that Z / E W X. Summing up we have: Corollary 20. Let be a DAG faithful probability distribution over V, G a perfect map of, and X Y an arc in G. Then the following are equivalent statements: X Y is locally compelled there is a W S X such that Y EW X \ S X, there is a W S X MB (X) such that Y E W X \ S X, there is a W V and T MB (X) such that X W T and X W T {Y }, and there is a W MB G (X) S X such that there is a directed path from W to Y in G running through only nodes in CH G (X). Thus there are strong correlations between the abstract probabilistic definition of locally compelled, the conditional independence oriented definition of enabling sets, and strictly graphical notions. This allows us to finally state the proof of Theorem 3: Theorem 21. Let BN B 1 (G 1,Φ 1 ) and DAG faithful probability distributions 1 and 2 each be defined over variables V, and h be a DAG faithful 21
22 sample sequence of size 2 and rank r, where 1 is the distribution of the first sample and 2 is the distribution of the last sample. Furthermore, let 1 be similar to 2 on S, and B 2 (G 2,Φ 2 ) be the result of running Algorithm 1 on B 1 with cases h and some choice of k. If 1. G 1 is equivalent to G 1, 2. r > k, 3. ChangeInStream(, k) is true for a variable X iff X / S and the algorithm is currently processing the k th sample drawn from 2, 4. UncoverMarkovBoundary(X, h) returns MB h (X), 5. G X returned from UncoverAdjacencies(X, M X, h) connects X to Y iff Y / S X h, and 6. the oracle used in artiallydirect( ) is correct, then G 2 is equivalent to G 2. roof. Given Assumptions 2 and 3 it is obvious that Adapt( ) does nothing except exactly once, when it calls UpdateNet( ) on B 1, V \S, and the first k cases h k of the partition of h sampled from 2. According to Corollary 14 if we can prove that i) two nodes are adjacent in G 2 iff they are adjacent in G 2, and ii) an arc is put in G 2 by artially- Direct( ) or MergeAndDirect( ) iff it is a locally compelled arc in G 2 then we have that G 2 is identical to G 2. The result then follows from the correction of Rules 1 to 3 on page 10, which have been proved in (Spirtes, Glymour, and Scheines 2001). i) Let X,Y V be adjacent in G 2. It is either the case that both of X and Y are in S or not. In the latter case we assume WLOG that X S. Then X and Y have been set adjacent by UncoverAdjacencies( ), and by Assumption 5 this means that Y / S X h k. Since h is piecewise DAG faithful, and h k is sampled solely from the partition generated by 2, it follows that Y / S X 2, and by Corollary 6 that X and Y are adjacent in G 2. In the first case, X and Y are adjacent in G 2 iff they are so in G 1, so we have that X and Y are adjacent in G 1. By Corollary 6 and Assumption 1 this means that Y S X 1. Since 1 and 2 are similar on S and thus X, it follows that Y S X 2, and by Corollary 6 that X and Y are adjacent in G 2. roving that if X and Y are non-adjacent in G 2 then they are so also in G 2 follows the same reasoning. 22
23 ii) We consider arcs emanating from a node X in G 2. First, let X S. This means that the locally compelled arcs emanating from X in G 2 are exactly those in G 1. By Corollary 15 and Assumption 1, these are again the same as those in G 1, and by Corollary 20 these are exactly those created by MergeAndDirect( ). Thus CH G 2 (X) = CH G 2 (X). Assume then that X / S. This means that the arcs out of X are created by artiallydirect( ) and only by this method. Due to Assumptions 4 and 5 this method sets Y as a child of X iff it can find a variable Z and a set T MB hk (X) such that I hk (X,Z T) and I hk (X,Z T {Y }). By Assumption 6 this means that Y is set as a child of X iff X hk Z T and X hk Z T {Y }. Since h k is a representative sample from 2, it follows that Y is set as a child of X iff X 2 Z T and X 2 Z T {Y }, and hence by Corollary 20 that Y is set as a child of X in G 2 iff X Y is a locally compelled arc in G 2. This proves the result. 5 Experiments and Results To investigate how our method behaves in practice, we ran a series of experiments with a fully implemented version of the method. In our implementation we have used the conflict measure of Jensen, Chamberlain, Nordahl, and Jensen (1991) in the method ConflictMeasure(B, X i, (x 1,...,x m )): log E (X i = x i ) B (X i = x i X j = x j j i), where the empty BN structure E is adapted to the observations using fractional updating. For ChangeInStream(c X, k) we calculated the second discrete cosine transform component (see e.g. (ress, Teukolsky, Vetterling, and Flannery 2002)) of the last 2k measures in c X, and had ChangeIn- Stream( ) return true if this statistic exceeded UncoverMarkovBoundary( ) was implemented as the decision tree learning method of (Frey, Fisher, Tsamardinos, Aliferis, and Statnikov 2003), and UncoverAdjacencies(X, M X, h) corresponds to the AlgorithmCD( ) method of (eña, Björkegren, and Tegnér 2005) restricted to the variables in M X. This method uses a greedy method for iteratively growing and shrinking the estimated set of adjacent variables until no further change takes place. Both of these methods need an independence oracle, and for that we have used a χ 2 test. This test was also used as oracle in artiallydirect( ), 4 We are unaware of any previous work using this kind of change point detection, but in our setting it outperformed the traditional methods of log-odds ratios and t-tests. 23
24 We constructed 100 experiments, where each consisted five randomly generated BNs over ten variables, B 1,...,B 5. We made sure that B i was structurally similar to B i 1 except for the connection between two randomly chosen nodes. All CTs in B i were kept the same as in B i 1, except for the nodes with a new parent set. For these we employed four different methods for generating new distributions: A estimated the probabilities from the previous network with some added noise to ensure that no two distributions were the same. B to D generated entirely new CTs, with B being totally random, C being restricted to CTs that implied strong dependencies between the parent variables and the child, and D being restricted to only those CTs where these dependencies were extremely strong. For generation of the initial BNs we used the method of (Ide, Cozman, and Ramos 2004). For each series of five BNs, we sampled r cases from each network and concatenated them into a piecewise DAG faithful sample sequence of rank r and length 5r, for r being 500, 1000, and We fed our method (NN) with the generated sequences, using different values of k (100, 500, and 1000), and measured the precision wrt. the KLdistance on each. As mentioned we are unaware of other work geared towards non-stationary distributions, but for base-line comparison purposes, we implemented the structural adaptation methods of Friedman and Goldszmidt (1997) (FG) and Lam and Bacchus (1994) (LB), which serve as representatives of the state of the art. For the method of Friedman and Goldszmidt (1997) we tried both simulated annealing (FG-SA) and a more time consuming hill-climbing (FG-HC) for the unspecified search step of the algorithm. As these methods have not been developed to deal with non-stationary distributions, they have to be told the delay between learning. For this we used the same value k, that we use as allowed delay for our own method, as this ensure that all methods stores only a maximum of k full cases at any one time. Each method was given the correct initial network to start its exploration. Space does not permit us to present the results in full, but the precision of both NN, FG-SA, and FG-HC are presented in Figures 1 and 2. NN outperformed FG-SA in 81 of the experiments, and FG-HC in 65 of the experiments. The precision of the LB method was much worse than for either of these three. The one experiment, where both the FG methods outperformed the NN method substantially, had r equal to and k equal to 5000, and was thus the experiment closest to the assumption on stationary distributions of the FG and LB learners. What is interesting is that another experiment having similar r and k values, but generated with method B rather than D, gave the exact opposite result (NN being much 24
COMP538: Introduction to Bayesian Networks
COMP538: Introduction to Bayesian Networks Lecture 9: Optimal Structure Learning Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering Hong Kong University of Science and Technology
More informationAbstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables
N-1 Experiments Suffice to Determine the Causal Relations Among N Variables Frederick Eberhardt Clark Glymour 1 Richard Scheines Carnegie Mellon University Abstract By combining experimental interventions
More informationLearning in Bayesian Networks
Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks
More informationIdentifiability assumptions for directed graphical models with feedback
Biometrika, pp. 1 26 C 212 Biometrika Trust Printed in Great Britain Identifiability assumptions for directed graphical models with feedback BY GUNWOONG PARK Department of Statistics, University of Wisconsin-Madison,
More informationArrowhead completeness from minimal conditional independencies
Arrowhead completeness from minimal conditional independencies Tom Claassen, Tom Heskes Radboud University Nijmegen The Netherlands {tomc,tomh}@cs.ru.nl Abstract We present two inference rules, based on
More informationMarkov Independence (Continued)
Markov Independence (Continued) As an Undirected Graph Basic idea: Each variable V is represented as a vertex in an undirected graph G = (V(G), E(G)), with set of vertices V(G) and set of edges E(G) the
More informationLearning Multivariate Regression Chain Graphs under Faithfulness
Sixth European Workshop on Probabilistic Graphical Models, Granada, Spain, 2012 Learning Multivariate Regression Chain Graphs under Faithfulness Dag Sonntag ADIT, IDA, Linköping University, Sweden dag.sonntag@liu.se
More informationSupplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges
Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges 1 PRELIMINARIES Two vertices X i and X j are adjacent if there is an edge between them. A path
More informationTree sets. Reinhard Diestel
1 Tree sets Reinhard Diestel Abstract We study an abstract notion of tree structure which generalizes treedecompositions of graphs and matroids. Unlike tree-decompositions, which are too closely linked
More informationOn improving matchings in trees, via bounded-length augmentations 1
On improving matchings in trees, via bounded-length augmentations 1 Julien Bensmail a, Valentin Garnero a, Nicolas Nisse a a Université Côte d Azur, CNRS, Inria, I3S, France Abstract Due to a classical
More informationBayes Nets: Independence
Bayes Nets: Independence [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available at http://ai.berkeley.edu.] Bayes Nets A Bayes
More informationLearning Causal Bayesian Networks from Observations and Experiments: A Decision Theoretic Approach
Learning Causal Bayesian Networks from Observations and Experiments: A Decision Theoretic Approach Stijn Meganck 1, Philippe Leray 2, and Bernard Manderick 1 1 Vrije Universiteit Brussel, Pleinlaan 2,
More informationLecture 4 October 18th
Directed and undirected graphical models Fall 2017 Lecture 4 October 18th Lecturer: Guillaume Obozinski Scribe: In this lecture, we will assume that all random variables are discrete, to keep notations
More informationBelief Update in CLG Bayesian Networks With Lazy Propagation
Belief Update in CLG Bayesian Networks With Lazy Propagation Anders L Madsen HUGIN Expert A/S Gasværksvej 5 9000 Aalborg, Denmark Anders.L.Madsen@hugin.com Abstract In recent years Bayesian networks (BNs)
More information2 : Directed GMs: Bayesian Networks
10-708: Probabilistic Graphical Models, Spring 2015 2 : Directed GMs: Bayesian Networks Lecturer: Eric P. Xing Scribes: Yi Cheng, Cong Lu 1 Notation Here the notations used in this course are defined:
More informationIntroduction to Probabilistic Graphical Models
Introduction to Probabilistic Graphical Models Kyu-Baek Hwang and Byoung-Tak Zhang Biointelligence Lab School of Computer Science and Engineering Seoul National University Seoul 151-742 Korea E-mail: kbhwang@bi.snu.ac.kr
More informationCS 5522: Artificial Intelligence II
CS 5522: Artificial Intelligence II Bayes Nets: Independence Instructor: Alan Ritter Ohio State University [These slides were adapted from CS188 Intro to AI at UC Berkeley. All materials available at http://ai.berkeley.edu.]
More informationLearning Marginal AMP Chain Graphs under Faithfulness
Learning Marginal AMP Chain Graphs under Faithfulness Jose M. Peña ADIT, IDA, Linköping University, SE-58183 Linköping, Sweden jose.m.pena@liu.se Abstract. Marginal AMP chain graphs are a recently introduced
More informationCausality in Econometrics (3)
Graphical Causal Models References Causality in Econometrics (3) Alessio Moneta Max Planck Institute of Economics Jena moneta@econ.mpg.de 26 April 2011 GSBC Lecture Friedrich-Schiller-Universität Jena
More informationProbabilistic Models. Models describe how (a portion of) the world works
Probabilistic Models Models describe how (a portion of) the world works Models are always simplifications May not account for every variable May not account for all interactions between variables All models
More information1 Some loose ends from last time
Cornell University, Fall 2010 CS 6820: Algorithms Lecture notes: Kruskal s and Borůvka s MST algorithms September 20, 2010 1 Some loose ends from last time 1.1 A lemma concerning greedy algorithms and
More informationCS Lecture 3. More Bayesian Networks
CS 6347 Lecture 3 More Bayesian Networks Recap Last time: Complexity challenges Representing distributions Computing probabilities/doing inference Introduction to Bayesian networks Today: D-separation,
More informationIndependencies. Undirected Graphical Models 2: Independencies. Independencies (Markov networks) Independencies (Bayesian Networks)
(Bayesian Networks) Undirected Graphical Models 2: Use d-separation to read off independencies in a Bayesian network Takes a bit of effort! 1 2 (Markov networks) Use separation to determine independencies
More informationArtificial Intelligence Bayes Nets: Independence
Artificial Intelligence Bayes Nets: Independence Instructors: David Suter and Qince Li Course Delivered @ Harbin Institute of Technology [Many slides adapted from those created by Dan Klein and Pieter
More informationDirected and Undirected Graphical Models
Directed and Undirected Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Machine Learning: Neural Networks and Advanced Models (AA2) Last Lecture Refresher Lecture Plan Directed
More informationBayesian Networks to design optimal experiments. Davide De March
Bayesian Networks to design optimal experiments Davide De March davidedemarch@gmail.com 1 Outline evolutionary experimental design in high-dimensional space and costly experimentation the microwell mixture
More informationMassachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014
Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science 6.438 Algorithms For Inference Fall 2014 Recitation 3 1 Gaussian Graphical Models: Schur s Complement Consider
More informationLocal Structure Discovery in Bayesian Networks
1 / 45 HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET UNIVERSITY OF HELSINKI Local Structure Discovery in Bayesian Networks Teppo Niinimäki, Pekka Parviainen August 18, 2012 University of Helsinki Department
More informationGraphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence
Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence General overview Introduction Directed acyclic graphs (DAGs) and conditional independence DAGs and causal effects
More informationRealization Plans for Extensive Form Games without Perfect Recall
Realization Plans for Extensive Form Games without Perfect Recall Richard E. Stearns Department of Computer Science University at Albany - SUNY Albany, NY 12222 April 13, 2015 Abstract Given a game in
More informationA Decision Theoretic View on Choosing Heuristics for Discovery of Graphical Models
A Decision Theoretic View on Choosing Heuristics for Discovery of Graphical Models Y. Xiang University of Guelph, Canada Abstract Discovery of graphical models is NP-hard in general, which justifies using
More informationPCSI-labeled Directed Acyclic Graphs
Date of acceptance 7.4.2014 Grade Instructor Eximia cum laude approbatur Jukka Corander PCSI-labeled Directed Acyclic Graphs Jarno Lintusaari Helsinki April 8, 2014 UNIVERSITY OF HELSINKI Department of
More informationThe exam is closed book, closed calculator, and closed notes except your one-page crib sheet.
CS 188 Fall 2015 Introduction to Artificial Intelligence Final You have approximately 2 hours and 50 minutes. The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.
More informationBranch-and-Bound for the Travelling Salesman Problem
Branch-and-Bound for the Travelling Salesman Problem Leo Liberti LIX, École Polytechnique, F-91128 Palaiseau, France Email:liberti@lix.polytechnique.fr March 15, 2011 Contents 1 The setting 1 1.1 Graphs...............................................
More informationBounding the False Discovery Rate in Local Bayesian Network Learning
Bounding the False Discovery Rate in Local Bayesian Network Learning Ioannis Tsamardinos Dept. of Computer Science, Univ. of Crete, Greece BMI, ICS, Foundation for Research and Technology, Hellas Dept.
More informationCSC 412 (Lecture 4): Undirected Graphical Models
CSC 412 (Lecture 4): Undirected Graphical Models Raquel Urtasun University of Toronto Feb 2, 2016 R Urtasun (UofT) CSC 412 Feb 2, 2016 1 / 37 Today Undirected Graphical Models: Semantics of the graph:
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More informationBayesian Networks: Representation, Variable Elimination
Bayesian Networks: Representation, Variable Elimination CS 6375: Machine Learning Class Notes Instructor: Vibhav Gogate The University of Texas at Dallas We can view a Bayesian network as a compact representation
More informationLearning causal network structure from multiple (in)dependence models
Learning causal network structure from multiple (in)dependence models Tom Claassen Radboud University, Nijmegen tomc@cs.ru.nl Abstract Tom Heskes Radboud University, Nijmegen tomh@cs.ru.nl We tackle the
More informationPROBABILISTIC REASONING SYSTEMS
PROBABILISTIC REASONING SYSTEMS In which we explain how to build reasoning systems that use network models to reason with uncertainty according to the laws of probability theory. Outline Knowledge in uncertain
More informationPDF hosted at the Radboud Repository of the Radboud University Nijmegen
PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is an author's version which may differ from the publisher's version. For additional information about this
More informationFaithfulness of Probability Distributions and Graphs
Journal of Machine Learning Research 18 (2017) 1-29 Submitted 5/17; Revised 11/17; Published 12/17 Faithfulness of Probability Distributions and Graphs Kayvan Sadeghi Statistical Laboratory University
More information3 : Representation of Undirected GM
10-708: Probabilistic Graphical Models 10-708, Spring 2016 3 : Representation of Undirected GM Lecturer: Eric P. Xing Scribes: Longqi Cai, Man-Chia Chang 1 MRF vs BN There are two types of graphical models:
More informationHeuristic Search Algorithms
CHAPTER 4 Heuristic Search Algorithms 59 4.1 HEURISTIC SEARCH AND SSP MDPS The methods we explored in the previous chapter have a serious practical drawback the amount of memory they require is proportional
More informationProbabilistic Graphical Models (I)
Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random
More informationPlanning With Information States: A Survey Term Project for cs397sml Spring 2002
Planning With Information States: A Survey Term Project for cs397sml Spring 2002 Jason O Kane jokane@uiuc.edu April 18, 2003 1 Introduction Classical planning generally depends on the assumption that the
More informationk-blocks: a connectivity invariant for graphs
1 k-blocks: a connectivity invariant for graphs J. Carmesin R. Diestel M. Hamann F. Hundertmark June 17, 2014 Abstract A k-block in a graph G is a maximal set of at least k vertices no two of which can
More informationarxiv: v2 [cs.lg] 9 Mar 2017
Journal of Machine Learning Research? (????)??-?? Submitted?/??; Published?/?? Joint Causal Inference from Observational and Experimental Datasets arxiv:1611.10351v2 [cs.lg] 9 Mar 2017 Sara Magliacane
More informationON COST MATRICES WITH TWO AND THREE DISTINCT VALUES OF HAMILTONIAN PATHS AND CYCLES
ON COST MATRICES WITH TWO AND THREE DISTINCT VALUES OF HAMILTONIAN PATHS AND CYCLES SANTOSH N. KABADI AND ABRAHAM P. PUNNEN Abstract. Polynomially testable characterization of cost matrices associated
More informationGraph Theory. Thomas Bloom. February 6, 2015
Graph Theory Thomas Bloom February 6, 2015 1 Lecture 1 Introduction A graph (for the purposes of these lectures) is a finite set of vertices, some of which are connected by a single edge. Most importantly,
More informationConditional probabilities and graphical models
Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within
More informationOn the errors introduced by the naive Bayes independence assumption
On the errors introduced by the naive Bayes independence assumption Author Matthijs de Wachter 3671100 Utrecht University Master Thesis Artificial Intelligence Supervisor Dr. Silja Renooij Department of
More information5 Flows and cuts in digraphs
5 Flows and cuts in digraphs Recall that a digraph or network is a pair G = (V, E) where V is a set and E is a multiset of ordered pairs of elements of V, which we refer to as arcs. Note that two vertices
More informationInference of A Minimum Size Boolean Function by Using A New Efficient Branch-and-Bound Approach From Examples
Published in: Journal of Global Optimization, 5, pp. 69-9, 199. Inference of A Minimum Size Boolean Function by Using A New Efficient Branch-and-Bound Approach From Examples Evangelos Triantaphyllou Assistant
More information4.1 Notation and probability review
Directed and undirected graphical models Fall 2015 Lecture 4 October 21st Lecturer: Simon Lacoste-Julien Scribe: Jaime Roquero, JieYing Wu 4.1 Notation and probability review 4.1.1 Notations Let us recall
More informationPart I Qualitative Probabilistic Networks
Part I Qualitative Probabilistic Networks In which we study enhancements of the framework of qualitative probabilistic networks. Qualitative probabilistic networks allow for studying the reasoning behaviour
More informationConditional Independence and Factorization
Conditional Independence and Factorization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationLearning the Structure of Linear Latent Variable Models
Journal (2005) - Submitted /; Published / Learning the Structure of Linear Latent Variable Models Ricardo Silva Center for Automated Learning and Discovery School of Computer Science rbas@cs.cmu.edu Richard
More informationStatistical Approaches to Learning and Discovery
Statistical Approaches to Learning and Discovery Graphical Models Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon University
More information1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution
NETWORK ANALYSIS Lourens Waldorp PROBABILITY AND GRAPHS The objective is to obtain a correspondence between the intuitive pictures (graphs) of variables of interest and the probability distributions of
More informationMonotone DAG Faithfulness: A Bad Assumption
Monotone DAG Faithfulness: A Bad Assumption David Maxwell Chickering Christopher Meek Technical Report MSR-TR-2003-16 Microsoft Research Microsoft Corporation One Microsoft Way Redmond, WA 98052 Abstract
More informationOut-colourings of Digraphs
Out-colourings of Digraphs N. Alon J. Bang-Jensen S. Bessy July 13, 2017 Abstract We study vertex colourings of digraphs so that no out-neighbourhood is monochromatic and call such a colouring an out-colouring.
More informationDecision-Theoretic Specification of Credal Networks: A Unified Language for Uncertain Modeling with Sets of Bayesian Networks
Decision-Theoretic Specification of Credal Networks: A Unified Language for Uncertain Modeling with Sets of Bayesian Networks Alessandro Antonucci Marco Zaffalon IDSIA, Istituto Dalle Molle di Studi sull
More informationLearning Semi-Markovian Causal Models using Experiments
Learning Semi-Markovian Causal Models using Experiments Stijn Meganck 1, Sam Maes 2, Philippe Leray 2 and Bernard Manderick 1 1 CoMo Vrije Universiteit Brussel Brussels, Belgium 2 LITIS INSA Rouen St.
More information4-coloring P 6 -free graphs with no induced 5-cycles
4-coloring P 6 -free graphs with no induced 5-cycles Maria Chudnovsky Department of Mathematics, Princeton University 68 Washington Rd, Princeton NJ 08544, USA mchudnov@math.princeton.edu Peter Maceli,
More informationAnalysis of Algorithms. Outline 1 Introduction Basic Definitions Ordered Trees. Fibonacci Heaps. Andres Mendez-Vazquez. October 29, Notes.
Analysis of Algorithms Fibonacci Heaps Andres Mendez-Vazquez October 29, 2015 1 / 119 Outline 1 Introduction Basic Definitions Ordered Trees 2 Binomial Trees Example 3 Fibonacci Heap Operations Fibonacci
More informationTowards an extension of the PC algorithm to local context-specific independencies detection
Towards an extension of the PC algorithm to local context-specific independencies detection Feb-09-2016 Outline Background: Bayesian Networks The PC algorithm Context-specific independence: from DAGs to
More informationGraph coloring, perfect graphs
Lecture 5 (05.04.2013) Graph coloring, perfect graphs Scribe: Tomasz Kociumaka Lecturer: Marcin Pilipczuk 1 Introduction to graph coloring Definition 1. Let G be a simple undirected graph and k a positive
More informationBefore we show how languages can be proven not regular, first, how would we show a language is regular?
CS35 Proving Languages not to be Regular Before we show how languages can be proven not regular, first, how would we show a language is regular? Although regular languages and automata are quite powerful
More informationPart I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS
Part I C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS Probabilistic Graphical Models Graphical representation of a probabilistic model Each variable corresponds to a
More informationModel Complexity of Pseudo-independent Models
Model Complexity of Pseudo-independent Models Jae-Hyuck Lee and Yang Xiang Department of Computing and Information Science University of Guelph, Guelph, Canada {jaehyuck, yxiang}@cis.uoguelph,ca Abstract
More informationUndirected Graphical Models: Markov Random Fields
Undirected Graphical Models: Markov Random Fields 40-956 Advanced Topics in AI: Probabilistic Graphical Models Sharif University of Technology Soleymani Spring 2015 Markov Random Field Structure: undirected
More informationProving languages to be nonregular
Proving languages to be nonregular We already know that there exist languages A Σ that are nonregular, for any choice of an alphabet Σ. This is because there are uncountably many languages in total and
More informationDirected Graphical Models
CS 2750: Machine Learning Directed Graphical Models Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 Graphical Models If no assumption of independence is made, must estimate an exponential
More information12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria
12. LOCAL SEARCH gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria Lecture slides by Kevin Wayne Copyright 2005 Pearson-Addison Wesley h ttp://www.cs.princeton.edu/~wayne/kleinberg-tardos
More informationStrongly chordal and chordal bipartite graphs are sandwich monotone
Strongly chordal and chordal bipartite graphs are sandwich monotone Pinar Heggernes Federico Mancini Charis Papadopoulos R. Sritharan Abstract A graph class is sandwich monotone if, for every pair of its
More informationEnhancing Active Automata Learning by a User Log Based Metric
Master Thesis Computing Science Radboud University Enhancing Active Automata Learning by a User Log Based Metric Author Petra van den Bos First Supervisor prof. dr. Frits W. Vaandrager Second Supervisor
More informationExact and Approximate Equilibria for Optimal Group Network Formation
Exact and Approximate Equilibria for Optimal Group Network Formation Elliot Anshelevich and Bugra Caskurlu Computer Science Department, RPI, 110 8th Street, Troy, NY 12180 {eanshel,caskub}@cs.rpi.edu Abstract.
More informationP is the class of problems for which there are algorithms that solve the problem in time O(n k ) for some constant k.
Complexity Theory Problems are divided into complexity classes. Informally: So far in this course, almost all algorithms had polynomial running time, i.e., on inputs of size n, worst-case running time
More informationBayesian Networks. Vibhav Gogate The University of Texas at Dallas
Bayesian Networks Vibhav Gogate The University of Texas at Dallas Intro to AI (CS 6364) Many slides over the course adapted from either Dan Klein, Luke Zettlemoyer, Stuart Russell or Andrew Moore 1 Outline
More informationRespecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features
Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features Eun Yong Kang Department
More informationPacking and Covering Dense Graphs
Packing and Covering Dense Graphs Noga Alon Yair Caro Raphael Yuster Abstract Let d be a positive integer. A graph G is called d-divisible if d divides the degree of each vertex of G. G is called nowhere
More informationTDT70: Uncertainty in Artificial Intelligence. Chapter 1 and 2
TDT70: Uncertainty in Artificial Intelligence Chapter 1 and 2 Fundamentals of probability theory The sample space is the set of possible outcomes of an experiment. A subset of a sample space is called
More informationK 4 -free graphs with no odd holes
K 4 -free graphs with no odd holes Maria Chudnovsky 1 Columbia University, New York NY 10027 Neil Robertson 2 Ohio State University, Columbus, Ohio 43210 Paul Seymour 3 Princeton University, Princeton
More informationarxiv: v4 [math.st] 19 Jun 2018
Complete Graphical Characterization and Construction of Adjustment Sets in Markov Equivalence Classes of Ancestral Graphs Emilija Perković, Johannes Textor, Markus Kalisch and Marloes H. Maathuis arxiv:1606.06903v4
More informationGenerating p-extremal graphs
Generating p-extremal graphs Derrick Stolee Department of Mathematics Department of Computer Science University of Nebraska Lincoln s-dstolee1@math.unl.edu August 2, 2011 Abstract Let f(n, p be the maximum
More informationPAC Learning. prof. dr Arno Siebes. Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht
PAC Learning prof. dr Arno Siebes Algorithmic Data Analysis Group Department of Information and Computing Sciences Universiteit Utrecht Recall: PAC Learning (Version 1) A hypothesis class H is PAC learnable
More informationThe Impossibility of Certain Types of Carmichael Numbers
The Impossibility of Certain Types of Carmichael Numbers Thomas Wright Abstract This paper proves that if a Carmichael number is composed of primes p i, then the LCM of the p i 1 s can never be of the
More informationThe Complexity of Maximum. Matroid-Greedoid Intersection and. Weighted Greedoid Maximization
Department of Computer Science Series of Publications C Report C-2004-2 The Complexity of Maximum Matroid-Greedoid Intersection and Weighted Greedoid Maximization Taneli Mielikäinen Esko Ukkonen University
More informationA MODEL-THEORETIC PROOF OF HILBERT S NULLSTELLENSATZ
A MODEL-THEORETIC PROOF OF HILBERT S NULLSTELLENSATZ NICOLAS FORD Abstract. The goal of this paper is to present a proof of the Nullstellensatz using tools from a branch of logic called model theory. In
More informationBayesian Networks. Vibhav Gogate The University of Texas at Dallas
Bayesian Networks Vibhav Gogate The University of Texas at Dallas Intro to AI (CS 4365) Many slides over the course adapted from either Dan Klein, Luke Zettlemoyer, Stuart Russell or Andrew Moore 1 Outline
More informationSergey Norin Department of Mathematics and Statistics McGill University Montreal, Quebec H3A 2K6, Canada. and
NON-PLANAR EXTENSIONS OF SUBDIVISIONS OF PLANAR GRAPHS Sergey Norin Department of Mathematics and Statistics McGill University Montreal, Quebec H3A 2K6, Canada and Robin Thomas 1 School of Mathematics
More informationAlgebraic Methods in Combinatorics
Algebraic Methods in Combinatorics Po-Shen Loh 27 June 2008 1 Warm-up 1. (A result of Bourbaki on finite geometries, from Răzvan) Let X be a finite set, and let F be a family of distinct proper subsets
More informationBichain graphs: geometric model and universal graphs
Bichain graphs: geometric model and universal graphs Robert Brignall a,1, Vadim V. Lozin b,, Juraj Stacho b, a Department of Mathematics and Statistics, The Open University, Milton Keynes MK7 6AA, United
More informationTHE VINE COPULA METHOD FOR REPRESENTING HIGH DIMENSIONAL DEPENDENT DISTRIBUTIONS: APPLICATION TO CONTINUOUS BELIEF NETS
Proceedings of the 00 Winter Simulation Conference E. Yücesan, C.-H. Chen, J. L. Snowdon, and J. M. Charnes, eds. THE VINE COPULA METHOD FOR REPRESENTING HIGH DIMENSIONAL DEPENDENT DISTRIBUTIONS: APPLICATION
More informationCS 350 Algorithms and Complexity
1 CS 350 Algorithms and Complexity Fall 2015 Lecture 15: Limitations of Algorithmic Power Introduction to complexity theory Andrew P. Black Department of Computer Science Portland State University Lower
More information2 : Directed GMs: Bayesian Networks
10-708: Probabilistic Graphical Models 10-708, Spring 2017 2 : Directed GMs: Bayesian Networks Lecturer: Eric P. Xing Scribes: Jayanth Koushik, Hiroaki Hayashi, Christian Perez Topic: Directed GMs 1 Types
More informationMarkov properties for mixed graphs
Bernoulli 20(2), 2014, 676 696 DOI: 10.3150/12-BEJ502 Markov properties for mixed graphs KAYVAN SADEGHI 1 and STEFFEN LAURITZEN 2 1 Department of Statistics, Baker Hall, Carnegie Mellon University, Pittsburgh,
More informationOutline. CSE 573: Artificial Intelligence Autumn Bayes Nets: Big Picture. Bayes Net Semantics. Hidden Markov Models. Example Bayes Net: Car
CSE 573: Artificial Intelligence Autumn 2012 Bayesian Networks Dan Weld Many slides adapted from Dan Klein, Stuart Russell, Andrew Moore & Luke Zettlemoyer Outline Probabilistic models (and inference)
More informationNoisy-OR Models with Latent Confounding
Noisy-OR Models with Latent Confounding Antti Hyttinen HIIT & Dept. of Computer Science University of Helsinki Finland Frederick Eberhardt Dept. of Philosophy Washington University in St. Louis Missouri,
More information