Adapting Bayes Nets to Non-stationary Probability Distributions

Size: px

Start display at page:

Download "Adapting Bayes Nets to Non-stationary Probability Distributions"

Shawn Matthews
5 years ago
Views:

1 Adapting Bayes Nets to Non-stationary robability Distributions Søren Holbech Nielsen & Thomas D. Nielsen June 4, Introduction Ever since earl (1988) published his seminal book on Bayesian networks (BNs), the formalism has become a widespread tool for representing, eliciting, and discovering probabilistic relationships for problem domains defined over discrete variables. One area of research that has seen much activity is the area of learning the structure of BNs, where probabilistic relationships for variables are discovered, or inferred, from a database of observations of these variables (main papers in learning include (Cooper and Herskovits 1992), (Heckerman, Geiger, and Chickering 1994), and (Chickering 2002)). One part of this research area focuses on incremental learning of BNs, where observations are received sequentially, and a BN structure is gradually constructed along the way without keeping all the observations in memory. A special case of incremental learning is structural adaptation, where the incremental algorithm maintains one or more candidate structures and applies changes to these as observations are received. This particular area of research has received very little attention, with the only results that we are aware of being (Buntine 1991), (Lam and Bacchus 1994) (and later (Lam 1998)), (Friedman and Goldszmidt 1997), and (Alcobé 2004). These papers all assume that the database of observations has been produced by a stationary stochastic process. That is, the ordering of the observations in the database is inconsequential. However, many real life observable processes cannot really be said to be invariant with respect to time: Mechanical mechanisms tend to become worn out as time progresses, for instance, and influences not observed in the data may change suddenly. When human decision makers are somehow involved in the data generating process, these are almost surely not fully describable by the observables 1

2 and may change their behaviour instantaneously if some advantage can be gained. A very simple example of a situation, where it is unrealistic to expect a stationary generating process, is a database storing business transactions and internal communications of some company. If at some point a high ranking employee is replaced, the subsequent data would be distributed differently from that representing the time before the change. In this work we relax the assumption on stationary data, opting instead for learning from data which is only approximately stationary. More concretely, we assume that the data generating process is piecewise stationary, as in the example given above. Furthermore, we focus on domains where the shifts in distribution from one stationary period to the next is of a local nature (i.e. only a subset of the probabilistic relationships among variables change as the shifts take place). As the assumptions underlying the method are generalizations of those carrying other structural adaptation methods, we may even hope that our method is competitive for incremental learning from data generated by stationary distributions. 2 reliminaries Here we present the definitions and terminology necessary to understand the remainder of the text. As a general notational rule we use bold font to denote sets and vectors (V, c, etc.) and calligraphic font to denote mathematical structures and compositions (B, G, etc.). Moreover, we shall use upper case letters to denote random variables or sets of random variables (X, Y, V, etc.), and lower case letters to denote specific states of these variables (x 4, y, c, etc.). is used to denote defined as. A BN B (G,Φ) over a set of discrete variables V consists of an acyclic directed graph (traditionally abbreviated DAG) G, whose nodes are the variables in V, and a set of conditional probability distributions Φ (which we abbreviate CTs for conditional probability table ). A correspondence between G and Φ is enforced by requiring that Φ consists of one CT (X A G (X)) for each variable X, specifying a conditional probability distribution for X given each possible instantiation of the parents A G (X) of X in G. A unique joint distribution B over V is obtained by taking the product of all the CTs in Φ. In a graph G we shall often need to distinguish between different types of connections on paths between nodes: For three nodes X, Y, and Z next to each other on some path a, we say that the connection at Y is serial, if arcs X Y and Y Z are part of a; if arcs X Y and Y Z are part of a, 2

3 we call the connection diverging; and if arcs X Y and Y Z are part of a, we call it converging. If for such a converging connection, we furthermore have that X and Z are not adjacent in G, then we call the triple (X,Y,Z) a v-structure of G. Due to the construction of B we are guaranteed that all dependencies inherent in B can be read directly from G by use of the d-separation criterion (earl 1988): For any path a between two nodes X and Y in G, we say that a is active given a set of nodes Z, if for each node W on the path, either the connection at W is serial or diverging, and W is not in Z, or the connection is converging, and either W or one of its descendants are in Z. If there is no active path between two nodes X and Y given Z, then we say that X and Y are d-separated given Z. The d-separation criterion states that, if X and Y are d-separated by Z, then it holds that X is conditionally independent of Y given Z in B, or equivalently, if X is conditionally dependent of Y given Z in B, then X and Y are not d-separated by Z in G. In the remainder of the text, we use X G Y Z to denote that X is d-separated from Y by Z in the DAG G, and X Y Z to denote that X is conditionally independent of Y given Z in the distribution. The d-separation criterion is thus X G Y Z X B Y Z, (1) for any BN B (G,Φ). The set of all conditional independence statements that may be read from a graph in this manner, we refer to as that graph s d-separation properties. We refer to any two graphs over the same variables as being equivalent if they have the same d-separation properties. Equivalence is obviously an equivalence relation. Verma and earl (1991) proved that equivalent graphs necessarily must have the same skeleton and the same v-structures, and the equivalence class of graphs equivalent to a specific graph G can then be uniquely represented by the partially directed graph G obtained from the skeleton of G by directing links that participate in a v-structure in G in the direction dictated by G. G is called the pattern of G. Any graph G, which is obtained from G by directing the remaining undirected links, without creating a directed cycle or a new v-structure, is then equivalent to G. We say that G is a consistent extension of G. The partially directed graph G obtained from G, by directing undirected links as they appear in G, whenever all consistent extensions of G agree on this direction, is called the completed pattern of G. G is obviously a unique representation of G s equivalence class as well. Any arc in G is called compelled in G. 3

4 Given any joint distribution over V it is possible to construct a BN B such that = B (earl 1988). A distribution for which there is a BN B (G,Φ ) such that B = and also X Y Z X G Y Z (2) holds, is called DAG faithful, and B (and sometimes G alone) is called a perfect map. DAG faithful distributions are important since, if a data generating process is known to be DAG faithful, then a perfect map can, in principle, be inferred from the data under the assumption that the data is representative of the distribution. (earl 1988) proves that the following important properties holds of any DAG faithful probability distributions : A B D C A B C A D C (3) A B D C A B C D. (4) For any probability distribution over variables V and variable X V, we define a Markov boundary of X to be a set S V \ {X} such that X V \ (S {X}) S and this holds for no proper subset of S. It is easy to see that if is DAG faithful, the Markov boundary of X is uniquely defined, and consists of X s parents, children, and children s parents in a perfect map of. In the case of being DAG faithful, we denote the Markov boundary by MB (X). For any DAG G, we denote the set of X s parents, children, and children s parents in G as MB G (X). 3 An Adaptive Approach We now present our method for structural adaptation to a piecewise stationary data sequence. We first describe the problem formally, and then the proposed method is presented. 3.1 The Adaptation roblem We say that a sequence is samples from a piecewise DAG faithful distribution, if the sequence can be partitioned into sets such that each set is a database sampled from a single DAG faithful distribution. The rank of the sequence is the size of the smallest such partition. Formally, let d = (d 1,...,d l ) be a sequence of observations over variables V. We say that the sequence is sampled from a piecewise DAG faithful distribution (or simply that it is a piecewise DAG faithful sequence), if there are indices 1 = i 1 < < i m = l, such that each of d (j) = (d ij,...,d ij+1 ), for 1 j m 1, is a sequence 4

5 of samples from a DAG faithful distribution. The rank of the sequence is defined to be min j i j+1 i j, and we say that m 1 is its size and l its length. Obviously, we can have any sequence of observations being indistinguishable from a piecewise DAG faithful sequence, by selecting the partitions small enough, so we restrict our attention to sequences that are piecewise DAG faithful of at least rank r. However, we do not assume that neither the actual rank nor size of the sequence are known, and specifically we do not assume that the indices i 1,...,i m are known. We will assume that each sample is complete, though, so that no observations in the sequence have missing values. The learning task that we address in this paper consists of incrementally learning a BN, while receiving a piecewise DAG faithful sequence of samples, and making sure that after reception of each sample point the BN structure is as close as possible to the distribution that generated this point. Formally, let d be a complete piecewise DAG faithful sample sequence of length l, and let t be the distribution generating sample point t. Furthermore, let B 1,..., B l be BNs output (or maintained) by a structural adaptation method M, when receiving d and starting with BN B. Given a distance measure on BNs d, we define the precision of M on d wrt. d as prec(m, B,d) 1 l l d(b i, B i ). (5) i=1 For a method M to adapt to a DAG faithful sample sequence d wrt. d then means that M has a higher precision than the trivial learner (which never learns) on d wrt. d. 3.2 Our Approach The method that we propose here builds on the idea of automating the human practice of revising models only when needed, and then only the parts of the models that are in need of a revision. The method consists of two main mechanisms: One, monitoring the current BN while receiving evidence, detecting when the model should be changed, and two, adapting the parts of the model that conflicts with the evidence. These two mechanisms roughly corresponds with Algorithms 1 and 2. More detailed, our method maintains histories of conflict measures for the nodes in the current BN along with recent observations, and then initiate a local search over a set of nodes if the conflict measure histories of these nodes indicate that the underlying distribution has changed. A local 5

6 search means that we assume that only some aspects of the perfect maps of the two underlying generating distributions differ. The general form of the method is given in Algorithm 1. We have designed it to use a number of helper functions, whose exact implementation is open for experimentation as long as they fulfill a number of requirements. The responsibilities of ConflictMeasure( ) and ChangeInStream( ) will be described shortly. The actual technical implementation choices we have made for these methods are presented in relation to our experimental results in Section 5. The specification of UpdateNet( ) is given in Algorithm 2, and is elaborated upon later. The role of the rest of the methods should be self-explanatory, and their implementation is trivial. Algorithm 1 Algorithm for BN adaption. Takes as input an initial network B, defined over variables V, a series of cases h, and an allowed delay k for determining shifts in h. 1: procedure Adapt(B, V, h, k) 2: h [] 3: c X [] X V 4: loop 5: x NextCase(h ) 6: Append(h, (x)) 7: S 8: for X V do 9: c ConflictMeasure(B, X, x) 10: Append(c X, c) 11: if ChangeInStream(c X, k) then 12: S S {X} 13: end if 14: end for 15: h LastKEntries(h, k) 16: if S then 17: UpdateNet(B, S, h) 18: end if 19: end loop 20: end procedure The responsibilities of the two helper methods are as follows: ConflictMeasure(B, X i, x) returns a real value reflecting how inconsistent the observation x is with the way X i connects to the rest of the model B. ChangeInStream(c X, k) is a boolean valued function that, given a sequence of conflict measure values for X, tells if there appears to be a jump between the concrete values of the last k values and the values before that; 6

7 such a jump would be an indication of an abrupt change in the underlying generating distribution. Algorithm 2 Update Algorithm for BN. Takes as input the network to be updated B, a set of variables whose structural bindings may be wrong S, and data to learn from h 1: procedure UpdateNet(B, S, h) 2: for X S do 3: M X UncoverMarkovBoundary(X, h) 4: G X UncoverAdjacencies(X, M X, h) 5: artiallydirect(x, G X, M X, h) 6: end for 7: G MergeFragments({G X } X S ) 8: G GraphOf(B) 9: (G, C) MergeAndDirect(G, G, S) 10: Φ 11: for X V do 12: if X C then 13: Φ Φ { h (X A G (X))} 14: else 15: Φ Φ { B (X A G (X))} 16: end if 17: end for 18: B (G,Φ ) 19: end procedure UpdateNet(B, S, h) in Algorithm 2 adapts the BN B around the nodes in S to fit the empirical distribution defined by the data in h, which will be the last k received cases at the time of invocation. Throughout the text, this empirical distribution will also be denoted h which meaning h takes on should be clear from the context. The update of B to fit h is based on the assumption that only nodes in S need updating of their probabilistic bindings (i.e. their families in B h ). The method is listed in Algorithm 2. As Adapt( ) in Algorithm 1, it employs a number of helper methods that we only specify semantically. The concrete implementations of these methods that we have used in our experiments are presented in Section 5. UpdateNet( ) first runs through the nodes in S and uncovers for each node X the subgraph 1 of Bh consisting of X and links and arcs connecting X to its adjacents. The construction of this subgraph is divided into three steps: UncoverMarkovBoundary(X, h), which should compute MB h (X), or 1 Subgraph is used in a loose sense here, as the graph fragments produced do not contain links and arcs between X s adjacent variables. 7

8 at least a superset thereof; UncoverAdjacencies(X, M X, h), which first computes the nodes in M X that, according to h, are inseparable from X given subsets of M X, and then constructs a simple graph structure, where X is connected to each of these nodes by a link; and, finally, artiallydirect(x, G X, M X, h), which directs some of the links in G X. This method is presented in Algorithm 3 and elaborated upon later in this section. After constructing network fragments for all nodes in S, Algorithm 2 merges these into a single (not necessarily connected) graph G in Merge- Fragments( ). This happens as a simple graph union. 2 The result may still contain undirected arcs. G is merged with the a partially directed graph based on the subgraph of B s graph G that corresponds to nodes not in S into a new graph G by MergeAndDirect( ). MergeAndDirect( ) is described in the end of this section. MergeAndDirect( ) returns both the joined graph G and a set C which contains all nodes who have got a new parent set (ideally a subset of S). Algorithm 2 ends by constructing new CTs for the newly constructed BN structure G. Nodes not in C are assigned the same CT as in B, and nodes in C have new CTs calculated from h. Algorithm 3 Uncovers the direction of some arcs adjacent to a variable X in Bh. G X (X) is nodes connected to X by a link in G X. 1: procedure artiallydirect(x, G X, M X, h) 2: DirectAsInOtherFragments(G X ) 3: for Y M X \ (NE GX (X) A GX (X)) do 4: for T M X \ {X, Y } do 5: if I h (X, Y T) then 6: for Z NE GX (X) \ T do 7: if I h (X, Y T {Z}) then 8: LinkToArc(G X, X, Z) 9: end if 10: end for 11: end if 12: end for 13: end for 14: end procedure The method artiallydirect(x, G X, M X, h) directs a number of links in the graph fragment G X in accordance with the direction of these in B h. With DirectAsInOtherFragments(G X), the procedure first ex- 2 Due to the inner workings of artiallydirect( ) no conflicts of orientation among nets can occur. 8

9 change links for arcs when previously directed graph fragments dictate this. The procedure then runs through each variable Y in M X not adjacent to X, finds a set of nodes in M X that separate X from Y, and then tries to re-establish connection to Y by repeatedly expanding the set of separating nodes by a single node adjacent to X. If such a node can be found it has to be a child of X (see Corollary 16). No arc in Bh originating from X, is left as a link by the procedure (see Lemma 13). During these steps the procedure uses an independence oracle I h derived from h. The independence oracle should return true for a call I h (X,Y T) iff X h Y T. Algorithm 4 Merges graphs G and G under the assumption that connection to nodes in S have just been relearned. 1: procedure MergeAndDirect(G, G, S) 2: E EdgesOf(G ) 3: for X / S do 4: for Y AD G (X) \ S do 5: E E {(X, Y ), (Y, X)} 6: end for 7: end for 8: for X / S do 9: for Y AD G (X) \ S do 10: if Z MB G (X) \ AD G (X) : Y is descendant of Z then 11: E E \ {(Y, X)} 12: end if 13: end for 14: end for 15: return Direct((V, E)) 16: end procedure MergeAndDirect(G, G, S) in Algorithm 4 merges the graphs G and G by adding links between nodes in V \S in G, if these are adjacent in G. Then it directs some of these links: A link X Y is directed as X Y if X Y is an arc in G, and Y is a descendent of some node Z in MB G (X)\AD G (X) through a path involving only children of X. That this particular rule is sensible is proved in Section 4. Links and arcs between nodes in S and those not in S should theoretically be the same, but due to flawed statistical tests, in practice they are set either as defined in G or as defined in G, depending on how aggressive the adaptation should be. Initial experiments seemed to indicate that the latter option yields a small performance increase, and this is the setting that we have used in the end experiments, and the setting that Algorithm 4 has been written to conform to. 9

10 After the merge, any remaining links in the joined graph are directed by Direct( ) by application of the following rules 1. if X Y is a link, Z X is an arc, and Z and Y are non-adjacent, then direct the link X Y as X Y, 2. if X Y is a link and there is a directed path from X to Y, then direct the link X Y as X Y, and 3. if Rules 1 and 2 cannot be applied, chose a link at random, and direct it randomly. Due to potentially flawed statistical tests, the resultant graph may contain cycles involving at least some nodes in S. These are eliminated by reversing arcs between nodes in S only. The reversal process resembles the one used in (Margaritis and Thrun 2000): We remove all arcs between nodes in S that appears in at least one cycle. We order the removed arcs according to how many cycles they appear in, and then insert them back in the graph, starting with the arcs that appear in the least number of cycles. When at some point the insertion of an arc gives rise to a cycle, we insert the arc as its reverse. 4 Formal Correctness We have obtained a proof ensuring that under a set of assumptions our adaptation method is correct, in the sense that when the underlying distribution generating the sequence of data changes, the algorithm reacts and changes the current BN to accurately reflect the perfect map of the new underlying distribution. Moreover, the result makes it clear what it means for these changes to be local, probabilistically speaking. First, we need a few definitions, on which the assumptions are based: Definition 1. Let be a DAG faithful probability distribution over variables V, X in V, and Y in MB (X). We say that a set T is a maximal separating set of Y from X wrt. if T MB (X) \ {Y }, Y X T, and this holds for no T, where T T MB (X) \ {Y }. 10

11 The set of all variables in V, for which a maximal separating set from X wrt. exists, we denote by S X (the intuition being separable from ). For each Y in S X Y X, we denote by T the intersection of all maximal separating sets of Y from X wrt., and by E Y X the set MB (X) \ (T Y X {Y }) ( dependency enabling ). We expect, but have been unable to prove, that there is only one maximal separating set of Y from X wrt. for any DAG faithful probability distribution. Definition 2. We say that a DAG faithful probability distribution 1 over variables V is similar to a DAG faithful probability distribution 2 over variables V on the variables S V, if we have that 1. MB 1 (X) = MB 2 (X), for all X S, 2. S X 1 = S X 2, for all X S, and 3. E Y X 1 \ S X 1 = E Y X 2 \ S X 2, for all X S and Y S X 1. Here Bullet 3 states that inseparable variables enabling dependencies between X and Y must be the same in both 1 and 2. A few points that should be noted, are that similarity on a set of variables S is an equivalence relation, and that two distributions similar on a set of variables S are also similar on any subset of S. Our main formal result is: Theorem 3. Let BN B 1 (G 1,Φ 1 ) and DAG faithful probability distributions 1 and 2 each be defined over variables V, and h be a DAG faithful sample sequence of size 2 and rank r, where 1 is the distribution of the first sample and 2 is the distribution of the last sample. Furthermore, let 1 be similar to 2 on S, and B 2 (G 2,Φ 2 ) be the result of running Algorithm 1 on B 1 with cases h and some choice of k. If 1. G 1 is equivalent to G 1, 2. r > k, 3 3. ChangeInStream(, k) is true for a variable X iff X / S and the algorithm is currently processing the k th sample drawn from 2, 3 Note that Bullet 6 incorporates the traditional assumption on independencies being identifiable from sufficient data, and that Bullet 2 is only concerned with ensuring that ChangeInStream( ) always has at least k cases from a new partition of h to detect the change. 11

12 4. UncoverMarkovBoundary(X, h) returns MB h (X), 5. G X returned from UncoverAdjacencies(X, M X, h) connects X to Y iff Y is in MB h (X) \ S X h, and 6. the oracle used in artiallydirect( ) is correct, then G 2 is equivalent to G 2. The theorem is important as it guarantees that if our method is started with a BN that is equivalent to the perfect map of some DAG faithful distribution, then no matter how often changes, as long as it remains DAG faithful, and at least 2k observations are made from between each change, then our method continues to adapt B to fit the distribution, with a delay of k observations. In what follows, we gradually build up a theory to support Theorem 3, and then give the proof for it. The main difficulty is that our method directs arcs between a node and its adjacents, without knowing how these adjacents interconnect. revious works on constraint based DAG discovery have been based on two phase algorithms that first uncover the skeleton of the DAG and then direct the links into arcs. That has not been possible in our case, where local parts of the graph have to be learned in isolation from the remaining graph. First, we need some small results that we have been unable to find proved elsewhere: Lemma 4. Let be a DAG faithful probability distribution over V, and let X Y S for X,Y V and all S MB (X) \ {Y }. Then Y X T for all T V \ {X,Y }. roof. First note that if X Y S for all S MB (X) \ {Y }, then it follows that Y must be in MB (X). As is DAG faithful, there is some DAG G which is a perfect map of. If there is an arc between X and Y in G, then there is no set T V \ {X,Y }, such that X G Y T and since G is a perfect map of that X Y T. Therefore, we just need to prove that X and Y must be adjacent if X Y S for all S MB (X) \ {Y } = MB G (X) \ {Y }: As Y is a member of MB G (X) it must either be adjacent to X or share a child with X. In the first case, the result follows, so we assume that X and Y share the children Z. It follows that Y cannot be a descendent of any variable in Z, and hence by the definition of a BN that X G Y S MB G (X) \ (Z {Y }). Thus, there is a set S MB G (X) \ {Y }, where 12

13 X G Y S, and since G is a perfect map of also X Y S. This violates the assumptions, so we conclude that X and Y must be adjacent. Corollary 5. Let be a DAG faithful probability distribution over V and X,Y V. If there is a set T V \ {X,Y } such that Y X T, then there is a set S MB (X) \ {Y }, such that Y X S. Since perfect maps of a distribution are equivalent and equivalent DAGs have the same skeleton, it follows from Corollary 5 that Corollary 6. Let be a DAG faithful probability distribution over V and X,Y V. Then Y / S X iff Y is adjacent to X in all perfect maps of, and Y S X iff Y is not adjacent to X in any perfect map of. Given the above result it is thus sufficient for a learning algorithm to be able to identify S X when learning the skeleton of G. Clearly, this can be seen as a localized operation. Now, we deduce a set of lemmas that are needed for local reasoning about direction of arcs. Lemma 7. Let be a DAG faithful probability distribution over variables V, X,Y,Z V, T V \ {X,Y,Z}, and G be a perfect map of. If X Y T and X Y T {Z}, then in G there is a node C, with non-adjacent parents A and B, such that Z is C or a descendent of C on a path involving no nodes in T, A is connected to X with a path that is active given T, and B is connected to Y with a path that is active given T. Trivially, we also have X G Z T and Y G Z T. roof. Since G is a perfect map of, it is clear that X G Y T and X G Y T {Z}. From the definition of d-separation, it follows that there is a node C, such that either Z is C or Z is a descendent of C along a path c involving no nodes in T, and C has two parents A and B, where A (B) is either X (Y ) or connected to X (Y ) on a path a (b) that is active given T {Z}. We need to show that for at least one such C, 1. both a and b are active given T alone, and 2. that A and B cannot be adjacent. We first show that there is a C fulfilling 1, and then that at least one of these Cs fulfills 2. Take C such that a is active given T {Z} but not T. This means that there is a converging connection at some node V on a, such that Z is V or 13

14 a descendent of V. WLOG let V be the V closest to X on a, and let a be the path from X to V, and v be the path from V to Z. Moreover, let U be the node furthest away from Z on v such that U is also on c, and let u be the part of c (or v) from U to Z, v be the part of v from V to U, and c the part of c from C to U. Then U satisfies the definition of C since the three paths a v, u, and c b connect X to U, U to Z, and U to B, and the connection at U between a v and c b is converging. However, unlike a, a v is active given T only. If b is also only active given T {Z}, we can perform the same transformation to that path and thus we get that 1 is satisfiable. Next, we show that given Cs that satisfy 1, we can find one that also satisfies 2. articularly, let C satisfy 1, and furthermore let C be chosen such that c has maximal length among all valid choices of C. Then A and B must be non-adjacent. Suppose this is not the case, and WLOG let the arc between A and B be oriented A B. This means that B cannot be Y as that would create a path from X to Y active given T only, contradicting that X G Y T. So consider the arc between B and the next node V on b. This arc cannot be directed into B as that would mean that B is a valid choice for C, yet with the length of the path from B to Z through C being longer than c, which contradicts the choice of C. But if the arc between B and V is directed away from B then we have that aa Bb is an acive path given T, which again contradicts X G Y T. Thus A and B are non-adjacent. Lemma 8. Let be a DAG faithful probability distribution over V, X,Z V, Y S X, T V \ {X,Y,Z}, and Z MB (X) \ (T S X {Y }). If X Y T and X Y T {Z}, then for any DAG G, which is a perfect map of, Z is in CH G (X). roof. As Z is not in S X, Corollary 6 yields that Z and X must be adjacent in every DAG that is a perfect map of. All that is left to be shown is that Z cannot be a parent of X. Assume this is false, and let G be a perfect map of in which Z is a parent of X. As X Y T and G is a perfect map of, it follows that X G Y T. (6) Specifically, (6) in conjunction with the assumption that X is a child of Z implies that there can be no path between Z and Y, which is active given T, i.e. Z G Y T. (7) 14

15 But from X Y T and X Y T {Z}, Lemma 7 allows us to conclude that Z G Y T, which contradicts 7. Lemma 9. Let be a DAG faithful probability distribution over V, X V, Y S X, T MB (X) \ {Y }, and Z MB (X) \ (T S X {Y }). If X Y T and X Y T {Z}, then there is W S X such that Z E W X. roof. Let G be a perfect map of. We show that there is a W S X such that X G W T (8) and X G W T T {Z}, (9) for some set T MB G (X)\{W,Z} and any T MB G (X)\(T {W,Z}). The result then follows by G being a perfect map of. From the assumptions, it is clear that there is at least one W S X (namely Y ), such that (8) and X G W T {Z}, (10) if we choose T as T. Since G is a perfect map of, we have that (8) implies X W T, (11) and (10) implies X W T {Z}. From Lemma 8 it follows that Z is a child of X in G. Moreover, from Lemma 7 we get that there must be at least one active path between W and Z. WLOG let a be the shortest such path connecting Z to such a W. We prove that the existence of a implies that (9) is fulfilled. First note that the last arc in a (connecting to Z) must be directed into Z, as otherwise there would be an active path from X to W given T, contradicting (8). Also note that if the length of a is 1, then we have the situation where Z is a child of both W and Z, and hence that a trivially ensures that (9) is fulfilled. Assume therefore that a has length 2 or more. We prove that the existence of a implies (9) by contradiction. That is we assume that there is a set T MB G (X) \ (T {W,Z}) such that X G W T T {Z}, and we prove that this means that a cannot be the shortest path matching its definition. As a is active given T and Z is a child of X, it follows that the nodes in T must render a inactive, which can only happen if some of the nodes in T lie on a. Let U be the node in T closest to W on a such that the connection at U is either serial or diverging (meaning that U renders a inactive). There are three possible scenarios: 15

16 1. a : (W U Z) 2. a : (W U Z) 3. a : (W U Z) Let b be the part of a running between W and U and c be the part between U and Z. We show that U can serve the same role as W, and hence that a cannot be the shortest active path connecting Z to a suitable W. Assume first that either Scenario 1 or 2 are the case: If X G U T, then there is an active path d between X and U given T, and hence also an active path from W to X given T formed by joining b and d. But that means that X G W T, which contradicts (8). Thus we must have that X G U T. Furthermore, since U is on a, it follows that X G U T {Z}. But then U satisfies (8) and (10) by the virtue of the active path c, and hence a is not the shortest such path. Thus none of Scenarios 1 and 2 can be the case. Assume then that Scenario 3 is the case: First we note that it cannot be the case that X G U T since b is active given T and this implies that (8) and (10) holds for U, thus violating the definition of a as the shortest path. Hence, X G U T meaning that there is an active path d from X to U given T. Since b is also active given T and (8) is assumed, it follows that the arc attached to U in d, must be an arc going into U. But since that results in a converging connection at U joining b and d, and X G W T, we have that no descendent of U can be in T. Specifically, we have that c cannot contain any nodes of T, and hence that it must be a directed path from U to Z. First assume that c contains some nodes that are adjacent to X. None of these can be parents of X, as the arc from the parent node to X along with the active path a, would mean that X G W T, contradicting (8). Thus nodes on c in MB G (X) are either children of X or non-adjacent to X. U itself must be non-adjacent to X, as U cannot be a child of X: That would imply that X G U R, for any set R, and since b is active given T T {Z} and there is a converging connection at U on the path formed by b and the arc from X to U, that X G W T T {Z}, violating assumption (10). Since U is not adjacent to X, we thus have that there is some node on c non-adjacent to X and closer to Z on c than any other node. Let V be this node. Since Z is a child of X and a direct descendant of V, it is easy to see that V satisfies both (8) and (10) and thus that a cannot be the shortest path connecting Z to a satisfying node, and the lemma is thus proved. 16

17 These basic but important lemmas aside, we now establish a connection between the purely probabilistic notion of similarity, and the DAGs that ensure it. Since we have that two equivalent DAGs G 1 and G 2 encode the same conditional independency statements, and these are reflected in any probability distributions whose perfect maps are one of G 1 and G 2, the following corollary is immediately obtained. Corollary 10. Let G 1 and G 2 be perfect maps of probability distributions 1 and 2 both defined over variables V. If G 1 and G 2 are equivalent, then 1 and 2 are similar on V. This connection spurs the introduction of a new concept: Definition 11 (Locally Compelled Arc). Let G 1 be a perfect map of a probability distribution 1 over V, and X and Y in V, such that X is a parent of Y in G 1. We call the arc from X to Y locally compelled in G 1, if for any probability distribution 2 similar to 1 on {X}, and any perfect map G 2 of 2, we have that X is a parent of Y in G 2. We denote by CH G (X) the variables that are attached to X in G by locally compelled arcs originating from X. It turns out that locally compelled arcs are important for learning network structures locally: Locally compelled arcs of a graph encompass all arcs of the pattern of that graph, and enjoy the property of being compelled too, as the following results show. Lemma 12. Let be a DAG faithful probability distribution over V, X V, Y S X, and T a maximal seperating set of Y from X wrt.. Then for any DAG G, which is a perfect map of, each member Z of MB (X) \ (T {Y } S X) must be a compelled child of X. That is, Z is in CH G (X). roof. We need to show that Z is a child of X in any perfect map of any probability distribution 2 similar to on {X}. Since T is a maximal separating set, it follows that Z E Y X. Since and 2 are similar on X, Y S X X, and Z EY \ S X, we have that Y SX 2 and Z E Y X 2 \ S X 2. Hence, there is a (maximal separating) set T 2 MB 2 (X) not including Z, such that X 2 Y T 2 and X 2 Y T 2 {Z}. The result then follows from Lemma 8. Lemma 13. Let be a DAG faithful probability distribution over V, and X,Y,Z V. Then for any DAG G, which is a perfect map of, if (X,Z,Y ) 17

18 is a v-structure in G, then Y S X X, Z EY \S X, and X Z and Y Z are both locally compelled. roof. We only show that X Z is locally compelled, as the case of Y Z is similar. Since Y is not adjacent to X, Corollary 6 implies that Y S X, and hence that E Y X is well-defined. Clearly, Z cannot be in any maximal seperating set of Y from X wrt., as G is a perfect map of and X G Y T {Z} for any set T. Furthermore, as Z is adjacent to X, Corollary 6 ensures that Z / S X. The result now follows from Lemma 12. From Definition 11, Corollary 10, and Lemma 13, we thus have: Corollary 14. For any DAG G and any arc participating in a v-structure, we have that this arc is locally compelled in G. Furthermore, we have that each locally compelled arc in G is compelled in G. At this point, we have that it is sufficient to require of an algorithm uncovering G, for some DAG faithful distribution, that it identifies the skeleton of G, identifies all locally compelled arcs, directs them, and then eliminates arcs not participating in v-structures. Moreover, any member of G s equivalence class can be constructed by identifying all locally compelled arcs and then directing the remaining arcs without introducing cycles or new v-structures. Formally, we can restate the corollary in terms that emphasize the connection between similarity and equivalence: Corollary 15. Let G 1 = (V,E 1 ) and G 2 = (V,E 2 ) be DAGs. Then G 1 is similar to G 2 on V iff G 1 is equivalent to G 2. Now, we proceed by asserting that the arcs directed by our method are indeed locally compelled, and that no locally compelled arcs are left undirected. From Lemmas 9 and 12 we immediately get: Corollary 16. Let be a DAG faithful probability distribution over V, X V, Y S X, T MB (X)\{Y }, and Z MB (X)\(T S X {Y }). If X Y T and X Y T {Z}, then for any DAG G, which is a perfect map of, Z is in CH G (X). We thus have that all arcs directed by artiallydirect( ) are locally compelled. That a arc X Y participating in a v-structure (X, Y, Z) will be found by artiallydirect( ), follows from Lemma

19 Lemma 17. Let be a DAG faithful probability distribution over V, X,Y V, and G be a perfect map of. There is a W S X such that Y E W X \ S X iff there is Z MB G(X) S X such that Y is a descendent of Z through a path encompassing only children of X. roof. The if part follows trivially from the d-separation properties of graphs, and we therefore only prove the only if part: Since Y E W X \S X we have from Lemma 12 that Y is a child of X in G. Moreover, we know that there is a set T MB (X) = MB G (X) such that X G W T, but X G W T {Y }, and thus by Lemma 7 that in G there is a node C with non-adjacent parents A and B, such that Z is a descendent of C on path c, and there are active paths a and b given T between X and A and between B and W. Let V be the first node on c following Y. As V is also a descendent of C it follows that it is also in E W X. If V is furthermore in S X then we have found the node Z that we are looking for, so assume that this is not the case. Then Lemma 12 can be applied again with V taking Y s place. Iterating in this manner we either find the Z or we get that each node on c must be a child of X. Specifically this holds for C. But then either B is the node Z we are looking for, or there is an active path from X to W given T, consisting of the connection between X and B along with b. The latter possibility is precluded by the assumption X G W T. A direct consequence of this graphical result and d-separation properties on graphs is: Corollary 18. Let be a DAG faithful probability distribution over V, X,Y V, and G be a perfect map of. There is a W S X such that Y E W X \ S X iff there is Z SX MB (X) such that Y E Z X \ S X. Lemma 19. Let be a DAG faithful probability distribution over V, X,Y V, and G be a perfect map of. We then have that Y CH G (X) iff there is a W S X X such that Y EW \ S X. roof. The if part follows immediately from Lemma 12. We therefore only treat the only if part: Since X Y is an arc in G it follows from Corollary 6 that Y / S X W X, so what we need to prove is that Y W for some W V. We do so by contradiction, assuming that Y is not in any such E W X, and showing that any distribution 2, having the graph G 2 resulting from reversing X Y as its perfect map, is similar to on X, and thus that X Y cannot be locally compelled. 19

20 We assume that 2 is not similar to on X, and hence that at least one of oints 1 to 3 in Definition 2 is violated. First we note that it cannot be the case that there is a v-structure X,Y,Z, as Y would then be in E Z X by Lemma 13. Hence, MB G (X) = MB G2 (X) and therefore MB (X) = MB 2 (X), so oint 1 is satisfied. Additionally, Corollary 6 means that the fact that no adjacencies are changed implies that oint 2 is satisfied. So it must be oint 3 that is violated, meaning that for some node W we have that E W X \S X \S X 2. Let Z be a node having either left or entered such an enabling set due to the arc reversal. First note that the assumption that Y is in no such sets means that if Z = Y, Y must have entered such a set, but then Lemma 12 states that X Y is a compelled arc in G 2 which contradicts the definition of G 2. Thus Z Y. We prove this is impossible, too, reaching the required contradiction. X EW 2 First assume that Z E W X but Z / E W X 2. This means that there is a set T MB (X) such that X G W T, but X G W T {Z}, and by Lemma 7 that in G there is a node C with non-adjacent parents A and B, such that Z is a descendent of C on path c, and there are active paths a and b given T between X and A and between B and W, respectively, but the reversal of X Y somehow destroys this setup. Clearly, X Y cannot be on c since that would mean that X G W T, and X Y cannot be on b, as that would either mean X G W T or require Y to participate in a v-structure at C, which we have established cannot be the case. Thus X Y is on a, and WLOG we take it to be the first arc of a. If the next arc on a is an arc out of Y, then exchanging X Y for X Y changes the connection at Y from being serial into being diverging, and then a is still active. Therefore the connection at Y must be converging. Since Y does not participate in a v-structure, X must be adjacent to the next node V on a. The connection between X and V must be directed towards V as otherwise there is guaranteed to be an active path from X to C without the help of X Y. We consider the two cases for the arc V U following Y V on a: If this is an arc out of V, then X V provides an active path between X and C, so we know this cannot be the case. So the arc is oriented towards V. Now V cannot itself participate in any v-structures, since by Lemma 17 that would imply that Y E R X for some R, which we have assumed is not the case. So U is also adjacent to X, and by repeating the previous arguments we continue to establish that X must be a parent of all nodes on a, including A. But then there is an active path X A C Bb between X and W without the help of X Y, and we conclude that Z cannot have left an enabling set due to the arc reversal. 20

21 Finally, assume that Z / E W X and Z E W X 2. As before Lemma 7 guarantee that we have the structure of paths a, b, and c along with node Z and set T, but this time active in G 2 and not in G. We know that X Y cannot be part of b as either that would mean X G2 W T or that X is required to participate in a v-structure and be part of T, defying the definition of T. Moreover, X Y cannot be on c, as that would either imply that X G2 W T or that X G2 W T {Y }, the latter being precluded by Corollary 16 and the definition of G 2 as a DAG where Y is a parent of X. So X Y must be on a, and we asumme WLOG that it is the first arc on a. Since reversing X Y breaks the setup we must have that the arc following it on a cannot be directed out of Y. Let V be the node connected to Y like this. Since Y does not participate in a v-structure in G, and only the direction of the link X Y differs between the two, this means that X must also be adjacent to V in G. As above we get that V, and all other nodes on a must be children of X, ending up with the active path X A C Bb between X and W contradicting the assumption that Z / E W X. Summing up we have: Corollary 20. Let be a DAG faithful probability distribution over V, G a perfect map of, and X Y an arc in G. Then the following are equivalent statements: X Y is locally compelled there is a W S X such that Y EW X \ S X, there is a W S X MB (X) such that Y E W X \ S X, there is a W V and T MB (X) such that X W T and X W T {Y }, and there is a W MB G (X) S X such that there is a directed path from W to Y in G running through only nodes in CH G (X). Thus there are strong correlations between the abstract probabilistic definition of locally compelled, the conditional independence oriented definition of enabling sets, and strictly graphical notions. This allows us to finally state the proof of Theorem 3: Theorem 21. Let BN B 1 (G 1,Φ 1 ) and DAG faithful probability distributions 1 and 2 each be defined over variables V, and h be a DAG faithful 21

22 sample sequence of size 2 and rank r, where 1 is the distribution of the first sample and 2 is the distribution of the last sample. Furthermore, let 1 be similar to 2 on S, and B 2 (G 2,Φ 2 ) be the result of running Algorithm 1 on B 1 with cases h and some choice of k. If 1. G 1 is equivalent to G 1, 2. r > k, 3. ChangeInStream(, k) is true for a variable X iff X / S and the algorithm is currently processing the k th sample drawn from 2, 4. UncoverMarkovBoundary(X, h) returns MB h (X), 5. G X returned from UncoverAdjacencies(X, M X, h) connects X to Y iff Y / S X h, and 6. the oracle used in artiallydirect( ) is correct, then G 2 is equivalent to G 2. roof. Given Assumptions 2 and 3 it is obvious that Adapt( ) does nothing except exactly once, when it calls UpdateNet( ) on B 1, V \S, and the first k cases h k of the partition of h sampled from 2. According to Corollary 14 if we can prove that i) two nodes are adjacent in G 2 iff they are adjacent in G 2, and ii) an arc is put in G 2 by artially- Direct( ) or MergeAndDirect( ) iff it is a locally compelled arc in G 2 then we have that G 2 is identical to G 2. The result then follows from the correction of Rules 1 to 3 on page 10, which have been proved in (Spirtes, Glymour, and Scheines 2001). i) Let X,Y V be adjacent in G 2. It is either the case that both of X and Y are in S or not. In the latter case we assume WLOG that X S. Then X and Y have been set adjacent by UncoverAdjacencies( ), and by Assumption 5 this means that Y / S X h k. Since h is piecewise DAG faithful, and h k is sampled solely from the partition generated by 2, it follows that Y / S X 2, and by Corollary 6 that X and Y are adjacent in G 2. In the first case, X and Y are adjacent in G 2 iff they are so in G 1, so we have that X and Y are adjacent in G 1. By Corollary 6 and Assumption 1 this means that Y S X 1. Since 1 and 2 are similar on S and thus X, it follows that Y S X 2, and by Corollary 6 that X and Y are adjacent in G 2. roving that if X and Y are non-adjacent in G 2 then they are so also in G 2 follows the same reasoning. 22

23 ii) We consider arcs emanating from a node X in G 2. First, let X S. This means that the locally compelled arcs emanating from X in G 2 are exactly those in G 1. By Corollary 15 and Assumption 1, these are again the same as those in G 1, and by Corollary 20 these are exactly those created by MergeAndDirect( ). Thus CH G 2 (X) = CH G 2 (X). Assume then that X / S. This means that the arcs out of X are created by artiallydirect( ) and only by this method. Due to Assumptions 4 and 5 this method sets Y as a child of X iff it can find a variable Z and a set T MB hk (X) such that I hk (X,Z T) and I hk (X,Z T {Y }). By Assumption 6 this means that Y is set as a child of X iff X hk Z T and X hk Z T {Y }. Since h k is a representative sample from 2, it follows that Y is set as a child of X iff X 2 Z T and X 2 Z T {Y }, and hence by Corollary 20 that Y is set as a child of X in G 2 iff X Y is a locally compelled arc in G 2. This proves the result. 5 Experiments and Results To investigate how our method behaves in practice, we ran a series of experiments with a fully implemented version of the method. In our implementation we have used the conflict measure of Jensen, Chamberlain, Nordahl, and Jensen (1991) in the method ConflictMeasure(B, X i, (x 1,...,x m )): log E (X i = x i ) B (X i = x i X j = x j j i), where the empty BN structure E is adapted to the observations using fractional updating. For ChangeInStream(c X, k) we calculated the second discrete cosine transform component (see e.g. (ress, Teukolsky, Vetterling, and Flannery 2002)) of the last 2k measures in c X, and had ChangeIn- Stream( ) return true if this statistic exceeded UncoverMarkovBoundary( ) was implemented as the decision tree learning method of (Frey, Fisher, Tsamardinos, Aliferis, and Statnikov 2003), and UncoverAdjacencies(X, M X, h) corresponds to the AlgorithmCD( ) method of (eña, Björkegren, and Tegnér 2005) restricted to the variables in M X. This method uses a greedy method for iteratively growing and shrinking the estimated set of adjacent variables until no further change takes place. Both of these methods need an independence oracle, and for that we have used a χ 2 test. This test was also used as oracle in artiallydirect( ), 4 We are unaware of any previous work using this kind of change point detection, but in our setting it outperformed the traditional methods of log-odds ratios and t-tests. 23

24 We constructed 100 experiments, where each consisted five randomly generated BNs over ten variables, B 1,...,B 5. We made sure that B i was structurally similar to B i 1 except for the connection between two randomly chosen nodes. All CTs in B i were kept the same as in B i 1, except for the nodes with a new parent set. For these we employed four different methods for generating new distributions: A estimated the probabilities from the previous network with some added noise to ensure that no two distributions were the same. B to D generated entirely new CTs, with B being totally random, C being restricted to CTs that implied strong dependencies between the parent variables and the child, and D being restricted to only those CTs where these dependencies were extremely strong. For generation of the initial BNs we used the method of (Ide, Cozman, and Ramos 2004). For each series of five BNs, we sampled r cases from each network and concatenated them into a piecewise DAG faithful sample sequence of rank r and length 5r, for r being 500, 1000, and We fed our method (NN) with the generated sequences, using different values of k (100, 500, and 1000), and measured the precision wrt. the KLdistance on each. As mentioned we are unaware of other work geared towards non-stationary distributions, but for base-line comparison purposes, we implemented the structural adaptation methods of Friedman and Goldszmidt (1997) (FG) and Lam and Bacchus (1994) (LB), which serve as representatives of the state of the art. For the method of Friedman and Goldszmidt (1997) we tried both simulated annealing (FG-SA) and a more time consuming hill-climbing (FG-HC) for the unspecified search step of the algorithm. As these methods have not been developed to deal with non-stationary distributions, they have to be told the delay between learning. For this we used the same value k, that we use as allowed delay for our own method, as this ensure that all methods stores only a maximum of k full cases at any one time. Each method was given the correct initial network to start its exploration. Space does not permit us to present the results in full, but the precision of both NN, FG-SA, and FG-HC are presented in Figures 1 and 2. NN outperformed FG-SA in 81 of the experiments, and FG-HC in 65 of the experiments. The precision of the LB method was much worse than for either of these three. The one experiment, where both the FG methods outperformed the NN method substantially, had r equal to and k equal to 5000, and was thus the experiment closest to the assumption on stationary distributions of the FG and LB learners. What is interesting is that another experiment having similar r and k values, but generated with method B rather than D, gave the exact opposite result (NN being much 24

COMP538: Introduction to Bayesian Networks

COMP538: Introduction to Bayesian Networks Lecture 9: Optimal Structure Learning Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering Hong Kong University of Science and Technology