Local Structure Discovery in Bayesian Networks

Similar documents
Who Learns Better Bayesian Network Structures

COMP538: Introduction to Bayesian Networks

Treedy: A Heuristic for Counting and Sampling Subsets

An Empirical-Bayes Score for Discrete Bayesian Networks

An Empirical-Bayes Score for Discrete Bayesian Networks

Bounding the False Discovery Rate in Local Bayesian Network Learning

Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation

Beyond Uniform Priors in Bayesian Network Structure Learning

Approximation Strategies for Structure Learning in Bayesian Networks. Teppo Niinimäki

Global Journal of Engineering Science and Research Management

Curriculum Learning of Bayesian Network Structures

Learning in Bayesian Networks

Comparing Bayesian Networks and Structure Learning Algorithms

Learning With Bayesian Networks. Markus Kalisch ETH Zürich

Directed and Undirected Graphical Models

Feature Selection and Discovery Using Local Structure Induction: Algorithms, Applications, and Challenges

Learning of Causal Relations

Causal Structure Learning and Inference: A Selective Review

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges

ALGORITHMS FOR DISCOVERY OF MULTIPLE MARKOV BOUNDARIES: APPLICATION TO THE MOLECULAR SIGNATURE MULTIPLICITY PROBLEM. Alexander Romanovich Statnikov

Constraint-Based Learning Bayesian Networks Using Bayes Factor

An Improved IAMB Algorithm for Markov Blanket Discovery

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables

Bayesian Networks to design optimal experiments. Davide De March

Causal Discovery Methods Using Causal Probabilistic Networks

Markov properties for directed graphs

Estimation and Control of the False Discovery Rate of Bayesian Network Skeleton Identification

Arrowhead completeness from minimal conditional independencies

arxiv: v2 [cs.lg] 9 Mar 2017

Directed Graphical Models

An Efficient Bayesian Network Structure Learning Algorithm in the Presence of Deterministic Relations

PCSI-labeled Directed Acyclic Graphs

Learning causal network structure from multiple (in)dependence models

Learning' Probabilis2c' Graphical' Models' BN'Structure' Structure' Learning' Daphne Koller

Markov blanket discovery in positive-unlabelled and semi-supervised data

A review of some recent advances in causal inference

arxiv: v6 [math.st] 3 Feb 2018

Introduction to Causal Calculus

Learning Marginal AMP Chain Graphs under Faithfulness

CS Lecture 3. More Bayesian Networks

Undirected Graphical Models

Robustification of the PC-algorithm for Directed Acyclic Graphs

Partial Order MCMC for Structure Discovery in Bayesian Networks

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

Markov Independence (Continued)

Bayesian Networks. Machine Learning, Fall Slides based on material from the Russell and Norvig AI Book, Ch. 14

Structure Learning: the good, the bad, the ugly

Chris Bishop s PRML Ch. 8: Graphical Models

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Summary of the Bayes Net Formalism. David Danks Institute for Human & Machine Cognition

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

11 : Gaussian Graphic Models and Ising Models

A Decision Theoretic View on Choosing Heuristics for Discovery of Graphical Models

Learning Quadratic Variance Function (QVF) DAG Models via OverDispersion Scoring (ODS)

Artificial Intelligence Bayes Nets: Independence

Bayesian Networks BY: MOHAMAD ALSABBAGH

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

CS 188: Artificial Intelligence Spring Announcements

Identifiability assumptions for directed graphical models with feedback

Identifying the irreducible disjoint factors of a multivariate probability distribution

Entropy-based pruning for learning Bayesian networks using BIC

3 : Representation of Undirected GM

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Bayesian Networks: Belief Propagation in Singly Connected Networks

Introduction to Bayesian Networks

Preliminaries Bayesian Networks Graphoid Axioms d-separation Wrap-up. Bayesian Networks. Brandon Malone

Introduction to Artificial Intelligence. Unit # 11

Stochastic Complexity for Testing Conditional Independence on Discrete Data

Introduction to Graphical Models. Srikumar Ramalingam School of Computing University of Utah

Mixtures of Gaussians with Sparse Regression Matrices. Constantinos Boulis, Jeffrey Bilmes

ANALYTIC COMPARISON. Pearl and Rubin CAUSAL FRAMEWORKS

A Tutorial on Learning with Bayesian Networks

Structure Learning of Bayesian Networks using Constraints

Statistical Approaches to Learning and Discovery

Probabilistic Graphical Networks: Definitions and Basic Results

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Mixtures of Gaussians with Sparse Structure

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Markov blanket discovery in positive-unlabelled and semi-supervised data

Directed Graphical Models or Bayesian Networks

Automatic Causal Discovery

arxiv: v1 [cs.ai] 16 Aug 2016

Adapting Bayes Nets to Non-stationary Probability Distributions

Undirected Graphical Models

Genotype-Environment Effects Analysis Using Bayesian Networks

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Bayesian Networks. instructor: Matteo Pozzi. x 1. x 2. x 3 x 4. x 5. x 6. x 7. x 8. x 9. Lec : Urban Systems Modeling

UNIVERSITY OF CALIFORNIA, IRVINE. Fixing and Extending the Multiplicative Approximation Scheme THESIS

Applying Bayesian networks in the game of Minesweeper

Announcements. CS 188: Artificial Intelligence Spring Bayes Net Semantics. Probabilities in BNs. All Conditional Independences

Machine Learning Lecture 14

Rapid Introduction to Machine Learning/ Deep Learning

Machine Learning for Data Science (CS4786) Lecture 19

Structure Discovery in Bayesian Networks: Algorithms and Applications

Loopy Belief Propagation for Bipartite Maximum Weight b-matching

Causal Associative Classification

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Lecture 4 October 18th

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Transcription:

1 / 45 HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET UNIVERSITY OF HELSINKI Local Structure Discovery in Bayesian Networks Teppo Niinimäki, Pekka Parviainen August 18, 2012 University of Helsinki Department of Computer Science

2 / 45 Outline Introduction Local learning From local to global Summary

3 / 45 Outline Introduction Local learning From local to global Summary

4 / 45 Bayesian network structure: directed acyclic graph A A v 1 4 7 6 2 v 5 3 8 conditional probabilities p(x) = v N p(x v x Av )

5 / 45 Global structure learning Data D n variables m samples from distribution p Task: Given data D learn the structure A. variables sample 1 2 3 8 1 2 1 0 1 2 2 0 2 0 3 2 0 1 0 4 1 0 2 1...... 5000 2 2 1 1 1 4 7 6 2 5 3 8

Global structure learning Data D n variables m samples from distribution p Task: Given data D learn the structure A. Approaches: constraint-based: (pairwise) independency tests score-based: (global) score for each structure 5 / 45

Score-based structure learning Structure learning is NP-hard. Dynamic programming algorithm O(n 2 2 n ) time O(n2 n ) space Exact learning infeasible for large networks. 6 / 45

Score-based structure learning Structure learning is NP-hard. Dynamic programming algorithm O(n 2 2 n ) time O(n2 n ) space Exact learning infeasible for large networks. Note: Linear programming based approaches can scale to very large networks. However, no good worst case guarantees in general case. 6 / 45

Local structure learning interested in target node(s) learn local structure near target(s) constraint-based HITON, GLL framework (Aliferis, Statnikov, Tsamardinos, Mani and Koutsoukos, 2010) 7 / 45

Local structure learning interested in target node(s) learn local structure near target(s) constraint-based HITON, GLL framework (Aliferis, Statnikov, Tsamardinos, Mani and Koutsoukos, 2010) our contribution: score-based variant: SLL 7 / 45

Local to global learning learn local structure for each node combine results to global structure MMHC, LGL framework (Tsamardinos et al., 2006; Aliferis et al., 2010) 8 / 45

Local to global learning learn local structure for each node combine results to global structure MMHC, LGL framework (Tsamardinos et al., 2006; Aliferis et al., 2010) our contribution: two SLL-based algorithms: SLL+C, SLL+G 8 / 45

9 / 45 Outline Introduction Local learning From local to global Summary

10 / 45 Local learning problem Markov blanket = neighbors + spouses t Given its Markov blanket a node is independent of all other nodes in the network. For example, if predicting the value of target node t the rest of the nodes can be ignored.

10 / 45 Local learning problem Markov blanket = neighbors + spouses t Task: Given data D and a target node t learn the Markov blanket of t.

GLL/HITON algorithm outline 1. Learn neighbors: Find potential neighbors (constraint-based) Symmetry correction 2. Learn spouses: Find potential spouses (constraint-based) Symmetry correction 11 / 45

SLL algorithm outline 1. Learn neighbors: Find potential neighbors (score-based) Symmetry correction 2. Learn spouses: Find potential spouses (score-based) Symmetry correction 12 / 45

SLL algorithm outline 1. Learn neighbors: Find potential neighbors (score-based) Symmetry correction 2. Learn spouses: Find potential spouses (score-based) Symmetry correction Assume: Procedure OPTIMALNETWORK(Z ) returns an optimal DAG on subset Z of nodes. if Z small exact computation possible for example: dynamic programming algorithm 12 / 45

13 / 45 Finding potential neighbors Algorithm: FINDPOTENTIALNEIGHBORS 4 7 1 3 t Z

13 / 45 Finding potential neighbors Algorithm: FINDPOTENTIALNEIGHBORS 4 7 1 3 t Z

13 / 45 Finding potential neighbors Algorithm: FINDPOTENTIALNEIGHBORS 4 7 1 3 t Z

13 / 45 Finding potential neighbors Algorithm: FINDPOTENTIALNEIGHBORS 4 7 3 t 1 Z

13 / 45 Finding potential neighbors Algorithm: FINDPOTENTIALNEIGHBORS 4 3 7 t 1 Call OPTIMALNETWORK(Z ). Z

13 / 45 Finding potential neighbors Algorithm: FINDPOTENTIALNEIGHBORS 4 3 1 t 7 Z

13 / 45 Finding potential neighbors Algorithm: FINDPOTENTIALNEIGHBORS 4 3 t 7 Z

13 / 45 Finding potential neighbors Algorithm: FINDPOTENTIALNEIGHBORS 4 3 7 t Z

13 / 45 Finding potential neighbors Algorithm: FINDPOTENTIALNEIGHBORS 4 3 7 t Z

13 / 45 Finding potential neighbors Algorithm: FINDPOTENTIALNEIGHBORS 3 7 The remaining nodes are potential neighbors of target node t. t Z 8

14 / 45 Problem t 4 True structure: The algorithm: 3 2 4 2 3 t Z

14 / 45 Problem t 4 True structure: The algorithm: 3 2 2 3 t 4 Z

14 / 45 Problem t 4 True structure: The algorithm: 3 2 2 3 t Z

14 / 45 Problem t 4 The algorithm: 3 True structure: 2 3 2 t Z

14 / 45 Problem t 4 The algorithm: 3 True structure: 2 3 t 2 Z

14 / 45 Problem t 4 The algorithm: 3 True structure: 2 3 2 t Z

14 / 45 Problem t 4 The algorithm: True structure: 2 3 2 t 3 Z

14 / 45 Problem t 4 The algorithm: True structure: 2 3 t 2 3 Z

14 / 45 Problem t 4 The algorithm: True structure: 2 3 2 t 3 Z

Symmetry correction for neighbors Algorithm: FINDNEIGHBORS 1. Find potential neighbors H t of t 2. Remove nodes v H t which do not have t as a potential neighbor. 3. The nodes remaining in the end are the neighbors of t. Note: Equivalent to symmetry correction in GLL. 15 / 45

16 / 45 Performance Experiments: returns good results in practise. Theoretical guarantees?

Asymptotic correctness Does FINDNEIGHBORS always return the correct set of neighbors under certain assumptions when the sample size tends to infinity? Compare: the generalized local learning (GLL) framework (Aliferis et al., 2010) greedy equivalent search (GES) (Chickering, 2002) 17 / 45

Asymptotic correctness Does FINDNEIGHBORS always return the correct set of neighbors under certain assumptions when the sample size tends to infinity? Compare: the generalized local learning (GLL) framework (Aliferis et al., 2010) greedy equivalent search (GES) (Chickering, 2002) We don t know for sure, but can say something. 17 / 45

18 / 45 Assumptions The data distribution is faithful to a Bayesian network. (i.e. there exists a perfect map of the data distribution) OPTIMALNETWORK uses a locally consistent and score equivalent scoring criterion (e.g. BDeu)

19 / 45 Network structure properties Definition A structure A contains a distribution p if p can be described using a Bayesian network with structure A (local Markov condition holds).

19 / 45 Network structure properties Definition A structure A contains a distribution p if p can be described using a Bayesian network with structure A (local Markov condition holds). Definition A distribution p is faithful to a structure A if all conditional independencies in p are implied by A.

Network structure properties Definition A structure A is a perfect map of p if A contains p and p is faithful to A 20 / 45

Network structure properties Definition A structure A is a perfect map of p if A contains p and p is faithful to A no missing or extra independencies Note: The property of having a perfect map is not closed under marginalization. 20 / 45

21 / 45 Equivalent structures Definition Two structures are Markov equivalent if they contain the same set of distributions. A scoring criterion is score equivalent if equivalent DAGs always have the same score. (e.g. BDeu)

21 / 45 Equivalent structures Definition Two structures are Markov equivalent if they contain the same set of distributions. A scoring criterion is score equivalent if equivalent DAGs always have the same score. (e.g. BDeu) learning possible up to an equivalence class

Consistent scoring criterion Definition (Chickering, 2002) Let A be the DAG that results from adding the arc uv to A. A scoring criterion f is locally consistent if in the limit as data size grows, for all A, u and v hold: If u p v A v, then f (A, D) > f (A, D). If u p v A v, then f (A, D) < f (A, D). relevant arcs increase score, irrelevant decrease 22 / 45

23 / 45 Finding neighbors: theory In the limit of large sample size, we get the following result: Lemma FINDPOTENTIALNEIGHBORS returns a set that includes all true neighbors of target node t. Conjecture FINDNEIGHBORS returns exactly the true neighbors of target node t.

Finding neighbors: Proof of the lemma In the limit of large sample size: Lemma FINDPOTENTIALNEIGHBORS returns a set that includes all true neighbors of target node t. Proof: assume v a neighbor of t in A faithfulness t and v not independent given any subset of nodes local consistency OPTIMALNETWORK on Z including t and v always returns a structure with arc between t and v thus v kept in Z once added 24 / 45

SLL algorithm outline 1. Learn neighbors: Find potential neighbors (score-based) Symmetry correction 2. Learn spouses: Find potential spouses (score-based) Symmetry correction 25 / 45

26 / 45 Finding potential spouses Algorithm: FINDPOTENTIALSPOUSES 2 5 4 1 3 7 t 8 Z

26 / 45 Finding potential spouses Algorithm: FINDPOTENTIALSPOUSES 2 5 4 3 7 1 t 8 Z

26 / 45 Finding potential spouses Algorithm: FINDPOTENTIALSPOUSES 2 5 3 7 1 4 t Z 8 Call OPTIMALNETWORK(Z ).

26 / 45 Finding potential spouses Algorithm: FINDPOTENTIALSPOUSES 2 5 3 8 1 7 t 4 Z

26 / 45 Finding potential spouses Algorithm: FINDPOTENTIALSPOUSES 2 5 3 8 1 t 7 Z

26 / 45 Finding potential spouses Algorithm: FINDPOTENTIALSPOUSES 2 5 3 7 1 t 8 Z

Finding potential spouses Algorithm: FINDPOTENTIALSPOUSES 3 7 1 6 t Z 8 The remaining new nodes are potential spouses of target node t. 26 / 45

Symmetry correction for spouses Algorithm: FINDSPOUSES 1. Find potential spouses S t of t. 2. Add nodes which do have t as a potential spouses. 3. The nodes gathered in the end are the spouses of t. Note: Equivalent to symmetry correction in GLL. 27 / 45

Finding spouses: theory Again, in the limit of large sample size: Lemma FINDPOTENTIALSPOUSES returns a set that contains only true spouses of target node t. Conjecture FINDSPOUSES returns exactly the true spouses of target node t. 28 / 45

29 / 45 Finding Markov blanket: theory Summarising previous lemmas and conjectures: Conjecture If the data is faithful to a Bayesian network and one uses a locally consistent and score equivalent scoring criterion then in the limit of large sample size SLL always returns the correct Markov blanket.

Time and space requirements OPTIMALNETWORK(Z ) dominates: O( Z 2 2 Z ) time O( Z 2 Z ) space Worst case: Z = O(n) total time at most O(n 4 2 n ) In practise: Networks are relatively sparse. significantly lower running times 30 / 45

SLL implementation A C++ implementation. OPTIMALNETWORK(Z ) procedure: dynamic programming algorithm BDeu score fallback to GES for Z > 20 The implementation available at http://www.cs.helsinki.fi/u/tzniinim/uai2012/ 31 / 45

32 / 45 Experiments Generated data from different Bayesian networks. (ALARM5 = 185 nodes, CHILD10 = 200 nodes) Compared score-based SLL to constraint-based HITON by Aliferis et al. (2003, 2010). Measured SLHD: Sum of Local Hamming Distances between the returned neighbors (Markov blankets) and the true neighbors (Markov blankets).

33 / 45 Experimental results: neighbors 400 Alarm5 400 Child10 300 200 300 200 100 HITON SLL 100 500 1000 5000 0 500 1000 5000 Average SLHD for neighbors.

34 / 45 Experimental results: Markov blankets 600 Alarm5 1000 Child10 500 400 300 500 HITON SLL 200 500 1000 5000 0 500 1000 5000 Average SLHD for Markov blankets.

35 / 45 Experimental results: time consumption 1000 Alarm5 1000 Child10 100 100 SLL HITON 10 500 1000 5000 10 500 1000 5000 Average running times (s).

36 / 45 Outline Introduction Local learning From local to global Summary

Local to global learning Idea: Construct the global network structure based on local results. Implemented two algorithms: Constraint-based: SLL+C Score-based: SLL+G Compared to: max-min hill-climbing (MMHC) by Tsamardinos et al. (2006) greedy search (GS) greedy equivalence search (GES) by Chickering (2002) 37 / 45

SLL+C algorithm Algorithm: SLL+C (constraint-based local-to-global) 1. Run SLL for all nodes. 2. Based on the neighbor sets construct the skeleton of the DAG. 3. Direct the edges consistently (if possible) with the v-structures learned during the spouse-phase of SLL. (see Pearl, 2000) 38 / 45

SLL+G algorithm Algorithm: SLL+G (greedy search based local-to-global) 1. Run the first phase of the local neighbor search of SLL for all nodes. 2. Construct the DAG by greedy search but allow adding arc between two nodes only if both are each others potential neighbors. Similar to MMHC algorithm. 39 / 45

40 / 45 Experimental results: SHD 300 Alarm5 300 Child10 200 100 200 100 GS GES MMHC SLL+G SLL+C 0 500 1000 5000 0 500 1000 5000 Average SHD.

41 / 45 Experimental results: scores 1.05 Alarm5 1.02 Child10 1 1.01 1 0.99 GS GES MMHC SLL+G SLL+C 0.95 500 1000 5000 0.98 500 1000 5000 Average normalized score. (1 = true network, lower = better)

42 / 45 Experimental results: time consumption 1000 Alarm5 10000 Child10 100 1000 100 10 GS GES MMHC SLL+G SLL+C 10 500 1000 5000 1 500 1000 5000 Average running times (s).

43 / 45 Outline Introduction Local learning From local to global Summary

Summary Score-based local learning algorithm: SLL Conjecture: same theoretical guarantees as GES and constraint-based local learning Experiments: competitive alternative to constraint-based local learning Score-based local-to-global algorithms Experiments: mixed results Future directions Prove the guarantees (or lack thereof) Speed improvements 44 / 45

45 / 45 References Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, and Xenofon D. Koutsoukos. Local causal and Markov blanket induction for causal discovery and feature selection for classification part I: Algorithms and empirical evaluation. Journal of Machine Learning Research, 11:171 234, 2010. Constantin F. Aliferis, Alexander Statnikov, Ioannis Tsamardinos, Subramani Mani, and Xenofon D. Koutsoukos. Local causal and Markov blanket induction for causal discovery and feature selection for classification part II: Analysis and extensions. Journal of Machine Learning Research, 11:235 284, 2010. David Maxwell Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Reseach, 3:507 554, 2002. Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge university Press, 2000. Ioannis Tsamardinos, Laura E. Brown, and Constantin F. Aliferis. The max-min hill-climbing Bayesian network structure learning algorithm. Machine Learning, 65:31 78, 2006.