Who Learns Better Bayesian Network Structures

Similar documents
An Empirical-Bayes Score for Discrete Bayesian Networks

An Empirical-Bayes Score for Discrete Bayesian Networks

Local Structure Discovery in Bayesian Networks

Beyond Uniform Priors in Bayesian Network Structure Learning

COMP538: Introduction to Bayesian Networks

Constraint-Based Learning Bayesian Networks Using Bayes Factor

Bayesian Networks BY: MOHAMAD ALSABBAGH

Curriculum Learning of Bayesian Network Structures

Learning in Bayesian Networks

Learning With Bayesian Networks. Markus Kalisch ETH Zürich

Global Journal of Engineering Science and Research Management

Comparing Bayesian Networks and Structure Learning Algorithms

Genotype-Environment Effects Analysis Using Bayesian Networks

Estimation and Control of the False Discovery Rate of Bayesian Network Skeleton Identification

Learning of Causal Relations

Machine Learning Summer School

Fast Bayesian Network Structure Learning using Quasi-Determinism Screening

arxiv: v2 [stat.co] 30 Nov 2018

Bayesian Network Modelling

Structure Learning of Bayesian Networks using Constraints

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Scoring functions for learning Bayesian networks

Introduction to Probabilistic Graphical Models

Score Metrics for Learning Bayesian Networks used as Fitness Function in a Genetic Algorithm

Directed Graphical Models or Bayesian Networks

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

arxiv: v6 [math.st] 3 Feb 2018

Structure Learning: the good, the bad, the ugly

An Improved IAMB Algorithm for Markov Blanket Discovery

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Lecture 6: Graphical Models: Learning

Bayesian Networks. instructor: Matteo Pozzi. x 1. x 2. x 3 x 4. x 5. x 6. x 7. x 8. x 9. Lec : Urban Systems Modeling

Supplementary material to Structure Learning of Linear Gaussian Structural Equation Models with Weak Edges

Bayesian Networks. Machine Learning, Fall Slides based on material from the Russell and Norvig AI Book, Ch. 14

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Embellishing a Bayesian Network using a Chain Event Graph

Learning a probabalistic model of rainfall using graphical models

Mixtures of Gaussians with Sparse Structure

Bounding the False Discovery Rate in Local Bayesian Network Learning

Artificial Intelligence Bayes Nets: Independence

Inferring Protein-Signaling Networks II

Learning Bayesian networks

CS 5522: Artificial Intelligence II

Bayesian Networks to design optimal experiments. Davide De March

CSE 473: Artificial Intelligence Autumn Topics

Branch and Bound for Regular Bayesian Network Structure Learning

An Efficient Bayesian Network Structure Learning Algorithm in the Presence of Deterministic Relations

Causal Structure Learning and Inference: A Selective Review

Construction and Analysis of Climate Networks

Exact model averaging with naive Bayesian classifiers

Bayes Nets: Independence

Statistical Approaches to Learning and Discovery

JOINT PROBABILISTIC INFERENCE OF CAUSAL STRUCTURE

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), Institute BW/WI & Institute for Computer Science, University of Hildesheim

Entropy-based Pruning for Learning Bayesian Networks using BIC

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

Entropy-based pruning for learning Bayesian networks using BIC

Integer Programming for Bayesian Network Structure Learning

Data Mining Part 5. Prediction

WITH. P. Larran aga. Computational Intelligence Group Artificial Intelligence Department Technical University of Madrid

CS839: Probabilistic Graphical Models. Lecture 2: Directed Graphical Models. Theo Rekatsinas

Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures

Feature Selection and Discovery Using Local Structure Induction: Algorithms, Applications, and Challenges

Approximation Strategies for Structure Learning in Bayesian Networks. Teppo Niinimäki

A Tutorial on Learning with Bayesian Networks

Robust reconstruction of causal graphical models based on conditional 2-point and 3-point information

Efficient Sensitivity Analysis in Hidden Markov Models

CS6220: DATA MINING TECHNIQUES

LEARNING BAYESIAN NETWORKS THROUGH KNOWLEDGE REDUCTION

Bayesian Network Structure Learning using Factorized NML Universal Models

Learning Measurement Models for Unobserved Variables

Measures of Variability for Graphical Models

Learning the Tree Augmented Naive Bayes Classifier from incomplete datasets

Bayesian Networks. B. Inference / B.3 Approximate Inference / Sampling

Directed Graphical Models

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Directed and Undirected Graphical Models

Protein Complex Identification by Supervised Graph Clustering

Scoring Bayesian networks of mixed variables

Bayesian Network Learning and Applications in Bioinformatics. Xiaotong Lin

An Evolutionary Programming Based Algorithm for HMM training

Markov blanket discovery in positive-unlabelled and semi-supervised data

The Monte Carlo Method: Bayesian Networks

The Variational Bayesian EM Algorithm for Incomplete Data: with Application to Scoring Graphical Model Structures

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Abstract. Three Methods and Their Limitations. N-1 Experiments Suffice to Determine the Causal Relations Among N Variables

arxiv: v1 [cs.ai] 16 Aug 2016

Learning Bayesian Networks Does Not Have to Be NP-Hard

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

STA 4273H: Statistical Machine Learning

Artificial Intelligence

ECE 6504: Advanced Topics in Machine Learning Probabilistic Graphical Models and Large-Scale Learning

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features

Belief Update in CLG Bayesian Networks With Lazy Propagation

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

ENSO-DRIVEN PREDICTABILITY OF TROPICAL DRY AUTUMNS USING THE SEASONAL ENSEMBLES MULTIMODEL

Lecture 5: Bayesian Network

Stochastic Complexity for Testing Conditional Independence on Discrete Data

Bayesian Networks. Motivation

3 : Representation of Undirected GM

Transcription:

Who Learns Better Bayesian Network Structures Constraint-Based, Score-based or Hybrid Algorithms? Marco Scutari 1 Catharina Elisabeth Graafland 2 José Manuel Gutiérrez 2 1 Department of Statistics, UK scutari@stats.ox.ac.uk 2 Institute of Physics of Cantabria (CSIC-UC) Santander, Spain September 11, 2018

Outline Bayesian network Structure learning is defined by the combination of a statistical criterion and an algorithm that determines how the criterion is applied to the data. After removing the confounding effect of different choices for the statistical criterion, we ask the following questions: Q1 Which of constraint-based and score-based algorithms provide the most accurate structural reconstruction? Q2 Are hybrid algorithms more accurate than constraint-based or score-based algorithms? Q3 Are score-based algorithms slower than constraint-based and hybrid algorithms?

Classes of Structure Learning Algorithms Structure learning consists in finding the DAG G that encodes the dependence structure of a data set D with n observations. Algorithms for this task fall into one three classes: Constraint-based algorithms identify conditional independence constraints with statistical tests, and link nodes that are not found to be independent. Score-based algorithms are applications of general optimisation techniques; each candidate DAG is assigned a network score maximise as the objective function. Hybrid algorithms have a restrict phase implementing a constraint-based strategy to reduce the space of candidate DAGs; and a maximise phase implementing a score-based strategy to find the optimal DAG in the restricted space.

Conditional Independence Tests and Network Scores For discrete BNs, the most common test is the log-likelihood ratio test G 2 (X, Y Z) = 2 log P(X Y, Z) P(X Z) = 2 R C L i=1 j=1 k=1 n ijk log n ijkn ++k n i+k n +jk, has an asymptotic χ 2 (R 1)(C 1)L distribution. For GBNs, G 2 (X, Y Z) = n log(1 ρ 2 XY Z ). χ 2 1. As for network scores, the Bayesian Information criterion BIC(G; D) = N [ i=1 log P(X i Π Xi ) Θ X i 2 log n is a common choice for both discrete BNs and GBNs, as it provides a simple approximation to log P(G D). log P(G D) itself is available in closed form as BDeu and BGeu [5, 4]. ],

Score- and Constraint-Based Algorithms Can Be Equivalent Cowell [3] famously showed that constraint-based and score-based algorithms can select identical discrete BNs. 1. He noticed that the G 2 test in has the same expression as a score-based network comparison based on the log-likelihoods log P(X Y, Z) log P(X Z) if we take Z = Π X. 2. He then showed that these two classes of algorithms are equivalent if we assume a fixed, known topological ordering and we use log-likelihood and G 2 as matching statistical criteria. We take the same view that the algorithms and the statistical criteria they use are separate and complementary in determining the overall behaviour of structure learning. We then want to remove the confounding effect of choices for the statistical criterion from our evaluation of the algorithms.

Constructing Matching Tests and Scores Consider two DAGs G + and G that differ by a single arc X j X i. In a score-based approach, we can compare them using BIC: BIC(G + ; D) > BIC(G ; D) 2 log P(X i Π Xi {X j }) P(X i Π Xi ) > ) ( Θ G+ X i Θ G X i log n which is equivalent to testing the conditional independence of X i and X j given Π Xi using the G 2 test, just with a different significance threshold. We will call this test G 2 BIC and use it as the matching statistical criterion for BIC to compare different learning algorithms. For discrete BNs, starting from log P(G D) we get log P(G + D) > log P(G D) log BF = log P(G+ D) P(G D) > 0, which uses Bayes factors as matching tests for BDeu.

A Simulation Study We assess three constraint-based algorithms (PC [2], GS [6], Inter-IAMB [13]), two score-based algorithms (tabu search, simulated annealing [7] for BIC, GES [1] for log BDeu) and two hybrid algorithms (MMHC [10], RSMAX2 [9]) on 14 reference networks [8]. For each BN: 1. We generate 20 samples of size n/ Θ = 0.1, 0.2, 0.5 (small samples), 1.0, 2.0, 5.0 (large samples). 2. We learn G using (BIC, G 2 BIC ), and (log BDeu, log BF) as well for discrete BNs. 3. We measure the accuracy of the learned DAGs using SHD/ A [10] from the reference BN; and we measure the speed of the learning algorithms with the number of calls to the statistical criterion.

Discrete Bayesian Networks (Large Samples) 0.4 0.6 0.8 1.0 ALARM 0.2 0.4 0.6 0.8 1.0 ANDES 0.2 0.4 0.6 0.8 1.0 CHILD 0.2 0.4 0.6 0.8 HAILFINDER 3.2 3.4 3.6 3.8 4.0 4.2 4.8 5.0 5.2 5.4 5.6 2.6 2.8 3.0 3.2 3.4 3.6 3.8 3.6 3.8 4.0 4.2 4.4 Scaled SHD 0.4 0.6 0.8 1.0 HEPAR2 0.85 0.90 0.95 1.00 1.05 MUNIN1 0.8 0.9 1.0 1.1 1.2 PATHFINDER 0.0 0.2 0.4 0.6 0.8 PIGS 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 4.2 4.4 4.6 4.8 5.0 5.5 6.0 6.5 0.700.750.800.850.900.95 WATER 3.2 3.4 3.6 3.8 4.0 4.2 0.4 0.6 0.8 1.0 WIN95PTS 4.0 4.2 4.4 4.6 4.8 log 10(calls to the statistical criterion)

Discrete Bayesian Networks (Small Samples) 0.8 1.0 1.2 ALARM 0.6 0.8 1.0 1.2 ANDES 0.6 0.8 1.0 1.2 1.4 CHILD 0.6 0.8 1.0 HAILFINDER 3.2 3.4 3.6 3.8 4.0 4.2 4.8 5.0 5.2 5.4 5.6 2.6 2.8 3.0 3.2 3.4 3.6 3.8 3.6 3.8 4.0 4.2 4.4 4.6 Scaled SHD 0.7 0.8 0.9 1.0 1.1 1.2 HEPAR2 0.9 1.0 1.1 MUNIN1 1.0 1.1 1.2 1.3 1.4 1.5 PATHFINDER 0.0 0.2 0.4 0.6 0.8 1.0 PIGS 3.8 4.0 4.2 4.4 4.6 4.8 5.0 5.2 5.4 0.85 0.90 0.95 1.00 0.7 0.8 0.9 1.0 1.1 1.2 WATER WIN95PTS 3.0 3.2 3.4 3.6 3.8 4.0 3.8 4.0 4.2 4.4 4.6 4.8 4.2 4.4 4.6 4.8 5.0 5.2 5.4 5.6 5.8 6.0 6.2 6.4 log 10(calls to the statistical criterion)

Gaussian Bayesian Networks 2 4 6 8 10 Scaled SHD 0.5 1.0 1.5 2.0 ARTH150 (small samples) 4.5 5.0 5.5 6.0 ARTH150 (large samples) 1 2 3 4 5 0.4 0.6 0.8 1.0 1.2 1.4 ECOLI70 (small samples) 3.6 3.8 4.0 4.2 4.4 4.6 ECOLI70 (large samples) 1 2 3 4 5 6 7 0.6 0.8 1.0 1.2 MAGIC IRRI (small samples) 4 5 6 7 MAGIC IRRI (large samples) 1 2 3 4 5 6 0.2 0.4 0.6 0.8 1.0 1.2 MAGIC NIAB (small samples) 3.5 4.0 4.5 5.0 5.5 6.0 MAGIC NIAB (large samples) 4.5 5.0 5.5 6.0 6.5 7.0 4.0 4.5 5.0 3.8 4.0 4.2 4.4 log 10(calls to the statistical criterion) 3.4 3.6 3.8 4.0

Overall Conclusions Discrete networks: score-based algorithms often have higher SHDs for small samples; hybrid and constraint-based algorithms have comparable SHDs; constraint-based algorithms have better SHD than score-based algorithms for small sample sizes in 7/10 BNs, but it decreases more slowly as n increases for all BNs; simulated annealing is consistently slower; tabu search is always fast and accurate in large samples, for 6/10 BNs in small samples. Gaussian networks: tabu search and simulated annealing have larger SHDs than constraint-based or hybrid algorithms for most samples; hybrid and constraint-based algorithms have roughly the same SHD for all sample sizes.

Real-World Climate Data... Climate networks aim to analyse the complex spatial structure of climate data: spatial dependence among nearby locations, but also long-range large-scale oscillation patterns over distant regions in the world, known as teleconnections [11], such as the El Niño Southern Oscillation (ENSO) [12]. We confirm the results above using NCEP/NCAR monthly surface temperature data on a global 10 -resolution grid between 1981 and 2010. This gives sample size n = 30 12 = 360 and variables N = 18 36 = 648, which we model with a Gaussian Bayesian network. The sample would count as a small sample in the simulation study.

... Gives Networks that Look Like This... (a) A = 1594 (links) (c) A = 1594 (conditional probabilities) (b) A = 898 (links) (d) A = 898 (conditional probabilities) P(V>=1 V81=2)-P(V>=1 ) 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 We want to find teleconnections, so we are looking forward to learning networks like that on the left more than that on the right because the latter only encodes short-range perturbations. Marco Scutari, Catharina Elisabeth Graafland, Jose Manuel Gutie rrez

0 10 20 γ 30 40 50 7868 3503 A = number of arcs 1000 1500 2000 (b) Score (log-likelihood) (c) Size 500 898 501 0 (a) Speed 5 log10(calls to the statistical criterion) 5.5 6 6.5 8.8 7.7 7.1 Log P(x (G,θ) -350000-300000 -250000-200000 -150000... and Agree with the Simulation Study 0 10 20 γ 30 40 50 0 10 20 γ 30 40 50 Constraint-based algorithms produce BNs with the highest log-likelihood, hybrid have the worst log-likelihood values and includes only a few teleconnections; score-based algorithms produce high-likelihood networks with a large number of teleconnections that allow propagating evidences with realistic results. score-based algorithms are faster than both hybrid and constraint-based algorithms. Marco Scutari, Catharina Elisabeth Graafland, Jose Manuel Gutie rrez

Conclusions We assessed the three classes of BN structure learning algorithms, removing the confounding effect of different choices of statistical criteria. Interestingly, we found that: Q1 constraint-based algorithms are more accurate than score-based algorithms for small sample sizes; Q2 that they are as accurate as hybrid algorithms; Q3 and that tabu search, as a score-based algorithm, is faster than constraint-based algorithms more often than not. This in contrast with the general view in the literature that score-based algorithms are less sensitive to individual errors and more accurate than constraint-based algorithms; and that hybrid algorithms are faster and more accurate than both. More so at small sample sizes. Also, scorebased algorithms are supposed to scale less well to high-dimensional data.

Thanks!

References

References References I D. M. Chickering. Optimal Structure Identification With Greedy Search. Journal of Machine Learning Research, 3:507 554, 2002. D. Colombo and M. H. Maathuis. Order-Independent Constraint-Based Causal Structure Learning. Journal of Machine Learning Research, 15:3921 3962, 2014. R. Cowell. Conditions Under Which Conditional Independence and Scoring Methods Lead to Identical Selection of Bayesian Network Models. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence, pages 91 97, 2001. D. Geiger and D. Heckerman. Learning Gaussian Networks. In Proceedings of the 10th Conference on Uncertainty in Artificial Intelligence, pages 235 243, 1994. D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian Networks: The Combination of Knowledge and Statistical Data. Machine Learning, 20(3):197 243, 1995.

References References II D. Margaritis. Learning Bayesian Network Model Structure from Data. PhD thesis, School of Computer Science, Carnegie-Mellon University, Pittsburgh, PA, May 2003. S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 3rd edition, 2009. M. Scutari. Bayesian Network Repository. http://www.bnlearn.com/bnrepository, 2012. M. Scutari, P. Howell, D. J. Balding, and I. Mackay. Multiple Quantitative Trait Analysis Using Bayesian Networks. Genetics, 198(1):129 137, 2014. I. Tsamardinos, L. E. Brown, and C. F. Aliferis. The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm. Machine Learning, 65(1):31 78, 2006.

References References III A. A. Tsonis, K. L. Swanson, and G. Wang. On the Role of Atmospheric Teleconnections in Climate. Journal of Climate, 21(12):2990 3001, 2008. K. Yamasaki, A. Gozolchiani, and S. Havlin. Climate Networks around the Globe are Significantly Affected by El Niño. Phys. Rev. Lett., 100:228501, 2008. S. Yaramakala and D. Margaritis. Speculative Markov Blanket Discovery for Optimal Feature Selection. In ICDM 05: Proceedings of the Fifth IEEE International Conference on Data Mining, pages 809 812. IEEE Computer Society, 2005.