Learning Bayes Net Structures

Similar documents
Structure Learning: the good, the bad, the ugly

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Learning P-maps Param. Learning

BN Semantics 3 Now it s personal! Parameter Learning 1

Probabilistic Graphical Models

Probabilistic Graphical Models

Bayesian Networks Structure Learning (cont.)

Directed Graphical Models or Bayesian Networks

Directed Graphical Models

Structure Learning in Bayesian Networks (mostly Chow-Liu)

Learning Bayesian Networks

Learning Bayesian network : Given structure and completely observed data

Bayesian Learning. Bayesian Learning Criteria

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Bayesian Methods: Naïve Bayes

Bayesian Networks. Motivation

Learning' Probabilis2c' Graphical' Models' BN'Structure' Structure' Learning' Daphne Koller

BN Semantics 3 Now it s personal!

Structure Learning in Bayesian Networks

An Empirical-Bayes Score for Discrete Bayesian Networks

Directed and Undirected Graphical Models

Machine Learning

The Monte Carlo Method: Bayesian Networks

Beyond Uniform Priors in Bayesian Network Structure Learning

Learning Bayesian Networks (part 1) Goals for the lecture

{ p if x = 1 1 p if x = 0

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

COMP538: Introduction to Bayesian Networks

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Lecture 9: PGM Learning

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Learning Causality. Sargur N. Srihari. University at Buffalo, The State University of New York USA

Lecture 5: Bayesian Network

Lecture 6: Graphical Models: Learning

CSC321 Lecture 18: Learning Probabilistic Models

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

4: Parameter Estimation in Fully Observed BNs

Bayesian Learning. Instructor: Jesse Davis

An Introduction to Bayesian Machine Learning

Learning Bayesian networks

CS 188: Artificial Intelligence. Bayes Nets

CS6220: DATA MINING TECHNIQUES

Introduction to Bayesian Learning

Machine Learning Summer School

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

CSE 473: Artificial Intelligence Autumn Topics

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

6.867 Machine Learning

Announcements. Proposals graded

CS6220: DATA MINING TECHNIQUES

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Lecture 8: December 17, 2003

Probabilistic Graphical Models

CSC 411 Lecture 3: Decision Trees

A Tutorial on Learning with Bayesian Networks

CSCE 478/878 Lecture 6: Bayesian Learning

Machine Learning

CPSC 540: Machine Learning

Part I. C. M. Bishop PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 8: GRAPHICAL MODELS

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Learning Bayesian belief networks

Scoring functions for learning Bayesian networks

Andrew/CS ID: Midterm Solutions, Fall 2006

10708 Graphical Models: Homework 2

Bayesian Methods for Machine Learning

Naïve Bayes classification

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Sampling Algorithms for Probabilistic Graphical models

Learning in Bayesian Networks

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

The Naïve Bayes Classifier. Machine Learning Fall 2017

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

CS 630 Basic Probability and Information Theory. Tim Campbell

Bayesian Network Representation

Probabilistic Graphical Models

Feature selection. Micha Elsner. January 29, 2014

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Introduction to Probabilistic Graphical Models

11. Learning graphical models

Intelligent Systems:

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Qualifier: CS 6375 Machine Learning Spring 2015

Probabilistic Graphical Models (I)

Lecture 8: PGM Inference

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

GWAS IV: Bayesian linear (variance component) models

CPSC 340: Machine Learning and Data Mining

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

Generative Clustering, Topic Modeling, & Bayesian Inference

CS 6140: Machine Learning Spring 2016

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Networks: Independencies and Inference

From Bayesian Networks to Markov Networks. Sargur Srihari

Bayesian Networks Representation

Transcription:

Learning Bayes Net Structures KF, Chapter 15 15.5 (RN, Chapter 20) Some material taken from C Guesterin (CMU), K Murphy (UBC) 1

2 Learning Bayes Nets Known Structure Unknown Data Complete Missing Easy Hard EM NP-hard Very hard!! x (7) = [x 1 =t, x 2 =?, x 3 =f, ] Data x (1) x (m) structure CPTs : P(X i Pa Xi ) parameters

3 Learning the structure of a BN Data <x 1 (1),,x n (1) > <x 1 (m),,x n (m) > Flu Headache Learn structure and parameters Sinus Allergy Nose Constraint-based approach BN encodes conditional independencies Test conditional independencies in data Find an I-map (?P-map?) Score-based approach Finding structure + parameters is density estimation Evaluate model as we evaluated parameters Maximum likelihood Bayesian etc.

4 Remember: Obtaining a P-map? Given I(P) = independence assertions that are true for P 1. Obtain skeleton 2. Obtain immoralities 3. Using skeleton and immoralities, obtain every (and any) BN structure from the equivalence class Constraint-based approach: Use Learn_PDAG algorithm Key question: Independence test

5 Independence tests Statistically difficult task! Intuitive approach: Mutual information Mutual information and independence: X and Y independent if and only if I(X,Y)=0 X Y P(x, y) = P(x) P(y) log[ P(x,y)/P(x)P(y) ] = 0 Conditional mutual information: X Y Z iff P(X,Y Z) = P(X Z) P(Y Z) iff I(X,Y Z)= 0

6 Independence tests and the constraint based approach Using the data D Empirical distribution: Mutual information: Similarly for conditional MI Use learning PDAG algorithm: When algorithm asks: (X Y U)? Use I(X,Y U ) = 0? No doesn t happen Use I(X,Y U ) < t for some t>0? based on some statistical text t s.t. p<0.05 Many other types of independence tests

7 Independence Tests II For discrete data: χ 2 statistic measures how far the counts are from what we would expect given independence: p-value requires summing over all datasets of size N: p(t) = P({D : d(d) > t} H 0,N) Expensive approximation consider the expected distribution of d(d) (under the null hypothesis) as N to define thresholds for a given significance

8 Ex of classical hypothesis testing Spin Belgian one-euro coin N = 250 heads Y = 140; tails 110. Distinguish two models, H 0 = coin is unbiased (so p = 0.5) H 1 = coin is biased p 0.5 p-value is less than 7% p = P(Y 140) + P(Y 110) = 0.066: n=250; p = 0.5; y = 140; p = (1-binocdf(y-1,n,p)) + binocdf(n-y,n,p) If Y = 141: p = 0.0497 reject the null hypothesis at significance level 0.05. But is the coin really biased?

9 build-pdag Algorithm build-pdag can recover the true structure up to I-equivalence in O(N 3 2 d ) time if maximum number of parents over nodes is d independence test oracle can handle 2d + 2 variables G = a I-map of P underlying distribution P is faithful to G spurious independencies not sanctioned by G Called IC or PC algorithm

10 Eval of IC / PC alg Bad Faithfulness assumption rules out certain CPDs XOR. Independence test typically unreliable (especially given small data sets) make many errors One misleading independence test result can result in multiple errors in the resulting PDAG, so overall the approach is not robust to noise. Good PC algorithm is less dumb than local search

11 Score-based Approach Possible DAG structures (gazillions) Flu Headache Sinus Allergy Nose Data <x 1 (1),,x n (1) > <x 1 (m),,x n (m) > Score of each Structure 15,000 Learn Parameters + Evaluate... 10,000 20,000 10,500

12 Information-theoretic interpretation of maximum likelihood Given structure, log likelihood of data: Flu Headache Sinus Allergy Nose N ijk θ ijk

13 Entropy Entropy of V = [p(v = 1), p(v = 0)] : H(V) = - vi P( V = v i ) log 2 P( V = v i ) # of bits needed to obtain full info average surprise of result of one trial" of V Entropy measure of uncertainty

14 Examples of Entropy Fair coin: H(½, ½) = ½ log 2 (½) ½ log 2 (½) = 1 bit ie, need 1 bit to convey the outcome of coin flip) Biased coin: H( 1/100, 99/100) = 1/100 log 2 (1/100) 99/100 log 2 (99/100) = 0.08 bit As P( heads ) 1, info of actual outcome 0 H(0, 1) = H(1, 0) = 0 bits ie, no uncertainty left in source (0 log 2 (0) = 0)

15 Entropy & Conditional Entropy Entropy of Distribution H(X) = - i P(x i ) log P(x i ) How `surprising variable is Entropy = 0 when know everything eg P(+x)=1.0 Conditional Entropy H(X U) H(X U) = - u P(u) i P(x i u) log P(x i u) How much uncertainty is left in X, after observing U

16 Information-theoretic interpretation of maximum likelihood 2 Given structure, log likelihood of data: So log P(D θ, G) is LARGEST when each H(X i Pa X G ) is SMALL i, ie, when parents of X i are very INFORMATIVE about X i!

17 Score for Bayesian Network I(X, U) = H(X) H(X U) H(X Pa X,G ) = H(X) I(X, Pa X,G ) Log data likelihood Doesn t involve the structure, G! (X Pa X ) not very independent So use score: i I(X i, Pa Xi, G )

18 Decomposable Score Log data likelihood or perhaps just score: i I(X i, Pa Xi, G ) Decomposable score: Decomposes over families in BN (node and its parents) Will lead to significant computational efficiency!!! Score(G : D) = i FamScore( X i Pa Xi : D) For MLE: FamScore( X i Pa Xi : D) = m[i(x i, Pa xi ) H(X i ) ]

Using DeComposability Compare i I(X i, Pa Xi, G) + c G 1 X Y Z G 2 Y Z X G 1 : i I(X i, Pa Xi, G1) = I(X, {}) + I(Y, X) + I(Z, Y) 0 = I(Y, X) + I(Z, Y) G 2 : i I(X i, Pa Xi, G2) = I(Y, {}) + I(Z,Y) + I(X, Z) 0 = I(Z,Y) + I(X, Z) so diff is I(Y, X) - I(X, Z) 19

20 How many trees are there? Tree: one path between any two nodes (in skeleton) Most nodes have 1 parent (+ root with 0 parents) How many: One: pick root pick children for each child another tree Nonetheless efficient optimal alg to find OPTIMAL tree

21 Best Tree Structure Identify tree with set F = { Pa(X) } each Pa(X) is {}, or another variable Optimal tree, given data, is argmax F m i I( X i, Pa(X i ) ) m i H(X i ) = argmax F i I( X i, Pa(X i ) ) as i H(X i ) does not depend on structure So want parents F s.t. tree structure maximizes i I( X i, Pa(X i ) )

22 Chow-Liu tree learning algorithm 1 For each pair of variables X i,x j Compute empirical distribution: Compute mutual information: Define a graph Nodes X 1,,X n Edge (i,j) gets weight Find maximal Spanning Tree Pick a node for root, dangle

23 Chow-Liu tree learning algorithm 2 Optimal tree BN Compute maximum weight spanning tree Directions in BN: pick any node as root, breadth-first-search defines directions Doesn t matter which! Score Equivalence: If G and G are I-equiv, then scores are same

24 Chow-Liu (CL) Results If distribution P is tree-structured, CL finds CORRECT one If distribution P is NOT tree-structured, CL finds tree structured Q that has min l KL-divergence argmin Q KL(P; Q) Even though 2 θ(n log n) trees, CL finds BEST one in poly time O(n 2 [m + log n])

25 Can we extend Chow-Liu 1 Naïve Bayes model overcounts because it ignores correlation between features Tree augmented naïve Bayes (TAN) [Friedman et al. 97] Same as Chow-Liu, but score edges with:

26 Can we extend Chow-Liu 1 Naïve Bayes model overcounts because it ignores correlation between features Tree augmented naïve Bayes (TAN) [Friedman et al. 97] Same as Chow-Liu, but score edges with:

27 Can we extend Chow-Liu 2 (Approximately learning) models with tree-width up to k [Narasimhan & Bilmes 04] But, O(n k+1 ) and more subtleties

28 Comments on learning BN structures so far Decomposable scores Maximum likelihood Information theoretic interpretation Best tree (Chow-Liu) Best TAN Nearly best k-treewidth (in O(N k+1 ))

29 Overfitting So far: Find parameters/structure that fit the training data If too many parameters, will match TRAINING data well, but NOT new instances Overfitting! Error Complexity Regularizing, Bayesian approach,

30 How to Evaluate a Model? Training Data TRAIN SNP1 SNP2 SNP3 SNP53 Bleeding? C/G A/G T/T T/C No SNP1 SNP2 SNP3 SNP53 Bleed? T/C : C/C : A/A : T/T : Yes : G/A C/C T/T T/C No G/A T/C G/G T/C No A/A C/C A/T T/T Yes A/A C/T A/A T/T Yes : : : : : TEST Find Param G/A C/T A/A T/T No SNP1 G/A A/A : SNP2 C/C C/C : SNP3 T/T A/T : SNP53 T/C T/T : Compute Prob -38 Training Set Error too optimistic G/A C/T A/A T/T

31 How to Evaluate a Model? Training Data TRAIN SNP1 SNP2 SNP3 SNP53 Bleeding? C/G A/G T/T T/C No SNP1 SNP2 SNP3 SNP53 Bleed? T/C : C/C : A/A : T/T : Yes : G/A C/C T/T T/C No G/A T/C G/G T/C No A/A C/C A/T T/T Yes A/A C/T A/A T/T Yes : : : : : TEST Find Param G/A C/T A/A T/T No SNP1 G/A A/A : SNP2 C/C C/C : SNP3 T/T A/T : SNP53 T/C T/T : Compute Prob -53 G/A C/T A/A T/T Simple Hold-out Set Error slightly pessimistic

32 How to Evaluate a Model? K-fold Cross Validation Eg, K=3 Not as pessimistic every point is test example, once

33 Maximum likelihood score overfits! Information never hurts: Adding a parent never decreases score!!!

34 Bayesian Score Prior distributions: Over structures Over parameters of a structure Goal: Prefer simpler structures.. regularization Posterior over structures given data: P(G D) P(D G) P(G) Posterior Likelihood Prior over Graphs P(D G) = ΘG P(D G, Θ G ) P( Θ G G) d Θ G Prior over Parameters

35 Bayesian score and Model complexity True model: Structure 1: X and Y independent X Y P(Y=t X=t) = 0.5 + α P(Y=t X=f) = 0.5 - α Score doesn t depend on α!

36 Bayesian score and model complexity True model: Structure 1: X and Y independent X Score doesn t depend on alpha Structure 2: X Y Y P(Y=t X=t) = 0.5 + α P(Y=t X=f) = 0.5 - α Data points split between P(Y=t X=t) and P(Y=t X=f) For fixed m, only worth it for large α Because posterior over parameter will be more diffuse with less data

37 Bayesian, a decomposable score Local and global parameter independence θ Y +x θ X Prior satisfies parameter modularity: If X i has same parents in G and G, then parameters have same prior C A B A B X X C Θ(X; A,B) same in both structures Structure prior P(G) satisfies structure modularity Product of terms over families E.g., P(G) c G G =#edges; c<1.. then Bayesian score decomposes along families! log P(G D) = X ScoreFam( X Pa X : D)

38 Marginal Posterior Given θ ~ Beta(1,1), what is probability of h H, T, T, H, H i? P( f 1 =H, f 2 =T, f 3 =T, f 4 =H, f 5 =H θ ~ Beta(1,1) ) = P( f 1 =H θ ~ Beta(1,1) ) P( f 2 =T, f 3 =T, f 4 =H, f 5 =H f 1 =H, θ ~ Beta(1,1) ) = ½ P( f 2 =T, f 3 =T, f 4 =H, f 5 =H θ ~ Beta(2,1) ) = ½ P( f 2 =T θ ~ Beta(2,1) ) x P(f 3 =T, f 4 =H, f 5 =H f 2 =T, θ ~ Beta(2,1) ) = ½ 1/3 P(f 3 =T, f 4 =H, f 5 =H θ ~ Beta(2,2) ) = ½ 1/3 2/4 2/5 P(f 5 =H θ ~ Beta(2,3) ) = ½ 1/3 2/4 2/5 3/6 = (1 2 3) (1 2) / (2 3 4 5)

Marginal Posterior con t Given θ ~ Beta(a,b), what is P[ h H, T, T, H, H i ]? P( f 1 =H, f 2 =T, f 3 =T, f 4 =H, f 5 =H θ ~ Beta(a,b) ) = P( f 1 =H θ ~ Beta(a,b) ) P( f 2 =T, f 3 =T, f 4 =H, f 5 =H f 1 =H, θ ~ Beta(a,b) ) = a/(a+b) P( f 2 =T, f 3 =T, f 4 =H, f 5 =H θ~beta(a+1,b) ) = = a b b+ 1 a+ 1 a+ 2 a+ b a+ b+ 1 a+ b+ 2 a+ b+ 3 a+ b+ 4 a ( a+ 1) ( a+ 2) b ( b+ 1) ( a+ b)( a+ b+ 1)( a+ b+ 2)( a+ b+ 3)( a+ b+ 4) = Γ ( a+ mh) Γ ( b+ mt) Γ ( a+ b) Γ( a) Γ( b) Γ ( a+ b+ m) 39

40 Marginal, vs Maximal, Likelihood Data D =h H, T, T, H, H i θ * = argmax θ P( D θ ) = 3/5 Here: P( D θ * ) = (3/5) 3 (2/5) 2 0.035 Or Bayesian, from Beta(1,1), θ * B(1,1) = 4/7 Marginal i P( x i x 1, x i-1 ) kinda like cross validation: Evaluate each instance, wrt previous instance

41 Marginal Probability of Graph Given complete data, independent parameters,

BIC approximation of Bayesian Score In general, Bayesian has difficult integrals For Dirichlet prior over parameters, can use simple Bayes information criterion (BIC) approximation In the limit, we can forget prior! Theorem: for Dirichlet prior, and a BN with Dim(G) independent parameters, as m : max likelihood estimate for θ likelihood score prefers fully-connected graph regularizer penalizes edges 42

43 BIC approximation, a decomposable score BIC: Using information theoretic formulation:

44 Consistency of BIC, Bayesian scores Consistency is limiting behavior says nothing wrt finite sample size!!! A scoring function is consistent if, for true model G *, as m, with probability 1 G * maximizes the score All structures not I-equivalent to G * have strictly lower score Theorem: BIC score (with Dirichlet prior) is consistent Corollary: the Bayesian score is consistent What about likelihood score? NO! True, Likelihood of optimal is MAX. But fully-connected graph (which is NOT I-equiv, also max s score!

45 Priors for General Graphs For finite datasets, prior is important! Prior over structure satisfying prior modularity What about prior over parameters, how do we represent it? K2 prior: fix an α, P(θ Xi PaXi ) = Dirichlet(α,, α) K2 is inconsistent

Priors for Parameters G 1 X Y θ X ~ Beta(1, 1) θ Y +x ~ Beta(1, 1) θ Y -x ~ Beta(1, 1) Does this make sense? EffectiveSampleSize(θ Y +x ) = 2 But only 1 example ~ +x?? G 2 X θ X +y ~ Beta(1, 1) θ X -y ~ Beta(1, 1) I-Equivalent structure What happens after [+x, -y]? Should be the same!! Y θ Y ~ Beta(1, 1)

Priors for Parameters G 1 G 1 X θ X ~ Beta(1, 1) X θ X ~ Beta(2, 1) Y P 1 (+x) = 2/3 θ Y +x ~ Beta(1, 1) θ Y -x ~ Beta(1, 1) [+x, -y] Y θ Y +x ~ Beta(1, 2) θ Y -x ~ Beta(1, 1) G 2 G 2 X θ X +y ~ Beta(1, 1) θ X -y ~ Beta(1, 1) P 2 (+x) = P 2 (+x,+y) + P 2 (+x,-y) = 1/3 * ½ + 2/3 * 2/3 = 11/18!!! Y θ Y ~ Beta(1, 1) X θ X +y ~ Beta(1, 1) θ X -y ~ Beta(2, 1) Y θ Y ~ Beta(1, 2)

BDe Priors G 1 X Y θ X ~ Beta(2, 2) θ Y +x ~ Beta(1, 1) θ Y -x ~ Beta(1, 1) This makes more sense: EffectiveSampleSize(θ Y +x ) = 2 Now 2 examples ~ +x?? G 2 X θ X +y ~ Beta(1, 1) θ X -y ~ Beta(1, 1) I-Equivalent structure Now what happens after [+x, -y]? Y θ Y ~ Beta(2, 2)

BDe Priors G 1 G 1 X θ X ~ Beta(2, 2) X θ X ~ Beta(3, 2) Y P 1 (+x) = 3/5 θ Y +x ~ Beta(1, 1) θ Y -x ~ Beta(1, 1) [+x, -y] Y θ Y +x ~ Beta(1, 2) θ Y -x ~ Beta(1, 1) G 2 G 2 X θ X +y ~ Beta(1, 1) θ X -y ~ Beta(1, 1) P 2 (+x) = P 2 (+x,+y) + P 2 (+x,-y) = 2/5 * ½ + 3/5 * 2/3 = 3/5!!! Y θ Y ~ Beta(2, 2) X θ X +y ~ Beta(1, 1) θ X -y ~ Beta(2, 1) Y θ Y ~ Beta(2, 3)

50 BDe Prior View Dirichlet parameters as fictitious samples equivalent sample size Pick a fictitious sample size m For each possible family, define a prior distribution P(X i,pa Xi ) Represent with a BN Usually independent (product of marginals)

51 Score Equivalence If G and G are I-equivalent, then they have same score Theorem 1: Maximum likelihood score and BIC score satisfy score equivalence. Theorem 2: If P(G) assigns same prior to I-equivalent structures (e.g., edge counting) and each parameter prior is Dirichlet then Bayesian score satisfies score equivalence if and only if prior over parameters represented as a BDe prior!!!!!!

52 Chow-Liu for Bayesian score Edge weight w Xj Xi is advantage of adding X j as parent for X i Now have a directed graph, need directed spanning forest Note that adding an edge can hurt Bayesian score choose forest not tree But, if score satisfies score equivalence, then w Xj Xi = w Xj Xi! Simple maximum spanning forest algorithm works

53 Structure learning for General Graphs In a tree, a node only has one parent Theorem: The problem of learning a BN structure with at most d parents is NP-hard for any (fixed) d 2 Most structure learning approaches use heuristics Exploit score decomposition (Quickly) Describe two heuristics that exploit decomposition in different ways

54 Understanding Score Decomposition Coherence Difficulty Intelligence Grade SAT Letter Happy Job

55 Fixed variable order 1 Pick a variable order e.g., X 1,,X n X i can only pick parents in {X 1,, X i-1 } Any subset Acyclicity guaranteed! Total score = sum score of each node

Fixed variable order.. 2 Fix max number of parents to k For each i in order Pick Pa Xi {X 1,,X i-1 } Exhaustively search through all possible subsets Pa Xi is U {X 1,,X i-1 } that maximizes FamScore(X i U : D) Optimal BN for each order!!! Greedy search through space of orders: E.g., try switching pairs of variables in order If neighboring vars in order are switched, only need to recompute score for this pair O(n) speed up per iteration F A S H N F S A H N 56

57 Learn BN structure using local search Starting from Chow-Liu tree Local search, possible moves: Only if acyclic!!! Add edge Delete edge Invert edge Select using favorite score Computed locally ( efficiently) thanks to Score Decomposition FamScore

58 Exploit score decomposition in local search Coherence Difficulty Grade Intelligence SAT Add edge: Re-score only one family! Delete edge: Re-score only one family! Happy Letter Job Reverse edge Re-score only two families

59 Some experiments Alarm network

60 Order search versus Graph search Order search advantages For fixed order, optimal BN more global optimization Space of orders (n!) much smaller than space of graphs Ω(2 n2 ) Graph search advantages Not restricted to k parents Especially if exploiting CPD structure, such as CSI Cheaper per iteration Finer moves within a graph

61 Bayesian Model Averaging So far, we have selected a single structure But, if you are really Bayesian must average over structures Similar to averaging over parameters P(G D) probability for each graph Inference for structure averaging is very hard!!! Clever tricks in KF text

62 Summary wrt Learning BN Structure Decomposable scores Data likelihood Information theoretic interpretation Bayesian BIC approximation Priors Structure and parameter assumptions BDe if and only if score equivalence Best tree (Chow-Liu) Best TAN Nearly best k-treewidth (in O(N k+1 )) Search techniques Search through orders Search through structures Bayesian model averaging