Learning Bayesian Networks

Similar documents
Beyond Uniform Priors in Bayesian Network Structure Learning

An Empirical-Bayes Score for Discrete Bayesian Networks

Bayesian networks. Independence. Bayesian networks. Markov conditions Inference. by enumeration rejection sampling Gibbs sampler

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Probabilistic Graphical Models

Scoring functions for learning Bayesian networks

Learning Bayesian network : Given structure and completely observed data

Introduction to Bayesian Networks. Probabilistic Models, Spring 2009 Petri Myllymäki, University of Helsinki 1

Introduction to Bayesian Networks

Learning Bayes Net Structures

Lecture 6: Graphical Models: Learning

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Machine Learning Summer School

PROBABILISTIC REASONING SYSTEMS

An Empirical-Bayes Score for Discrete Bayesian Networks

Structure Learning: the good, the bad, the ugly

PMR Learning as Inference

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

The Monte Carlo Method: Bayesian Networks

Informatics 2D Reasoning and Agents Semester 2,

T Machine Learning: Basic Principles

Learning Bayesian belief networks

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning

Learning Bayesian networks

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Inference and MCMC

Bayesian Network Structure Learning using Factorized NML Universal Models

Sampling from Bayes Nets

CSC 411 Lecture 3: Decision Trees

STA414/2104 Statistical Methods for Machine Learning II

LEARNING WITH BAYESIAN NETWORKS

Directed Graphical Models

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Bayesian Models in Machine Learning

Bayesian Machine Learning

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Methods: Naïve Bayes

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

CSC321 Lecture 18: Learning Probabilistic Models

CSCE 478/878 Lecture 6: Bayesian Learning

Computational Cognitive Science

Introduction to Bayesian Learning

Probabilistic Graphical Models

Bayesian Networks Basic and simple graphs

Bayesian Regression Linear and Logistic Regression

Computational Perception. Bayesian Inference

STA 4273H: Sta-s-cal Machine Learning

Classical and Bayesian inference

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Bayesian Methods for Machine Learning

A Bayesian Approach to Phylogenetics

A Brief Introduction to Graphical Models. Presenter: Yijuan Lu November 12,2004

Gibbs Sampling Methods for Multiple Sequence Alignment

Naïve Bayes classification

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Oliver Schulte - CMPT 419/726. Bishop PRML Ch.

CPSC 540: Machine Learning

Learning Bayesian Networks (part 1) Goals for the lecture

Modeling Environment

Introduction to Probabilistic Graphical Models

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Pp. 311{318 in Proceedings of the Sixth International Workshop on Articial Intelligence and Statistics

Introduc)on to Bayesian Methods

STA 4273H: Statistical Machine Learning

Quantifying uncertainty & Bayesian networks

Directed Graphical Models

Probability and Estimation. Alan Moses

Machine Learning

Probabilistic Graphical Models (I)

Algorithmisches Lernen/Machine Learning

Sampling Rejection Sampling Importance Sampling Markov Chain Monte Carlo. Sampling Methods. Machine Learning. Torsten Möller.

Which coin will I use? Which coin will I use? Logistics. Statistical Learning. Topics. 573 Topics. Coin Flip. Coin Flip

an introduction to bayesian inference

Bayesian (conditionally) conjugate inference for discrete data models. Jon Forster (University of Southampton)

COMP538: Introduction to Bayesian Networks

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Probabilistic Machine Learning

Language as a Stochastic Process

Computational Cognitive Science

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008

STA 4273H: Statistical Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Lecture 5: Bayesian Network

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

(1) Introduction to Bayesian statistics

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Machine Learning using Bayesian Approaches

Outline. Spring It Introduction Representation. Markov Random Field. Conclusion. Conditional Independence Inference: Variable elimination

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Transcription:

Learning Bayesian Networks Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-1

Aspects in learning Learning the parameters of a Bayesian network Marginalizing over all all parameters Equivalent to choosing the expected parameters Learning the structure of a Bayesian network Marginalizing over the structures not computationally feasible Model selection Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-2

A Bayesian network P(Sprinkler Cloudy) Cloudy Sprinkler=onSprinkler=off no 0.5 0.5 yes 0.9 0.1 Cloudy=no Cloudy=yes 0.5 0.5 Sprinkler P(Cloudy) Cloudy Rain P(Rain Cloudy) Cloudy Rain=yes Rain=no no 0.2 0.8 yes 0.8 0.2 Wet Grass P(WetGrass Sprinkler, Rain) Sprinkler Rain WetGrass=yes WetGrass=no on no 0.90 0.10 on yes 0.99 0.01 off no 0.01 0.99 off yes 0.90 0.10 Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-3

Learning the parameters Given the data D, how should I fill the conditional probability tables? Bayesian answer: You should not. If you do not know them, you will have a priori and a posteriori distributions for them. They are many, but again, the independence comes to rescue. Once you have distribution of parameters, you can do the prediction by model averaging. Very similar to Bernoulli case. Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-4

A Bayesian network revisited Θ C Θ C : Cloudy=no Cloudy=yes 0.5 0.5 Θ R C Θ S C=no : Θ S C=yes : Θ S C Sprinkler=onSprinkler=off 0.5 0.5 0.9 0.1 Sprinkler Cloudy Rain Θ R C=no : Θ R C=yes : Rain=yes Rain=no 0.2 0.8 0.8 0.2 Θ W S,R Wet Grass WetGrass=yes WetGrass=no Θ W S=on,R=no : 0.90 0.10 Θ W S=on,R=yes : 0.99 0.01 Θ W S=off,R=no : 0.01 0.99 Θ W S=off,R=yes : 0.90 0.10 Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-5

A Bayesian network as a generative model Θ C Cloudy Θ S C Θ R C Sprinkler Rain Wet Grass Θ W S,R Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-6

A Bayesian network as a generative model Θ C Cloudy Θ S C=yes Θ R C=yes Sprinkler Rain Θ S C=no Wet Grass Θ R C=no Θ W S=on,R=no Θ W S=on,R=yes Θ W S=off,R=no Θ W S=off,R=yes Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-7

A Bayesian network as a generative model Θ S C=yes Θ C Cloudy Θ R C=yes Parameters are independent a priori: n P = i=1 P i Sprinkler Rain q i P i j, Θ S C=no Θ R C=no n = i=1 j=1 Wet Grass where P i j =Dir 1,, r i. Θ W S=on,R=no Θ W S=on,R=yes Θ W S=off,R=no Θ W S=off,R=yes Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-8

Generating a data set α C α R C=yes α R C=no Θ C Θ R C=yes Θ R C=no D Cloudy Rain d 1 yes no Cloudy 1 Rain 1 d 2 no yes......... d N no yes Cloudy 2 Rain 2 Plate notation:...... α C α R C=yes α R C=no Θ C Θ R C=yes Θ R C=no Cloudy N Rain N Cloudy j Rain j i.i.d, isn't it Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-9 N

Likelihood P(D Θ,G) For one data vector it was: n P x 1, x 2,..., x n G = P x i pa G x i, or n i=1 P d 1 G, = d1i pa 1i, whered 1i and pa 1i are the i=1 value and the parent configuration of the variablei in data vector d 1. P d 1, d 2,, d N G, = j=1 N n i=1 d ji pa ji = i=1 n k=1 Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-10 r i q i j=1 N ijk ik j, where N ijk is the number of data vectors with parent configuration j when variablei has the valuek, r i andq i are the numbers of values and parent configurations of the variable i.

N S C=no N S C=yes Bayesian network learning N S C (q S =2, r S =2) N C N C (q C =1, r C =2) Cloudy=no Cloudy=yes 0 0 Cloudy Sprinkler=onSprinkler=off 0 0 0 0 Sprinkler N R C=no N R C=yes Rain N R C (q R =2, r R =2) Rain=yes Rain=no 0 0 0 0 n r i P D G, = i=1 k=1 i picks the variable (table) j picks the row k picks the column q i j=1 N ijk ik j Wet Grass N W S=on,R=no N W S=on,R=yes N W S=off,R=no N W S=off,R=yes N W S,R (q W =4, r W =2) WetGrass=yesWetGrass=no 0 0 0 0 0 0 0 0 Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-11

Bayesian network learning after (C,S,R,W)=[(no, on, yes, yes), (no,on,no,no)] N C (q C =1, r C =2) Sprinkler=onSprinkler=off N S C=no 1+1=2 0 N S C=yes N S C (q S =2, r S =2) P D G, = i=1 0 0 r i n k=1 q i j=1 N ijk ik j i picks the variable (table) j picks the row k picks the column r i, number of columns in table i q i, number of rows in table i Cloudy=no Cloudy=yes N C 1+1=2 0 N W S,R (q W =4, r W =2) N W S=on,R=no N W S=on,R=yes N W S=off,R=no N W S=off,R=yes N R C (q R =2, r R =2) Rain=yes Rain=no WetGrass=yesWetGrass=no 0 1 1 0 0 0 0 0 1 1 0 0 Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-12

Bayesian network learning after a while (20 data vectors) N S C=no N S C=yes N S C Sprinkler=on Sprinkler=off N C Cloudy=no Cloudy=yes N C 16 4 = 20 = 16 = 4 10 6 = 16 1 3 = 4 = 11 = 9 N W S,R N R C=no N R C=yes N R C Rain=yes Rain=no 3 13 = 16 4 0 = 4 = 7 = 13 P D G, = i=1 r i n k=1 q i j=1 N ijk ik j i picks the variable (table) j picks the row k picks the column r i, number of columns in table i q i, number of rows in table i N W S=on,R=no N W S=on,R=yes N W S=off,R=no N W S=off,R=yes WetGrass=yesWetGrass=no 2 3 = 5 1 5 = 6 6 2 = 8 0 1 = 1 Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-13

Maximum likelihood Since the parameters are occur separately in likelihood we can maximize the terms independently: P D G, = i=1 r i n k=1 q i j=1 N ijk ijk ijk = N ijk r i N ijk' k'=1 So you simply normalize the rows in the sufficient statistics tables to get ML-parameters. But these parameters may have zero probabilities: not good for prediction; hear the Bayes call... Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-14

Learning the parameters - again Given the data D, how should I fill the conditional probability tables? Bayesian answer: You should not. If you do not know them, you will have a priori and a posteriori distributions for them. They are many, but again, the independence comes to rescue. Once you have distribution of parameters, you can do the prediction by model averaging. Very similar to the Bernoulli case. Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-15

Prior x Likelihood A priori parameters independently Dirichlet: n P = i=1 n q P i = i i=1 j=1 n q P i j = i i=1 j=1 r i k=1 ijk Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-16 r i k=1 r i ijk k=1 Likelihood compatible with conjugate prior: n q i P D G, = i=1 j=1 r i k=1 N ijk ijk Yields a simple posterior q i n P D, = P ij N ij, ij, i=1 j=1 where P ij N ij, ij =Dir N ij ij ijk 1 ijk

Predictive distribution P(d D,α,G) r i n q i Posterior: P D, = i=1 j=1 Predictive distribution: P d D,,G = = n = i=1 n = i=1 k=1 N ijk ijk r i N k=1 ijk ijk Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-17 r i k=1 N ijk ijk 1 ijk P d, D, d = P d P D, d n i=1 ipai d i P d i i P i D, d ipai d i P ipai d i N ipai d i, ipai d i d ipai d i n ipai d i = i=1 r i k=1 N i pai d i i pai d i N i paik ipaik

Predictive distribution This means that predictive distribution n P d D,,G = i=1 r i k=1 N i pai d i i pai d i N ipaik i pai k can be achieved by just setting ijk = N ijk ijk N ij ij So just gather counts N ijk, add α ijk to them and normalize. Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-18

Being uncertain about the Bayesian network structure Bayesian says again: If you do not know it, you should have an a priori and the a posteriori distribution for it. P G D = P D G P G P D Likelihood P(D G) is called the marginal likelihood and with certain assumptions, it can be computed in closed form Normalizer we can just ignore. Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-19

Prediction over model structures P X D = M = M M = M P X M, D P M D P X, M, D P M, D d P M D P X D, M P D M P M P X D, M P D, M P M d P M This summation is not feasible as it goes over a super-exponential number of model structures Does NOT reduce to using a single expected model structure, like what happens with the parameters Typically use only one (or a few) models with high posterior probability P(M D) Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-20

Averaging over an equivalence class Boils down to using a single model (assuming uniform prior over the models within the equivalence class): P X E = M E P X M, E P M E = E P X M 1 E =P X M Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-21

Model Selection Problem: The number of possible structures for a given domain is more than exponential in the number of variables Solution: Use only one or a handful of good models Necessary components: Scoring method (what is good?) Search method (how to find good models?) Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-22

Learning the structure: scoring Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-23

Good models? In marginalization/summation/model averaging over all the model structures, the predictions are weighted by P(M D), the posteriors of the models given the data If have to select one (a few) model(s), it sounds reasonable to use model(s) with the largest weight(s) P(M D) = P(D M)P(M)/P(D) Relevant components: The structure prior P(M) The marginal likelihood (the evidence ) P(D M) Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-24

How to set the structure prior P(M)? The standard solution: use the uniform prior (i.e., ignore the structure prior) Sometimes suggested: P(M) proportional to the number of arcs so that simple models more probable Justification??? Uniform over the equivalence classes? Proportional to the size of the equivalence class? What about the nestedness (full networks contain all the other networks)...?...still very much an open issue Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-25

Marginal likelihood P(D G,α) P D G, =P d 1 G, P d 2 d 1, G, P d N d 1,, d N 1,G, n q i = i=1 j=1 ij r i N ij ij k=1 N ijk ijk ijk Sprinkler=onSprinkler=off N S C=no 1+1=2 0 N S C=yes 0 0 Cloudy=no Cloudy=yes N C 1+1=2 0 Rain=yes Rain=no 1 1 0 0 N W S=on,R=no N W S=on,R=yes N W S=off,R=no N W S=off,R=yes WetGrass=yesWetGrass=no 0 1 1 0 0 0 0 0 Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-26

Computing the marginal likelihood Two choices: 1 Calculate the sufficient statistics N ijk and compute P(D M) directely using the (gamma) formula on the previous slide 2 Use the chain rule, and compute P(d 1,...d n M) = P(d 1 M)P(d 2 d 1,M)...P(d n d 1,...,d n-1 M) by using iteratively the predictive distribution (slide 18) OBS! The latter can be done in any order, and the result will be the same (remember Exercise 2?)! Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-27

How to set the hyperparameters α? Assuming... a multinomial sample, independent parameters, modular parameters, complete data, likelihood equivalence,...implies a certain parameter prior: BDe ( Bayesian Dirichlet with likelihood equivalence ) Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-28

BDeu Likelihood equivalence: two Markov equivalent model structures produce to the same predictive distribution Means also that P(D M) = P(D M') if M and M' equivalent Let i = j ij, where ij = BDe means that αi = α for all i, and α is the equivalent sample size An important special case: BDeu ( u for uniform ): ijk = q i r i, ij = q i Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-29 k ijk

Model selection in the Bernoulli case Toss a coin 250 times, observe D: 140 heads and 110 tails. Hypothesis H 0 : the coin is fair (P(ϴ=0.5) = 1) Hypothesis H 1 : the coin is biased Statistics: The P-value is 7% suspicious, but not enough for rejecting the null hypothesis (Dr. Barry Blight, The Guardian, January 4, 2002) Bayes: Let s assume a prior, e.g. Beta(a,a) Compute the Bayes factor P D H 1 P D H 0 = P D, H 1, a P H 1,a d 1/2 250 Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-30

Equivalent sample size and the Bayes Factor 2 Bayes factor in favor of H1 1,5 1 0,5 0 0,37 1 2,7 7,4 20 55 148 403 1096 Equivalent sample size Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-31

A slightly modified example Toss a coin 250 times, observe D = 141 heads and 109 tails. Hypothesis H 0 : the coin is fair (P(ϴ=0.5) = 1) Hypothesis H 1 : the coin is biased Statistics: The P-value is 4,97% Reject the null hypothesis at a significance level of 5% Bayes: Let s assume a prior, e.g. Beta(a,a) Compute the Bayes factor P D H 1 P D H 0 = P D, H 1, a P H 1,a d 1/2 250 Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-32

Equivalent sample size and the Bayes Factor (modified example) 2,5 Bayes factor in favor of H1 2 1,5 1 0,5 0 0,37 1 2,7 7,4 20 55 148 403 1096 Equivalent sample size Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-33

Lessons learned Classical statistics and the Bayesian approach may give contradictory results Using a fixed P-value threshold is problematic as any null hypothesis can be rejected with sufficient amount of data The Bayesian approach compares models and does not aim at an absolute estimate of the goodness of the models Bayesian model selection depends heavily on the priors selected However, the process is completely transparent and suspicious results can be criticized based on the selected priors Moreover, the impact of the prior can be easily controlled with respect to the amount of available data The issue of determining non-informative priors is controversial Reference priors Normalized maximum likelihood & MDL (see www.mdl-research.org) Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-34

D 22.02.11 On Bayes factor and Occam s razor The marginal likelihood (the evidence ) P(D H) yields a probability distribution (or density) over all the possible data sets D. Complex models can predict well many different data sets, so they need to spread the probability mass over a wide region of models Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-35

Hyperparameters in more complex cases Bad news: the BDeu score seems to be quite sensitive to the equivalent sample size (Silander & Myllymäki, UAI'2007) Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-36

So which prior to use? An open issue One solution: use the priorless Normalized Maximum Likelihood approach A more Bayesian solution: use the Jeffreys prior Can be formulated in the Bayesian network framework (Kontkanen et al., 2000), but nobody has produced software for computing it in practice (good topic n for your thesis!) 1 B-Course: = 2n r i i=1 Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-37

Learning the structure: search Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-38

Learning the structure when each node has at most one parent The BD score is decomposable: max M P( D M )=max M P( X i [ D] Pa M i [ D]) i =min M f D ( X i, Pa M i ), i where f D ( X i, Pa M i )=log P( X i [D] Pa M i [ D]) 1 For trees (or forests), can use the minimum spanning tree algorithm (see Chow & Liu, 1968) Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-39

The General Case Finding the best structure is NP-hard, if max. number of parents > 1 (Chickering) New dynamic programming solutions work up to ~30 variables (Silander & Myllymäki, UAI'2006) Heuristics: Greedy bottom-up/top-down Stochastic greedy (with restarts) Simulated annealing and other Monte- Carlo approaches Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-40

Local Search initialize structure prior structure empty graph max spanning tree random graph score all possible single changes perform best change any changes better? yes no return saved structure extension: multiple restarts Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-41

Simulated Annealing initialize structure and temperature T prior structure empty graph max spanning tree random graph pick random change and compute: p=exp(score/t); decrease T; perform change with prob p quit? no yes return saved structure Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-42

Evaluation Methodology random sample data Gold standard network score + search learned network(s) noise prior network Measures of utility of learned network: - Cross Entropy (Gold standard network, learned network) - Structural difference (e.g. #missing arcs, extra arcs, reversed arcs,...) Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-43

Gold Standard Prior Network A D R D A R R A R A D D D Data D Learned Network R Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-44

Gold Standard Prior Data D Learned Network Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-45

Problems with the Gold standard methodology Structural criteria may not properly reflect the quality of the result (e.g., the relevance of an extra arc depends on the parameters) Cross-entropy (Kullback-Leibler distance) hard to compute With small data samples, what is the correct answer? Why should the learned network be like the generating network? Are there better evaluation strategies? How about predictive performance? Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-46