Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Similar documents
Probabilistic Graphical Models

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Learning Bayesian network : Given structure and completely observed data

1 Maximum Likelihood Estimation

Lecture 2: Conjugate priors

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Lecture 2: Priors and Conjugacy

The Monte Carlo Method: Bayesian Networks

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian RL Seminar. Chris Mansley September 9, 2008

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Learning Bayesian networks

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Introduction to Probabilistic Machine Learning

CS540 Machine learning L9 Bayesian statistics

CSC321 Lecture 18: Learning Probabilistic Models

COMP538: Introduction to Bayesian Networks

Lecture 6: Graphical Models: Learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

PMR Learning as Inference

Introduction to Bayesian inference

Machine Learning Summer School

Lecture 18: Bayesian Inference

Data Analysis and Uncertainty Part 2: Estimation

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Introduction to Machine Learning

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Hierarchical Models & Bayesian Model Selection

Parametric Techniques Lecture 3

COS513 LECTURE 8 STATISTICAL CONCEPTS

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Probability and Estimation. Alan Moses

Density Estimation. Seungjin Choi

CPSC 540: Machine Learning

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Bayesian Methods for Machine Learning

Parameter Learning With Binary Variables

Motif representation using position weight matrix

Discrete Binary Distributions

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Parametric Techniques

4: Parameter Estimation in Fully Observed BNs

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1

COMP 328: Machine Learning

Bayesian Methods: Naïve Bayes

Computational Cognitive Science

CS540 Machine learning L8

Computational Perception. Bayesian Inference

Probabilistic Graphical Models

Chapter 5. Bayesian Statistics

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CS 361: Probability & Statistics

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Bayesian Inference: Posterior Intervals

Learning Bayesian Networks

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Computational Cognitive Science

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Bayesian Models in Machine Learning

Probabilistic and Bayesian Machine Learning

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Probabilistic Graphical Networks: Definitions and Basic Results

Point Estimation. Vibhav Gogate The University of Texas at Dallas

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Expectation Propagation Algorithm

Machine learning - HT Maximum Likelihood

Predictive Hypothesis Identification

Classical and Bayesian inference

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Robustness and duality of maximum entropy and exponential family distributions

Bayesian Machine Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

5. Conditional Distributions

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Lecture 4: Probabilistic Learning

An Empirical-Bayes Score for Discrete Bayesian Networks

Bayesian Regression Linear and Logistic Regression

Introduction to Machine Learning

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

COMP90051 Statistical Machine Learning

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

6.867 Machine Learning

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Lecture 11: Probability Distributions and Parameter Estimation

Inferring Protein-Signaling Networks II

Lecture 8: December 17, 2003

Bayesian Nonparametrics

LEARNING WITH BAYESIAN NETWORKS

Naïve Bayes classification

Directed Graphical Models

6.867 Machine Learning

POSTERIOR PROPRIETY IN SOME HIERARCHICAL EXPONENTIAL FAMILY MODELS

Transcription:

Overview of Course So far, we have studied The concept of Bayesian network Independence and Separation in Bayesian networks Inference in Bayesian networks The rest of the course: Data analysis using Bayesian network Parameter learning: Learn parameters for a given structure. Structure learning: Learn both structures and parameters Learning latent structures: Discover latent variables behind observed variables and determine their relationships. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 1 / 58

COMP538: Introduction to Bayesian Networks Lecture 6: Parameter Learning in Bayesian Networks Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering Hong Kong University of Science and Technology Fall 2008 Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 2 / 58

Objective Objective: Principles for parameter learning in Bayesian networks. Algorithms for the case of complete data. Reading: Zhang and Guo (2007), Chapter 7 Reference: Heckerman (1996) (first half), Cowell et al (1999, Chapter 9) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 3 / 58

Outline Problem Statement 1 Problem Statement 2 Principles of Parameter Learning Maximum likelihood estimation Bayesian estimation Variable with Multiple Values 3 Parameter Estimation in General Bayesian Networks The Parameters Maximum likelihood estimation Properties of MLE Bayesian estimation Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 4 / 58

Problem Statement Parameter Learning Given: A Bayesian network structure. X 1 X 2 X 3 X 4 A data set X 5 Estimate conditional probabilities: X 1 X 2 X 3 X 4 X 5 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 1 : : : : : P(X 1 ), P(X 2 ), P(X 3 X 1, X 2 ), P(X 4 X 1 ), P(X 5 X 1, X 3, X 4 ) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 5 / 58

Outline Principles of Parameter Learning 1 Problem Statement 2 Principles of Parameter Learning Maximum likelihood estimation Bayesian estimation Variable with Multiple Values 3 Parameter Estimation in General Bayesian Networks The Parameters Maximum likelihood estimation Properties of MLE Bayesian estimation Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 6 / 58

Principles of Parameter Learning Single-Node Bayesian Network Maximum likelihood estimation X X: result of tossing a thumbtack H T Consider a Bayesian network with one node X, where X is the result of tossing a thumbtack and Ω X = {H, T }. Data cases: D 1 = H, D 2 = T, D 3 = H,..., D m = H Data set: D = {D 1, D 2, D 3,..., D m } Estimate parameter: θ = P(X =H). Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 8 / 58

Likelihood Principles of Parameter Learning Maximum likelihood estimation Data: D = {H, T, H, T, T, H, T } As possible values of θ, which of the following is the most likely? Why? θ = 0 θ = 0.01 θ = 10.5 θ = 0 contradicts data because P(D θ = 0) = 0.It cannot explain the data at all. θ = 0.01 almost contradicts with the data. It does not explain the data well. However, it is more consistent with the data than θ = 0 because P(D θ = 0.01) > P(D θ = 0). So θ = 0.5 is more consistent with the data than θ = 0.01 because P(D θ = 0.5) > P(D θ = 0.01) It explains the data the best among the three and is hence the most likely. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 9 / 58

Principles of Parameter Learning Maximum Likelihood Estimation Maximum likelihood estimation In general, the larger P(D θ = v) is, the more likely θ = v is. Likelihood of parameter θ given data set: L(θ D) = P(D θ) L( θ D) The maximum likelihood estimation (MLE) θ of θ is a possible value of θ such that 0 θ * 1 θ L(θ D) = sup θ L(θ D). MLE best explains data or best fits data. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 10 / 58

i.i.d and Likelihood Principles of Parameter Learning Maximum likelihood estimation Assume the data cases D 1,..., D m are independent given θ: m P(D 1,...,D m θ) = P(D i θ) i=1 Assume the data cases are identically distributed: P(D i = H) = θ, P(D i = T) = 1 θ (Note: i.i.d means independent and identically distributed) Then L(θ D) for all i = P(D θ) = P(D 1,..., D m θ) m = P(D i θ) = θ m h (1 θ) mt (1) where m h is the number of heads and m t is the number of tail. Binomial likelihood. i=1 Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 11 / 58

Principles of Parameter Learning Maximum likelihood estimation Example of Likelihood Function Example: D = {D 1 = H, D 2 T, D 3 = H, D 4 = H, D 5 = T } L(θ D) = P(D θ) = P(D 1 = H θ)p(d 2 = T θ)p(d 3 = H θ)p(d 4 = H θ)p(d 5 = T θ) = θ(1 θ)θθ(1 θ) = θ 3 (1 θ) 2. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 12 / 58

Sufficient Statistic Principles of Parameter Learning Maximum likelihood estimation A sufficient statistic is a function s(d) of data that summarizing the relevant information for computing the likelihood. That is s(d) = s(d ) L(θ D) = L(θ D ) Sufficient statistics tell us all there is to know about data. Since L(θ D) = θ m h (1 θ) mt, the pair (m h, m t ) is a sufficient statistic. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 13 / 58

Loglikelihood Principles of Parameter Learning Maximum likelihood estimation Loglikelihood: l(θ D) = logl(θ D) = logθ m h (1 θ) mt = m h logθ + m t log(1 θ) Maximizing likelihood is the same as maximizing loglikelihood. The latter is easier. By Corollary 1.1 of Lecture 1, the following value maximizes l(θ D): MLE is intuitive. It also has nice properties: θ = m h = m h m h + m t m E.g. Consistence: θ approaches the true value of θ with probability 1 as m goes to infinity. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 14 / 58

Drawback of MLE Principles of Parameter Learning Bayesian estimation Thumbtack tossing: (m h, m t ) = (3, 7). MLE: θ = 0.3. Reasonable. Data suggest that the thumbtack is biased toward tail. Coin tossing: Case 1: (m h, m t ) = (3, 7). MLE: θ = 0.3. Not reasonable. Our experience (prior) suggests strongly that coins are fair, hence θ=1/2. The size of the data set is too small to convince us this particular coin is biased. The fact that we get (3, 7) instead of (5, 5) is probably due to randomness. Case 2: (m h, m t ) = (30, 000, 70, 000). MLE: θ = 0.3. Reasonable. Data suggest that the coin is after all biased, overshadowing our prior. MLE does not differentiate between those two cases. It doe not take prior information into account. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 16 / 58

Principles of Parameter Learning Bayesian estimation Two Views on Parameter Estimation MLE: Assumes that θ is unknown but fixed parameter. Estimates it using θ, the value that maximizes the likelihood function Makes prediction based on the estimation: P(D m+1 = H D) = θ Bayesian Estimation: Treats θ as a random variable. Assumes a prior probability of θ: p(θ) Uses data to get posterior probability of θ: p(θ D) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 17 / 58

Principles of Parameter Learning Bayesian estimation Two Views on Parameter Estimation Bayesian Estimation: Predicting D m+1 P(D m+1 = H D) = = = = P(D m+1 = H, θ D)dθ P(D m+1 = H θ,d)p(θ D)dθ P(D m+1 = H θ)p(θ D)dθ θp(θ D)dθ. Full Bayesian: Take expectation over θ. Bayesian MAP: P(D m+1 = H D) = θ = arg maxp(θ D) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 18 / 58

Principles of Parameter Learning Bayesian estimation Calculating Bayesian Estimation Posterior distribution: p(θ D) p(θ)l(θ D) where the equation follows from (1) = θ m h (1 θ) mt p(θ) To facilitate analysis, assume prior has Beta distribution B(α h, α t ) Then p(θ) θ αh 1 αt 1 (1 θ) p(θ D) θ m h+α h 1 (1 θ) mt+αt 1 (2) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 19 / 58

Beta Distribution Principles of Parameter Learning Bayesian estimation Γ(α) = (α 1)!. It is also defined for non-integers. Density function of prior Beta distribution B(α h, α t ), p(θ) = Γ(α t + α h ) Γ(α t )Γ(α h ) θα h 1 αt 1 (1 θ) The normalization constant for the Beta distribution B(α h, α t ) Γ(α t + α h ) Γ(α t )Γ(α h ) where Γ(.) is the Gamma function. For any integer α, The hyperparameters α h and α t can be thought of as imaginary counts from our prior experiences. Their sum α = α h +α t is called equivalent sample size. The larger the equivalent sample size, the more confident we are in our prior. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 20 / 58

Conjugate Families Principles of Parameter Learning Bayesian estimation Binomial Likelihood: θ m h (1 θ) mt Beta Prior: θ α h 1 (1 θ) αt 1 Beta Posterior: θ m h+α h 1 (1 θ) mt+αt 1. Beta distributions are hence called a conjugate family for Binomial likelihood. Conjugate families allow closed-form for posterior distribution of parameters and closed-form solution for prediction. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 21 / 58

Principles of Parameter Learning Calculating Prediction Bayesian estimation We have P(D m+1 = H D) = θp(θ D)dθ = c θθ m h+α h 1 (1 θ) mt+αt 1 dθ = m h + α h m + α where c is the normalization constant, m=m h +m t, α = α h +α t. Consequently, P(D m+1 = T D) = m t + α t m + α After taking data D into consideration, now our updated belief on X=T is m t+α t m+α. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 22 / 58

Principles of Parameter Learning MLE and Bayesian estimation Bayesian estimation As m goes to infinity, P(D m+1 = H D) approaches the MLE approaches the true value of θ with probability 1. Coin tossing example revisited: Suppose α h = α t = 100. Equivalent sample size: 200 In case 1, 3 + 100 P(D m+1 = H D) = 10 + 100 + 100 0.5 Our prior prevails. In case 2, Data prevail. P(D m+1 = H D) = 30, 000 + 100 100, 0000 + 100 + 100 0.3 m h m h +m t,which Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 23 / 58

Principles of Parameter Learning Variable with Multiple Values Variable with Multiple Values Bayesian networks with a single multi-valued variable. Ω X = {x 1, x 2,..., x r }. Let θ i = P(X = x i ) and θ = (θ 1, θ 2,..., θ r ). Note that θ i 0 and i θ i = 1. Suppose in a data set D, there are m i data cases where X takes value x i. Then Multinomial likelihood. L(θ D) = P(D θ) = N P(D j θ) = j=1 r i=1 θ mi i Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 25 / 58

Principles of Parameter Learning Dirichlet distributions Variable with Multiple Values Conjugate family for multinomial likelihood: Dirichlet distributions. A Dirichlet distribution is parameterized by r parameters α 1, α 2,..., α r. Density function given by Γ(α) r i=1 Γ(α i) where α = α 1 + α 2 +... + α r. Same as Beta distribution when r=2. Fact: For any i: θ i Γ(α) r i=1 Γ(α i) k θ i=1 k i=1 αi 1 i αi 1 θi dθ 1 dθ 2...dθ r = α i α Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 26 / 58

Principles of Parameter Learning Calculating Parameter Estimations Variable with Multiple Values If the prior probability is a Dirichlet distribution Dir(α 1, α 2,...,α r ), then the posterior probability p(θ D) is a given by p(θ D) r i=1 mi+αi 1 θi So it is Dirichlet distribution Dir(α 1 + m 1, α 2 + m 2,..., α r + m r ), Bayesian estimation has the following closed-form: P(D m+1 =x i D) = θ i p(θ D)dθ = α i + m i α + m MLE: θi = mi m. (Exercise: Prove this.) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 27 / 58

Outline Parameter Estimation in General Bayesian Networks 1 Problem Statement 2 Principles of Parameter Learning Maximum likelihood estimation Bayesian estimation Variable with Multiple Values 3 Parameter Estimation in General Bayesian Networks The Parameters Maximum likelihood estimation Properties of MLE Bayesian estimation Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 28 / 58

Parameter Estimation in General Bayesian Networks The Parameters The Parameters n variables: X 1, X 2,..., X n. Number of states of X i : 1, 2,..., r i = Ω Xi. Number of configurations of parents of X i : 1, 2,..., q i = Ω pa(xi). Parameters to be estimated: θ ijk = P(X i = j pa(x i ) = k), i = 1,...,n; j = 1,...,r i ; k = 1,...,q i Parameter vector: θ = {θ ijk i = 1,...,n; j = 1,...,r i ; k = 1,...,q i }. Note that j θ ijk = 1 i, k θ i.. : Vector of parameters for P(X i pa(x i )) θ i.. = {θ ijk j = 1,...,r i ; k = 1,...,q i } θ i.k : Vector of parameters for P(X i pa(x i )=k) θ i.k = {θ ijk j = 1,...,r i } Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 30 / 58

Parameter Estimation in General Bayesian Networks The Parameters The Parameters Example: Consider the Bayesian network shown below. Assume all variables are binary, taking values 1 and 2. X 1 X 2 X 3 θ 111 = P(X 1 =1), θ 121 = P(X 1 =2) θ 211 = P(X 2 =1), θ 221 = P(X 2 =2) pa(x 3 ) = 1 : θ 311 = P(X 3 =1 X 1 = 1, X 2 = 1), θ 321 = P(X 3 =2 X 1 = 1, X 2 = 1) pa(x 3 ) = 2 : θ 312 = P(X 3 =1 X 1 = 1, X 2 = 2), θ 322 = P(X 3 =2 X 1 = 1, X 2 = 2) pa(x 3 ) = 3 : θ 313 = P(X 3 =1 X 1 = 2, X 2 = 1), θ 323 = P(X 3 =2 X 1 = 2, X 2 = 1) pa(x 3 ) = 4 : θ 314 = P(X 3 =1 X 1 = 2, X 2 = 2), θ 324 = P(X 3 =2 X 1 = 2, X 2 = 2) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 31 / 58

Data Parameter Estimation in General Bayesian Networks The Parameters A complete case D l : a vector of values, one for each variable. Example: D l = (X 1 = 1, X 2 = 2, X 3 = 2) Given: A set of complete cases: D = {D 1, D 2,...,D m }. Example: X 1 X 2 X 3 X 1 X 2 X 3 1 1 1 2 1 1 1 1 2 2 1 2 1 1 2 2 2 1 1 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 1 2 2 2 Find: The ML estimates of the parameters θ. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 32 / 58

Parameter Estimation in General Bayesian Networks Maximum likelihood estimation The Loglikelihood Function Loglikelihood: l(θ D) = logl(θ D) = logp(d θ) = log l P(D l θ) = l logp(d l θ). The term logp(d l θ): D 4 = (1, 2, 2), logp(d 4 θ) = logp(x 1 = 1, X 2 = 2, X 3 = 2) = logp(x 1 =1 θ)p(x 2 =2 θ)p(x 3 =2 X 1 =1, X 2 =2, θ) = logθ 111 + logθ 221 + logθ 322. Recall: θ = {θ 111, θ 121 ; θ 211, θ 221 ; θ 311, θ 312, θ 313, θ 314, θ 321, θ 322, θ 323, θ 324 } Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 34 / 58

Parameter Estimation in General Bayesian Networks The Loglikelihood Function Maximum likelihood estimation Define the characteristic function of case D l : { 1 if Xi = j, pa(x χ(i, j, k : D l ) = i ) = k in D l 0 otherwise When l=4, D 4 = (1, 2, 2). χ(1, 1, 1 : D 4 ) = χ(2, 2, 1 : D 4 ) = χ(3, 2, 2 : D 4 ) = 1 χ(i, j, k : D 4 ) = 0 for all other i, j, k So, logp(d 4 θ) = ijk χ(i, j, k; D 4)logθ ijk In general, logp(d l θ) = ijk χ(i, j, k : D l )logθ ijk Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 35 / 58

Nevin Sufficient L. Zhang (HKUST) statistics: m ijk Bayesian Networks Fall 2008 36 / 58 Parameter Estimation in General Bayesian Networks The Loglikelihood Function Maximum likelihood estimation Define m ijk = l χ(i, j, k : D l ). It is the number of data cases where X i = j and pa(x i ) = k. Then l(θ D) = logp(d l θ) l = χ(i, j, k : D l )logθ ijk l i,j,k = χ(i, j, k : D l )logθ ijk i,j,k l = ijk = i,k m ijk logθ ijk m ijk logθ ijk. (4) j

MLE Parameter Estimation in General Bayesian Networks Maximum likelihood estimation Want: arg max θ l(θ D) = arg max θ ijk m ijk logθ ijk i,k j Note that θ ijk = P(X i =j pa(x i )=k) and θ i j k = P(X i =j pa(x i )=k ) are not related if either i i or k k. Consequently, we can separately maximize each term in the summation i,k [...] arg max m ijk logθ ijk θ ijk j Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 37 / 58

MLE Parameter Estimation in General Bayesian Networks Maximum likelihood estimation By Corollary 1.1, we get θ ijk = m ijk j m ijk In words, the MLE estimate for θ ijk = P(X i =j pa(x i )=k) is: θ ijk = number of cases where X i=j and pa(x i )=k number of cases where pa(x i )=k Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 38 / 58

Example Parameter Estimation in General Bayesian Networks Maximum likelihood estimation Example: X 1 X 3 X 2 MLE for P(X 1 =1) is: 6/16 MLE for P(X 2 =1) is: 7/16 MLE for P(X 3 =1 X 1 =2, X 2 =2) is: 2/6... X 1 X 2 X 3 X 1 X 2 X 3 1 1 1 2 1 1 1 1 2 2 1 2 1 1 2 2 2 1 1 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 1 2 2 2 Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 39 / 58

Parameter Estimation in General Bayesian Networks A Question Properties of MLE Start from a joint distribution P(X) (Generative Distribution) D: collection of data sampled from P(X). Let S be a BN structrue (DAG) over variables X. Learn parameters θ for BN structure S from D. Let P (X) be the joint probability of the BN (S, θ ). Note: θijk = P (X i =j pa S (X i )=k) How is P related to P? Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 41 / 58

Parameter Estimation in General Bayesian Networks Properties of MLE MLE in General Bayesian Networks with Complete Data P * *. P Distributions that factorize according to S We will show that, with probability 1, P converges to the distribution that Factorizes according to S, Is closest to P under KL divergence among all distributions that factorize according to S. P factorizes according to S. P does not necessarily factorize according to S. If P factorizes according to S, P converges to P with probability 1. (MLE is consistent.) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 42 / 58

Parameter Estimation in General Bayesian Networks The Target Distribution Properties of MLE Define θ S ijk = P(X i=j pa S (X i ) = k)) Let P S (X) be the joint distribution of the BN (S, θ S ) P S factorizes according to S and for any X X, P S (X pa(x)) = P(X pa(x)) If P factorizes according to S, then P and P S are identical. If P does not factorize according to S, then P and P S are different. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 43 / 58

Parameter Estimation in General Bayesian Networks First Theorem Properties of MLE Theorem (6.1) Among all distributions Q that factorizes according to S, the KL divergence KL(P, Q) is minimized by Q=P S. P S is the closest to P among all those that factorize according to S. Proof: Since KL(P, Q) = X P(X)log P(X) Q(X) It suffices to show that Proposition: Q=P S maximizes X P(X)logQ(X) We show the claim by induction on the number of nodes. When there is only one node, the proposition follows from property of KL divergence (Corollary 1.1). Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 44 / 58

Parameter Estimation in General Bayesian Networks Properties of MLE First Theorem Suppose the proposition is true for the case of n nodes. Consider the case of n+1 nodes. Let X be a leaf node and X =X \ {X }. S be the obtained from S by removing X. Then P(X)logQ(X) = P(X )logq(x ) + P(pa(X)) P(X pa(x))logq(x pa(x)) X X pa(x) X By the induction hypothesis, the first term is maximized by P S. By Corollary 1.1, the second term is maximized if Q(X pa(x)) = P(X pa(x)). Hence the sum is maximized by P S. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 45 / 58

Parameter Estimation in General Bayesian Networks Properties of MLE Second Theorem Theorem (6.2) lim N P (X=x) = P S (X=x) with probability 1 where N is the sample size, i.e. number of cases in D. Proof: Let ˆP(X) be the empirical distribution: ˆP(X=x) = fraction of cases in D where X=x It is clear that P (X i =j pa S (X i )=k) = θ ijk = ˆP(X i =j pa S (X i )=k) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 46 / 58

Parameter Estimation in General Bayesian Networks Second Theorem Properties of MLE On the other hand, by the law of large numbers, we have Hence lim ˆP(X=x) = P(X=x) with probability 1 N lim N P (X i =j pa S (X i )=k) = lim ˆP(X i =j pa S (X i )=k) N = P(X i =j pa S (X i )=k) with probability 1 = P S (X i =j pa S (X i )=k) Because both P and P S factorizes according to S, the theorem follows. Q.E.D. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 47 / 58

Parameter Estimation in General Bayesian Networks A Corollary Properties of MLE Corollary If P factorizes according to S, then lim N P (X=x) = P(X=x) with probability 1 Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 48 / 58

Parameter Estimation in General Bayesian Networks Bayesian Estimation Bayesian estimation View θ as a vector of random variables with prior distribution p(θ). Posterior: p(θ D) p(θ)l(θ D) = p(θ) i,k j where the equation follows from (4). θ m ijk ijk Assumptions need to be made about prior distribution. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 50 / 58

Parameter Estimation in General Bayesian Networks Assumptions Bayesian estimation Global independence in prior distribution: p(θ) = i p(θ i.. ) Local independence in prior distribution: For each i p(θ i.. ) = k p(θ i.k ) Parameter independence = global independence + local independence: p(θ) = i,k p(θ i.k ) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 51 / 58

Parameter Estimation in General Bayesian Networks Assumptions Bayesian estimation Further assume that p(θ i.k ) is Dirichlet distribution Dir(α i0k, α i1k,..., α irik): p(θ i.k ) j θ α ijk 1 ijk Then, product Dirichlet distribution. p(θ) = i,k j θ α ijk 1 ijk Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 52 / 58

Parameter Estimation in General Bayesian Networks Bayesian Estimation Bayesian estimation Posterior: p(θ D) p(θ) i,k = [ i,k j = i,k j j θ m ijk ijk θ α ijk 1 ijk ] i,k θ m ijk+α ijk 1 ijk j θ m ijk ijk It is also a product product Dirichlet distribution.(think: What does this mean?) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 53 / 58

Parameter Estimation in General Bayesian Networks Bayesian estimation Prediction Predicting D m+1 = {X1 m+1, X2 m+1,...,xn m+1 }. Random variables. For notational simplicity, simply write D m+1 = {X 1, X 2,..., X n }. First, we have: P(D m+1 D) = P(X 1, X 2,...,X n D) = i P(X i pa(x i ),D) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 54 / 58

Proof Parameter Estimation in General Bayesian Networks Bayesian estimation P(D m+1 D) = P(D m+1 θ)p(θ D)dθ P(D m+1 θ) = P(X 1, X 2,..., X n θ) = i P(X i pa(x i ), θ) = i P(X i pa(x i ), θ i.. ) p(θ i D) = i p(θ i.. D) Hence P(D m+1 D) = i P(X i pa(x i ), θ i.. )p(θ i.. D)dθ i.. = i P(X i pa(x i ),D) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 55 / 58

Parameter Estimation in General Bayesian Networks Prediction Bayesian estimation Further, we have P(X i =j pa(x i )=k,d) = = P(X i =j pa(x i ) = k, θ ijk )p(θ ijk D)dθ ijk θ ijk p(θ ijk D)dθ ijk Because p(θ i.k D) j θ m ijk+α ijk 1 ijk We have θ ijk p(θ ijk D)dθ ijk = m ijk+α ijk j (m ijk+α ijk ) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 56 / 58

Parameter Estimation in General Bayesian Networks Prediction Bayesian estimation Conclusion: P(X 1, X 2,..., X n D) = i P(X i pa(x i ),D) where P(X i =j pa(x i )=k,d) = m ijk+α ijk m i k +α i k where m i k = j m ijk and α i k = j α ijk Notes: Conditional independence or structure preserved after absorbing D. Important property for sequential learning where we process one case at a time. The final result is independent of the order by which cases are processed. Comparison with MLE estimation: θijk = m ijk j m ijk Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 57 / 58

Summary Parameter Estimation in General Bayesian Networks Bayesian estimation θ: random variable. Prior p(θ): product Dirichlet distribution p(θ) = i,k p(θ i.k ) i,k Posterior p(θ D): also product Dirichlet distribution Prediction: p(θ D) i,k j j θ m ijk+α ijk 1 ijk P(D m+1 D) = P(X 1, X 2,...,X n D) = i θ α ijk 1 ijk P(X i pa(x i ),D) where P(X i =j pa(x i )=k,d) = m ijk+α ijk m i k +α i k Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 58 / 58