Overview of Course So far, we have studied The concept of Bayesian network Independence and Separation in Bayesian networks Inference in Bayesian networks The rest of the course: Data analysis using Bayesian network Parameter learning: Learn parameters for a given structure. Structure learning: Learn both structures and parameters Learning latent structures: Discover latent variables behind observed variables and determine their relationships. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 1 / 58
COMP538: Introduction to Bayesian Networks Lecture 6: Parameter Learning in Bayesian Networks Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering Hong Kong University of Science and Technology Fall 2008 Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 2 / 58
Objective Objective: Principles for parameter learning in Bayesian networks. Algorithms for the case of complete data. Reading: Zhang and Guo (2007), Chapter 7 Reference: Heckerman (1996) (first half), Cowell et al (1999, Chapter 9) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 3 / 58
Outline Problem Statement 1 Problem Statement 2 Principles of Parameter Learning Maximum likelihood estimation Bayesian estimation Variable with Multiple Values 3 Parameter Estimation in General Bayesian Networks The Parameters Maximum likelihood estimation Properties of MLE Bayesian estimation Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 4 / 58
Problem Statement Parameter Learning Given: A Bayesian network structure. X 1 X 2 X 3 X 4 A data set X 5 Estimate conditional probabilities: X 1 X 2 X 3 X 4 X 5 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 1 : : : : : P(X 1 ), P(X 2 ), P(X 3 X 1, X 2 ), P(X 4 X 1 ), P(X 5 X 1, X 3, X 4 ) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 5 / 58
Outline Principles of Parameter Learning 1 Problem Statement 2 Principles of Parameter Learning Maximum likelihood estimation Bayesian estimation Variable with Multiple Values 3 Parameter Estimation in General Bayesian Networks The Parameters Maximum likelihood estimation Properties of MLE Bayesian estimation Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 6 / 58
Principles of Parameter Learning Single-Node Bayesian Network Maximum likelihood estimation X X: result of tossing a thumbtack H T Consider a Bayesian network with one node X, where X is the result of tossing a thumbtack and Ω X = {H, T }. Data cases: D 1 = H, D 2 = T, D 3 = H,..., D m = H Data set: D = {D 1, D 2, D 3,..., D m } Estimate parameter: θ = P(X =H). Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 8 / 58
Likelihood Principles of Parameter Learning Maximum likelihood estimation Data: D = {H, T, H, T, T, H, T } As possible values of θ, which of the following is the most likely? Why? θ = 0 θ = 0.01 θ = 10.5 θ = 0 contradicts data because P(D θ = 0) = 0.It cannot explain the data at all. θ = 0.01 almost contradicts with the data. It does not explain the data well. However, it is more consistent with the data than θ = 0 because P(D θ = 0.01) > P(D θ = 0). So θ = 0.5 is more consistent with the data than θ = 0.01 because P(D θ = 0.5) > P(D θ = 0.01) It explains the data the best among the three and is hence the most likely. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 9 / 58
Principles of Parameter Learning Maximum Likelihood Estimation Maximum likelihood estimation In general, the larger P(D θ = v) is, the more likely θ = v is. Likelihood of parameter θ given data set: L(θ D) = P(D θ) L( θ D) The maximum likelihood estimation (MLE) θ of θ is a possible value of θ such that 0 θ * 1 θ L(θ D) = sup θ L(θ D). MLE best explains data or best fits data. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 10 / 58
i.i.d and Likelihood Principles of Parameter Learning Maximum likelihood estimation Assume the data cases D 1,..., D m are independent given θ: m P(D 1,...,D m θ) = P(D i θ) i=1 Assume the data cases are identically distributed: P(D i = H) = θ, P(D i = T) = 1 θ (Note: i.i.d means independent and identically distributed) Then L(θ D) for all i = P(D θ) = P(D 1,..., D m θ) m = P(D i θ) = θ m h (1 θ) mt (1) where m h is the number of heads and m t is the number of tail. Binomial likelihood. i=1 Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 11 / 58
Principles of Parameter Learning Maximum likelihood estimation Example of Likelihood Function Example: D = {D 1 = H, D 2 T, D 3 = H, D 4 = H, D 5 = T } L(θ D) = P(D θ) = P(D 1 = H θ)p(d 2 = T θ)p(d 3 = H θ)p(d 4 = H θ)p(d 5 = T θ) = θ(1 θ)θθ(1 θ) = θ 3 (1 θ) 2. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 12 / 58
Sufficient Statistic Principles of Parameter Learning Maximum likelihood estimation A sufficient statistic is a function s(d) of data that summarizing the relevant information for computing the likelihood. That is s(d) = s(d ) L(θ D) = L(θ D ) Sufficient statistics tell us all there is to know about data. Since L(θ D) = θ m h (1 θ) mt, the pair (m h, m t ) is a sufficient statistic. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 13 / 58
Loglikelihood Principles of Parameter Learning Maximum likelihood estimation Loglikelihood: l(θ D) = logl(θ D) = logθ m h (1 θ) mt = m h logθ + m t log(1 θ) Maximizing likelihood is the same as maximizing loglikelihood. The latter is easier. By Corollary 1.1 of Lecture 1, the following value maximizes l(θ D): MLE is intuitive. It also has nice properties: θ = m h = m h m h + m t m E.g. Consistence: θ approaches the true value of θ with probability 1 as m goes to infinity. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 14 / 58
Drawback of MLE Principles of Parameter Learning Bayesian estimation Thumbtack tossing: (m h, m t ) = (3, 7). MLE: θ = 0.3. Reasonable. Data suggest that the thumbtack is biased toward tail. Coin tossing: Case 1: (m h, m t ) = (3, 7). MLE: θ = 0.3. Not reasonable. Our experience (prior) suggests strongly that coins are fair, hence θ=1/2. The size of the data set is too small to convince us this particular coin is biased. The fact that we get (3, 7) instead of (5, 5) is probably due to randomness. Case 2: (m h, m t ) = (30, 000, 70, 000). MLE: θ = 0.3. Reasonable. Data suggest that the coin is after all biased, overshadowing our prior. MLE does not differentiate between those two cases. It doe not take prior information into account. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 16 / 58
Principles of Parameter Learning Bayesian estimation Two Views on Parameter Estimation MLE: Assumes that θ is unknown but fixed parameter. Estimates it using θ, the value that maximizes the likelihood function Makes prediction based on the estimation: P(D m+1 = H D) = θ Bayesian Estimation: Treats θ as a random variable. Assumes a prior probability of θ: p(θ) Uses data to get posterior probability of θ: p(θ D) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 17 / 58
Principles of Parameter Learning Bayesian estimation Two Views on Parameter Estimation Bayesian Estimation: Predicting D m+1 P(D m+1 = H D) = = = = P(D m+1 = H, θ D)dθ P(D m+1 = H θ,d)p(θ D)dθ P(D m+1 = H θ)p(θ D)dθ θp(θ D)dθ. Full Bayesian: Take expectation over θ. Bayesian MAP: P(D m+1 = H D) = θ = arg maxp(θ D) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 18 / 58
Principles of Parameter Learning Bayesian estimation Calculating Bayesian Estimation Posterior distribution: p(θ D) p(θ)l(θ D) where the equation follows from (1) = θ m h (1 θ) mt p(θ) To facilitate analysis, assume prior has Beta distribution B(α h, α t ) Then p(θ) θ αh 1 αt 1 (1 θ) p(θ D) θ m h+α h 1 (1 θ) mt+αt 1 (2) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 19 / 58
Beta Distribution Principles of Parameter Learning Bayesian estimation Γ(α) = (α 1)!. It is also defined for non-integers. Density function of prior Beta distribution B(α h, α t ), p(θ) = Γ(α t + α h ) Γ(α t )Γ(α h ) θα h 1 αt 1 (1 θ) The normalization constant for the Beta distribution B(α h, α t ) Γ(α t + α h ) Γ(α t )Γ(α h ) where Γ(.) is the Gamma function. For any integer α, The hyperparameters α h and α t can be thought of as imaginary counts from our prior experiences. Their sum α = α h +α t is called equivalent sample size. The larger the equivalent sample size, the more confident we are in our prior. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 20 / 58
Conjugate Families Principles of Parameter Learning Bayesian estimation Binomial Likelihood: θ m h (1 θ) mt Beta Prior: θ α h 1 (1 θ) αt 1 Beta Posterior: θ m h+α h 1 (1 θ) mt+αt 1. Beta distributions are hence called a conjugate family for Binomial likelihood. Conjugate families allow closed-form for posterior distribution of parameters and closed-form solution for prediction. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 21 / 58
Principles of Parameter Learning Calculating Prediction Bayesian estimation We have P(D m+1 = H D) = θp(θ D)dθ = c θθ m h+α h 1 (1 θ) mt+αt 1 dθ = m h + α h m + α where c is the normalization constant, m=m h +m t, α = α h +α t. Consequently, P(D m+1 = T D) = m t + α t m + α After taking data D into consideration, now our updated belief on X=T is m t+α t m+α. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 22 / 58
Principles of Parameter Learning MLE and Bayesian estimation Bayesian estimation As m goes to infinity, P(D m+1 = H D) approaches the MLE approaches the true value of θ with probability 1. Coin tossing example revisited: Suppose α h = α t = 100. Equivalent sample size: 200 In case 1, 3 + 100 P(D m+1 = H D) = 10 + 100 + 100 0.5 Our prior prevails. In case 2, Data prevail. P(D m+1 = H D) = 30, 000 + 100 100, 0000 + 100 + 100 0.3 m h m h +m t,which Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 23 / 58
Principles of Parameter Learning Variable with Multiple Values Variable with Multiple Values Bayesian networks with a single multi-valued variable. Ω X = {x 1, x 2,..., x r }. Let θ i = P(X = x i ) and θ = (θ 1, θ 2,..., θ r ). Note that θ i 0 and i θ i = 1. Suppose in a data set D, there are m i data cases where X takes value x i. Then Multinomial likelihood. L(θ D) = P(D θ) = N P(D j θ) = j=1 r i=1 θ mi i Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 25 / 58
Principles of Parameter Learning Dirichlet distributions Variable with Multiple Values Conjugate family for multinomial likelihood: Dirichlet distributions. A Dirichlet distribution is parameterized by r parameters α 1, α 2,..., α r. Density function given by Γ(α) r i=1 Γ(α i) where α = α 1 + α 2 +... + α r. Same as Beta distribution when r=2. Fact: For any i: θ i Γ(α) r i=1 Γ(α i) k θ i=1 k i=1 αi 1 i αi 1 θi dθ 1 dθ 2...dθ r = α i α Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 26 / 58
Principles of Parameter Learning Calculating Parameter Estimations Variable with Multiple Values If the prior probability is a Dirichlet distribution Dir(α 1, α 2,...,α r ), then the posterior probability p(θ D) is a given by p(θ D) r i=1 mi+αi 1 θi So it is Dirichlet distribution Dir(α 1 + m 1, α 2 + m 2,..., α r + m r ), Bayesian estimation has the following closed-form: P(D m+1 =x i D) = θ i p(θ D)dθ = α i + m i α + m MLE: θi = mi m. (Exercise: Prove this.) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 27 / 58
Outline Parameter Estimation in General Bayesian Networks 1 Problem Statement 2 Principles of Parameter Learning Maximum likelihood estimation Bayesian estimation Variable with Multiple Values 3 Parameter Estimation in General Bayesian Networks The Parameters Maximum likelihood estimation Properties of MLE Bayesian estimation Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 28 / 58
Parameter Estimation in General Bayesian Networks The Parameters The Parameters n variables: X 1, X 2,..., X n. Number of states of X i : 1, 2,..., r i = Ω Xi. Number of configurations of parents of X i : 1, 2,..., q i = Ω pa(xi). Parameters to be estimated: θ ijk = P(X i = j pa(x i ) = k), i = 1,...,n; j = 1,...,r i ; k = 1,...,q i Parameter vector: θ = {θ ijk i = 1,...,n; j = 1,...,r i ; k = 1,...,q i }. Note that j θ ijk = 1 i, k θ i.. : Vector of parameters for P(X i pa(x i )) θ i.. = {θ ijk j = 1,...,r i ; k = 1,...,q i } θ i.k : Vector of parameters for P(X i pa(x i )=k) θ i.k = {θ ijk j = 1,...,r i } Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 30 / 58
Parameter Estimation in General Bayesian Networks The Parameters The Parameters Example: Consider the Bayesian network shown below. Assume all variables are binary, taking values 1 and 2. X 1 X 2 X 3 θ 111 = P(X 1 =1), θ 121 = P(X 1 =2) θ 211 = P(X 2 =1), θ 221 = P(X 2 =2) pa(x 3 ) = 1 : θ 311 = P(X 3 =1 X 1 = 1, X 2 = 1), θ 321 = P(X 3 =2 X 1 = 1, X 2 = 1) pa(x 3 ) = 2 : θ 312 = P(X 3 =1 X 1 = 1, X 2 = 2), θ 322 = P(X 3 =2 X 1 = 1, X 2 = 2) pa(x 3 ) = 3 : θ 313 = P(X 3 =1 X 1 = 2, X 2 = 1), θ 323 = P(X 3 =2 X 1 = 2, X 2 = 1) pa(x 3 ) = 4 : θ 314 = P(X 3 =1 X 1 = 2, X 2 = 2), θ 324 = P(X 3 =2 X 1 = 2, X 2 = 2) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 31 / 58
Data Parameter Estimation in General Bayesian Networks The Parameters A complete case D l : a vector of values, one for each variable. Example: D l = (X 1 = 1, X 2 = 2, X 3 = 2) Given: A set of complete cases: D = {D 1, D 2,...,D m }. Example: X 1 X 2 X 3 X 1 X 2 X 3 1 1 1 2 1 1 1 1 2 2 1 2 1 1 2 2 2 1 1 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 1 2 2 2 Find: The ML estimates of the parameters θ. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 32 / 58
Parameter Estimation in General Bayesian Networks Maximum likelihood estimation The Loglikelihood Function Loglikelihood: l(θ D) = logl(θ D) = logp(d θ) = log l P(D l θ) = l logp(d l θ). The term logp(d l θ): D 4 = (1, 2, 2), logp(d 4 θ) = logp(x 1 = 1, X 2 = 2, X 3 = 2) = logp(x 1 =1 θ)p(x 2 =2 θ)p(x 3 =2 X 1 =1, X 2 =2, θ) = logθ 111 + logθ 221 + logθ 322. Recall: θ = {θ 111, θ 121 ; θ 211, θ 221 ; θ 311, θ 312, θ 313, θ 314, θ 321, θ 322, θ 323, θ 324 } Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 34 / 58
Parameter Estimation in General Bayesian Networks The Loglikelihood Function Maximum likelihood estimation Define the characteristic function of case D l : { 1 if Xi = j, pa(x χ(i, j, k : D l ) = i ) = k in D l 0 otherwise When l=4, D 4 = (1, 2, 2). χ(1, 1, 1 : D 4 ) = χ(2, 2, 1 : D 4 ) = χ(3, 2, 2 : D 4 ) = 1 χ(i, j, k : D 4 ) = 0 for all other i, j, k So, logp(d 4 θ) = ijk χ(i, j, k; D 4)logθ ijk In general, logp(d l θ) = ijk χ(i, j, k : D l )logθ ijk Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 35 / 58
Nevin Sufficient L. Zhang (HKUST) statistics: m ijk Bayesian Networks Fall 2008 36 / 58 Parameter Estimation in General Bayesian Networks The Loglikelihood Function Maximum likelihood estimation Define m ijk = l χ(i, j, k : D l ). It is the number of data cases where X i = j and pa(x i ) = k. Then l(θ D) = logp(d l θ) l = χ(i, j, k : D l )logθ ijk l i,j,k = χ(i, j, k : D l )logθ ijk i,j,k l = ijk = i,k m ijk logθ ijk m ijk logθ ijk. (4) j
MLE Parameter Estimation in General Bayesian Networks Maximum likelihood estimation Want: arg max θ l(θ D) = arg max θ ijk m ijk logθ ijk i,k j Note that θ ijk = P(X i =j pa(x i )=k) and θ i j k = P(X i =j pa(x i )=k ) are not related if either i i or k k. Consequently, we can separately maximize each term in the summation i,k [...] arg max m ijk logθ ijk θ ijk j Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 37 / 58
MLE Parameter Estimation in General Bayesian Networks Maximum likelihood estimation By Corollary 1.1, we get θ ijk = m ijk j m ijk In words, the MLE estimate for θ ijk = P(X i =j pa(x i )=k) is: θ ijk = number of cases where X i=j and pa(x i )=k number of cases where pa(x i )=k Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 38 / 58
Example Parameter Estimation in General Bayesian Networks Maximum likelihood estimation Example: X 1 X 3 X 2 MLE for P(X 1 =1) is: 6/16 MLE for P(X 2 =1) is: 7/16 MLE for P(X 3 =1 X 1 =2, X 2 =2) is: 2/6... X 1 X 2 X 3 X 1 X 2 X 3 1 1 1 2 1 1 1 1 2 2 1 2 1 1 2 2 2 1 1 2 2 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 1 2 2 2 Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 39 / 58
Parameter Estimation in General Bayesian Networks A Question Properties of MLE Start from a joint distribution P(X) (Generative Distribution) D: collection of data sampled from P(X). Let S be a BN structrue (DAG) over variables X. Learn parameters θ for BN structure S from D. Let P (X) be the joint probability of the BN (S, θ ). Note: θijk = P (X i =j pa S (X i )=k) How is P related to P? Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 41 / 58
Parameter Estimation in General Bayesian Networks Properties of MLE MLE in General Bayesian Networks with Complete Data P * *. P Distributions that factorize according to S We will show that, with probability 1, P converges to the distribution that Factorizes according to S, Is closest to P under KL divergence among all distributions that factorize according to S. P factorizes according to S. P does not necessarily factorize according to S. If P factorizes according to S, P converges to P with probability 1. (MLE is consistent.) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 42 / 58
Parameter Estimation in General Bayesian Networks The Target Distribution Properties of MLE Define θ S ijk = P(X i=j pa S (X i ) = k)) Let P S (X) be the joint distribution of the BN (S, θ S ) P S factorizes according to S and for any X X, P S (X pa(x)) = P(X pa(x)) If P factorizes according to S, then P and P S are identical. If P does not factorize according to S, then P and P S are different. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 43 / 58
Parameter Estimation in General Bayesian Networks First Theorem Properties of MLE Theorem (6.1) Among all distributions Q that factorizes according to S, the KL divergence KL(P, Q) is minimized by Q=P S. P S is the closest to P among all those that factorize according to S. Proof: Since KL(P, Q) = X P(X)log P(X) Q(X) It suffices to show that Proposition: Q=P S maximizes X P(X)logQ(X) We show the claim by induction on the number of nodes. When there is only one node, the proposition follows from property of KL divergence (Corollary 1.1). Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 44 / 58
Parameter Estimation in General Bayesian Networks Properties of MLE First Theorem Suppose the proposition is true for the case of n nodes. Consider the case of n+1 nodes. Let X be a leaf node and X =X \ {X }. S be the obtained from S by removing X. Then P(X)logQ(X) = P(X )logq(x ) + P(pa(X)) P(X pa(x))logq(x pa(x)) X X pa(x) X By the induction hypothesis, the first term is maximized by P S. By Corollary 1.1, the second term is maximized if Q(X pa(x)) = P(X pa(x)). Hence the sum is maximized by P S. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 45 / 58
Parameter Estimation in General Bayesian Networks Properties of MLE Second Theorem Theorem (6.2) lim N P (X=x) = P S (X=x) with probability 1 where N is the sample size, i.e. number of cases in D. Proof: Let ˆP(X) be the empirical distribution: ˆP(X=x) = fraction of cases in D where X=x It is clear that P (X i =j pa S (X i )=k) = θ ijk = ˆP(X i =j pa S (X i )=k) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 46 / 58
Parameter Estimation in General Bayesian Networks Second Theorem Properties of MLE On the other hand, by the law of large numbers, we have Hence lim ˆP(X=x) = P(X=x) with probability 1 N lim N P (X i =j pa S (X i )=k) = lim ˆP(X i =j pa S (X i )=k) N = P(X i =j pa S (X i )=k) with probability 1 = P S (X i =j pa S (X i )=k) Because both P and P S factorizes according to S, the theorem follows. Q.E.D. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 47 / 58
Parameter Estimation in General Bayesian Networks A Corollary Properties of MLE Corollary If P factorizes according to S, then lim N P (X=x) = P(X=x) with probability 1 Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 48 / 58
Parameter Estimation in General Bayesian Networks Bayesian Estimation Bayesian estimation View θ as a vector of random variables with prior distribution p(θ). Posterior: p(θ D) p(θ)l(θ D) = p(θ) i,k j where the equation follows from (4). θ m ijk ijk Assumptions need to be made about prior distribution. Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 50 / 58
Parameter Estimation in General Bayesian Networks Assumptions Bayesian estimation Global independence in prior distribution: p(θ) = i p(θ i.. ) Local independence in prior distribution: For each i p(θ i.. ) = k p(θ i.k ) Parameter independence = global independence + local independence: p(θ) = i,k p(θ i.k ) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 51 / 58
Parameter Estimation in General Bayesian Networks Assumptions Bayesian estimation Further assume that p(θ i.k ) is Dirichlet distribution Dir(α i0k, α i1k,..., α irik): p(θ i.k ) j θ α ijk 1 ijk Then, product Dirichlet distribution. p(θ) = i,k j θ α ijk 1 ijk Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 52 / 58
Parameter Estimation in General Bayesian Networks Bayesian Estimation Bayesian estimation Posterior: p(θ D) p(θ) i,k = [ i,k j = i,k j j θ m ijk ijk θ α ijk 1 ijk ] i,k θ m ijk+α ijk 1 ijk j θ m ijk ijk It is also a product product Dirichlet distribution.(think: What does this mean?) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 53 / 58
Parameter Estimation in General Bayesian Networks Bayesian estimation Prediction Predicting D m+1 = {X1 m+1, X2 m+1,...,xn m+1 }. Random variables. For notational simplicity, simply write D m+1 = {X 1, X 2,..., X n }. First, we have: P(D m+1 D) = P(X 1, X 2,...,X n D) = i P(X i pa(x i ),D) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 54 / 58
Proof Parameter Estimation in General Bayesian Networks Bayesian estimation P(D m+1 D) = P(D m+1 θ)p(θ D)dθ P(D m+1 θ) = P(X 1, X 2,..., X n θ) = i P(X i pa(x i ), θ) = i P(X i pa(x i ), θ i.. ) p(θ i D) = i p(θ i.. D) Hence P(D m+1 D) = i P(X i pa(x i ), θ i.. )p(θ i.. D)dθ i.. = i P(X i pa(x i ),D) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 55 / 58
Parameter Estimation in General Bayesian Networks Prediction Bayesian estimation Further, we have P(X i =j pa(x i )=k,d) = = P(X i =j pa(x i ) = k, θ ijk )p(θ ijk D)dθ ijk θ ijk p(θ ijk D)dθ ijk Because p(θ i.k D) j θ m ijk+α ijk 1 ijk We have θ ijk p(θ ijk D)dθ ijk = m ijk+α ijk j (m ijk+α ijk ) Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 56 / 58
Parameter Estimation in General Bayesian Networks Prediction Bayesian estimation Conclusion: P(X 1, X 2,..., X n D) = i P(X i pa(x i ),D) where P(X i =j pa(x i )=k,d) = m ijk+α ijk m i k +α i k where m i k = j m ijk and α i k = j α ijk Notes: Conditional independence or structure preserved after absorbing D. Important property for sequential learning where we process one case at a time. The final result is independent of the order by which cases are processed. Comparison with MLE estimation: θijk = m ijk j m ijk Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 57 / 58
Summary Parameter Estimation in General Bayesian Networks Bayesian estimation θ: random variable. Prior p(θ): product Dirichlet distribution p(θ) = i,k p(θ i.k ) i,k Posterior p(θ D): also product Dirichlet distribution Prediction: p(θ D) i,k j j θ m ijk+α ijk 1 ijk P(D m+1 D) = P(X 1, X 2,...,X n D) = i θ α ijk 1 ijk P(X i pa(x i ),D) where P(X i =j pa(x i )=k,d) = m ijk+α ijk m i k +α i k Nevin L. Zhang (HKUST) Bayesian Networks Fall 2008 58 / 58