Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Size: px

Start display at page:

Download "Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58"

Kory Foster
6 years ago
Views:

1 Overview of Course So far, we have studied The concept of Bayesian network Independence and Separation in Bayesian networks Inference in Bayesian networks The rest of the course: Data analysis using Bayesian network Parameter learning: Learn parameters for a given structure. Structure learning: Learn both structures and parameters Learning latent structures: Discover latent variables behind observed variables and determine their relationships. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

2 COMP538: Introduction to Bayesian Networks Lecture 6: Parameter Learning in Bayesian Networks Nevin L. Zhang Department of Computer Science and Engineering Hong Kong University of Science and Technology Fall 2008 Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

3 Objective Objective: Principles for parameter learning in Bayesian networks. Algorithms for the case of complete data. Reading: Zhang and Guo (2007), Chapter 7 Reference: Heckerman (1996) (first half), Cowell et al (1999, Chapter 9) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

4 Outline Problem Statement 1 Problem Statement 2 Principles of Parameter Learning Maximum likelihood estimation Bayesian estimation Variable with Multiple Values 3 Parameter Estimation in General Bayesian Networks The Parameters Maximum likelihood estimation Properties of MLE Bayesian estimation Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

5 Problem Statement Parameter Learning Given: A Bayesian network structure. X 1 X 2 X 3 X 4 A data set X 5 Estimate conditional probabilities: X 1 X 2 X 3 X 4 X : : : : : P(X 1 ), P(X 2 ), P(X 3 X 1, X 2 ), P(X 4 X 1 ), P(X 5 X 1, X 3, X 4 ) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

6 Outline Principles of Parameter Learning 1 Problem Statement 2 Principles of Parameter Learning Maximum likelihood estimation Bayesian estimation Variable with Multiple Values 3 Parameter Estimation in General Bayesian Networks The Parameters Maximum likelihood estimation Properties of MLE Bayesian estimation Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

7 Principles of Parameter Learning Single-Node Bayesian Network Maximum likelihood estimation X X: result of tossing a thumbtack H T Consider a Bayesian network with one node X, where X is the result of tossing a thumbtack and Ω X = {H, T }. Data cases: D 1 = H, D 2 = T, D 3 = H,..., D m = H Data set: D = {D 1, D 2, D 3,..., D m } Estimate parameter: θ = P(X =H). Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

8 Likelihood Principles of Parameter Learning Maximum likelihood estimation Data: D = {H, T, H, T, T, H, T } As possible values of θ, which of the following is the most likely? Why? θ = 0 θ = 0.01 θ = 10.5 θ = 0 contradicts data because P(D θ = 0) = 0.It cannot explain the data at all. θ = 0.01 almost contradicts with the data. It does not explain the data well. However, it is more consistent with the data than θ = 0 because P(D θ = 0.01) > P(D θ = 0). So θ = 0.5 is more consistent with the data than θ = 0.01 because P(D θ = 0.5) > P(D θ = 0.01) It explains the data the best among the three and is hence the most likely. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

9 Principles of Parameter Learning Maximum Likelihood Estimation Maximum likelihood estimation In general, the larger P(D θ = v) is, the more likely θ = v is. Likelihood of parameter θ given data set: L(θ D) = P(D θ) L( θ D) The maximum likelihood estimation (MLE) θ of θ is a possible value of θ such that 0 θ * 1 θ L(θ D) = sup θ L(θ D). MLE best explains data or best fits data. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

10 i.i.d and Likelihood Principles of Parameter Learning Maximum likelihood estimation Assume the data cases D 1,..., D m are independent given θ: m P(D 1,...,D m θ) = P(D i θ) i=1 Assume the data cases are identically distributed: P(D i = H) = θ, P(D i = T) = 1 θ (Note: i.i.d means independent and identically distributed) Then L(θ D) for all i = P(D θ) = P(D 1,..., D m θ) m = P(D i θ) = θ m h (1 θ) mt (1) where m h is the number of heads and m t is the number of tail. Binomial likelihood. i=1 Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

11 Principles of Parameter Learning Maximum likelihood estimation Example of Likelihood Function Example: D = {D 1 = H, D 2 T, D 3 = H, D 4 = H, D 5 = T } L(θ D) = P(D θ) = P(D 1 = H θ)p(d 2 = T θ)p(d 3 = H θ)p(d 4 = H θ)p(d 5 = T θ) = θ(1 θ)θθ(1 θ) = θ 3 (1 θ) 2. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

12 Sufficient Statistic Principles of Parameter Learning Maximum likelihood estimation A sufficient statistic is a function s(d) of data that summarizing the relevant information for computing the likelihood. That is s(d) = s(d ) L(θ D) = L(θ D ) Sufficient statistics tell us all there is to know about data. Since L(θ D) = θ m h (1 θ) mt, the pair (m h, m t ) is a sufficient statistic. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

13 Loglikelihood Principles of Parameter Learning Maximum likelihood estimation Loglikelihood: l(θ D) = logl(θ D) = logθ m h (1 θ) mt = m h logθ + m t log(1 θ) Maximizing likelihood is the same as maximizing loglikelihood. The latter is easier. By Corollary 1.1 of Lecture 1, the following value maximizes l(θ D): MLE is intuitive. It also has nice properties: θ = m h = m h m h + m t m E.g. Consistence: θ approaches the true value of θ with probability 1 as m goes to infinity. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

14 Drawback of MLE Principles of Parameter Learning Bayesian estimation Thumbtack tossing: (m h, m t ) = (3, 7). MLE: θ = 0.3. Reasonable. Data suggest that the thumbtack is biased toward tail. Coin tossing: Case 1: (m h, m t ) = (3, 7). MLE: θ = 0.3. Not reasonable. Our experience (prior) suggests strongly that coins are fair, hence θ=1/2. The size of the data set is too small to convince us this particular coin is biased. The fact that we get (3, 7) instead of (5, 5) is probably due to randomness. Case 2: (m h, m t ) = (30, 000, 70, 000). MLE: θ = 0.3. Reasonable. Data suggest that the coin is after all biased, overshadowing our prior. MLE does not differentiate between those two cases. It doe not take prior information into account. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

15 Principles of Parameter Learning Bayesian estimation Two Views on Parameter Estimation MLE: Assumes that θ is unknown but fixed parameter. Estimates it using θ, the value that maximizes the likelihood function Makes prediction based on the estimation: P(D m+1 = H D) = θ Bayesian Estimation: Treats θ as a random variable. Assumes a prior probability of θ: p(θ) Uses data to get posterior probability of θ: p(θ D) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

16 Principles of Parameter Learning Bayesian estimation Two Views on Parameter Estimation Bayesian Estimation: Predicting D m+1 P(D m+1 = H D) = = = = P(D m+1 = H, θ D)dθ P(D m+1 = H θ,d)p(θ D)dθ P(D m+1 = H θ)p(θ D)dθ θp(θ D)dθ. Full Bayesian: Take expectation over θ. Bayesian MAP: P(D m+1 = H D) = θ = arg maxp(θ D) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

17 Principles of Parameter Learning Bayesian estimation Calculating Bayesian Estimation Posterior distribution: p(θ D) p(θ)l(θ D) where the equation follows from (1) = θ m h (1 θ) mt p(θ) To facilitate analysis, assume prior has Beta distribution B(α h, α t ) Then p(θ) θ αh 1 αt 1 (1 θ) p(θ D) θ m h+α h 1 (1 θ) mt+αt 1 (2) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

18 Beta Distribution Principles of Parameter Learning Bayesian estimation Γ(α) = (α 1)!. It is also defined for non-integers. Density function of prior Beta distribution B(α h, α t ), p(θ) = Γ(α t + α h ) Γ(α t )Γ(α h ) θα h 1 αt 1 (1 θ) The normalization constant for the Beta distribution B(α h, α t ) Γ(α t + α h ) Γ(α t )Γ(α h ) where Γ(.) is the Gamma function. For any integer α, The hyperparameters α h and α t can be thought of as imaginary counts from our prior experiences. Their sum α = α h +α t is called equivalent sample size. The larger the equivalent sample size, the more confident we are in our prior. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

19 Conjugate Families Principles of Parameter Learning Bayesian estimation Binomial Likelihood: θ m h (1 θ) mt Beta Prior: θ α h 1 (1 θ) αt 1 Beta Posterior: θ m h+α h 1 (1 θ) mt+αt 1. Beta distributions are hence called a conjugate family for Binomial likelihood. Conjugate families allow closed-form for posterior distribution of parameters and closed-form solution for prediction. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

20 Principles of Parameter Learning Calculating Prediction Bayesian estimation We have P(D m+1 = H D) = θp(θ D)dθ = c θθ m h+α h 1 (1 θ) mt+αt 1 dθ = m h + α h m + α where c is the normalization constant, m=m h +m t, α = α h +α t. Consequently, P(D m+1 = T D) = m t + α t m + α After taking data D into consideration, now our updated belief on X=T is m t+α t m+α. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

21 Principles of Parameter Learning MLE and Bayesian estimation Bayesian estimation As m goes to infinity, P(D m+1 = H D) approaches the MLE approaches the true value of θ with probability 1. Coin tossing example revisited: Suppose α h = α t = 100. Equivalent sample size: 200 In case 1, P(D m+1 = H D) = Our prior prevails. In case 2, Data prevail. P(D m+1 = H D) = 30, , m h m h +m t,which Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

22 Principles of Parameter Learning Variable with Multiple Values Variable with Multiple Values Bayesian networks with a single multi-valued variable. Ω X = {x 1, x 2,..., x r }. Let θ i = P(X = x i ) and θ = (θ 1, θ 2,..., θ r ). Note that θ i 0 and i θ i = 1. Suppose in a data set D, there are m i data cases where X takes value x i. Then Multinomial likelihood. L(θ D) = P(D θ) = N P(D j θ) = j=1 r i=1 θ mi i Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

23 Principles of Parameter Learning Dirichlet distributions Variable with Multiple Values Conjugate family for multinomial likelihood: Dirichlet distributions. A Dirichlet distribution is parameterized by r parameters α 1, α 2,..., α r. Density function given by Γ(α) r i=1 Γ(α i) where α = α 1 + α α r. Same as Beta distribution when r=2. Fact: For any i: θ i Γ(α) r i=1 Γ(α i) k θ i=1 k i=1 αi 1 i αi 1 θi dθ 1 dθ 2...dθ r = α i α Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

24 Principles of Parameter Learning Calculating Parameter Estimations Variable with Multiple Values If the prior probability is a Dirichlet distribution Dir(α 1, α 2,...,α r ), then the posterior probability p(θ D) is a given by p(θ D) r i=1 mi+αi 1 θi So it is Dirichlet distribution Dir(α 1 + m 1, α 2 + m 2,..., α r + m r ), Bayesian estimation has the following closed-form: P(D m+1 =x i D) = θ i p(θ D)dθ = α i + m i α + m MLE: θi = mi m. (Exercise: Prove this.) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

25 Outline Parameter Estimation in General Bayesian Networks 1 Problem Statement 2 Principles of Parameter Learning Maximum likelihood estimation Bayesian estimation Variable with Multiple Values 3 Parameter Estimation in General Bayesian Networks The Parameters Maximum likelihood estimation Properties of MLE Bayesian estimation Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

26 Parameter Estimation in General Bayesian Networks The Parameters The Parameters n variables: X 1, X 2,..., X n. Number of states of X i : 1, 2,..., r i = Ω Xi. Number of configurations of parents of X i : 1, 2,..., q i = Ω pa(xi). Parameters to be estimated: θ ijk = P(X i = j pa(x i ) = k), i = 1,...,n; j = 1,...,r i ; k = 1,...,q i Parameter vector: θ = {θ ijk i = 1,...,n; j = 1,...,r i ; k = 1,...,q i }. Note that j θ ijk = 1 i, k θ i.. : Vector of parameters for P(X i pa(x i )) θ i.. = {θ ijk j = 1,...,r i ; k = 1,...,q i } θ i.k : Vector of parameters for P(X i pa(x i )=k) θ i.k = {θ ijk j = 1,...,r i } Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

27 Parameter Estimation in General Bayesian Networks The Parameters The Parameters Example: Consider the Bayesian network shown below. Assume all variables are binary, taking values 1 and 2. X 1 X 2 X 3 θ 111 = P(X 1 =1), θ 121 = P(X 1 =2) θ 211 = P(X 2 =1), θ 221 = P(X 2 =2) pa(x 3 ) = 1 : θ 311 = P(X 3 =1 X 1 = 1, X 2 = 1), θ 321 = P(X 3 =2 X 1 = 1, X 2 = 1) pa(x 3 ) = 2 : θ 312 = P(X 3 =1 X 1 = 1, X 2 = 2), θ 322 = P(X 3 =2 X 1 = 1, X 2 = 2) pa(x 3 ) = 3 : θ 313 = P(X 3 =1 X 1 = 2, X 2 = 1), θ 323 = P(X 3 =2 X 1 = 2, X 2 = 1) pa(x 3 ) = 4 : θ 314 = P(X 3 =1 X 1 = 2, X 2 = 2), θ 324 = P(X 3 =2 X 1 = 2, X 2 = 2) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

28 Data Parameter Estimation in General Bayesian Networks The Parameters A complete case D l : a vector of values, one for each variable. Example: D l = (X 1 = 1, X 2 = 2, X 3 = 2) Given: A set of complete cases: D = {D 1, D 2,...,D m }. Example: X 1 X 2 X 3 X 1 X 2 X Find: The ML estimates of the parameters θ. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

29 Parameter Estimation in General Bayesian Networks Maximum likelihood estimation The Loglikelihood Function Loglikelihood: l(θ D) = logl(θ D) = logp(d θ) = log l P(D l θ) = l logp(d l θ). The term logp(d l θ): D 4 = (1, 2, 2), logp(d 4 θ) = logp(x 1 = 1, X 2 = 2, X 3 = 2) = logp(x 1 =1 θ)p(x 2 =2 θ)p(x 3 =2 X 1 =1, X 2 =2, θ) = logθ logθ logθ 322. Recall: θ = {θ 111, θ 121 ; θ 211, θ 221 ; θ 311, θ 312, θ 313, θ 314, θ 321, θ 322, θ 323, θ 324 } Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

30 Parameter Estimation in General Bayesian Networks The Loglikelihood Function Maximum likelihood estimation Define the characteristic function of case D l : { 1 if Xi = j, pa(x χ(i, j, k : D l ) = i ) = k in D l 0 otherwise When l=4, D 4 = (1, 2, 2). χ(1, 1, 1 : D 4 ) = χ(2, 2, 1 : D 4 ) = χ(3, 2, 2 : D 4 ) = 1 χ(i, j, k : D 4 ) = 0 for all other i, j, k So, logp(d 4 θ) = ijk χ(i, j, k; D 4)logθ ijk In general, logp(d l θ) = ijk χ(i, j, k : D l )logθ ijk Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

31 Nevin Sufficient L. Zhang (HKUST) statistics: m ijk Bayesian Networks Fall / 58 Parameter Estimation in General Bayesian Networks The Loglikelihood Function Maximum likelihood estimation Define m ijk = l χ(i, j, k : D l ). It is the number of data cases where X i = j and pa(x i ) = k. Then l(θ D) = logp(d l θ) l = χ(i, j, k : D l )logθ ijk l i,j,k = χ(i, j, k : D l )logθ ijk i,j,k l = ijk = i,k m ijk logθ ijk m ijk logθ ijk. (4) j

32 MLE Parameter Estimation in General Bayesian Networks Maximum likelihood estimation Want: arg max θ l(θ D) = arg max θ ijk m ijk logθ ijk i,k j Note that θ ijk = P(X i =j pa(x i )=k) and θ i j k = P(X i =j pa(x i )=k ) are not related if either i i or k k. Consequently, we can separately maximize each term in the summation i,k [...] arg max m ijk logθ ijk θ ijk j Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

33 MLE Parameter Estimation in General Bayesian Networks Maximum likelihood estimation By Corollary 1.1, we get θ ijk = m ijk j m ijk In words, the MLE estimate for θ ijk = P(X i =j pa(x i )=k) is: θ ijk = number of cases where X i=j and pa(x i )=k number of cases where pa(x i )=k Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

34 Example Parameter Estimation in General Bayesian Networks Maximum likelihood estimation Example: X 1 X 3 X 2 MLE for P(X 1 =1) is: 6/16 MLE for P(X 2 =1) is: 7/16 MLE for P(X 3 =1 X 1 =2, X 2 =2) is: 2/6... X 1 X 2 X 3 X 1 X 2 X Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

35 Parameter Estimation in General Bayesian Networks A Question Properties of MLE Start from a joint distribution P(X) (Generative Distribution) D: collection of data sampled from P(X). Let S be a BN structrue (DAG) over variables X. Learn parameters θ for BN structure S from D. Let P (X) be the joint probability of the BN (S, θ ). Note: θijk = P (X i =j pa S (X i )=k) How is P related to P? Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

36 Parameter Estimation in General Bayesian Networks Properties of MLE MLE in General Bayesian Networks with Complete Data P * *. P Distributions that factorize according to S We will show that, with probability 1, P converges to the distribution that Factorizes according to S, Is closest to P under KL divergence among all distributions that factorize according to S. P factorizes according to S. P does not necessarily factorize according to S. If P factorizes according to S, P converges to P with probability 1. (MLE is consistent.) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

37 Parameter Estimation in General Bayesian Networks The Target Distribution Properties of MLE Define θ S ijk = P(X i=j pa S (X i ) = k)) Let P S (X) be the joint distribution of the BN (S, θ S ) P S factorizes according to S and for any X X, P S (X pa(x)) = P(X pa(x)) If P factorizes according to S, then P and P S are identical. If P does not factorize according to S, then P and P S are different. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

38 Parameter Estimation in General Bayesian Networks First Theorem Properties of MLE Theorem (6.1) Among all distributions Q that factorizes according to S, the KL divergence KL(P, Q) is minimized by Q=P S. P S is the closest to P among all those that factorize according to S. Proof: Since KL(P, Q) = X P(X)log P(X) Q(X) It suffices to show that Proposition: Q=P S maximizes X P(X)logQ(X) We show the claim by induction on the number of nodes. When there is only one node, the proposition follows from property of KL divergence (Corollary 1.1). Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

39 Parameter Estimation in General Bayesian Networks Properties of MLE First Theorem Suppose the proposition is true for the case of n nodes. Consider the case of n+1 nodes. Let X be a leaf node and X =X \ {X }. S be the obtained from S by removing X. Then P(X)logQ(X) = P(X )logq(x ) + P(pa(X)) P(X pa(x))logq(x pa(x)) X X pa(x) X By the induction hypothesis, the first term is maximized by P S. By Corollary 1.1, the second term is maximized if Q(X pa(x)) = P(X pa(x)). Hence the sum is maximized by P S. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

40 Parameter Estimation in General Bayesian Networks Properties of MLE Second Theorem Theorem (6.2) lim N P (X=x) = P S (X=x) with probability 1 where N is the sample size, i.e. number of cases in D. Proof: Let ˆP(X) be the empirical distribution: ˆP(X=x) = fraction of cases in D where X=x It is clear that P (X i =j pa S (X i )=k) = θ ijk = ˆP(X i =j pa S (X i )=k) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

41 Parameter Estimation in General Bayesian Networks Second Theorem Properties of MLE On the other hand, by the law of large numbers, we have Hence lim ˆP(X=x) = P(X=x) with probability 1 N lim N P (X i =j pa S (X i )=k) = lim ˆP(X i =j pa S (X i )=k) N = P(X i =j pa S (X i )=k) with probability 1 = P S (X i =j pa S (X i )=k) Because both P and P S factorizes according to S, the theorem follows. Q.E.D. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

42 Parameter Estimation in General Bayesian Networks A Corollary Properties of MLE Corollary If P factorizes according to S, then lim N P (X=x) = P(X=x) with probability 1 Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

43 Parameter Estimation in General Bayesian Networks Bayesian Estimation Bayesian estimation View θ as a vector of random variables with prior distribution p(θ). Posterior: p(θ D) p(θ)l(θ D) = p(θ) i,k j where the equation follows from (4). θ m ijk ijk Assumptions need to be made about prior distribution. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

44 Parameter Estimation in General Bayesian Networks Assumptions Bayesian estimation Global independence in prior distribution: p(θ) = i p(θ i.. ) Local independence in prior distribution: For each i p(θ i.. ) = k p(θ i.k ) Parameter independence = global independence + local independence: p(θ) = i,k p(θ i.k ) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

45 Parameter Estimation in General Bayesian Networks Assumptions Bayesian estimation Further assume that p(θ i.k ) is Dirichlet distribution Dir(α i0k, α i1k,..., α irik): p(θ i.k ) j θ α ijk 1 ijk Then, product Dirichlet distribution. p(θ) = i,k j θ α ijk 1 ijk Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

46 Parameter Estimation in General Bayesian Networks Bayesian Estimation Bayesian estimation Posterior: p(θ D) p(θ) i,k = [ i,k j = i,k j j θ m ijk ijk θ α ijk 1 ijk ] i,k θ m ijk+α ijk 1 ijk j θ m ijk ijk It is also a product product Dirichlet distribution.(think: What does this mean?) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

47 Parameter Estimation in General Bayesian Networks Bayesian estimation Prediction Predicting D m+1 = {X1 m+1, X2 m+1,...,xn m+1 }. Random variables. For notational simplicity, simply write D m+1 = {X 1, X 2,..., X n }. First, we have: P(D m+1 D) = P(X 1, X 2,...,X n D) = i P(X i pa(x i ),D) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

48 Proof Parameter Estimation in General Bayesian Networks Bayesian estimation P(D m+1 D) = P(D m+1 θ)p(θ D)dθ P(D m+1 θ) = P(X 1, X 2,..., X n θ) = i P(X i pa(x i ), θ) = i P(X i pa(x i ), θ i.. ) p(θ i D) = i p(θ i.. D) Hence P(D m+1 D) = i P(X i pa(x i ), θ i.. )p(θ i.. D)dθ i.. = i P(X i pa(x i ),D) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

49 Parameter Estimation in General Bayesian Networks Prediction Bayesian estimation Further, we have P(X i =j pa(x i )=k,d) = = P(X i =j pa(x i ) = k, θ ijk )p(θ ijk D)dθ ijk θ ijk p(θ ijk D)dθ ijk Because p(θ i.k D) j θ m ijk+α ijk 1 ijk We have θ ijk p(θ ijk D)dθ ijk = m ijk+α ijk j (m ijk+α ijk ) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

50 Parameter Estimation in General Bayesian Networks Prediction Bayesian estimation Conclusion: P(X 1, X 2,..., X n D) = i P(X i pa(x i ),D) where P(X i =j pa(x i )=k,d) = m ijk+α ijk m i k +α i k where m i k = j m ijk and α i k = j α ijk Notes: Conditional independence or structure preserved after absorbing D. Important property for sequential learning where we process one case at a time. The final result is independent of the order by which cases are processed. Comparison with MLE estimation: θijk = m ijk j m ijk Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

51 Summary Parameter Estimation in General Bayesian Networks Bayesian estimation θ: random variable. Prior p(θ): product Dirichlet distribution p(θ) = i,k p(θ i.k ) i,k Posterior p(θ D): also product Dirichlet distribution Prediction: p(θ D) i,k j j θ m ijk+α ijk 1 ijk P(D m+1 D) = P(X 1, X 2,...,X n D) = i θ α ijk 1 ijk P(X i pa(x i ),D) where P(X i =j pa(x i )=k,d) = m ijk+α ijk m i k +α i k Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Probabilistic Graphical Models

Probabilistic Graphical Models Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially