Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Size: px
Start display at page:

Download "Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58"

Transcription

1 Overview of Course So far, we have studied The concept of Bayesian network Independence and Separation in Bayesian networks Inference in Bayesian networks The rest of the course: Data analysis using Bayesian network Parameter learning: Learn parameters for a given structure. Structure learning: Learn both structures and parameters Learning latent structures: Discover latent variables behind observed variables and determine their relationships. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

2 COMP538: Introduction to Bayesian Networks Lecture 6: Parameter Learning in Bayesian Networks Nevin L. Zhang Department of Computer Science and Engineering Hong Kong University of Science and Technology Fall 2008 Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

3 Objective Objective: Principles for parameter learning in Bayesian networks. Algorithms for the case of complete data. Reading: Zhang and Guo (2007), Chapter 7 Reference: Heckerman (1996) (first half), Cowell et al (1999, Chapter 9) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

4 Outline Problem Statement 1 Problem Statement 2 Principles of Parameter Learning Maximum likelihood estimation Bayesian estimation Variable with Multiple Values 3 Parameter Estimation in General Bayesian Networks The Parameters Maximum likelihood estimation Properties of MLE Bayesian estimation Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

5 Problem Statement Parameter Learning Given: A Bayesian network structure. X 1 X 2 X 3 X 4 A data set X 5 Estimate conditional probabilities: X 1 X 2 X 3 X 4 X : : : : : P(X 1 ), P(X 2 ), P(X 3 X 1, X 2 ), P(X 4 X 1 ), P(X 5 X 1, X 3, X 4 ) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

6 Outline Principles of Parameter Learning 1 Problem Statement 2 Principles of Parameter Learning Maximum likelihood estimation Bayesian estimation Variable with Multiple Values 3 Parameter Estimation in General Bayesian Networks The Parameters Maximum likelihood estimation Properties of MLE Bayesian estimation Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

7 Principles of Parameter Learning Single-Node Bayesian Network Maximum likelihood estimation X X: result of tossing a thumbtack H T Consider a Bayesian network with one node X, where X is the result of tossing a thumbtack and Ω X = {H, T }. Data cases: D 1 = H, D 2 = T, D 3 = H,..., D m = H Data set: D = {D 1, D 2, D 3,..., D m } Estimate parameter: θ = P(X =H). Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

8 Likelihood Principles of Parameter Learning Maximum likelihood estimation Data: D = {H, T, H, T, T, H, T } As possible values of θ, which of the following is the most likely? Why? θ = 0 θ = 0.01 θ = 10.5 θ = 0 contradicts data because P(D θ = 0) = 0.It cannot explain the data at all. θ = 0.01 almost contradicts with the data. It does not explain the data well. However, it is more consistent with the data than θ = 0 because P(D θ = 0.01) > P(D θ = 0). So θ = 0.5 is more consistent with the data than θ = 0.01 because P(D θ = 0.5) > P(D θ = 0.01) It explains the data the best among the three and is hence the most likely. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

9 Principles of Parameter Learning Maximum Likelihood Estimation Maximum likelihood estimation In general, the larger P(D θ = v) is, the more likely θ = v is. Likelihood of parameter θ given data set: L(θ D) = P(D θ) L( θ D) The maximum likelihood estimation (MLE) θ of θ is a possible value of θ such that 0 θ * 1 θ L(θ D) = sup θ L(θ D). MLE best explains data or best fits data. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

10 i.i.d and Likelihood Principles of Parameter Learning Maximum likelihood estimation Assume the data cases D 1,..., D m are independent given θ: m P(D 1,...,D m θ) = P(D i θ) i=1 Assume the data cases are identically distributed: P(D i = H) = θ, P(D i = T) = 1 θ (Note: i.i.d means independent and identically distributed) Then L(θ D) for all i = P(D θ) = P(D 1,..., D m θ) m = P(D i θ) = θ m h (1 θ) mt (1) where m h is the number of heads and m t is the number of tail. Binomial likelihood. i=1 Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

11 Principles of Parameter Learning Maximum likelihood estimation Example of Likelihood Function Example: D = {D 1 = H, D 2 T, D 3 = H, D 4 = H, D 5 = T } L(θ D) = P(D θ) = P(D 1 = H θ)p(d 2 = T θ)p(d 3 = H θ)p(d 4 = H θ)p(d 5 = T θ) = θ(1 θ)θθ(1 θ) = θ 3 (1 θ) 2. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

12 Sufficient Statistic Principles of Parameter Learning Maximum likelihood estimation A sufficient statistic is a function s(d) of data that summarizing the relevant information for computing the likelihood. That is s(d) = s(d ) L(θ D) = L(θ D ) Sufficient statistics tell us all there is to know about data. Since L(θ D) = θ m h (1 θ) mt, the pair (m h, m t ) is a sufficient statistic. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

13 Loglikelihood Principles of Parameter Learning Maximum likelihood estimation Loglikelihood: l(θ D) = logl(θ D) = logθ m h (1 θ) mt = m h logθ + m t log(1 θ) Maximizing likelihood is the same as maximizing loglikelihood. The latter is easier. By Corollary 1.1 of Lecture 1, the following value maximizes l(θ D): MLE is intuitive. It also has nice properties: θ = m h = m h m h + m t m E.g. Consistence: θ approaches the true value of θ with probability 1 as m goes to infinity. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

14 Drawback of MLE Principles of Parameter Learning Bayesian estimation Thumbtack tossing: (m h, m t ) = (3, 7). MLE: θ = 0.3. Reasonable. Data suggest that the thumbtack is biased toward tail. Coin tossing: Case 1: (m h, m t ) = (3, 7). MLE: θ = 0.3. Not reasonable. Our experience (prior) suggests strongly that coins are fair, hence θ=1/2. The size of the data set is too small to convince us this particular coin is biased. The fact that we get (3, 7) instead of (5, 5) is probably due to randomness. Case 2: (m h, m t ) = (30, 000, 70, 000). MLE: θ = 0.3. Reasonable. Data suggest that the coin is after all biased, overshadowing our prior. MLE does not differentiate between those two cases. It doe not take prior information into account. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

15 Principles of Parameter Learning Bayesian estimation Two Views on Parameter Estimation MLE: Assumes that θ is unknown but fixed parameter. Estimates it using θ, the value that maximizes the likelihood function Makes prediction based on the estimation: P(D m+1 = H D) = θ Bayesian Estimation: Treats θ as a random variable. Assumes a prior probability of θ: p(θ) Uses data to get posterior probability of θ: p(θ D) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

16 Principles of Parameter Learning Bayesian estimation Two Views on Parameter Estimation Bayesian Estimation: Predicting D m+1 P(D m+1 = H D) = = = = P(D m+1 = H, θ D)dθ P(D m+1 = H θ,d)p(θ D)dθ P(D m+1 = H θ)p(θ D)dθ θp(θ D)dθ. Full Bayesian: Take expectation over θ. Bayesian MAP: P(D m+1 = H D) = θ = arg maxp(θ D) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

17 Principles of Parameter Learning Bayesian estimation Calculating Bayesian Estimation Posterior distribution: p(θ D) p(θ)l(θ D) where the equation follows from (1) = θ m h (1 θ) mt p(θ) To facilitate analysis, assume prior has Beta distribution B(α h, α t ) Then p(θ) θ αh 1 αt 1 (1 θ) p(θ D) θ m h+α h 1 (1 θ) mt+αt 1 (2) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

18 Beta Distribution Principles of Parameter Learning Bayesian estimation Γ(α) = (α 1)!. It is also defined for non-integers. Density function of prior Beta distribution B(α h, α t ), p(θ) = Γ(α t + α h ) Γ(α t )Γ(α h ) θα h 1 αt 1 (1 θ) The normalization constant for the Beta distribution B(α h, α t ) Γ(α t + α h ) Γ(α t )Γ(α h ) where Γ(.) is the Gamma function. For any integer α, The hyperparameters α h and α t can be thought of as imaginary counts from our prior experiences. Their sum α = α h +α t is called equivalent sample size. The larger the equivalent sample size, the more confident we are in our prior. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

19 Conjugate Families Principles of Parameter Learning Bayesian estimation Binomial Likelihood: θ m h (1 θ) mt Beta Prior: θ α h 1 (1 θ) αt 1 Beta Posterior: θ m h+α h 1 (1 θ) mt+αt 1. Beta distributions are hence called a conjugate family for Binomial likelihood. Conjugate families allow closed-form for posterior distribution of parameters and closed-form solution for prediction. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

20 Principles of Parameter Learning Calculating Prediction Bayesian estimation We have P(D m+1 = H D) = θp(θ D)dθ = c θθ m h+α h 1 (1 θ) mt+αt 1 dθ = m h + α h m + α where c is the normalization constant, m=m h +m t, α = α h +α t. Consequently, P(D m+1 = T D) = m t + α t m + α After taking data D into consideration, now our updated belief on X=T is m t+α t m+α. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

21 Principles of Parameter Learning MLE and Bayesian estimation Bayesian estimation As m goes to infinity, P(D m+1 = H D) approaches the MLE approaches the true value of θ with probability 1. Coin tossing example revisited: Suppose α h = α t = 100. Equivalent sample size: 200 In case 1, P(D m+1 = H D) = Our prior prevails. In case 2, Data prevail. P(D m+1 = H D) = 30, , m h m h +m t,which Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

22 Principles of Parameter Learning Variable with Multiple Values Variable with Multiple Values Bayesian networks with a single multi-valued variable. Ω X = {x 1, x 2,..., x r }. Let θ i = P(X = x i ) and θ = (θ 1, θ 2,..., θ r ). Note that θ i 0 and i θ i = 1. Suppose in a data set D, there are m i data cases where X takes value x i. Then Multinomial likelihood. L(θ D) = P(D θ) = N P(D j θ) = j=1 r i=1 θ mi i Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

23 Principles of Parameter Learning Dirichlet distributions Variable with Multiple Values Conjugate family for multinomial likelihood: Dirichlet distributions. A Dirichlet distribution is parameterized by r parameters α 1, α 2,..., α r. Density function given by Γ(α) r i=1 Γ(α i) where α = α 1 + α α r. Same as Beta distribution when r=2. Fact: For any i: θ i Γ(α) r i=1 Γ(α i) k θ i=1 k i=1 αi 1 i αi 1 θi dθ 1 dθ 2...dθ r = α i α Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

24 Principles of Parameter Learning Calculating Parameter Estimations Variable with Multiple Values If the prior probability is a Dirichlet distribution Dir(α 1, α 2,...,α r ), then the posterior probability p(θ D) is a given by p(θ D) r i=1 mi+αi 1 θi So it is Dirichlet distribution Dir(α 1 + m 1, α 2 + m 2,..., α r + m r ), Bayesian estimation has the following closed-form: P(D m+1 =x i D) = θ i p(θ D)dθ = α i + m i α + m MLE: θi = mi m. (Exercise: Prove this.) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

25 Outline Parameter Estimation in General Bayesian Networks 1 Problem Statement 2 Principles of Parameter Learning Maximum likelihood estimation Bayesian estimation Variable with Multiple Values 3 Parameter Estimation in General Bayesian Networks The Parameters Maximum likelihood estimation Properties of MLE Bayesian estimation Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

26 Parameter Estimation in General Bayesian Networks The Parameters The Parameters n variables: X 1, X 2,..., X n. Number of states of X i : 1, 2,..., r i = Ω Xi. Number of configurations of parents of X i : 1, 2,..., q i = Ω pa(xi). Parameters to be estimated: θ ijk = P(X i = j pa(x i ) = k), i = 1,...,n; j = 1,...,r i ; k = 1,...,q i Parameter vector: θ = {θ ijk i = 1,...,n; j = 1,...,r i ; k = 1,...,q i }. Note that j θ ijk = 1 i, k θ i.. : Vector of parameters for P(X i pa(x i )) θ i.. = {θ ijk j = 1,...,r i ; k = 1,...,q i } θ i.k : Vector of parameters for P(X i pa(x i )=k) θ i.k = {θ ijk j = 1,...,r i } Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

27 Parameter Estimation in General Bayesian Networks The Parameters The Parameters Example: Consider the Bayesian network shown below. Assume all variables are binary, taking values 1 and 2. X 1 X 2 X 3 θ 111 = P(X 1 =1), θ 121 = P(X 1 =2) θ 211 = P(X 2 =1), θ 221 = P(X 2 =2) pa(x 3 ) = 1 : θ 311 = P(X 3 =1 X 1 = 1, X 2 = 1), θ 321 = P(X 3 =2 X 1 = 1, X 2 = 1) pa(x 3 ) = 2 : θ 312 = P(X 3 =1 X 1 = 1, X 2 = 2), θ 322 = P(X 3 =2 X 1 = 1, X 2 = 2) pa(x 3 ) = 3 : θ 313 = P(X 3 =1 X 1 = 2, X 2 = 1), θ 323 = P(X 3 =2 X 1 = 2, X 2 = 1) pa(x 3 ) = 4 : θ 314 = P(X 3 =1 X 1 = 2, X 2 = 2), θ 324 = P(X 3 =2 X 1 = 2, X 2 = 2) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

28 Data Parameter Estimation in General Bayesian Networks The Parameters A complete case D l : a vector of values, one for each variable. Example: D l = (X 1 = 1, X 2 = 2, X 3 = 2) Given: A set of complete cases: D = {D 1, D 2,...,D m }. Example: X 1 X 2 X 3 X 1 X 2 X Find: The ML estimates of the parameters θ. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

29 Parameter Estimation in General Bayesian Networks Maximum likelihood estimation The Loglikelihood Function Loglikelihood: l(θ D) = logl(θ D) = logp(d θ) = log l P(D l θ) = l logp(d l θ). The term logp(d l θ): D 4 = (1, 2, 2), logp(d 4 θ) = logp(x 1 = 1, X 2 = 2, X 3 = 2) = logp(x 1 =1 θ)p(x 2 =2 θ)p(x 3 =2 X 1 =1, X 2 =2, θ) = logθ logθ logθ 322. Recall: θ = {θ 111, θ 121 ; θ 211, θ 221 ; θ 311, θ 312, θ 313, θ 314, θ 321, θ 322, θ 323, θ 324 } Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

30 Parameter Estimation in General Bayesian Networks The Loglikelihood Function Maximum likelihood estimation Define the characteristic function of case D l : { 1 if Xi = j, pa(x χ(i, j, k : D l ) = i ) = k in D l 0 otherwise When l=4, D 4 = (1, 2, 2). χ(1, 1, 1 : D 4 ) = χ(2, 2, 1 : D 4 ) = χ(3, 2, 2 : D 4 ) = 1 χ(i, j, k : D 4 ) = 0 for all other i, j, k So, logp(d 4 θ) = ijk χ(i, j, k; D 4)logθ ijk In general, logp(d l θ) = ijk χ(i, j, k : D l )logθ ijk Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

31 Nevin Sufficient L. Zhang (HKUST) statistics: m ijk Bayesian Networks Fall / 58 Parameter Estimation in General Bayesian Networks The Loglikelihood Function Maximum likelihood estimation Define m ijk = l χ(i, j, k : D l ). It is the number of data cases where X i = j and pa(x i ) = k. Then l(θ D) = logp(d l θ) l = χ(i, j, k : D l )logθ ijk l i,j,k = χ(i, j, k : D l )logθ ijk i,j,k l = ijk = i,k m ijk logθ ijk m ijk logθ ijk. (4) j

32 MLE Parameter Estimation in General Bayesian Networks Maximum likelihood estimation Want: arg max θ l(θ D) = arg max θ ijk m ijk logθ ijk i,k j Note that θ ijk = P(X i =j pa(x i )=k) and θ i j k = P(X i =j pa(x i )=k ) are not related if either i i or k k. Consequently, we can separately maximize each term in the summation i,k [...] arg max m ijk logθ ijk θ ijk j Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

33 MLE Parameter Estimation in General Bayesian Networks Maximum likelihood estimation By Corollary 1.1, we get θ ijk = m ijk j m ijk In words, the MLE estimate for θ ijk = P(X i =j pa(x i )=k) is: θ ijk = number of cases where X i=j and pa(x i )=k number of cases where pa(x i )=k Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

34 Example Parameter Estimation in General Bayesian Networks Maximum likelihood estimation Example: X 1 X 3 X 2 MLE for P(X 1 =1) is: 6/16 MLE for P(X 2 =1) is: 7/16 MLE for P(X 3 =1 X 1 =2, X 2 =2) is: 2/6... X 1 X 2 X 3 X 1 X 2 X Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

35 Parameter Estimation in General Bayesian Networks A Question Properties of MLE Start from a joint distribution P(X) (Generative Distribution) D: collection of data sampled from P(X). Let S be a BN structrue (DAG) over variables X. Learn parameters θ for BN structure S from D. Let P (X) be the joint probability of the BN (S, θ ). Note: θijk = P (X i =j pa S (X i )=k) How is P related to P? Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

36 Parameter Estimation in General Bayesian Networks Properties of MLE MLE in General Bayesian Networks with Complete Data P * *. P Distributions that factorize according to S We will show that, with probability 1, P converges to the distribution that Factorizes according to S, Is closest to P under KL divergence among all distributions that factorize according to S. P factorizes according to S. P does not necessarily factorize according to S. If P factorizes according to S, P converges to P with probability 1. (MLE is consistent.) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

37 Parameter Estimation in General Bayesian Networks The Target Distribution Properties of MLE Define θ S ijk = P(X i=j pa S (X i ) = k)) Let P S (X) be the joint distribution of the BN (S, θ S ) P S factorizes according to S and for any X X, P S (X pa(x)) = P(X pa(x)) If P factorizes according to S, then P and P S are identical. If P does not factorize according to S, then P and P S are different. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

38 Parameter Estimation in General Bayesian Networks First Theorem Properties of MLE Theorem (6.1) Among all distributions Q that factorizes according to S, the KL divergence KL(P, Q) is minimized by Q=P S. P S is the closest to P among all those that factorize according to S. Proof: Since KL(P, Q) = X P(X)log P(X) Q(X) It suffices to show that Proposition: Q=P S maximizes X P(X)logQ(X) We show the claim by induction on the number of nodes. When there is only one node, the proposition follows from property of KL divergence (Corollary 1.1). Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

39 Parameter Estimation in General Bayesian Networks Properties of MLE First Theorem Suppose the proposition is true for the case of n nodes. Consider the case of n+1 nodes. Let X be a leaf node and X =X \ {X }. S be the obtained from S by removing X. Then P(X)logQ(X) = P(X )logq(x ) + P(pa(X)) P(X pa(x))logq(x pa(x)) X X pa(x) X By the induction hypothesis, the first term is maximized by P S. By Corollary 1.1, the second term is maximized if Q(X pa(x)) = P(X pa(x)). Hence the sum is maximized by P S. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

40 Parameter Estimation in General Bayesian Networks Properties of MLE Second Theorem Theorem (6.2) lim N P (X=x) = P S (X=x) with probability 1 where N is the sample size, i.e. number of cases in D. Proof: Let ˆP(X) be the empirical distribution: ˆP(X=x) = fraction of cases in D where X=x It is clear that P (X i =j pa S (X i )=k) = θ ijk = ˆP(X i =j pa S (X i )=k) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

41 Parameter Estimation in General Bayesian Networks Second Theorem Properties of MLE On the other hand, by the law of large numbers, we have Hence lim ˆP(X=x) = P(X=x) with probability 1 N lim N P (X i =j pa S (X i )=k) = lim ˆP(X i =j pa S (X i )=k) N = P(X i =j pa S (X i )=k) with probability 1 = P S (X i =j pa S (X i )=k) Because both P and P S factorizes according to S, the theorem follows. Q.E.D. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

42 Parameter Estimation in General Bayesian Networks A Corollary Properties of MLE Corollary If P factorizes according to S, then lim N P (X=x) = P(X=x) with probability 1 Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

43 Parameter Estimation in General Bayesian Networks Bayesian Estimation Bayesian estimation View θ as a vector of random variables with prior distribution p(θ). Posterior: p(θ D) p(θ)l(θ D) = p(θ) i,k j where the equation follows from (4). θ m ijk ijk Assumptions need to be made about prior distribution. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

44 Parameter Estimation in General Bayesian Networks Assumptions Bayesian estimation Global independence in prior distribution: p(θ) = i p(θ i.. ) Local independence in prior distribution: For each i p(θ i.. ) = k p(θ i.k ) Parameter independence = global independence + local independence: p(θ) = i,k p(θ i.k ) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

45 Parameter Estimation in General Bayesian Networks Assumptions Bayesian estimation Further assume that p(θ i.k ) is Dirichlet distribution Dir(α i0k, α i1k,..., α irik): p(θ i.k ) j θ α ijk 1 ijk Then, product Dirichlet distribution. p(θ) = i,k j θ α ijk 1 ijk Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

46 Parameter Estimation in General Bayesian Networks Bayesian Estimation Bayesian estimation Posterior: p(θ D) p(θ) i,k = [ i,k j = i,k j j θ m ijk ijk θ α ijk 1 ijk ] i,k θ m ijk+α ijk 1 ijk j θ m ijk ijk It is also a product product Dirichlet distribution.(think: What does this mean?) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

47 Parameter Estimation in General Bayesian Networks Bayesian estimation Prediction Predicting D m+1 = {X1 m+1, X2 m+1,...,xn m+1 }. Random variables. For notational simplicity, simply write D m+1 = {X 1, X 2,..., X n }. First, we have: P(D m+1 D) = P(X 1, X 2,...,X n D) = i P(X i pa(x i ),D) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

48 Proof Parameter Estimation in General Bayesian Networks Bayesian estimation P(D m+1 D) = P(D m+1 θ)p(θ D)dθ P(D m+1 θ) = P(X 1, X 2,..., X n θ) = i P(X i pa(x i ), θ) = i P(X i pa(x i ), θ i.. ) p(θ i D) = i p(θ i.. D) Hence P(D m+1 D) = i P(X i pa(x i ), θ i.. )p(θ i.. D)dθ i.. = i P(X i pa(x i ),D) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

49 Parameter Estimation in General Bayesian Networks Prediction Bayesian estimation Further, we have P(X i =j pa(x i )=k,d) = = P(X i =j pa(x i ) = k, θ ijk )p(θ ijk D)dθ ijk θ ijk p(θ ijk D)dθ ijk Because p(θ i.k D) j θ m ijk+α ijk 1 ijk We have θ ijk p(θ ijk D)dθ ijk = m ijk+α ijk j (m ijk+α ijk ) Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

50 Parameter Estimation in General Bayesian Networks Prediction Bayesian estimation Conclusion: P(X 1, X 2,..., X n D) = i P(X i pa(x i ),D) where P(X i =j pa(x i )=k,d) = m ijk+α ijk m i k +α i k where m i k = j m ijk and α i k = j α ijk Notes: Conditional independence or structure preserved after absorbing D. Important property for sequential learning where we process one case at a time. The final result is independent of the order by which cases are processed. Comparison with MLE estimation: θijk = m ijk j m ijk Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

51 Summary Parameter Estimation in General Bayesian Networks Bayesian estimation θ: random variable. Prior p(θ): product Dirichlet distribution p(θ) = i,k p(θ i.k ) i,k Posterior p(θ D): also product Dirichlet distribution Prediction: p(θ D) i,k j j θ m ijk+α ijk 1 ijk P(D m+1 D) = P(X 1, X 2,...,X n D) = i θ α ijk 1 ijk P(X i pa(x i ),D) where P(X i =j pa(x i )=k,d) = m ijk+α ijk m i k +α i k Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Probabilistic Graphical Models

Probabilistic Graphical Models Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially

More information

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018 Bayesian Methods David S. Rosenberg New York University March 20, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 1 / 38 Contents 1 Classical Statistics 2 Bayesian

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

1 Maximum Likelihood Estimation

1 Maximum Likelihood Estimation heads tails Figure 1: A simple thumbtack tossing experiment. L(θ :D) 0 0.2 0.4 0.6 0.8 1 Figure 2: The likelihood function for the sequence of tosses H,T,T,H,H. 1 Maximum Likelihood Estimation In this

More information

Lecture 2: Conjugate priors

Lecture 2: Conjugate priors (Spring ʼ) Lecture : Conjugate priors Julia Hockenmaier juliahmr@illinois.edu Siebel Center http://www.cs.uiuc.edu/class/sp/cs98jhm The binomial distribution If p is the probability of heads, the probability

More information

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings: K&F: 16.3, 16.4, 17.3 Bayesian Param. Learning Bayesian Structure Learning Graphical Models 10708 Carlos Guestrin Carnegie Mellon University October 6 th, 2008 10-708 Carlos Guestrin 2006-2008

More information

Lecture 2: Priors and Conjugacy

Lecture 2: Priors and Conjugacy Lecture 2: Priors and Conjugacy Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 6, 2014 Some nice courses Fred A. Hamprecht (Heidelberg U.) https://www.youtube.com/watch?v=j66rrnzzkow Michael I.

More information

The Monte Carlo Method: Bayesian Networks

The Monte Carlo Method: Bayesian Networks The Method: Bayesian Networks Dieter W. Heermann Methods 2009 Dieter W. Heermann ( Methods)The Method: Bayesian Networks 2009 1 / 18 Outline 1 Bayesian Networks 2 Gene Expression Data 3 Bayesian Networks

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Learning Bayesian networks

Learning Bayesian networks 1 Lecture topics: Learning Bayesian networks from data maximum likelihood, BIC Bayesian, marginal likelihood Learning Bayesian networks There are two problems we have to solve in order to estimate Bayesian

More information

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang Machine Learning Lecture 02.2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

CS540 Machine learning L9 Bayesian statistics

CS540 Machine learning L9 Bayesian statistics CS540 Machine learning L9 Bayesian statistics 1 Last time Naïve Bayes Beta-Bernoulli 2 Outline Bayesian concept learning Beta-Bernoulli model (review) Dirichlet-multinomial model Credible intervals 3 Bayesian

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

COMP538: Introduction to Bayesian Networks

COMP538: Introduction to Bayesian Networks COMP538: Introduction to Bayesian Networks Lecture 9: Optimal Structure Learning Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering Hong Kong University of Science and Technology

More information

Lecture 6: Graphical Models: Learning

Lecture 6: Graphical Models: Learning Lecture 6: Graphical Models: Learning 4F13: Machine Learning Zoubin Ghahramani and Carl Edward Rasmussen Department of Engineering, University of Cambridge February 3rd, 2010 Ghahramani & Rasmussen (CUED)

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Introduction to Bayesian inference

Introduction to Bayesian inference Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions

More information

Machine Learning Summer School

Machine Learning Summer School Machine Learning Summer School Lecture 3: Learning parameters and structure Zoubin Ghahramani zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/ Department of Engineering University of Cambridge,

More information

Lecture 18: Bayesian Inference

Lecture 18: Bayesian Inference Lecture 18: Bayesian Inference Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 18 Probability and Statistics, Spring 2014 1 / 10 Bayesian Statistical Inference Statiscal inference

More information

Data Analysis and Uncertainty Part 2: Estimation

Data Analysis and Uncertainty Part 2: Estimation Data Analysis and Uncertainty Part 2: Estimation Instructor: Sargur N. University at Buffalo The State University of New York srihari@cedar.buffalo.edu 1 Topics in Estimation 1. Estimation 2. Desirable

More information

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018 CLASS NOTES Models, Algorithms and Data: Introduction to computing 208 Petros Koumoutsakos, Jens Honore Walther (Last update: June, 208) IMPORTANT DISCLAIMERS. REFERENCES: Much of the material (ideas,

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Generative Models Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation Machine Learning CMPT 726 Simon Fraser University Binomial Parameter Estimation Outline Maximum Likelihood Estimation Smoothed Frequencies, Laplace Correction. Bayesian Approach. Conjugate Prior. Uniform

More information

Hierarchical Models & Bayesian Model Selection

Hierarchical Models & Bayesian Model Selection Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. Lecture 4 Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza October 6, 2017 Luigi Freda ( La Sapienza University) Lecture 4 October 6, 2017 1 / 46 Outline

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

More information

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio Estimation of reliability parameters from Experimental data (Parte 2) This lecture Life test (t 1,t 2,...,t n ) Estimate θ of f T t θ For example: λ of f T (t)= λe - λt Classical approach (frequentist

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Parameter Learning With Binary Variables

Parameter Learning With Binary Variables With Binary Variables University of Nebraska Lincoln CSCE 970 Pattern Recognition Outline Outline 1 Learning a Single Parameter 2 More on the Beta Density Function 3 Computing a Probability Interval Outline

More information

Motif representation using position weight matrix

Motif representation using position weight matrix Motif representation using position weight matrix Xiaohui Xie University of California, Irvine Motif representation using position weight matrix p.1/31 Position weight matrix Position weight matrix representation

More information

Discrete Binary Distributions

Discrete Binary Distributions Discrete Binary Distributions Carl Edward Rasmussen November th, 26 Carl Edward Rasmussen Discrete Binary Distributions November th, 26 / 5 Key concepts Bernoulli: probabilities over binary variables Binomial:

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

4: Parameter Estimation in Fully Observed BNs

4: Parameter Estimation in Fully Observed BNs 10-708: Probabilistic Graphical Models 10-708, Spring 2015 4: Parameter Estimation in Fully Observed BNs Lecturer: Eric P. Xing Scribes: Purvasha Charavarti, Natalie Klein, Dipan Pal 1 Learning Graphical

More information

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception PROBABILITY DISTRIBUTIONS Credits 2 These slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Parametric Distributions 3 Basic building blocks: Need to determine given Representation:

More information

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model

More information

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell Machine Learning 0-70/5 70/5-78, 78, Spring 00 Probability 0 Aarti Singh Lecture, January 3, 00 f(x) µ x Reading: Bishop: Chap, Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell Announcements Homework

More information

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1 The Exciting Guide To Probability Distributions Part 2 Jamie Frost v. Contents Part 2 A revisit of the multinomial distribution The Dirichlet Distribution The Beta Distribution Conjugate Priors The Gamma

More information

COMP 328: Machine Learning

COMP 328: Machine Learning COMP 328: Machine Learning Lecture 2: Naive Bayes Classifiers Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Spring 2010 Nevin L. Zhang

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Computational Cognitive Science

Computational Cognitive Science Computational Cognitive Science Lecture 8: Frank Keller School of Informatics University of Edinburgh keller@inf.ed.ac.uk Based on slides by Sharon Goldwater October 14, 2016 Frank Keller Computational

More information

CS540 Machine learning L8

CS540 Machine learning L8 CS540 Machine learning L8 Announcements Linear algebra tutorial by Mark Schmidt, 5:30 to 6:30 pm today, in the CS X-wing 8th floor lounge (X836). Move midterm from Tue Oct 14 to Thu Oct 16? Hw3sol handed

More information

Computational Perception. Bayesian Inference

Computational Perception. Bayesian Inference Computational Perception 15-485/785 January 24, 2008 Bayesian Inference The process of probabilistic inference 1. define model of problem 2. derive posterior distributions and estimators 3. estimate parameters

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 5 Bayesian Learning of Bayesian Networks CS/CNS/EE 155 Andreas Krause Announcements Recitations: Every Tuesday 4-5:30 in 243 Annenberg Homework 1 out. Due in class

More information

Chapter 5. Bayesian Statistics

Chapter 5. Bayesian Statistics Chapter 5. Bayesian Statistics Principles of Bayesian Statistics Anything unknown is given a probability distribution, representing degrees of belief [subjective probability]. Degrees of belief [subjective

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics October 17, 2017 CS 361: Probability & Statistics Inference Maximum likelihood: drawbacks A couple of things might trip up max likelihood estimation: 1) Finding the maximum of some functions can be quite

More information

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Introduc)on to Bayesian methods (con)nued) - Lecture 16 Introduc)on to Bayesian methods (con)nued) - Lecture 16 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Vibhav Gogate Outline of lectures Review of

More information

Bayesian Inference: Posterior Intervals

Bayesian Inference: Posterior Intervals Bayesian Inference: Posterior Intervals Simple values like the posterior mean E[θ X] and posterior variance var[θ X] can be useful in learning about θ. Quantiles of π(θ X) (especially the posterior median)

More information

Learning Bayesian Networks

Learning Bayesian Networks Learning Bayesian Networks Probabilistic Models, Spring 2011 Petri Myllymäki, University of Helsinki V-1 Aspects in learning Learning the parameters of a Bayesian network Marginalizing over all all parameters

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Computational Cognitive Science

Computational Cognitive Science Computational Cognitive Science Lecture 9: Bayesian Estimation Chris Lucas (Slides adapted from Frank Keller s) School of Informatics University of Edinburgh clucas2@inf.ed.ac.uk 17 October, 2017 1 / 28

More information

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book) Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book) Principal Topics The approach to certainty with increasing evidence. The approach to consensus for several agents, with increasing

More information

Bayesian Models in Machine Learning

Bayesian Models in Machine Learning Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a

More information

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference Lecture 23 Maximum Likelihood Estimation and Bayesian Inference Thais Paiva STA 111 - Summer 2013 Term II August 7, 2013 1 / 31 Thais Paiva STA 111 - Summer 2013 Term II Lecture 23, 08/07/2013 Lecture

More information

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic Graphical Networks: Definitions and Basic Results This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical

More information

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Point Estimation. Vibhav Gogate The University of Texas at Dallas Point Estimation Vibhav Gogate The University of Texas at Dallas Some slides courtesy of Carlos Guestrin, Chris Bishop, Dan Weld and Luke Zettlemoyer. Basics: Expectation and Variance Binary Variables

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks Outline 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks Likelihood A common and fruitful approach to statistics is to assume

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

Predictive Hypothesis Identification

Predictive Hypothesis Identification Marcus Hutter - 1 - Predictive Hypothesis Identification Predictive Hypothesis Identification Marcus Hutter Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU RSISE NICTA Marcus Hutter - 2 - Predictive

More information

Classical and Bayesian inference

Classical and Bayesian inference Classical and Bayesian inference AMS 132 Claudia Wehrhahn (UCSC) Classical and Bayesian inference January 8 1 / 11 The Prior Distribution Definition Suppose that one has a statistical model with parameter

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Robustness and duality of maximum entropy and exponential family distributions

Robustness and duality of maximum entropy and exponential family distributions Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling Due: Tuesday, May 10, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to the questions below, including

More information

5. Conditional Distributions

5. Conditional Distributions 1 of 12 7/16/2009 5:36 AM Virtual Laboratories > 3. Distributions > 1 2 3 4 5 6 7 8 5. Conditional Distributions Basic Theory As usual, we start with a random experiment with probability measure P on an

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

An Empirical-Bayes Score for Discrete Bayesian Networks

An Empirical-Bayes Score for Discrete Bayesian Networks An Empirical-Bayes Score for Discrete Bayesian Networks scutari@stats.ox.ac.uk Department of Statistics September 8, 2016 Bayesian Network Structure Learning Learning a BN B = (G, Θ) from a data set D

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Bayesian Classification Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574

More information

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/?? to Bayesian Methods Introduction to Bayesian Methods p.1/?? We develop the Bayesian paradigm for parametric inference. To this end, suppose we conduct (or wish to design) a study, in which the parameter

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Lecture 11: Probability Distributions and Parameter Estimation

Lecture 11: Probability Distributions and Parameter Estimation Intelligent Data Analysis and Probabilistic Inference Lecture 11: Probability Distributions and Parameter Estimation Recommended reading: Bishop: Chapters 1.2, 2.1 2.3.4, Appendix B Duncan Gillies and

More information

Inferring Protein-Signaling Networks II

Inferring Protein-Signaling Networks II Inferring Protein-Signaling Networks II Lectures 15 Nov 16, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall (JHN) 022

More information

Lecture 8: December 17, 2003

Lecture 8: December 17, 2003 Computational Genomics Fall Semester, 2003 Lecture 8: December 17, 2003 Lecturer: Irit Gat-Viks Scribe: Tal Peled and David Burstein 8.1 Exploiting Independence Property 8.1.1 Introduction Our goal is

More information

Bayesian Nonparametrics

Bayesian Nonparametrics Bayesian Nonparametrics Lorenzo Rosasco 9.520 Class 18 April 11, 2011 About this class Goal To give an overview of some of the basic concepts in Bayesian Nonparametrics. In particular, to discuss Dirichelet

More information

LEARNING WITH BAYESIAN NETWORKS

LEARNING WITH BAYESIAN NETWORKS LEARNING WITH BAYESIAN NETWORKS Author: David Heckerman Presented by: Dilan Kiley Adapted from slides by: Yan Zhang - 2006, Jeremy Gould 2013, Chip Galusha -2014 Jeremy Gould 2013Chip Galus May 6th, 2016

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Directed Graphical Models

Directed Graphical Models Directed Graphical Models Instructor: Alan Ritter Many Slides from Tom Mitchell Graphical Models Key Idea: Conditional independence assumptions useful but Naïve Bayes is extreme! Graphical models express

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

POSTERIOR PROPRIETY IN SOME HIERARCHICAL EXPONENTIAL FAMILY MODELS

POSTERIOR PROPRIETY IN SOME HIERARCHICAL EXPONENTIAL FAMILY MODELS POSTERIOR PROPRIETY IN SOME HIERARCHICAL EXPONENTIAL FAMILY MODELS EDWARD I. GEORGE and ZUOSHUN ZHANG The University of Texas at Austin and Quintiles Inc. June 2 SUMMARY For Bayesian analysis of hierarchical

More information