Parameter Learning With Binary Variables

Size: px

Start display at page:

Download "Parameter Learning With Binary Variables"

Rose Collins
5 years ago
Views:

1 With Binary Variables University of Nebraska Lincoln CSCE 970 Pattern Recognition

2 Outline Outline 1 Learning a Single Parameter 2 More on the Beta Density Function 3 Computing a Probability Interval

3 Outline Outline 4 Learning Parameters in a Bayesian Network 5 Learning with Missing Data Items 6 Variances in Computed Relative Frequencies

4 Outline Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Probability Distributions of Relative Frequencies Learning a Relative Frequency 1 Learning a Single Parameter Probability Distributions of Relative Frequencies Learning a Relative Frequency 2 More on the Beta Density Function 3 Computing a Probability Interval

5 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Probability Distributions of Relative Frequencies Learning a Relative Frequency Equally Probable Relative Frequencies The Urn example can be modeled by the following Bayesian network: P(f) = 1/ f 1.00 F Side P(Side = heads f) = f

6 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Review of Gamma Function Probability Distributions of Relative Frequencies Learning a Relative Frequency Γ(x) = 0 t x 1 e 1 dt The integral converges if and only if x > 0. If x is an integer 1, it can be shown that Γ(x) = (x 1)!

7 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Introducing the Beta Density Function Probability Distributions of Relative Frequencies Learning a Relative Frequency The beta density function with parameters a, b, N = a + b, where a and b are real numbers > 0, is ρ(f) = Γ(N) Γ(a)Γ(b) f a 1 (1 f) b 1 0 f 1 A beta density function is denoted beta(f; a, b).

8 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Example Beta Function Plots Probability Distributions of Relative Frequencies Learning a Relative Frequency beta(f; 50, 50) beta(f; 3, 3) beta(f; 18, 2) Note that the larger the values of a and b, the more mass is concentrated around a/(a + b).

9 Expected Value Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Probability Distributions of Relative Frequencies Learning a Relative Frequency If F has a beta distribution with parameters a, b, N = a + b, then E(F) = a N We assume our beliefs such that P(X = 1 f) = f. In this case we have P(X = 1) = E(F) = a N

10 Urn Revisited Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Probability Distributions of Relative Frequencies Learning a Relative Frequency Probability that the first coin chosen lands heads is 0.5 What if we toss it 20 times and it lands heads 18? We now feel the coin is closer to.90 than.10 How to quantify such a change in belief?

11 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Independence of Trials Probability Distributions of Relative Frequencies Learning a Relative Frequency ρ(f) F If we know the value f, then each trial of X denoted X (h) is independent from all trials X (i), i h X P(X = 1 F = f) = f

12 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Probability of a Data Set Probability Distributions of Relative Frequencies Learning a Relative Frequency Suppose we have a data set d = {x (1), x (2),..., x (M) }. Let s be the number of variables in d equal to 1, and t be the number of variables in d equal to 2. Then P(d) = E(F s (1 F) t ) If F has a beta distribution, it can be shown that E(F s (1 F) t ) = Γ(N) Γ(a + s)γ(b + t) Γ(N + M) Γ(a)Γ(b) Therefore P(d) = Γ(N) Γ(a + s)γ(b + t) Γ(N + M) Γ(a)Γ(b)

13 Urn Example Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Probability Distributions of Relative Frequencies Learning a Relative Frequency Recall the urn example. ρ(f) = beta(f; 1, 1). Consider the binomial sample d = {1, 2}. We have a = b = 1, N = 2, s = 1, t = 1, M = 2. Thus, P(d) = Γ(2) Γ(1 + 1)Γ(1 + 1) = 1 Γ(2 + 2) Γ(1)Γ(1) 6 Now consider the sample d = {1, 1}. We have P(d ) = 1/3 Why is the probability of two heads twice the probability of one heads and one tails?

14 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Updating Parameter Density Function Probability Distributions of Relative Frequencies Learning a Relative Frequency Given a data set d and original density function ρ(f) = beta(f; a, b), the updated density function is given by ρ(f) = beta(f; a + s, b + t) The probability that the next trial is equal to 1, denoted P(X (M+1) = 1 d), is equal to E(F d). Assuming F has a beta distribution with parameters a, b, N = a + b this becomes P ( ) X (M+1) = 1 d = a + s N + M

15 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Probability Distributions of Relative Frequencies Learning a Relative Frequency Example of Updating a Density Function Consider the thumbtack example with density function ρ(f) = beta(f; 3, 3) and a sample data set d = {1, 1, 2, 1, 1, 1, 1, 1, 2, 1} Our updated density function becomes ρ(f d) = beta(f; 3 + 8, 3 + 2) = beta(f; 11, 5) Our probability that the next trial will produce 1 becomes ( ) X (11) = 1 d = =

16 Outline Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval 1 Learning a Single Parameter 2 More on the Beta Density Function 3 Computing a Probability Interval

17 Review Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Our parameters F ij are random variables with a beta distribution denoted beta(f; a, b) P(X 1 = 1 f) = f Expected value E(F) = a/n where N = a + b Data set probability P(d) = Γ(N) Γ(a + s)γ(b + t) Γ(N + M) Γ(a)Γ(b) Updated density function ρ(f d) = beta(f; a + s, b + t) Updated probability P(X (M+1) = 1 d) = a + s N + M

18 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Scrutinizing the Updated Probability Recall that we update probability from a data set d as follows: P(X (M+1) = 1 d) = a + s N + M We use the uniform distribution given by beta(f; 1, 1) in order to model a lack of belief about the true probabilities of X. Why do we use the updated function beta(f; 1 + s, 1 + t) rather than just beta(f; s, t) to model our belief once we have concrete trials?

19 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Avoiding Overly-Confident Beliefs Consider the urn example. Suppose we sample a coin at random and flip it. If it lands heads we have the data set d = {1} If we update our belief based solely on the concrete trial we would have ρ(f d) = beta(f; 1, 0) This would give an updated probability P(X (2) = 1 d) = 1/1 = 100%. Updating our belief as normal we instead have ρ(f d) = beta(f; 2, 1) leading to the updated probability P(X (2) = 1 d) = 2/ %.

20 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Beta Function With 0 < a < 1, 0 < b < 1 Relative frequency of one of the two values is very low Not sure which value Belief quickly overwhelmed by data beta(f; 0.2, 0.2)

21 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Assessing the Values of a and b The following are guidelines for choosing values of a and b. a = b = 1: Belief that we have no knowledge at all of the value of the relative frequency. a, b > 1: Belief that it is probable that the relative frequency that X = 1 is around a/(a + b). The larger the values of a and b, the greater the belief. a, b < 1: Belief that the relative frequency that X = 1 is either very high or very low, but we are not sure which.

22 Outline Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval 1 Learning a Single Parameter 2 More on the Beta Density Function 3 Computing a Probability Interval

23 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Motivation for Probability Interval We saw earlier that P(X = 1 f) = f. How confidant are we that the true probability is near f? We measure this confidence by finding a value c such that P(f (E(F) c, E(F) + c)) = perc where (E(F) c, E(F) + c) is an interval, known as a probability interval, such that 100(perc)% of the area under the beta curve is within that interval.

24 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Computing a Probability Interval A perc% probability interval for E(F) is found by solving the following equation for c: E(F)+c E(F) c ρ(f) df = perc

25 Example Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Recall the updated density function we computed for the thumbtack example: ρ(f) = beta(f; 11, 5) and E(F) = = To find a 95% probability interval we solve the following equation for c: c c Γ(16) Γ(11)Γ(5) f 10 (1 f) 4 df = 0.95 We obtain the solution c = which gives a probability interval ( , ) = (0.474, 0.902)

26 Example Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval Suppose we have a = 31 and b = 1, leading to E(F) = Solving the equation c Γ(32) Γ(31)Γ(1) f 30 (1 f) 0 df = c we find c = This leads to a probability interval (0.936, 1.002). Re-compute c from the following equation: c Γ(32) Γ(31)Γ(1) f 30 (1 f) 0 df = 0.95 We now obtain c = and probability interval ( , 1) = (0.908, 1).

27 Learning a Single Parameter More on the Beta Density Function Computing a Probability Interval General Probability Intervals If c is computed, and (E(F) c, E(F) + c) (0, 1) and E(F) > 0.5, solve the following equation for c: 1 E(F) c ρ(f) df = perc The probability interval is then (E(F) c, 1). If (E(F) c, E(F) + c) (0, 1) and E(F) < 0.5, solve the following equation for c: E(F)+c 0 ρ(f) df = perc The probability interval is then (0, E(F) + c).

28 Outline Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size 4 Learning Parameters in a Bayesian Network Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size 5 Learning with Missing Data Items 6 Variances in Computed Relative Frequencies

29 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Urn Revisited Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size Consider two identical urns. We sample and toss one coin from each urn beta(f 11; 1, 1) F 11 beta(f 21; 1, 1) F X 1 P(X 1 = 1 f 11) = f 11 X 2 P(X 2 = 1 f 21) = f 21 X 1 X 2 P(X 1 = 1) = 1/2 P(X 2 = 1) = 1/2

30 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Example: Joint Probabilities Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size Since the two nodes X 1 and X 2 are independent, we have the following joint probabilities: ( )( ) 1 1 P(X 1 = 1, X 2 = 1) = P(X 1 = 1)P(X 2 = 1) = = ( )( ) 1 1 P(X 1 = 1, X 2 = 2) = P(X 1 = 1)P(X 2 = 2) = = ( )( ) 1 1 P(X 1 = 2, X 2 = 1) = P(X 1 = 2)P(X 2 = 1) = = ( )( ) 1 1 P(X 1 = 2, X 2 = 2) = P(X 1 = 2)P(X 2 = 2) = = These are NOT relative frequencies. They are our beliefs concerning the first outcome.

31 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Example: Updated Values Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size Case X 1 X Given the data for the first 7 trials we have updated density functions ρ(f 11 d) = beta(f 11 ; 1 + 4, 1 + 3) = beta(f 11 ; 5, 4) ρ(f 21 d) = beta(f 21 ; 1 + 5, 1 + 2) = beta(f 21 ; 6, 3) Thus we now have joint distributions ««5 2 P(X 1 = 1, X 2 = 1) = = ««4 2 P(X 1 = 2, X 2 = 1) = = ««5 1 P(X 1 = 1, X 2 = 2) = 9 3 ««4 1 P(X 1 = 2, X 2 = 2) = 9 3 = 5 27 = 4 27

32 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Example: Three Urns Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size Suppose we have three urns u 1, u 2, u 3. We sample a coin from u 1 and flip it. If it turns up heads we sample a coin from u 2 and flip it, else we sample a coin from u 3 and flip it. This situation can be modeled by the following Bayesian network: beta(f 11; 1, 1) beta(f 21; 1, 1) beta(f 22; 1, 1) F 11 F 21 F 22 X 1 X 2 P(X 1 = 1 f 11) = f 11 P(X 2 = 1 X 1 = 1, f 21) = f 21 P(X 2 = 1 X 1 = 2, f 22) = f 22

33 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Three Urns Joint Probabilities Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size The previous augmented Bayesian network contains the following embedded network: X 1 X 2 P(X 1 = 1) = 1/2 P(X 2 = 1 X 1 = 1) = 1/2 P(X 2 = 1 X 1 = 2) = 1/2 This network has the following joint probabilities: P(X 1 = 1, X 2 = 1) = P(X 2 = 1 X 1 = 1)P(X 1 = 1) = ( ) ( ) 1 1 = P(X 1 = 1, X 2 = 2) = 1 4, P(X 1 = 2, X 2 = 1) = 1 4, P(X 1 = 2, X 2 = 2) = 1 4

34 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Three Urns Updated Values Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size Case X 1 X Given the data for the first 7 trials we have updated density functions ρ(f 11 d) = beta(f 11 ; 1 + 4, 1 + 3) = beta(f 11 ; 5, 4) ρ(f 21 d) = beta(f 21 ; 1 + 3, 1 + 1) = beta(f 21 ; 4, 2) ρ(f 22 d) = beta(f 22 ; 1 + 2, 1 + 1) = beta(f 22 ; 3, 2) Thus we now have joint distributions ««2 5 P(X 1 = 1, X 2 = 1) = = ««3 4 P(X 1 = 2, X 2 = 1) = = ««1 5 P(X 1 = 1, X 2 = 2) = 3 9 ««2 4 P(X 1 = 2, X 2 = 2) = 5 9 = 5 27 = 8 45

35 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Augmented Bayesian Networks Consider the Three Urn example: Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size beta(f 11; 1, 1) beta(f 21; 1, 1) beta(f 22; 1, 1) F 11 F 21 F 22 X 1 X 2 P(X 1 = 1 f 11) = f 11 P(X 2 = 1 X 1 = 1, f 21) = f 21 P(X 2 = 1 X 1 = 2, f 22) = f 22 Gobal and local parameter independence implies ρ(f 11, f 12,..., f nqn ) = ρ(f 11 )ρ(f 12 ) ρ(f nqn )

36 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Binomial Bayesian Network Sample Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size Suppose we have a M random vectors X (1) = X (1) 1. X (1) n X (2) = and the random vector set X (2) 1. X (2) n D = {X (1), X (2),, X (M) } X (M) = such that, for every i, each X (h) i has the space {1, 2} X (M) 1. X (M) n

37 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size Binomial Bayesian Network Sample Continued Suppose also that there is a binomial Bayesian network (G, F,ρ), where G = (V, E), such that, for 1 h M {X (h) 1,...,X(h) n } constitutes an instance of V in G resulting in a distinct augmented Bayesian network. Then the random vector set D is called a binomial Bayesian network sample. F 11 F 21 F 22 X (1) 1 X (1) 2 X (2) 1 X (2) 2

38 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies General Data Set Probability Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size Suppose we have a random vector set D and a set of data of the X (h) s as follows: x (1) x x (1) = 1. (2) x x (2) = 1. (M) x (M) = 1. x (1) n x (2) n x (M) n d = {x (1), x (1),...,x (M) } Suppose also that s ij is the number of x (h) i the number of x (h) i equal to 2. equal to 1, and t ij is Then we have the data set probability n q i Γ(N ij ) Γ(a ij + s ij )Γ(b ij + t ij ) P(d) = Γ(N ij + M ij ) Γ(a ij )Γ(b ij ) i=1 j=1

39 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies General Updated Density Function Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size Suppose again that we have a random vector set D and a set of data of the X (h) s as follows: x (1) x x (1) = 1. (2) x x (2) = 1. (M) x (M) = 1. x (1) n x (2) n x (M) n d = {x (1), x (1),...,x (M) } and that s ij is the number of x (h) i number of x (h) i equal to 2. equal to 1, and t ij is the Assuming each F ij has original beta distribution, we have the updated density function ρ(f ij d) = beta(f ij ; a ij + s ij, b ij + t ij )

40 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Problem Overview Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size beta(f 11 ; 1, 1) beta(f 21 ; 1, 1) beta(f 22 ; 1, 1) F 11 F 21 F 22 X 1 X 2 P(X 1 = 1) = 1/2 P(X 2 = 1 X 1 = 1) = 1/2 P(X 2 = 1 X 1 = 2) = 1/2 Prior experience of seeing X 1 taking the value 1 once in two trials Prior experience of seeing X 2 taking the value 1 out of the two times that X 1 took the value 1

41 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Prior Equivalent Sample Size Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size To solve this problem we specify the same prior sample size at each node. beta(f 11 ; 2, 2) beta(f 21 ; 1, 1) beta(f 22 ; 1, 1) F 11 F 21 F 22 X 1 X 2 P(X 1 = 1) = 1/2 P(X 2 = 1 X 1 = 1) = 1/2 P(X 2 = 1 X 1 = 2) = 1/2

42 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies General Equivalent Sample Size Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size Given an augmented binomial Bayesian network with beta density functions for all i and j, if there is a number N equiv such that, for all i and j N ij = a ij + b ij = P(pa ij ) N equiv then the network has an equivalent sample size N equiv.

43 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Equivalent Sample Size Example Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size beta(f 11; 10, 5) beta(f 21; 9, 6) F 11 F 21 X 1 X 2 beta(f 31; 2, 4) F 31 beta(f 33; 2, 1) F 33 beta(f 32; 3, 1) F 32 X 3 beta(f 34; 1, 1) F 34 N equiv = 15

44 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size Constructing Equivalent Sample Size Bayes Nets Given an augmented Bayesian network, we can assign for all i and j a ij = b ij = N 2q i This provides an equal probability of each value at each node. Given a Bayesian network, the parameters F ij can be assigned values a ij = P(X i = 1 pa ij ) P(pa ij ) N equiv b ij = P(X i = 2 pa ij ) P(pa ij ) N equiv This provides an augmented Bayesian network that embeds the given Bayesian network.

45 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Expressing Prior Indifference Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size Recall the following augmented Bayesian network: beta(f 11 ; 2, 2) beta(f 21 ; 1, 1) beta(f 22 ; 1, 1) F 11 F 21 F 22 X 1 X 2 P(X 1 = 1) = 1/2 P(X 2 = 1 X 1 = 1) = 1/2 P(X 2 = 1 X 1 = 2) = 1/2 Though this network has equivalent sample size, it no longer models our indifference to true parameter value.

46 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Expressing Prior Indifference Urn Examples Learning Using an Augmented Bayesian Network A Problem with Updating; Using an Equivalent Sample Size We can solve this problem by using an equivalent sample size N equiv = 2 to describe indifference in a Bayesian network. Using the previous example, this would give us a new network beta(f 11 ; 1, 1) beta(f 21 ;.5,.5) beta(f 22 ;.5,.5) F 11 F 21 F 22 X 1 X 2 P(X 1 = 1) = 1/2 P(X 2 = 1 X 1 = 1) = 1/2 P(X 2 = 1 X 1 = 2) = 1/2

47 Outline Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Data Items Missing at Random 4 Learning Parameters in a Bayesian Network 5 Learning with Missing Data Items Data Items Missing at Random 6 Variances in Computed Relative Frequencies

48 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Review of Updating Data Items Missing at Random Suppose we have the following Bayesian network beta(f 11; 2, 2) beta(f 21; 1, 1) beta(f 22; 1, 1) F 11 F 21 F 22 X 1 X 2 Case X 1 X P(X 1 = 1) = 1/2 P(X 2 = 1 X 1 = 1) = 1/2 P(X 2 = 1 X 1 = 2) = 1/2 We have updated values: ρ(f 11 d) = beta(f 11 ; 2 + 4, 2 + 1) = beta(f 11 ; 6, 3) ρ(f 21 d) = beta(f 21 ; 1 + 3, 1 + 1) = beta(f 11 ; 4, 2) ρ(f 22 d) = beta(f 11 ; 1 + 0, 1 + 1) = beta(f 11 ; 1, 2)

49 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Randomly Missing Data Points Data Items Missing at Random Consider the same network but given the following data points: Case X 1 X ? ? We can estimate the values of X 2 in these cases using P(X 2 X 1 )

50 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Data Items Missing at Random Prior Sample Probability Substitution Substituting probabilities for missing data yields the following data points: Case X 1 X 2 Occurences / / / /2 This gives us the updated network beta(f 11; 6, 3) beta(f 21; 7/2, 5/2) beta(f 22; 3/2, 3/2) F 11 X 1 F 21 X 2 F 22 P(X 1 = 1) = 2/3 P(X 2 = 1 X 1 = 1) = 7/12 P(X 2 = 1 X 1 = 2) = 1/2

51 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Incorporating Data Set Values Data Items Missing at Random Note that we used our prior sample probabilities to fill in the missing data. However, the data set favors the event X 1 = 1, X 2 = 1 over the event X 1 = 1, X 2 = 2 because the former event occurs twice while the latter occurs only once. This implies that we may get more accurate results by using our updated probabilities in place of our prior sample probabilities

52 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Incorporating Data Set Values Data Items Missing at Random Using updated probabilities gives us the data set: Case X 1 X 2 Occurences / / / /2 This gives us a new updated network beta(f 11; 6, 3) beta(f 21; 43/12, 29/12) beta(f 22; 3/2, 3/2) F 11 X 1 F 21 X 2 F 22 P(X 1 = 1) = 2/3 P(X 2 = 1 X 1 = 1) = 43/72 P(X 2 = 1 X 1 = 2) = 1/2

53 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Data Items Missing at Random Data Items Missing Not at Random The estimation we have for missing data items is not appropriate if the missing item is not independent of state.

54 Learning Parameters in a Bayesian Network Learning with Missing Data Items Variances in Computed Relative Frequencies Outline 4 Learning Parameters in a Bayesian Network 5 Learning with Missing Data Items 6 Variances in Computed Relative Frequencies

Parameter Learning: Binary Variables

Parameter Learning: Binary Variables SS 008 Bayesian Networks Multimedia Computing, Universität Augsburg Rainer.Lienhart@informatik.uni-augsburg.de www.multimedia-computing.{de,org} Reference Richard E.