Advanced Probabilistic Modeling in R Day 1

Size: px
Start display at page:

Download "Advanced Probabilistic Modeling in R Day 1"

Transcription

1 Advanced Probabilistic Modeling in R Day 1 Roger Levy University of California, San Diego July 20, /24

2 Today s content Quick review of probability: axioms, joint & conditional probabilities, Bayes Rule, conditional independence Bayes Nets (a.k.a. directed acyclic graphical models, DAGs) The Gaussian distribution Example: human phoneme categorization Maximum likelihood estimation Bayesian parameter estimation Frequentist hypothesis testing Bayesian hypothesis testing 2/24

3 Probability spaces Traditionally, probability spaces are defined in terms of sets. An event E is a subset of a sample space Ω: E Ω. 3/24

4 Probability spaces Traditionally, probability spaces are defined in terms of sets. An event E is a subset of a sample space Ω: E Ω. A probability space P on a sample space Ω is a function from events E in Ω to real numbers such that the following three axioms hold: 1. P(E) 0 for all E Ω (non-negativity). 2. If E 1 and E 2 are disjoint, then P(E 1 E 2 ) = P(E 1 )+P(E 2 ) (disjoint union). 3. P(Ω) = 1 (properness). 3/24

5 Joint, conditional, and marginal probabilities Given the joint distribution P(X,Y) over two random variables X and Y, the conditional distribution P(Y X) is defined as P(Y X) P(X,Y) P(X) 4/24

6 Joint, conditional, and marginal probabilities Given the joint distribution P(X,Y) over two random variables X and Y, the conditional distribution P(Y X) is defined as P(Y X) P(X,Y) P(X) The marginal probability distribution P(X) is P(X = x) = y P(X = x,y = y) These concepts can be extended to arbitrary numbers of random variables. 4/24

7 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) 5/24

8 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: 5/24

9 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 )

10 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 )

11 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 )..

12 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 )..

13 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 ).. P(E 1,E 2,...,E n ) = P(E n E 1,E 2,...,E n 1 )...P(E 2 E 1 )P(E 1 ) 5/24

14 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 ).. P(E 1,E 2,...,E n ) = P(E n E 1,E 2,...,E n 1 )...P(E 2 E 1 )P(E 1 ) Breaking a joint probability down into the product a marginal probability and several joint probabilities this way is called chain rule decomposition. 5/24

15 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) 6/24

16 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) 6/24

17 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: P(A, B) = P(B A)P(A) 6/24

18 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) 6/24

19 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) So P(A B)P(B) = P(B A)P(A) 6/24

20 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: So P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) P(A B)P(B) = P(B A)P(A) P(A B)P(B) P(B) = P(B A)P(A) P(B) 6/24

21 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: So P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) P(A B)P(B) = P(B A)P(A) P(A B)P(B) P(B) = P(B A)P(A) P(B) 6/24

22 Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) 7/24

23 Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) Hence there are often two other ways we write Bayes Rule: 7/24

24 Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) Hence there are often two other ways we write Bayes Rule: 1. Emphasizing explicit marginalization: P(A B) = P(B A)P(A) P(A = a,b) a 7/24

25 Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) Hence there are often two other ways we write Bayes Rule: 1. Emphasizing explicit marginalization: 2. Ignoring the partition function: P(A B) = P(B A)P(A) P(A = a,b) a P(A B) P(B A)P(A) 7/24

26 (Conditional) Independence Events A and B are said to be Conditionally Independent given information C if P(A,B C) = P(A C)P(B C) Conditional independence of A and B given C is often expressed as A B C 8/24

27 Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables 9/24

28 Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables So next we ll introduce you to a general framework for specifying conditional independencies among collections of random variables 9/24

29 Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables So next we ll introduce you to a general framework for specifying conditional independencies among collections of random variables It won t allow us to express all possible independencies that may hold, but it goes a long way 9/24

30 Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables So next we ll introduce you to a general framework for specifying conditional independencies among collections of random variables It won t allow us to express all possible independencies that may hold, but it goes a long way And I hope that you ll agree that the framework is intuitive too! 9/24

31 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: 10/24

32 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 10/24

33 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 10/24

34 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. 10/24

35 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: 10/24

36 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: The factory produces a coin of type X and sends it to you; 10/24

37 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: The factory produces a coin of type X and sends it to you; You receive the coin and flip it twice, with H(eads)/T(ails) outcomes Y 1 and Y 2 10/24

38 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: The factory produces a coin of type X and sends it to you; You receive the coin and flip it twice, with H(eads)/T(ails) outcomes Y 1 and Y 2 Receiving a coin from the factory and flipping it twice is sampling (or taking a sample) from the joint distribution P(X,Y 1,Y 2 ) 10/24

39 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents 11/24

40 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) 11/24

41 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) X P(X) 1 Fair 3 2-H T 3 11/24

42 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T /24

43 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T /24

44 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? 12/24

45 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? 12/24

46 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? No! 12/24

47 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? No! P(Y2 = H) = 1 2 (you can see this by symmetry) 12/24

48 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? No! P(Y2 = H) = 1 2 (you can see this by symmetry) Coin was fair Coin was 2-H {}}{ 1 But P(Y2 = H Y 1 = H) = 3 1 {}}{ = /24

49 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. 13/24

50 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B 13/24

51 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B Any node on a given path has converging arrows if two edges on the path connect to it and point to it. 13/24

52 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B Any node on a given path has converging arrows if two edges on the path connect to it and point to it. A node on the path has non-converging arrows if two edges on the path connect to it, but at least one does not point to it. 13/24

53 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B Any node on a given path has converging arrows if two edges on the path connect to it and point to it. A node on the path has non-converging arrows if two edges on the path connect to it, but at least one does not point to it. A third disjoint node set C d-separates A and B if for every path between A and B, either: 1. there is some node on the path with converging arrows which is not in C; or 2. there is some node on the path whose arrows do not converge and which is in C. 13/24

54 Major types of d-separation A node set C d-separates A and B if for every path between A and B, either: 1. there is some node on the path with converging arrows which is not in C; or 2. there is some node on the path whose arrows do not converge and which is in C. Commoncause d- separation (from knowing Z) Z Intervening d- separation (from knowing Y) X X Explaining away: knowing Z prevents d- separation Y D- separation in the absence of knowledge of Z X Y Y X Y Z Z Z 14/24

55 Back to our example X Y 1 Y 2 15/24

56 Back to our example X Y 1 Y 2 Without looking at the coin before flipping it, the outcome Y 1 of the first flip gives me information about the type of coin, and affects my beliefs about the outcome of Y 2 15/24

57 Back to our example X Y 1 Y 2 Without looking at the coin before flipping it, the outcome Y 1 of the first flip gives me information about the type of coin, and affects my beliefs about the outcome of Y 2 But if I look at the coin before flipping it, Y 1 and Y 2 are rendered independent 15/24

58 An example of explaining away I saw an exhibition about the, uh... 16/24

59 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: 16/24

60 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) 16/24

61 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) The speaker s attention was distracted by something in the non-linguistic environment 16/24

62 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) The speaker s attention was distracted by something in the non-linguistic environment 16/24

63 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) The speaker s attention was distracted by something in the non-linguistic environment A reasonable graphical model: W: hard word? A: attention distracted? D: disfluency? 16/24

64 An example of explaining away W: hard word? A: attention distracted? D: disfluency? Without knowledge of D, there s no reason to expect that W and A are correlated 17/24

65 An example of explaining away W: hard word? A: attention distracted? D: disfluency? Without knowledge of D, there s no reason to expect that W and A are correlated But hearing a disfluency demands a cause 17/24

66 An example of explaining away W: hard word? A: attention distracted? D: disfluency? Without knowledge of D, there s no reason to expect that W and A are correlated But hearing a disfluency demands a cause Knowing that there was a distraction explains away the disfluency, reducing the probability that the speaker was planning to utter a hard word 17/24

67 An example of the disfluency model W: hard word? A: attention distracted? Let s suppose that both hard words and distractions are unusual, the latter more so P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? 18/24

68 An example of the disfluency model W: hard word? A: attention distracted? Let s suppose that both hard words and distractions are unusual, the latter more so P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? Hard words and distractions both induce disfluencies; having both makes a disfluency really likely W A D=no disfluency D=disfluency easy undistracted easy distracted hard undistracted hard distracted /24

69 An example of the disfluency model W: hard word? A: attention distracted? P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? W A D=no disfluency D=disfluency easy undistracted easy distracted hard undistracted hard distracted Suppose that we observe the speaker uttering a disfluency. What is P(W = hard D = disfluent)? 19/24

70 An example of the disfluency model W: hard word? A: attention distracted? P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? W A D=no disfluency D=disfluency easy undistracted easy distracted hard undistracted hard distracted Suppose that we observe the speaker uttering a disfluency. What is P(W = hard D = disfluent)? Now suppose we also learn that her attention is distracted. What does that do to our beliefs about W 19/24

71 An example of the disfluency model W: hard word? A: attention distracted? P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? W A D=no disfluency D=disfluency easy undistracted easy distracted hard undistracted hard distracted Suppose that we observe the speaker uttering a disfluency. What is P(W = hard D = disfluent)? Now suppose we also learn that her attention is distracted. What does that do to our beliefs about W That is, what is P(W = hard D = disfluent,a = distracted)? 19/24

72 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = /24

73 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = /24

74 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = 0.57 P(W = hard D = disfluent,a = distracted) = /24

75 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = 0.57 P(W = hard D = disfluent,a = distracted) = 0.40 Knowing that the speaker was distracted (A) decreased the probability that the speaker was about to utter a hard word (W) A explained D away. 20/24

76 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = 0.57 P(W = hard D = disfluent,a = distracted) = 0.40 Knowing that the speaker was distracted (A) decreased the probability that the speaker was about to utter a hard word (W) A explained D away. A caveat: the type of relationship among A, W, and D will depend on the values one finds in the probability table! P(W) P(A) P(D W,A) 20/24

77 Summary thus far Key points: Bayes Rule is a compelling framework for modeling inference under uncertainty DAGs/Bayes Nets are a broad class of models for specifying joint probability distributions with conditional independencies Classic Bayes Net references:??;?;?, Chapter 14;?, Chapter 8. 21/24

78 References I 22/24

79 An example of the disfluency model P(W = hard D = disfluent,a = distracted) hard W=hard easy W=easy disfl D=disfluent distr A=distracted undistr A=undistracted P(hard disfl,distr) = P(disfl hard,distr)p(hard distr) P(disfl distr) = P(disfl hard,distr)p(hard) P(disfl distr) P(disfl distr) = w P(disfl W = w )P(W = w ) (Bayes Rule) (Independence from the DAG) (Marginalization) = P(disfl hard)p(hard) + P(disfl easy)p(easy) = = P(hard disfl,distr) = = /24

80 An example of the disfluency model P(W = hard D = disfluent) P(hard disfl) = P(disfl hard)p(hard) P(disfl) (Bayes Rule) P(disfl hard) = a P(disfl A = a,hard)p(a = a hard) = P(disfl A = distr, hard)p(a = distr hard) + P(disfl undistr, hard)p(undistr hard) = = P(disfl) = P(disfl W = w )P(W = w ) w = P(disfl hard)p(hard) + P(disfl easy)p(easy) P(disfl easy) = a P(disfl A = a,easy)p(a = a easy) = P(disfl A = distr, easy)p(a = distr easy) + P(disfl undistr, easy)p(undistr easy) = = P(disfl) = = P(hard disfl) = = /24

81 Bayesian parameter estimation The scenario: you are a native English speaker in whose experience passivizable constructions are passivized with frequency q. 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) You encounter a new dialect of English and hear data y consisting of n passivizable utterances, m of which were passivized: Goal: X Bern(π) Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances.

82 Anatomy of Bayesian inference Simplest possible scenario: I θ Y

83 Anatomy of Bayesian inference Simplest possible scenario: I θ Y The corresponding Bayesian inference: P(θ y,i) = P(y θ,i)p(θ I) P(y I)

84 Anatomy of Bayesian inference Simplest possible scenario: I θ Y The corresponding Bayesian inference: P(θ y,i) = P(y θ,i)p(θ I) P(y I) Likelihood for θ Prior over θ {}}{{}}{ P(y θ) P(θ I) = P(y I) }{{} Likelihood marginalized over θ (because y I θ)

85 Anatomy of Bayesian inference Simplest possible scenario: I θ Y The corresponding Bayesian inference: P(θ y,i) = P(y θ,i)p(θ I) P(y I) Likelihood for θ Prior over θ {}}{{}}{ P(y θ) P(θ I) = P(y I) }{{} Likelihood marginalized over θ (because y I θ) At the bottom of the graph, our model is the binomial distribution: P(y θ) Binom(n, θ) But to get things going we have to set the prior P(θ I).

86 Priors for the binomial distribution For a model with parameters θ, a prior distribution is just some joint probability distribution P(θ) Because the prior is often supposed to account for knowledge we bring to the table, we often write P(θ I) to be explciit

87 Priors for the binomial distribution For a model with parameters θ, a prior distribution is just some joint probability distribution P(θ) Because the prior is often supposed to account for knowledge we bring to the table, we often write P(θ I) to be explciit Model parameters are nearly always real-valued, so P(θ) is generally a multivariate continuous distribution

88 Priors for the binomial distribution For a model with parameters θ, a prior distribution is just some joint probability distribution P(θ) Because the prior is often supposed to account for knowledge we bring to the table, we often write P(θ I) to be explciit Model parameters are nearly always real-valued, so P(θ) is generally a multivariate continuous distribution In general, the sky is the limit as to what you choose for P(θ)

89 Priors for the binomial distribution For a model with parameters θ, a prior distribution is just some joint probability distribution P(θ) Because the prior is often supposed to account for knowledge we bring to the table, we often write P(θ I) to be explciit Model parameters are nearly always real-valued, so P(θ) is generally a multivariate continuous distribution In general, the sky is the limit as to what you choose for P(θ) But in many cases there are useful priors that will make your life easier

90 The beta distribution The beta distribution has two parameters α 1,α 2 > 0 and is defined as: P(π α 1,α 2 ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1 (0 π 1,α 1 > 0,α 2 > 0) where the beta function B(α 1,α 2 ) serves as a normalizing constant: asdf B(α 1,α 2 ) = 1 0 π α 1 1 (1 π) α 2 1 dπ

91 Some beta distributions If X B(α 1,α 2 ): p(π) Beta(1,1) Beta(0.5,0.5) Beta(3,3) Beta(3,0.5) E[X] = α 1 α 1 +α 2 If α 1,α 2 > 1, then X has a mode at α 1 1 α 1 +α π

92 Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2.

93 Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2. P(π y,α 1,α 2 ) = P(y π)p(π α 1,α 2 ) P(y α 1,α 2 ) (1)

94 Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2. P(π y,α 1,α 2 ) = P(y π)p(π α 1,α 2 ) P(y α 1,α 2 ) Since the denominator is not a function of π, it is a normalizing constant. Ignore it and work in terms of proportionality: P(π y,α 1,α 2 ) P(y π)p(π α 1,α 2 ) (1)

95 Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2. P(π y,α 1,α 2 ) = P(y π)p(π α 1,α 2 ) P(y α 1,α 2 ) Since the denominator is not a function of π, it is a normalizing constant. Ignore it and work in terms of proportionality: P(π y,α 1,α 2 ) P(y π)p(π α 1,α 2 ) Likelihood for the binomial distribution is ( ) n P(y π) = π m (1 π) n m m (1)

96 Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2. P(π y,α 1,α 2 ) = P(y π)p(π α 1,α 2 ) P(y α 1,α 2 ) Since the denominator is not a function of π, it is a normalizing constant. Ignore it and work in terms of proportionality: P(π y,α 1,α 2 ) P(y π)p(π α 1,α 2 ) Likelihood for the binomial distribution is ( ) n P(y π) = π m (1 π) n m m Beta prior is P(π α 1,α 2 ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1 (1)

97 Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1

98 Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1 π m+α 1 1 (1 π) n m+α 2 1

99 Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1 π m+α 1 1 (1 π) n m+α 2 1 Crucial trick: this is itself a beta distribution! Recall that if θ Beta(α 1,α 2 ) then P(θ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1

100 Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1 π m+α 1 1 (1 π) n m+α 2 1 Crucial trick: this is itself a beta distribution! Recall that if θ Beta(α 1,α 2 ) then P(θ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1 Hence P(θ y,α 1,α 2 ) is distributed as Beta(α 1 +m,α 2 +n m).

101 Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1 π m+α 1 1 (1 π) n m+α 2 1 Crucial trick: this is itself a beta distribution! Recall that if θ Beta(α 1,α 2 ) then P(θ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1 Hence P(θ y,α 1,α 2 ) is distributed as Beta(α 1 +m,α 2 +n m). With a beta prior and a binomial likelihood, the posterior is still beta-distributed. This is called conjugacy.

102 Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances.

103 Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability

104 Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability P(passive passivizable clause) 0.08 (Roland et al., 2007)

105 Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability P(passive passivizable clause) 0.08 (Roland et al., 2007) The mode of a beta distribution is α 1 1 α 1 +α 2 2

106 Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability P(passive passivizable clause) 0.08 (Roland et al., 2007) The mode of a beta distribution is α 1 1 α 1 +α 2 2 Hence we might use α 1 = 3,α 2 = 24 (note that 2 25 = 0.08)

107 Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability P(passive passivizable clause) 0.08 (Roland et al., 2007) The mode of a beta distribution is α 1 1 α 1 +α 2 2 Hence we might use α 1 = 3,α 2 = 24 (note that 2 25 = 0.08) Suppose that n = 7,m = 2: our posterior will be Beta(5,29), hence ˆπ = 4 32 = 0.125

108 Beta-binomial posterior distributions P(p) Prior Likelihood (n=7) Posterior (n=7) Likelihood (n=21) Posterior (n=21) Likelihood(p) p

109 Fully Bayesian density estimation Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances.

110 Fully Bayesian density estimation Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. In the fully Bayesian view, we don t summarize our posterior beliefs into a point estimate; rather, we marginalize over them in predicting the future:

111 Fully Bayesian density estimation Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. In the fully Bayesian view, we don t summarize our posterior beliefs into a point estimate; rather, we marginalize over them in predicting the future: P(y new y,i) = P(y new θ)p(θ y,i)dθ θ

112 Fully Bayesian density estimation Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. In the fully Bayesian view, we don t summarize our posterior beliefs into a point estimate; rather, we marginalize over them in predicting the future: P(y new y,i) = P(y new θ)p(θ y,i)dθ θ This leads to the beta-binomial predictive model: ( ) k B(α1 +m+r,α 2 +n m+k r) P(r k,i,y) = r B(α 1 +m,α 2 +n m)

113 Fully Bayesian density estimation P(k passives out of 50 trials y, I) Binomial Beta Binomial k

114 Fully Bayesian density estimation In this case (as in many others), marginalizing over the model parameters allows for greater dispersion in the model s predictions

115 Fully Bayesian density estimation In this case (as in many others), marginalizing over the model parameters allows for greater dispersion in the model s predictions This is because the new observations are only conditionally independent given θ with uncertainty about θ, they are linked!

116 Fully Bayesian density estimation In this case (as in many others), marginalizing over the model parameters allows for greater dispersion in the model s predictions This is because the new observations are only conditionally independent given θ with uncertainty about θ, they are linked! I θ y (1) new y (2) new... y (N) new

117 References I Roland, D., Dick, F., and Elman, J. L. (2007). Frequency of basic English grammatical structures: A corpus analysis. Journal of Memory and Language, 57:

Day 1: Probability and speech perception

Day 1: Probability and speech perception Day 1: Probability and speech perception 1 Day 2: Human sentence parsing 2 Day 3: Noisy-channel sentence processing? Day 4: Language production & acquisition whatsthat thedoggie yeah wheresthedoggie Grammar/lexicon

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution

1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution NETWORK ANALYSIS Lourens Waldorp PROBABILITY AND GRAPHS The objective is to obtain a correspondence between the intuitive pictures (graphs) of variables of interest and the probability distributions of

More information

LEARNING WITH BAYESIAN NETWORKS

LEARNING WITH BAYESIAN NETWORKS LEARNING WITH BAYESIAN NETWORKS Author: David Heckerman Presented by: Dilan Kiley Adapted from slides by: Yan Zhang - 2006, Jeremy Gould 2013, Chip Galusha -2014 Jeremy Gould 2013Chip Galus May 6th, 2016

More information

Probability Theory for Machine Learning. Chris Cremer September 2015

Probability Theory for Machine Learning. Chris Cremer September 2015 Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING SAMPLE CHAPTER Avi Pfeffer FOREWORD BY Stuart Russell MANNING Practical Probabilistic Programming by Avi Pfeffer Chapter 9 Copyright 2016 Manning Publications brief contents PART 1 INTRODUCING PROBABILISTIC

More information

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 630 Basic Probability and Information Theory. Tim Campbell CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)

More information

Review of Probabilities and Basic Statistics

Review of Probabilities and Basic Statistics Alex Smola Barnabas Poczos TA: Ina Fiterau 4 th year PhD student MLD Review of Probabilities and Basic Statistics 10-701 Recitations 1/25/2013 Recitation 1: Statistics Intro 1 Overview Introduction to

More information

Dept. of Linguistics, Indiana University Fall 2015

Dept. of Linguistics, Indiana University Fall 2015 L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 34 To start out the course, we need to know something about statistics and This is only an introduction; for a fuller understanding, you would

More information

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem Recall from last time: Conditional probabilities Our probabilistic models will compute and manipulate conditional probabilities. Given two random variables X, Y, we denote by Lecture 2: Belief (Bayesian)

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic Graphical Networks: Definitions and Basic Results This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical

More information

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network. ecall from last time Lecture 3: onditional independence and graph structure onditional independencies implied by a belief network Independence maps (I-maps) Factorization theorem The Bayes ball algorithm

More information

MATH MW Elementary Probability Course Notes Part I: Models and Counting

MATH MW Elementary Probability Course Notes Part I: Models and Counting MATH 2030 3.00MW Elementary Probability Course Notes Part I: Models and Counting Tom Salisbury salt@yorku.ca York University Winter 2010 Introduction [Jan 5] Probability: the mathematics used for Statistics

More information

Bayesian Approaches Data Mining Selected Technique

Bayesian Approaches Data Mining Selected Technique Bayesian Approaches Data Mining Selected Technique Henry Xiao xiao@cs.queensu.ca School of Computing Queen s University Henry Xiao CISC 873 Data Mining p. 1/17 Probabilistic Bases Review the fundamentals

More information

Computational Perception. Bayesian Inference

Computational Perception. Bayesian Inference Computational Perception 15-485/785 January 24, 2008 Bayesian Inference The process of probabilistic inference 1. define model of problem 2. derive posterior distributions and estimators 3. estimate parameters

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,

More information

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4 ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian

More information

Lecture 1: Probability Fundamentals

Lecture 1: Probability Fundamentals Lecture 1: Probability Fundamentals IB Paper 7: Probability and Statistics Carl Edward Rasmussen Department of Engineering, University of Cambridge January 22nd, 2008 Rasmussen (CUED) Lecture 1: Probability

More information

Bayesian Models in Machine Learning

Bayesian Models in Machine Learning Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events LECTURE 1 1 Introduction The first part of our adventure is a highly selective review of probability theory, focusing especially on things that are most useful in statistics. 1.1 Sample spaces and events

More information

Introduction into Bayesian statistics

Introduction into Bayesian statistics Introduction into Bayesian statistics Maxim Kochurov EF MSU November 15, 2016 Maxim Kochurov Introduction into Bayesian statistics EF MSU 1 / 7 Content 1 Framework Notations 2 Difference Bayesians vs Frequentists

More information

Unit 1: Sequence Models

Unit 1: Sequence Models CS 562: Empirical Methods in Natural Language Processing Unit 1: Sequence Models Lecture 5: Probabilities and Estimations Lecture 6: Weighted Finite-State Machines Week 3 -- Sep 8 & 10, 2009 Liang Huang

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics October 17, 2017 CS 361: Probability & Statistics Inference Maximum likelihood: drawbacks A couple of things might trip up max likelihood estimation: 1) Finding the maximum of some functions can be quite

More information

Probability Review. Chao Lan

Probability Review. Chao Lan Probability Review Chao Lan Let s start with a single random variable Random Experiment A random experiment has three elements 1. sample space Ω: set of all possible outcomes e.g.,ω={1,2,3,4,5,6} 2. event

More information

Rapid Introduction to Machine Learning/ Deep Learning

Rapid Introduction to Machine Learning/ Deep Learning Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/32 Lecture 5a Bayesian network April 14, 2016 2/32 Table of contents 1 1. Objectives of Lecture 5a 2 2.Bayesian

More information

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013 Introduction to Bayes Nets CS 486/686: Introduction to Artificial Intelligence Fall 2013 1 Introduction Review probabilistic inference, independence and conditional independence Bayesian Networks - - What

More information

Introduction to Bayesian inference

Introduction to Bayesian inference Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information

Inference for a Population Proportion

Inference for a Population Proportion Al Nosedal. University of Toronto. November 11, 2015 Statistical inference is drawing conclusions about an entire population based on data in a sample drawn from that population. From both frequentist

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Intro to Probability. Andrei Barbu

Intro to Probability. Andrei Barbu Intro to Probability Andrei Barbu Some problems Some problems A means to capture uncertainty Some problems A means to capture uncertainty You have data from two sources, are they different? Some problems

More information

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler Review: Probability BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler tatjana.scheffler@uni-potsdam.de October 21, 2016 Today probability random variables Bayes rule expectation

More information

Probabilistic Models

Probabilistic Models Bayes Nets 1 Probabilistic Models Models describe how (a portion of) the world works Models are always simplifications May not account for every variable May not account for all interactions between variables

More information

Chapter 4. Parameter Estimation. 4.1 Introduction

Chapter 4. Parameter Estimation. 4.1 Introduction Chapter 4 Parameter Estimation Thus far we have concerned ourselves primarily with probability theory: what events may occur with what probabilities, given a model family and choices for the parameters.

More information

Directed Graphical Models

Directed Graphical Models CS 2750: Machine Learning Directed Graphical Models Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 Graphical Models If no assumption of independence is made, must estimate an exponential

More information

Bayesian Inference. Introduction

Bayesian Inference. Introduction Bayesian Inference Introduction The frequentist approach to inference holds that probabilities are intrinsicially tied (unsurprisingly) to frequencies. This interpretation is actually quite natural. What,

More information

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I kevin small & byron wallace today a review of probability random variables, maximum likelihood, etc. crucial for clinical

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Lecture 1. ABC of Probability

Lecture 1. ABC of Probability Math 408 - Mathematical Statistics Lecture 1. ABC of Probability January 16, 2013 Konstantin Zuev (USC) Math 408, Lecture 1 January 16, 2013 1 / 9 Agenda Sample Spaces Realizations, Events Axioms of Probability

More information

{ p if x = 1 1 p if x = 0

{ p if x = 1 1 p if x = 0 Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Introduc)on to Bayesian Methods

Introduc)on to Bayesian Methods Introduc)on to Bayesian Methods Bayes Rule py x)px) = px! y) = px y)py) py x) = px y)py) px) px) =! px! y) = px y)py) y py x) = py x) =! y "! y px y)py) px y)py) px y)py) px y)py)dy Bayes Rule py x) =

More information

CS 5522: Artificial Intelligence II

CS 5522: Artificial Intelligence II CS 5522: Artificial Intelligence II Bayes Nets Instructor: Alan Ritter Ohio State University [These slides were adapted from CS188 Intro to AI at UC Berkeley. All materials available at http://ai.berkeley.edu.]

More information

Introduction to Bayesian Learning

Introduction to Bayesian Learning Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline

More information

Probability, Entropy, and Inference / More About Inference

Probability, Entropy, and Inference / More About Inference Probability, Entropy, and Inference / More About Inference Mário S. Alvim (msalvim@dcc.ufmg.br) Information Theory DCC-UFMG (2018/02) Mário S. Alvim (msalvim@dcc.ufmg.br) Probability, Entropy, and Inference

More information

Data Mining Techniques. Lecture 3: Probability

Data Mining Techniques. Lecture 3: Probability Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 3: Probability Jan-Willem van de Meent (credit: Zhao, CS 229, Bishop) Project Vote 1. Freeform: Develop your own project proposals 30% of

More information

Chris Bishop s PRML Ch. 8: Graphical Models

Chris Bishop s PRML Ch. 8: Graphical Models Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular

More information

7.1 What is it and why should we care?

7.1 What is it and why should we care? Chapter 7 Probability In this section, we go over some simple concepts from probability theory. We integrate these with ideas from formal language theory in the next chapter. 7.1 What is it and why should

More information

Latent Variable Models Probabilistic Models in the Study of Language Day 4

Latent Variable Models Probabilistic Models in the Study of Language Day 4 Latent Variable Models Probabilistic Models in the Study of Language Day 4 Roger Levy UC San Diego Department of Linguistics Preamble: plate notation for graphical models Here is the kind of hierarchical

More information

Introduction to Stochastic Processes

Introduction to Stochastic Processes Stat251/551 (Spring 2017) Stochastic Processes Lecture: 1 Introduction to Stochastic Processes Lecturer: Sahand Negahban Scribe: Sahand Negahban 1 Organization Issues We will use canvas as the course webpage.

More information

Probability Theory Review

Probability Theory Review Probability Theory Review Brendan O Connor 10-601 Recitation Sept 11 & 12, 2012 1 Mathematical Tools for Machine Learning Probability Theory Linear Algebra Calculus Wikipedia is great reference 2 Probability

More information

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples Machine Learning Bayes Basics Bayes, probabilities, Bayes theorem & examples Marc Toussaint U Stuttgart So far: Basic regression & classification methods: Features + Loss + Regularization & CV All kinds

More information

CMPSCI 240: Reasoning about Uncertainty

CMPSCI 240: Reasoning about Uncertainty CMPSCI 240: Reasoning about Uncertainty Lecture 17: Representing Joint PMFs and Bayesian Networks Andrew McGregor University of Massachusetts Last Compiled: April 7, 2017 Warm Up: Joint distributions Recall

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Bayes Nets Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188

More information

Probabilistic Machine Learning

Probabilistic Machine Learning Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:

More information

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept

Computational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept Computational Biology Lecture #3: Probability and Statistics Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept 26 2005 L2-1 Basic Probabilities L2-2 1 Random Variables L2-3 Examples

More information

Sample Space: Specify all possible outcomes from an experiment. Event: Specify a particular outcome or combination of outcomes.

Sample Space: Specify all possible outcomes from an experiment. Event: Specify a particular outcome or combination of outcomes. Chapter 2 Introduction to Probability 2.1 Probability Model Probability concerns about the chance of observing certain outcome resulting from an experiment. However, since chance is an abstraction of something

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018 Bayesian networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018 Soleymani Slides have been adopted from Klein and Abdeel, CS188, UC Berkeley. Outline Probability

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

Bayesian Inference. p(y)

Bayesian Inference. p(y) Bayesian Inference There are different ways to interpret a probability statement in a real world setting. Frequentist interpretations of probability apply to situations that can be repeated many times,

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

Probabilistic Graphical Models (I)

Probabilistic Graphical Models (I) Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random

More information

Quantifying uncertainty & Bayesian networks

Quantifying uncertainty & Bayesian networks Quantifying uncertainty & Bayesian networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2016 Soleymani Artificial Intelligence: A Modern Approach, 3 rd Edition,

More information

Probability and Inference

Probability and Inference Deniz Yuret ECOE 554 Lecture 3 Outline 1 Probabilities and ensembles 2 3 Ensemble An ensemble X is a triple (x, A X, P X ), where the outcome x is the value of a random variable, which takes on one of

More information

Hierarchical Models & Bayesian Model Selection

Hierarchical Models & Bayesian Model Selection Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or

More information

An Introduction to Bayesian Machine Learning

An Introduction to Bayesian Machine Learning 1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems

More information

Some Probability and Statistics

Some Probability and Statistics Some Probability and Statistics David M. Blei COS424 Princeton University February 13, 2012 Card problem There are three cards Red/Red Red/Black Black/Black I go through the following process. Close my

More information

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Probabilistic Reasoning. (Mostly using Bayesian Networks) Probabilistic Reasoning (Mostly using Bayesian Networks) Introduction: Why probabilistic reasoning? The world is not deterministic. (Usually because information is limited.) Ways of coping with uncertainty

More information

Origins of Probability Theory

Origins of Probability Theory 1 16.584: INTRODUCTION Theory and Tools of Probability required to analyze and design systems subject to uncertain outcomes/unpredictability/randomness. Such systems more generally referred to as Experiments.

More information

Overview of Probability. Mark Schmidt September 12, 2017

Overview of Probability. Mark Schmidt September 12, 2017 Overview of Probability Mark Schmidt September 12, 2017 Dungeons & Dragons scenario: You roll dice 1: Practical Application Roll or you sneak past monster. Otherwise, you are eaten. If you survive, you

More information

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over Point estimation Suppose we are interested in the value of a parameter θ, for example the unknown bias of a coin. We have already seen how one may use the Bayesian method to reason about θ; namely, we

More information

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 Plan for today and next week Today and next time: Bayesian networks (Bishop Sec. 8.1) Conditional

More information

Review of Basic Probability

Review of Basic Probability Review of Basic Probability Erik G. Learned-Miller Department of Computer Science University of Massachusetts, Amherst Amherst, MA 01003 September 16, 2009 Abstract This document reviews basic discrete

More information

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.

Axioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks. Bayesian Networks Today we ll introduce Bayesian Networks. This material is covered in chapters 13 and 14. Chapter 13 gives basic background on probability and Chapter 14 talks about Bayesian Networks.

More information

Lecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability

Lecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability Lecture Notes 1 Basic Probability Set Theory Elements of Probability Conditional probability Sequential Calculation of Probability Total Probability and Bayes Rule Independence Counting EE 178/278A: Basic

More information

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB

More information

Probability theory basics

Probability theory basics Probability theory basics Michael Franke Basics of probability theory: axiomatic definition, interpretation, joint distributions, marginalization, conditional probability & Bayes rule. Random variables:

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 143 Part IV

More information

CS 188: Artificial Intelligence Fall 2009

CS 188: Artificial Intelligence Fall 2009 CS 188: Artificial Intelligence Fall 2009 Lecture 14: Bayes Nets 10/13/2009 Dan Klein UC Berkeley Announcements Assignments P3 due yesterday W2 due Thursday W1 returned in front (after lecture) Midterm

More information

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev CS4705 Probability Review and Naïve Bayes Slides from Dragomir Radev Classification using a Generative Approach Previously on NLP discriminative models P C D here is a line with all the social media posts

More information

Probability & statistics for linguists Class 2: more probability. D. Lassiter (h/t: R. Levy)

Probability & statistics for linguists Class 2: more probability. D. Lassiter (h/t: R. Levy) Probability & statistics for linguists Class 2: more probability D. Lassiter (h/t: R. Levy) conditional probability P (A B) = when in doubt about meaning: draw pictures. P (A \ B) P (B) keep B- consistent

More information

Introduction to Probability and Statistics (Continued)

Introduction to Probability and Statistics (Continued) Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:

More information

The Monte Carlo Method: Bayesian Networks

The Monte Carlo Method: Bayesian Networks The Method: Bayesian Networks Dieter W. Heermann Methods 2009 Dieter W. Heermann ( Methods)The Method: Bayesian Networks 2009 1 / 18 Outline 1 Bayesian Networks 2 Gene Expression Data 3 Bayesian Networks

More information

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,

More information

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides slide 1 Inference with Bayes rule: Example In a bag there are two envelopes one has a red ball (worth $100) and a black ball one

More information

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation CS 188: Artificial Intelligence Spring 2010 Lecture 15: Bayes Nets II Independence 3/9/2010 Pieter Abbeel UC Berkeley Many slides over the course adapted from Dan Klein, Stuart Russell, Andrew Moore Current

More information

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014

Probability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014 Probability Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh August 2014 (All of the slides in this course have been adapted from previous versions

More information

Learning in Bayesian Networks

Learning in Bayesian Networks Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks

More information

Review: Bayesian learning and inference

Review: Bayesian learning and inference Review: Bayesian learning and inference Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E Inference problem:

More information