Advanced Probabilistic Modeling in R Day 1
|
|
- Arnold Lindsey
- 5 years ago
- Views:
Transcription
1 Advanced Probabilistic Modeling in R Day 1 Roger Levy University of California, San Diego July 20, /24
2 Today s content Quick review of probability: axioms, joint & conditional probabilities, Bayes Rule, conditional independence Bayes Nets (a.k.a. directed acyclic graphical models, DAGs) The Gaussian distribution Example: human phoneme categorization Maximum likelihood estimation Bayesian parameter estimation Frequentist hypothesis testing Bayesian hypothesis testing 2/24
3 Probability spaces Traditionally, probability spaces are defined in terms of sets. An event E is a subset of a sample space Ω: E Ω. 3/24
4 Probability spaces Traditionally, probability spaces are defined in terms of sets. An event E is a subset of a sample space Ω: E Ω. A probability space P on a sample space Ω is a function from events E in Ω to real numbers such that the following three axioms hold: 1. P(E) 0 for all E Ω (non-negativity). 2. If E 1 and E 2 are disjoint, then P(E 1 E 2 ) = P(E 1 )+P(E 2 ) (disjoint union). 3. P(Ω) = 1 (properness). 3/24
5 Joint, conditional, and marginal probabilities Given the joint distribution P(X,Y) over two random variables X and Y, the conditional distribution P(Y X) is defined as P(Y X) P(X,Y) P(X) 4/24
6 Joint, conditional, and marginal probabilities Given the joint distribution P(X,Y) over two random variables X and Y, the conditional distribution P(Y X) is defined as P(Y X) P(X,Y) P(X) The marginal probability distribution P(X) is P(X = x) = y P(X = x,y = y) These concepts can be extended to arbitrary numbers of random variables. 4/24
7 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) 5/24
8 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: 5/24
9 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 )
10 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 )
11 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 )..
12 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 )..
13 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 ).. P(E 1,E 2,...,E n ) = P(E n E 1,E 2,...,E n 1 )...P(E 2 E 1 )P(E 1 ) 5/24
14 The chain rule A joint probability can be rewritten as the product of marginal and conditional probabilities: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) And this generalizes to more than two variables: P(E 1,E 2 ) = P(E 2 E 1 )P(E 1 ) P(E 1,E 2,E 3 ) = P(E 3 E 1,E 2 )P(E 2 E 1 )P(E 1 ).. P(E 1,E 2,...,E n ) = P(E n E 1,E 2,...,E n 1 )...P(E 2 E 1 )P(E 1 ) Breaking a joint probability down into the product a marginal probability and several joint probabilities this way is called chain rule decomposition. 5/24
15 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) 6/24
16 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) 6/24
17 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: P(A, B) = P(B A)P(A) 6/24
18 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) 6/24
19 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) So P(A B)P(B) = P(B A)P(A) 6/24
20 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: So P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) P(A B)P(B) = P(B A)P(A) P(A B)P(B) P(B) = P(B A)P(A) P(B) 6/24
21 Bayes Rule (Bayes Theorem) P(A B) = P(B A)P(A) P(B) With extra background random variables I: P(A B,I) = P(B A,I)P(A I) P(B I) This theorem follows directly from def n of conditional probability: So P(A, B) = P(B A)P(A) P(A, B) = P(A B)P(B) P(A B)P(B) = P(B A)P(A) P(A B)P(B) P(B) = P(B A)P(A) P(B) 6/24
22 Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) 7/24
23 Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) Hence there are often two other ways we write Bayes Rule: 7/24
24 Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) Hence there are often two other ways we write Bayes Rule: 1. Emphasizing explicit marginalization: P(A B) = P(B A)P(A) P(A = a,b) a 7/24
25 Other ways of writing Bayes Rule Likelihood Prior {}}{{}}{ P(B A) P(A) P(A B) = P(B) }{{} Normalizing constant The hardest part of using Bayes Rule was calculating the normalizing constant (a.k.a. the partition function) Hence there are often two other ways we write Bayes Rule: 1. Emphasizing explicit marginalization: 2. Ignoring the partition function: P(A B) = P(B A)P(A) P(A = a,b) a P(A B) P(B A)P(A) 7/24
26 (Conditional) Independence Events A and B are said to be Conditionally Independent given information C if P(A,B C) = P(A C)P(B C) Conditional independence of A and B given C is often expressed as A B C 8/24
27 Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables 9/24
28 Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables So next we ll introduce you to a general framework for specifying conditional independencies among collections of random variables 9/24
29 Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables So next we ll introduce you to a general framework for specifying conditional independencies among collections of random variables It won t allow us to express all possible independencies that may hold, but it goes a long way 9/24
30 Directed graphical models A lot of the interesting joint probability distributions in the study of language involve conditional independencies among the variables So next we ll introduce you to a general framework for specifying conditional independencies among collections of random variables It won t allow us to express all possible independencies that may hold, but it goes a long way And I hope that you ll agree that the framework is intuitive too! 9/24
31 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: 10/24
32 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 10/24
33 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 10/24
34 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. 10/24
35 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: 10/24
36 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: The factory produces a coin of type X and sends it to you; 10/24
37 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: The factory produces a coin of type X and sends it to you; You receive the coin and flip it twice, with H(eads)/T(ails) outcomes Y 1 and Y 2 10/24
38 A non-linguistic example Imagine a factory that produces three types of coins in equal volumes: Fair coins; 2-headed coins; 2-tailed coins. Generative process: The factory produces a coin of type X and sends it to you; You receive the coin and flip it twice, with H(eads)/T(ails) outcomes Y 1 and Y 2 Receiving a coin from the factory and flipping it twice is sampling (or taking a sample) from the joint distribution P(X,Y 1,Y 2 ) 10/24
39 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents 11/24
40 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) 11/24
41 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) X P(X) 1 Fair 3 2-H T 3 11/24
42 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T /24
43 This generative process a Bayes Net The directed acyclic graphical model (DAG), or Bayes net: X Y 1 Y 2 Semantics of a Bayes net: the joint distribution can be expressed as the product of the conditional distributions of each variable given only its parents In this DAG, P(X,Y 1,Y 2 ) = P(X)P(Y 1 X)P(Y 2 X) X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T /24
44 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? 12/24
45 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? 12/24
46 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? No! 12/24
47 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? No! P(Y2 = H) = 1 2 (you can see this by symmetry) 12/24
48 Conditional independence in Bayes nets X P(X) 1 Fair 3 2-H T 3 X P(Y 1 = H X) P(Y 1 = T X) 1 1 Fair H T 0 1 X P(Y 2 = H X) P(Y 2 = T X) 1 1 Fair H T 0 1 Question: Conditioned on not having any further information, are the two coin flips Y 1 and Y 2 in this generative process independent? That is, if C = {}, is it the case that A B C? No! P(Y2 = H) = 1 2 (you can see this by symmetry) Coin was fair Coin was 2-H {}}{ 1 But P(Y2 = H Y 1 = H) = 3 1 {}}{ = /24
49 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. 13/24
50 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B 13/24
51 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B Any node on a given path has converging arrows if two edges on the path connect to it and point to it. 13/24
52 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B Any node on a given path has converging arrows if two edges on the path connect to it and point to it. A node on the path has non-converging arrows if two edges on the path connect to it, but at least one does not point to it. 13/24
53 Formally assessing conditional independence in Bayes Nets The comprehensive criterion for assessing conditional independence is known as D-separation. A path between two disjoint node sets A and B is a sequence of edges connecting some node in A with some node in B Any node on a given path has converging arrows if two edges on the path connect to it and point to it. A node on the path has non-converging arrows if two edges on the path connect to it, but at least one does not point to it. A third disjoint node set C d-separates A and B if for every path between A and B, either: 1. there is some node on the path with converging arrows which is not in C; or 2. there is some node on the path whose arrows do not converge and which is in C. 13/24
54 Major types of d-separation A node set C d-separates A and B if for every path between A and B, either: 1. there is some node on the path with converging arrows which is not in C; or 2. there is some node on the path whose arrows do not converge and which is in C. Commoncause d- separation (from knowing Z) Z Intervening d- separation (from knowing Y) X X Explaining away: knowing Z prevents d- separation Y D- separation in the absence of knowledge of Z X Y Y X Y Z Z Z 14/24
55 Back to our example X Y 1 Y 2 15/24
56 Back to our example X Y 1 Y 2 Without looking at the coin before flipping it, the outcome Y 1 of the first flip gives me information about the type of coin, and affects my beliefs about the outcome of Y 2 15/24
57 Back to our example X Y 1 Y 2 Without looking at the coin before flipping it, the outcome Y 1 of the first flip gives me information about the type of coin, and affects my beliefs about the outcome of Y 2 But if I look at the coin before flipping it, Y 1 and Y 2 are rendered independent 15/24
58 An example of explaining away I saw an exhibition about the, uh... 16/24
59 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: 16/24
60 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) 16/24
61 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) The speaker s attention was distracted by something in the non-linguistic environment 16/24
62 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) The speaker s attention was distracted by something in the non-linguistic environment 16/24
63 An example of explaining away I saw an exhibition about the, uh... There are several causes of disfluency, including: An upcoming word is difficult to produce (e.g., low frequency, astrolabe) The speaker s attention was distracted by something in the non-linguistic environment A reasonable graphical model: W: hard word? A: attention distracted? D: disfluency? 16/24
64 An example of explaining away W: hard word? A: attention distracted? D: disfluency? Without knowledge of D, there s no reason to expect that W and A are correlated 17/24
65 An example of explaining away W: hard word? A: attention distracted? D: disfluency? Without knowledge of D, there s no reason to expect that W and A are correlated But hearing a disfluency demands a cause 17/24
66 An example of explaining away W: hard word? A: attention distracted? D: disfluency? Without knowledge of D, there s no reason to expect that W and A are correlated But hearing a disfluency demands a cause Knowing that there was a distraction explains away the disfluency, reducing the probability that the speaker was planning to utter a hard word 17/24
67 An example of the disfluency model W: hard word? A: attention distracted? Let s suppose that both hard words and distractions are unusual, the latter more so P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? 18/24
68 An example of the disfluency model W: hard word? A: attention distracted? Let s suppose that both hard words and distractions are unusual, the latter more so P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? Hard words and distractions both induce disfluencies; having both makes a disfluency really likely W A D=no disfluency D=disfluency easy undistracted easy distracted hard undistracted hard distracted /24
69 An example of the disfluency model W: hard word? A: attention distracted? P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? W A D=no disfluency D=disfluency easy undistracted easy distracted hard undistracted hard distracted Suppose that we observe the speaker uttering a disfluency. What is P(W = hard D = disfluent)? 19/24
70 An example of the disfluency model W: hard word? A: attention distracted? P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? W A D=no disfluency D=disfluency easy undistracted easy distracted hard undistracted hard distracted Suppose that we observe the speaker uttering a disfluency. What is P(W = hard D = disfluent)? Now suppose we also learn that her attention is distracted. What does that do to our beliefs about W 19/24
71 An example of the disfluency model W: hard word? A: attention distracted? P(W = hard) = 0.25 P(A = distracted) = 0.15 D: disfluency? W A D=no disfluency D=disfluency easy undistracted easy distracted hard undistracted hard distracted Suppose that we observe the speaker uttering a disfluency. What is P(W = hard D = disfluent)? Now suppose we also learn that her attention is distracted. What does that do to our beliefs about W That is, what is P(W = hard D = disfluent,a = distracted)? 19/24
72 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = /24
73 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = /24
74 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = 0.57 P(W = hard D = disfluent,a = distracted) = /24
75 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = 0.57 P(W = hard D = disfluent,a = distracted) = 0.40 Knowing that the speaker was distracted (A) decreased the probability that the speaker was about to utter a hard word (W) A explained D away. 20/24
76 An example of the disfluency model Fortunately, there is automated machinery to turn the Bayesian crank : P(W = hard) = 0.25 P(W = hard D = disfluent) = 0.57 P(W = hard D = disfluent,a = distracted) = 0.40 Knowing that the speaker was distracted (A) decreased the probability that the speaker was about to utter a hard word (W) A explained D away. A caveat: the type of relationship among A, W, and D will depend on the values one finds in the probability table! P(W) P(A) P(D W,A) 20/24
77 Summary thus far Key points: Bayes Rule is a compelling framework for modeling inference under uncertainty DAGs/Bayes Nets are a broad class of models for specifying joint probability distributions with conditional independencies Classic Bayes Net references:??;?;?, Chapter 14;?, Chapter 8. 21/24
78 References I 22/24
79 An example of the disfluency model P(W = hard D = disfluent,a = distracted) hard W=hard easy W=easy disfl D=disfluent distr A=distracted undistr A=undistracted P(hard disfl,distr) = P(disfl hard,distr)p(hard distr) P(disfl distr) = P(disfl hard,distr)p(hard) P(disfl distr) P(disfl distr) = w P(disfl W = w )P(W = w ) (Bayes Rule) (Independence from the DAG) (Marginalization) = P(disfl hard)p(hard) + P(disfl easy)p(easy) = = P(hard disfl,distr) = = /24
80 An example of the disfluency model P(W = hard D = disfluent) P(hard disfl) = P(disfl hard)p(hard) P(disfl) (Bayes Rule) P(disfl hard) = a P(disfl A = a,hard)p(a = a hard) = P(disfl A = distr, hard)p(a = distr hard) + P(disfl undistr, hard)p(undistr hard) = = P(disfl) = P(disfl W = w )P(W = w ) w = P(disfl hard)p(hard) + P(disfl easy)p(easy) P(disfl easy) = a P(disfl A = a,easy)p(a = a easy) = P(disfl A = distr, easy)p(a = distr easy) + P(disfl undistr, easy)p(undistr easy) = = P(disfl) = = P(hard disfl) = = /24
81 Bayesian parameter estimation The scenario: you are a native English speaker in whose experience passivizable constructions are passivized with frequency q. 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) You encounter a new dialect of English and hear data y consisting of n passivizable utterances, m of which were passivized: Goal: X Bern(π) Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances.
82 Anatomy of Bayesian inference Simplest possible scenario: I θ Y
83 Anatomy of Bayesian inference Simplest possible scenario: I θ Y The corresponding Bayesian inference: P(θ y,i) = P(y θ,i)p(θ I) P(y I)
84 Anatomy of Bayesian inference Simplest possible scenario: I θ Y The corresponding Bayesian inference: P(θ y,i) = P(y θ,i)p(θ I) P(y I) Likelihood for θ Prior over θ {}}{{}}{ P(y θ) P(θ I) = P(y I) }{{} Likelihood marginalized over θ (because y I θ)
85 Anatomy of Bayesian inference Simplest possible scenario: I θ Y The corresponding Bayesian inference: P(θ y,i) = P(y θ,i)p(θ I) P(y I) Likelihood for θ Prior over θ {}}{{}}{ P(y θ) P(θ I) = P(y I) }{{} Likelihood marginalized over θ (because y I θ) At the bottom of the graph, our model is the binomial distribution: P(y θ) Binom(n, θ) But to get things going we have to set the prior P(θ I).
86 Priors for the binomial distribution For a model with parameters θ, a prior distribution is just some joint probability distribution P(θ) Because the prior is often supposed to account for knowledge we bring to the table, we often write P(θ I) to be explciit
87 Priors for the binomial distribution For a model with parameters θ, a prior distribution is just some joint probability distribution P(θ) Because the prior is often supposed to account for knowledge we bring to the table, we often write P(θ I) to be explciit Model parameters are nearly always real-valued, so P(θ) is generally a multivariate continuous distribution
88 Priors for the binomial distribution For a model with parameters θ, a prior distribution is just some joint probability distribution P(θ) Because the prior is often supposed to account for knowledge we bring to the table, we often write P(θ I) to be explciit Model parameters are nearly always real-valued, so P(θ) is generally a multivariate continuous distribution In general, the sky is the limit as to what you choose for P(θ)
89 Priors for the binomial distribution For a model with parameters θ, a prior distribution is just some joint probability distribution P(θ) Because the prior is often supposed to account for knowledge we bring to the table, we often write P(θ I) to be explciit Model parameters are nearly always real-valued, so P(θ) is generally a multivariate continuous distribution In general, the sky is the limit as to what you choose for P(θ) But in many cases there are useful priors that will make your life easier
90 The beta distribution The beta distribution has two parameters α 1,α 2 > 0 and is defined as: P(π α 1,α 2 ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1 (0 π 1,α 1 > 0,α 2 > 0) where the beta function B(α 1,α 2 ) serves as a normalizing constant: asdf B(α 1,α 2 ) = 1 0 π α 1 1 (1 π) α 2 1 dπ
91 Some beta distributions If X B(α 1,α 2 ): p(π) Beta(1,1) Beta(0.5,0.5) Beta(3,3) Beta(3,0.5) E[X] = α 1 α 1 +α 2 If α 1,α 2 > 1, then X has a mode at α 1 1 α 1 +α π
92 Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2.
93 Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2. P(π y,α 1,α 2 ) = P(y π)p(π α 1,α 2 ) P(y α 1,α 2 ) (1)
94 Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2. P(π y,α 1,α 2 ) = P(y π)p(π α 1,α 2 ) P(y α 1,α 2 ) Since the denominator is not a function of π, it is a normalizing constant. Ignore it and work in terms of proportionality: P(π y,α 1,α 2 ) P(y π)p(π α 1,α 2 ) (1)
95 Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2. P(π y,α 1,α 2 ) = P(y π)p(π α 1,α 2 ) P(y α 1,α 2 ) Since the denominator is not a function of π, it is a normalizing constant. Ignore it and work in terms of proportionality: P(π y,α 1,α 2 ) P(y π)p(π α 1,α 2 ) Likelihood for the binomial distribution is ( ) n P(y π) = π m (1 π) n m m (1)
96 Using the beta distribution as a prior 1. The ball hit the window. (Active) 2. The window was hit by the ball. (Passive) Let us use a beta distribution as a prior for our problem hence I = α 1,α 2. P(π y,α 1,α 2 ) = P(y π)p(π α 1,α 2 ) P(y α 1,α 2 ) Since the denominator is not a function of π, it is a normalizing constant. Ignore it and work in terms of proportionality: P(π y,α 1,α 2 ) P(y π)p(π α 1,α 2 ) Likelihood for the binomial distribution is ( ) n P(y π) = π m (1 π) n m m Beta prior is P(π α 1,α 2 ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1 (1)
97 Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1
98 Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1 π m+α 1 1 (1 π) n m+α 2 1
99 Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1 π m+α 1 1 (1 π) n m+α 2 1 Crucial trick: this is itself a beta distribution! Recall that if θ Beta(α 1,α 2 ) then P(θ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1
100 Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1 π m+α 1 1 (1 π) n m+α 2 1 Crucial trick: this is itself a beta distribution! Recall that if θ Beta(α 1,α 2 ) then P(θ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1 Hence P(θ y,α 1,α 2 ) is distributed as Beta(α 1 +m,α 2 +n m).
101 Using the beta distribution as a prior Ignore ( n m) and B(α1,α 2 ) (both constant in π): Likelihood Prior {}}{{}}{ P(π y,α 1,α 2 ) π m (1 π) n m π α1 1 (1 π) α 2 1 π m+α 1 1 (1 π) n m+α 2 1 Crucial trick: this is itself a beta distribution! Recall that if θ Beta(α 1,α 2 ) then P(θ) = 1 B(α 1,α 2 ) πα 1 1 (1 π) α 2 1 Hence P(θ y,α 1,α 2 ) is distributed as Beta(α 1 +m,α 2 +n m). With a beta prior and a binomial likelihood, the posterior is still beta-distributed. This is called conjugacy.
102 Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances.
103 Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability
104 Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability P(passive passivizable clause) 0.08 (Roland et al., 2007)
105 Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability P(passive passivizable clause) 0.08 (Roland et al., 2007) The mode of a beta distribution is α 1 1 α 1 +α 2 2
106 Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability P(passive passivizable clause) 0.08 (Roland et al., 2007) The mode of a beta distribution is α 1 1 α 1 +α 2 2 Hence we might use α 1 = 3,α 2 = 24 (note that 2 25 = 0.08)
107 Using our beta-binomial model Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. To estimate π it is common to use Maximum a-posteriori (MAP) estimation: choose the value of π with highest posterior probability P(passive passivizable clause) 0.08 (Roland et al., 2007) The mode of a beta distribution is α 1 1 α 1 +α 2 2 Hence we might use α 1 = 3,α 2 = 24 (note that 2 25 = 0.08) Suppose that n = 7,m = 2: our posterior will be Beta(5,29), hence ˆπ = 4 32 = 0.125
108 Beta-binomial posterior distributions P(p) Prior Likelihood (n=7) Posterior (n=7) Likelihood (n=21) Posterior (n=21) Likelihood(p) p
109 Fully Bayesian density estimation Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances.
110 Fully Bayesian density estimation Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. In the fully Bayesian view, we don t summarize our posterior beliefs into a point estimate; rather, we marginalize over them in predicting the future:
111 Fully Bayesian density estimation Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. In the fully Bayesian view, we don t summarize our posterior beliefs into a point estimate; rather, we marginalize over them in predicting the future: P(y new y,i) = P(y new θ)p(θ y,i)dθ θ
112 Fully Bayesian density estimation Goal: Estimate the success parameter π associated with passivization in the new English dialect; Or place a probability distribution on the number of passives in the next N passivizable utterances. In the fully Bayesian view, we don t summarize our posterior beliefs into a point estimate; rather, we marginalize over them in predicting the future: P(y new y,i) = P(y new θ)p(θ y,i)dθ θ This leads to the beta-binomial predictive model: ( ) k B(α1 +m+r,α 2 +n m+k r) P(r k,i,y) = r B(α 1 +m,α 2 +n m)
113 Fully Bayesian density estimation P(k passives out of 50 trials y, I) Binomial Beta Binomial k
114 Fully Bayesian density estimation In this case (as in many others), marginalizing over the model parameters allows for greater dispersion in the model s predictions
115 Fully Bayesian density estimation In this case (as in many others), marginalizing over the model parameters allows for greater dispersion in the model s predictions This is because the new observations are only conditionally independent given θ with uncertainty about θ, they are linked!
116 Fully Bayesian density estimation In this case (as in many others), marginalizing over the model parameters allows for greater dispersion in the model s predictions This is because the new observations are only conditionally independent given θ with uncertainty about θ, they are linked! I θ y (1) new y (2) new... y (N) new
117 References I Roland, D., Dick, F., and Elman, J. L. (2007). Frequency of basic English grammatical structures: A corpus analysis. Journal of Memory and Language, 57:
Day 1: Probability and speech perception
Day 1: Probability and speech perception 1 Day 2: Human sentence parsing 2 Day 3: Noisy-channel sentence processing? Day 4: Language production & acquisition whatsthat thedoggie yeah wheresthedoggie Grammar/lexicon
More informationBayesian RL Seminar. Chris Mansley September 9, 2008
Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in
More information1. what conditional independencies are implied by the graph. 2. whether these independecies correspond to the probability distribution
NETWORK ANALYSIS Lourens Waldorp PROBABILITY AND GRAPHS The objective is to obtain a correspondence between the intuitive pictures (graphs) of variables of interest and the probability distributions of
More informationLEARNING WITH BAYESIAN NETWORKS
LEARNING WITH BAYESIAN NETWORKS Author: David Heckerman Presented by: Dilan Kiley Adapted from slides by: Yan Zhang - 2006, Jeremy Gould 2013, Chip Galusha -2014 Jeremy Gould 2013Chip Galus May 6th, 2016
More informationProbability Theory for Machine Learning. Chris Cremer September 2015
Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares
More informationCSC321 Lecture 18: Learning Probabilistic Models
CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling
More informationSAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING
SAMPLE CHAPTER Avi Pfeffer FOREWORD BY Stuart Russell MANNING Practical Probabilistic Programming by Avi Pfeffer Chapter 9 Copyright 2016 Manning Publications brief contents PART 1 INTRODUCING PROBABILISTIC
More informationCS 630 Basic Probability and Information Theory. Tim Campbell
CS 630 Basic Probability and Information Theory Tim Campbell 21 January 2003 Probability Theory Probability Theory is the study of how best to predict outcomes of events. An experiment (or trial or event)
More informationReview of Probabilities and Basic Statistics
Alex Smola Barnabas Poczos TA: Ina Fiterau 4 th year PhD student MLD Review of Probabilities and Basic Statistics 10-701 Recitations 1/25/2013 Recitation 1: Statistics Intro 1 Overview Introduction to
More informationDept. of Linguistics, Indiana University Fall 2015
L645 Dept. of Linguistics, Indiana University Fall 2015 1 / 34 To start out the course, we need to know something about statistics and This is only an introduction; for a fuller understanding, you would
More informationRecall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem
Recall from last time: Conditional probabilities Our probabilistic models will compute and manipulate conditional probabilities. Given two random variables X, Y, we denote by Lecture 2: Belief (Bayesian)
More informationIntroduction: MLE, MAP, Bayesian reasoning (28/8/13)
STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this
More informationProbabilistic Graphical Networks: Definitions and Basic Results
This document gives a cursory overview of Probabilistic Graphical Networks. The material has been gleaned from different sources. I make no claim to original authorship of this material. Bayesian Graphical
More informationRecall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.
ecall from last time Lecture 3: onditional independence and graph structure onditional independencies implied by a belief network Independence maps (I-maps) Factorization theorem The Bayes ball algorithm
More informationMATH MW Elementary Probability Course Notes Part I: Models and Counting
MATH 2030 3.00MW Elementary Probability Course Notes Part I: Models and Counting Tom Salisbury salt@yorku.ca York University Winter 2010 Introduction [Jan 5] Probability: the mathematics used for Statistics
More informationBayesian Approaches Data Mining Selected Technique
Bayesian Approaches Data Mining Selected Technique Henry Xiao xiao@cs.queensu.ca School of Computing Queen s University Henry Xiao CISC 873 Data Mining p. 1/17 Probabilistic Bases Review the fundamentals
More informationComputational Perception. Bayesian Inference
Computational Perception 15-485/785 January 24, 2008 Bayesian Inference The process of probabilistic inference 1. define model of problem 2. derive posterior distributions and estimators 3. estimate parameters
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University August 30, 2017 Today: Decision trees Overfitting The Big Picture Coming soon Probabilistic learning MLE,
More informationECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4
ECE52 Tutorial Topic Review ECE52 Winter 206 Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides ECE52 Tutorial ECE52 Winter 206 Credits to Alireza / 4 Outline K-means, PCA 2 Bayesian
More informationLecture 1: Probability Fundamentals
Lecture 1: Probability Fundamentals IB Paper 7: Probability and Statistics Carl Edward Rasmussen Department of Engineering, University of Cambridge January 22nd, 2008 Rasmussen (CUED) Lecture 1: Probability
More informationBayesian Models in Machine Learning
Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationLECTURE 1. 1 Introduction. 1.1 Sample spaces and events
LECTURE 1 1 Introduction The first part of our adventure is a highly selective review of probability theory, focusing especially on things that are most useful in statistics. 1.1 Sample spaces and events
More informationIntroduction into Bayesian statistics
Introduction into Bayesian statistics Maxim Kochurov EF MSU November 15, 2016 Maxim Kochurov Introduction into Bayesian statistics EF MSU 1 / 7 Content 1 Framework Notations 2 Difference Bayesians vs Frequentists
More informationUnit 1: Sequence Models
CS 562: Empirical Methods in Natural Language Processing Unit 1: Sequence Models Lecture 5: Probabilities and Estimations Lecture 6: Weighted Finite-State Machines Week 3 -- Sep 8 & 10, 2009 Liang Huang
More informationCS 361: Probability & Statistics
October 17, 2017 CS 361: Probability & Statistics Inference Maximum likelihood: drawbacks A couple of things might trip up max likelihood estimation: 1) Finding the maximum of some functions can be quite
More informationProbability Review. Chao Lan
Probability Review Chao Lan Let s start with a single random variable Random Experiment A random experiment has three elements 1. sample space Ω: set of all possible outcomes e.g.,ω={1,2,3,4,5,6} 2. event
More informationRapid Introduction to Machine Learning/ Deep Learning
Rapid Introduction to Machine Learning/ Deep Learning Hyeong In Choi Seoul National University 1/32 Lecture 5a Bayesian network April 14, 2016 2/32 Table of contents 1 1. Objectives of Lecture 5a 2 2.Bayesian
More informationIntroduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013
Introduction to Bayes Nets CS 486/686: Introduction to Artificial Intelligence Fall 2013 1 Introduction Review probabilistic inference, independence and conditional independence Bayesian Networks - - What
More informationIntroduction to Bayesian inference
Introduction to Bayesian inference Thomas Alexander Brouwer University of Cambridge tab43@cam.ac.uk 17 November 2015 Probabilistic models Describe how data was generated using probability distributions
More informationCS 361: Probability & Statistics
March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the
More informationProbability and Estimation. Alan Moses
Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.
More informationInference for a Population Proportion
Al Nosedal. University of Toronto. November 11, 2015 Statistical inference is drawing conclusions about an entire population based on data in a sample drawn from that population. From both frequentist
More informationLearning Bayesian network : Given structure and completely observed data
Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution
More informationIntro to Probability. Andrei Barbu
Intro to Probability Andrei Barbu Some problems Some problems A means to capture uncertainty Some problems A means to capture uncertainty You have data from two sources, are they different? Some problems
More informationReview: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler
Review: Probability BM1: Advanced Natural Language Processing University of Potsdam Tatjana Scheffler tatjana.scheffler@uni-potsdam.de October 21, 2016 Today probability random variables Bayes rule expectation
More informationProbabilistic Models
Bayes Nets 1 Probabilistic Models Models describe how (a portion of) the world works Models are always simplifications May not account for every variable May not account for all interactions between variables
More informationChapter 4. Parameter Estimation. 4.1 Introduction
Chapter 4 Parameter Estimation Thus far we have concerned ourselves primarily with probability theory: what events may occur with what probabilities, given a model family and choices for the parameters.
More informationDirected Graphical Models
CS 2750: Machine Learning Directed Graphical Models Prof. Adriana Kovashka University of Pittsburgh March 28, 2017 Graphical Models If no assumption of independence is made, must estimate an exponential
More informationBayesian Inference. Introduction
Bayesian Inference Introduction The frequentist approach to inference holds that probabilities are intrinsicially tied (unsurprisingly) to frequencies. This interpretation is actually quite natural. What,
More informationA.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace
A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I kevin small & byron wallace today a review of probability random variables, maximum likelihood, etc. crucial for clinical
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationProbabilistic Graphical Models
Parameter Estimation December 14, 2015 Overview 1 Motivation 2 3 4 What did we have so far? 1 Representations: how do we model the problem? (directed/undirected). 2 Inference: given a model and partially
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationLecture 1. ABC of Probability
Math 408 - Mathematical Statistics Lecture 1. ABC of Probability January 16, 2013 Konstantin Zuev (USC) Math 408, Lecture 1 January 16, 2013 1 / 9 Agenda Sample Spaces Realizations, Events Axioms of Probability
More information{ p if x = 1 1 p if x = 0
Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationIntroduc)on to Bayesian Methods
Introduc)on to Bayesian Methods Bayes Rule py x)px) = px! y) = px y)py) py x) = px y)py) px) px) =! px! y) = px y)py) y py x) = py x) =! y "! y px y)py) px y)py) px y)py) px y)py)dy Bayes Rule py x) =
More informationCS 5522: Artificial Intelligence II
CS 5522: Artificial Intelligence II Bayes Nets Instructor: Alan Ritter Ohio State University [These slides were adapted from CS188 Intro to AI at UC Berkeley. All materials available at http://ai.berkeley.edu.]
More informationIntroduction to Bayesian Learning
Course Information Introduction Introduction to Bayesian Learning Davide Bacciu Dipartimento di Informatica Università di Pisa bacciu@di.unipi.it Apprendimento Automatico: Fondamenti - A.A. 2016/2017 Outline
More informationProbability, Entropy, and Inference / More About Inference
Probability, Entropy, and Inference / More About Inference Mário S. Alvim (msalvim@dcc.ufmg.br) Information Theory DCC-UFMG (2018/02) Mário S. Alvim (msalvim@dcc.ufmg.br) Probability, Entropy, and Inference
More informationData Mining Techniques. Lecture 3: Probability
Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 3: Probability Jan-Willem van de Meent (credit: Zhao, CS 229, Bishop) Project Vote 1. Freeform: Develop your own project proposals 30% of
More informationChris Bishop s PRML Ch. 8: Graphical Models
Chris Bishop s PRML Ch. 8: Graphical Models January 24, 2008 Introduction Visualize the structure of a probabilistic model Design and motivate new models Insights into the model s properties, in particular
More information7.1 What is it and why should we care?
Chapter 7 Probability In this section, we go over some simple concepts from probability theory. We integrate these with ideas from formal language theory in the next chapter. 7.1 What is it and why should
More informationLatent Variable Models Probabilistic Models in the Study of Language Day 4
Latent Variable Models Probabilistic Models in the Study of Language Day 4 Roger Levy UC San Diego Department of Linguistics Preamble: plate notation for graphical models Here is the kind of hierarchical
More informationIntroduction to Stochastic Processes
Stat251/551 (Spring 2017) Stochastic Processes Lecture: 1 Introduction to Stochastic Processes Lecturer: Sahand Negahban Scribe: Sahand Negahban 1 Organization Issues We will use canvas as the course webpage.
More informationProbability Theory Review
Probability Theory Review Brendan O Connor 10-601 Recitation Sept 11 & 12, 2012 1 Mathematical Tools for Machine Learning Probability Theory Linear Algebra Calculus Wikipedia is great reference 2 Probability
More informationMachine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples
Machine Learning Bayes Basics Bayes, probabilities, Bayes theorem & examples Marc Toussaint U Stuttgart So far: Basic regression & classification methods: Features + Loss + Regularization & CV All kinds
More informationCMPSCI 240: Reasoning about Uncertainty
CMPSCI 240: Reasoning about Uncertainty Lecture 17: Representing Joint PMFs and Bayesian Networks Andrew McGregor University of Massachusetts Last Compiled: April 7, 2017 Warm Up: Joint distributions Recall
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September
More informationCS 343: Artificial Intelligence
CS 343: Artificial Intelligence Bayes Nets Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188
More informationProbabilistic Machine Learning
Probabilistic Machine Learning Bayesian Nets, MCMC, and more Marek Petrik 4/18/2017 Based on: P. Murphy, K. (2012). Machine Learning: A Probabilistic Perspective. Chapter 10. Conditional Independence Independent
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Undirected Graphical Models Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 3: 2 late days to hand it in today, Thursday is final day. Assignment 4:
More informationComputational Biology Lecture #3: Probability and Statistics. Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept
Computational Biology Lecture #3: Probability and Statistics Bud Mishra Professor of Computer Science, Mathematics, & Cell Biology Sept 26 2005 L2-1 Basic Probabilities L2-2 1 Random Variables L2-3 Examples
More informationSample Space: Specify all possible outcomes from an experiment. Event: Specify a particular outcome or combination of outcomes.
Chapter 2 Introduction to Probability 2.1 Probability Model Probability concerns about the chance of observing certain outcome resulting from an experiment. However, since chance is an abstraction of something
More informationLecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable
Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed
More informationBayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018
Bayesian networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018 Soleymani Slides have been adopted from Klein and Abdeel, CS188, UC Berkeley. Outline Probability
More informationConditional probabilities and graphical models
Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within
More informationBayesian Inference. p(y)
Bayesian Inference There are different ways to interpret a probability statement in a real world setting. Frequentist interpretations of probability apply to situations that can be repeated many times,
More informationLecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable
Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed
More informationProbabilistic Graphical Models (I)
Probabilistic Graphical Models (I) Hongxin Zhang zhx@cad.zju.edu.cn State Key Lab of CAD&CG, ZJU 2015-03-31 Probabilistic Graphical Models Modeling many real-world problems => a large number of random
More informationQuantifying uncertainty & Bayesian networks
Quantifying uncertainty & Bayesian networks CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2016 Soleymani Artificial Intelligence: A Modern Approach, 3 rd Edition,
More informationProbability and Inference
Deniz Yuret ECOE 554 Lecture 3 Outline 1 Probabilities and ensembles 2 3 Ensemble An ensemble X is a triple (x, A X, P X ), where the outcome x is the value of a random variable, which takes on one of
More informationHierarchical Models & Bayesian Model Selection
Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or
More informationAn Introduction to Bayesian Machine Learning
1 An Introduction to Bayesian Machine Learning José Miguel Hernández-Lobato Department of Engineering, Cambridge University April 8, 2013 2 What is Machine Learning? The design of computational systems
More informationSome Probability and Statistics
Some Probability and Statistics David M. Blei COS424 Princeton University February 13, 2012 Card problem There are three cards Red/Red Red/Black Black/Black I go through the following process. Close my
More informationProbabilistic Reasoning. (Mostly using Bayesian Networks)
Probabilistic Reasoning (Mostly using Bayesian Networks) Introduction: Why probabilistic reasoning? The world is not deterministic. (Usually because information is limited.) Ways of coping with uncertainty
More informationOrigins of Probability Theory
1 16.584: INTRODUCTION Theory and Tools of Probability required to analyze and design systems subject to uncertain outcomes/unpredictability/randomness. Such systems more generally referred to as Experiments.
More informationOverview of Probability. Mark Schmidt September 12, 2017
Overview of Probability Mark Schmidt September 12, 2017 Dungeons & Dragons scenario: You roll dice 1: Practical Application Roll or you sneak past monster. Otherwise, you are eaten. If you survive, you
More informationDecision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over
Point estimation Suppose we are interested in the value of a parameter θ, for example the unknown bias of a coin. We have already seen how one may use the Bayesian method to reason about θ; namely, we
More informationCS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016 Plan for today and next week Today and next time: Bayesian networks (Bishop Sec. 8.1) Conditional
More informationReview of Basic Probability
Review of Basic Probability Erik G. Learned-Miller Department of Computer Science University of Massachusetts, Amherst Amherst, MA 01003 September 16, 2009 Abstract This document reviews basic discrete
More informationAxioms of Probability? Notation. Bayesian Networks. Bayesian Networks. Today we ll introduce Bayesian Networks.
Bayesian Networks Today we ll introduce Bayesian Networks. This material is covered in chapters 13 and 14. Chapter 13 gives basic background on probability and Chapter 14 talks about Bayesian Networks.
More informationLecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability
Lecture Notes 1 Basic Probability Set Theory Elements of Probability Conditional probability Sequential Calculation of Probability Total Probability and Bayes Rule Independence Counting EE 178/278A: Basic
More informationSome slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2
Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional
More informationIntroduction to Machine Learning
Introduction to Machine Learning Introduction to Probabilistic Methods Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB
More informationProbability theory basics
Probability theory basics Michael Franke Basics of probability theory: axiomatic definition, interpretation, joint distributions, marginalization, conditional probability & Bayes rule. Random variables:
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 143 Part IV
More informationCS 188: Artificial Intelligence Fall 2009
CS 188: Artificial Intelligence Fall 2009 Lecture 14: Bayes Nets 10/13/2009 Dan Klein UC Berkeley Announcements Assignments P3 due yesterday W2 due Thursday W1 returned in front (after lecture) Midterm
More informationCS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev
CS4705 Probability Review and Naïve Bayes Slides from Dragomir Radev Classification using a Generative Approach Previously on NLP discriminative models P C D here is a line with all the social media posts
More informationProbability & statistics for linguists Class 2: more probability. D. Lassiter (h/t: R. Levy)
Probability & statistics for linguists Class 2: more probability D. Lassiter (h/t: R. Levy) conditional probability P (A B) = when in doubt about meaning: draw pictures. P (A \ B) P (B) keep B- consistent
More informationIntroduction to Probability and Statistics (Continued)
Introduction to Probability and Statistics (Continued) Prof. icholas Zabaras Center for Informatics and Computational Science https://cics.nd.edu/ University of otre Dame otre Dame, Indiana, USA Email:
More informationThe Monte Carlo Method: Bayesian Networks
The Method: Bayesian Networks Dieter W. Heermann Methods 2009 Dieter W. Heermann ( Methods)The Method: Bayesian Networks 2009 1 / 18 Outline 1 Bayesian Networks 2 Gene Expression Data 3 Bayesian Networks
More informationIntroduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak
Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak 1 Introduction. Random variables During the course we are interested in reasoning about considered phenomenon. In other words,
More informationProbability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides
Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides slide 1 Inference with Bayes rule: Example In a bag there are two envelopes one has a red ball (worth $100) and a black ball one
More informationAnnouncements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation
CS 188: Artificial Intelligence Spring 2010 Lecture 15: Bayes Nets II Independence 3/9/2010 Pieter Abbeel UC Berkeley Many slides over the course adapted from Dan Klein, Stuart Russell, Andrew Moore Current
More informationProbability. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. August 2014
Probability Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh August 2014 (All of the slides in this course have been adapted from previous versions
More informationLearning in Bayesian Networks
Learning in Bayesian Networks Florian Markowetz Max-Planck-Institute for Molecular Genetics Computational Molecular Biology Berlin Berlin: 20.06.2002 1 Overview 1. Bayesian Networks Stochastic Networks
More informationReview: Bayesian learning and inference
Review: Bayesian learning and inference Suppose the agent has to make decisions about the value of an unobserved query variable X based on the values of an observed evidence variable E Inference problem:
More information