Lecture 18: Bayesian Inference

Size: px

Start display at page:

Download "Lecture 18: Bayesian Inference"

Elinor Ward
6 years ago
Views:

1 Lecture 18: Bayesian Inference Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 18 Probability and Statistics, Spring / 10

2 Bayesian Statistical Inference Statiscal inference Process of estimating information about an unknown variable/model For example, biased coin with a head coming up w.p. p(unknown) Bayesian vs. Classical Inference Bayesian Unknowns Random variables with known distributions Prior distribution p Θ (θ) Posterior p Θ X (θ x) (x: observed data) Classical Deterministic quantities that happen to be unknown θ: constant Estimate θ with some performance guarantee Lecture 18 Probability and Statistics, Spring / 10

3 Bayesian Statistical Inference (contd.) Inference(model/variable) problem i) Model inference : construct a process and predict the future(weather forecast) ii) Variable inference : estimate unknown parameter(gps reachings and current position) - Example : Noisy channel i) sequence of binary messages S i {0, 1} transmitted over a wireless channel ii) receiver observes X i = as i + W i, i = 1,, n W i N(0, ), a : scalar iii) model inference problem : a unknown(s i s known) iv) variable inference problem : infer S i s(a known) based on X i s Lecture 18 Probability and Statistics, Spring / 10

4 Bayesian Statistical Inference Types of statistical inference problems - Estimation(of unknown { constant or RV) binary - Hypothesis testing m-ary Bayesian inference methods i) Maximum a posterior probability (MAP) rule ii) Least mean squares (LMS) estimation iii) Linear least mean squares estimation Lecture 18 Probability and Statistics, Spring / 10

5 Bayesian Inference and Posterior Distribution Pictorial introduction Bayes rule (i) Θ discrete, X discrete : p Θ X (θ x) = (ii) Θ discrete, X continuous : p Θ X (θ x) = (iii) Θ continuous, X discrete : f Θ X (θ x) = (iv) Θ continuous, X continuous : f Θ X (θ x) = pθ(θ)p X Θ(x θ) θ pθ(θ )p X Θ (x θ ) pθ(θ)f X Θ(x θ) θ pθ(θ )f X Θ (x θ ) fθ(θ)p X Θ(x θ) fθ(θ )p X Θ (x θ )dθ fθ(θ)f X Θ(x θ) fθ(θ )f X Θ (x θ )dθ Lecture 18 Probability and Statistics, Spring / 10

6 Conditional Probability Revisited 4 versions of conditional probability (i) p Θ X (θ x) = p Θ,X (θ,x) p X (x) (ii) (iii) p Θ X (θ x) = P(Θ = θ X = x) = lim δ 0 P(Θ = θ x X x + δ) = lim δ 0 p Θ (θ)p(x X x + δ Θ = θ) P(x X x + δ) = p Θ (θ)f X Θ (x θ) θ p Θ(θ )f X Θ (x θ ) = p Θ(θ)f X Θ (x θ) f X (x) f Θ X (θ x) = lim δ 0 P(θ Θ θ + δ X = x) δ = lim δ 0 P(θ Θ θ + δ)p(x = x θ Θ θ + δ) δp(x = x) = f Θ(θ)p X Θ (x θ) p X (x) (iv) f Θ X (θ x) = f Θ(θ)f X Θ (x θ) f X (x) Lecture 18 Probability and Statistics, Spring / 10

7 Example Romeo & Juliet I - Juliet will be late on any date by a random amout X U(0, θ) - θ unknown modeled as RV Θ U(0, 1) - Assume that Juliet was late by an amout x on 1st date - How to update the distribution of Θ What we know { 1, if 0 θ 1 1) The prior PDF f Θ(θ) = 0, o.w. 2) Conditional PDF of observation f X Θ (x Θ) = Posterior PDF f Θ X (θ x) = f Θ (θ)f X Θ (x θ) 10 f Θ (θ )f X Θ (x θ )dθ = f Θ X (θ x) = 0 if θ < x or θ > 1 { 1, θ 0, o.w. if 0 x θ 1 θ 1x 1 θ dθ = 1 θ logx, if x θ 1 Lecture 18 Probability and Statistics, Spring / 10

8 Example ii) Romeo & Juliet II - Observe first n dates - Juliet is late by X 1, X 2,, X n U(0, θ) given Θ = θ - Let X = [X 1, X 2,, X n ], x = [x 1, x 2,, x n ] - Conditional PDF f X Θ (x θ) = f { X1 Θ(x 1 θ) f Xn Θ(x n θ) 1 = θ, if max{x n 1,, x n } θ 1 0, o.w. - Posterior PDF 1 f Θ(θ)f X Θ (x θ) 1 = θ n f Θ X (θ x) = 0 fθ(θ )f X Θ (x θ )dθ 1, if x θ 1 1 x (θ ) n dθ 0, o.w. Lecture 18 Probability and Statistics, Spring / 10

9 Example iii) Beta priors on the bias of a coin - Biased coin with probability of heads θ - θ unknown modeled as RV Θ with known prior PDF of Θ - Consider n independent tosses : X = # heads observed - Posterior PDF f Θ X (θ k) = f Θ(θ)p X Θ (k θ) 1 0 fθ(θ )p X Θ (k θ )dθ 1 c = cf Θ (θ)p X Θ (k θ) = c( n k )f Θ(θ)θ k (1 θ) n k - Suppose { 1 B(α,β) f Θ (θ) = θα 1 (1 θ) β 1, if 0 < θ < 1 0, o.w. B(α, β) = 1 0 θα 1 (1 θ) β 1 dθ = (α 1)!(β 1)! (α+β 1)! f Θ X (θ k) = d B(α,β) θk+α 1 (1 θ) n k+β 1, 0 θ 1 α > 0, β > 0 Lecture 18 Probability and Statistics, Spring / 10

10 Example iv) Spam filtering - An message is spam or legitimate - Θ = 1 if spam, Θ = 2 if legit p Θ (1) p Θ (2) - {w 1,, w n } : collection of special words whose appearance suggest a spam - For each i, X i =Bernoulli RV modeling appearance of w i X i = 1 if w i appears, 0 o.w. - Conditional probability p Xi Θ(x i 1), p Xi Θ(x i 2) known - X 1,, X n independent, given Θ - Posterior probability P(Θ = m X i = x i, i = 1,, n) = pθ(m) n i=1 p X i Θ(x i m) 2 j=1 pθ(j) n i=1 p X i Θ(x i j) Lecture 18 Probability and Statistics, Spring / 10

11 Lecture 18: Bayesian Inference Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 18 Probability and Statistics, Spring / 10

12 Bayesian Statistical Inference Statiscal inference Process of estimating information about an unknown variable/model For example, biased coin with a head coming up w.p. p(unknown) Bayesian vs. Classical Inference Bayesian Unknowns Random variables with known distributions Prior distribution p Θ (θ) Posterior p Θ X (θ x) (x: observed data) Classical Deterministic quantities that happen to be unknown θ: constant Estimate θ with some performance guarantee Lecture 18 Probability and Statistics, Spring / 10

13 Bayesian Statistical Inference (contd.) Inference(model/variable) problem i) Model inference : construct a process and predict the future(weather forecast) ii) Variable inference : estimate unknown parameter(gps reachings and current position) - Example : Noisy channel i) sequence of binary messages S i {0, 1} transmitted over a wireless channel ii) receiver observes X i = as i + W i, i = 1,, n W i N(0, ), a : scalar iii) model inference problem : a unknown(s i s known) iv) variable inference problem : infer S i s(a known) based on X i s Lecture 18 Probability and Statistics, Spring / 10

14 Bayesian Statistical Inference Types of statistical inference problems - Estimation(of unknown { constant or RV) binary - Hypothesis testing m-ary Bayesian inference methods i) Maximum a posterior probability (MAP) rule ii) Least mean squares (LMS) estimation iii) Linear least mean squares estimation Lecture 18 Probability and Statistics, Spring / 10

15 Bayesian Inference and Posterior Distribution Pictorial introduction Bayes rule (i) Θ discrete, X discrete : p Θ X (θ x) = (ii) Θ discrete, X continuous : p Θ X (θ x) = (iii) Θ continuous, X discrete : f Θ X (θ x) = (iv) Θ continuous, X continuous : f Θ X (θ x) = pθ(θ)p X Θ(x θ) θ pθ(θ )p X Θ (x θ ) pθ(θ)f X Θ(x θ) θ pθ(θ )f X Θ (x θ ) fθ(θ)p X Θ(x θ) fθ(θ )p X Θ (x θ )dθ fθ(θ)f X Θ(x θ) fθ(θ )f X Θ (x θ )dθ Lecture 18 Probability and Statistics, Spring / 10

16 Conditional Probability Revisited 4 versions of conditional probability (i) p Θ X (θ x) = p Θ,X (θ,x) p X (x) (ii) (iii) p Θ X (θ x) = P(Θ = θ X = x) = lim δ 0 P(Θ = θ x X x + δ) = lim δ 0 p Θ (θ)p(x X x + δ Θ = θ) P(x X x + δ) = p Θ (θ)f X Θ (x θ) θ p Θ(θ )f X Θ (x θ ) = p Θ(θ)f X Θ (x θ) f X (x) f Θ X (θ x) = lim δ 0 P(θ Θ θ + δ X = x) δ = lim δ 0 P(θ Θ θ + δ)p(x = x θ Θ θ + δ) δp(x = x) = f Θ(θ)p X Θ (x θ) p X (x) (iv) f Θ X (θ x) = f Θ(θ)f X Θ (x θ) f X (x) Lecture 18 Probability and Statistics, Spring / 10

17 Example Romeo & Juliet I - Juliet will be late on any date by a random amout X U(0, θ) - θ unknown modeled as RV Θ U(0, 1) - Assume that Juliet was late by an amout x on 1st date - How to update the distribution of Θ What we know { 1, if 0 θ 1 1) The prior PDF f Θ(θ) = 0, o.w. 2) Conditional PDF of observation f X Θ (x Θ) = Posterior PDF f Θ X (θ x) = f Θ (θ)f X Θ (x θ) 10 f Θ (θ )f X Θ (x θ )dθ = f Θ X (θ x) = 0 if θ < x or θ > 1 { 1, θ 0, o.w. if 0 x θ 1 θ 1x 1 θ dθ = 1 θ logx, if x θ 1 Lecture 18 Probability and Statistics, Spring / 10

18 Example ii) Romeo & Juliet II - Observe first n dates - Juliet is late by X 1, X 2,, X n U(0, θ) given Θ = θ - Let X = [X 1, X 2,, X n ], x = [x 1, x 2,, x n ] - Conditional PDF f X Θ (x θ) = f { X1 Θ(x 1 θ) f Xn Θ(x n θ) 1 = θ, if max{x n 1,, x n } θ 1 0, o.w. - Posterior PDF 1 f Θ(θ)f X Θ (x θ) 1 = θ n f Θ X (θ x) = 0 fθ(θ )f X Θ (x θ )dθ 1, if x θ 1 1 x (θ ) n dθ 0, o.w. Lecture 18 Probability and Statistics, Spring / 10

19 Example iii) Beta priors on the bias of a coin - Biased coin with probability of heads θ - θ unknown modeled as RV Θ with known prior PDF of Θ - Consider n independent tosses : X = # heads observed - Posterior PDF f Θ X (θ k) = f Θ(θ)p X Θ (k θ) 1 0 fθ(θ )p X Θ (k θ )dθ 1 c = cf Θ (θ)p X Θ (k θ) = c( n k )f Θ(θ)θ k (1 θ) n k - Suppose { 1 B(α,β) f Θ (θ) = θα 1 (1 θ) β 1, if 0 < θ < 1 0, o.w. B(α, β) = 1 0 θα 1 (1 θ) β 1 dθ = (α 1)!(β 1)! (α+β 1)! f Θ X (θ k) = d B(α,β) θk+α 1 (1 θ) n k+β 1, 0 θ 1 α > 0, β > 0 Lecture 18 Probability and Statistics, Spring / 10

20 Example iv) Spam filtering - An message is spam or legitimate - Θ = 1 if spam, Θ = 2 if legit p Θ (1) p Θ (2) - {w 1,, w n } : collection of special words whose appearance suggest a spam - For each i, X i =Bernoulli RV modeling appearance of w i X i = 1 if w i appears, 0 o.w. - Conditional probability p Xi Θ(x i 1), p Xi Θ(x i 2) known - X 1,, X n independent, given Θ - Posterior probability P(Θ = m X i = x i, i = 1,, n) = pθ(m) n i=1 p X i Θ(x i m) 2 j=1 pθ(j) n i=1 p X i Θ(x i j) Lecture 18 Probability and Statistics, Spring / 10

21 Lecture 15: MAP and LMS Estimation Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 15 Probability and Statistics, Fall / 8

22 Maximum a Posterior Probability (MAP) Rule MAP rule i) Observation x given ii) p Θ (θ), p X Θ (x θ) given iii) Want to estimate Θ MAP Rule ˆθ = arg max p Θ X (θ x) θ ˆθ = arg max f Θ X (θ x) θ (θ : discrete) (θ : continuous) Visualization on the board * For discrete Θ, MAP rule minimizes the prob. of an incorrect decision Notes on computation of ˆθ i) From Bayes rule, the denominator is the same for all values of θ e.g. p Θ X (θ x) = pθ(θ)p X Θ(x θ) θ pθ(θ )p X Θ (x θ ) { function of θ constant w.r.t θ Lecture 15 Probability and Statistics, Fall / 8

23 MAP Rule (contd.) ii) Only need to maximize the numerator as p Θ (θ)p X Θ (x θ), (Θ, X both discrete) p Θ (θ)f X Θ (x θ), (Θ discrete, X continuous) ˆθ = arg max θ f Θ (θ)p X Θ (x θ), (Θ continuous, X discrete) f Θ (θ)f X Θ (x θ), (Θ, Xboth continuous) Example (Spam Filtering) i) Θ = 1(spam), Θ = 2(legit) p Θ (1) { p Θ (2) 1, if w i appears in the message ii) X i : Bernoulli 0, o.w. iii) Posterior probability P(Θ = θ X 1 = x 1,, X n = x n ) = p n Θ(θ) i=1 p X i Θ(x i θ) 2 θ =1 pθ(θ ), θ = 1, 2 n i=1 p X i Θ(x i θ ) Lecture 15 Probability and Statistics, Fall / 8

24 Spam Filtering Example (contd.) iv) MAP ˆθ = arg max P(Θ = θ X i = x i, i = 1,..., n) θ n = arg max p Θ (θ) p Xi Θ(x i θ) θ i=1 n n θ = 1 (spam) if p Θ (1) p Xi Θ(x i 1) > p Θ (2) p Xi Θ(x i 2) i=1 i=1 Lecture 15 Probability and Statistics, Fall / 8

25 Example Romeo & Juliet I i) Juliet is late on the first date by a random amount X U(0, Θ) ii) Θ U(0, 1) unknown iii) Posterior PDF {(x [0, 1]) 1 θ log x f Θ X (θ x) =, if x θ 1 0, o.w. MAP : ˆθ = x Pictorial description on the board Lecture 15 Probability and Statistics, Fall / 8

26 Probability of (In)Correct Decision Hypothesis Testing i) Unknown parameter takes one of a finite # of values, each corresponding to a competing hypothesis ii) In the language of Bayesian inference: Θ {θ 1,, θ m} (m=2 : binary hypothesis testing) θ i : hypothesis H i Computing the probability of correct decision i) Given observation X = x ii) MAP rule : g MAP (x) hypothesis selected by MAP given X = x iii) Probability of correct decision P(Θ = g MAP (x) X = x) P(Θ = g MAP (x)) = i P(Θ = θi, X = Si) (S i = {x g MAP (x) = θ i}) iv) Probability of error i P(Θ θi, X = Si) Lecture 15 Probability and Statistics, Fall / 8

27 Example Two biased coins i) Coin 1: prob. of heads=p 1, coin 2: prob. of heads=p 2 ii) Choose a coin at random with equal probability iii) Want to infer its identity based on the outcome of single toss iv) Θ = 1 : coin 1, Θ = 2 : coin 2 X = 1 : head, X = 0 : tail v) MAP rule p Θ(1)p X Θ (x 1) > p Θ(2)p X Θ (x 2) coin 1 P 1 = if P 2 = > coin x = tail What is the probability of incorrect decision? Lecture 15 Probability and Statistics, Fall / 8

28 Example Two biased coins (contd.) vi) n coin toss, X = # heads p Θ (1)p X Θ (k 1) = 1 2 (n k )P 1 k (1 P 1 ) n k p Θ (2)p X Θ (k 2) = 1 2 (n k )P 2 k (1 P 2 ) n k P k 1 (1 P 1 ) n k > P k 2 (1 P 2 ) n k coin 1 o.w. coin 2 Pictorial description on the board vii) Probability of error P(error) = P(Θ = 1, X > k ) + P(Θ = 2, X k ) = p Θ (1) n k=k +1 c(k)p 1 k (1 P 1 ) n k +p Θ (2) k k=1 c(k)p 2 k (1 P 2 ) n k Pictorial description on the board Lecture 15 Probability and Statistics, Fall / 8

29 Lecture 20: Classical Statistical Inference Hyang-Won Lee Dept. of Internet & Multimedia Eng. Konkuk University Lecture 20 Probability and Statistics, Spring / 6

30 Classical Statistical Inference Setup X : random observation p X (x; θ) or f X (x; θ) θ : unknown constant dependence on θ one prob. model for each value of θ Notation E θ [h(x)], P θ (A) dependence on θ Inference methods i) Maximum Likelihood(ML) estimation ii) Linear regression iii) Likelihood ratio test iv) Significance testing Lecture 20 Probability and Statistics, Spring / 6

31 Maximum Likelihood Estimation Maximum Likelihood Estimation X = (X 1,, X n ) vector of observations p X (x; θ) : joint PMF p X (x; θ) ˆθ : estimate of θ ML : ˆθ = arg max θ p X (x 1,, x n ; θ) X discrete arg max θ f X (x 1,, x n ; θ) X continuous Likelihood function p X (x; θ), f X (x; θ) Log-likelihood function Assume X i s are independent p X (x 1,, x n ; θ) = n i=1 p X i (x i ; θ) log p X (x; θ) = n i=1 log p X i (x i ; θ) Interpretation of p X (x; θ) Incorrect : prob that θ is equal to θ Correct : prob of X = x when the unknown parameter is θ ML : with what value of θ, the observations X = x are most likely to arise? Lecture 20 Probability and Statistics, Spring / 6

32 Examples ML vs MAP MAP : arg max θ p Θ (θ)p X Θ (x θ) p Θ flat and p X Θ (x θ) = p X (x; θ) ML : arg max θ p X (x; θ) Example 1 i) Juliet is always late by X U(0, θ) ii) θ unknown constant iii) ML : ˆθ? { 1 f X (x; θ) = θ, 0 x θ ˆθ = x (compare w/ MAP) 0, otherwise Example 2 i) Biased coin(prob of heads θ unknown) ii) X 1,, X n : n independent coin tosses (X i = 1 if head, 0 if tail) iii) p X (x; θ) = n i=1 θxi (1 θ) 1 xi = θ i xi (1 θ) n i xi When x i = k(i.e., k heads out of n tosses) arg max θ θ k (1 θ) n k Lecture 20 Probability and Statistics, Spring / 6

33 Estimation of Mean and Variance Estimation of mean and variance of an RV i) Observations X 1,, X n are i.i.d.,with an unknown common mean θ and known variance v ii) Sample mean M n = M n i.p. i X i n E θ [M n] = θ unbiased θ (weak law of large numbers) consistent iii) Sample Variance S 2 n = 1 n n i=1 (Xi Mn)2 E (θ,v) [ S 2 n ] = n 1 v asymptotically unbiased n 2 Sˆ n = 1 n n 1 i=1 (Xi Mn)2 unbiased E (θ,v) [ S ˆ 2 n ] = v Confidence interval P θ ( ˆΘ n θ ˆΘ + n ) 1 α 1 α confidence interval : [Θ n, Θ + n ] Θ ˆ n : lower estimator Θ ˆ + n : upper estimator *compare w/ point estimator Lecture 20 Probability and Statistics, Spring / 6

34 Example Example i) Observations X 1,, X n are i.i.d. normal, with unknown mean θ known variance v ii) sample mean estimator ˆΘ n = X 1+ +X n normal w/ mean θ, variance v n n iii) α = 0.05 ˆΘ n θ std normal, P θ ( ˆΘ n θ 1.96) = 0.95 v n v n P θ ( ˆΘ n 1.96 v n θ ˆΘ n v n ) = 0.95 [ ˆΘ n 1.96 v n, ˆΘ n v n ] : 0.95 C.I. Interpretation of confidence interval Incorrect : θ is in the CI w.p. at least 1 α Correct : construct a confidence interval many times about 1 α of them are expected to contain θ Lecture 20 Probability and Statistics, Spring / 6

Compute f(x θ)f(θ) dθ

Bayesian Updating: Continuous Priors 18.05 Spring 2014 b a Compute f(x θ)f(θ) dθ January 1, 2017 1 /26 Beta distribution Beta(a, b) has density (a + b 1)! f (θ) = θ a 1 (1 θ) b 1 (a 1)!(b 1)! http://mathlets.org/mathlets/beta-distribution/