Introduction to Bayesian Statistics

Similar documents
ECE531 Lecture 13: Sequential Detection of Discrete-Time Signals

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

(1) Introduction to Bayesian statistics

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

CSC321 Lecture 18: Learning Probabilistic Models

Comparison of Bayesian and Frequentist Inference

Computational Perception. Bayesian Inference

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Statistical Inference

Compute f(x θ)f(θ) dθ

Detection theory. H 0 : x[n] = w[n]

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Classical and Bayesian inference

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

CS 361: Probability & Statistics

CS 361: Probability & Statistics

Uniformly Most Powerful Bayesian Tests and Standards for Statistical Evidence

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Information Theory and Hypothesis Testing

Part III. A Decision-Theoretic Approach and Bayesian testing

CPSC 540: Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Models in Machine Learning

A primer on Bayesian statistics, with an application to mortality rate estimation

Computational Cognitive Science

Bayesian Social Learning with Random Decision Making in Sequential Systems

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Computational Cognitive Science

Bayesian Inference. Chapter 2: Conjugate models

Bayesian RL Seminar. Chris Mansley September 9, 2008

ECE521 W17 Tutorial 6. Min Bai and Yuhuai (Tony) Wu

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Bayesian Inference. Introduction

Statistics: Learning models from data

Lecture 2: Conjugate priors

5.3 METABOLIC NETWORKS 193. P (x i P a (x i )) (5.30) i=1

PMR Learning as Inference

Quantitative Introduction ro Risk and Uncertainty in Business Module 5: Hypothesis Testing

Testing Simple Hypotheses R.L. Wolpert Institute of Statistics and Decision Sciences Duke University, Box Durham, NC 27708, USA

Conjugate Priors, Uninformative Priors

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Bayesian Analysis for Natural Language Processing Lecture 2

Sequential Detection. Changes: an overview. George V. Moustakides

7. Estimation and hypothesis testing. Objective. Recommended reading

Lecture 2: Statistical Decision Theory (Part I)

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Lecture 8: Information Theory and Statistics

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Probabilistic and Bayesian Machine Learning

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

The devil is in the denominator

Point Estimation. Vibhav Gogate The University of Texas at Dallas

How Much Evidence Should One Collect?

ECE531 Lecture 4b: Composite Hypothesis Testing

Bayesian Inference. p(y)

LEARNING WITH BAYESIAN NETWORKS

2. I will provide two examples to contrast the two schools. Hopefully, in the end you will fall in love with the Bayesian approach.

Time Series and Dynamic Models

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Part 3 Robust Bayesian statistics & applications in reliability networks

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

1 A simple example. A short introduction to Bayesian statistics, part I Math 217 Probability and Statistics Prof. D.

Chapter Three. Hypothesis Testing

Divergence Based priors for the problem of hypothesis testing

Bayesian Analysis (Optional)

A sequential hypothesis test based on a generalized Azuma inequality 1

ECE531 Lecture 2a: A Mathematical Model for Hypothesis Testing (Finite Number of Possible Observations)

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Bayesian Updating with Continuous Priors Class 13, Jeremy Orloff and Jonathan Bloom

18.05 Practice Final Exam

Introductory Econometrics. Review of statistics (Part II: Inference)

Advanced Herd Management Probabilities and distributions

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

ORF 245 Fundamentals of Statistics Chapter 9 Hypothesis Testing

ST 740: Model Selection

Group Sequential Designs: Theory, Computation and Optimisation

Lecture Testing Hypotheses: The Neyman-Pearson Paradigm

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

A Discussion of the Bayesian Approach

1 Maximum Likelihood Estimation

Math 494: Mathematical Statistics

Lecture 2: Priors and Conjugacy

Mathematical Statistics

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Unobservable Parameter. Observed Random Sample. Calculate Posterior. Choosing Prior. Conjugate prior. population proportion, p prior:

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

ICES REPORT Model Misspecification and Plausibility

CS 6140: Machine Learning Spring 2016

Bayesian Learning (II)

Eco517 Fall 2014 C. Sims MIDTERM EXAM

INTRODUCTION TO BAYESIAN INFERENCE PART 2 CHRIS BISHOP

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Ling 289 Contingency Table Statistics

Announcements. Proposals graded

Lecture : Probabilistic Machine Learning

Transcription:

Bayesian Parameter Estimation Introduction to Bayesian Statistics Harvey Thornburg Center for Computer Research in Music and Acoustics (CCRMA) Department of Music, Stanford University Stanford, California 94305 February 19, 2006 Statistical approaches to parameter estimation and hypothesis testing which use prior distributions over parameters are known as Bayesian methods. The following notes briefly summarize some important facts. Outline Bayesian Parameter Estimation Bayesian Hypothesis Testing Bayesian Sequential Hypothesis Testing Let y be distributed according to a parametric family: y f θ (y). The goal is, given iid observations {y i }, to estimate θ. For instance, let {y i } be a series of coin flips where y i = 1 denotes heads and y i = 0 denotes tails. The coin is weighted, so P (y i = 1) can be other than 1/2. Let us define θ = P (y i = 1); our goal is to estimate θ. This simple distribution is given the name Bernoulli. Without prior information, we use the maximum likelihood approach. Let the observations be y 1... y H+T. Let H be the number of heads observed and T be the number of tails. ˆθ = argmaxf θ (y 1:H+T ) = argmaxθ H (1 θ) T = H/(H + T ) Not surprisingly, the probability of heads is estimated as the empirical frequency of heads in the data sample. Suppose we remember that yesterday, using the same coin, we recorded 10 heads and 20 tails. This is one 1 2

way to indicate prior information about θ. We simply include these past trials in our estimate: ˆθ = (10 + H)/(10 + H + 20 + T ) As (H+T) goes to infinity, the effect of the past trials will wash out. Suppose, due to computer crash, we had lost the details of the experiment, and our memory has also failed (due to lack of sleep), that we forget even the number of heads and tails (which are the sufficient statistics for the Bernoulli distribution). However, we believe the probability of heads is about 1/3, but this probability itself is somewhat uncertain, since we only performed 30 trials. In short, we claim to have a prior distribution over the probability θ, which represents our prior belief. Suppose this distribution is P (θ) and P (θ) Beta(10, 20): g(θ) = θ 9 (1 θ) 19 θ9 (1 θ) 19 dθ 6 Prior distribution on theta = Beta(10,20) 4 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Now we observe a new sequence of tosses: y 1:H+T. We may calculate the posterior distribution 3 4

P (θ y 1:H+T ) according to Bayes Rule: P (y θ)p (θ) P (θ y) = P (y) P (y θ)p (θ) = P (y θ)p (θ)dθ The term P (y θ) is, as before, the likelihood function of θ. The marginal P (y) comes by integrating out θ: P (y) = P (y θ)p (θ)dθ To continue our example, suppose we observe in the new data y(1 : H + T ) a sequence of 50 heads and 50 tails. The likelihood becomes: P (y θ) = θ 50 (1 θ) 50 10 5 Prior distribution on theta = Beta(60,70) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 binomial distribution which gives the likelihood of iid Bernoulli trials. As we will see, a conjugate prior perfectly captures the results of past experiments. Or, it allows us to express prior belief in terms of invented data. More importantly, conjugacy allows for efficient sequential updating of the posterior distribution, where the posterior at one stage is used as prior for the next. Plugging this likelihood and the prior into the Bayes Rule expression, and doing he math, obtains the posterior distribution as a Beta(10 + 50, 20 + 50): P (θ y) = θ 59 (1 θ) 69 θ 59 (1 θ) 69 dθ Note that the posterior and prior distribution have the same form. We call such a distribution a conjugate prior. The Beta distribution is conjugate to the 5 6

Key Point The output of the Bayesian analysis is not a single estimate of θ, but rather the entire posterior distribution. The posterior distribution summarizes all our information about θ. As we get more data, if the samples are truly iid, the posterior distribution will become more sharply peaked about a single value. Of course, we can use this distribution to make inference about θ. Suppose an oracle was to tell us the true value of θ used to generate the samples. We want to guess θ that minimizes the mean squared error between our guess and the true value. This is the same criterion as in maximum likelihood estimation. We would choose the mean of the posterior distribution, because we know conditional mean minimizes mean square error. Let our prior be Beta(H 0, T 0 ) and ˆθ = E(θ y 1:N ) H 0 + H = H 0 + H + T 0 + T The same way, we can do prediction. What is P (y N+1 = 1 y 1:N )? P (y N+1 = 1 y 1:N ) = = = P (y N+1 = 1 θ, y 1:N )P (θ y 1:N )dθ P (y N+1 = 1 θ)p (θ y 1:N )dθ θp (θ y 1:N )dθ = E(θ y 1:N ) H 0 + H = H 0 + H + T 0 + T 7 8

Bayesian Hypothesis Testing Suppose we have a fixed iid data sample y 1:N f θ (y). We have two choices: θ = θ 0 or θ = θ 1. That is, the data y 1:N is generated by either θ 0 or θ 1. Call θ 0 the null hypothesis and θ 1 the alternative. The alternative hypothesis indicates a disturbance is present. If we decide θ = θ 1, we signal an alarm for the disturbance. We process the data by a decision function g(y 1:N ) D(y 1:N ) = 0, θ = θ 0 = 1, θ = θ 1 We have two possible errors: False Alarm: θ = θ 0, but D(y 1:N ) = 1 Miss: θ = θ 1, but D(y 1:N ) = 0 In the non-bayesian setting, we wish to choose a family of D( ), which navigate the optimal tradeoff between the probabilities of miss and false alarm. The probability of miss, P M, is P (D(y 1:N )) = 0 θ = θ 1 ) and the probability of false alarm, P F A, is P (D(y 1:N )) = 1 θ = θ 0 ). 9 We optimize the tradeoff by comparing the likelihood ratio to a nonnegative threshold, say exp(t ) > 0: D (y 1:N ) = 1 fθ1 (y) f θ0 (y) >exp(t ) Equivalently, compare the log likelihood ratio to an arbitrary real threshold T : D (y 1:N ) = 1 log f θ1 (y) f θ0 (y) >T Increasing T makes the test less sensitive for the disturbance: we accept a higher probability of miss in return for a lower probability of false alarm. Because of the tradeoff, there is a limit as to how well we can do, which improves exponentially as we collect more data. This limit relation is given by Stein s lemma. Fix P M = ɛ. Then, as ɛ 0, and for large N, we get: 1 N log P F A K(f θ0, f θ1 ) The quantity K(f θ0, f θ1 ) is the Kullback-Leibler distance, or the expected value of the log likelihood ratio. We define, where f and g are densities: K(f, g) = E f [log(g/f)] The following facts about Kullback-Leibler distance hold: 10

K(f, g) 0. Equality holds when f g except on a set of (f + g)/2-measure zero. I.E. for a continuous sample space you can allow difference on sets of Lebesgue measures zero, for a discrete space you cannot allow any difference. K(f, g) K(g, f), in general. So the K-L distance is not a metric. The triangle inequality also fails. When f, g belong to the same parametric family, we adopt the shorthand: K(θ 0, θ 1 ) rather than K(f θ0, f θ1 ). Then we have an additional fact. When hypotheses are close, K-L distance behaves approximately like the square of the Euclidean metric in parameter (θ)-space. Specifically: 2K(θ 0, θ 1 ) (θ 1 θ 0 ) J(θ 0 )(θ 1 θ 0 ). where J(θ 0 ) is the Fisher information. The right hand side is sometimes called the square of the Mahalanobis distance. Furthermore, we may assume the hypotheses are close enough that J(θ 0 ) J(θ 1 ). Then, K-L information appears also symmetric. Practically there is still the problem to choose T, or to choose desirable probabilities of miss and false alarm which obey Stein s lemma, which gives also the data size. We can solve for T given the error probabilities. However, it is often unnatural to specify these probabilities; instead, we are concerned about other, observable effects on the system. Hence, the usual scenario results in a lot of lost sleep, as we are continually varying T, running simulations, and then observing some distant outcome. Fortunately, the Bayesian approach comes to the 11 12

rescue. Instead of optimizing a probability tradeoff, we assign costs: C M > 0 to a miss event and C F A > 0 to a false alarm event. Additionally, we have a prior distribution on θ P (θ = θ 1 ) = π 1 Let D(y 1:N ) be the decision function as before. The Bayes risk, or expected cost, is as follows. R(D) = π 1 E [D(y 1:N ) = 0 θ = θ 1 ] + (1 π 1 )E [D(y 1:N ) = 1 It follows, the optimum-bayes risk decision also involves comparing the likelihood ratio to a threshold: D(y 1:N ) = 1 P (y θ1 ) P (y θ 0 ) >C F A P (θ 0 ) C M P (θ 1 ) = 1 fθ1 (y) f θ0 (y) >C F A (1 π 1 ) C M π 1 We see the threshold is available in closed form, as a function of costs and priors. 13 14

Bayesian Sequential Analysis In sequential analysis we don t have a fixed number of observations. Instead, observations come in sequence, and we d like to decide in favor of θ 0 or θ 1 as soon as possible. For each n we perform a test: D(y 1:n ). There are three outcomes of D: Decide θ = θ 1 Decide θ = θ 0 Keep testing NonBayesian Case Let T be the stopping time of this test. We wish to find an optimal tradeoff between: P F A, the probability:: [D(y 1:T ) = 1, but θ = θ 0 ] P M, the probability: [D(y 1:T ) = 0, but θ = θ 1 ] E θ (T ), where θ = θ 0 or θ 1 It turns out, the optimal test again involves monitoring the likelihood ratio. This test is called SPRT for Sequential Probability Ratio Test. It is more insightful to examine this test in the log domain. The test involves comparing the log likelihood ratio: S n = f θ 1 (y 1:n ) f θ0 (y 1:n ) to positive and negative thresholds a < 0, b > 0. The first time S n < a, we stop the test and decide θ 0 The first time S n > b, we stop and declare θ 1. Otherwise we keep on testing. There is one catch ; in the analysis, we ignore overshoots concerning the threshold boundary. Hence S T = a or b. Properties of SPRT The change (first difference) of S n is s n = S n S n 1 = f θ 1 (y n y 1:n 1 ) f θ0 (y n y 1:n 1 ) For an iid process, we drop the conditioning: s n = f θ 1 (y n ) f θ0 (y n ) The drift of S n is defined as E(s n θ). From definitions, it follows that the drifts under θ = θ 0 or 15 16

θ 1 are given by the K-L informations: E(s n θ 0 ) = K(θ 0, θ 1 ) E(s n θ 1 ) = K(θ 1, θ 0 ) We can visualize the behavior of S n, when in fact θ undergoes a step transition from θ 0 to θ 1 : 20 0 20 Hinkley test: S n, threshold b 40 0 20 40 60 80 100 120 140 160 180 200 Again, we have practical issues concerning how we choose thresholds a, b. By invoking Wald s equation, or some results from martingale theory, these are easily related to the probabilities of error at the stopping time of the test. However, the problem arises how to choose both probabilities of error, since we have a three-way tradeoff with the average run lengths E θ0 (T ), E θ1 (T )!! Fortunately, the Bayesian formulation comes to our rescue. We can again assign costs to the probabilities of false alarm and miss C F A, C M. We also include a cost proportional to the number of observations prior to stopping. Let this cost equal the number of observations, which is T. The goal is to minimize expected cost, or sequential Bayes risk. What is our prior information? Again, we must know P (θ = θ 1 ) = π 1. It turns out that the optimal Bayesian strategy is again a SPRT. This follows from the theory of optimal stopping. Suppose at time n, our we have yet to make a decision concerning θ. We must decide among the following alternatives: Stop, and declare θ 0 or θ 1. Take one more observation. 17 18

We choose to stop only when the minimum additional cost of stopping is less than the minimum expected additional cost of taking one more observation. We compute these costs using the posterior distribution of θ, i.e: π 1 (n) = P (θ = θ 1 y 1:n ) Unfortunately, one cannot get a close form expression for the thresholds in terms of the costs, but the Bayes formulation allows at least to involve prior information about the hypotheses. We will see a much richer extension to the problem of Bayesian change detection. which comes by recursively applying Bayes rule. π 1 (n + 1) = π 1 (0) = π 1 π 1 (n)p (y n+1 θ 1 ) (1 π 1 (n))p (y n+1 θ 0 ) + π 1 (n)p (y n+1 θ 1 ) If we stopped after observing y n and declared θ = θ 0, the expected cost due to miss would be π 1 (n)c M. Therefore if we make the decision to stop, the (minimum) additional cost is ρ 0 (π 1 (n)) = min {π 1 (n)c M, (1 π 1 (n))c F A } The overall minimum cost is: ρ(π 1 (n)) = min { ρ 0 (π 1 (n)), 1 + E π1 (n)[ρ(π 1 (n + 1))] } In the two-hypothesis case, the implied recursion for the minimum cost can be solved, and the result is a SPRT(!) 19 20