Lecture 8 October Bayes Estimators and Average Risk Optimality

Similar documents
Lecture 12 November 3

Lecture 7 October 13

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Parametric Techniques

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Parametric Techniques Lecture 3

Lecture 2: Basic Concepts of Statistical Decision Theory

STAT215: Solutions for Homework 2

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

A Very Brief Summary of Bayesian Inference, and Examples

Lecture 1: Bayesian Framework Basics

Part III. A Decision-Theoretic Approach and Bayesian testing

Lecture 3 September 29

Naive Bayes and Gaussian Bayes Classifier

Introduction to Machine Learning. Lecture 2

Econ 2148, spring 2019 Statistical decision theory

INTRODUCTION TO BAYESIAN METHODS II

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

COS513 LECTURE 8 STATISTICAL CONCEPTS

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

MIT Spring 2016

Lecture 2: Statistical Decision Theory (Part I)

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Lecture 2 Machine Learning Review

A Very Brief Summary of Statistical Inference, and Examples

Lecture 7 Introduction to Statistical Decision Theory

Lecture 13 and 14: Bayesian estimation theory

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Naïve Bayes classification

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

ST5215: Advanced Statistical Theory

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

9 Bayesian inference. 9.1 Subjective probability

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

Bayesian Methods: Naïve Bayes

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

MIT Spring 2016

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Nearest Neighbor Pattern Classification

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Review of Maximum Likelihood Estimators

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Parameter Estimation

g-priors for Linear Regression

CS 188: Artificial Intelligence Spring Today

Chapter 3: Unbiased Estimation Lecture 22: UMVUE and the method of using a sufficient and complete statistic

Chapter 8.8.1: A factorization theorem

Chapter 5. Bayesian Statistics

Lecture 18: Bayesian Inference

ST5215: Advanced Statistical Theory

Least Squares Regression

6.1 Variational representation of f-divergences

Bayesian statistics: Inference and decision theory

COMP90051 Statistical Machine Learning

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

STAT 830 Bayesian Estimation

Least Squares Regression

STAT 425: Introduction to Bayesian Analysis

Bayesian Decision Theory

Lecture 16 November Application of MoUM to our 2-sided testing problem

Introduction to Statistical Learning Theory

21.1 Lower bounds on minimax risk for functional estimation

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning Linear Classification. Prof. Matteo Matteucci

Introduction to Probabilistic Machine Learning

Decision trees COMS 4771

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Bayesian Learning (II)

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

STATS 306B: Unsupervised Learning Spring Lecture 3 April 7th

Machine Learning

Lecture 8: Information Theory and Statistics

Bayesian Asymptotics

Lecture notes on statistical decision theory Econ 2110, fall 2013

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Harrison B. Prosper. CMS Statistics Committee

MIT Spring 2016

Linear Classifiers. Michael Collins. January 18, 2012

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

1. Fisher Information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Learning Binary Classifiers for Multi-Class Problem

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

ECE521 Lecture7. Logistic Regression

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Bayesian Regression (1/31/13)

Evaluating the Performance of Estimators (Section 7.3)

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Empirical Risk Minimization is an incomplete inductive principle Thomas P. Minka

Transcription:

STATS 300A: Theory of Statistics Fall 205 Lecture 8 October 5 Lecturer: Lester Mackey Scribe: Hongseok Namkoong, Phan Minh Nguyen Warning: These notes may contain factual and/or typographic errors. 8. Bayes Estimators and Average Risk Optimality 8.. Setting We discuss the average risk optimality of estimators within the framework of Bayesian decision problems. As with the general decision problem setting the Bayesian setup considers a model P = {P θ : θ Ω}, for our data X, a loss function L(θ, d), and risk R(θ, δ). In the frequentist approach, the parameter θ was considered to be an unknown deterministic quantity. In the Bayesian paradigm, we consider a measure Λ over the parameter space which we call a prior. Assuming this measure defines a probability distribution, we interpret the parameter θ as an outcome of the random variable Θ Λ. So, in this setup both X and θ are random. Conditioning on Θ = θ, we assume the data is generated by the distribution P θ. Now, the optimality goal for our decision problem of estimating g(θ) is the minimization of the average risk r(λ, δ) = E[L(Θ, δ(x))] = E[E[L(Θ, δ(x)) X]]. An estimator δ which minimizes this average risk is a Bayes estimator and is sometimes referred to as being Bayes. Note that the average risk is an expectation over both the random variables Θ and X. Then by using the tower property, we showed last time that it suffices to find an estimator δ which minimizes the posterior risk E[L(Θ, δ(x)) X = x] for almost every x. This estimator will generally depend on the choice of loss function, as we shall now see. 8..2 Examples of Bayes estimators Example. Suppose the loss function is the absolute error loss. Our task is to find a δ that minimizes the posterior risk which in this case is given by E[ g(θ) δ(x) X]. Thus, the minimizer δ Λ (X), or the Bayes estimator, is any median of g(θ) under the posterior distribution Θ X. This is given by a quantity c such that the following two properties hold: P(g(Θ) c X) /2 P(g(Θ) c X) /2 8-

We have seen that for the squared error loss function the Bayes estimator is given by the mean of the posterior distribution. In the next example we consider a generalization of the squared error loss. Example 2. Suppose the loss function is of the form L(θ, d) = w(θ)(d g(θ)) 2, where w(θ) 0. Here the function w(θ) can be thought of as a weight function. Then the Bayes estimator minimizes the posterior risk E[w(Θ)(g(Θ) d) 2 X = x] with respect to d, where we are able to take d out of the conditional expectation because our estimator is a function only of X and so d is a constant for fixed X = x. Rewriting, we have d 2 E[w(Θ) X = x] 2dE[w(Θ)g(Θ) X = x] + E[w(Θ)g 2 (Θ) X = x] Now, this is nothing but a convex quadratic function in d. Taking derivatives w.r.t. d, we see that this function takes on its minimum when Hence, the Bayes estimator is given by 2dE[w(Θ) X = x] 2E[w(Θ)g(Θ) X = x] = 0. δ Λ (X) = E[w(Θ)g(Θ) X = x]. E[w(Θ) X = x] Note that this is the ratio of the posterior mean of w(θ)g(θ) and that of w(θ). In particular, if w, our loss function is the usual squared error loss function and the above expression yields the Bayes estimator in this case to be the posterior mean of g(θ) as we had seen before. Example 3 (Binary Classification). Our next example is inspired by a quintessential binary classification task, email spam filtering. Our goal is, given an incoming e-mail, to classify that email as either spam or non-spam (we will call this higher-quality email ham in the latter case). To model this, the parameter space is taken to be Ω = {0, }, where 0 corresponds to a ham and a represents spam, and we suppose that the email X is drawn from either the ham distribution f 0 or the spam distribution f. Naturally, the decision space is D = {0, } as well: we predict either ham or spam for the incoming email. A natural loss function to consider in a such a binary classification problem set up is the 0- loss function : { 0 d = θ L(θ, d) = d θ. We tackle this problem in Bayesian fashion by defining a prior distribution with π() = p and π(0) = p for some fixed p [0, ]. The hyperparameter p is the probability assigned to an e-mail being spam before observing any data point. One way to select p based on prior This is not the only reasonable choice of loss. We may for instance want to impose different penalties for misclassifying true spam emails and misclassifying true ham emails. 8-2

knowledge is to set p equal to the proportion of previously received emails which were spam. Given this decision problem, the Bayes estimator minimizes the average risk r(π, δ) = E[L(Θ, δ(x))] = P(δ(X) Θ) = P(δ(X) = Θ). We see that minimizing the average risk is equivalent to maximize the probability of correct classification P(δ(X) = Θ). As a thought experiment, consider attempting to predict the label of an incoming email before it has been viewed. Since we have no data to condition on, our only (non-randomized) options are the constant estimators δ and δ 0 0. The average risks for these are and r(π, δ ) = π()r(, δ ) + π(0)r(0, δ ) = p r(π, δ 0 ) = π()r(, δ 0 ) + π(0)r(0, δ 0 ) = p. So δ has smaller average risk when p > /2, and δ 0 has smaller average risk when p < /2. After observing the data X = x, we make the estimation θ = if it has higher posterior probability P(Θ = δ(x) X = x), which is proportional to the product of the likelihood times the prior. This gives the following two relations: P(Θ = X = x) f (x) π() =f (x)p P(Θ = 0 X = x) f 0 (x) π(0) =f 0 (x)( p). Note that both of these posterior probabilities have the common normalizing constant /[f (x)p + f 0 (x)( p)] so it is enough to consider just the numerators. So the Bayes estimator predicts iff f (x)p f 0 (x)( p) >, i.e., iff f (x) f 0 (x) > p p. The left-hand side of the above inequality is known as a likelihood ratio as it is the ratio of the likelihoods of X under the two values of θ in the parameter space. The right-hand side is known as the prior odds. Example 4. Consider the binary classification setting again with likelihood f j = Exp(λ j ), for j {0, }, where λ j are known rate parameters. We assume w.l.o.g. λ 0 > λ. Consider that data X,..., X n (a batch of n emails) are drawn i.i.d. from f θ. From the calculations above it follows that we estimate iff λ n exp ( λ n i= x i) λ n 0 exp ( λ n 0 i= x i) > p p. This condition is equivalent to n (λ λ 0 ) x i > log( p) log(p) + n log(λ 0 /λ ). i= The above means we estimate iff n i= x i > h, where h depends on λ, λ 0, p, and n. 8-3

Note that in the example above, the Bayes estimator depends on the data only through the sufficient statistic i X i. This is not surprising and is in general true for Bayes estimators. To see this, suppose a sufficient statistic T (X) exists. Then the posterior density, which is proportional to the product of the likelihood and the prior can be written as posterior π(θ x) f(x θ)π(θ) h(x)g θ (T (x))π(θ) g θ (T (x))π(θ). (by NFFC) Note that h(x) does not involve θ, so it cancels out with the same term in the normalizing constant. Since the posterior depends on the data only through the sufficient statistic T, the same will be the case for the estimator minimizing the posterior risk. Keeping this fact in mind, we consider another Bayesian example. i.i.d. Example 5 (Normal mean estimation). Let X,..., X n N (Θ, σ 2 ), with σ 2 known. Further, let Θ N (µ, b 2 ), where µ, and b are fixed prior hyper-parameters. Thus, we have that, the posterior π(θ x) is n ( π(θ x) exp ) ( i= 2πσ 2σ (x 2 i θ) 2 exp ) (θ µ)2 2πb 2b2 ( ) exp n (x 2σ 2 i θ) 2 (θ µ)2 2b2 i= ( ) n exp x σ 2 i θ nθ2 2σ 2 2b 2 θ2 + µ b θ 2 i= ( exp ( n 2 σ + ) ( n x θ 2 + 2 b 2 σ + µ ) ) θ. 2 b 2 Thus the posterior density, π(θ x) is that of an exponential family with sufficient statistics θ and θ 2, which implies that the posterior distribution is normal. Looking to the normal density, we know that the coefficient of θ in the last line above will be ( n x σ + µ ) = mean 2 b 2 variance, and the coefficient of θ 2 is 2 ( n σ + ) 2 b 2 So the posterior distribution is N ( µ, σ 2 ) with µ = n x/σ2 + µ/b 2 n/σ 2 + /b 2, σ2 = = 2 variance. n/σ 2 + /b 2. Under squared error loss, the Bayes estimator is the posterior mean, which can be written as n/σ 2 µ = n/σ 2 + /b x + /b 2 2 n/σ 2 + /b µ. 2 8-4

We note that, the posterior mean is just a convex combination of the sample mean and prior mean. We further note that, for small values of n, the Bayes estimator gives significant weight to the prior mean, but as n, we have X µ 0 a.s. irrespective of the hyper-parameters µ and b 2, i.e. the data overwhelms the prior. However, for finite values of n, we note that as the coefficient of X is less than, and E X = θ, we note that µ is a biased estimator unless µ = θ. The fact that the unbiased estimator X from the example was not the Bayes estimator is a special case of a more general result: Theorem (TPE 4.2.3). If δ is unbiased for g(θ) with r(λ, δ) < and E[g(Θ) 2 ] < then δ is not Bayes under squared error loss 2 unless its average risk is zero i.e., E[(δ(X) g(θ)) 2 ] = 0, where the expectation is taken over X and Θ. Proof. Let δ be an unbiased Bayes estimator under squared error loss satisfying the assumptions of the theorem. Then, we know that δ is the mean of the posterior distribution, i.e., δ(x) = E[g(Θ) X] a.s. (8.) Thus we have that, E[δ(X)g(Θ)] = E[E[δ(X)g(Θ) X]]=E[δ(X)E[g(Θ) X]] (a) = E[δ 2 (X)], (8.2) where (a) follows by substituting for E[g(Θ) X]], using (8.). We also have that, E[δ(X)g(Θ)] = E[E[δ(X)g(Θ) Θ]] = E[g(Θ)E[δ(X) Θ]] (b) = E[g 2 (Θ)], (8.3) where (b) follows since E[δ(X) Θ] = g(θ) because δ is an unbiased estimator of g(θ). We have that the average risk under squared error, E[(δ(X) g(θ)) 2 ] = E[δ 2 (X)] 2E[δ(X)g(Θ)] + E[g 2 (Θ)], = E[δ 2 (X)] E[δ(X)g(Θ)] + E[g 2 (Θ)] E[δ(X)g(Θ)], (c) = E[δ 2 (X)] E[δ 2 (X)] + E[g 2 (Θ)] E[g 2 (Θ)], = 0, where (c) follows, by substituting for E[δ(X)g(Θ)] once from (8.2) and once from (8.3). Thus, we have that E[(δ(X) g(θ)) 2 ] = 0, i.e., the average risk is zero, proving the claim. The take-away message from this theorem is that (finite risk) squared error Bayes estimators are only unbiased in the degenerate case where perfect estimation is possible, that is, when the Bayes risk is 0. In fact, we can use this to check if a given unbiased estimator is a squared-error Bayes estimator. 2 Note that this theorem only applies to Bayes estimators under the squared error loss. Bayes estimators under other losses may be unbiased without achieving 0 average risk. 8-5

Example 6. Let X,..., X n N (θ, σ 2 ), with σ 2 > 0 known. Is X Bayes under squared error for some choice of prior distribution? We know that, E( X θ) = θ, i.e., X is an unbiased estimator of θ. Further, we have that the average risk under squared error, E[( X Θ) 2 ] = σ2 n 0, which means that X is not the Bayes estimator under any prior distribution! 8-6