LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Similar documents
A Very Brief Summary of Bayesian Inference, and Examples

Lecture 2: Basic Concepts of Statistical Decision Theory

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

Lecture 7 October 13

Lecture 2: Statistical Decision Theory (Part I)

A Very Brief Summary of Statistical Inference, and Examples

ST5215: Advanced Statistical Theory

Part III. A Decision-Theoretic Approach and Bayesian testing

STAT215: Solutions for Homework 2

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

Lecture 8 October Bayes Estimators and Average Risk Optimality

STAT 830 Decision Theory and Bayesian Methods

Evaluating the Performance of Estimators (Section 7.3)

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

STAT 830 Bayesian Estimation

CSC321 Lecture 18: Learning Probabilistic Models

g-priors for Linear Regression

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parameter Estimation

MIT Spring 2016

Density Estimation. Seungjin Choi

Introduction to Probabilistic Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

COS513 LECTURE 8 STATISTICAL CONCEPTS

Statistics: Learning models from data

Lecture 7 Introduction to Statistical Decision Theory

Parametric Techniques Lecture 3

A General Overview of Parametric Estimation and Inference Techniques.

ECE531 Lecture 10b: Maximum Likelihood Estimation

Econ 2148, spring 2019 Statistical decision theory

Parametric Techniques

Sparse Linear Models (10/7/13)

IEOR165 Discussion Week 5

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Overfitting, Bias / Variance Analysis

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

ECE 275A Homework 7 Solutions

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Lecture 1: Bayesian Framework Basics

Lecture notes on statistical decision theory Econ 2110, fall 2013

Lecture 5 September 19

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

9 Bayesian inference. 9.1 Subjective probability

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

COMP90051 Statistical Machine Learning

Bayesian Regression Linear and Logistic Regression

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

LECTURE 7 NOTES. x n. d x if. E [g(x n )] E [g(x)]

Bayesian statistics: Inference and decision theory

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

ECE521 week 3: 23/26 January 2017

COS 424: Interacting with Data. Lecturer: Rob Schapire Lecture #15 Scribe: Haipeng Zheng April 5, 2007

Outline lecture 2 2(30)

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

Econometrics I, Estimation

HPD Intervals / Regions

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Overall Objective Priors

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Bayes and Empirical Bayes Estimation of the Scale Parameter of the Gamma Distribution under Balanced Loss Functions

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Linear Models in Machine Learning

Lecture 12 November 3

5.2 Expounding on the Admissibility of Shrinkage Estimators

2 Inference for Multinomial Distribution

Today. Calculus. Linear Regression. Lagrange Multipliers

Bayesian analysis in finite-population models

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Chapter 8.8.1: A factorization theorem

3.0.1 Multivariate version and tensor product of experiments

Statistical Measures of Uncertainty in Inverse Problems

Statistical Inference: Maximum Likelihood and Bayesian Approaches

y Xw 2 2 y Xw λ w 2 2

MIT Spring 2016

Bayesian statistics, simulation and software

Lecture : Probabilistic Machine Learning

Lecture 1: Introduction

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

David Giles Bayesian Econometrics

Problem Selected Scores

Bayesian estimation of the discrepancy with misspecified parametric models

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

STA414/2104 Statistical Methods for Machine Learning II

Lecture 8: Information Theory and Statistics

General Bayesian Inference I

Bayesian Inference and MCMC

Introduction to Estimation Methods for Time Series models Lecture 2

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Lecture 4 September 15

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Transcription:

LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered stochastic. Its distribution over Θ before observing any data, called the prior, reflects the investigator s prior beliefs about θ. After obtaining a sample x, the investigator updates his or her beliefs by Bayes rule, hence the name. Let π(θ) be the prior density on Θ and f θ (x) be a set of densities on X. The distribution of θ x is called the posterior distribution and is given by Bayes rule: f(x θ)π(θ) π(θ x) =, f π (x) where f π (x) := Θ f(x θ)π(θ)dθ is the marginal density of x. It reflects the investigator s updated beliefs about θ after observing x. Its expected value is often used as a point estimate of θ. i.i.d. Example 1.1. Let x i Ber(p). A Bayesian investigator assumes a beta(a, b) prior on p. It is possible to show that the joint density of t = i n x i and p is ( ) n Γ(a + b) f(t, p) = t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is f beta(a,b) (t) = ( n t and the posterior density of p is π(p t) = ) Γ(a + b) Γ(a)Γ(b) Γ(t + a)γ(n t + b), Γ(n + a + b) Γ(n + a + b) Γ(t + a)γ(n t + b) pt+a 1 (1 p) n t+b 1, which is the density of a beta(t + a, n t + b) random variable. The Bayes estimator is ˆp B := E π p t = t + a a + b + n The Bayes estimator ˆp B has the form n t (1.1) ˆp B = a + b + n n + a + b a a + b + n a + b. 1

STAT 01B Thus it is a convex combination of the sample mean t n a a+b. As the sample size grows, ˆp B converges to t n. and the prior mean Example 1.. Let x N (µ, 1). A Bayesian investigator assumes a N (a, b ) prior on µ. The posterior density of µ after observing x is π(µ x) exp ( 1 (x µ)) }{{} f(x µ) exp ( (µ a) b ) } {{ } π(µ) It is possible to show that the posterior is Gaussian: ) (µ ã) π(µ x) exp (, b where b = ( 1 + 1 b ) 1/, ã = x + a b 1 + 1 b = b x + a 1 + b. The Bayes estimator is the posterior mean, which is also a convex combination of the sample x and the prior mean a. In practice, it is usually not possible to derive the expected value of the posterior in closed-form, and we must evaluate Bayes estimators by Monte Carlo methods. In the preceding two examples, we chose the expected value of the posterior as the point estimator. Another option is the mode of the posterior, also known as the maximum a posteriori (MAP) estimator: (1.) ˆθMAP arg max θ Θ π(θ x) = arg max f(x θ)π(θ). θ Θ The main advantage of MAP estimators over Bayes estimators is they are usually easier to evaluate that Bayes estimators. For this reason, MAP estimators are sometimes known as the poor man s Bayes estimator. Example 1.3. Consider the Bayesian linear model β N (0, λ 1 I p ) y i {x i, β} i.i.d. N (x T i β, 1). We remark that the task is to learn the distribution of y i x i : the distribution of x i is irrelevant. The posterior is π(β X, y) ( ) exp 1 y i x T i β exp ( λ ) θ. i n

LECTURE 5 NOTES 3 Maximizing the posterior is equivalent to maximizing log π(β X, y) = 1 y i x T i β λ θ, which is akin to ridge regression. i n We remark that MAP estimators are generally interpretable as regularized MLE s: the regularizer is the log-density of the prior. Another prominent example is the lasso estimator: 1 ˆβ lasso arg min β R p y Xβ + λ β 1, which is the MAP estimator under a double-exponential prior. A subtle drawback of the MAP estimator is it is not equivariant under reparametrization. Let π(θ x) be the posterior density of θ and ˆθ MAP be its mode. The posterior density of g(θ), where g : Θ H is invertible, is π (η x) = π(g 1 (η) x) g 1 (η). Due to the presence of the Jacobian term, the maximizer of π (η x) is generally not g(ˆθ MAP ). Example 1.4. Consider a Bayesian investigation of the expected value of a Ber(p) random variable. The investigator assumes a uniform prior on 0, 1. Before collecting any data, the posterior is just the prior, and a MAP estimator of p is any point in 0, 1. However, the prior of q = p is π (q) = π(q ) q = 1 0,1 ( q ) q, which is maximized at 1. The prior of r = 1 1 p is which is maximized at 0. π (r) = π ( (1 r) ) (1 r), A key design choice in a Bayesian investigation is the choice of prior. If prior knowledge is available, it is often desirable to incorporate this knowledge into the choice of prior. We summarize some popular approaches to choosing a prior. 1. empirical Bayes: estimate (the parameters of) the prior from data. The marginal distribution of x usually depends on the parameters of the prior, which can be used to estimate the parameters. We give an example of an empirical Bayes estimator later in the course.

4 STAT 01B. hierarchical Bayes: impose a hyper prior on the parameters of the prior. The choice of hyper prior usually has a smaller impact on the Bayes estimator than the choice of prior. 3. robust Bayes: look for estimators that perform well under all priors in some family of priors. Bayes estimators are less popular in practice than maximum likelihood or MoM estimators because the choice of prior is usually subjective. Two investigators given the same dataset may arrive at different estimates of a parameter due to differences in their priors. Thus, unlike the MLE or MoM estimators, Bayes estimators are inherently subjective.. Evaluating point estimators. Definition.1. of a parameter θ is E θ δ(x) θ The mean squared error (MSE) of an estimator δ(x). 1 The MSE is the average discrepancy between the estimator δ(x) and the unknown parameter θ in the l norm. For any estimator, the MSE decomposes into bias and variance terms: E θ δ(x) θ = E θ δ(x) E θ δ(x) + Eθ δ(x) θ = E θ δ(x) E θ δ(x) + E θ E θ δ(x) θ (.1) (δ(x) ) T ( ) + E θ Eθ δ(x) Eθ δ(x) θ = E θ δ(x) E θ δ(x) + E θ δ(x) θ. }{{}}{{} variance bias The bias term is the difference between the expected value of the estimator and the target and is a measure of the estimator s accuracy. An estimator whose bias vanishes for any θ Θ is unbiased. MSE is by no means the only error metric that practitioners consider. It is a special case of a risk function. The study of the performance of estimators evaluated by risk functions is a branch of decision theory. Decision theory formalizes a statistical investigations as a decision problem. After observing x F F, the investigator makes a decision regarding the unknown parameter θ Θ. The set of allowed decisions is called the action space A. In point estimation, the decision is typically the point 1 The subscript on E means the expectation is taken with respect to F θ F.

LECTURE 5 NOTES 5 estimate. Thus A is often just Θ. After a decision is made, the investigator incurrs a loss given by a loss function l : Θ A R +. By convention, bad decisions incur higher losses, and the investigator seeks to minimize his or her losses. Some examples of loss functions are 1. square loss: l(θ, a) = θ a,. absolute value loss: l(θ, a) = θ a 1, 3. zero-one-loss: l(θ, a) = 1 1 θ (a), 4. log loss: l(θ, a) = f θ(x) f a(x). The performance of a decision rule δ : X A is summarized by its risk function: Risk δ (θ) = E θ l(θ, δ(x)). The MSE of a point estimator δ(x) is the risk function of the decision rule δ under the square loss function l(θ, a) = θ a..1. Rao-Blackwellization. As we shall see, there is an intimate connection between data reduction and good point estimators (in the decision theoretic sense). When the loss function is convex (in δ), it is always possible to reduce the risk of an estimator by Rao-Blackwellization. Theorem. (Rao-Blackwell). Let t be a sufficient statistic for model F := {F θ : θ Θ}. For any loss function l : Θ A R + that is convex in its second argument, we have E θ l ( θ, E δ(x) t ) E θ l(θ, δ(x)) for any θ Θ. Further, if l is strictly convex in its second argument, the inequality is strict unless δ(x) a.s. = E δ(x) t. Proof. By the tower property of conditional expectations, E θ l(θ, δ(x)) = Eθ E l(θ, δ(x)) t, which, by Jensen s inequality, is at least E θ l ( θ, E δ(x) t ). If l is strictly convex in its second argument, unless δ(x) a.s. = E δ(x) t, Jensen s inequality is strict.

6 STAT 01B The Rao-Blackwell theorem shows that it is possible to reduce the risk of any point estimator by conditioning on a sufficient statistic. The estimator δ RB (t) := E δ(x) t is sometimes called a Rao-Blackwellized estimator. By modifying the proof of the Rao-Blackwell theorem, it is possible to show that if t is a minimal sufficient statistic and t is another sufficient statistic, then E θ l ( θ, E δ(x) t ) E θ l ( θ, E δ(x) t ) for any θ Θ. The astute reader may observe that the proof of the Rao-Blackwell theorem is valid whether t is a sufficient statistic or not. Thus conditioning on any statistic reduces the risk. Although this is true, the law of δ(x) t generally depends on the unknown parameter θ. Thus the Rao-Blackwellized estimator is, in fact, not an estimator... Admissibility. Recall the risk of an estimator δ R δ (θ) = E θ l(θ, δ(x)), which led us to study risk-study notions of optimality for estimators. Unfortunately, there is usually no uniformly optimal estimator. Example.3. Bayes estimator is Let x i i.i.d. Ber(p). The MLE of p is ˆp ML = x, and the (.) ˆp B = E π p t = n x + a a + b + n Since the MLE is unbiased, its MSE is its variance, which is p(1 p) n. The MSE of the Bayes estimator is ( ) MSEˆpB (p) = var p ˆpB + Ep ˆpB p + n x + a = var p a + b + n = np(1 p) (a + b + n) + ( n x + a E p a + b + n ( np + a a + b + n p ), ) p which is a quadratic function of p. It is possible to show that by choosing a = b = n 4, the MSE is constant in p: n MSEˆpB (p) = 4(n + n). We observe that the MSE of the MLE is smaller than that of ˆp B when p is near 0 or 1, but larger when p is near 1. Thus neither estimator dominates the other uniformly on p 0, 1.

LECTURE 5 NOTES 7 Although there is usually no uniformly optimal estimator, there are uniformly suboptimal estimators. Definition.4. An decision rule δ is inadmissible if there is another decision rule δ such that R δ (θ) R δ (θ) for any θ Θ and R δ (θ) < R δ (θ) at some θ Θ. Otherwise, δ is admissible. Intuitively, an inadmissible estimator is uniformly bested by another estimator, so, from a decision theoretic perspective, there is no good reason to use inadmissible estimators. We remark that although inadmissible estimators are bad, admissible estimators are not necessarily good. An estimator is admissible if it has (strictly) smaller risk than any other estimator at a single θ Θ. Its risk at any other θ may be arbitrarily bad. 3. Bayes estimators. Definition 3.1. Let π be a prior on the parameter space Θ. The Bayes risk of a decision rule δ is E π Rδ (θ) = R δ (θ)π(θ)dθ. It is possible to minimize the Bayes risk to derive a Bayes estimator. We begin by noticing the Bayes risk has the form E π Rδ (θ) = l(θ, δ(x))f(x θ)π(θ)dxdθ Θ X = l(θ, δ(x))π(θ x)f π (x)dxdθ Θ X = PostRisk δ (x)f π (x)dx. where PostRisk δ (x) = E θ π( x) l(θ, δ(x)) = X Θ Θ l(θ, δ(x))π(θ x)dθ is the posterior risk of the estimator δ. We observe that the posterior risk does not depend on θ; it only depends only on x. By choosing δ(x) to minimize the posterior risk, we minimize the Bayes risk. In practice, it is only necessary to minimize the posterior at the observed x. That is, δ B (x) := arg min E θ π( x) l(θ, a). a A

8 STAT 01B Example 3.. The posterior MSE of an estimator δ is PostMSE δ (x) = E θ π( x) θ δ(x) = E θ π( x) θ + E θ π( x) θ T δ(x) + δ(x), which is a quadratic function of δ(x). It is minimized at δ B (x) = E θ π( x) θ. Thus the Bayes estimator is the expected value of the posterior. Example.3 is a concrete example of a Bayes estimator. Example 3.3. loss function Instead of the square loss function, consider the zero-one l(θ, δ) = 1 1 θ (δ) = If Θ is finite, the posterior risk is PostRisk δ (x) = θ Θ { 0 θ = δ 1 otherwise. l(θ, δ(x))π(θ x) = 1 π(δ(x) x). To minimize the posterior risk, we should choose δ(x) so that π(δ(x) x) is as large as possible. Thus the Bayes estimator under the zero-one loss function is the MAP estimator. When minimization of the posterior risk cannot be done analytically, it is often done numerically: δ B (x) arg min δ Θ PostRisk δ (x) = arg min log ( PostRisk δ (x) ). δ Θ We remark that the problem is similar to evaluating the MLE. Thus most numerical methods for evaluating the MLE are directly applicable to minimizing the posterior risk. Before moving on to minimax estimators, we comment on the admissibility of Bayes estimators. Theorem 3.4. A unique Bayes estimator is admissible. Proof. If there is another estimator θ such that R δ (θ) R δb (θ) for any θ Θ and R δ (θ) < R δb (θ) at some θ, then (3.1) E π Rδ (θ) E π RδB (θ). Thus δ(x) is Bayes. Since δ B is unique, δ B (x) = δ(x) for all x X.

LECTURE 5 NOTES 9 Yuekai Sun Berkeley, California November 3, 015