MIT Spring 2016

Similar documents
MIT Spring 2016

Part III. A Decision-Theoretic Approach and Bayesian testing

Hypothesis Testing. Testing Hypotheses MIT Dr. Kempthorne. Spring MIT Testing Hypotheses

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Foundations of Statistical Inference

MIT Spring 2016

A Very Brief Summary of Bayesian Inference, and Examples

Lecture 2: Statistical Decision Theory (Part I)

Basic math for biology

Density Estimation. Seungjin Choi

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

Methods of Estimation

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Introduction to Bayesian Methods

1. Fisher Information

Patterns of Scalable Bayesian Inference Background (Session 1)

STA 732: Inference. Notes 10. Parameter Estimation from a Decision Theoretic Angle. Other resources

9 Bayesian inference. 9.1 Subjective probability

A Very Brief Summary of Statistical Inference, and Examples

Lecture 6: Model Checking and Selection

Lecture 8 October Bayes Estimators and Average Risk Optimality

Econ 2140, spring 2018, Part IIa Statistical Decision Theory

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

6.1 Variational representation of f-divergences

Chapter 3 : Likelihood function and inference

Lecture 13 Fundamentals of Bayesian Inference

Maximum Likelihood Large Sample Theory

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Econ 2148, spring 2019 Statistical decision theory

7. Estimation and hypothesis testing. Objective. Recommended reading

MIT Spring 2016

CSC 2541: Bayesian Methods for Machine Learning

G8325: Variational Bayes

Nested Sampling. Brendon J. Brewer. brewer/ Department of Statistics The University of Auckland

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

COS513 LECTURE 8 STATISTICAL CONCEPTS

Bayesian Inference and MCMC

an introduction to bayesian inference

Introduction to Machine Learning

MIT Spring 2015

Lecture 7 October 13

Markov Chain Monte Carlo methods

INTRODUCTION TO BAYESIAN METHODS II

A BAYESIAN MATHEMATICAL STATISTICS PRIMER. José M. Bernardo Universitat de València, Spain

David Giles Bayesian Econometrics

Machine learning - HT Maximum Likelihood

Bayesian estimation of complex networks and dynamic choice in the music industry

Bayesian Analysis of Risk for Data Mining Based on Empirical Likelihood

Prediction. Prediction MIT Dr. Kempthorne. Spring MIT Prediction

Introduction to Machine Learning

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Time Series and Dynamic Models

Bayesian Regression Linear and Logistic Regression

Monte Carlo in Bayesian Statistics

Parametric Techniques Lecture 3

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Parameter Estimation

Introduction to Probabilistic Machine Learning

STA414/2104 Statistical Methods for Machine Learning II

Minimum Message Length Analysis of the Behrens Fisher Problem

STAT215: Solutions for Homework 2

Modern Methods of Statistical Learning sf2935 Auxiliary material: Exponential Family of Distributions Timo Koski. Second Quarter 2016

Lecture 4 September 15

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Statistical Inference

MCMC algorithms for fitting Bayesian models

Test Code: STA/STB (Short Answer Type) 2013 Junior Research Fellowship for Research Course in Statistics

Learning the hyper-parameters. Luca Martino

Lecture 1: Introduction

Statistics Ph.D. Qualifying Exam: Part II November 9, 2002

Quick Tour of Basic Probability Theory and Linear Algebra

7. Estimation and hypothesis testing. Objective. Recommended reading

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Graphical Models for Collaborative Filtering

Lecture 1: Bayesian Framework Basics

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

Principles of Bayesian Inference

Qualifying Exam in Probability and Statistics.

STAT 830 Bayesian Estimation

Principles of Bayesian Inference

Statistics Ph.D. Qualifying Exam: Part I October 18, 2003

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Curve Fitting Re-visited, Bishop1.2.5

Statistics & Data Sciences: First Year Prelim Exam May 2018

Part 1: Expectation Propagation

Lecture 13 and 14: Bayesian estimation theory

Some Asymptotic Bayesian Inference (background to Chapter 2 of Tanner s book)

Parametric Techniques

Lecture 13 : Variational Inference: Mean Field Approximation

Minimax lower bounds I

Bayesian statistics: Inference and decision theory

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

Bayesian Inference: Concept and Practice

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Transcription:

MIT 18.655 Dr. Kempthorne Spring 2016 1 MIT 18.655

Outline 1 2 MIT 18.655

Decision Problem: Basic Components P = {P θ : θ Θ} : parametric model. Θ = {θ}: Parameter space. A{a} : Action space. L(θ, a) : Loss function. R(θ, δ) = E X θ [L(θ, δ(x ))] Decision Problem: Bayes Components π : Prior distribution on Θ r(π, δ) = E θ [R(θ, δ)] = E θ [E X θ [L(θ, δ(x )]] = E X,θ [L(θ, δ(x ))] Bayes risk of the procedure δ(x ) with respect to the prior π. r(π) = inf δ D r(π, δ): Minimum Bayes risk. δ π : r(π, δ π ) = r(π) Bayes rule with respect to prior π and loss L(, ). 3 MIT 18.655

: Interpretations of Bayes Risk Bayes risk I r(π, δ) = Θ R(θ, δ)π(dθ) π( ) weights θ Θ where R(θ, δ) matters. π(θ) = constant: weights θ uniformly. Note: uniform weighting depends on parametrization. Interdependence of specifying the loss function and prior density: I I r(π, δ) = [L(θ, δ(x))π(θ)]p(x θ)dx dθ. IΘ IX = [L (θ, δ(x))π (θ)]p(x θ)dx dθ. Θ X for L (, ), π ( ) such that L (θ, δ(x))π (θ) = L(θ, δ(x))π(θ) 4 MIT 18.655

: Quadratic Loss Quadratic Loss: Estimating q(θ) with a A = {q(θ), θ Θ}. L(θ, a) = [q(θ) a] 2. Bayes risk: r(π, δ) = E ([q(θ) δ(x )] 2 ) Bayes risk as expected Posterior Risk: r(π, δ) = E X (E θ X L(θ, δ(x ))) = E X (E θ X ([q(θ) δ(x)] 2 )) Bayes decision rule specified by minimizing: E θ X ([q(θ) δ(x)] 2 X = x) for each outcome X = x, which is solved by δ π (x) = EI θ X =x [q(θ) X = x] Θ q(θ)p(x θ)π(θ)dθ = I p(x θ)π(θ)dθ Θ 5 MIT 18.655

Bayes Procedure: Quadratic Loss Example 3.2.1 X 1,..., X n iid N(θ, σ 2 ), σ 2 > 0, known. Prior Distribution: π : θ N(η, τ 2 ). Posterior Distribution: θ X = x : N(η, τ 2 ) 1 1 1 1 where η = [ X + η]/[ + ] σ 2 /n τ 2 σ 2 /n τ 2 2 1 1 = [ σ 2 /n + ] 1. τ 2 τ Bayes Procedure: δ π (X ) = E [θ x] = η Observations: Posterior risk: E [L(θ, δ π ) X = x] = τ 2 2. (constant!) = Bayes risk: r(π, δ π ) = τ MLE δ MLE (x) = x has Constant Risk: R(θ, X) = σ 2 /n = BayesRisk: r(θ, X) = σ 2 /n (> τ lim τ δ π (x) = x and lim τ τ 2 = σ 2 /n. 6 MIT 18.655 2 )

Bayes Procedure: General Case Bayes Risk and Posterior Risk: r(π, δ) = E θ [R(θ, δ(x )] = E θ [E X θ [L(θ, δ(x ))]] = E X [E θ X [L(θ, δ(x ))]] = E X [r(δ(x) x)] where r(a x) = E [L(θ, a) X = x] (Posterior risk) Proposition 3.2.1 Suppose δ : X A is such that r(δ (x) a) = inf a A {r(a x)} Then δ is a Bayes rule. Proof. For any procedure δ D, r(π, δ) = E X [r(δ(x) x)] E X [r(δ (x) x)] = r(π, δ ). 7 MIT 18.655

for Problems With Finite Θ Finite Θ Problem Θ = {θ 0, θ 1,..., θ K } A = {a 0, a 1,..., a q } (q may equal K or not) L(θ i, a j ) = w ij, for i = 0, 1,..., K, j = 0, 1,..., q K Prior distribution: π(θ i ) = π i 0, i = 1,..., K ( 0 π i = 1). Data/Random variable: X P θ with density/pmf p(x θ). Solution: Posterior probabilities: π i p(x θ i ) π(θ i X = x) = K j=0 π j p(x θ j ) Posterior risks: K i=0 w ij π i p(x θ i ) r(a j X = x) = K i=0 π i p(x θ i ) Bayes decision rule: δ (x) satisfies r(δ (x) x) = min 0 j K r(a j x) 8 MIT 18.655

Finite Θ Problem: Classification Classification Decision Problem p = q, identify A with Θ Loss function: 1 if i = j L(θ i, a j ) = w ij = 0 if i = j Bayes procedure minimizes Posterior Risk r(θ i x) = P[θ = θ i x] = 1 P[θ = θ i x] = δ (x) = θ i A, that maximizes P[θ = θ i x]. Special case: Testing Null Hypothesis vs Alternative p = q = 1 π 0 = π, π 1 = 1 π 0 Testing Θ 0 = {θ 0 } versus Θ 1 = {θ 1 } Bayes rule chooses θ = θ 1 if P[θ = θ 1 x] > P[θ = θ 0 x]. 9 MIT 18.655

Finite Θ Problem: Testing Equivalent Specifications of Bayes Procedure: δ (x) Minimizes r(π, δ) = πr(θ 0, δ) + (1 π)r(θ 1, δ) = πp(δ(x ) = θ 1 θ 0 ) + (1 π)p(δ(x ) = θ 0 θ 1 ) Chooses θ = θ 1 if P[θ = θ 1 x] > P[θ = θ 0 x] (1 π)p(x θ 1 ) > πp(x θ 0 ) p(x θ 1 ) > π/(1 π) (Likelihood Ratio) p(x θ 0 (1 π) p(x θ 1 ) > 1 (Bayes Factor) π p(x θ 0 ) The procedure δ solves: Minimize : P(δ(X ) = θ 0 θ 1 ) P(Type II Error) Subject to : P(δ(X ) = θ 1 θ 0 ) α P(Type I Error) i.e., minimizes the Lagrangian: P(δ(X ) = θ 1 θ 0 ) + λp(δ(x ) = θ 0 θ 1 ) with Lagrange multiplier λ = (1 π)/π. 10 MIT 18.655

Estimating Success Probability With Non-Quadratic Loss Decision Problem: X 1,..., X n iid Bernoulli(θ) Θ = {θ} = {θ : 0 < θ < 1} = (0, 1) A = {a} = Θ Loss equal to relative-squared-error: (θ a) 2 L(θ, a) =, 0 < θ < 1 and a real. θ(1 θ) Solving the Decision Problem By sufficiency, consider decision rules based on the sufficient statistic n S = 1 X i Binomial(n, θ). For a prior distribution π on Θ, with density π(θ), denote the density of the posterior distribution I by π(θ s) = [π(θ)θ s (1 θ) (n s) ]/ Θ[π(t)t s (1 t) (n s) ]dt 11 MIT 18.655

Solving the Decision Problem (continued) The posterior risk r(a S = k) is r(a S = k) = E [L(θ, a) S = k) (θ a) = E [ 2 θ(1 θ) S = k] θ 2 a = E [ a + 2 (1 θ) 1 θ θ(1 θ) S = k] θ 2 1 = E [ (1 θ) k] ae [ 1 θ k] + a 2 E[ θ(1 θ) k] which is a parabola in a minimized at 1 E [ 1 θ k] a = 1 E [ θ(1 θ) k] This defines the Bayes rule δ (S) for S = k (if the expectations exist). The Bayes rule can be expressed in closed form when the prior distribution is θ Beta(r, s) has closed form solution β(r+k,n k+s 1) (r+k 1) δ (k) = = β(r+k 1,n k+s 1) n+r+s 2 For r = s = 1, δ (k) = k/n = X (for k = 0, a = 0 directly) 12 MIT 18.655

With Hierarchical Prior Example 3.2.4 Random Effects Model X ij = µ + Δ i + E ij, i = 1,..., I and j = 1,..., J E ij are iid N(0, σe 2 ). Δ i iid N(0, σδ 2 ) independent of the E ij µ N(µ 0, σµ 2 ). Bayes Model: Specification I Prior distribution on θ = (µ, σe 2, σδ 2 ) µ N(µ 0, σ 2 ), independent of σ2 µ e and σδ 2. π(θ) = π 1 (µ)π 2 (σe 2 )π 3 (σδ 2 ). Data distribution: X ij θ are jointly normal random variables with E [X ij θ] = µ Var[X ij θ] = σδ 2 + σe 2 σ 2 Δ + σ2 if i = k, j = l Cov[X ij, X kl θ] = Δ if 0 if 13 MIT 18.655 σ 2 e i = k, j = l i = k

With Hierarchical Prior Example 3.2.4 Random Effects Model X ij = µ + Δ i + E ij, i = 1,..., I and j = 1,..., J E ij are iid N(0, σ 2 ). e Δ i iid N(0, σ 2 ) independent of the E ij Δ µ N(µ 0, σ 2 ). µ Bayes Model: Specification II Prior distribution on (µ, σe 2, σδ 2, Δ 1,..., Δ I ) r π(θ) = π I 1 (µ) π 2 (σe 2 ) π 3 (σδ 2 ) i=1 π 4 (Δ i σδ 2 ) r = π I 1 (µ) π 2 (σe 2 ) π 3 (σδ 2 ) i=1 φ σδ (Δ i ) Data distribution: X ij θ are independent normal random variables with E [X ij θ] = µ + Δ i, i = 1,... I, and j = 1,..., J Var[X ij θ] = σe 2 14 MIT 18.655

with Hierarchical Prior Issues: Decision problems often focus on single Δ i Posterior analyses then require marginal posterior distributions; e.g. I r π(δ 1 = d 1 x) = {θ:δ1 =d 1 } π(θ x) {i except Δ 1 } dθ i Approaches to computing marginal posterior distributions Direct computation (conjugate priors) Markov-Chain Monte Carlo (MCMC): simulations of posterior distributions. 15 MIT 18.655

Equivariance Definition ˆθ M : estimator of θ applying methodolgy M. h(θ) : one-to-one function of θ (a reparametrization). ˆθ M is equivariant if hh(θ) M = h(θˆm ) Equivariance of MLEs and MLEs are equivariant. Bayes procedures not necessarily equivariant For squared error loss, the Bayes procedure is mean of posterior distribution. With non-linear reparametrization h( ), E [h(θ) x] = h(e [θ x]). 16 MIT 18.655

and Reparametrization Reparametrization of Bayes Decision Problems Reparametrization in Bayes decision analysis is not just a transformation-of-variables exercise with the joint/posterior distributions. The loss function should be transformed as well. If φ = h(θ) and Φ = {φ : φ = h(θ), θ Θ} then L [φ, a] = L[h 1 (φ), a]. The decision analysis should be independent of the parametrization. 17 MIT 18.655

Equivariant Bayesian Decision Problems Equivariant Loss Functions Consider a loss function for which: L(h(θ), a) = L(θ, a), for all one-to-one functions h( ). Such a loss function is equivariant General class of equivariant loss functions: L(θ, a) = Q(P θ, P a ) E.g., Kullback-Leibler divergence loss: p(x a) L(θ, a) = E [log( ) θ] p(x θ) Loss probability-weighted log-likelihood ratio. For canonical exponential family: k L(η, a) = [η j a j )E[T j (X ) η] + A(η) A(a) j=1 18 MIT 18.655

MIT OpenCourseWare http://ocw.mit.edu 18.655 Mathematical Statistics Spring 2016 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.