David Giles Bayesian Econometrics

Similar documents
The Normal Linear Regression Model with Natural Conjugate Prior. March 7, 2016

Density Estimation. Seungjin Choi

Introduction into Bayesian statistics

Principles of Bayesian Inference

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT J535: Introduction

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

David Giles Bayesian Econometrics

Principles of Bayesian Inference

Time Series and Dynamic Models

David Giles Bayesian Econometrics

Data Analysis and Uncertainty Part 2: Estimation

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

A Bayesian Treatment of Linear Gaussian Regression

Principles of Bayesian Inference

Linear Models A linear model is defined by the expression

Introduction to Bayesian Statistics. James Swain University of Alabama in Huntsville ISEEM Department

Lecture 2: Priors and Conjugacy

Principles of Bayesian Inference

Bayesian Statistics Part III: Building Bayes Theorem Part IV: Prior Specification

Introduction to Bayesian Inference

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Phylogenetics:

CS281A/Stat241A Lecture 22

The Metropolis-Hastings Algorithm. June 8, 2012

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian Inference: Concept and Practice

Bayesian Inference and MCMC

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Introduc)on to Bayesian Methods

Parametric Techniques Lecture 3

Lecture 1: Bayesian Framework Basics

One-parameter models

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Bayesian model selection: methodology, computation and applications

Bayesian Machine Learning

Computational Cognitive Science

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Sequential Monte Carlo and Particle Filtering. Frank Wood Gatsby, November 2007

Introduction to Bayesian Methods

A Very Brief Summary of Bayesian Inference, and Examples

Lecture 2: From Linear Regression to Kalman Filter and Beyond

INTRODUCTION TO BAYESIAN ANALYSIS

Parametric Techniques

Introduction to Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning

Bayes Factors, posterior predictives, short intro to RJMCMC. Thermodynamic Integration

Particle Filtering a brief introductory tutorial. Frank Wood Gatsby, August 2007

9 Bayesian inference. 9.1 Subjective probability

Bayesian hypothesis testing

Foundations of Statistical Inference

Review of Probabilities and Basic Statistics

Bayesian Regression Linear and Logistic Regression

Integrated Non-Factorized Variational Inference

Bayesian Decision and Bayesian Learning

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

Chapter 5. Bayesian Statistics

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Hierarchical Models & Bayesian Model Selection

Probability and Estimation. Alan Moses

Learning Bayesian network : Given structure and completely observed data

Bayesian Methods for Machine Learning

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Bayesian Inference for DSGE Models. Lawrence J. Christiano

10. Exchangeability and hierarchical models Objective. Recommended reading

Computational Cognitive Science

Remarks on Improper Ignorance Priors

Introduction to Machine Learning

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Bayesian Analysis (Optional)

CS-E3210 Machine Learning: Basic Principles

COMP90051 Statistical Machine Learning

Other Noninformative Priors

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

CPSC 540: Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

STAT 730 Chapter 4: Estimation

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Accounting for Complex Sample Designs via Mixture Models

Part 4: Multi-parameter and normal models

Lecture 2: Statistical Decision Theory (Part I)

Introduction to Machine Learning

GAUSSIAN PROCESS REGRESSION

STAT J535: Chapter 5: Classes of Bayesian Priors

Stats 579 Intermediate Bayesian Modeling. Assignment # 2 Solutions

CS540 Machine learning L9 Bayesian statistics

Bayesian Regression (1/31/13)

MCMC algorithms for fitting Bayesian models

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003


Bayesian Approaches Data Mining Selected Technique

COS513 LECTURE 8 STATISTICAL CONCEPTS

A Probabilistic Framework for solving Inverse Problems. Lambros S. Katafygiotis, Ph.D.

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Transcription:

David Giles Bayesian Econometrics 1. General Background 2. Constructing Prior Distributions 3. Properties of Bayes Estimators and Tests 4. Bayesian Analysis of the Multiple Regression Model 5. Bayesian Model Selection / Averaging 6. Bayesian Computation Monte Carlo Markov Chain (MCMC) 1

1. General Background Major Themes (i) (ii) (iii) (iv) (v) (vi) (vii) Flexible & explicit use of Prior Information, together with data, when drawing inferences. Use of an explicit Decision-Theoretic basis for inferences. Use of Subjective Probability to draw inferences about once-and-for-all events. Unified set of principles applied to a wide range of inferential problems. No reliance on Repeated Sampling. No reliance on large-n asymptotics - exact Finite-Sample Results. Construction of Estimators, Tests, and Predictions that are "Optimal". 2

Probability Definitions 1. Classical, "a priori" definition. 2. Long-run relative frequency definition. 3. Subjective, "personalistic" definition: (i) Probability is a personal "degree of belief". (ii) Formulate using subjective "betting odds". (iii) Probability is dependent on the available Information Set. (iv) Probabilities are revised (updated) as new information arises. (v) Once-and-for-all events can be handled formally. (vi) This is what most Bayesians use. 3

Rev. Thomas Bayes (1702?-1761) 4

Bayes' Theorem (i) Conditional Probability: p(a B) = p(a B)/p(B) p(b A) = p(a B)/p(A) (1) p(a B) = p(a B)p(B) = p(b A)p(A) (2) (ii) Bayes' Theorem: From (1) and (2): p(a B) = p(b A)p(A)/p(B) (3) 5

(iii) Law of Total Probability: Divide event A into k mutually exclusive & exhaustive events, {B1, B2,..., Bk}. Then, A = (A B 1 ) (A B 2 ) (A B k ) (4) From (2): So, from (4): p(a B) = p(a B)p(B) p(a) = p(a B 1 )p(b 1 ) + p(a B 2 )p(b 2 ) + + p(a B k )p(b k ) (iv) Bayes' Theorem Re-stated: k p(b j A) = p(a B j )p(b j )/[ i=1 (p(a B i )p(b i ))] 6

(v) Translate This to Inferential Problem: p(θ y) = p(y θ)p(θ)/p(y) or, p(θ y) = p(y θ)p(θ)/ p(y θ)p(θ) dθ (Note: multi-dimensional multiple integral.) We can write this as: p(θ y) p(θ)p(y θ) or, p(θ y) p(θ)l(θ y) That is: Posterior Density for θ Prior Density for θ Likelihood Function 7

Bayes' Theorem provides us with a way of updating our prior information about the parameters when (new) data information is obtained: Prior Information Sample Information Bayes' Theorem Posterior Information ( = New Prior Info.) More Data Bayes' Theorem 8

As we'll see later, the Bayesian updating procedure is strictly "additive". If p(θ y)dθ = 1, then we can always "recover" the proportionality constant later: So, the proportionality constant is p(θ y)dθ = p(y θ)p(θ)dθ p(y θ)p(θ)dθ = 1 c = 1/ p(y θ)p(θ)dθ. This is just a number, but obtaining it requires multi-dimensional integration. So, even "normalizing" the posterior density may be computationally burdensome. 9

Example 1 We have data generated by a Bernoulli distribution, and we want to estimate the probability of a "success": p(y) = θ y (1 θ) 1 y ; y = 0, 1 ; 0 θ 1 Uniform prior for θ [0, 1] : p(θ) = 1 ; 0 θ 1 (mean & variance?) = 0 ; otherwise Random sample of n observations, so the Likelihood Function is n L(θ y) = p(y θ) = [θ y i(1 θ) (1 y i i=1 ) ] = θ y i(1 θ) n (y i ) Bayes' Theorem: p(θ y) p(θ)p(y θ) θ y i(1 θ) n (y i ) How would we normalize this posterior density function? 10

Statistical Decision Theory Loss Function - personal in nature. L(θ, θ ) 0, for all θ ; L(θ, θ ) = 0 iff θ = θ. Note that θ = θ (y). Examples (scalar case; each loss is symmetric) (i) L 1 (θ, θ ) = c 1 (θ θ ) 2 ; c1 > 0 (ii) L 2 (θ, θ ) = c 2 θ θ ; c2 > 0 (iii) L 3 (θ, θ ) = c 3 ; if θ θ > ε c3 > 0 ; ε > 0 = 0 ; if θ θ ε (Typically, c3 = 1) Risk Function: Risk = Expected Loss R(θ, θ ) = L (θ, θ (y)) p(y θ)dy Average is taken over the sample space. 11

Minimum Expected Loss (MEL) Rule: "Act so as to Minimize (posterior) Expected Loss." (e.g., choose an estimator, select a model/hypothesis) Bayes' Rule: "Act so as to Minimize Average Risk." (Often called the "Bayes' Risk".) r(θ ) = R (θ, θ )p(θ)dθ (Averaging over the parameter space.) Typically, this will be a multi-dimensional integral. So, the Bayes' Rule implies θ = argmin. [r(θ )] = argmin. Ω R(θ, θ )p(θ)dθ Sometimes, the Bayes' Rule is not defined. (If double integral diverges.) The MEL Rule is almost always defined. Often, the Bayes' Rule and the MEL Rule will be equivalent. 12

To see this last result: θ = argmin. [ L(θ, θ )p(y θ)dy]p(θ)dθ Ω Y = argmin. [ L(θ, θ )p(θ y)p(y)dy]dθ Ω Y = argmin. [ L(θ, θ )p(θ y)dθ]p(y)dy Y Ω Have to be able to interchange order of integration - Fubini's Theorem. Note that p(y) 0. So, θ = argmin. That is, θ satisfies the MEL Rule. Ω L(θ, θ )p(θ y)dθ May as well use the MEL Rule in practice. What is the MEL (Bayes') estimator under particular Loss Functions?. 13

Quadratic Loss (L1): Absolute Error Loss (L2): Zero-One Loss (L3): Consider L1 as example: θ = Posterior Mean = E(θ y) θ = Posterior Median θ = Posterior Mode min. θ c 1 (θ θ) 2 p(θ y)dθ 2c 1 (θ θ)p(θ y)dθ = 0 θ p(θ y)dθ = θp(θ y)dθ (Check the 2nd-order condition) θ = θp(θ y)dθ = E(θ y) See CourseSpaces handouts for the cases of L2 and L3 Loss Functions. 14

Example 2 We have data generated by a Bernoulli distribution, and we want to estimate the probability of a "success": p(y) = θ y (1 θ) 1 y ; y = 0, 1 ; 0 θ 1 (a) Uniform prior for [0, 1] : p(θ) = 1 ; 0 θ 1 (mean = 1/2; variance = 1/12) = 0 ; otherwise Random sample of n observations, so the Likelihood Function is n L(θ y) = p(y θ) = [θ y i(1 θ) (1 y i i=1 ) ] = θ y i(1 θ) n (y i ) Bayes' Theorem: p(θ y) p(θ)p(y θ) θ y i(1 θ) n (y i ) (*) Apart from the normalizing constant, this is a Beta p.d.f.: 15

A continuous random variable, X, has a Beta distribution if its density function is p(x α, β) = 1 B(α,β) x(α 1) (1 x) (β 1) ; 0 < x < 1 ; α, β > 0 1 Here, B(α, β) = t α (1 t) β 1 dt 0 is the "Beta Function". We can write it as: B(α, β) = Γ(α)Γ(β)/Γ(α + β) where Γ(t) = 0 x t 1 e x dx is the "Gamma Function" See the handout, "Gamma and Beta Functions". E[X] = α/(α + β) ; V[X] = αβ/[(α + β) 2 (α + β + 1)] Mode [X] = (α 1)/(α + β 2) ; if α, β > 1 Median [X] (α 1/3)/(α + β 2/3) ; if α, β > 1 Can extend to have support [a, b], and by adding other parameters. It's a very flexible and useful distribution, with a finite support. 16

17

Going back to Slide 14: p(θ y) p(θ)p(y θ) θ y i(1 θ) n (y i ) ; 0 < θ < 1 (*) Or, p(θ y) = 1 B(α,β) p(θ)p(y θ) θα 1 (1 θ) β 1 ; 0 < θ < 1 where α = ny + 1 ; β = n(1 y ) + 1 So, under various loss functions, we have the following Bayes' estimators: (i) L1: θ 1 = E[θ y] = α α+β = (ny + 1)/(n + 2) (ii) L3: θ 3 = Mode[θ y] = (α 1) (α+β 2) = y 18

Note that the MLE is θ = y. (The Bayes' estimator under zero-one loss.) The MLE (θ 3) is unbiased (because E[y] = θ); consistent, and BAN. We can write: θ 1 = (y + 1/n)/(1 + 2/n) So, lim (n ) (θ 1) = y ; and θ 1 is also consistent and BAN. (b) Beta prior for [0, 1] : p(θ) θ a 1 (1 θ) b 1 ; a, b > 0 Recall that Uniform [0, 1] is a special case of this. We can assign values to a and b to reflect prior beliefs about θ. e.g., E[θ] = a/(a + b) ; V[θ] = ab/[(a + b) 2 (a + b + 1)] 19

Set values for E[θ] and V[θ], and then solve for implied values of a and b. Recall that the Likelihood Function is: L(θ y) = p(y θ) = θ ny (1 θ) n(1 y ) So, applying Bayes' Theorem: kernel of the density p(θ y) p(θ)p(y θ) θ ny +a 1 (1 θ) n(1 y )+b 1 This is a Beta density, with α = ny + a ; β = n(1 y ) + b An example of "Natural Conjugacy" So, under various loss functions, we have the following Bayes' estimators: (i) L1: θ 1 = E[θ y] = α α+β = (ny + a)/(n + a + b) 20

(ii) L3: θ 3 = Mode[θ y] = (α 1) (α+β 2) = (ny + a 1)/(a + b + n 2) Recall that the MLE is θ = y. We can write: θ 1 = (y + a/n)/(1 + a/n + b/n) So, lim (n ) (θ 1) = y ; and θ 1 is consistent and BAN. We can write: θ 3 = (y + a/n 1/n)/(1 + a/n + b/n 2/n) So, lim (n ) (θ 3) = y ; and θ 3 is consistent and BAN. In the various examples so far, the Bayes' estimators have had the same asymptotic properties as the MLE. We'll see later that this result holds in general for Bayes' estimators. 21

22

Example 3 y i ~N[μ, σ 0 2 ] ; σ 0 2 is known Before we see the sample of data, we have prior beliefs about value of μ: That is, p(μ) = p(μ σ 0 2 )~N[μ, v ] p(μ) = 1 1 exp { (μ 2πv 2v μ )2 } exp { 1 (μ 2v μ )2 } kernel of the density Note: we are not saying that μ is random. Just uncertain about its value. 23

Now we take a random sample of data: y = (y 1, y 2,, y n ) The joint data density (i.e., the Likelihood Function) is: p(y μ, σ 2 0 ) = L(μ y, σ 2 0 ) = ( 1 n σ 0 2π ) exp { 1 2 2σ (y i μ) 2 } 0 n i=1 p(y μ, σ 0 2 ) exp { 1 2σ 0 2 (y i μ) 2 n i=1 } kernel of the density Now we ll apply Bayes Theorem: p(μ y, σ 0 2 ) p(μ σ 0 2 )p(y μ, σ 0 2 ) Likelihood function 24

So, p(μ y, σ 0 2 ) exp { 1 2 [1 v (μ μ ) 2 + 1 2 σ (y i μ) 2 ]} 0 n i=1 We can write: n (y i μ) 2 i=1 n = (y i y ) 2 + n(y μ) 2 i=1 So, p(μ y, σ 0 2 ) exp { 1 2 [1 v (μ μ ) 2 + 1 σ 0 2 ( (y i y ) 2 n i=1 + n(y μ) 2 )]} exp { 1 2 [μ2 ( 1 v + n σ 0 2) 2μ ( μ v + ny σ 0 2) + constant]} 25

p(μ y, σ 2 0 ) exp { 1 2 [μ2 ( 1 + n ny v σ 0 2) 2μ ( + μ v σ 0 2)]} "Complete the square": In our case, x = μ a = ( 1 v + n σ 0 2 ) b = 2 ( μ v c = 0 So, we have: + nx σ 0 2 ) ax 2 + bx + c = a(x + b 2a )2 + (c b2 4a ) 26

p(μ y, σ 0 2 ) exp { 1 2 [(1 v + n σ 0 2) (μ ( μ v +ny 2 2 σ 2 ) 0 ) ]} ( 1 v + n σ 2 ) 0 The Posterior distribution for μ is N[μ, v ], where μ = (( 1 v )μ +( n σ 2 )y ) 0 ( 1 v + n σ 0 2 ) ; weighted average interpretation? 1 v = ( 1 v + n σ 0 2 ) μ is the MEL (Bayes) estimator of μ under loss functions L1, L2, L3. Another example of "Natural Conjugacy". Show that lim (μ ) = y (n ) ( = MLE: consistent, BAN) 27

28

Dealing With "Nuisance Parameters" Typically, θ is a vector: θ = (θ 1, θ 2,, θ p ) e.g., θ = (β 1, β 2,, β k ; σ) Often interested primarily in only some (one?) of the parameters. The rest are "nuisance parameters". However, they can't be ignored when we draw inferences about parameters of interest. Bayesians deal with this in a very flexible way - they construct the Joint Posterior density for the full θ vector; and then they marginalize this density by integrating out the nuisance parameters. For instance: p(θ y) = p(β, σ y) p(β, σ)l(β, σ y) Then, 29

p(β y) = 0 p(β, σ y)dσ (marginal posterior for β) p(β i y) =. p(β y)dβ 1. dβ i 1 dβ i+1. dβ k (marginal posterior for β i ) p(σ y) = p(β, σ y)dβ =. p(β, σ y)dβ 1. dβ k (marginal posterior for σ) Computational issue - can the integrals be evaluated analytically? If not, we'll need to use numerical procedures. Standard numerical "quadrature" infeasible if p > 3, 4. Need to use alternative methods for obtaining the Marginal Posterior densities from the Joint Posterior density. 30

Once we have a Marginal Posterior density, such as p(β i y), we can obtain a Bayes point estimator for β i. For example: = E[β i y] = β i β i = argmax. {p(β i y)} β i M β i p( β i y)dβ i = M, such that p( β i y)dβ i = 0.5 The "posterior uncertainty" about β i can be measured, for instance, by computing var. (β i y) = (β i β i ) 2 p( β i y)dβ i 31

Bayesian Updating Suppose that we apply our Bayesian analysis using a first sample of data, y 1 : p(θ y 1 ) p(θ)p(y 1 θ) Then we obtain a new independent sample of data, y 2. The previous Posterior becomes our new Prior for θ and we apply Bayes' Theorem again: p(θ y 1, y 2 ) p(θ y 1 )p(y 2 θ, y 1 ) p(θ)p(y 1 θ)p(y 2 θ, y 1 ) p(θ)p(y 1, y 2 θ) The Posterior is the same as if we got both samples of data at once. The Bayes updating procedure is "additive" with respect to the available information set. 32

Advantages & Disadvantages of Bayesian Approach 1. Unity of approach regardless of inference problem: Prior, Likelihood function, Bayes' Theorem, Loss function, MEL rule. 2. Prior information (if any) incorporated explicitly and flexibly. 3. Explicit use of a Loss function and Decision Theory. 4. Nuisance parameters can be handled easily - not ignored. 5. Decision rules (estimators, tests) are Admissible. 6. Small samples are alright (even n = 1). 7. Good asymptotic properties - same as MLE. However: 1. Difficulty / cost of specifying the Prior for the parameters. 2. Results may be sensitive to choice of Prior - need robustness checks. 3. Possible computational issues. 33