Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Similar documents
Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Introduction to Markov Chain Monte Carlo & Gibbs Sampling

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

1 Priors, Contd. ISyE8843A, Brani Vidakovic Handout Nuisance Parameters

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Modern Methods of Statistical Learning sf2935 Auxiliary material: Exponential Family of Distributions Timo Koski. Second Quarter 2016

A Very Brief Summary of Bayesian Inference, and Examples

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Stat 535 C - Statistical Computing & Monte Carlo Methods. Lecture 15-7th March Arnaud Doucet

Introduction to Bayesian Methods

Introduction to Applied Bayesian Modeling. ICPSR Day 4

PMR Learning as Inference

COS513 LECTURE 8 STATISTICAL CONCEPTS

Bayesian Methods for Machine Learning

Chapter 8: Sampling distributions of estimators Sections

Density Estimation. Seungjin Choi

Caterpillar Regression Example: Conjugate Priors, Conditional & Marginal Posteriors, Predictive Distribution, Variable Selection

Invariant HPD credible sets and MAP estimators

Statistical Theory MT 2007 Problems 4: Solution sketches

Eco517 Fall 2014 C. Sims MIDTERM EXAM

A Discussion of the Bayesian Approach

Brief Review on Estimation Theory

General Bayesian Inference I

STA414/2104 Statistical Methods for Machine Learning II

Bayesian Regression Linear and Logistic Regression

i=1 h n (ˆθ n ) = 0. (2)

Harrison B. Prosper. CMS Statistics Committee

STA 4273H: Sta-s-cal Machine Learning

Lecture 6: Model Checking and Selection

Foundations of Statistical Inference

Divergence Based priors for the problem of hypothesis testing

Model comparison and selection

Curve Fitting Re-visited, Bishop1.2.5

Bayesian linear regression

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

An Introduction to Bayesian Linear Regression

GAUSSIAN PROCESS REGRESSION

STA 4273H: Statistical Machine Learning

Variational Learning : From exponential families to multilinear systems

Bayes and Empirical Bayes Estimation of the Scale Parameter of the Gamma Distribution under Balanced Loss Functions

Learning Bayesian network : Given structure and completely observed data

CSC 2541: Bayesian Methods for Machine Learning

Stat 5101 Lecture Notes

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Statistical Theory MT 2006 Problems 4: Solution sketches

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

13: Variational inference II

Other Noninformative Priors

Stat 5102 Final Exam May 14, 2015

CPSC 540: Machine Learning

Part 6: Multivariate Normal and Linear Models

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

PATTERN RECOGNITION AND MACHINE LEARNING

Remarks on Improper Ignorance Priors

A Very Brief Summary of Statistical Inference, and Examples

Computing Maximum Entropy Densities: A Hybrid Approach

Chapter 5. Bayesian Statistics

Week 3: The EM algorithm

Bayesian inference: what it means and why we care

Quantitative Biology II Lecture 4: Variational Methods

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

1 Bayesian Linear Regression (BLR)

A Bayesian perspective on GMM and IV

Naïve Bayes classification

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

arxiv: v1 [math.st] 10 Aug 2007

Department of Statistics

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

POSTERIOR PROPRIETY IN SOME HIERARCHICAL EXPONENTIAL FAMILY MODELS

STAT 518 Intro Student Presentation

Recent Advances in Bayesian Inference Techniques

Lecture 2. (See Exercise 7.22, 7.23, 7.24 in Casella & Berger)

Bayesian Learning (II)

Expectation Propagation for Approximate Bayesian Inference

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Bayesian Machine Learning

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Linear Models A linear model is defined by the expression

Introduction to Probabilistic Machine Learning

Lecture 13 : Variational Inference: Mean Field Approximation

Parametric Techniques Lecture 3

ST 740: Model Selection

The Bayesian Approach to Multi-equation Econometric Model Estimation

Bayesian Modeling of Conditional Distributions

Power-Expected-Posterior Priors for Variable Selection in Gaussian Linear Models

Inference with few assumptions: Wasserman s example

Introduction to Bayesian Inference

36-720: The Rasch Model

STA 294: Stochastic Processes & Bayesian Nonparametrics

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Bayesian Semiparametric GARCH Models

Models for spatial data (cont d) Types of spatial data. Types of spatial data (cont d) Hierarchical models for spatial data

Lecture : Probabilistic Machine Learning

Conjugate Predictive Distributions and Generalized Entropies

Bayesian Semiparametric GARCH Models

Transcription:

Stat 535 C - Statistical Computing & Monte Carlo Methods Arnaud Doucet Email: arnaud@cs.ubc.ca 1

Suggested Projects: www.cs.ubc.ca/~arnaud/projects.html First assignement on the web: capture/recapture. Additional articles have been posted. 2

2.1 Outline Prior distributions: conjugate, maxent, Jeffrey s. Bayesian variable selection. Overview of the Lecture Page 3

3.1 How to Select the Prior Distribution? Once the prior distribution is specified, inference using Bayes can be performed almost mechanically. Omitting computational issues, the most critical and critized point is the choice of the prior. Seldom, the available observation is precise enough to lead to an exact determination of the prior distribution. Prior Distributions Page 4

3.1 How to Select the Prior Distribution? Prior includes subjectivity. Subjectivity does not mean being nonscientific: vast amount of scientific information coming from theoretical and physical models is guiding specification of priors. In the last decades, a lot of research has focused on un-informative and robust priors. Prior Distributions Page 5

3.2 Conjugate Priors Conjugate priors are the most commonly used priors. A family of probability distributions F on Θ is said to be conjugate for a likelihood function f (x θ) if, for every π F, the posterior distribution π (θ x) also belongs to F. In simpler terms, the posterior remains admits the same functional form as the prior and only its parameters are changed. Prior Distributions Page 6

3.3 Example: Gaussian with unknown mean Assume you have observations X i μ N ( μ, σ 2) and μ N ( m 0,σ 2 0) then where μ x 1,..., x n N ( ) m n,σn 2 1 σ 2 n = 1 σ 2 0 m n = σ 2 n + n σ 2 σ2 n = σ2 0σ 2 nσ 2 0 + σ2, ( n i=1 x i σ 2 + m σ 2 0 ) = σ 2 n ( n i=1 x i σ 2 + m σ 2 0 ). One can think of the prior as n 0 virtual observations with n 0 = σ2 σ 2 0 and m n = n n i=1 x i + n 0 m 0 n + n 0. Prior Distributions Page 7

3.4 Example: Gaussian with unknown mean and variance Assume you have observations X i ( μ, σ 2) N ( μ, σ 2) and We have π ( μ, σ 2) = π ( σ 2) π ( μ σ 2) = IG μ, σ 2 x 1,..., x n IG ( ( σ 2 ; ν 0 2, γ ) 0 N ( μ; m 0,δ 2 σ 2) 2 σ 2 ; ν 0 + n, γ 0 + n i=1 x2 i (m n/σ n ) 2 2 2 N ( ) μ; m n,σn 2 ) where m n = 1 δ 2 + n ( m 2 0 δ 2 ) n + x i, σn 2 = i=1 σ 2 δ 2 + n, Prior Distributions Page 8

3.4 Example: Gaussian with unknown mean and variance Assume you have some counting observations X i i.i.d. P (θ); i.e. f (x i θ) =e θ θx i Assume we adopt a Gamma prior for θ; i.e. θ Ga (α, β) x i! π (θ) =Ga (θ; α, β) = βα Γ(α) θα 1 e βθ. We have π (θ x 1,..., x n )=Ga ( θ; α + ) n x i,β+ n. i=1 You can think of the prior as having β virtual observations who sum to α. Prior Distributions Page 9

3.5 Limitations Many likelihood do not admit conjugate distributions BUT it is feasible when the likelihood is in the exponential family f (x θ) =h (x)exp ( θ T x Ψ(θ) ) and in this case the conjugate distribution is (for the hyperparameters μ, λ) It follows that π (θ) =K (μ, λ)exp ( θ T μ λψ(θ) ). π (θ x) =K (μ + x, λ +1)exp ( θ T (μ + x) (λ +1)Ψ(θ) ). Prior Distributions Page 10

3.5 Limitations The conjugate prior can have a strange shape or be difficult to handle. Consider Pr (y =1 θ, x) = exp ( θ T x ) 1+exp(θ T x) then the likelihood for n observations is exponential conditional upon x i s as and f (y 1,..., y n x 1,..., x n,θ)=exp ( θ T n i=1 ) n ( ( )) y i x i 1+exp θ T 1 x i i=1 π (θ) exp ( θ T μ ) n ( ( 1+exp θ T )) λ x i i=1 Prior Distributions Page 11

3.6 Mixture of Conjugate Distributions If you have a prior distribution π (θ) which is a mixture of conjugate distributions, then the posterior is in closed form and is a mixture of conjugate distributions; i.e. with then where π (θ x) = K π (θ) = w i π i (θ) i=1 K i=1 w iπ i (θ) f (x θ) K K i=1 w i πi (θ) f (x θ) dθ = w iπ i (θ x) i=1 K w i w i π i (θ) f (x θ) dθ, w i =1. i=1 Theorem (Brown, 1986): It is possible to approximate arbitrary closely any prior distribution by a mixture of conjugate distributions. Prior Distributions Page 12

3.7 Pros and Cons of Conjugate Priors Pros. Very simple to handle, easy to interpret (through imaginary observations). Some statisticians argue that they are the least informative ones. Cons. Not applicable to all likelihood functions. Not flexible at all; what is you have a constraint like μ>0. Approximation by mixtures feasible but very tiedous and almost never used in practice. Prior Distributions Page 13

3.8 Invariant Priors If the likelihood is of the form X θ f (x θ) then f ( ) is translation invariant and θ is a location parameter. An invariance requirement is that the prior distribution should be translation invariant for every θ 0 ; i.e. π (θ) =c. π (θ) =π (θ θ 0 ) This flat prior is improper but the resulting posterior is proper as long as f (x θ) dθ <. Prior Distributions Page 14

3.8 Invariant Priors If the likelihood is of the form X θ 1 ( x ) θ f θ then f ( ) is scale invariant and θ is a scale parameter. An invariance requirement is that the prior distribution should be scale invariant; i.e. got any c>0 π (θ) = 1 ( ) θ c π. c This implies that the resulting prior is improper π (θ) 1 θ. Prior Distributions Page 15

3.9 The Jeffreys Prior Consider the Fisher information matrix I (θ) =E X θ [ log f (X θ) θ The Jeffrey s prior is defined as ] log f (X θ) T θ π (θ) I (θ) 1/2 = E X θ [ 2 log f (X θ) θ 2 ]. This prior follows from an invariance principle. Let φ = h (θ) andh be an invertible function with inverse function θ = g (φ) then as π (φ) =π (g (φ)) I (φ) = E X φ [ 2 log f (X φ) θ 2 dg (φ) dφ = π (θ) dθ dφ I (φ) 1/2 ] = E X θ [ 2 log f (X φ) θ 2. dθ 2] = I (θ) dθ dφ dφ 2. Prior Distributions Page 16

3.9 The Jeffreys Prior Consider X θ B (n, θ); i.e. f (x θ) = n x θx (1 θ) n x, The Jeffreys prior is 2 log f (x θ) θ 2 = x θ 2 + n x (1 θ) 2, I (θ) = n θ (1 θ). π (θ) [θ (1 θ)] 1/2 = Be ( θ; 1 2, 1 ). 2 Prior Distributions Page 17

3.9 The Jeffreys Prior Consider X i θ N ( θ, σ 2) ; i.e. f (x 1:n θ) exp ( (x θ) 2 / ( 2σ 2)). Since 2 log f (x 1:n θ) θ 2 = n π (θ) 1. σ2 Consider X i θ N (μ, θ); i.e. where s = n i=1 (x i μ) 2.Then f (x 1:n θ) θ n/2 exp ( s/ (2θ)) 2 log f (x 1:n θ) θ 2 = n 2θ 2 s θ 3 π (θ) 1 θ. Prior Distributions Page 18

3.10 Pros and)cons of Jeffreys Prior It can lead to incoherences; i.e. the Jeffreys prior for Gaussian data and θ =(μ, σ) unknown is π (θ) σ 2. However if these parameters are assumed a priori independent then π (θ) σ 1. Automated procedure but cannot incorporate any physical information. It does NOT satisfy the likelihood principle. Prior Distributions Page 19

3.11 The MaxEnt Priors If some characteristics of the prior distributions (moments, etc.) are known and can be written as K prior expectations E π [g k (θ)] = w k, a way to select a prior π satisfying these constraints is the maximum entropy method. In a finite setting, the entropy is defined by Ent(π) = i=1 π (θ i )log(π (θ i )). The distribution maximizing the entropy ( is of the form K ) exp k=1 λ kg k (θ i ) π (θ i )= ( j=1 exp K ) k=1 λ kg k (θ j ) where {λ k } are Lagrange multipliers. However, the constraints might be incompatible; i.e. E ( θ 2) E 2 (θ). Prior Distributions Page 20

3.12 Example Assume Θ = {0, 1, 2,...}. Suppose that E π [θ] =5,then π (θ) = e λ 1θ θ=0 eλ 1θ = ( 1 e λ 1 ) e λ 1θ. Maximizing the entropy we find e λ 1 =1/6, thus What about the continuous case??? π (θ) =Geo (1/6) Prior Distributions Page 21

3.13 The MaxEnt Prior for Continuous Random Variables Jaynes argues that the entropy should be defined as the Kullback-Leibler divergence between π and some invariant noninformative prior for the problem π 0 ; i.e. Ent(π) = π 0 (θ)log ( ) π (θ) dθ. π 0 (θ) The maxent prior is of the form π (θ) = ( K ) exp k=1 λ kg k (θ) π 0 (θ) ( K exp k=1 λ kg k (θ)) π 0 (θ) dθ Selecting π 0 (θ) isnoteasy! Prior Distributions Page 22

3.14 Example Consider a real parameter θ and set E π [θ] =μ. We can select π 0 (dθ) =dθ; i.e. the Lebesgue measure. In this case π (θ) e λθ which is a (bad) improper distribution. If additionally Var π [θ] =σ 2, then you can establish that π (θ) =N ( θ; μ, σ 2). Prior Distributions Page 23

3.15 Summary In most applications, there is true prior. Although conjugate priors are limited, they remain the most widely used class of priors for convenience and simple interpretability. There is a whole literature on the subject: reference & objective priors. Empirical Bayes: the prior is constructed from the data. In all cases, you should do a sensitivity analysis!!! Prior Distributions Page 24

3.16 Bayesian Variable Selection Example Consider the standard linear regression problem Y = p β i X i + σv where V N(0, 1) i=1 Often you might have too many predictors, so this model will be inefficient. A standard Bayesian treatment of this problem consists of selecting only a subset of explanatory variables. This is nothing but a model selection problem with 2 p possible models. Prior Distributions Page 25

3.17 Bayesian Variable Selection Example A standard way to write the model is p Y = γ i β i X i + σv where V N(0, 1) i=1 where γ i =1ifX i is included or γ i = 0 otherwise. However this suggests that β i is defined even when γ i =0. A neater way to write such models is to write Y = β i X i + σv = βγ T X γ + σv {i:γ i =1} where, for a vector γ =(γ 1,..., γ p ), β γ = {β i : γ i =1},X γ = {X i : γ i =1} and n γ = p i=1 γ i. Prior distributions ( π γ βγ,σ 2) = N ( ) ( β γ ;0,δ 2 σ 2 I nγ IG σ 2 ; ν 0 2, γ 0 2 and π (γ) = p i=1 π (γ i)=2 p. ) Prior Distributions Page 26

3.18 Bayesian Variable Selection Example For a fixed model γ and n observations D = {x i,y i } n i=1 the marginal likelihood and the posterior analytically then we can determine ( π γ D βγ,σ 2) ( ν0 + n =Γ 2 and ) ( +1 δ n γ Σ γ 1/2 γ0 + n i=1 y2 i μt γ Σ 1 γ 2 μ γ ) ( ν 0 +n 2 +1) where π γ ( βγ,σ 2 D ) = N ( β γ ; μ γ,σ 2 Σ γ ) IG ( n ) μ γ =Σ γ y i x γ,i i=1 ( σ 2 ; ν 0 + n, γ 0 + n i=1 y2 i μt γ Σ 1 γ 2 2, Σ 1 γ = δ 2 I nγ + n x γ,i x T γ,i. i=1 μ γ ) Prior Distributions Page 27

3.18 Bayesian Variable Selection Example Popular alternative Bayesian models include γ i B(λ) where λ U[0, 1], γ i B(λ i )whereλ i Be (α, β). g-prior (Zellner) β γ σ 2 N (β γ ;0,δ 2 σ 2 ( Xγ T ) 1 ) X γ. Robust models where additionally one ( has δ 2 a0 IG 2, b ) 0. 2 Such variations are very important and can modify dramatically the performance of the Bayesian model. Prior Distributions Page 28

3.19 Bayesian Variable Selection Example Caterpillar dataset: 1973 study to assess the influence of some forest settlement characteristics on the development of catepillar colonies. The response variable is the log of the average number of nests of caterpillars per tree on an area of 500 square meters. We have n = 33 data and 10 explanatory variables Prior Distributions Page 29

3.20 Bayesian Variable Selection Example x 1 is the altitude (in meters), x 2 is the slope (in degrees), x 3 is the number of pines in the square, x 4 is the height (in meters) of the tree sampled at the center of the square, x 5 is the diameter of the tree sampled at the center of the square, x 6 is the index of the settlement density, x 7 is the orientation of the square (from 1 if southbound to 2 otherwise), x 8 is the height (in meters) of the dominant tree, x 9 is the number of vegetation strata, x 10 is the mix settlement index (from 1 if not mixed to 2 if mixed). Prior Distributions Page 30

3.20 Bayesian Variable Selection Example x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 Prior Distributions Page 31

3.20 Bayesian Variable Selection Example Top five most likely models π (γ x) (Ridgeδ 2 = 10) π (γ x) (g-pδ 2 = 10) π (γ x) (g-p,δ 2 estimated) 0,1,2,4,5/0.1946 0,1,2,4,5/0.2316 0,1,2,4,5/0.0929 0,1,2,4,5,9/0.0321 0,1,2,4,5,9/0.0374 0,1,2,4,5,9/0.0325 0,12,4,5,10/0.0327 0,1,9/0.0344 0,1,2,4,5,10/0.0295 0,1,2,4,5,7/0.0306 0,1,2,4,5,10/0.0328 0,1,2,4,5,7/0.0231 0,1,2,4,5,8/0.0251 0,1,4,5/0.0306 0,1,2,4,5,8/0.0228 Prior Distributions Page 32