Stat260: Bayesian Modeling and Inference Lecture Date: February 22, Reference Priors

Similar documents
Conjugacy and the Exponential Family

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Parametric fractional imputation for missing data analysis. Jae Kwang Kim Survey Working Group Seminar March 29, 2010

EM and Structure Learning

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Goodness of fit and Wilks theorem

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

Limited Dependent Variables

8 : Learning in Fully Observed Markov Networks. 1 Why We Need to Learn Undirected Graphical Models. 2 Structural Learning for Completely Observed MRF

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

Lecture 10 Support Vector Machines II

Maximum Likelihood Estimation

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Hidden Markov Models

Solutions Homework 4 March 5, 2018

6 More about likelihood

NUMERICAL DIFFERENTIATION

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

3.1 ML and Empirical Distribution

Linear Approximation with Regularization and Moving Least Squares

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Numerical Heat and Mass Transfer

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

Composite Hypotheses testing

Computing MLE Bias Empirically

Machine learning: Density estimation

Calculating CLs Limits. Abstract

Estimation of the Mean of Truncated Exponential Distribution

Expectation propagation

The EM Algorithm (Dempster, Laird, Rubin 1977) The missing data or incomplete data setting: ODL(φ;Y ) = [Y;φ] = [Y X,φ][X φ] = X

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

4.3 Poisson Regression

PHYS 705: Classical Mechanics. Calculus of Variations II

Statistical analysis using matlab. HY 439 Presented by: George Fortetsanakis

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

The Multiple Classical Linear Regression Model (CLRM): Specification and Assumptions. 1. Introduction

Lecture 12: Discrete Laplacian

Lecture 4: September 12

Explaining the Stein Paradox

Parametric fractional imputation for missing data analysis

Differentiating Gaussian Processes

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

Chapter 13: Multiple Regression

Stanford University CS254: Computational Complexity Notes 7 Luca Trevisan January 29, Notes for Lecture 7

Lecture 4 Hypothesis Testing

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Gaussian process classification: a message-passing viewpoint

Homework Notes Week 7

Errors for Linear Systems

Linear Regression Analysis: Terminology and Notation

p(z) = 1 a e z/a 1(z 0) yi a i x (1/a) exp y i a i x a i=1 n i=1 (y i a i x) inf 1 (y Ax) inf Ax y (1 ν) y if A (1 ν) = 0 otherwise

Global Sensitivity. Tuesday 20 th February, 2018

Integrals and Invariants of Euler-Lagrange Equations

k t+1 + c t A t k t, t=0

Stat 543 Exam 2 Spring 2016

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

1 Motivation and Introduction

Appendix for Causal Interaction in Factorial Experiments: Application to Conjoint Analysis

Lecture Notes on Linear Regression

Lecture 3: Probability Distributions

Estimation: Part 2. Chapter GREG estimation

ECE559VV Project Report

Feature Selection: Part 1

Supplementary material: Margin based PU Learning. Matrix Concentration Inequalities

UNIVERSITY OF TORONTO Faculty of Arts and Science. December 2005 Examinations STA437H1F/STA1005HF. Duration - 3 hours

Joint Statistical Meetings - Biopharmaceutical Section

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

Georgia Tech PHYS 6124 Mathematical Methods of Physics I

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

arxiv: v2 [stat.me] 26 Jun 2012

Stat 543 Exam 2 Spring 2016

4DVAR, according to the name, is a four-dimensional variational method.

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Appendix B. Criterion of Riemann-Stieltjes Integrability

Economics 130. Lecture 4 Simple Linear Regression Continued

First Year Examination Department of Statistics, University of Florida

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Homework Assignment 3 Due in class, Thursday October 15

Open Systems: Chemical Potential and Partial Molar Quantities Chemical Potential

Module 1 : The equation of continuity. Lecture 1: Equation of Continuity

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Probability Theory. The nth coefficient of the Taylor series of f(k), expanded around k = 0, gives the nth moment of x as ( ik) n n!

CS 468 Lecture 16: Isometry Invariance and Spectral Techniques

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture 21: Numerical methods for pricing American type derivatives

The Geometry of Logit and Probit

Poisson brackets and canonical transformations

On mutual information estimation for mixed-pair random variables

Quantifying Uncertainty

The Second Anti-Mathima on Game Theory

Probabilistic Classification: Bayes Classifiers. Lecture 6:

Physics 5153 Classical Mechanics. D Alembert s Principle and The Lagrangian-1

Section 8.3 Polar Form of Complex Numbers

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

Computation of Higher Order Moments from Two Multinomial Overdispersion Likelihood Models

Math 426: Probability MWF 1pm, Gasson 310 Homework 4 Selected Solutions

Transcription:

Stat60: Bayesan Modelng and Inference Lecture Date: February, 00 Reference Prors Lecturer: Mchael I. Jordan Scrbe: Steven Troxler and Wayne Lee In ths lecture, we assume that θ R; n hgher-dmensons, reference prors are defned n a sequental manner based on the parameter of prmary nterest, and we wll address ths constructon n later lectures. Motvatng Reference Prors Consder an nference scenaro n whch we have data X comng from a dstrbuton px θ dependng on a parameter θ, and suppose that TX s a suffcent statstc for θ. Ths mples that px θ s n one-to-one correspondence wth pt θ := ptx θ. Our goal s to develop a non-nformatve pror for θ. One possble way to choose a non-nformatve pror s va nformaton: we select the pror πθ to maxmze mutual nformaton between T k and θ : take π θ = argmax pθ I pθ θ,t k, where. I pθ θ,t = [ pt pθ dθ pθ tlog pθ t } {{ } KLpθ t,pθ The nner term n the double ntegral, KLpθ t,pθ := pθ t k log pθ tk pθ dθ, s the Kullback-Lebler dvergence between the posteror and pror when we observe a partcular value T = t. The mutual nformaton, then, s an average of Kullback-Lebler dvergence wth respect to the margnal dstrbuton pt of T. Ths dea s clever, but does not qute work as posed. Unfortunately, the problem of maxmzng the mutual nformaton between T and θ s often not analytcally tractable. We mght hope to solve the problem numercally, but ths can be dffcult. An alternatve s to use asymptotcs, whch often results n more analytcally tractable expressons. To ths end, we consder the followng hypothetcal stuaton: nstead of observng TX for just a sngle experment, we repeat the experment k ndependent condtonal on θ, whch remans the same throughout tmes, obtanng a vector T k consstng of k ndependent copes of T. Instead of maxmzng the mutual nformaton just between T and θ, we maxmze the nformaton between the vector T k and θ, obtanng π k θ = argmax pθ I pθ θ,t k, where. I pθ θ,t k = [ pt k pθ t k log pθ tk pθ dθ dt k d We can obtan an analytcally tractable unnformatve pror by then takng πθ = lm k π k θ, where the lmt s n a loose sense that allows for mproper prors. Bernardo Bernardo 005 argues that not only does takng k to nfnty gve us a convenent way to compute unnformatve prors, but t s also phlosophcally sense the rght thng to do. Hs argument s loosely that, when choosng a pror, we want to not only consder the nformaton we obtan from a partcular experment, but the nformaton we mght obtan from many future experments.

Reference Prors Computng Reference Prors and the Bernsten Von Mses Theorem. Solvng the Mutual Informaton Problem To fnd a more convenent form of π k so that we may apply asymptotc theory, we rewrte I pθ θ,t k as [ I pθ θ,t k = pt k pθ t k log pθ tk pθ dθ dt k = pθlog f kθ pθ dθ, where f k θ = exp{ pt k θlog pθ t k dt k }. Usng a functonal form of a Lagrangan to nclude the constrant that pθ =, the problem becomes π k θ = sup pθ pθlog f kθ pθ + λ pθdθ p θ f k θ Ths can be solved va methods of calculus of varatons, and the soluton s π k θ f k θ. Although we wll not go through the calculus-of-varatons argument here, we wll motvate ths soluton usng the dscrete case: f T and θ are both dscrete, then the problem s of the form π = argmax p p log q p + λ p. Takng partal dervatves wth respect to p j, we obtan [ q p log + λ p p j p = logq j /p + p j q j /p j q j /p j + λ = log p j + log q j + λ, and settng ths partal dervatve to zero we obtan log p = log q + λ p = q e λ π = q.. The Bernsten Von Mses Theorem and an Asymptotc Soluton We have now reduced the problem to computng { f k θ = exp pt k θlog pθ t k dt k }. It s possble to obtan an analytcal soluton for the lmt as k usng the fact that pθ t k s asymptotcally Gaussan, concentrated at the true value θ 0.e. the value of θ such that T k j d pt θ 0. Ths fact, whch ensures smlar behavor of Bayesan posterors and frequentsts samplng dstrbutons as the sample sze tends to nfnty, s a consequence of the Bernsten Von Mses Theorem, sometmes called the Bayesan Central Lmt Theorem:

Reference Prors 3 Theorem. Assume regularty condtons on the model whch ensure asymptotc normalty of an asymptotcally effcent n the frequentst sense estmator θ k and also assume that the pror satsfes regularty assumptons, n partcular that t s near θ 0. If T k denotes a vector of d components Tj k drawn from the dstrbuton of T θ 0, then pθ t k N θ k,i k θ 0 0, 3 where I k θ 0 denotes the Fsher nformaton at θ 0 and the convergence denotes convergence n probablty. Here, denotes the total varaton dstance. In general, any asymptotcally effcent estmator θ k s also asymptotcally suffcent, so we may replace t k n 3 wth θ k, obtanng pθ θ k N θk,i k 0 θ 0. 4 Also, usng the densty of a multvarate normal, we know that f y N θk,i k θ 0, then py = I k θ 0 exp I kθ 0 y θ k By ndependence, we know that I k θ 0 = /k I θ 0 wth I θ 0 := I θ 0, so the precedng expresson, combned wth lmt n 4, and the fact that θ k s consstent, leads to the followng approxmate representaton for large k : pθ θ k k I θk exp k I θ k θ θ k The remander of the argument wll be somewhat loose; for rgorous arguments of our fnal result that a one-dmensonal reference pror s a Jeffreys pror, see Bernardo s revew paper Bernardo 005. Suppose now that θ k results from k ndependent draws of X, where X has dstrbuton of X θ 0 for a partcular θ 0. Then, because θ k s consstent, θ k p θ 0 and under regularty condtons, also I θ k Iθ 0. Hence, pθ 0 θ k k I θk exp k I θ0 exp = k I θ0. k I θ k θ 0 θ k k Iθ 0 θ 0 θ 0 Returnng now to equaton, snce the nner ntegral s an expectaton wth respect to pt k θ so that the precedng theory apples so long as there exsts some asymptotcally normal effcent estmator θ k = θt k, we obtan { f k θ exp pt k θlog } I θ dt k. Snce the term nsde the ntegral does not depend on t k, and t s beng ntegrated aganst a densty, as k we have f k θ I θ0. In other words, when there s an asymptotcally normal asymptotcally effcent estmator θ k, the Jeffreys pror s a reference pror n one dmenson!

4 Reference Prors 3 Example: Reference Pror for Exponental Dstrbuton Let X Expθ be d. A suffcent statstc for θ s x := n, and the maxmum lkelhood estmator s ˆθ MLE = x. We could use the Jeffreys pror drectly, but let us nstead work through some of the work above for ths specfc case. P X Set X = X,...,X n. Then px θ = θ n exp n xθ. Snce the Bernsten Von Mses theorem ensures that the posteror s the same as n regardless of the pror we use, we may just take the pror to be flat for convenence. Hence, asymptotcally, we have pθ ˆθ ML θ n exp nθ/ˆθ ML, and snce ˆθ ML s consstent,.e. ˆθ ML θ 0 when X Expθ 0, we obtan π n θ = n+ n ˆθ ML Γk + θn exp nθ ˆθML=θ ˆθ ML θ, where we evaluated at ˆθ ML = θ because, n the defnton of f n, we are ntegratng wth respect to px θ, so that the consstency apples. 4 Invarance to Transformatons We have already showed that reference prors are the same as Jeffreys prors under regularty condtons, so they are nvarant to transformatons n that settng, but t also follows drectly from the defnton that they are nvarant to transformaton even n the absence of such regularty condtons. Specfcally, the reference pror was defned n terms of mutual nformaton, but mutual nformaton s transformaton-nvarant. That s, Iθ,T k = pt k pθ t k log pθ tk pθ dθdtk = pt k pφ t k log pφ tk pφ dφdtk. The equalty holds because, when we do the changes of varables pφ = pθφ dθ dφ and pφ t k = pθφ t k dθ dφ, the Jacoban terms nsde the logarthm cancel, so that the logarthms n the ntegrals are equal. The term dθ dφ n pφ = pθφ dθ dφ, on the other hand, s exactly what we obtan f we do a change of varables from θ to φ n the nner ntegral on the left hand sde, obtanng pθ t k log pθ tk pθ dθ = pθφ t k log pθφ tk dθ pθ dφ dφ. Hence, mutual nformaton s transformaton-nvarant, and so are reference prors. 5 Example: Locaton and Scale Famles 5. Locaton Famles For a gven densty f whch we dentfy wth the nduced dstrbuton defne a class of measures m = {fx µ : x R,µ R}.

Reference Prors 5 For a partcular µ and random varable X fx µ, let Y = X + α, and θ = µ + α. If π denotes the reference pror under the parametrzaton n our defnton of m above, then f f y = fy α µ, the smlarly constructed famly m = {f y θ : y R,θ R} s actually equal to m, reparametrzed by the transformaton θ = µ + α. Snce we know reference prors are transformaton-nvarant and the Jacoban of the transformaton s equal to, π θ = πµ. But snce the two famles are dentcal, we also have π θ = πµ + α, and hence πµ + α = πµ. In other words, reference prors for one-dmensonal locaton famles are flat. 5. Scale Famles Defne the famly m = { } σ fx σ : x > 0,σ > 0. Takng y = log x and φ = log σ, defne an equvalent reparametrzed famly m = {fexpy φ : y R,φ R}. In m, φ s a locaton famly, so π φ s flat by the work n Secton 5.. But snce π φ = σπσ by a change of varables and transformaton-nvarance, we therefore obtan πσ σ. References Bernardo, J. 005. Reference analyss. Handbook of Statstcs, 5:7 60.