Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

Similar documents
An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form.

3 Independent and Identically Distributed 18

Stat 8931 (Aster Models) Lecture Slides Deck 8

Stat 5101 Lecture Notes

Bayesian Regression Linear and Logistic Regression

CPSC 340: Machine Learning and Data Mining

PARAMETER ESTIMATION: BAYESIAN APPROACH. These notes summarize the lectures on Bayesian parameter estimation.

STA414/2104 Statistical Methods for Machine Learning II

An exponential family of distributions is a parametric statistical model having log likelihood

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Bayesian Models in Machine Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Introduction to Applied Bayesian Modeling. ICPSR Day 4

Bayesian Methods for Machine Learning

Likelihood Inference in Exponential Families and Generic Directions of Recession

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Foundations of Statistical Inference

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Introduction to Bayesian Statistics with WinBUGS Part 4 Priors and Hierarchical Models

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

DS-GA 1002 Lecture notes 11 Fall Bayesian statistics

Eco517 Fall 2014 C. Sims MIDTERM EXAM

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Computational Perception. Bayesian Inference

Generalized Linear Models

Model comparison. Christopher A. Sims Princeton University October 18, 2016

Aster Modeling of Chamaecrista fasciculata

VARIABLE TRANSFORMATION TO OBTAIN GEOMETRIC ERGODICITY IN THE RANDOM-WALK METROPOLIS ALGORITHM

New Bayesian methods for model comparison

Zero Variance Markov Chain Monte Carlo for Bayesian Estimators

Stat 5421 Lecture Notes Likelihood Inference Charles J. Geyer March 12, Likelihood

POSTERIOR PROPRIETY IN SOME HIERARCHICAL EXPONENTIAL FAMILY MODELS

An overview of Bayesian analysis

40.530: Statistics. Professor Chen Zehua. Singapore University of Design and Technology

Generalized Linear Models for Non-Normal Data

STA 4273H: Sta-s-cal Machine Learning

Conditional probabilities and graphical models

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

i=1 h n (ˆθ n ) = 0. (2)

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

STA 4273H: Statistical Machine Learning

Bayesian (conditionally) conjugate inference for discrete data models. Jon Forster (University of Southampton)

CSC 2541: Bayesian Methods for Machine Learning

Lecture 10: Generalized likelihood ratio test

Deciding, Estimating, Computing, Checking

Deciding, Estimating, Computing, Checking. How are Bayesian posteriors used, computed and validated?

CS 361: Probability & Statistics

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Statistical Methods in Particle Physics

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

Introduction to Bayesian Methods

Stat 8931 (Exponential Families) Lecture Notes The Eggplant that Ate Chicago (the Notes that Got out of Hand) Charles J. Geyer November 2, 2016

Graphical Models for Collaborative Filtering

Probabilistic Graphical Models

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

PMR Learning as Inference

Based on slides by Richard Zemel

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

13: Variational inference II

CS 361: Probability & Statistics

STAT J535: Chapter 5: Classes of Bayesian Priors

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

MODEL COMPARISON CHRISTOPHER A. SIMS PRINCETON UNIVERSITY

Lecture 2: Priors and Conjugacy

ST5215: Advanced Statistical Theory

Nonparametric Bayesian Methods (Gaussian Processes)

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

Part 8: GLMs and Hierarchical LMs and GLMs

Statistical Data Analysis Stat 3: p-values, parameter estimation

Artificial Intelligence

Generalized Linear Models and Exponential Families

Stat 8501 Lecture Notes Spatial Lattice Processes Charles J. Geyer November 16, Introduction

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

The Expectation-Maximization Algorithm

Data Analysis and Uncertainty Part 2: Estimation

2 Inference for Multinomial Distribution

ML estimation: Random-intercepts logistic model. and z

Patterns of Scalable Bayesian Inference Background (Session 1)

STAT 425: Introduction to Bayesian Analysis

Monte Carlo in Bayesian Statistics

An English translation of Laplace s Théorie Analytique des Probabilités is available online at the departmental website of Richard Pulskamp (Math

Generative Clustering, Topic Modeling, & Bayesian Inference

Lecture 2: Conjugate priors

One-parameter models

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CSC321 Lecture 18: Learning Probabilistic Models

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Statistics & Data Sciences: First Year Prelim Exam May 2018

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Markov Chain Monte Carlo methods

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

E. Santovetti lesson 4 Maximum likelihood Interval estimation

Probability and Estimation. Alan Moses

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Transcription:

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016 1 Theory This section explains the theory of conjugate priors for exponential families of distributions, which is due to Diaconis and Ylvisaker (1979). 1.1 Conjugate Priors A family of distributions P is conjugate to another family of distributions M if, when applying Bayes rule with the likelihood for M, whenever the prior is in P, the posterior is also in P. The way one finds conjugate families is to make their PDF look like the likelihood. For an exponential family with canonical statistic y, canonical parameter θ, and cumulant function c, the likelihood is L(θ) = e y,θ c(θ). (1) More generally, the likelihood for sample size n from this family (see Section 3 of Geyer, 2016a) is L(θ) = e y,θ nc(θ). (2) where now y is the canonical statistic for sample size n, what Section 3 of Geyer (2016a) writes as n y i, and θ and c are the same as before, but now n appears. Saying we want the prior to look like (2) means we want it to be a function that looks like it (considered as a function of θ). But, of course, a prior cannot depend on data. So we have to replace y by something known, and, while we are at it, we also replace n too. Thus we say that h η,ν (θ) = e η,θ νc(θ), θ Θ, (3) defines a function h η,ν that is an unnormalized conjugate prior probability density function (PDF) for θ, where η is a vector of the same dimension as 1

the canonical statistic and canonical parameter, ν is a scalar, and Θ is the full canonical parameter space for the family given by equation (7) in Geyer (2016a). The quantities η and ν are called hyperparameters of the prior, to distinguish them from θ which is the parameter we are making Bayesian inference about. They are treated as known constants, different choices of which determine which prior we are using. We don t treat η and ν as parameters the way Bayesians treat parameters (put priors on them, treat them as random). Instead we treat η and ν the way frequentists treat parameters (as non-random constants, different values of which determine different probability distributions). 1.2 Corresponding Posteriors The make the prior look like the likelihood trick is not guaranteed to work. It depends on what the likelihood looks like. So we check that it does indeed work for exponential families. Bayes rule can be expressed as unnormalized posterior = likelihood unnormalized prior. (4) If we apply this with (2) as the likelihood and (3) as the unnormalized prior, we obtain e y,θ nc(θ) e η,θ νc(θ) = e y+η,θ (n+ν)c(θ) which is a distribution in the conjugate family with vector hyperparameter y + η and scalar hyperparameter n + ν. So it does work. This is indeed the conjugate family. In general, there is no simple expression for the normalized PDF of the conjugate family, so we still have to use Markov chain Monte Carlo (MCMC) to do calculations about the posterior. 1.3 The Philosophy of Conjugate Priors What is the point of conjugate priors? Suppose we started with a flat prior, which is the special case of the conjugate family with η = 0 (the zero vector) and ν = 0 (the zero scalar). Then no matter how much data we ever collect, our posterior will be some distribution in the conjugate family. (If we start in the conjugate family, we stay in the conjugate family.) The Bayesian learning paradigm says that the posterior distribution incorporating past data serves as the prior distribution for future data. So this suggests that any prior based on data should be a conjugate prior. 2

But, of course, the whole argument depends on starting with a flat prior. If we don t start with a prior in the conjugate family, then we don t (in general) get a posterior distribution in the conjugate family. But this argument does suggest that conjugate priors are one kind of prior that is reasonable for the model. For example, they have the kind of tail behavior that could have come from observation of data from the model. This is something that just using, for example, normal prior distributions does not do. The normal distribution has tails that decrease more rapidly than any other widely used distribution, and a lot faster than conjugate priors for some discrete exponential families. Consider a family with bounded support of the canonical statistic, for example, logistic regression. The log of (3) has derivative log h η,ν (θ) = η ν c(θ) = η νe θ (y) using equation (5) in Geyer (2016a). The point is that since y is bounded, so is E θ (y), bounded considered as a function of θ, that is. And that means the logarithmic derivative of the conjugate prior is bounded. And that means the conjugate prior PDF has tails that decrease no more than exponentially fast. But the normal distribution has tails that decrease superexponentially fast (like exp( β 2 ), where denotes the Euclidean norm). So normal priors are more informative than conjugate priors (have lighter tails, much lighter when far out in the tails). They express uncertainty about the parameter that has more certainty than could reflect what is learned from any amount of data. Something wrong there (philosophically). In summary, when you use conjugate priors, they are guaranteed to be something that could reflect uncertainty that comes from actual data. Other priors pulled out of nowhere do not have this property. 1.4 Proper Priors The fundamental theorem about conjugate priors for exponential families is Theorem 1 in Diaconis and Ylvisaker (1979), which we repeat here. Theorem 1 (Diaconis-Ylvisaker). The conjugate prior (3) determined by η and ν is proper if and only if ν > 0 and η/ν lies in the interior of the convex support of the exponential family, which is the smallest convex set that contains the canonical statistic of the family with probability one. So what does that last bit mean? A set S in a vector space is convex if it contains all convex combinations of its points, which in turn are defined 3

to be points of the form k p ix i, where x i are points and p i are scalars that are nonnegative and sum to one. The nonnegative and sum to one should ring a bell. It is the condition for probability mass functions. So another way to explain convexity is to say that S is convex if the mean of every probability distribution concentrated on a finite set of points of S is contained in S. Corollary 2. If a conjugate prior for the canonical parameter of exponential family is proper, then the parameterization is identifiable. Proof. By Theorem 1 in Geyer (2009), which also is Theorem 1 in Geyer (2016a), the canonical parameterization is identifiable if and only if the convex support has the same dimension as the canonical statistic and parameter vectors. So if the parameterization is not identifiable, then the convex support lies in a lower-dimensional subspace and has empty interior, but by Theorem 1 above, that implies the prior is not proper. The converse to this corollary is that, if we want a proper prior, then we had better have an identifiable parameterization. 1.4.1 Saturated Models To get a more concrete picture, let us calculate the convex supports for a few simple probability models. Bernoulli Regression First consider the saturated model for Bernoulli regression. The the components y i of the response vector y are Bernoulli. Each y i has support {0, 1}. Hence y has support {0, 1} n, where n is the dimension of y. The convex support must fill in all the points in between, so it is [0, 1] n. The interior of the convex support is the points inside, not on the surface, so that is (0, 1) n. Here we are using the convention that square brackets for an interval indicate that the end points are included, and round brackets indicate that the end points are excluded. To be absolutely clear the interior of the convex support is (0, 1) n = { y R n : 0 < y i < 1 for all i }. (5) Poisson Regression Poisson regression is similar. The components y i of the response vector y are Poisson. Each y i has support N (the set of nonnegative integers). Hence y has support N n, where n is the dimension 4

of y. The convex support is thus [0, ) n, and the interior of the convex support is (0, ) n = { y R n : 0 < y i for all i }. Multinomial, Try III The multinomial distribution is a bit different. If y is a multinomial random vector of dimension k for sample size n, then the components of y are nonnegative integers that sum to n. Unlike the case for Bernoulli or Poisson regression, the components of y are not independent and the support is not a Cartesian product. But we just said what it was (components are nonnegative integers that sum to n). In math notation that is { y N k : } k y i = n, and the convex support must fill in points in between { y [0, ) k : } k y i = n, and the interior of the convex support is empty because the convex support is contained in a lower-dimensional subspace and hence is not a neighborhood of any point. Hence there can be no proper conjugate prior if we take y to be the canonical statistic of the family, what Geyer (2016a, Section 2.4.3) calls the Try III parameterization of the multinomial. Multinomial, Try II If instead we use the Try II parameterization (Geyer, 2016a, Section 2.4.2) in which the canonical statistic has components y 1, y 2,..., y k 1 (omitting the count for the last category), then the support of the canonical statistic vector is { y N k 1 : } k 1 y i n, (less than or equal to n because we have dropped y k from the sum), and the interior of convex support is { y (0, ) k 1 : } k 1 y i < n. 1.4.2 Canonical Affine Models Theorem 1 also applies to canonical affine submodels because they too are full exponential families (Geyer, 2016a, Section 4.6). The only difference 5

is that the canonical statistic vector has the form M T y where M is the model matrix and y is the canonical statistic vector of the saturated model. It is usually hard to calculate the convex support of M T y even if the convex support of y is known. Thus it is usually hard to know the set of all hyperparameters that yield proper conjugate priors. Hence the following theorem, which, at least, allows us to identify some proper conjugate priors. Theorem 3. If η/ν is a vector in the relative interior of the convex support of a full exponential family and M is the model matrix for a canonical affine submodel that has identifiable canonical parameterization, then M T η and ν are hyperparameters for a proper conjugate prior for the submodel canonical parameter. To understand this we need to know what relative interior is. It is the interior relative to the smallest affine subspace containing the set (Rockafellar, 1970, Chapter 6). Again, we can understand this concept by examples. If the convex support of the saturated model has the same dimension as the canonical statistic, as we saw was the case for Bernoulli regression, Poisson regression, and the multinomial distribution with Try II parameterization, then the relative interior is the same as the interior. Only with the multinomial distribution with Try III parameterization did we have a lower-dimensional convex support which yields an empty interior. That is where the notion of relative interior is useful. It is what we might think is the interior if we weren t careful. For the multinomial it is what we get if we replace by < in the formula for the convex support obtaining { y (0, ) k : } k y i = n. Proof of Theorem 3. By Theorem 6.6 in Rockafellar (1970) if η/ν is in the relative interior of the convex support saturated model, then M T η/ν is in the relative interior of convex support the submodel. By Theorem 1 in Geyer (2009) the relative interior of the convex support of the submodel is equal to the interior if and only if the submodel canonical parameterization is identifiable. 6

2 Examples 2.1 Logistic Regression 2.1.1 Saturated Model This example follows an example that is just R code (Geyer, 2016b) where is it said that adding 1/2 to the count in each cell of a product binomial (logistic regression) model has the same effect as using a proper prior. The log likelihood for the saturated model in terms of ordinary parameters is n [ l(p) = yi log(p i ) + (n i y i ) log(1 p i ) ] (6) (we have a reason for starting with ordinary parameters rather than canonical parameters we will get to them). By adding 1/2 to each cell of the table, what was meant was to add 1/2 to each of the y i and to each of the n i y i, which is clearly what the cited computer code does. Let us generalize writing ε rather than 1/2 and allowing it to be any positive real number. Then the unnormalized log posterior is l(p) = n [ (yi + ε) log(p i ) + (n i y i + ε) log(1 p i ) ]. (7) From (4) we get log unnormalized posterior = log likelihood + log unnormalized prior from which we see that the log unnormalized prior is (7) minus (6), which is n [ ε log(pi ) + ε log(1 p i ) ] is the log unnormalized prior we are using. parameters eθ i p i = 1 + e θ i Now we introduce canonical 7

so the log unnormalized prior becomes h(θ) = = n n [ ( e θ i ) ( )] 1 ε log 1 + e θ + ε log i 1 + e θ i [ ( )] εθ i 2ε log 1 + e θ i (8) and we have now put the prior in exponential family form. The vector hyperparameter η is the vector having all components equal to ε and ν = 2ε (compare with the form of the log likelihood for logistic regression in Section 5.1 of Geyer, 2016a). So does this prior satisfy the conditions to be proper given in Theorem 1? Or, to be more precise, would it be proper if we were using the saturated model? Since we said ε > 0, we clearly have ν = 2ε > 0. That s one part. Then η/ν is the vector having all components equal to 1/2. And this is indeed a point in (5), so this does give a proper prior. 2.1.2 Canonical Affine Submodel When we go to canonical affine submodels, Corollary 2 assures us that adding 1/2 to each cell of the table gives a proper conjugate prior for the submodel canonical parameter. 2.2 Poisson Regression Now we turn to our other example, which is homework problem 3-2. In the statement of the problem, it is claimed that adding 1/2 to the count for each cell of the contingency again results in a proper conjugate prior. As we shall see, this claim is wrong. Now the saturated model log likelihood is where l(θ) = y, θ c(θ), c(θ) = (Geyer, 2016a, equation (40)). It is now clear from Theorem 1 that a proper conjugate prior has the form (3) with η having components η i > 0 and also with ν > 0. 8 n e θ i

And the latter requirement is what adding 1/2 to each component of y leaves out. That gives an unnormalized log posterior of y + 1/2, θ c(θ) = y + η, θ (1 + ν)c(θ) with η i = 1/2 for all i and ν = 0. We see that our unnormalized prior for the saturated model is just g(θ) = e η,θ and applying this to the canonical affine submodel gives g(β) = e M T η,β and this cannot possibly integrate to something finite because it goes to infinity as β goes to infinity in some direction (which depends on what M is). If we wanted a proper prior, its log unnormalized density would have to have the form h(β) = M T η, β νc(a + Mβ) with all components of η strictly positive and ν > 0. References Geyer, C. J. (2009). Likelihood inference in exponential families and directions of recession. Electronic Journal of Statistics, 3, 259 289. Geyer, C. J. (2016a). Exponential families, part I. Lecture notes for Stat 5421 (categorical data analysis). http://www.stat.umn.edu/geyer/ 5421/notes/expfam.pdf. Geyer, C. J. (2016b). Untitled R script for Stat 5421 (categorical data analysis). http://www.stat.umn.edu/geyer/5421/examp/mcmc-too.rout. Diaconis, P., and Ylvisaker, D. (1979). Conjugate priors for exponential families. Annals of Statistics, 7, 269 281. Rockafellar, R. T. 1970. Princeton, NJ. Convex Analysis. Princeton University Press, 9