Conjugate Predictive Distributions and Generalized Entropies

Similar documents
Invariant HPD credible sets and MAP estimators

Exercises and Answers to Chapter 1

Robustness and duality of maximum entropy and exponential family distributions

Expectation Propagation Algorithm

6.1 Variational representation of f-divergences

Information geometry of Bayesian statistics

A Very Brief Summary of Statistical Inference, and Examples

ICES REPORT Model Misspecification and Plausibility

Probability and Statistics

COS513 LECTURE 8 STATISTICAL CONCEPTS

Stat 5101 Lecture Notes

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

13: Variational inference II

The Bayesian Choice. Christian P. Robert. From Decision-Theoretic Foundations to Computational Implementation. Second Edition.

Overall Objective Priors

Exponential Families

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Curve Fitting Re-visited, Bishop1.2.5

A General Overview of Parametric Estimation and Inference Techniques.

ST5215: Advanced Statistical Theory

14 : Theory of Variational Inference: Inner and Outer Approximation

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

topics about f-divergence

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Modern Methods of Statistical Learning sf2935 Auxiliary material: Exponential Family of Distributions Timo Koski. Second Quarter 2016

Parametric Techniques Lecture 3

Exam C Solutions Spring 2005

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

Parameter Estimation

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

A Very Brief Summary of Bayesian Inference, and Examples

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

One-parameter models

Statistical Inference

STAT 730 Chapter 4: Estimation

Stat 535 C - Statistical Computing & Monte Carlo Methods. Arnaud Doucet.

Suggested solutions to written exam Jan 17, 2012

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Mathematical statistics

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Statistics 3858 : Maximum Likelihood Estimators

Statistics: Learning models from data

Bayesian analysis in finite-population models

MIT Spring 2016

Linear Models A linear model is defined by the expression

Probabilistic Graphical Models for Image Analysis - Lecture 4

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Estimation MLE-Pandemic data MLE-Financial crisis data Evaluating estimators. Estimation. September 24, STAT 151 Class 6 Slide 1

Testing Statistical Hypotheses

Instructor: Dr. Volkan Cevher. 1. Background

PATTERN RECOGNITION AND MACHINE LEARNING

Remarks on Improper Ignorance Priors

Predictive Hypothesis Identification

WE start with a general discussion. Suppose we have

The Expectation Maximization or EM algorithm

ECE 4400:693 - Information Theory

Quantitative Biology II Lecture 4: Variational Methods

Parametric Techniques

Entropy measures of physics via complexity

Artificial Intelligence

1. Fisher Information

Bayes spaces: use of improper priors and distances between densities

Mathematical statistics

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Statistical Theory MT 2007 Problems 4: Solution sketches

A BAYESIAN MATHEMATICAL STATISTICS PRIMER. José M. Bernardo Universitat de València, Spain

Stat 260/CS Learning in Sequential Decision Problems.

CS 361: Probability & Statistics

Let us first identify some classes of hypotheses. simple versus simple. H 0 : θ = θ 0 versus H 1 : θ = θ 1. (1) one-sided

CS 591, Lecture 2 Data Analytics: Theory and Applications Boston University

The Expectation-Maximization Algorithm

Brief Review on Estimation Theory

Estimation and Maintenance of Measurement Rates for Multiple Extended Target Tracking

Beta statistics. Keywords. Bayes theorem. Bayes rule

Introduction to Bayesian Statistics

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

Probability and Estimation. Alan Moses

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

14 : Theory of Variational Inference: Inner and Outer Approximation

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Experimental Design to Maximize Information

arxiv: v3 [stat.me] 11 Feb 2018

arxiv: v1 [cs.lg] 1 May 2010

Testing Statistical Hypotheses

Graduate Econometrics I: Maximum Likelihood I

Review and continuation from last week Properties of MLEs

ECE531 Lecture 10b: Maximum Likelihood Estimation

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

ORIGINS OF STOCHASTIC PROGRAMMING

The Information Bottleneck Revisited or How to Choose a Good Distortion Measure

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Expectation Maximization

A View on Extension of Utility-Based on Links with Information Measures

ECE 275A Homework 7 Solutions

Transcription:

Conjugate Predictive Distributions and Generalized Entropies Eduardo Gutiérrez-Peña Department of Probability and Statistics IIMAS-UNAM, Mexico Padova, Italy. 21-23 March, 2013

Menu 1 Antipasto/Appetizer Posterior unbiasedness 2 Piatto Principale/Main Course Conjugate predictive distributions as generalized exponential families 3 Dolce/Dessert Concluding remarks

Posterior unbiasedness Unbiasedness Why the interest in unbiasedness?

Posterior unbiasedness Unbiasedness Why the interest in unbiasedness? There is an interesting duality between the classical notion of unbiasedness and minimization of posterior expected loss [Noorbaloochi & Meeden, 1983]

Posterior unbiasedness Unbiasedness Why the interest in unbiasedness? There is an interesting duality between the classical notion of unbiasedness and minimization of posterior expected loss [Noorbaloochi & Meeden, 1983] Some Bayes estimators ˆλ = ˆλ(x (n) ) turn out to be unbiased for certain transformations λ = λ(θ) in the sense that E(ˆλ λ) = λ This definition of unbiasedness is related to the squared error loss L 2 λ(ˆλ, λ) (ˆλ λ) 2, as is the Bayes estimator ˆλ = E(λ x (n) ) S = E(λ ˆλ)

Posterior unbiasedness L-unbiasedness For a general loss function L(ˆλ, λ), we shall say that the estimator ˆλ(x (n) ) is L-unbiased for the parameter λ if for all λ, λ Λ [Lehmann, 1951] E(L(ˆλ(X (n) ), λ) λ) E(L(ˆλ(X (n) ), λ ) λ) Also, for a loss function L and a prior p, we shall say that the estimator ˆλ is (L, p)-bayes if it minimizes the corresponding posterior expected loss L(ˆλ, λ) p(λ x (n) ) dλ

Posterior unbiasedness L-unbiasedness Using this general definition of unbiasedness, Hartigan (1965) showed that the prior which minimizes the asymptotic bias of the Bayes estimator relative to the loss function L(ˆλ, λ) is given by [ 2 L(λ, ω) π(λ) = i λ (λ) ω 2 where i λ (λ) denotes the Fisher information ] 1/2 ω=λ For the squared error loss, L 2 λ(ˆλ, λ), this becomes π(λ) i λ (λ)

Posterior unbiasedness L-unbiasedness For exponential family models and squared error loss with respect to the mean parameter [ i.e. L 2 µ(ˆµ, µ) = (ˆµ µ) 2 ], the unbiasedness of π µ(µ) i µ(µ) is exact [Hartigan, 1965] This prior can be written as π µ(µ) V (µ) 1, while in terms of the canonical parameter it becomes π θ (θ) 1 Hence, the (L 2 µ, π µ)-bayes estimator ˆµ = E(µ x (n) ) is L 2 µ-unbiased and the corresponding unbiased prior is π µ Under certain conditions, this also holds (in the usual L 2 λ-sense) for other parametrizations λ = λ(θ) [GP & Mendoza, 1999]

Posterior unbiasedness Relative entropy loss For an arbitrary parametrization λ, let L K (ˆλ, λ) = D KL (p(x λ) p(x ˆλ)) denote the relative entropy (Kullback-Leibler) loss, and let denote the corresponding dual loss L K (ˆλ, λ) = D KL (p(x ˆλ) p(x λ)) Unlike the squared error loss L 2 λ, both of these loss functions are invariant under reparametrizations of the model and so we may drop the subscript λ from the notation With this notation, ˆµ = E(µ x (n) ) is not only (L 2 µ, π µ)-bayes but also (L K, π µ)-bayes

Posterior unbiasedness Relative entropy loss Therefore, for the mean parameter of an exponential family, the (L K, π µ)-bayes estimator is L 2 µ-unbiased A partial dual result holds for the canonical parameter θ In this case, the (L K, π θ )-Bayes estimator is ˆθ = E(θ x (n) ), which is also the (L 2 θ, π θ )-Bayes estimator of θ Suppose ˆθ is unbiased in the usual sense. Then, for the canonical parameter of an exponential family, the (L K, π θ )-Bayes estimator is L 2 θ-unbiased The (L K, π θ )-Bayes estimator of θ is also its posterior mode which, since π θ (θ) 1, coincides with the maximum likelihood estimator. In other words, π θ (θ) 1 is a maximum likelihood prior as defined by Hartigan (1998)

Posterior unbiasedness Relative entropy loss Unlike (L 2, p)-bayes estimators, both (L K, p)- and (L K, p)-bayes estimators are invariant under reparametrizations of the model Wu & Vos (2012) have shown that, for exponential families, the maximum likelihood estimator is L K -unbiased (regardless of the parametrization) We can rephrase this result to say that the (L K, π)-bayes estimator is L K -unbiased, so it is unbiased in a dual sense. Here, π is the prior induced by the uniform prior on the canonical parameter θ

Posterior unbiasedness Relative entropy loss It follows that the L K -unbiased estimator for an arbitrary transformation λ = λ(µ) of the mean parameter always exists and can be obtained by correspondingly transforming the usual L 2 µ-unbiased estimator ˆµ = E(µ x (n) ) i.e. ˆλ = λ(ˆµ) which in this case coincides with the maximum likelihood estimator [GP & Mendoza, 2013] Such ˆλ is not necessarily unbiased in the usual L 2 λ-sense

Posterior unbiasedness Relative entropy loss Again, a partial dual result can be obtained for the canonical parameter Using results from Wu & Vos (2012), it can be shown that an estimator is L K -unbiased for θ if it is L 2 θ-unbiased. Recall that the (L K, π)-bayes estimator is ˆθ = E(θ x (n) ). If ˆθ happens to be unbiased in the usual sense, then the (L K, π)-bayes estimator is L K -unbiased In this case, the L K -unbiased estimator for an arbitrary transformation λ = λ(θ) of the canonical parameter can be obtained by correspondingly transforming the usual L 2 θ-unbiased estimator ˆθ, i.e. ˆλ = λ(ˆθ) As before, such ˆλ is not necessarily unbiased in the usual L 2 λ-sense

Conjugate predictive distributions and generalized entropies Conjugate predictive distributions Natural exponential family p θ (x θ) = b(x) exp{θx M(θ)} (θ Ξ) with M(θ) = log b(x) exp{θx} η(dx) Canonical parameter space Ξ = {θ IR : M(θ) < } Mean-value parameter µ = µ(θ) = E[ X θ ] = dm(θ)/dθ Mean-value parameter space: Ω = µ(ξ) Variance function V (µ) = Var[ X θ(µ) ] = d 2 M(θ(µ))/dθ 2

Conjugate predictive distributions and generalized entropies Conjugate predictive distributions Let X 1,..., X n p θ (x θ). Then p θ (s θ, n) = b( x, n) exp{n [ θ x M(θ) ]} Likelihood L θ (θ x, n) exp{n [ θ x M(θ) ]} Fisher information i θ (θ) = d 2 M(θ)/dθ 2 In terms of the mean-value parameter i µ(µ) = V (µ) 1

Conjugate predictive distributions and generalized entropies Conjugate predictive distributions Natural conjugate family π θ (θ x 0, n 0 ) L θ (θ x 0, n 0 ) (s 0 IR, n 0 IR) Normalized form π θ (θ x 0, n 0 ) = h(x 0, n 0 ) exp{n 0 [ x 0 θ n 0 M(θ) ]} with h(x 0, n 0 ) 1 = exp{n 0 [x 0 θ n 0 M(θ) ]} dθ (DY-conjugate prior)

Conjugate predictive distributions and generalized entropies Conjugate predictive distributions Prior p(x x 0, n 0 ) = b(x) h(x 0, n 0 ) h(x + x 0, n 0 + 1) Posterior p(x x 1, n 1 ) = b(x) h(x 1, n 1 ) h(x + x 1, n 1 + 1) where x 1 = (n 0 x 0 + n x)/(n 0 + n) and n 1 = n 0 + n In general, this is not an exponential family

Conjugate predictive distributions and generalized entropies Conjugate predictive distributions Prior p(x x 0, n 0 ) = b(x) h(x 0, n 0 ) h(x + x 0, n 0 + 1) Posterior p(x x 1, n 1 ) = b(x) h(x 1, n 1 ) h(x + x 1, n 1 + 1) where x 1 = (n 0 x 0 + n x)/(n 0 + n) and n 1 = n 0 + n In general, this is not an exponential family As n, x 1 µ and p(x x 1, n 1 ) p µ(x µ) where p µ(x µ) = b(x) exp{θ(µ)x M(θ(µ))}

Conjugate predictive distributions and generalized entropies Generalized entropies The maximum entropy principle (MEP) is called up when one wishes to select a single distribution P (for a random variable X) as a representative of a class Γ of probability distributions Typically Γ = Γ d {P : E P (S) = ς} where ς IR d and S = s(x) is a statistic taking values on IR d The Shannon entropy of the distribution P is defined by { } p(x) H(P) = p(x) log η(dx) = D KL (p p 0 ) p 0 (x) and describes the uncertainty inherent in p( ) relative to p 0 ( )

Conjugate predictive distributions and generalized entropies Generalized entropies If there exists a distribution P maximizing H(P) over the class Γ d, then its density p satisfies p (x) = p 0 (x) exp{λ T s(x)}, p 0 (x) exp{λ T s(x)} η(dx) where λ = (λ 1,..., λ d ) is a vector of Lagrange multipliers and is determined by the moment constraints E P (S) = ς which define the class Γ d Despite its intuitive appeal, the MEP is controversial. However, Topsøe (1979) provides an interesting decision theoretical justification for the MEP

Conjugate predictive distributions and generalized entropies Generalized entropies Basically, there exists a duality between maximization of entropy and minimization of worst-case expected loss in a suitably defined statistical decision problem Grünwald and Dawid (2004) extend the work of Topsøe by considering a generalized concept of entropy related to the choice of the loss function They also discuss generalized divergences and the corresponding generalized exponential families of distributions

Conjugate predictive distributions and generalized entropies Generalized entropies Consider the following parametrized family of divergences for α > 1 { [ ] 1/α D α(p p 0 ) = α2 p0 (x) 1 p(x) η(dx)} (α 1) p(x) which includes the Kullback-Leibler divergence as the limiting case α This is a particular case of the general class of f -divergences defined by ( ) p(x) D (f ) (p p 0 ) = f p 0 (x) η(dx), p 0 (x) where f : [0, ) (, ] is a convex function, continuous at 0 and such that f (1) = 0

Conjugate predictive distributions and generalized entropies Generalized entropies Define the corresponding f -entropy by H (f ) (P) = D (f ) (p p 0 ) Then the generalized maximum entropy distribution P over the class Γ, if it exists, has a density of the form p (x) = p 0 (x) g(λ 0 + λ T s(x)) where g( ) = ḟ 1 ( ) and ḟ denotes the first derivative of f

Conjugate predictive distributions and generalized entropies Generalized entropies The α-divergence D α(p p 0 ) is obtained when f (x) = f α(x) α (α 1) {(1 x) αx(x 1/α 1)} This is derived from the well-known limit lim α(x 1/α 1) = log(x) α Here we shall focus on the corresponding generalized entropy for which g(y) = 1/(1 y/α) α H α(p) = D α(p p 0 )

Conjugate predictive distributions and generalized entropies Generalized exponential families The Student s t distribution, with density function St(x µ, σ 2, γ) = [ Γ((γ + 1)/2) σ 1 + 1 ] (x µ) 2 (γ+1)/2 (x IR) γπ Γ(γ/2) γ σ 2 maximizes H α(p) over the class Γ 2 provided that α > 3/2 [Here γ = 2α 1] On the other hand, the Student s t distribution can be expressed as St(x µ, σ 2, γ) = 0 N(x µ, σ 2 /y) Ga(y γ/2, γ/2) dy As γ, H α(p) H(P) and St(x µ, σ 2, γ) N(x µ, σ 2 ) The Student s t distribution can be regarded as a generalized exponential family in this sense

Conjugate predictive distributions and generalized entropies Generalized exponential families The Exponential-Gamma distribution, with density function EG(x γ, γµ) = [ ( µ 1 + x )] (γ+1) (x IR +) µ maximizes H α(p) over the class Γ 1 provided that α > 2 [Here γ = α 1] On the other hand, the Exponential-Gamma distribution can be expressed as EG(x γ, γµ) = NE(x µ/y) Ga(y γ, γ) dy 0 Here NE(x µ) denotes the density of a Negative Exponential distribution with mean µ As γ, H α(p) H(P) and EG(x γ, γµ) NE(x µ) The Exponential-Gamma distribution can also be regarded as a generalized exponential family in this sense

Conjugate predictive distributions and generalized entropies Generalized exponential families Results analogous to those discussed above are likely to hold for predictive distributions derived from other exponential-conjugate pairs, such an the Poisson-Gamma and Binomial-Beta distributions However, the analysis in these discrete cases in not as simple since one has to deal with (power) series instead of integrals On the other hand, it is well known that the Poisson distribution is not the maximum entropy distribution over the class of distributions on the non-negative integers (the geometric distribution claiming this property)

Conjugate predictive distributions and generalized entropies Generalized exponential families Nonetheless, the Poisson distribution is the maximum entropy distribution over the class of ultra log-concave distributions on the non-negative integers [Johnson, 2007] So, we can expect the Poisson-Gamma distribution to maximize the generalized entropy H α(p) over the same class A similar result could well hold in the Binomial case [Harremoës, 2001]

Conjugate predictive distributions and generalized entropies Generalized exponential families The predictive distributions discussed here are related to the q-exponential families introduced by Naudts (2009, 2011) These families are based on the so-called deformed exponential function and its inverse, the deformed logarithmic function, which can be used in the definition of the generalized entropy H α(p) Such generalized exponential families share some of the nice properties of standard exponential families, but are not as tractable These deformed logarithmic and exponential functions can be further generalized, and one may be able to define the f -divergence H (f ) (P) in terms of them [Naudts, 2004]

Discussion Concluding Remarks 1 There exists an interesting duality between the classical notion of unbiasedness and minimization of posterior expected quadratic loss A similar result holds for the relative entropy loss if one uses a more general definition of unbiasedness

Discussion Concluding Remarks 1 There exists an interesting duality between the classical notion of unbiasedness and minimization of posterior expected quadratic loss A similar result holds for the relative entropy loss if one uses a more general definition of unbiasedness Using the relative entropy as loss function one can achieve invariant L-unbiased Bayes estimators In exponential family settings, such estimator can easily be computed for any (informative) conjugate prior, not only for unbiased priors

Discussion Concluding Remarks 2 There is another interesting duality between maximization of entropy and minimization of worst-case expected loss Conjugate predictive distribution derived from exponential family sampling models can be regarded as generalized exponential families in the sense that they maximize a generalized entropy

Discussion Concluding Remarks 2 There is another interesting duality between maximization of entropy and minimization of worst-case expected loss Conjugate predictive distribution derived from exponential family sampling models can be regarded as generalized exponential families in the sense that they maximize a generalized entropy They can be seen as flexible/robust versions of the corresponding sampling model and used for data analysis in their own right They inherit some of the tractability of the original sampling model

Discussion Concluding Remarks 2 There is another interesting duality between maximization of entropy and minimization of worst-case expected loss Conjugate predictive distribution derived from exponential family sampling models can be regarded as generalized exponential families in the sense that they maximize a generalized entropy They can be seen as flexible/robust versions of the corresponding sampling model and used for data analysis in their own right They inherit some of the tractability of the original sampling model When used as sampling models for data analysis, their mixture representation leads to a simple analysis via Gibbs sampling

References Grünwald & Dawid (2004). Game theory, maximum extropy, minumum discrepancy and robust Bayesian decision theory. Annals of Statistics 32, 1367 1433. GP & Mendoza (1999). A note on Bayes estimates for exponential families. Revista de la Real Academia de Ciencias Exactas, Físicas y Naturales (España) 93, 351 356. GP & Mendoza, M. (2013). Proper and non-informative conjugate priors for exponential family models. In Bayesian Theory and Applications (P. Damien, P. Dellaportas, N.G. Polson & D.A. Stephens, eds.) Oxford: University Press, chap. 19. Harremoës, P. (2001). Binomial and Poisson distributions as maximum entropy distributions. IEEE Trans. Information Theory 47, 2039 2041. Hartigan (1965). The asymptotically unbiased prior distribution. Annals of Mathematical Statistics 36, 1137 1152. Hartigan (1998). The maximum likelihood prior. Annals of Statistics 26, 2083 2103. Johnson, O. (2007). Log-concavity and the maximum entropy property of the Poisson distribution. Stochastic Process. Appl. 117, 791 802. Lehmann (1951). A general concept of unbiasedness. The Annals of Mathematical Statistics 32, 587 592. Naudts, J. (2004). Estimators, escort probabilities, and φ-exponential families in statistical physics. Journal of Inequalities in Pure and Applied Mathematics 5, 102. Naudts, J. (2009). The q-exponential family in statistical physics. Central European Journal of Physics 7, 405 413. Naudts, J. (2011). Generalised Thermostatistics. Berlin: Springer-Verlag. Noorbaloochi & Meeden (1983). Unbiasedness as the dual of being Bayes. Journal of the American Statistical Association 78, 619 623. Topsøe, F. (1979). Information theoretical optimization techniques. Kybernetika 15, 8 27. Wu & Vos (2012). Decomposition of Kullback-Leibler risk and unbiasedness for parameter free estimators. Journal of Statistical Planning and Inference 142, 1525 1536.