Conjugate Predictive Distributions and Generalized Entropies

Size: px

Start display at page:

Download "Conjugate Predictive Distributions and Generalized Entropies"

Bruno Daniel
6 years ago
Views:

1 Conjugate Predictive Distributions and Generalized Entropies Eduardo Gutiérrez-Peña Department of Probability and Statistics IIMAS-UNAM, Mexico Padova, Italy March, 2013

2 Menu 1 Antipasto/Appetizer Posterior unbiasedness 2 Piatto Principale/Main Course Conjugate predictive distributions as generalized exponential families 3 Dolce/Dessert Concluding remarks

3 Posterior unbiasedness Unbiasedness Why the interest in unbiasedness?

4 Posterior unbiasedness Unbiasedness Why the interest in unbiasedness? There is an interesting duality between the classical notion of unbiasedness and minimization of posterior expected loss [Noorbaloochi & Meeden, 1983]

5 Posterior unbiasedness Unbiasedness Why the interest in unbiasedness? There is an interesting duality between the classical notion of unbiasedness and minimization of posterior expected loss [Noorbaloochi & Meeden, 1983] Some Bayes estimators ˆλ = ˆλ(x (n) ) turn out to be unbiased for certain transformations λ = λ(θ) in the sense that E(ˆλ λ) = λ This definition of unbiasedness is related to the squared error loss L 2 λ(ˆλ, λ) (ˆλ λ) 2, as is the Bayes estimator ˆλ = E(λ x (n) ) S = E(λ ˆλ)

6 Posterior unbiasedness L-unbiasedness For a general loss function L(ˆλ, λ), we shall say that the estimator ˆλ(x (n) ) is L-unbiased for the parameter λ if for all λ, λ Λ [Lehmann, 1951] E(L(ˆλ(X (n) ), λ) λ) E(L(ˆλ(X (n) ), λ ) λ) Also, for a loss function L and a prior p, we shall say that the estimator ˆλ is (L, p)-bayes if it minimizes the corresponding posterior expected loss L(ˆλ, λ) p(λ x (n) ) dλ

7 Posterior unbiasedness L-unbiasedness Using this general definition of unbiasedness, Hartigan (1965) showed that the prior which minimizes the asymptotic bias of the Bayes estimator relative to the loss function L(ˆλ, λ) is given by [ 2 L(λ, ω) π(λ) = i λ (λ) ω 2 where i λ (λ) denotes the Fisher information ] 1/2 ω=λ For the squared error loss, L 2 λ(ˆλ, λ), this becomes π(λ) i λ (λ)

8 Posterior unbiasedness L-unbiasedness For exponential family models and squared error loss with respect to the mean parameter [ i.e. L 2 µ(ˆµ, µ) = (ˆµ µ) 2 ], the unbiasedness of π µ(µ) i µ(µ) is exact [Hartigan, 1965] This prior can be written as π µ(µ) V (µ) 1, while in terms of the canonical parameter it becomes π θ (θ) 1 Hence, the (L 2 µ, π µ)-bayes estimator ˆµ = E(µ x (n) ) is L 2 µ-unbiased and the corresponding unbiased prior is π µ Under certain conditions, this also holds (in the usual L 2 λ-sense) for other parametrizations λ = λ(θ) [GP & Mendoza, 1999]

9 Posterior unbiasedness Relative entropy loss For an arbitrary parametrization λ, let L K (ˆλ, λ) = D KL (p(x λ) p(x ˆλ)) denote the relative entropy (Kullback-Leibler) loss, and let denote the corresponding dual loss L K (ˆλ, λ) = D KL (p(x ˆλ) p(x λ)) Unlike the squared error loss L 2 λ, both of these loss functions are invariant under reparametrizations of the model and so we may drop the subscript λ from the notation With this notation, ˆµ = E(µ x (n) ) is not only (L 2 µ, π µ)-bayes but also (L K, π µ)-bayes

10 Posterior unbiasedness Relative entropy loss Therefore, for the mean parameter of an exponential family, the (L K, π µ)-bayes estimator is L 2 µ-unbiased A partial dual result holds for the canonical parameter θ In this case, the (L K, π θ )-Bayes estimator is ˆθ = E(θ x (n) ), which is also the (L 2 θ, π θ )-Bayes estimator of θ Suppose ˆθ is unbiased in the usual sense. Then, for the canonical parameter of an exponential family, the (L K, π θ )-Bayes estimator is L 2 θ-unbiased The (L K, π θ )-Bayes estimator of θ is also its posterior mode which, since π θ (θ) 1, coincides with the maximum likelihood estimator. In other words, π θ (θ) 1 is a maximum likelihood prior as defined by Hartigan (1998)

11 Posterior unbiasedness Relative entropy loss Unlike (L 2, p)-bayes estimators, both (L K, p)- and (L K, p)-bayes estimators are invariant under reparametrizations of the model Wu & Vos (2012) have shown that, for exponential families, the maximum likelihood estimator is L K -unbiased (regardless of the parametrization) We can rephrase this result to say that the (L K, π)-bayes estimator is L K -unbiased, so it is unbiased in a dual sense. Here, π is the prior induced by the uniform prior on the canonical parameter θ

12 Posterior unbiasedness Relative entropy loss It follows that the L K -unbiased estimator for an arbitrary transformation λ = λ(µ) of the mean parameter always exists and can be obtained by correspondingly transforming the usual L 2 µ-unbiased estimator ˆµ = E(µ x (n) ) i.e. ˆλ = λ(ˆµ) which in this case coincides with the maximum likelihood estimator [GP & Mendoza, 2013] Such ˆλ is not necessarily unbiased in the usual L 2 λ-sense

13 Posterior unbiasedness Relative entropy loss Again, a partial dual result can be obtained for the canonical parameter Using results from Wu & Vos (2012), it can be shown that an estimator is L K -unbiased for θ if it is L 2 θ-unbiased. Recall that the (L K, π)-bayes estimator is ˆθ = E(θ x (n) ). If ˆθ happens to be unbiased in the usual sense, then the (L K, π)-bayes estimator is L K -unbiased In this case, the L K -unbiased estimator for an arbitrary transformation λ = λ(θ) of the canonical parameter can be obtained by correspondingly transforming the usual L 2 θ-unbiased estimator ˆθ, i.e. ˆλ = λ(ˆθ) As before, such ˆλ is not necessarily unbiased in the usual L 2 λ-sense

14 Conjugate predictive distributions and generalized entropies Conjugate predictive distributions Natural exponential family p θ (x θ) = b(x) exp{θx M(θ)} (θ Ξ) with M(θ) = log b(x) exp{θx} η(dx) Canonical parameter space Ξ = {θ IR : M(θ) < } Mean-value parameter µ = µ(θ) = E[ X θ ] = dm(θ)/dθ Mean-value parameter space: Ω = µ(ξ) Variance function V (µ) = Var[ X θ(µ) ] = d 2 M(θ(µ))/dθ 2

15 Conjugate predictive distributions and generalized entropies Conjugate predictive distributions Let X 1,..., X n p θ (x θ). Then p θ (s θ, n) = b( x, n) exp{n [ θ x M(θ) ]} Likelihood L θ (θ x, n) exp{n [ θ x M(θ) ]} Fisher information i θ (θ) = d 2 M(θ)/dθ 2 In terms of the mean-value parameter i µ(µ) = V (µ) 1

16 Conjugate predictive distributions and generalized entropies Conjugate predictive distributions Natural conjugate family π θ (θ x 0, n 0 ) L θ (θ x 0, n 0 ) (s 0 IR, n 0 IR) Normalized form π θ (θ x 0, n 0 ) = h(x 0, n 0 ) exp{n 0 [ x 0 θ n 0 M(θ) ]} with h(x 0, n 0 ) 1 = exp{n 0 [x 0 θ n 0 M(θ) ]} dθ (DY-conjugate prior)

17 Conjugate predictive distributions and generalized entropies Conjugate predictive distributions Prior p(x x 0, n 0 ) = b(x) h(x 0, n 0 ) h(x + x 0, n 0 + 1) Posterior p(x x 1, n 1 ) = b(x) h(x 1, n 1 ) h(x + x 1, n 1 + 1) where x 1 = (n 0 x 0 + n x)/(n 0 + n) and n 1 = n 0 + n In general, this is not an exponential family

18 Conjugate predictive distributions and generalized entropies Conjugate predictive distributions Prior p(x x 0, n 0 ) = b(x) h(x 0, n 0 ) h(x + x 0, n 0 + 1) Posterior p(x x 1, n 1 ) = b(x) h(x 1, n 1 ) h(x + x 1, n 1 + 1) where x 1 = (n 0 x 0 + n x)/(n 0 + n) and n 1 = n 0 + n In general, this is not an exponential family As n, x 1 µ and p(x x 1, n 1 ) p µ(x µ) where p µ(x µ) = b(x) exp{θ(µ)x M(θ(µ))}

19 Conjugate predictive distributions and generalized entropies Generalized entropies The maximum entropy principle (MEP) is called up when one wishes to select a single distribution P (for a random variable X) as a representative of a class Γ of probability distributions Typically Γ = Γ d {P : E P (S) = ς} where ς IR d and S = s(x) is a statistic taking values on IR d The Shannon entropy of the distribution P is defined by { } p(x) H(P) = p(x) log η(dx) = D KL (p p 0 ) p 0 (x) and describes the uncertainty inherent in p( ) relative to p 0 ( )

20 Conjugate predictive distributions and generalized entropies Generalized entropies If there exists a distribution P maximizing H(P) over the class Γ d, then its density p satisfies p (x) = p 0 (x) exp{λ T s(x)}, p 0 (x) exp{λ T s(x)} η(dx) where λ = (λ 1,..., λ d ) is a vector of Lagrange multipliers and is determined by the moment constraints E P (S) = ς which define the class Γ d Despite its intuitive appeal, the MEP is controversial. However, Topsøe (1979) provides an interesting decision theoretical justification for the MEP

21 Conjugate predictive distributions and generalized entropies Generalized entropies Basically, there exists a duality between maximization of entropy and minimization of worst-case expected loss in a suitably defined statistical decision problem Grünwald and Dawid (2004) extend the work of Topsøe by considering a generalized concept of entropy related to the choice of the loss function They also discuss generalized divergences and the corresponding generalized exponential families of distributions

22 Conjugate predictive distributions and generalized entropies Generalized entropies Consider the following parametrized family of divergences for α > 1 { [ ] 1/α D α(p p 0 ) = α2 p0 (x) 1 p(x) η(dx)} (α 1) p(x) which includes the Kullback-Leibler divergence as the limiting case α This is a particular case of the general class of f -divergences defined by ( ) p(x) D (f ) (p p 0 ) = f p 0 (x) η(dx), p 0 (x) where f : [0, ) (, ] is a convex function, continuous at 0 and such that f (1) = 0

23 Conjugate predictive distributions and generalized entropies Generalized entropies Define the corresponding f -entropy by H (f ) (P) = D (f ) (p p 0 ) Then the generalized maximum entropy distribution P over the class Γ, if it exists, has a density of the form p (x) = p 0 (x) g(λ 0 + λ T s(x)) where g( ) = ḟ 1 ( ) and ḟ denotes the first derivative of f

24 Conjugate predictive distributions and generalized entropies Generalized entropies The α-divergence D α(p p 0 ) is obtained when f (x) = f α(x) α (α 1) {(1 x) αx(x 1/α 1)} This is derived from the well-known limit lim α(x 1/α 1) = log(x) α Here we shall focus on the corresponding generalized entropy for which g(y) = 1/(1 y/α) α H α(p) = D α(p p 0 )

25 Conjugate predictive distributions and generalized entropies Generalized exponential families The Student s t distribution, with density function St(x µ, σ 2, γ) = [ Γ((γ + 1)/2) σ ] (x µ) 2 (γ+1)/2 (x IR) γπ Γ(γ/2) γ σ 2 maximizes H α(p) over the class Γ 2 provided that α > 3/2 [Here γ = 2α 1] On the other hand, the Student s t distribution can be expressed as St(x µ, σ 2, γ) = 0 N(x µ, σ 2 /y) Ga(y γ/2, γ/2) dy As γ, H α(p) H(P) and St(x µ, σ 2, γ) N(x µ, σ 2 ) The Student s t distribution can be regarded as a generalized exponential family in this sense

26 Conjugate predictive distributions and generalized entropies Generalized exponential families The Exponential-Gamma distribution, with density function EG(x γ, γµ) = [ ( µ 1 + x )] (γ+1) (x IR +) µ maximizes H α(p) over the class Γ 1 provided that α > 2 [Here γ = α 1] On the other hand, the Exponential-Gamma distribution can be expressed as EG(x γ, γµ) = NE(x µ/y) Ga(y γ, γ) dy 0 Here NE(x µ) denotes the density of a Negative Exponential distribution with mean µ As γ, H α(p) H(P) and EG(x γ, γµ) NE(x µ) The Exponential-Gamma distribution can also be regarded as a generalized exponential family in this sense

27 Conjugate predictive distributions and generalized entropies Generalized exponential families Results analogous to those discussed above are likely to hold for predictive distributions derived from other exponential-conjugate pairs, such an the Poisson-Gamma and Binomial-Beta distributions However, the analysis in these discrete cases in not as simple since one has to deal with (power) series instead of integrals On the other hand, it is well known that the Poisson distribution is not the maximum entropy distribution over the class of distributions on the non-negative integers (the geometric distribution claiming this property)

28 Conjugate predictive distributions and generalized entropies Generalized exponential families Nonetheless, the Poisson distribution is the maximum entropy distribution over the class of ultra log-concave distributions on the non-negative integers [Johnson, 2007] So, we can expect the Poisson-Gamma distribution to maximize the generalized entropy H α(p) over the same class A similar result could well hold in the Binomial case [Harremoës, 2001]

29 Conjugate predictive distributions and generalized entropies Generalized exponential families The predictive distributions discussed here are related to the q-exponential families introduced by Naudts (2009, 2011) These families are based on the so-called deformed exponential function and its inverse, the deformed logarithmic function, which can be used in the definition of the generalized entropy H α(p) Such generalized exponential families share some of the nice properties of standard exponential families, but are not as tractable These deformed logarithmic and exponential functions can be further generalized, and one may be able to define the f -divergence H (f ) (P) in terms of them [Naudts, 2004]

30 Discussion Concluding Remarks 1 There exists an interesting duality between the classical notion of unbiasedness and minimization of posterior expected quadratic loss A similar result holds for the relative entropy loss if one uses a more general definition of unbiasedness

31 Discussion Concluding Remarks 1 There exists an interesting duality between the classical notion of unbiasedness and minimization of posterior expected quadratic loss A similar result holds for the relative entropy loss if one uses a more general definition of unbiasedness Using the relative entropy as loss function one can achieve invariant L-unbiased Bayes estimators In exponential family settings, such estimator can easily be computed for any (informative) conjugate prior, not only for unbiased priors

32 Discussion Concluding Remarks 2 There is another interesting duality between maximization of entropy and minimization of worst-case expected loss Conjugate predictive distribution derived from exponential family sampling models can be regarded as generalized exponential families in the sense that they maximize a generalized entropy

33 Discussion Concluding Remarks 2 There is another interesting duality between maximization of entropy and minimization of worst-case expected loss Conjugate predictive distribution derived from exponential family sampling models can be regarded as generalized exponential families in the sense that they maximize a generalized entropy They can be seen as flexible/robust versions of the corresponding sampling model and used for data analysis in their own right They inherit some of the tractability of the original sampling model

34 Discussion Concluding Remarks 2 There is another interesting duality between maximization of entropy and minimization of worst-case expected loss Conjugate predictive distribution derived from exponential family sampling models can be regarded as generalized exponential families in the sense that they maximize a generalized entropy They can be seen as flexible/robust versions of the corresponding sampling model and used for data analysis in their own right They inherit some of the tractability of the original sampling model When used as sampling models for data analysis, their mixture representation leads to a simple analysis via Gibbs sampling

35 References Grünwald & Dawid (2004). Game theory, maximum extropy, minumum discrepancy and robust Bayesian decision theory. Annals of Statistics 32, GP & Mendoza (1999). A note on Bayes estimates for exponential families. Revista de la Real Academia de Ciencias Exactas, Físicas y Naturales (España) 93, GP & Mendoza, M. (2013). Proper and non-informative conjugate priors for exponential family models. In Bayesian Theory and Applications (P. Damien, P. Dellaportas, N.G. Polson & D.A. Stephens, eds.) Oxford: University Press, chap. 19. Harremoës, P. (2001). Binomial and Poisson distributions as maximum entropy distributions. IEEE Trans. Information Theory 47, Hartigan (1965). The asymptotically unbiased prior distribution. Annals of Mathematical Statistics 36, Hartigan (1998). The maximum likelihood prior. Annals of Statistics 26, Johnson, O. (2007). Log-concavity and the maximum entropy property of the Poisson distribution. Stochastic Process. Appl. 117, Lehmann (1951). A general concept of unbiasedness. The Annals of Mathematical Statistics 32, Naudts, J. (2004). Estimators, escort probabilities, and φ-exponential families in statistical physics. Journal of Inequalities in Pure and Applied Mathematics 5, 102. Naudts, J. (2009). The q-exponential family in statistical physics. Central European Journal of Physics 7, Naudts, J. (2011). Generalised Thermostatistics. Berlin: Springer-Verlag. Noorbaloochi & Meeden (1983). Unbiasedness as the dual of being Bayes. Journal of the American Statistical Association 78, Topsøe, F. (1979). Information theoretical optimization techniques. Kybernetika 15, Wu & Vos (2012). Decomposition of Kullback-Leibler risk and unbiasedness for parameter free estimators. Journal of Statistical Planning and Inference 142,

Invariant HPD credible sets and MAP estimators

Bayesian Analysis (007), Number 4, pp. 681 69 Invariant HPD credible sets and MAP estimators Pierre Druilhet and Jean-Michel Marin Abstract. MAP estimators and HPD credible sets are often criticized in