Bayesian Estimation of the Entropy of the Multivariate Gaussian

Size: px

Start display at page:

Download "Bayesian Estimation of the Entropy of the Multivariate Gaussian"

Millicent Carroll
5 years ago
Views:

1 Bayesian Estimation of the Entropy of the Multivariate Gaussian Santosh Srivastava Fre Hutchinson Cancer Research Center Seattle, WA 989, USA Maya R. Gupta Department of Electrical Engineering University of Washington Seattle, WA 989, USA Abstract Estimating the entropy of a Gaussian istribution from samples rawn from the istribution is a ifficult problem when the number of samples is smaller than the number of imensions. A new Bayesian entropy estimator is propose using an inverte Wishart istribution an a ata-epenent prior that hanles the small-sample case. Experiments for six ifferent cases show that the propose estimator provies goo performance for the small-sample case compare to the stanar nearest-neighbor entropy estimator. Aitionally, it is shown that the Bayesian estimate forme by taking the expecte entropy minimizes expecte Bregman ivergence. I. INTRODUCTION Entropy is a useful escription of the preictability of a istribution, an has use in many applications of coing, machine learning, signal processing, communications, an chemistry [] []. In practice, many continuous generating istributions are moele as Gaussian istributions. One reason for this is the central limit theorem, another reason is because the Gaussian is the maximum entropy istribution given an empirical mean an covariance, an as such is a least assumptive moel. For the multivariate Gaussian istribution, the entropy goes as the log eterminant of the covariance; specifically, the ifferential entropy of a -imensional ranom vector X rawn from the Gaussian N µ, Σ is hx = N x ln N xx = lnπ ln Σ. In this paper we consier the practical problem of estimating the entropy of a generating normal istribution N given samples x, x,..., x n that are assume to be realizations of ranom vectors X, X,..., X n rawn ii from N µ, Σ. In particular, we focus on the ifficult limite-ata case that n, which occurs often in practice with high-imensional ata. One approach to estimating hx is to first estimate the Gaussian istribution, for example with maximum likelihoo ML estimates of µ an Σ, an then plug-in the covariance estimate to []. Such estimates are usually infeasible when n. The ML estimate is also negatively biase so that it unerestimates the entropy on average []. We believe it is more effective an will create fewer numerical problems if one irectly estimates a single value for the entropy, rather than first estimating the entire covariance matrix in orer to prouce the scalar entropy estimate. To this en, we propose a ataepenent Bayesian estimate using an inverte Wishart prior. The choice of prior in Bayesian estimation can ramatically change the results, but there is little practical guiance for choosing priors. The main contributions of this work are showing that the inverte Wishart prior enables estimates for n, an that using a rough ata-epenent estimate as a parameter to the inverte Wishart prior yiels a more robust estimator than ignoring the ata when creating the prior. II. RELATED WORK First we review relate work in parametric entropy estimation, then in nonparametric estimation. Ahme an Gokhale investigate uniformly minimum variance unbiase UMVU entropy estimators for parametric istributions []. They showe that the UMVU entropy estimate for the Gaussian is, ĥ UMVU = ln π ln S n i, where S = n x i xx i x T, an the igamma function is efine z = z ln Γz, where Γ is the stanar gamma function. Bayesian entropy estimation for the multivariate normal was first propose in by Misra et al. []. They form an entropy estimate by substituting ln Σ for ln Σ in where ln Σ minimizes the square-error risk of the entropy estimate. That is, ln Σ solves, [ arg min E µ,σ δ ln Σ ], δ R where here Σ enotes a ranom covariance matrix an µ a ranom vector, an the expectation is taken with respect to the posterior. We will enote by µ an Σ respectively realizations of the ranom µ an Σ. Misra et al. consier ifferent priors with support over the set of symmetric positive efinite Σ. They show that given the prior p µ, Σ =, the solution to yiels the Bayesian entropy estimate ĥ Bayesian = ln π ln S Σ n i.

2 For iscrete ranom variables, a Bayesian approach that uses binning for estimating functionals such as entropy was propose the same year as Misra et al. s work []. It is interesting to note that the estimates ĥumvu an ĥ Bayesian were erive from ifferent perspectives, but iffer only slightly in the igamma argument. Like ĥumvu, Misra et al. show that ĥbayesian is an unbiase entropy estimate. They also show that ĥbayesian is ominate by a Stein-type estimator, ĥ Stein = ln S n x x T c, where c is a function of an n. Further, they also show that the estimate ĥbayesian is ominate by a Brewster-Ziek-type estimator ĥbz, ĥ BZ = ln S n x x T c. where c is a function of S an x x T that requires calculating the ratio of two efinite integrals, state in full in. of []. Misra et al. foun that on simulate numerical experiments their Stein-type an Brewster-Ziek-type estimators achieve roughly only % improvement over the simpler Bayesian estimate ĥbayesian, an thus they recommen using the computationally much simpler ĥbayesian in applications. There are two practical problems with the previously propose parametric estimators. First, the estimates given by,, an the other propose Misra et al. estimators require calculating the eterminant of S or S x x T, which can be numerically infeasible if there are few samples. Secon, the Bayesian estimate ĥbayesian uses the igamma function of n which requires n > samples so that the igamma has a non-negative argument, an similarly ĥumvu uses the igamma of n, which requires n samples. Thus, although the knowlege that one is estimating the entropy of a Gaussian shoul be of use, for the n case one must turn to nonparametric entropy estimators. A thorough review of work in nonparametric entropy estimation up to 99 was written by Beirlant et al. [], incluing ensity estimation approaches, sample-spacing approaches, an nearest-neighbor estimators. Recently, Nilsson an Kleijn show that high-rate quantization approximations of Zaor an Gray can be use to estimate Renyi entropy, an that the limiting case of Shannon entropy prouces a nearest-neighbor estimate that epens on the number of quantization cells [8]. The special case of their nearest-neighbor estimate that best valiates the high-rate quantization assumptions is when the number of quantization cells is as large as possible. They show that this special case prouces the nearest-neighbor ifferential entropy estimator originally propose by Kozachenko an Leonenko in 98 [9]: ĥ NN = n ln x i x i, lnn γ ln V n where x i, is x i s nearest neighbor in the sample set, γ is the Euler-Mascheroni constant, an V is the volume of the -imensional hypersphere with raius : V = π/ Γ/. A problem with this approach is that in practice ata samples may not be in general position; for example, image pixel ata are usually quantize to 8 bits or bits. Thus, it can happen in practice that two samples have the exact same measure value an thus x n x n, is zero an thus the entropy estimate coul be ill-efine. Though there are various fixes, such as pre-ithering the quantize ata, it is not clear what effect these fixes coul have on the estimate entropy. A ifferent approach is taken by Costa an Hero [], [], []; they use the Bearwoo Halton Hammersley result that the function of the length of a minimum spanning graph converges to the Renyi entropy [] to form an estimator base on the empirical length of a minimum spanning tree of ata. Unfortunately, how to use this approach to estimate Shannon entropy remains an open question. III. BAYESIAN ESTIMATE WITH INVERTED WISHART PRIOR We propose to estimate the entropy as, E N [hn], where N is a ranom Gaussian, an the prior pn is an inverte Wishart istribution with scale parameter q an parameter matrix B. We use a Fisher information metric to efine a measure over the Riemannian manifol forme by the set of Gaussian istributions. These choices for prior an measure are very similar to the choices that we foun worke very well for Bayesian quaratic iscriminant analysis [], an further etails on this framework can be foun in that work. The resulting propose inverte Wishart Bayesian entropy estimate is ĥ iw Bayesian = ln π ln S B n q i. Proof: To show that is E N [hn], we will nee to integrate ln Σ exp[ tr Σ V] Σ Σ q Σ> q V E Σ [ln Σ ] Γ q where Σ is a ranom covariance matrix rawn from an inverte Wishart istribution with scale parameter q an matrix parameter V. Recall that for any matrix V, Σ / V χ q i [, Corollary.], where χ enotes the chisquare ranom variable. Take the natural log of both sies an use the fact that A = A to show that ln Σ ln V ln χ q i. Then after taking E Σ[ln Σ ], 8 becomes Γ q ln V E [ ln χ ] V q q i = Γ q q i ln V, 9 V q where the secon line uses the property of the χ istribution that E[ln χ q] = ln q. 8

3 Now we will use the above integral ientity to fin ĥ iwbayesian = E N [hn]. Solving for E N [hn] only requires computing Σ E N [ln Σ ] = Σ ln Σ p Σ x, x,..., x n Σ where the term / Σ / results from the Fisher information metric which converts the integral E N [hn] from an integral over the statistical manifol of Gaussians to an integral over covariance matrices, an the posterior p Σ x, x,..., x n is given in [] such that E N [ln Σ ] S B nq = nq Γ nq ln Σ exp[ tr Σ S B] Σ = Σ> nq S B nq Γ nq ln S B Σ nq Γ nq SB nq n q i n q i = ln S B ln, where equation follows by using the fact that is given by 9 for V = S B/. Then replacing ln Σ in with the compute E N [ln Σ ] prouces the estimator. A. Choice of Prior Parameters q an B The inverte Wishart istribution is an unimoal prior that gives maximum a priori probability to Gaussians with Σ = B/q. Previous work using the inverte Wishart for Bayesian estimation of Gaussians has use the ientity matrix for B [], or a scale ientity where the scale factor was learne by cross-valiation given labele training ata for classification []. Using B = I sets the maximum of the prior at I/q, regarless of whether the ata are measure in nanometers or trillions of ollars. To a rough approximation, the prior regularizes the likelihoo towars the prior maximum at B/q, an thus the bias ae by using the prior with B/q = I can be ill-suite to the problem. Instea, setting B/q to be a rough estimate of the covariance can a bias that is more appropriate for the problem. For example, we have shown that using B = qiags can work well when estimating Gaussians for classification by Bayesian quaratic iscriminant analysis []. For entropy estimation, B = qiags/n works excellently if the true covariance is iagonal, but can perform poorly if the true covariance is a full covariance because the eterminant of B = iags/n can be significantly higher than the eterminant of S, biasing the entropy estimate to be too high. An optimal choice of B when there is no prior information about S remains an open question; we propose using B = q ln iags, n, for n, an for n > we let B be the matrix of zeros. The scalar prior parameter q changes the peake-ness of the inverte Wishart prior, with higher q corresponing to a more peake prior, an thus higher bias. For the entropy problem there is only one number to estimate, an thus we believe that as little bias as possible shoul be use. To achieve this, we set q = min n,. Letting q = when n > an using the zero matrix for B in this range makes ĥiwbayesian equivalent to ĥbayesian when n >. B. Bregman Divergences an Bayesian Estimation Misra et al. showe that E Σ [ ln Σ ] minimizes the square error loss as state in. Here we show that E N [hn] minimizes any Bregman ivergence loss. The Bregman ivergences are a class of loss functions that inclue square error an relative entropy [], []. Lemma: The mean entropy with respect to uncertainty in the generating istribution E N [hn] solves [ ] arg min E N φ hn, ĥ, ĥ R where φ a, b = φa φb φb a b for a, b, R is any Bregman ivergence with strictly convex φ. Proof: Set the first erivative to zero: = [ ĥe N φhn φĥ ] ĥ [ ĥφĥhn ] = E N φĥ hn ĥ. ĥ Because φ is strictly convex, φĥ >, an thus it must ĥ be that E N [hn ĥ] =, an thus by linearity that the minimizer is ĥ = E N [hn]. IV. EXPERIMENTS We compare the propose inverte Wishart Bayesian estimator ĥiwbayesian with the ata-epenent given in to ĥ iwbayesian with B = I, to the nearest-neighbor estimator given in, to the maximum likelihoo estimator forme by replacing µ an Σ in by maximum likelihoo estimates of µ an Σ, to ĥumvu given in, an to ĥbayesian given in. All results were compute with Matlab.. For the igamma function an Euler-Mascheroni constant we use the corresponing built-in Matlab commans. Simulations were run for a fixe imension of = with varying number of ii samples n rawn from ranom Gaussians. For n we it was not possible to calculate the igamma functions for ĥbayesian an ĥumvu or Σ, thus ĥ Bayesian, ĥumvu, an the maximum likelihoo estimates are only reporte for n >. The nearest-neighbor estimate an the inverte Wishart Bayesian estimates are compare own to n = samples. In the first simulation, each generating Gaussian ha a iagonal covariance matrix with elements rawn ii from the

4 Log of Mean Square Estimation Error ML Bayesian UMVU NN Inv. Wishart Bayesian Data epenent Inv. Wishart Bayesian # Ranomly Drawn Samples Log of Mean Square Estimation Error 9 8 # Ranomly Drawn Samples Diagonal Covariance, Elements U, ] Full Covariance R T R, Elements of R N, Log of Mean Square Estimation Error Log of Mean Square Estimation Error # Ranomly Drawn Samples # Ranomly Drawn Samples Diagonal Covariance, Elements U, ] Full Covariance R T R, Elements of R N, 9 Log of Mean Square Estimation Error 8 Log of Mean Square Estimation Error # Ranomly Drawn Samples # Ranomly Drawn Samples Diagonal Covariance, Elements U,.] Full Covariance R T R, Elements of R N,. Fig.. Comparison of entropy estimators average over, ii runs of each simulation.

5 uniform istribution on, α]. The results are shown in Fig. left for α = top, for α = mile, an for α =. bottom. Fig. shows the average eigenvalues. In the secon simulation, each generating Gaussian ha a full covariance matrix R T R, where each of the elements of R was rawn ii from a normal istribution N, α. The results are shown in Fig. right for α = top, for α = mile, an for α =. bottom. Fig. shows the average eigenvalues. For each n an each of the six cases, the simulation was run, times an the results average to prouce the plots. Average Log Eigenvalue Average Log Eigenvalue 8 Eigenvalue Rank Diagonal covariance from first simulation Eigenvalue Rank Full covariance from secon simulation Fig.. Average log ranke eigenvalues for the first simulation top an secon simulation bottom. For n > the three Bayesian estimates are equivalent an perform consistently better than ĥumv U, the maximum likelihoo estimate, or the nearest-neighbor estimate. The maximum likelihoo estimator is always the worst parametric estimator. Throughout the simulations, the nearest-neighbor estimate makes the least use of aitional samples, improving its estimate only slowly. This is reasonable because the nearestneighbor estimator oes not explicitly use the information that the true istribution is Gaussian. Given very few samples the nearest-neighbor estimator is the best performer for two of the full covariance cases. This suggests that ifferent prior parameter settings coul be more effective when there are few samples, perhaps a prior that as more bias. As expecte, in general the ata-epenent ĥ iwbayesian achieves lower error than ĥiwbayesian with B = I, sometimes significantly better, as in the case of true iagonal covariance with elements rawn from U,.], shown in Fig. bottom. V. CONCLUSIONS A ata-epenent approach to Bayesian entropy estimation was given that minimizes expecte Bregman ivergence an performs consistently well compare to other possible estimators for the high-imensional/limite ata case that n. REFERENCES [] T. Cover an J. Thomas, Elements of Information Theory. Unite States of America: John Wiley an Sons, 99. [] A. Hero an O. Michel, Asymptotic theory of greey approximations to minimal k-point ranom graphs, IEEE Trans. Information Theory, vol., pp. 9 99, 999. [] J. Costa an A. Hero, Geoesic entropic graphs for imension an entropy estimation in manifol learning, IEEE Trans. Signal Processing, vol., no. 8, pp.,. [] J. Beirlant, E. Duewicz, L. Györfi, an E. van er Meulen, Nonparametric entropy estimation: An overview, International Journal Mathematical an Statistical Sciences, vol., pp. 9, 98. [] N. Misra, H. Singh, an E. Demchuk, Estimation of the entropy of a multivariate normal istribution, Journal of Multivariate Analysis, vol. 9, pp.,. [] N. A. Ahme an D. V. Gokhale, Entropy expressions an their estimators for multivariate istributions, IEEE Trans. Information Theory, pp. 88 9, 989. [] D. Enres an P. Föliàk, Bayesian bin istribution inference an mutual information, IEEE Trans. Information Theory, pp. 9,. [8] M. Nilsson an W. B. Kleijn, On the estimation of ifferential entropy from ata locate on embee manifols, IEEE Trans. on Information Theory, vol., no., pp.,. [9] L. F. Kozachenko an N. N. Leonenko, Sample estimate of entropy of a ranom vector, Problems in Information Transmission, vol., no., pp. 9, 98. [] A. Hero, B. Ma, O. Michel, an J. Gorman, Applications of entropic spanning graphs, IEEE Signal Processing Magazine, pp. 8 9,. [] J. Bearwoo, J. H. Halton, an J. M. Hammersley, The shortest path through many points, Proc. Cambrige Philo. Soc., vol., pp. 99, 99. [] S. Srivastava, M. R. Gupta, an B. A. Frigyik, Bayesian quaratic iscriminant analysis, Journal of Machine Learning Research, vol. 8, pp. 8,. [] M. Biloeau an D. Brenner, Theory of Multivariate Statistics. New York: Springer Texts in Statistics, 999. [] D. G. Keehn, A note on learning for Gaussian properties, IEEE Trans. on Information Theory, vol., pp., 9. [] P. J. Brown, T. Fearn, an M. S. Haque, Discrimination with many variables, Journal of the American Statistical Association, vol. 9, no. 8, pp. 9, 999. [] A. Banerjee, X. Guo, an H. Wang, On the optimality of conitional expectation as a Bregman preictor, IEEE Trans. on Information Theory, vol., no., pp. 9,. [] B. A. Frigyik, S. Srivastava, an M. R. Gupta, Functional Bregman ivergence an Bayesian estimation of istributions, arxiv preprint cs.it/,.

Influence of weight initialization on multilayer perceptron performance

Influence of weight initialization on multilayer perceptron performance M. Karouia (1,2) T. Denœux (1) R. Lengellé (1) (1) Université e Compiègne U.R.A. CNRS 817 Heuiasyc BP 649 - F-66 Compiègne ceex -