Exact Calculation of Normalized Maximum Likelihood Code Length Using Fourier Analysis

Size: px

Start display at page:

Download "Exact Calculation of Normalized Maximum Likelihood Code Length Using Fourier Analysis"

Florence Booker
6 years ago
Views:

1 Eact Calculation of ormalized Maimum Likelihood Code Length Using Fourier Analysis Atsushi Suzuki and Kenji Yamanishi The University of Tokyo Graduate School of Information Science and Technology Bunkyo, Tokyo, Japan arxiv: v1 [math.st] 11 Jan 018 Abstract The normalized maimum likelihood code length has been widely used in model selection, and its favorable properties, such as its consistency and the upper bound of its statistical risk, have been demonstrated. This paper proposes a novel methodology for calculating the normalized maimum likelihood code length on the basis of Fourier analysis. Our methodology provides an efficient non-asymptotic calculation formula for eponential family models and an asymptotic calculation formula for general parametric models with a weaker assumption compared to that in previous work. I. ITRODUCTIO A. Background and Our Contribution The normalized maimum likelihood code length ML code length is an etension of self-entropy, in which a set of distributions is given instead of the true distribution. When the true distribution is known, the lower bound of the mean length of codings for a random variable is given by the Shannon entropy of its probability distribution, and the lower bound is attained by the self-entropy [18]. This optimal code or the self-entropy is also interpreted as the solution of a trivial optimization problem of a log redundancy with respect to the code l, with the Kraft-McMillan inequality [8] [9] as a constraint, as follows: [ minma l logf 0 ] l s. t. d ep l 1, where def 1,,..., is a data sequence and f 0 denotes the probability density function of the data-generating distribution. Apparently, the optimum code length is given by l logf 0 self-entropy. ote that we discuss cases of continuous random variables in this paper. Further, the base of the logarithm is e, and the natural unit of information is used in this paper. The optimization problem 1 or the original Shannon entropy deals with the case in which the true distribution is known. When a set of distributions F is given as candidates of the true distribution instead of the true distribution itself, we can etend the previous optimization problem to the problem introduced by Shtarkov [19]: 1 [ minma l min logf ] l f F s. t. d ep l 1. This problem is no longer trivial, and Shtarkov showed that the ML code length defined below attains its minimum [19]: l ML def logf ML logma f F f +log X ma f F f d. The problem is reduced to 1 if F {f 0 }, and in this sense, the ML code length is an etension of the self-entropy. The ML code is one of the universally optimal codings when the true distribution in the given set is unknown [11]. The ML code length is widely used in model selection on the basis of the minimum description length principle [16] [1] [1] [14]. Here, the model that minimizes the ML code length for given data is selected. Recently, it is shown that the ML code length bounds the generalized loss [3]. The calculation of the ML code length has been an important problem. Rissanen derived an asymptotic formula of the ML code length [13], which clarified the behavior of the ML code length with o1 terms ecluded as follows: log d f ;ˆθ 1 Klog Θ +log dθ 4 detiθ+o1, Π where K denotes the dimension of the parameter. This formula holds with certain regularity conditions and does not depend on the details of the model. According to this formula, we can apply ishii s analysis in terms of consistency in the selected model [10] and Barron and Cover s result in terms of statistical risk [1] to the model selection using the ML code length. In contrast to the generality of Rissanen s asymptotic formula, a non-asymptotic calculation formula has been derived through model-by-model discussion [15] [5] [6] [7] [17]. Recently, Hirai and Yamanishi non-asymptotically calculated 3

2 the ML code length for several models in the eponential family [4]. They reduced the calculation of the ML code length to the parameter domain integral of the function denoted by g in the paper. However, the method to obtain Hirai- Yamanishi sg-function eplicitly depends on the model. Thus, the eact calculation of the ML code length has been limited to particular models. This paper proposes a novel methodology for calculating the ML code length on the basis of Fourier transformation. Our methodology enables the systematic analysis of ML code length in terms of both asymptotic epansion and eact calculation. As corollaries, our methodology provides an asymptotic formula with weaker assumptions compared to Rissanen s and a useful eact calculation formula for the eponential family. B. Significance of This Paper This paper proposes an alternative form of the ML code length based on the Fourier transform. Our form enables the systematic calculation of the ML code length. Specifically, it results in the two formulae presented below. 1 Asymptotic Formula with Weaker Assumption: Taking the limitation of our form leads to Rissanen s asymptotic formula [13]. It should be noted that Lebesgue s dominant convergence theorem can be applied to our Fourier-transformbased form, which results in an asymptotic formula with a weaker assumption compared to that in the original paper [13]. Eact Calculation Formula for Eponential Family: Our Fourier-transform-based form gives a simple formula for the eact calculation of the ML code length of the eponential family. The formula yields the ML code length from the partition function and the relationship between the canonical parameters and the epectation of sufficient statistics. C. Related Work 1 Asymptotic Formula with Weaker Assumption: The consequence of the asymptotic formula in this paper is the same as that of Rissanen s theorem [13]. However, Rissanen s theorem assumes both the uniform asymptotic normality of the maimum likelihood estimator as well as the eistence of a non-zero lower bound and an finite upper bound of the Fisher information; in contrast, our theorem does not involve these assumptions and allows the Fisher information to converge to zero or diverge towards the boundary. Eact Calculation Formula for Eponential Family: Hirai and Yamanishi presented the eact calculation of several models in the eponential family through the integral of the g-function. However, in general, it is still difficult to obtain the eplicit form of the g-function. In this paper, the general eact calculation formula for the eponential family is obtained, including Hirai and Yamanishi s results. II. ORMALIZED MAXIMUM LIKELIHOOD CODE LEGTH We consider a sequence def 1,,..., of continuous random variables and assume that they have a probability density function. Definition 1. Let F { f : R D X [0, X f d 1 } denote a set of density functions. Here, X R D denotes the domain of a datum. Assume that ma f F f is a measurable function of. The ML code length is defined as its negative log likelihood as follows: l ML def logma f F f +logcf, 5 where CF def X d ma f F f. CF or its logarithm is called the parametric compleity PC of F. In this paper, we focus on the case in which it is easy to evaluate the first term but difficult to evaluate the maimum likelihood compleity. This is because, when even the first term is intractable, it is hardly possible to strictly evaluate the second term. In this paper, we consider the independent identical parametric model: { F Θ f : X [0, f n1 f } n;θ, θ Θ R K 6 as a set of density functions. Here, θ is its parameter and Θ is the domain of the parameter. We mainly analyze a proper parameter domain Θ Π Θ defined as follows. Definition. A subset Θ Π of Θ is proper if the following conditions are satisfied: 1 Map Θ Π θ f ;θ F Π F is bijective one to one. For all X, a unique solution ˆθ of ma θ ΘΠ f ;θ eists; that is, a unique maimum likelihood estimator MLE ˆθ eists. 3 ma θ ΘΠ f ;θ is a measurable function of. 4 If θ Θ Π and n f n ;θ, the asymptotic normality of the MLE ˆθ holds; that is, ˆθ θ 0,Iθ 1, where Iθ 1 denotes the Fisher information matri. We also define the proper data sequence domain XΠ as follows: X Π def { X ma f ;θ maf ;θ }. θ Θ Π θ Θ 7 Remark 1. Sufficient conditions for 4 have been discussed for eample, see [0]. At least the positive definiteness of Iθ 1 in X Π is required for 4. Remark. Since Θ Π Θ, the following holds in general: ma θ ΘΠ f ;θ ma θ Θ f ;θ. Remark 3. In this paper, ˆθ always denotes the unique MLE on Θ Π. If Θ Π Θ, the MLE in Θ can be non-unique. Roughly speaking, the proper parameter domain is a tractable subset of the model, and the proper data sequence domain is a set of sequences, the MLE of which lies in

3 the proper parameter domain. The PC can be decomposed as follows: d maf ;θ d f ;ˆθ X θ Θ X + d ma f ;θ f ;ˆθ. X \XΠ θ Θ 8 Remark 4. If we can take Θ Π Θ as is often the case with a well-behaved model such as the eponential family models, the second term vanishes, and the logarithm of the first term is equivalent to the parametric compleity. We assume that the second term is ignorable and focus on the first term CΘ Π def d f ;ˆθ in this X paper. ote that CΘ Π carries ecessive data sequences and often diverges to infinity. To avoid this problem, we introduce luckiness [] to generalize CΘ Π as follows: Definition 3. Let w : Θ [0, denote a weight function on Θ Π called luckiness. We define the luckiness parametric compleity LPC of Θ Π as follows: C w Θ Π def d f where ˆθ def argma θ Θ Π f ;θ. ;ˆθ wˆθ, 9 Remark 5. If wθ 1, the LPC is equivalent to the PC. Let A be a subset of Θ Π. We can regard the LPC C 1{A} Θ Π as a restriction of the PC CΘ Π to A, where 1{ } denotes the indicator function. This restriction is often necessary and used in continuous variable cases [] [4]. III. FOURIER FORM OF ML CODE LEGTH First, we make assumptions that allow us to echange integrals. Assumption 1. 1 For all Φ 0 Θ Π that have measure zero, { ˆθ } Φ 0 has measure zero. For all, f ;θ wθ is integrable and squareintegrable as a function of θ. 3 For all, the Fourier transform ˆf w ;ξ of f ;θ wθ is integrable as a function of ξ, where ˆf w ;ξ def K 1 dθep iξ T θ f ;θ wθ. ΘΠ 10 We obtain the Fourier-transform-based form of the ML code length as follows: Theorem 1. Under Assumption 1, the PC is calculated as follows: d f ;ˆθ wˆθ dθwθg θ, where g θ def 1 K dξ 1 d f ;θ ep iξ Tˆθ θ Proof: Since f ;θ L 1 Θ C L Θ C, f ;θ w ˆθ K 1 dξep iξ T θ ˆfw ;ξ a.s.. Thus, the following holds with Assumption 1 1: d f ;ˆθ w ˆθ 1 d K 1 dξ K dξepiξ Tˆθ ˆfw ;ξ d ep iξ Tˆθ ˆfw ;ξ, 15 where the last equation follows from the absolute integrability of ˆf w ;ξ and Fubini s theorem. We echange integrals again likewise as follows: d ep iξ Tˆθ ˆfw ;ξ X Π 1 1 K d epiξ Tˆθ dθep iξ T θ f ;θ wθ K dθwθ d ep iξ Tˆθ θ f ;θ, 16 where the last equation follows from the absolute integrability of ˆfw ;ξ and Fubini s theorem. By the third assumption, we can echange the integral with respect to θ and ξ, which completes the proof. 4 The characteristic function φ ξ of the maimum IV. ASYMPTOTIC FORMULA θ likelihood estimator is integrable as a function of θ and By taking the limitation of Theorem 1, we can prove the ξ, where asymptotic formula of the LPC, which relaes some conditions φ θ ξdef d f ;θ ep iξ Tˆθ given by Rissanen s asymptotic formula [13]. First, we make θ. assumptions that allow us to echange the limitation and 11 integral.

4 Assumption. 1 There eists an integrable function φ θ ξ of ξ such that φ θ ξ < φθ ξ, for all and θ. There eists an integrable function ḡθ of θ such that g θ < ḡθ. Theorem Asymptotic formula of the ML code length. Under Assumption 1 and Assumption, the following holds: log d f ;ˆθ w ˆθ 1 Klog +log Θ Π dθwθ detiθ+o1. 17 Proof: The assumptions allow us to apply Lebesgue s dominant convergence theorem to Theorem 1 as follows: lim d f ;ˆθ w ˆθ dθwθ lim g θ 18 1 dθwθ K dξ lim φ θ ξ If θ intθ Π, ˆθ θ 0,Iθ 1, 19 by asymptotic normality of the MLE. Hence, by Levy s continuity theorem, lim φ θ ξ ep 1 1 Iθξ ξt, 0 which completes the proof. Remark 6. The consequence of the theorem is the same as that of Rissanen s formula [13]. In contrast to Rissanen s formula, we do not make assumptions on the boundedness of the determinant of the Fisher information matri or the uniform asymptotic normality of the MLE. Thus, we can epect that our formula is easy to apply even whenθ Π is not compact and the boundedness of the Fisher information and uniform asymptotic normality of the MLE are difficult to guarantee. V. O-ASYMPTOTIC FORMULA FOR EXPOETIAL FAMILY First, we present the notation of the eponential family. Then, we present the non-asymptotic formula of the PC for the eponential family. A. Eponential Family and Its Canonical Parameters and Epectation Parameters We say that a model is in the eponential family when we can epress its density function with its canonical parameters η H R K, sufficient statistics u : R D X R K, and base measure h : R D X [0,+ as follows: f ;η h Zη ep η T u. 1 Here, the partition function Z : H [0, is defined by Zη dhep η T u. We define the transform from the canonical parameters to epectation parameters as k η def du k h Zη ep K η k u k, k1 η def [ 1 η η K η ] T and let η denote its inverse transform, assuming is bijective. We define the MLE with respect to the epectation parameters as follows: ˆ def argma f n ;η. 3 n1 We can calculate the PC of a model in the eponential family as follows. Theorem 3. Let f ;η be the density function of a model in the eponential family, where η denotes its natural parameter and denotes its epectation parameter. The LPC of f ;η is epressed as follows: d f ;η ˆ w ˆ K 1 dw dξep iξ T Z η+i ξ Zη 4 Corollary 1. Let X, 1,X,,...,X, be a sequence of i.i.d. K-dimensional random variables, the characteristic function of each of which is given by Z η+i ξ Zη g, be the density function of n the PC is epressed as follows: d f ;η ˆ w ˆ K 1 dg,w. n1 X,, and let. Then, 5 Remark 7. Theorem 3 reduces the original -times integrals to a K-times integral. Corollary 1 implies that, if we know the density function of n1 X, n, the characteristic function of each of which is given using the partition function of the original function, the calculation of the original PC can be reduced to one integral calculation, and it is often analytically obtained. Proof: ote that the following holds with respect to its maimum likelihood estimator as follows: η logzˆη 1 u n. 6 n1.

5 TABLE I EXPOETIAL FAMILY AD PC Distribution Density Sufficient Canonical parameter η Partition Parametric function statistics function compleity f ;θ u k Epectation parameter E[u k ] Zη ormal dist. with f ;µ η µ v,+ v ep 1 1 vη + wµ dµ v known variance v 1 ep µ µ vη,+ v v ormal dist. with f ;v µ η 1 v, ep 1 η Γ 1 known mean µ 1 ep µ v 1 v v η 0, + 0 dv wv v Laplace dist. with f ;b µ η 1 b,0 η known mean µ 1 b ep µ b ep Γ b 1 η 0, + 0 db wb b Gamma dist. with f ;µ η k µ,0 1 η k known shape k 1 kk k 1 Γkµ ep k k µ k k ep k Γk µ k η 0, + 0 dµ wµ µ ep Γ Weibull dist. with f ;L k η 1 L,0 1 η known shape k kl k+1 k k ep k L 1 L η 0, + dl wl 0 L Gamma dist. with f ;η log η ψ 1 λ logθ 1 1,+ Γη +1θ η+1 See 31 known scale θ 3 η Γη+1θ η+1 ep λ ψη +1+logθ,+ θ Also note that the maimum likelihood estimator ˆ with respect to the epectation parameters can be written as ˆ ˆη. Here it holds that ˆ k 1 u k n 7 n1 We can calculate g function as follows: K g θ [[ dξ d h n ] Zη ep η T u n n1 ] ep iξ T 1 u n n1 [ dξep iξ T d hn Zη ] η+i ξ T u n, ep n1 Substituting this to Theorem 1 completes the proof. B. Eamples 8 In this subsection, we give eamples of PC LPC calculation using Theorem 3. These eamples include the results in [4]. Table IV lists the results. 1 Including the eponential distribution with k 1 Derived in [4] 3 Including the chi-squared distribution with θ 1 Fied Variance Distribution: If the relationship between the natural parameter and epectation parameter is given by η v with a constant v and the partition function is given by Zη Cep v η D with a constant D, we can calculate the LPC as follows: d [w dξep D d w. D +i ξ ep iξ 9 Eponential Distribution Type: If the relationship between the natural parameter and epectation parameter is given by η C with a constant C,m and the partition function is given by Zη D m η with a constant D, we can calculate the LPC as follows: 1 + m C dw dξ 0 C +i ξ ep iξ Cm ep C Γm + 0 d w. C 30 m The last equation holds because is the characteristic function of the gamma distribution with the C +i ξ shape

6 parameter m and rate parameter C. If we set w 1{[ min, ma ]}, the PC is equal to m ep Γm log ma min. This formula can be applied to distributions including the eponential distribution, chi-squared distribution, Laplace distribution with a known mean, Weibull distribution with a known shape, and gamma distribution with a known shape. 3 Chi-squared Distribution Type: We discuss the gamma distribution with a known scale θ. The result here includes the chi-squared distribution set θ. we can calculate the LPC as follows: + ds wlogθ s dξ Γ ψ 1 s i ξ Γψ 1 s ep iξ s] + dswlogθ sg s;ψ 1 s, Here, G ;p,q is the probability density function of the sum of i.i.d. samples, the density of which is given by G;p,q 1 qγp ep pq ep. 3 q [9] B. McMillan. Two inequalities implied by unique decipherability. IRE Transactions on Information Theory, 4: , [10] R. ishii. Maimum likelihood principle and model selection when the true model is unspecified. Journal of Multivariate Analysis, 7:39 403, [11] J. Rissanen. Stochastic compleity in statistical inquiry. World Scientific, [1] J. Rissanen. Stochastic compleity in learning. In Computational Learning Theory, pages Springer, [13] J. Rissanen. Fisher information and stochastic compleity. IEEE Transactions on Information Theory,, 41:40 47, [14] J. Rissanen. Stochastic compleity in learning. Journal of Computer and System Sciences, 551:89 95, [15] J. Rissanen. Mdl denoising. IEEE Transactions on Information Theory, 467: , 000. [16] J. Rissanen, T. P. Speed, and B. Yu. Density estimation by stochastic compleity. IEEE Transactions on Information Theory, 38:315 33, 199. [17] T. Roos, T. Silander, P. Kontkanen, and P. Myllymaki. Bayesian network structure learning using factorized nml universal models. In Information Theory and Applications Workshop, 008, pages IEEE, 008. [18] C. E. Shannon. A mathematical theory of communication, part i, part ii. Bell System Technical Journal, 7:63 656, [19] Y. M. Shtar kov. Universal sequential coding of single messages. Problemy Peredachi Informatsii, 33:3 17, [0] A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge University Press, [1] K. Yamanishi. A learning criterion for stochastic rules. Machine Learning, 9:165 03, 199. VI. COCLUSIO In this paper, we derived a non-asymptotic form of the ML code length and clarified its relationship to the asymptotic epansion. Moreover, we presented a non-asymptotic calculation formula of the ML code length for eponential family models. This formula can be applied if we know the partition function. In addition, if we know the closed form of a distribution, the characteristic function of which is given by a function that can be epressed using the partition function, the calculation is reduced to one integral calculation, which is often analytically obtained. REFERECES [1] A. R. Barron and T. M. Cover. Minimum compleity density estimation. IEEE Transactions on Information Theory, 374: , [] P. D. Grünwald. The minimum description length principle. MIT press, 007. [3] P. D. Grünwald and. A. Mehta. A tight ecess risk bound via a unified pac-bayesian-rademacher-shtarkov-mdl compleity. CoRR, abs/ , 017. [4] S. Hirai and K. Yamanishi. Efficient computation of normalized maimum likelihood codes for gaussian miture models with its applications to clustering. IEEE Transactions on Information Theory, 5911: , 013. [5] P. Kontkanen and P. Myllymäki. A linear-time algorithm for computing the multinomial stochastic compleity. Information Processing Letters, 1036:7 33, 007. [6] P. Kontkanen and P. Myllymäki. Mdl histogram density estimation. In International Conference on Artificial Intelligence and Statistics, pages 19 6, 007. [7] P. Kontkanen and P. Myllymäki. An empirical comparison of nml clustering algorithms [8] L. G. Kraft. A device for quantizing, grouping, and coding amplitudemodulated pulses. PhD thesis, Massachusetts Institute of Technology, 1949.

COMPUTING THE REGRET TABLE FOR MULTINOMIAL DATA

COMPUTING THE REGRET TABLE FOR MULTINOMIAL DATA COMPUTIG THE REGRET TABLE FOR MULTIOMIAL DATA Petri Kontkanen, Petri Myllymäki April 12, 2005 HIIT TECHICAL REPORT 2005 1 COMPUTIG THE REGRET TABLE FOR MULTIOMIAL DATA Petri Kontkanen, Petri Myllymäki