Exact Calculation of Normalized Maximum Likelihood Code Length Using Fourier Analysis
|
|
- Florence Booker
- 6 years ago
- Views:
Transcription
1 Eact Calculation of ormalized Maimum Likelihood Code Length Using Fourier Analysis Atsushi Suzuki and Kenji Yamanishi The University of Tokyo Graduate School of Information Science and Technology Bunkyo, Tokyo, Japan arxiv: v1 [math.st] 11 Jan 018 Abstract The normalized maimum likelihood code length has been widely used in model selection, and its favorable properties, such as its consistency and the upper bound of its statistical risk, have been demonstrated. This paper proposes a novel methodology for calculating the normalized maimum likelihood code length on the basis of Fourier analysis. Our methodology provides an efficient non-asymptotic calculation formula for eponential family models and an asymptotic calculation formula for general parametric models with a weaker assumption compared to that in previous work. I. ITRODUCTIO A. Background and Our Contribution The normalized maimum likelihood code length ML code length is an etension of self-entropy, in which a set of distributions is given instead of the true distribution. When the true distribution is known, the lower bound of the mean length of codings for a random variable is given by the Shannon entropy of its probability distribution, and the lower bound is attained by the self-entropy [18]. This optimal code or the self-entropy is also interpreted as the solution of a trivial optimization problem of a log redundancy with respect to the code l, with the Kraft-McMillan inequality [8] [9] as a constraint, as follows: [ minma l logf 0 ] l s. t. d ep l 1, where def 1,,..., is a data sequence and f 0 denotes the probability density function of the data-generating distribution. Apparently, the optimum code length is given by l logf 0 self-entropy. ote that we discuss cases of continuous random variables in this paper. Further, the base of the logarithm is e, and the natural unit of information is used in this paper. The optimization problem 1 or the original Shannon entropy deals with the case in which the true distribution is known. When a set of distributions F is given as candidates of the true distribution instead of the true distribution itself, we can etend the previous optimization problem to the problem introduced by Shtarkov [19]: 1 [ minma l min logf ] l f F s. t. d ep l 1. This problem is no longer trivial, and Shtarkov showed that the ML code length defined below attains its minimum [19]: l ML def logf ML logma f F f +log X ma f F f d. The problem is reduced to 1 if F {f 0 }, and in this sense, the ML code length is an etension of the self-entropy. The ML code is one of the universally optimal codings when the true distribution in the given set is unknown [11]. The ML code length is widely used in model selection on the basis of the minimum description length principle [16] [1] [1] [14]. Here, the model that minimizes the ML code length for given data is selected. Recently, it is shown that the ML code length bounds the generalized loss [3]. The calculation of the ML code length has been an important problem. Rissanen derived an asymptotic formula of the ML code length [13], which clarified the behavior of the ML code length with o1 terms ecluded as follows: log d f ;ˆθ 1 Klog Θ +log dθ 4 detiθ+o1, Π where K denotes the dimension of the parameter. This formula holds with certain regularity conditions and does not depend on the details of the model. According to this formula, we can apply ishii s analysis in terms of consistency in the selected model [10] and Barron and Cover s result in terms of statistical risk [1] to the model selection using the ML code length. In contrast to the generality of Rissanen s asymptotic formula, a non-asymptotic calculation formula has been derived through model-by-model discussion [15] [5] [6] [7] [17]. Recently, Hirai and Yamanishi non-asymptotically calculated 3
2 the ML code length for several models in the eponential family [4]. They reduced the calculation of the ML code length to the parameter domain integral of the function denoted by g in the paper. However, the method to obtain Hirai- Yamanishi sg-function eplicitly depends on the model. Thus, the eact calculation of the ML code length has been limited to particular models. This paper proposes a novel methodology for calculating the ML code length on the basis of Fourier transformation. Our methodology enables the systematic analysis of ML code length in terms of both asymptotic epansion and eact calculation. As corollaries, our methodology provides an asymptotic formula with weaker assumptions compared to Rissanen s and a useful eact calculation formula for the eponential family. B. Significance of This Paper This paper proposes an alternative form of the ML code length based on the Fourier transform. Our form enables the systematic calculation of the ML code length. Specifically, it results in the two formulae presented below. 1 Asymptotic Formula with Weaker Assumption: Taking the limitation of our form leads to Rissanen s asymptotic formula [13]. It should be noted that Lebesgue s dominant convergence theorem can be applied to our Fourier-transformbased form, which results in an asymptotic formula with a weaker assumption compared to that in the original paper [13]. Eact Calculation Formula for Eponential Family: Our Fourier-transform-based form gives a simple formula for the eact calculation of the ML code length of the eponential family. The formula yields the ML code length from the partition function and the relationship between the canonical parameters and the epectation of sufficient statistics. C. Related Work 1 Asymptotic Formula with Weaker Assumption: The consequence of the asymptotic formula in this paper is the same as that of Rissanen s theorem [13]. However, Rissanen s theorem assumes both the uniform asymptotic normality of the maimum likelihood estimator as well as the eistence of a non-zero lower bound and an finite upper bound of the Fisher information; in contrast, our theorem does not involve these assumptions and allows the Fisher information to converge to zero or diverge towards the boundary. Eact Calculation Formula for Eponential Family: Hirai and Yamanishi presented the eact calculation of several models in the eponential family through the integral of the g-function. However, in general, it is still difficult to obtain the eplicit form of the g-function. In this paper, the general eact calculation formula for the eponential family is obtained, including Hirai and Yamanishi s results. II. ORMALIZED MAXIMUM LIKELIHOOD CODE LEGTH We consider a sequence def 1,,..., of continuous random variables and assume that they have a probability density function. Definition 1. Let F { f : R D X [0, X f d 1 } denote a set of density functions. Here, X R D denotes the domain of a datum. Assume that ma f F f is a measurable function of. The ML code length is defined as its negative log likelihood as follows: l ML def logma f F f +logcf, 5 where CF def X d ma f F f. CF or its logarithm is called the parametric compleity PC of F. In this paper, we focus on the case in which it is easy to evaluate the first term but difficult to evaluate the maimum likelihood compleity. This is because, when even the first term is intractable, it is hardly possible to strictly evaluate the second term. In this paper, we consider the independent identical parametric model: { F Θ f : X [0, f n1 f } n;θ, θ Θ R K 6 as a set of density functions. Here, θ is its parameter and Θ is the domain of the parameter. We mainly analyze a proper parameter domain Θ Π Θ defined as follows. Definition. A subset Θ Π of Θ is proper if the following conditions are satisfied: 1 Map Θ Π θ f ;θ F Π F is bijective one to one. For all X, a unique solution ˆθ of ma θ ΘΠ f ;θ eists; that is, a unique maimum likelihood estimator MLE ˆθ eists. 3 ma θ ΘΠ f ;θ is a measurable function of. 4 If θ Θ Π and n f n ;θ, the asymptotic normality of the MLE ˆθ holds; that is, ˆθ θ 0,Iθ 1, where Iθ 1 denotes the Fisher information matri. We also define the proper data sequence domain XΠ as follows: X Π def { X ma f ;θ maf ;θ }. θ Θ Π θ Θ 7 Remark 1. Sufficient conditions for 4 have been discussed for eample, see [0]. At least the positive definiteness of Iθ 1 in X Π is required for 4. Remark. Since Θ Π Θ, the following holds in general: ma θ ΘΠ f ;θ ma θ Θ f ;θ. Remark 3. In this paper, ˆθ always denotes the unique MLE on Θ Π. If Θ Π Θ, the MLE in Θ can be non-unique. Roughly speaking, the proper parameter domain is a tractable subset of the model, and the proper data sequence domain is a set of sequences, the MLE of which lies in
3 the proper parameter domain. The PC can be decomposed as follows: d maf ;θ d f ;ˆθ X θ Θ X + d ma f ;θ f ;ˆθ. X \XΠ θ Θ 8 Remark 4. If we can take Θ Π Θ as is often the case with a well-behaved model such as the eponential family models, the second term vanishes, and the logarithm of the first term is equivalent to the parametric compleity. We assume that the second term is ignorable and focus on the first term CΘ Π def d f ;ˆθ in this X paper. ote that CΘ Π carries ecessive data sequences and often diverges to infinity. To avoid this problem, we introduce luckiness [] to generalize CΘ Π as follows: Definition 3. Let w : Θ [0, denote a weight function on Θ Π called luckiness. We define the luckiness parametric compleity LPC of Θ Π as follows: C w Θ Π def d f where ˆθ def argma θ Θ Π f ;θ. ;ˆθ wˆθ, 9 Remark 5. If wθ 1, the LPC is equivalent to the PC. Let A be a subset of Θ Π. We can regard the LPC C 1{A} Θ Π as a restriction of the PC CΘ Π to A, where 1{ } denotes the indicator function. This restriction is often necessary and used in continuous variable cases [] [4]. III. FOURIER FORM OF ML CODE LEGTH First, we make assumptions that allow us to echange integrals. Assumption 1. 1 For all Φ 0 Θ Π that have measure zero, { ˆθ } Φ 0 has measure zero. For all, f ;θ wθ is integrable and squareintegrable as a function of θ. 3 For all, the Fourier transform ˆf w ;ξ of f ;θ wθ is integrable as a function of ξ, where ˆf w ;ξ def K 1 dθep iξ T θ f ;θ wθ. ΘΠ 10 We obtain the Fourier-transform-based form of the ML code length as follows: Theorem 1. Under Assumption 1, the PC is calculated as follows: d f ;ˆθ wˆθ dθwθg θ, where g θ def 1 K dξ 1 d f ;θ ep iξ Tˆθ θ Proof: Since f ;θ L 1 Θ C L Θ C, f ;θ w ˆθ K 1 dξep iξ T θ ˆfw ;ξ a.s.. Thus, the following holds with Assumption 1 1: d f ;ˆθ w ˆθ 1 d K 1 dξ K dξepiξ Tˆθ ˆfw ;ξ d ep iξ Tˆθ ˆfw ;ξ, 15 where the last equation follows from the absolute integrability of ˆf w ;ξ and Fubini s theorem. We echange integrals again likewise as follows: d ep iξ Tˆθ ˆfw ;ξ X Π 1 1 K d epiξ Tˆθ dθep iξ T θ f ;θ wθ K dθwθ d ep iξ Tˆθ θ f ;θ, 16 where the last equation follows from the absolute integrability of ˆfw ;ξ and Fubini s theorem. By the third assumption, we can echange the integral with respect to θ and ξ, which completes the proof. 4 The characteristic function φ ξ of the maimum IV. ASYMPTOTIC FORMULA θ likelihood estimator is integrable as a function of θ and By taking the limitation of Theorem 1, we can prove the ξ, where asymptotic formula of the LPC, which relaes some conditions φ θ ξdef d f ;θ ep iξ Tˆθ given by Rissanen s asymptotic formula [13]. First, we make θ. assumptions that allow us to echange the limitation and 11 integral.
4 Assumption. 1 There eists an integrable function φ θ ξ of ξ such that φ θ ξ < φθ ξ, for all and θ. There eists an integrable function ḡθ of θ such that g θ < ḡθ. Theorem Asymptotic formula of the ML code length. Under Assumption 1 and Assumption, the following holds: log d f ;ˆθ w ˆθ 1 Klog +log Θ Π dθwθ detiθ+o1. 17 Proof: The assumptions allow us to apply Lebesgue s dominant convergence theorem to Theorem 1 as follows: lim d f ;ˆθ w ˆθ dθwθ lim g θ 18 1 dθwθ K dξ lim φ θ ξ If θ intθ Π, ˆθ θ 0,Iθ 1, 19 by asymptotic normality of the MLE. Hence, by Levy s continuity theorem, lim φ θ ξ ep 1 1 Iθξ ξt, 0 which completes the proof. Remark 6. The consequence of the theorem is the same as that of Rissanen s formula [13]. In contrast to Rissanen s formula, we do not make assumptions on the boundedness of the determinant of the Fisher information matri or the uniform asymptotic normality of the MLE. Thus, we can epect that our formula is easy to apply even whenθ Π is not compact and the boundedness of the Fisher information and uniform asymptotic normality of the MLE are difficult to guarantee. V. O-ASYMPTOTIC FORMULA FOR EXPOETIAL FAMILY First, we present the notation of the eponential family. Then, we present the non-asymptotic formula of the PC for the eponential family. A. Eponential Family and Its Canonical Parameters and Epectation Parameters We say that a model is in the eponential family when we can epress its density function with its canonical parameters η H R K, sufficient statistics u : R D X R K, and base measure h : R D X [0,+ as follows: f ;η h Zη ep η T u. 1 Here, the partition function Z : H [0, is defined by Zη dhep η T u. We define the transform from the canonical parameters to epectation parameters as k η def du k h Zη ep K η k u k, k1 η def [ 1 η η K η ] T and let η denote its inverse transform, assuming is bijective. We define the MLE with respect to the epectation parameters as follows: ˆ def argma f n ;η. 3 n1 We can calculate the PC of a model in the eponential family as follows. Theorem 3. Let f ;η be the density function of a model in the eponential family, where η denotes its natural parameter and denotes its epectation parameter. The LPC of f ;η is epressed as follows: d f ;η ˆ w ˆ K 1 dw dξep iξ T Z η+i ξ Zη 4 Corollary 1. Let X, 1,X,,...,X, be a sequence of i.i.d. K-dimensional random variables, the characteristic function of each of which is given by Z η+i ξ Zη g, be the density function of n the PC is epressed as follows: d f ;η ˆ w ˆ K 1 dg,w. n1 X,, and let. Then, 5 Remark 7. Theorem 3 reduces the original -times integrals to a K-times integral. Corollary 1 implies that, if we know the density function of n1 X, n, the characteristic function of each of which is given using the partition function of the original function, the calculation of the original PC can be reduced to one integral calculation, and it is often analytically obtained. Proof: ote that the following holds with respect to its maimum likelihood estimator as follows: η logzˆη 1 u n. 6 n1.
5 TABLE I EXPOETIAL FAMILY AD PC Distribution Density Sufficient Canonical parameter η Partition Parametric function statistics function compleity f ;θ u k Epectation parameter E[u k ] Zη ormal dist. with f ;µ η µ v,+ v ep 1 1 vη + wµ dµ v known variance v 1 ep µ µ vη,+ v v ormal dist. with f ;v µ η 1 v, ep 1 η Γ 1 known mean µ 1 ep µ v 1 v v η 0, + 0 dv wv v Laplace dist. with f ;b µ η 1 b,0 η known mean µ 1 b ep µ b ep Γ b 1 η 0, + 0 db wb b Gamma dist. with f ;µ η k µ,0 1 η k known shape k 1 kk k 1 Γkµ ep k k µ k k ep k Γk µ k η 0, + 0 dµ wµ µ ep Γ Weibull dist. with f ;L k η 1 L,0 1 η known shape k kl k+1 k k ep k L 1 L η 0, + dl wl 0 L Gamma dist. with f ;η log η ψ 1 λ logθ 1 1,+ Γη +1θ η+1 See 31 known scale θ 3 η Γη+1θ η+1 ep λ ψη +1+logθ,+ θ Also note that the maimum likelihood estimator ˆ with respect to the epectation parameters can be written as ˆ ˆη. Here it holds that ˆ k 1 u k n 7 n1 We can calculate g function as follows: K g θ [[ dξ d h n ] Zη ep η T u n n1 ] ep iξ T 1 u n n1 [ dξep iξ T d hn Zη ] η+i ξ T u n, ep n1 Substituting this to Theorem 1 completes the proof. B. Eamples 8 In this subsection, we give eamples of PC LPC calculation using Theorem 3. These eamples include the results in [4]. Table IV lists the results. 1 Including the eponential distribution with k 1 Derived in [4] 3 Including the chi-squared distribution with θ 1 Fied Variance Distribution: If the relationship between the natural parameter and epectation parameter is given by η v with a constant v and the partition function is given by Zη Cep v η D with a constant D, we can calculate the LPC as follows: d [w dξep D d w. D +i ξ ep iξ 9 Eponential Distribution Type: If the relationship between the natural parameter and epectation parameter is given by η C with a constant C,m and the partition function is given by Zη D m η with a constant D, we can calculate the LPC as follows: 1 + m C dw dξ 0 C +i ξ ep iξ Cm ep C Γm + 0 d w. C 30 m The last equation holds because is the characteristic function of the gamma distribution with the C +i ξ shape
6 parameter m and rate parameter C. If we set w 1{[ min, ma ]}, the PC is equal to m ep Γm log ma min. This formula can be applied to distributions including the eponential distribution, chi-squared distribution, Laplace distribution with a known mean, Weibull distribution with a known shape, and gamma distribution with a known shape. 3 Chi-squared Distribution Type: We discuss the gamma distribution with a known scale θ. The result here includes the chi-squared distribution set θ. we can calculate the LPC as follows: + ds wlogθ s dξ Γ ψ 1 s i ξ Γψ 1 s ep iξ s] + dswlogθ sg s;ψ 1 s, Here, G ;p,q is the probability density function of the sum of i.i.d. samples, the density of which is given by G;p,q 1 qγp ep pq ep. 3 q [9] B. McMillan. Two inequalities implied by unique decipherability. IRE Transactions on Information Theory, 4: , [10] R. ishii. Maimum likelihood principle and model selection when the true model is unspecified. Journal of Multivariate Analysis, 7:39 403, [11] J. Rissanen. Stochastic compleity in statistical inquiry. World Scientific, [1] J. Rissanen. Stochastic compleity in learning. In Computational Learning Theory, pages Springer, [13] J. Rissanen. Fisher information and stochastic compleity. IEEE Transactions on Information Theory,, 41:40 47, [14] J. Rissanen. Stochastic compleity in learning. Journal of Computer and System Sciences, 551:89 95, [15] J. Rissanen. Mdl denoising. IEEE Transactions on Information Theory, 467: , 000. [16] J. Rissanen, T. P. Speed, and B. Yu. Density estimation by stochastic compleity. IEEE Transactions on Information Theory, 38:315 33, 199. [17] T. Roos, T. Silander, P. Kontkanen, and P. Myllymaki. Bayesian network structure learning using factorized nml universal models. In Information Theory and Applications Workshop, 008, pages IEEE, 008. [18] C. E. Shannon. A mathematical theory of communication, part i, part ii. Bell System Technical Journal, 7:63 656, [19] Y. M. Shtar kov. Universal sequential coding of single messages. Problemy Peredachi Informatsii, 33:3 17, [0] A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge University Press, [1] K. Yamanishi. A learning criterion for stochastic rules. Machine Learning, 9:165 03, 199. VI. COCLUSIO In this paper, we derived a non-asymptotic form of the ML code length and clarified its relationship to the asymptotic epansion. Moreover, we presented a non-asymptotic calculation formula of the ML code length for eponential family models. This formula can be applied if we know the partition function. In addition, if we know the closed form of a distribution, the characteristic function of which is given by a function that can be epressed using the partition function, the calculation is reduced to one integral calculation, which is often analytically obtained. REFERECES [1] A. R. Barron and T. M. Cover. Minimum compleity density estimation. IEEE Transactions on Information Theory, 374: , [] P. D. Grünwald. The minimum description length principle. MIT press, 007. [3] P. D. Grünwald and. A. Mehta. A tight ecess risk bound via a unified pac-bayesian-rademacher-shtarkov-mdl compleity. CoRR, abs/ , 017. [4] S. Hirai and K. Yamanishi. Efficient computation of normalized maimum likelihood codes for gaussian miture models with its applications to clustering. IEEE Transactions on Information Theory, 5911: , 013. [5] P. Kontkanen and P. Myllymäki. A linear-time algorithm for computing the multinomial stochastic compleity. Information Processing Letters, 1036:7 33, 007. [6] P. Kontkanen and P. Myllymäki. Mdl histogram density estimation. In International Conference on Artificial Intelligence and Statistics, pages 19 6, 007. [7] P. Kontkanen and P. Myllymäki. An empirical comparison of nml clustering algorithms [8] L. G. Kraft. A device for quantizing, grouping, and coding amplitudemodulated pulses. PhD thesis, Massachusetts Institute of Technology, 1949.
COMPUTING THE REGRET TABLE FOR MULTINOMIAL DATA
COMPUTIG THE REGRET TABLE FOR MULTIOMIAL DATA Petri Kontkanen, Petri Myllymäki April 12, 2005 HIIT TECHICAL REPORT 2005 1 COMPUTIG THE REGRET TABLE FOR MULTIOMIAL DATA Petri Kontkanen, Petri Myllymäki
More informationA NEW INFORMATION THEORETIC APPROACH TO ORDER ESTIMATION PROBLEM. Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.
A EW IFORMATIO THEORETIC APPROACH TO ORDER ESTIMATIO PROBLEM Soosan Beheshti Munther A. Dahleh Massachusetts Institute of Technology, Cambridge, MA 0239, U.S.A. Abstract: We introduce a new method of model
More informationAchievability of Asymptotic Minimax Regret in Online and Batch Prediction
JMLR: Workshop and Conference Proceedings 9:181 196, 013 ACML 013 Achievability of Asymptotic Minimax Regret in Online and Batch Prediction Kazuho Watanabe Graduate School of Information Science Nara Institute
More informationBayesian Network Structure Learning using Factorized NML Universal Models
Bayesian Network Structure Learning using Factorized NML Universal Models Teemu Roos, Tomi Silander, Petri Kontkanen, and Petri Myllymäki Complex Systems Computation Group, Helsinki Institute for Information
More informationSTATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY
2nd International Symposium on Information Geometry and its Applications December 2-6, 2005, Tokyo Pages 000 000 STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY JUN-ICHI TAKEUCHI, ANDREW R. BARRON, AND
More informationLTI Systems, Additive Noise, and Order Estimation
LTI Systems, Additive oise, and Order Estimation Soosan Beheshti, Munther A. Dahleh Laboratory for Information and Decision Systems Department of Electrical Engineering and Computer Science Massachusetts
More informationEfficient Computation of Stochastic Complexity
Efficient Computation of Stochastic Complexity Petri Kontkanen, Wray Buntine, Petri Myllymäki, Jorma Rissanen, Henry Tirri Complex Systems Computation Group CoSCo), Helsinki Institute for Information Technology
More informationAchievability of Asymptotic Minimax Regret by Horizon-Dependent and Horizon-Independent Strategies
Journal of Machine Learning Research 16 015) 1-48 Submitted 7/14; Revised 1/14; Published 8/15 Achievability of Asymptotic Minimax Regret by Horizon-Dependent and Horizon-Independent Strategies Kazuho
More informationMDL Histogram Density Estimation
MDL Histogram Density Estimation Petri Kontkanen, Petri Myllymäki Complex Systems Computation Group (CoSCo) Helsinki Institute for Information Technology (HIIT) University of Helsinki and Helsinki University
More informationA Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity
A Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity Peter Grünwald Centrum Wiskunde & Informatica Amsterdam Mathematical Institute Leiden University Joint work with
More informationBayesian Properties of Normalized Maximum Likelihood and its Fast Computation
Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation Andre Barron Department of Statistics Yale University Email: andre.barron@yale.edu Teemu Roos HIIT & Dept of Computer Science
More informationON CONVERGENCE PROPERTIES OF MESSAGE-PASSING ESTIMATION ALGORITHMS. Justin Dauwels
ON CONVERGENCE PROPERTIES OF MESSAGE-PASSING ESTIMATION ALGORITHMS Justin Dauwels Amari Research Unit, RIKEN Brain Science Institute, Wao-shi, 351-0106, Saitama, Japan email: justin@dauwels.com ABSTRACT
More informationMachine Learning Lecture 3
Announcements Machine Learning Lecture 3 Eam dates We re in the process of fiing the first eam date Probability Density Estimation II 9.0.207 Eercises The first eercise sheet is available on L2P now First
More informationGeneralized Post-Widder inversion formula with application to statistics
arxiv:5.9298v [math.st] 3 ov 25 Generalized Post-Widder inversion formula with application to statistics Denis Belomestny, Hilmar Mai 2, John Schoenmakers 3 August 24, 28 Abstract In this work we derive
More informationApproximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery
Approimate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery arxiv:1606.00901v1 [cs.it] Jun 016 Shuai Huang, Trac D. Tran Department of Electrical and Computer Engineering Johns
More informationPrequential Plug-In Codes that Achieve Optimal Redundancy Rates even if the Model is Wrong
Prequential Plug-In Codes that Achieve Optimal Redundancy Rates even if the Model is Wrong Peter Grünwald pdg@cwi.nl Wojciech Kotłowski kotlowsk@cwi.nl National Research Institute for Mathematics and Computer
More informationKeep it Simple Stupid On the Effect of Lower-Order Terms in BIC-Like Criteria
Keep it Simple Stupid On the Effect of Lower-Order Terms in BIC-Like Criteria Teemu Roos and Yuan Zou Helsinki Institute for Information Technology HIIT Department of Computer Science University of Helsinki,
More informationContext tree models for source coding
Context tree models for source coding Toward Non-parametric Information Theory Licence de droits d usage Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding
More informationRates of Convergence to Self-Similar Solutions of Burgers Equation
Rates of Convergence to Self-Similar Solutions of Burgers Equation by Joel Miller Andrew Bernoff, Advisor Advisor: Committee Member: May 2 Department of Mathematics Abstract Rates of Convergence to Self-Similar
More informationInformation-Theoretically Optimal Histogram Density Estimation
Information-Theoretically Optimal Histogram Density Estimation Petri Kontkanen, Petri Myllymäki March 17, 2006 HIIT TECHNICAL REPORT 2006 2 Information-Theoretically Optimal Histogram Density Estimation
More informationWorst-Case Bounds for Gaussian Process Models
Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationUpper Bounds to Error Probability with Feedback
Upper Bounds to Error robability with Feedbac Barış Naiboğlu Lizhong Zheng Research Laboratory of Electronics at MIT Cambridge, MA, 0239 Email: {naib, lizhong }@mit.edu Abstract A new technique is proposed
More informationMACHINE LEARNING ADVANCED MACHINE LEARNING
MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 2 2 MACHINE LEARNING Overview Definition pdf Definition joint, condition, marginal,
More informationCalculating the Normalized Maximum Likelihood Distribution for Bayesian Forests
In Proceedings of the IADIS International Conference Intelligent Systems and Agents 2007. Lisbon, Portugal, 2007. Calculating the Normalized Maximum Likelihood Distribution for Bayesian Forests Hannes
More informationDoes Better Inference mean Better Learning?
Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract
More informationAn Analysis of the Difference of Code Lengths Between Two-Step Codes Based on MDL Principle and Bayes Codes
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 3, MARCH 2001 927 An Analysis of the Difference of Code Lengths Between Two-Step Codes Based on MDL Principle Bayes Codes Masayuki Goto, Member, IEEE,
More informationSteepest descent on factor graphs
Steepest descent on factor graphs Justin Dauwels, Sascha Korl, and Hans-Andrea Loeliger Abstract We show how steepest descent can be used as a tool for estimation on factor graphs. From our eposition,
More informationEECS 545 Project Progress Report Sparse Kernel Density Estimates B.Gopalakrishnan, G.Bellala, G.Devadas, K.Sricharan
EECS 545 Project Progress Report Sparse Kernel Density Estimates B.Gopalakrishnan, G.Bellala, G.Devadas, K.Sricharan Introduction Density estimation forms the backbone for numerous machine learning algorithms
More informationA Practitioner s Guide to Generalized Linear Models
A Practitioners Guide to Generalized Linear Models Background The classical linear models and most of the minimum bias procedures are special cases of generalized linear models (GLMs). GLMs are more technically
More informationNUMERICAL COMPUTATION OF THE CAPACITY OF CONTINUOUS MEMORYLESS CHANNELS
NUMERICAL COMPUTATION OF THE CAPACITY OF CONTINUOUS MEMORYLESS CHANNELS Justin Dauwels Dept. of Information Technology and Electrical Engineering ETH, CH-8092 Zürich, Switzerland dauwels@isi.ee.ethz.ch
More informationECE 587 / STA 563: Lecture 5 Lossless Compression
ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 2017 Author: Galen Reeves Last Modified: October 18, 2017 Outline of lecture: 5.1 Introduction to Lossless Source
More informationECE 587 / STA 563: Lecture 5 Lossless Compression
ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 28 Author: Galen Reeves Last Modified: September 27, 28 Outline of lecture: 5. Introduction to Lossless Source
More informationMultidimensional partitions of unity and Gaussian terrains
and Gaussian terrains Richard A. Bale, Jeff P. Grossman, Gary F. Margrave, and Michael P. Lamoureu ABSTRACT Partitions of unity play an important rôle as amplitude-preserving windows for nonstationary
More informationStochastic Complexity for Testing Conditional Independence on Discrete Data
Stochastic Complexity for Testing Conditional Independence on Discrete Data Alexander Marx Max Planck Institute for Informatics, and Saarland University Saarbrücken, Germany amarx@mpi-inf.mpg.de Jilles
More informationMACHINE LEARNING ADVANCED MACHINE LEARNING
MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 22 MACHINE LEARNING Discrete Probabilities Consider two variables and y taking discrete
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationLecture 1c: Gaussian Processes for Regression
Lecture c: Gaussian Processes for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk
More informationarxiv: v3 [stat.me] 11 Feb 2018
arxiv:1708.02742v3 [stat.me] 11 Feb 2018 Minimum message length inference of the Poisson and geometric models using heavy-tailed prior distributions Chi Kuen Wong, Enes Makalic, Daniel F. Schmidt February
More informationA Comparison of Particle Filters for Personal Positioning
VI Hotine-Marussi Symposium of Theoretical and Computational Geodesy May 9-June 6. A Comparison of Particle Filters for Personal Positioning D. Petrovich and R. Piché Institute of Mathematics Tampere University
More informationCoding on Countably Infinite Alphabets
Coding on Countably Infinite Alphabets Non-parametric Information Theory Licence de droits d usage Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe
More informationData Compression. Limit of Information Compression. October, Examples of codes 1
Data Compression Limit of Information Compression Radu Trîmbiţaş October, 202 Outline Contents Eamples of codes 2 Kraft Inequality 4 2. Kraft Inequality............................ 4 2.2 Kraft inequality
More informationA BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo
A BLEND OF INFORMATION THEORY AND STATISTICS Andrew R. YALE UNIVERSITY Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo Frejus, France, September 1-5, 2008 A BLEND OF INFORMATION THEORY AND
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture Notes Fall 2009 November, 2009 Byoung-Ta Zhang School of Computer Science and Engineering & Cognitive Science, Brain Science, and Bioinformatics Seoul National University
More informationParticle Methods as Message Passing
Particle Methods as Message Passing Justin Dauwels RIKEN Brain Science Institute Hirosawa,2-1,Wako-shi,Saitama,Japan Email: justin@dauwels.com Sascha Korl Phonak AG CH-8712 Staefa, Switzerland Email: sascha.korl@phonak.ch
More informationExact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection
2708 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 11, NOVEMBER 2004 Exact Minimax Strategies for Predictive Density Estimation, Data Compression, Model Selection Feng Liang Andrew Barron, Senior
More informationFurther results involving Marshall Olkin log logistic distribution: reliability analysis, estimation of the parameter, and applications
DOI 1.1186/s464-16-27- RESEARCH Open Access Further results involving Marshall Olkin log logistic distribution: reliability analysis, estimation of the parameter, and applications Arwa M. Alshangiti *,
More informationTight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding
APPEARS IN THE IEEE TRANSACTIONS ON INFORMATION THEORY, FEBRUARY 015 1 Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding Igal Sason Abstract Tight bounds for
More informationRandom Fields in Bayesian Inference: Effects of the Random Field Discretization
Random Fields in Bayesian Inference: Effects of the Random Field Discretization Felipe Uribe a, Iason Papaioannou a, Wolfgang Betz a, Elisabeth Ullmann b, Daniel Straub a a Engineering Risk Analysis Group,
More informationApproximate inference, Sampling & Variational inference Fall Cours 9 November 25
Approimate inference, Sampling & Variational inference Fall 2015 Cours 9 November 25 Enseignant: Guillaume Obozinski Scribe: Basile Clément, Nathan de Lara 9.1 Approimate inference with MCMC 9.1.1 Gibbs
More information2 Statistical Estimation: Basic Concepts
Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation:
More informationLecture 3: More on regularization. Bayesian vs maximum likelihood learning
Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting
More informationPower EP. Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR , October 4, Abstract
Power EP Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR-2004-149, October 4, 2004 Abstract This note describes power EP, an etension of Epectation Propagation (EP) that makes the computations
More informationUseful Mathematics. 1. Multivariable Calculus. 1.1 Taylor s Theorem. Monday, 13 May 2013
Useful Mathematics Monday, 13 May 013 Physics 111 In recent years I have observed a reticence among a subpopulation of students to dive into mathematics when the occasion arises in theoretical mechanics
More informationNotes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed
18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,
More informationOptimal scaling of the random walk Metropolis on elliptically symmetric unimodal targets
Bernoulli 15(3), 2009, 774 798 DOI: 10.3150/08-BEJ176 Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal targets CHRIS SHERLOCK 1 and GARETH ROBERTS 2 1 Department of Mathematics
More informationSemiparametric posterior limits
Statistics Department, Seoul National University, Korea, 2012 Semiparametric posterior limits for regular and some irregular problems Bas Kleijn, KdV Institute, University of Amsterdam Based on collaborations
More informationStochastic Complexity of Variational Bayesian Hidden Markov Models
Stochastic Complexity of Variational Bayesian Hidden Markov Models Tikara Hosino Department of Computational Intelligence and System Science, Tokyo Institute of Technology Mailbox R-5, 459 Nagatsuta, Midori-ku,
More informationThe Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models
The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health
More informationExact Minimax Predictive Density Estimation and MDL
Exact Minimax Predictive Density Estimation and MDL Feng Liang and Andrew Barron December 5, 2003 Abstract The problems of predictive density estimation with Kullback-Leibler loss, optimal universal data
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationf-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models
IEEE Transactions on Information Theory, vol.58, no.2, pp.708 720, 2012. 1 f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models Takafumi Kanamori Nagoya University,
More informationMobile Robot Localization
Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations
More informationTail Properties and Asymptotic Expansions for the Maximum of Logarithmic Skew-Normal Distribution
Tail Properties and Asymptotic Epansions for the Maimum of Logarithmic Skew-Normal Distribution Xin Liao, Zuoiang Peng & Saralees Nadarajah First version: 8 December Research Report No. 4,, Probability
More informationPatterns of Scalable Bayesian Inference Background (Session 1)
Patterns of Scalable Bayesian Inference Background (Session 1) Jerónimo Arenas-García Universidad Carlos III de Madrid jeronimo.arenas@gmail.com June 14, 2017 1 / 15 Motivation. Bayesian Learning principles
More information= 1 2 x (x 1) + 1 {x} (1 {x}). [t] dt = 1 x (x 1) + O (1), [t] dt = 1 2 x2 + O (x), (where the error is not now zero when x is an integer.
Problem Sheet,. i) Draw the graphs for [] and {}. ii) Show that for α R, α+ α [t] dt = α and α+ α {t} dt =. Hint Split these integrals at the integer which must lie in any interval of length, such as [α,
More informationMinimax Optimal Bayes Mixtures for Memoryless Sources over Large Alphabets
Proceedings of Machine Learning Research 8: 8, 208 Algorithmic Learning Theory 208 Minimax Optimal Bayes Mixtures for Memoryless Sources over Large Alphabets Elias Jääsaari Helsinki Institute for Information
More informationFoundations of Nonparametric Bayesian Methods
1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models
More informationMinimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions
Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions Parthan Kasarapu & Lloyd Allison Monash University, Australia September 8, 25 Parthan Kasarapu
More informationTaylor Series and Asymptotic Expansions
Taylor Series and Asymptotic Epansions The importance of power series as a convenient representation, as an approimation tool, as a tool for solving differential equations and so on, is pretty obvious.
More informationType II variational methods in Bayesian estimation
Type II variational methods in Bayesian estimation J. A. Palmer, D. P. Wipf, and K. Kreutz-Delgado Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 9093
More informationBayesian Inference of Noise Levels in Regression
Bayesian Inference of Noise Levels in Regression Christopher M. Bishop Microsoft Research, 7 J. J. Thomson Avenue, Cambridge, CB FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop
More informationModel Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model
Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population
More informationTHE inverse tangent function is an elementary mathematical
A Sharp Double Inequality for the Inverse Tangent Function Gholamreza Alirezaei arxiv:307.983v [cs.it] 8 Jul 03 Abstract The inverse tangent function can be bounded by different inequalities, for eample
More informationVariational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M
A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M PD M = PD θ, MPθ Mdθ Lecture 14 : Variational Bayes where θ are the parameters of the model and Pθ M is
More informationWhere now? Machine Learning and Bayesian Inference
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 Email: sbh@clcamacuk wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from
More informationStatistical Learning Theory of Variational Bayes
Statistical Learning Theory of Variational Bayes Department of Computational Intelligence and Systems Science Interdisciplinary Graduate School of Science and Engineering Tokyo Institute of Technology
More informationTail Approximation of the Skew-Normal by the Skew-Normal Laplace: Application to Owen s T Function and the Bivariate Normal Distribution
Journal of Statistical and Econometric ethods vol. no. 3 - ISS: 5-557 print version 5-565online Scienpress Ltd 3 Tail Approimation of the Skew-ormal by the Skew-ormal Laplace: Application to Owen s T Function
More informationIntroduction to Probability Theory for Graduate Economics Fall 2008
Introduction to Probability Theory for Graduate Economics Fall 008 Yiğit Sağlam October 10, 008 CHAPTER - RANDOM VARIABLES AND EXPECTATION 1 1 Random Variables A random variable (RV) is a real-valued function
More informationMISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30
MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD Copyright c 2012 (Iowa State University) Statistics 511 1 / 30 INFORMATION CRITERIA Akaike s Information criterion is given by AIC = 2l(ˆθ) + 2k, where l(ˆθ)
More informationTight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences
Tight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences Igal Sason Department of Electrical Engineering Technion, Haifa 3000, Israel E-mail: sason@ee.technion.ac.il Abstract
More informationA General Overview of Parametric Estimation and Inference Techniques.
A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying
More informationSequential prediction with coded side information under logarithmic loss
under logarithmic loss Yanina Shkel Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Maxim Raginsky Department of Electrical and Computer Engineering Coordinated Science
More informationBregman Divergence and Mirror Descent
Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,
More informationMaximum Likelihood Estimation
Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function
More informationA GENERAL CLASS OF LOWER BOUNDS ON THE PROBABILITY OF ERROR IN MULTIPLE HYPOTHESIS TESTING. Tirza Routtenberg and Joseph Tabrikian
A GENERAL CLASS OF LOWER BOUNDS ON THE PROBABILITY OF ERROR IN MULTIPLE HYPOTHESIS TESTING Tirza Routtenberg and Joseph Tabrikian Department of Electrical and Computer Engineering Ben-Gurion University
More informationPredictive Hypothesis Identification
Marcus Hutter - 1 - Predictive Hypothesis Identification Predictive Hypothesis Identification Marcus Hutter Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU RSISE NICTA Marcus Hutter - 2 - Predictive
More informationInformation Theory Based Estimator of the Number of Sources in a Sparse Linear Mixing Model
Information heory Based Estimator of the Number of Sources in a Sparse Linear Mixing Model Radu Balan University of Maryland Department of Mathematics, Center for Scientific Computation And Mathematical
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationGraduate Econometrics I: Maximum Likelihood I
Graduate Econometrics I: Maximum Likelihood I Yves Dominicy Université libre de Bruxelles Solvay Brussels School of Economics and Management ECARES Yves Dominicy Graduate Econometrics I: Maximum Likelihood
More informationThe Minimum Message Length Principle for Inductive Inference
The Principle for Inductive Inference Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health University of Melbourne University of Helsinki, August 25,
More informationebay/google short course: Problem set 2
18 Jan 013 ebay/google short course: Problem set 1. (the Echange Parado) You are playing the following game against an opponent, with a referee also taking part. The referee has two envelopes (numbered
More informationBayesian estimation of the discrepancy with misspecified parametric models
Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012
More informationRejection sampling - Acceptance probability. Review: How to sample from a multivariate normal in R. Review: Rejection sampling. Weighted resampling
Rejection sampling - Acceptance probability Review: How to sample from a multivariate normal in R Goal: Simulate from N d (µ,σ)? Note: For c to be small, g() must be similar to f(). The art of rejection
More informationUnsupervised Learning
Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College
More informationApproximate formulas for the Point-to-Ellipse and for the Point-to-Ellipsoid Distance Problem
Approimate formulas for the Point-to-Ellipse and for the Point-to-Ellipsoid Distance Problem ALEXEI UTESHEV St.Petersburg State University Department of Applied Mathematics Universitetskij pr. 35, 198504
More informationInformation geometry of Bayesian statistics
Information geometry of Bayesian statistics Hiroshi Matsuzoe Department of Computer Science and Engineering, Graduate School of Engineering, Nagoya Institute of Technology, Nagoya 466-8555, Japan Abstract.
More informationOn the Behavior of MDL Denoising
On the Behavior of MDL Denoising Teemu Roos Petri Myllymäki Helsinki Institute for Information Technology Univ. of Helsinki & Helsinki Univ. of Technology P.O. Box 9800 FIN-0015 TKK, Finland Henry Tirri
More information8 The Contribution of Parameters to Stochastic Complexity
8 The Contribution of Parameters to Stochastic Complexity Dean P. Foster and Robert A. Stine Department of Statistics The Wharton School of the University of Pennsylvania Philadelphia, PA 19104-630 foster@wharton.upenn.edu
More informationEstimation theory and information geometry based on denoising
Estimation theory and information geometry based on denoising Aapo Hyvärinen Dept of Computer Science & HIIT Dept of Mathematics and Statistics University of Helsinki Finland 1 Abstract What is the best
More informationMachine Learning Basics: Maximum Likelihood Estimation
Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning
More information