Exact Calculation of Normalized Maximum Likelihood Code Length Using Fourier Analysis

Size: px
Start display at page:

Download "Exact Calculation of Normalized Maximum Likelihood Code Length Using Fourier Analysis"

Transcription

1 Eact Calculation of ormalized Maimum Likelihood Code Length Using Fourier Analysis Atsushi Suzuki and Kenji Yamanishi The University of Tokyo Graduate School of Information Science and Technology Bunkyo, Tokyo, Japan arxiv: v1 [math.st] 11 Jan 018 Abstract The normalized maimum likelihood code length has been widely used in model selection, and its favorable properties, such as its consistency and the upper bound of its statistical risk, have been demonstrated. This paper proposes a novel methodology for calculating the normalized maimum likelihood code length on the basis of Fourier analysis. Our methodology provides an efficient non-asymptotic calculation formula for eponential family models and an asymptotic calculation formula for general parametric models with a weaker assumption compared to that in previous work. I. ITRODUCTIO A. Background and Our Contribution The normalized maimum likelihood code length ML code length is an etension of self-entropy, in which a set of distributions is given instead of the true distribution. When the true distribution is known, the lower bound of the mean length of codings for a random variable is given by the Shannon entropy of its probability distribution, and the lower bound is attained by the self-entropy [18]. This optimal code or the self-entropy is also interpreted as the solution of a trivial optimization problem of a log redundancy with respect to the code l, with the Kraft-McMillan inequality [8] [9] as a constraint, as follows: [ minma l logf 0 ] l s. t. d ep l 1, where def 1,,..., is a data sequence and f 0 denotes the probability density function of the data-generating distribution. Apparently, the optimum code length is given by l logf 0 self-entropy. ote that we discuss cases of continuous random variables in this paper. Further, the base of the logarithm is e, and the natural unit of information is used in this paper. The optimization problem 1 or the original Shannon entropy deals with the case in which the true distribution is known. When a set of distributions F is given as candidates of the true distribution instead of the true distribution itself, we can etend the previous optimization problem to the problem introduced by Shtarkov [19]: 1 [ minma l min logf ] l f F s. t. d ep l 1. This problem is no longer trivial, and Shtarkov showed that the ML code length defined below attains its minimum [19]: l ML def logf ML logma f F f +log X ma f F f d. The problem is reduced to 1 if F {f 0 }, and in this sense, the ML code length is an etension of the self-entropy. The ML code is one of the universally optimal codings when the true distribution in the given set is unknown [11]. The ML code length is widely used in model selection on the basis of the minimum description length principle [16] [1] [1] [14]. Here, the model that minimizes the ML code length for given data is selected. Recently, it is shown that the ML code length bounds the generalized loss [3]. The calculation of the ML code length has been an important problem. Rissanen derived an asymptotic formula of the ML code length [13], which clarified the behavior of the ML code length with o1 terms ecluded as follows: log d f ;ˆθ 1 Klog Θ +log dθ 4 detiθ+o1, Π where K denotes the dimension of the parameter. This formula holds with certain regularity conditions and does not depend on the details of the model. According to this formula, we can apply ishii s analysis in terms of consistency in the selected model [10] and Barron and Cover s result in terms of statistical risk [1] to the model selection using the ML code length. In contrast to the generality of Rissanen s asymptotic formula, a non-asymptotic calculation formula has been derived through model-by-model discussion [15] [5] [6] [7] [17]. Recently, Hirai and Yamanishi non-asymptotically calculated 3

2 the ML code length for several models in the eponential family [4]. They reduced the calculation of the ML code length to the parameter domain integral of the function denoted by g in the paper. However, the method to obtain Hirai- Yamanishi sg-function eplicitly depends on the model. Thus, the eact calculation of the ML code length has been limited to particular models. This paper proposes a novel methodology for calculating the ML code length on the basis of Fourier transformation. Our methodology enables the systematic analysis of ML code length in terms of both asymptotic epansion and eact calculation. As corollaries, our methodology provides an asymptotic formula with weaker assumptions compared to Rissanen s and a useful eact calculation formula for the eponential family. B. Significance of This Paper This paper proposes an alternative form of the ML code length based on the Fourier transform. Our form enables the systematic calculation of the ML code length. Specifically, it results in the two formulae presented below. 1 Asymptotic Formula with Weaker Assumption: Taking the limitation of our form leads to Rissanen s asymptotic formula [13]. It should be noted that Lebesgue s dominant convergence theorem can be applied to our Fourier-transformbased form, which results in an asymptotic formula with a weaker assumption compared to that in the original paper [13]. Eact Calculation Formula for Eponential Family: Our Fourier-transform-based form gives a simple formula for the eact calculation of the ML code length of the eponential family. The formula yields the ML code length from the partition function and the relationship between the canonical parameters and the epectation of sufficient statistics. C. Related Work 1 Asymptotic Formula with Weaker Assumption: The consequence of the asymptotic formula in this paper is the same as that of Rissanen s theorem [13]. However, Rissanen s theorem assumes both the uniform asymptotic normality of the maimum likelihood estimator as well as the eistence of a non-zero lower bound and an finite upper bound of the Fisher information; in contrast, our theorem does not involve these assumptions and allows the Fisher information to converge to zero or diverge towards the boundary. Eact Calculation Formula for Eponential Family: Hirai and Yamanishi presented the eact calculation of several models in the eponential family through the integral of the g-function. However, in general, it is still difficult to obtain the eplicit form of the g-function. In this paper, the general eact calculation formula for the eponential family is obtained, including Hirai and Yamanishi s results. II. ORMALIZED MAXIMUM LIKELIHOOD CODE LEGTH We consider a sequence def 1,,..., of continuous random variables and assume that they have a probability density function. Definition 1. Let F { f : R D X [0, X f d 1 } denote a set of density functions. Here, X R D denotes the domain of a datum. Assume that ma f F f is a measurable function of. The ML code length is defined as its negative log likelihood as follows: l ML def logma f F f +logcf, 5 where CF def X d ma f F f. CF or its logarithm is called the parametric compleity PC of F. In this paper, we focus on the case in which it is easy to evaluate the first term but difficult to evaluate the maimum likelihood compleity. This is because, when even the first term is intractable, it is hardly possible to strictly evaluate the second term. In this paper, we consider the independent identical parametric model: { F Θ f : X [0, f n1 f } n;θ, θ Θ R K 6 as a set of density functions. Here, θ is its parameter and Θ is the domain of the parameter. We mainly analyze a proper parameter domain Θ Π Θ defined as follows. Definition. A subset Θ Π of Θ is proper if the following conditions are satisfied: 1 Map Θ Π θ f ;θ F Π F is bijective one to one. For all X, a unique solution ˆθ of ma θ ΘΠ f ;θ eists; that is, a unique maimum likelihood estimator MLE ˆθ eists. 3 ma θ ΘΠ f ;θ is a measurable function of. 4 If θ Θ Π and n f n ;θ, the asymptotic normality of the MLE ˆθ holds; that is, ˆθ θ 0,Iθ 1, where Iθ 1 denotes the Fisher information matri. We also define the proper data sequence domain XΠ as follows: X Π def { X ma f ;θ maf ;θ }. θ Θ Π θ Θ 7 Remark 1. Sufficient conditions for 4 have been discussed for eample, see [0]. At least the positive definiteness of Iθ 1 in X Π is required for 4. Remark. Since Θ Π Θ, the following holds in general: ma θ ΘΠ f ;θ ma θ Θ f ;θ. Remark 3. In this paper, ˆθ always denotes the unique MLE on Θ Π. If Θ Π Θ, the MLE in Θ can be non-unique. Roughly speaking, the proper parameter domain is a tractable subset of the model, and the proper data sequence domain is a set of sequences, the MLE of which lies in

3 the proper parameter domain. The PC can be decomposed as follows: d maf ;θ d f ;ˆθ X θ Θ X + d ma f ;θ f ;ˆθ. X \XΠ θ Θ 8 Remark 4. If we can take Θ Π Θ as is often the case with a well-behaved model such as the eponential family models, the second term vanishes, and the logarithm of the first term is equivalent to the parametric compleity. We assume that the second term is ignorable and focus on the first term CΘ Π def d f ;ˆθ in this X paper. ote that CΘ Π carries ecessive data sequences and often diverges to infinity. To avoid this problem, we introduce luckiness [] to generalize CΘ Π as follows: Definition 3. Let w : Θ [0, denote a weight function on Θ Π called luckiness. We define the luckiness parametric compleity LPC of Θ Π as follows: C w Θ Π def d f where ˆθ def argma θ Θ Π f ;θ. ;ˆθ wˆθ, 9 Remark 5. If wθ 1, the LPC is equivalent to the PC. Let A be a subset of Θ Π. We can regard the LPC C 1{A} Θ Π as a restriction of the PC CΘ Π to A, where 1{ } denotes the indicator function. This restriction is often necessary and used in continuous variable cases [] [4]. III. FOURIER FORM OF ML CODE LEGTH First, we make assumptions that allow us to echange integrals. Assumption 1. 1 For all Φ 0 Θ Π that have measure zero, { ˆθ } Φ 0 has measure zero. For all, f ;θ wθ is integrable and squareintegrable as a function of θ. 3 For all, the Fourier transform ˆf w ;ξ of f ;θ wθ is integrable as a function of ξ, where ˆf w ;ξ def K 1 dθep iξ T θ f ;θ wθ. ΘΠ 10 We obtain the Fourier-transform-based form of the ML code length as follows: Theorem 1. Under Assumption 1, the PC is calculated as follows: d f ;ˆθ wˆθ dθwθg θ, where g θ def 1 K dξ 1 d f ;θ ep iξ Tˆθ θ Proof: Since f ;θ L 1 Θ C L Θ C, f ;θ w ˆθ K 1 dξep iξ T θ ˆfw ;ξ a.s.. Thus, the following holds with Assumption 1 1: d f ;ˆθ w ˆθ 1 d K 1 dξ K dξepiξ Tˆθ ˆfw ;ξ d ep iξ Tˆθ ˆfw ;ξ, 15 where the last equation follows from the absolute integrability of ˆf w ;ξ and Fubini s theorem. We echange integrals again likewise as follows: d ep iξ Tˆθ ˆfw ;ξ X Π 1 1 K d epiξ Tˆθ dθep iξ T θ f ;θ wθ K dθwθ d ep iξ Tˆθ θ f ;θ, 16 where the last equation follows from the absolute integrability of ˆfw ;ξ and Fubini s theorem. By the third assumption, we can echange the integral with respect to θ and ξ, which completes the proof. 4 The characteristic function φ ξ of the maimum IV. ASYMPTOTIC FORMULA θ likelihood estimator is integrable as a function of θ and By taking the limitation of Theorem 1, we can prove the ξ, where asymptotic formula of the LPC, which relaes some conditions φ θ ξdef d f ;θ ep iξ Tˆθ given by Rissanen s asymptotic formula [13]. First, we make θ. assumptions that allow us to echange the limitation and 11 integral.

4 Assumption. 1 There eists an integrable function φ θ ξ of ξ such that φ θ ξ < φθ ξ, for all and θ. There eists an integrable function ḡθ of θ such that g θ < ḡθ. Theorem Asymptotic formula of the ML code length. Under Assumption 1 and Assumption, the following holds: log d f ;ˆθ w ˆθ 1 Klog +log Θ Π dθwθ detiθ+o1. 17 Proof: The assumptions allow us to apply Lebesgue s dominant convergence theorem to Theorem 1 as follows: lim d f ;ˆθ w ˆθ dθwθ lim g θ 18 1 dθwθ K dξ lim φ θ ξ If θ intθ Π, ˆθ θ 0,Iθ 1, 19 by asymptotic normality of the MLE. Hence, by Levy s continuity theorem, lim φ θ ξ ep 1 1 Iθξ ξt, 0 which completes the proof. Remark 6. The consequence of the theorem is the same as that of Rissanen s formula [13]. In contrast to Rissanen s formula, we do not make assumptions on the boundedness of the determinant of the Fisher information matri or the uniform asymptotic normality of the MLE. Thus, we can epect that our formula is easy to apply even whenθ Π is not compact and the boundedness of the Fisher information and uniform asymptotic normality of the MLE are difficult to guarantee. V. O-ASYMPTOTIC FORMULA FOR EXPOETIAL FAMILY First, we present the notation of the eponential family. Then, we present the non-asymptotic formula of the PC for the eponential family. A. Eponential Family and Its Canonical Parameters and Epectation Parameters We say that a model is in the eponential family when we can epress its density function with its canonical parameters η H R K, sufficient statistics u : R D X R K, and base measure h : R D X [0,+ as follows: f ;η h Zη ep η T u. 1 Here, the partition function Z : H [0, is defined by Zη dhep η T u. We define the transform from the canonical parameters to epectation parameters as k η def du k h Zη ep K η k u k, k1 η def [ 1 η η K η ] T and let η denote its inverse transform, assuming is bijective. We define the MLE with respect to the epectation parameters as follows: ˆ def argma f n ;η. 3 n1 We can calculate the PC of a model in the eponential family as follows. Theorem 3. Let f ;η be the density function of a model in the eponential family, where η denotes its natural parameter and denotes its epectation parameter. The LPC of f ;η is epressed as follows: d f ;η ˆ w ˆ K 1 dw dξep iξ T Z η+i ξ Zη 4 Corollary 1. Let X, 1,X,,...,X, be a sequence of i.i.d. K-dimensional random variables, the characteristic function of each of which is given by Z η+i ξ Zη g, be the density function of n the PC is epressed as follows: d f ;η ˆ w ˆ K 1 dg,w. n1 X,, and let. Then, 5 Remark 7. Theorem 3 reduces the original -times integrals to a K-times integral. Corollary 1 implies that, if we know the density function of n1 X, n, the characteristic function of each of which is given using the partition function of the original function, the calculation of the original PC can be reduced to one integral calculation, and it is often analytically obtained. Proof: ote that the following holds with respect to its maimum likelihood estimator as follows: η logzˆη 1 u n. 6 n1.

5 TABLE I EXPOETIAL FAMILY AD PC Distribution Density Sufficient Canonical parameter η Partition Parametric function statistics function compleity f ;θ u k Epectation parameter E[u k ] Zη ormal dist. with f ;µ η µ v,+ v ep 1 1 vη + wµ dµ v known variance v 1 ep µ µ vη,+ v v ormal dist. with f ;v µ η 1 v, ep 1 η Γ 1 known mean µ 1 ep µ v 1 v v η 0, + 0 dv wv v Laplace dist. with f ;b µ η 1 b,0 η known mean µ 1 b ep µ b ep Γ b 1 η 0, + 0 db wb b Gamma dist. with f ;µ η k µ,0 1 η k known shape k 1 kk k 1 Γkµ ep k k µ k k ep k Γk µ k η 0, + 0 dµ wµ µ ep Γ Weibull dist. with f ;L k η 1 L,0 1 η known shape k kl k+1 k k ep k L 1 L η 0, + dl wl 0 L Gamma dist. with f ;η log η ψ 1 λ logθ 1 1,+ Γη +1θ η+1 See 31 known scale θ 3 η Γη+1θ η+1 ep λ ψη +1+logθ,+ θ Also note that the maimum likelihood estimator ˆ with respect to the epectation parameters can be written as ˆ ˆη. Here it holds that ˆ k 1 u k n 7 n1 We can calculate g function as follows: K g θ [[ dξ d h n ] Zη ep η T u n n1 ] ep iξ T 1 u n n1 [ dξep iξ T d hn Zη ] η+i ξ T u n, ep n1 Substituting this to Theorem 1 completes the proof. B. Eamples 8 In this subsection, we give eamples of PC LPC calculation using Theorem 3. These eamples include the results in [4]. Table IV lists the results. 1 Including the eponential distribution with k 1 Derived in [4] 3 Including the chi-squared distribution with θ 1 Fied Variance Distribution: If the relationship between the natural parameter and epectation parameter is given by η v with a constant v and the partition function is given by Zη Cep v η D with a constant D, we can calculate the LPC as follows: d [w dξep D d w. D +i ξ ep iξ 9 Eponential Distribution Type: If the relationship between the natural parameter and epectation parameter is given by η C with a constant C,m and the partition function is given by Zη D m η with a constant D, we can calculate the LPC as follows: 1 + m C dw dξ 0 C +i ξ ep iξ Cm ep C Γm + 0 d w. C 30 m The last equation holds because is the characteristic function of the gamma distribution with the C +i ξ shape

6 parameter m and rate parameter C. If we set w 1{[ min, ma ]}, the PC is equal to m ep Γm log ma min. This formula can be applied to distributions including the eponential distribution, chi-squared distribution, Laplace distribution with a known mean, Weibull distribution with a known shape, and gamma distribution with a known shape. 3 Chi-squared Distribution Type: We discuss the gamma distribution with a known scale θ. The result here includes the chi-squared distribution set θ. we can calculate the LPC as follows: + ds wlogθ s dξ Γ ψ 1 s i ξ Γψ 1 s ep iξ s] + dswlogθ sg s;ψ 1 s, Here, G ;p,q is the probability density function of the sum of i.i.d. samples, the density of which is given by G;p,q 1 qγp ep pq ep. 3 q [9] B. McMillan. Two inequalities implied by unique decipherability. IRE Transactions on Information Theory, 4: , [10] R. ishii. Maimum likelihood principle and model selection when the true model is unspecified. Journal of Multivariate Analysis, 7:39 403, [11] J. Rissanen. Stochastic compleity in statistical inquiry. World Scientific, [1] J. Rissanen. Stochastic compleity in learning. In Computational Learning Theory, pages Springer, [13] J. Rissanen. Fisher information and stochastic compleity. IEEE Transactions on Information Theory,, 41:40 47, [14] J. Rissanen. Stochastic compleity in learning. Journal of Computer and System Sciences, 551:89 95, [15] J. Rissanen. Mdl denoising. IEEE Transactions on Information Theory, 467: , 000. [16] J. Rissanen, T. P. Speed, and B. Yu. Density estimation by stochastic compleity. IEEE Transactions on Information Theory, 38:315 33, 199. [17] T. Roos, T. Silander, P. Kontkanen, and P. Myllymaki. Bayesian network structure learning using factorized nml universal models. In Information Theory and Applications Workshop, 008, pages IEEE, 008. [18] C. E. Shannon. A mathematical theory of communication, part i, part ii. Bell System Technical Journal, 7:63 656, [19] Y. M. Shtar kov. Universal sequential coding of single messages. Problemy Peredachi Informatsii, 33:3 17, [0] A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge University Press, [1] K. Yamanishi. A learning criterion for stochastic rules. Machine Learning, 9:165 03, 199. VI. COCLUSIO In this paper, we derived a non-asymptotic form of the ML code length and clarified its relationship to the asymptotic epansion. Moreover, we presented a non-asymptotic calculation formula of the ML code length for eponential family models. This formula can be applied if we know the partition function. In addition, if we know the closed form of a distribution, the characteristic function of which is given by a function that can be epressed using the partition function, the calculation is reduced to one integral calculation, which is often analytically obtained. REFERECES [1] A. R. Barron and T. M. Cover. Minimum compleity density estimation. IEEE Transactions on Information Theory, 374: , [] P. D. Grünwald. The minimum description length principle. MIT press, 007. [3] P. D. Grünwald and. A. Mehta. A tight ecess risk bound via a unified pac-bayesian-rademacher-shtarkov-mdl compleity. CoRR, abs/ , 017. [4] S. Hirai and K. Yamanishi. Efficient computation of normalized maimum likelihood codes for gaussian miture models with its applications to clustering. IEEE Transactions on Information Theory, 5911: , 013. [5] P. Kontkanen and P. Myllymäki. A linear-time algorithm for computing the multinomial stochastic compleity. Information Processing Letters, 1036:7 33, 007. [6] P. Kontkanen and P. Myllymäki. Mdl histogram density estimation. In International Conference on Artificial Intelligence and Statistics, pages 19 6, 007. [7] P. Kontkanen and P. Myllymäki. An empirical comparison of nml clustering algorithms [8] L. G. Kraft. A device for quantizing, grouping, and coding amplitudemodulated pulses. PhD thesis, Massachusetts Institute of Technology, 1949.

COMPUTING THE REGRET TABLE FOR MULTINOMIAL DATA

COMPUTING THE REGRET TABLE FOR MULTINOMIAL DATA COMPUTIG THE REGRET TABLE FOR MULTIOMIAL DATA Petri Kontkanen, Petri Myllymäki April 12, 2005 HIIT TECHICAL REPORT 2005 1 COMPUTIG THE REGRET TABLE FOR MULTIOMIAL DATA Petri Kontkanen, Petri Myllymäki

More information

A NEW INFORMATION THEORETIC APPROACH TO ORDER ESTIMATION PROBLEM. Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A.

A NEW INFORMATION THEORETIC APPROACH TO ORDER ESTIMATION PROBLEM. Massachusetts Institute of Technology, Cambridge, MA 02139, U.S.A. A EW IFORMATIO THEORETIC APPROACH TO ORDER ESTIMATIO PROBLEM Soosan Beheshti Munther A. Dahleh Massachusetts Institute of Technology, Cambridge, MA 0239, U.S.A. Abstract: We introduce a new method of model

More information

Achievability of Asymptotic Minimax Regret in Online and Batch Prediction

Achievability of Asymptotic Minimax Regret in Online and Batch Prediction JMLR: Workshop and Conference Proceedings 9:181 196, 013 ACML 013 Achievability of Asymptotic Minimax Regret in Online and Batch Prediction Kazuho Watanabe Graduate School of Information Science Nara Institute

More information

Bayesian Network Structure Learning using Factorized NML Universal Models

Bayesian Network Structure Learning using Factorized NML Universal Models Bayesian Network Structure Learning using Factorized NML Universal Models Teemu Roos, Tomi Silander, Petri Kontkanen, and Petri Myllymäki Complex Systems Computation Group, Helsinki Institute for Information

More information

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY 2nd International Symposium on Information Geometry and its Applications December 2-6, 2005, Tokyo Pages 000 000 STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY JUN-ICHI TAKEUCHI, ANDREW R. BARRON, AND

More information

LTI Systems, Additive Noise, and Order Estimation

LTI Systems, Additive Noise, and Order Estimation LTI Systems, Additive oise, and Order Estimation Soosan Beheshti, Munther A. Dahleh Laboratory for Information and Decision Systems Department of Electrical Engineering and Computer Science Massachusetts

More information

Efficient Computation of Stochastic Complexity

Efficient Computation of Stochastic Complexity Efficient Computation of Stochastic Complexity Petri Kontkanen, Wray Buntine, Petri Myllymäki, Jorma Rissanen, Henry Tirri Complex Systems Computation Group CoSCo), Helsinki Institute for Information Technology

More information

Achievability of Asymptotic Minimax Regret by Horizon-Dependent and Horizon-Independent Strategies

Achievability of Asymptotic Minimax Regret by Horizon-Dependent and Horizon-Independent Strategies Journal of Machine Learning Research 16 015) 1-48 Submitted 7/14; Revised 1/14; Published 8/15 Achievability of Asymptotic Minimax Regret by Horizon-Dependent and Horizon-Independent Strategies Kazuho

More information

MDL Histogram Density Estimation

MDL Histogram Density Estimation MDL Histogram Density Estimation Petri Kontkanen, Petri Myllymäki Complex Systems Computation Group (CoSCo) Helsinki Institute for Information Technology (HIIT) University of Helsinki and Helsinki University

More information

A Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity

A Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity A Tight Excess Risk Bound via a Unified PAC-Bayesian- Rademacher-Shtarkov-MDL Complexity Peter Grünwald Centrum Wiskunde & Informatica Amsterdam Mathematical Institute Leiden University Joint work with

More information

Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation

Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation Andre Barron Department of Statistics Yale University Email: andre.barron@yale.edu Teemu Roos HIIT & Dept of Computer Science

More information

ON CONVERGENCE PROPERTIES OF MESSAGE-PASSING ESTIMATION ALGORITHMS. Justin Dauwels

ON CONVERGENCE PROPERTIES OF MESSAGE-PASSING ESTIMATION ALGORITHMS. Justin Dauwels ON CONVERGENCE PROPERTIES OF MESSAGE-PASSING ESTIMATION ALGORITHMS Justin Dauwels Amari Research Unit, RIKEN Brain Science Institute, Wao-shi, 351-0106, Saitama, Japan email: justin@dauwels.com ABSTRACT

More information

Machine Learning Lecture 3

Machine Learning Lecture 3 Announcements Machine Learning Lecture 3 Eam dates We re in the process of fiing the first eam date Probability Density Estimation II 9.0.207 Eercises The first eercise sheet is available on L2P now First

More information

Generalized Post-Widder inversion formula with application to statistics

Generalized Post-Widder inversion formula with application to statistics arxiv:5.9298v [math.st] 3 ov 25 Generalized Post-Widder inversion formula with application to statistics Denis Belomestny, Hilmar Mai 2, John Schoenmakers 3 August 24, 28 Abstract In this work we derive

More information

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery

Approximate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery Approimate Message Passing with Built-in Parameter Estimation for Sparse Signal Recovery arxiv:1606.00901v1 [cs.it] Jun 016 Shuai Huang, Trac D. Tran Department of Electrical and Computer Engineering Johns

More information

Prequential Plug-In Codes that Achieve Optimal Redundancy Rates even if the Model is Wrong

Prequential Plug-In Codes that Achieve Optimal Redundancy Rates even if the Model is Wrong Prequential Plug-In Codes that Achieve Optimal Redundancy Rates even if the Model is Wrong Peter Grünwald pdg@cwi.nl Wojciech Kotłowski kotlowsk@cwi.nl National Research Institute for Mathematics and Computer

More information

Keep it Simple Stupid On the Effect of Lower-Order Terms in BIC-Like Criteria

Keep it Simple Stupid On the Effect of Lower-Order Terms in BIC-Like Criteria Keep it Simple Stupid On the Effect of Lower-Order Terms in BIC-Like Criteria Teemu Roos and Yuan Zou Helsinki Institute for Information Technology HIIT Department of Computer Science University of Helsinki,

More information

Context tree models for source coding

Context tree models for source coding Context tree models for source coding Toward Non-parametric Information Theory Licence de droits d usage Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding

More information

Rates of Convergence to Self-Similar Solutions of Burgers Equation

Rates of Convergence to Self-Similar Solutions of Burgers Equation Rates of Convergence to Self-Similar Solutions of Burgers Equation by Joel Miller Andrew Bernoff, Advisor Advisor: Committee Member: May 2 Department of Mathematics Abstract Rates of Convergence to Self-Similar

More information

Information-Theoretically Optimal Histogram Density Estimation

Information-Theoretically Optimal Histogram Density Estimation Information-Theoretically Optimal Histogram Density Estimation Petri Kontkanen, Petri Myllymäki March 17, 2006 HIIT TECHNICAL REPORT 2006 2 Information-Theoretically Optimal Histogram Density Estimation

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Upper Bounds to Error Probability with Feedback

Upper Bounds to Error Probability with Feedback Upper Bounds to Error robability with Feedbac Barış Naiboğlu Lizhong Zheng Research Laboratory of Electronics at MIT Cambridge, MA, 0239 Email: {naib, lizhong }@mit.edu Abstract A new technique is proposed

More information

MACHINE LEARNING ADVANCED MACHINE LEARNING

MACHINE LEARNING ADVANCED MACHINE LEARNING MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 2 2 MACHINE LEARNING Overview Definition pdf Definition joint, condition, marginal,

More information

Calculating the Normalized Maximum Likelihood Distribution for Bayesian Forests

Calculating the Normalized Maximum Likelihood Distribution for Bayesian Forests In Proceedings of the IADIS International Conference Intelligent Systems and Agents 2007. Lisbon, Portugal, 2007. Calculating the Normalized Maximum Likelihood Distribution for Bayesian Forests Hannes

More information

Does Better Inference mean Better Learning?

Does Better Inference mean Better Learning? Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract

More information

An Analysis of the Difference of Code Lengths Between Two-Step Codes Based on MDL Principle and Bayes Codes

An Analysis of the Difference of Code Lengths Between Two-Step Codes Based on MDL Principle and Bayes Codes IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 47, NO. 3, MARCH 2001 927 An Analysis of the Difference of Code Lengths Between Two-Step Codes Based on MDL Principle Bayes Codes Masayuki Goto, Member, IEEE,

More information

Steepest descent on factor graphs

Steepest descent on factor graphs Steepest descent on factor graphs Justin Dauwels, Sascha Korl, and Hans-Andrea Loeliger Abstract We show how steepest descent can be used as a tool for estimation on factor graphs. From our eposition,

More information

EECS 545 Project Progress Report Sparse Kernel Density Estimates B.Gopalakrishnan, G.Bellala, G.Devadas, K.Sricharan

EECS 545 Project Progress Report Sparse Kernel Density Estimates B.Gopalakrishnan, G.Bellala, G.Devadas, K.Sricharan EECS 545 Project Progress Report Sparse Kernel Density Estimates B.Gopalakrishnan, G.Bellala, G.Devadas, K.Sricharan Introduction Density estimation forms the backbone for numerous machine learning algorithms

More information

A Practitioner s Guide to Generalized Linear Models

A Practitioner s Guide to Generalized Linear Models A Practitioners Guide to Generalized Linear Models Background The classical linear models and most of the minimum bias procedures are special cases of generalized linear models (GLMs). GLMs are more technically

More information

NUMERICAL COMPUTATION OF THE CAPACITY OF CONTINUOUS MEMORYLESS CHANNELS

NUMERICAL COMPUTATION OF THE CAPACITY OF CONTINUOUS MEMORYLESS CHANNELS NUMERICAL COMPUTATION OF THE CAPACITY OF CONTINUOUS MEMORYLESS CHANNELS Justin Dauwels Dept. of Information Technology and Electrical Engineering ETH, CH-8092 Zürich, Switzerland dauwels@isi.ee.ethz.ch

More information

ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 2017 Author: Galen Reeves Last Modified: October 18, 2017 Outline of lecture: 5.1 Introduction to Lossless Source

More information

ECE 587 / STA 563: Lecture 5 Lossless Compression

ECE 587 / STA 563: Lecture 5 Lossless Compression ECE 587 / STA 563: Lecture 5 Lossless Compression Information Theory Duke University, Fall 28 Author: Galen Reeves Last Modified: September 27, 28 Outline of lecture: 5. Introduction to Lossless Source

More information

Multidimensional partitions of unity and Gaussian terrains

Multidimensional partitions of unity and Gaussian terrains and Gaussian terrains Richard A. Bale, Jeff P. Grossman, Gary F. Margrave, and Michael P. Lamoureu ABSTRACT Partitions of unity play an important rôle as amplitude-preserving windows for nonstationary

More information

Stochastic Complexity for Testing Conditional Independence on Discrete Data

Stochastic Complexity for Testing Conditional Independence on Discrete Data Stochastic Complexity for Testing Conditional Independence on Discrete Data Alexander Marx Max Planck Institute for Informatics, and Saarland University Saarbrücken, Germany amarx@mpi-inf.mpg.de Jilles

More information

MACHINE LEARNING ADVANCED MACHINE LEARNING

MACHINE LEARNING ADVANCED MACHINE LEARNING MACHINE LEARNING ADVANCED MACHINE LEARNING Recap of Important Notions on Estimation of Probability Density Functions 22 MACHINE LEARNING Discrete Probabilities Consider two variables and y taking discrete

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Lecture 1c: Gaussian Processes for Regression

Lecture 1c: Gaussian Processes for Regression Lecture c: Gaussian Processes for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk

More information

arxiv: v3 [stat.me] 11 Feb 2018

arxiv: v3 [stat.me] 11 Feb 2018 arxiv:1708.02742v3 [stat.me] 11 Feb 2018 Minimum message length inference of the Poisson and geometric models using heavy-tailed prior distributions Chi Kuen Wong, Enes Makalic, Daniel F. Schmidt February

More information

A Comparison of Particle Filters for Personal Positioning

A Comparison of Particle Filters for Personal Positioning VI Hotine-Marussi Symposium of Theoretical and Computational Geodesy May 9-June 6. A Comparison of Particle Filters for Personal Positioning D. Petrovich and R. Piché Institute of Mathematics Tampere University

More information

Coding on Countably Infinite Alphabets

Coding on Countably Infinite Alphabets Coding on Countably Infinite Alphabets Non-parametric Information Theory Licence de droits d usage Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe

More information

Data Compression. Limit of Information Compression. October, Examples of codes 1

Data Compression. Limit of Information Compression. October, Examples of codes 1 Data Compression Limit of Information Compression Radu Trîmbiţaş October, 202 Outline Contents Eamples of codes 2 Kraft Inequality 4 2. Kraft Inequality............................ 4 2.2 Kraft inequality

More information

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo A BLEND OF INFORMATION THEORY AND STATISTICS Andrew R. YALE UNIVERSITY Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo Frejus, France, September 1-5, 2008 A BLEND OF INFORMATION THEORY AND

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture Notes Fall 2009 November, 2009 Byoung-Ta Zhang School of Computer Science and Engineering & Cognitive Science, Brain Science, and Bioinformatics Seoul National University

More information

Particle Methods as Message Passing

Particle Methods as Message Passing Particle Methods as Message Passing Justin Dauwels RIKEN Brain Science Institute Hirosawa,2-1,Wako-shi,Saitama,Japan Email: justin@dauwels.com Sascha Korl Phonak AG CH-8712 Staefa, Switzerland Email: sascha.korl@phonak.ch

More information

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection 2708 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 11, NOVEMBER 2004 Exact Minimax Strategies for Predictive Density Estimation, Data Compression, Model Selection Feng Liang Andrew Barron, Senior

More information

Further results involving Marshall Olkin log logistic distribution: reliability analysis, estimation of the parameter, and applications

Further results involving Marshall Olkin log logistic distribution: reliability analysis, estimation of the parameter, and applications DOI 1.1186/s464-16-27- RESEARCH Open Access Further results involving Marshall Olkin log logistic distribution: reliability analysis, estimation of the parameter, and applications Arwa M. Alshangiti *,

More information

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding APPEARS IN THE IEEE TRANSACTIONS ON INFORMATION THEORY, FEBRUARY 015 1 Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding Igal Sason Abstract Tight bounds for

More information

Random Fields in Bayesian Inference: Effects of the Random Field Discretization

Random Fields in Bayesian Inference: Effects of the Random Field Discretization Random Fields in Bayesian Inference: Effects of the Random Field Discretization Felipe Uribe a, Iason Papaioannou a, Wolfgang Betz a, Elisabeth Ullmann b, Daniel Straub a a Engineering Risk Analysis Group,

More information

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25

Approximate inference, Sampling & Variational inference Fall Cours 9 November 25 Approimate inference, Sampling & Variational inference Fall 2015 Cours 9 November 25 Enseignant: Guillaume Obozinski Scribe: Basile Clément, Nathan de Lara 9.1 Approimate inference with MCMC 9.1.1 Gibbs

More information

2 Statistical Estimation: Basic Concepts

2 Statistical Estimation: Basic Concepts Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation:

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Power EP. Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR , October 4, Abstract

Power EP. Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR , October 4, Abstract Power EP Thomas Minka Microsoft Research Ltd., Cambridge, UK MSR-TR-2004-149, October 4, 2004 Abstract This note describes power EP, an etension of Epectation Propagation (EP) that makes the computations

More information

Useful Mathematics. 1. Multivariable Calculus. 1.1 Taylor s Theorem. Monday, 13 May 2013

Useful Mathematics. 1. Multivariable Calculus. 1.1 Taylor s Theorem. Monday, 13 May 2013 Useful Mathematics Monday, 13 May 013 Physics 111 In recent years I have observed a reticence among a subpopulation of students to dive into mathematics when the occasion arises in theoretical mechanics

More information

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,

More information

Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal targets

Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal targets Bernoulli 15(3), 2009, 774 798 DOI: 10.3150/08-BEJ176 Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal targets CHRIS SHERLOCK 1 and GARETH ROBERTS 2 1 Department of Mathematics

More information

Semiparametric posterior limits

Semiparametric posterior limits Statistics Department, Seoul National University, Korea, 2012 Semiparametric posterior limits for regular and some irregular problems Bas Kleijn, KdV Institute, University of Amsterdam Based on collaborations

More information

Stochastic Complexity of Variational Bayesian Hidden Markov Models

Stochastic Complexity of Variational Bayesian Hidden Markov Models Stochastic Complexity of Variational Bayesian Hidden Markov Models Tikara Hosino Department of Computational Intelligence and System Science, Tokyo Institute of Technology Mailbox R-5, 459 Nagatsuta, Midori-ku,

More information

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models

The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models The Behaviour of the Akaike Information Criterion when Applied to Non-nested Sequences of Models Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health

More information

Exact Minimax Predictive Density Estimation and MDL

Exact Minimax Predictive Density Estimation and MDL Exact Minimax Predictive Density Estimation and MDL Feng Liang and Andrew Barron December 5, 2003 Abstract The problems of predictive density estimation with Kullback-Leibler loss, optimal universal data

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models

f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models IEEE Transactions on Information Theory, vol.58, no.2, pp.708 720, 2012. 1 f-divergence Estimation and Two-Sample Homogeneity Test under Semiparametric Density-Ratio Models Takafumi Kanamori Nagoya University,

More information

Mobile Robot Localization

Mobile Robot Localization Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations

More information

Tail Properties and Asymptotic Expansions for the Maximum of Logarithmic Skew-Normal Distribution

Tail Properties and Asymptotic Expansions for the Maximum of Logarithmic Skew-Normal Distribution Tail Properties and Asymptotic Epansions for the Maimum of Logarithmic Skew-Normal Distribution Xin Liao, Zuoiang Peng & Saralees Nadarajah First version: 8 December Research Report No. 4,, Probability

More information

Patterns of Scalable Bayesian Inference Background (Session 1)

Patterns of Scalable Bayesian Inference Background (Session 1) Patterns of Scalable Bayesian Inference Background (Session 1) Jerónimo Arenas-García Universidad Carlos III de Madrid jeronimo.arenas@gmail.com June 14, 2017 1 / 15 Motivation. Bayesian Learning principles

More information

= 1 2 x (x 1) + 1 {x} (1 {x}). [t] dt = 1 x (x 1) + O (1), [t] dt = 1 2 x2 + O (x), (where the error is not now zero when x is an integer.

= 1 2 x (x 1) + 1 {x} (1 {x}). [t] dt = 1 x (x 1) + O (1), [t] dt = 1 2 x2 + O (x), (where the error is not now zero when x is an integer. Problem Sheet,. i) Draw the graphs for [] and {}. ii) Show that for α R, α+ α [t] dt = α and α+ α {t} dt =. Hint Split these integrals at the integer which must lie in any interval of length, such as [α,

More information

Minimax Optimal Bayes Mixtures for Memoryless Sources over Large Alphabets

Minimax Optimal Bayes Mixtures for Memoryless Sources over Large Alphabets Proceedings of Machine Learning Research 8: 8, 208 Algorithmic Learning Theory 208 Minimax Optimal Bayes Mixtures for Memoryless Sources over Large Alphabets Elias Jääsaari Helsinki Institute for Information

More information

Foundations of Nonparametric Bayesian Methods

Foundations of Nonparametric Bayesian Methods 1 / 27 Foundations of Nonparametric Bayesian Methods Part II: Models on the Simplex Peter Orbanz http://mlg.eng.cam.ac.uk/porbanz/npb-tutorial.html 2 / 27 Tutorial Overview Part I: Basics Part II: Models

More information

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions

Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions Minimum message length estimation of mixtures of multivariate Gaussian and von Mises-Fisher distributions Parthan Kasarapu & Lloyd Allison Monash University, Australia September 8, 25 Parthan Kasarapu

More information

Taylor Series and Asymptotic Expansions

Taylor Series and Asymptotic Expansions Taylor Series and Asymptotic Epansions The importance of power series as a convenient representation, as an approimation tool, as a tool for solving differential equations and so on, is pretty obvious.

More information

Type II variational methods in Bayesian estimation

Type II variational methods in Bayesian estimation Type II variational methods in Bayesian estimation J. A. Palmer, D. P. Wipf, and K. Kreutz-Delgado Department of Electrical and Computer Engineering University of California San Diego, La Jolla, CA 9093

More information

Bayesian Inference of Noise Levels in Regression

Bayesian Inference of Noise Levels in Regression Bayesian Inference of Noise Levels in Regression Christopher M. Bishop Microsoft Research, 7 J. J. Thomson Avenue, Cambridge, CB FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

THE inverse tangent function is an elementary mathematical

THE inverse tangent function is an elementary mathematical A Sharp Double Inequality for the Inverse Tangent Function Gholamreza Alirezaei arxiv:307.983v [cs.it] 8 Jul 03 Abstract The inverse tangent function can be bounded by different inequalities, for eample

More information

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M PD M = PD θ, MPθ Mdθ Lecture 14 : Variational Bayes where θ are the parameters of the model and Pθ M is

More information

Where now? Machine Learning and Bayesian Inference

Where now? Machine Learning and Bayesian Inference Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone etension 67 Email: sbh@clcamacuk wwwclcamacuk/ sbh/ Where now? There are some simple take-home messages from

More information

Statistical Learning Theory of Variational Bayes

Statistical Learning Theory of Variational Bayes Statistical Learning Theory of Variational Bayes Department of Computational Intelligence and Systems Science Interdisciplinary Graduate School of Science and Engineering Tokyo Institute of Technology

More information

Tail Approximation of the Skew-Normal by the Skew-Normal Laplace: Application to Owen s T Function and the Bivariate Normal Distribution

Tail Approximation of the Skew-Normal by the Skew-Normal Laplace: Application to Owen s T Function and the Bivariate Normal Distribution Journal of Statistical and Econometric ethods vol. no. 3 - ISS: 5-557 print version 5-565online Scienpress Ltd 3 Tail Approimation of the Skew-ormal by the Skew-ormal Laplace: Application to Owen s T Function

More information

Introduction to Probability Theory for Graduate Economics Fall 2008

Introduction to Probability Theory for Graduate Economics Fall 2008 Introduction to Probability Theory for Graduate Economics Fall 008 Yiğit Sağlam October 10, 008 CHAPTER - RANDOM VARIABLES AND EXPECTATION 1 1 Random Variables A random variable (RV) is a real-valued function

More information

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30 MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD Copyright c 2012 (Iowa State University) Statistics 511 1 / 30 INFORMATION CRITERIA Akaike s Information criterion is given by AIC = 2l(ˆθ) + 2k, where l(ˆθ)

More information

Tight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences

Tight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences Tight Bounds for Symmetric Divergence Measures and a New Inequality Relating f-divergences Igal Sason Department of Electrical Engineering Technion, Haifa 3000, Israel E-mail: sason@ee.technion.ac.il Abstract

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Sequential prediction with coded side information under logarithmic loss

Sequential prediction with coded side information under logarithmic loss under logarithmic loss Yanina Shkel Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Maxim Raginsky Department of Electrical and Computer Engineering Coordinated Science

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

A GENERAL CLASS OF LOWER BOUNDS ON THE PROBABILITY OF ERROR IN MULTIPLE HYPOTHESIS TESTING. Tirza Routtenberg and Joseph Tabrikian

A GENERAL CLASS OF LOWER BOUNDS ON THE PROBABILITY OF ERROR IN MULTIPLE HYPOTHESIS TESTING. Tirza Routtenberg and Joseph Tabrikian A GENERAL CLASS OF LOWER BOUNDS ON THE PROBABILITY OF ERROR IN MULTIPLE HYPOTHESIS TESTING Tirza Routtenberg and Joseph Tabrikian Department of Electrical and Computer Engineering Ben-Gurion University

More information

Predictive Hypothesis Identification

Predictive Hypothesis Identification Marcus Hutter - 1 - Predictive Hypothesis Identification Predictive Hypothesis Identification Marcus Hutter Canberra, ACT, 0200, Australia http://www.hutter1.net/ ANU RSISE NICTA Marcus Hutter - 2 - Predictive

More information

Information Theory Based Estimator of the Number of Sources in a Sparse Linear Mixing Model

Information Theory Based Estimator of the Number of Sources in a Sparse Linear Mixing Model Information heory Based Estimator of the Number of Sources in a Sparse Linear Mixing Model Radu Balan University of Maryland Department of Mathematics, Center for Scientific Computation And Mathematical

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

Graduate Econometrics I: Maximum Likelihood I

Graduate Econometrics I: Maximum Likelihood I Graduate Econometrics I: Maximum Likelihood I Yves Dominicy Université libre de Bruxelles Solvay Brussels School of Economics and Management ECARES Yves Dominicy Graduate Econometrics I: Maximum Likelihood

More information

The Minimum Message Length Principle for Inductive Inference

The Minimum Message Length Principle for Inductive Inference The Principle for Inductive Inference Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population Health University of Melbourne University of Helsinki, August 25,

More information

ebay/google short course: Problem set 2

ebay/google short course: Problem set 2 18 Jan 013 ebay/google short course: Problem set 1. (the Echange Parado) You are playing the following game against an opponent, with a referee also taking part. The referee has two envelopes (numbered

More information

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian estimation of the discrepancy with misspecified parametric models Bayesian estimation of the discrepancy with misspecified parametric models Pierpaolo De Blasi University of Torino & Collegio Carlo Alberto Bayesian Nonparametrics workshop ICERM, 17-21 September 2012

More information

Rejection sampling - Acceptance probability. Review: How to sample from a multivariate normal in R. Review: Rejection sampling. Weighted resampling

Rejection sampling - Acceptance probability. Review: How to sample from a multivariate normal in R. Review: Rejection sampling. Weighted resampling Rejection sampling - Acceptance probability Review: How to sample from a multivariate normal in R Goal: Simulate from N d (µ,σ)? Note: For c to be small, g() must be similar to f(). The art of rejection

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Approximate formulas for the Point-to-Ellipse and for the Point-to-Ellipsoid Distance Problem

Approximate formulas for the Point-to-Ellipse and for the Point-to-Ellipsoid Distance Problem Approimate formulas for the Point-to-Ellipse and for the Point-to-Ellipsoid Distance Problem ALEXEI UTESHEV St.Petersburg State University Department of Applied Mathematics Universitetskij pr. 35, 198504

More information

Information geometry of Bayesian statistics

Information geometry of Bayesian statistics Information geometry of Bayesian statistics Hiroshi Matsuzoe Department of Computer Science and Engineering, Graduate School of Engineering, Nagoya Institute of Technology, Nagoya 466-8555, Japan Abstract.

More information

On the Behavior of MDL Denoising

On the Behavior of MDL Denoising On the Behavior of MDL Denoising Teemu Roos Petri Myllymäki Helsinki Institute for Information Technology Univ. of Helsinki & Helsinki Univ. of Technology P.O. Box 9800 FIN-0015 TKK, Finland Henry Tirri

More information

8 The Contribution of Parameters to Stochastic Complexity

8 The Contribution of Parameters to Stochastic Complexity 8 The Contribution of Parameters to Stochastic Complexity Dean P. Foster and Robert A. Stine Department of Statistics The Wharton School of the University of Pennsylvania Philadelphia, PA 19104-630 foster@wharton.upenn.edu

More information

Estimation theory and information geometry based on denoising

Estimation theory and information geometry based on denoising Estimation theory and information geometry based on denoising Aapo Hyvärinen Dept of Computer Science & HIIT Dept of Mathematics and Statistics University of Helsinki Finland 1 Abstract What is the best

More information

Machine Learning Basics: Maximum Likelihood Estimation

Machine Learning Basics: Maximum Likelihood Estimation Machine Learning Basics: Maximum Likelihood Estimation Sargur N. srihari@cedar.buffalo.edu This is part of lecture slides on Deep Learning: http://www.cedar.buffalo.edu/~srihari/cse676 1 Topics 1. Learning

More information