Asymptotically Minimax Regret for Models with Hidden Variables

Similar documents
STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

Bayesian Properties of Normalized Maximum Likelihood and its Fast Computation

Context tree models for source coding

Achievability of Asymptotic Minimax Regret in Online and Batch Prediction

WE start with a general discussion. Suppose we have

Information geometry of mirror descent

Achievability of Asymptotic Minimax Regret by Horizon-Dependent and Horizon-Independent Strategies

Worst-Case Bounds for Gaussian Process Models

Keep it Simple Stupid On the Effect of Lower-Order Terms in BIC-Like Criteria

Information geometry of Bayesian statistics

Coding on Countably Infinite Alphabets

Chapter 4: Asymptotic Properties of the MLE

Section 8.2. Asymptotic normality

An Analysis of the Difference of Code Lengths Between Two-Step Codes Based on MDL Principle and Bayes Codes

Maximum Likelihood Estimation

Asymptotic Minimax Regret for Data Compression, Gambling, and Prediction

Maximum Likelihood Estimation

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

Consistency of the maximum likelihood estimator for general hidden Markov models

Exact Minimax Predictive Density Estimation and MDL

Bayesian estimation of the discrepancy with misspecified parametric models

Bayesian Regularization

Testing Algebraic Hypotheses

Bayesian Network Structure Learning using Factorized NML Universal Models

Bandits : optimality in exponential families

Lecture 8: Information Theory and Statistics

Prequential Plug-In Codes that Achieve Optimal Redundancy Rates even if the Model is Wrong

Statistics & Data Sciences: First Year Prelim Exam May 2018

ECE 275A Homework 7 Solutions

Online Forest Density Estimation

Minimax Optimal Bayes Mixtures for Memoryless Sources over Large Alphabets

Invariant HPD credible sets and MAP estimators

Brief Review on Estimation Theory

Closest Moment Estimation under General Conditions

Robustness and duality of maximum entropy and exponential family distributions

Minimax Redundancy for Large Alphabets by Analytic Methods

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. A Superharmonic Prior for the Autoregressive Process of the Second Order

June 21, Peking University. Dual Connections. Zhengchao Wan. Overview. Duality of connections. Divergence: general contrast functions

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

Minimax lower bounds I

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Bootstrap prediction and Bayesian prediction under misspecified models

Lecture 35: December The fundamental statistical distances

The Minimum Message Length Principle for Inductive Inference

Graphical Models for Collaborative Filtering

280 Eiji Takimoto and Manfred K. Warmuth

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University

1. Fisher Information

Lecture 2: August 31

Conjugate Priors, Uninformative Priors

The properties of L p -GMM estimators

Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses

Lecture Characterization of Infinitely Divisible Distributions

Counting Markov Types

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Modern Methods of Statistical Learning sf2935 Auxiliary material: Exponential Family of Distributions Timo Koski. Second Quarter 2016

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection

Closest Moment Estimation under General Conditions

Stability and Sensitivity of the Capacity in Continuous Channels. Malcolm Egan

Mean-field equations for higher-order quantum statistical models : an information geometric approach

Methods of Estimation

Nonparametric Bayesian Methods (Gaussian Processes)

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Universal probability distributions, two-part codes, and their optimal precision

Stochastic Complexity of Variational Bayesian Hidden Markov Models

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Priors for the frequentist, consistency beyond Schwartz

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

The Volume of Bitnets

Introduction to Bayesian Statistics

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

Information Geometry

Lecture 26: Likelihood ratio tests

G8325: Variational Bayes

CS Lecture 19. Exponential Families & Expectation Propagation

Chapter 4: Asymptotic Properties of the MLE (Part 2)

AN AFFINE EMBEDDING OF THE GAMMA MANIFOLD

Graduate Econometrics I: Maximum Likelihood I

6.1 Variational representation of f-divergences

Theory and Applications of Stochastic Systems Lecture Exponential Martingale for Random Walk

Unsupervised Learning

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Machine learning - HT Maximum Likelihood

Lecture 1: Introduction

A Novel Asynchronous Communication Paradigm: Detection, Isolation, and Coding

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Bioinformatics: Biology X

Efficient Computation of Stochastic Complexity

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Econometrics I, Estimation

Model comparison. Christopher A. Sims Princeton University October 18, 2016

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

University of Pavia. M Estimators. Eduardo Rossi

Sequential prediction with coded side information under logarithmic loss

Hypothesis Testing with Communication Constraints

Statistical inference on Lévy processes

Transcription:

Asymptotically Minimax Regret for Models with Hidden Variables Jun ichi Takeuchi Department of Informatics, Kyushu University, Fukuoka, Fukuoka, Japan Andrew R. Barron Department of Statistics, Yale University, New Haven, Connecticut, USA Abstract We study the problems of data compression, gambling and prediction of a string x n = x x 2...x n from an alphabet X, in terms of regret with respect to models with hidden variables including general mixture families. When the target class is a non-exponential family, a modification of Jeffreys prior which has measure outside the given family of densities was introduced to achieve the minimax regret [8], under certain regularity conditions. In this paper, we show that the models with hidden variables satisfy those regularity conditions, when the hidden variables model is an exponential family. In paticular, we do not have to restrict the class of data strings so that the MLE is in the interior of the parameter space for the case of the general mixture family. I. INTRODUCTION We study the problem of data compression, gambling and prediction of a string x n = x x 2...x n from a certain alphabet X (not restricted to be discrete), in terms of regret with respect to models with hidden variables including general mixture families. In particular, we evaluate the regret of Bayes mixture densities and show that it asymptotically achieves their minimax values when variants of Jeffreys prior are used. This is some extension of the result for the mixture family addressed in Takeuchi & Barron [], [3]. This paper s concern is the regret of a coding or prediction. This regret is defined as the difference of the loss incurred and the loss of an ideal coding or prediction strategy for each string A coding scheme for the string of length n is equivalent to a probabilistic mass function q(x n ) on X n. We can also use q for prediction and gambling, that is, its conditionals q(x i+ x i ) provide a distribution for the coding or prediction of the next symbol given the past. The minimax regret with the target class (of probability densities) S = {p( θ) :θ Θ} and a set of the sequences W n X n (denoted by r(w n ))isdefined as ( r(w n )=inf sup log q x n W n q(x n ) log ), p(x n ˆθ(x n )) where ˆθ = ˆθ(x n ) is the maximum likelihood estimate of θ given x n. Here, the regret log(/q(x n )) log(/p(x n ˆθ)) in the data compression context is also called the (pointwise) redundancy: the difference between the code length based on q and the minimum of the codelength log(/p(x n θ)) achieved by distributions in the family. We give a brief review on the known results for this problem When S is the class of discrete memoryless sources, Xie and Barron [8] proved that the minimax regret asymptotically equals (d/2) log(n/2π)+logc J + o(), where d equals the size of alphabet minus and C J is the integral of the square root of the determinant of Fisher information matrix over the whole parameter space. For the stationary Markov model with finite alphabet, the analogous result is known [6]. Next, consider the set of strings restricted as K n = X n (K) ={x n : ˆθ K}, where K is a certain nice subset (satisfies K = K ) of Θ. For multi-dimensional exponential families, variants of Jeffreys mixture are minimax. The ordinary Jeffreys mixture for the concerned curved family is not minimax, even if K is a compact set included in the interior of Θ. However,the minimax result can be obtained by using a sequence of prior measures whose supports are the exponential family to which the curved family is embedded, rather than the concerned curved family. It is remarkable that this idea is applicable to general smooth families by an enlargement of the original family using exponential tilting. The idea for this enlargement in addressing minimax regret originates in preliminarily form in [0], [3] as informally discussed in [2]. The literature [5] gives discussion in the context of Amari s information geometry []. Recently, more formal statement for that method was given in [], where an example of the mixture family case was argued. In this paper, we extend it to the case of models with hidden variables assuming the hidden variables model is an exponential family. Further, we show that the asymptotic minimax procedure for the full family of strings can be obtained by modifying the procedure in []. II. PRELIMINARIES Let (X, F,ν) be a measurable space. Let S = {p( θ) : θ Θ} denote a parametric family of probability densities over X with respect to ν. Weletp(x n θ) denote n i= p(x i θ). Also, we let ν(dx n ) denote n i= ν(dx i). Here, we are treating models for independently identically distributed (i.i.d.) random variables. We let P θ denote the distribution function with density p( θ) and E θ denote expectation with respect to P θ. Assume that Θ R d and Θ = Θ hold. That is, the closure of Θ matches the closure of its interior. Here Ā and A respectively denote the closure and the interior of A R k. We introduce the empirical Fisher information given x n and 978--4799-586-4/4/$3.00 204 IEEE 3037

the Fisher information: Ĵ ij (θ) = Ĵij(θ, x n )= n J ij (θ) = E θ Ĵ ij (θ, x). 2 log p(x n θ) θ i θ j, The exponential family is defined as follows [4], []. Definition (Exponential Family): Given an F-measurable function T : X R d,define { Θ θ : θ R d, exp ( } θ T (x))ν(dx) <, X where θ T (x) denotes the inner product of θ and T (x). Define a function ψ and a probability density p on X with respect to ν by ψ(θ) log X exp(θ T (x))ν(dx) and p(x θ) exp( θ T (x) ψ(θ) ). We refer to the set {p(x θ) θ K Θ} as an exponential family of densities. When Θ is an open set, S(Θ) is said to be a regular exponential family. Many popular exponential families are regular. For exponential families, the entries of the Fisher information are given by 2 ψ(θ)/ θ i θ j. For regular exponential families, define the expectation parameter η as η(θ) =E θ (T (x)). Itis known that the map : θ η is one-to-one and analytic on Θ. LetH denote η(θ ) and W denote the closure of the convex hull of T (X ), then H = W holds, when S is a steep exponential family (E θ T (x) = for all θ Θ \ Θ ). Also, η i = ψ(θ)/ θ i holds. Note that p(x n θ) =exp(n(θ T ψ(θ))) holds, where T = n t= T (x t)/n.(herex t denotes the t-th element of the sequence x n.) It is known that the maximum likelihood estimate of η given x n equals T. The multinomial model is an important example of the regular exponential family. Example (multinomial model): Let X = {0,,...,d}, ν({x}) =for x X,andT (x) =(δ x,δ 2x,...,δ dx ), where δ ij is the Kronecker s delta. Define p(x θ) =exp(θ T (x) ψ(θ)) = e θx /( + d x= ), where θ =(θ eθx,θ 2,...,θ d ) Θ = R d. Then, we have ψ(θ) = log( + d x= eθ x ), which is finite for all θ Θ. Note that η x = p(x θ) and θ x = log(η x /( d x= η x)). For a subset K of Θ, weletc J (K) = K J(θ) /2 dθ. The Jeffreys prior ([5]) over K (denoted by w K (θ)) isdefined as w K (θ) = J(θ) /2 /C J (K). We define the Jeffreys mixture for K (denoted by m K )as K p(xn θ)w K (θ)dθ. Definition 2 (Model with Hidden Variables): Let q(y θ) be a density of a d-dimensional exponential family over Y.Define aclasss of probability density functions over X by S = { p(x θ) = κ(x y)q(y θ)ν y (dy) θ Θ }, () where κ(x y) is a fixed conditional probability density function of x given y, andν y is the reference measure for q(y θ). If q(y θ) is the multinomial model over Y = {0,,...,d}, then p(x θ) forms a mixture family as p(x θ) = d η y κ(x y)+( y= d η y )κ(x 0). y= Lower Bound on Minimax Regret A lower bound on minimax regret with the target class being a general smooth family is shown as follows. We employ the assumptions described below. Assumption : The density p(x θ) is twice continuously differentiable in θ for all x, and there is a function δ(θ) > 0 so that for each i, j, E θ sup θ : θ θ δ(θ) Ĵij(θ,x) 2 is finite and continuous as a function of θ. Assumption 2: The Fisher information J(θ) is continuous and coincides with the matrix of which the (i, j)-entry is log p(x θ) log p(x θ) E θ. θ i θ j Assumption 3: For all θ, θ Θ, the Kullback Leibler divergence D(θ θ ) is finite and for every ɛ > 0, inf (θ,θ ): θ θ >ɛ D(θ θ ) > 0 holds. Assumption 4: For every compact set K Θ, the MLE is uniformly consistent in K. sup P θ ( ˆθ(x n ) θ >ɛ)=o(/ log n). θ K Remark: These 4 assumptions hold in appropriately defined models with hidden variables. We have the following. Theorem : Let S = {p( θ) :θ Θ} be a d-dimensional family of probability densities. We suppose that Assumptions -4 hold for S. WeletK be an arbitrary subset of Θ satisfying C J (K) < and K = K. The following holds. lim inf n ( r n(k n ) d 2 log n 2π ) log C J(K). Finally in this section, we state the following useful inequalities for the model with hidden variables. Lemma : Given a data string x n,letˆθ denote the MLE for a model with hidden variables p(x n θ) defined by (). Then, the following holds for all x n X n. θ Θ, n log p(xn ˆθ) p(x n D(q( ˆθ) q( θ)) (2) θ) θ Θ, Ĵ(θ, x n ) G(θ) (3) where G(θ) is the Fisher information of θ for q(y θ), and D(q( ˆθ) q( θ)) denotes the Kullback-Leibler divergence from q( ˆθ) to q( θ). In particular, when q(y θ) is the multinomial model, the following holds p(x n ˆθ) p(x n θ) exp(nd(q( ˆθ) q( θ))) = y Y ˆη nˆηy y η nˆη y y, (4) where η y = q(y θ) and ˆη y = q(y ˆθ). This lemma can be shown by a convexity argument. First we proved it for the mixture family case as in [2]. Then, Hiroshi Nagaoka pointed out that (2) holds for this case and gave a proof from the view point of information geometry. We gave an extended statement and a different proof. See [4] for the detail. 3038

III. MINIMAX BAYES FOR RESTRICTED SETS We discuss the minimax strategy under the condition that data strings are restricted so that the maximum likelihood estimate is in a compact set interior to the parameter space. A. Exponential Families For exponential families it is known that a sequence of Jeffreys mixtures achieves the minimax regret asymptotically [8], [9], [0], [6]. For the multinomial and Finite State Machine models this holds for the full parameter set K =Θ, whereas for general exponential families these facts are proven provided that K is a compact subset included in Θ. We briefly review the outline of the proof for that case. Let {K n } be a sequence of subsets of Θ such that Kn K. Suppose that K n reduces to K as n.letm J,n denote the Jeffreys mixture for K n. If the rate of that reduction is sufficiently slow, then we have log p(xn û) m J,n (x n ) = d 2 log n 2π +logc J(K)+o(), (5) where the remainder o() tends to zero uniformly over all sequences with MLE in K. This implies that the sequence {m J,n } is asymptotically minimax. This is verified by the following asymptotic formula, which holds uniformly in K n by Laplace integration: m J,n (x n ) J(ˆθ) /2 (2π) d/2. p(x n ˆθ) C J (K) Ĵ(ˆθ, x n ) /2 n d/2 When S is an exponential family, Ĵ(ˆθ, x n )=J(ˆθ) holds. Hence, the above expression asymptotically equals the minimax value of regret mentioned in the former section. B. General Smooth Families When the target class is not an exponential family, we form an enlargement of S by exponential tilting using linear combinations of the entries of Ĵ(θ) J(θ). Actually, we introduce its normalized version as V (x n θ) =J(θ) /2 Ĵ(θ)J(θ) /2 I. (6) Let B =( b/2,b/2) d d for some b>0. The enlargement is formed as p(x n u) =p(x n θ)e n(β V (xn θ) ψ(θ,β)), (7) where u denotes the pair (θ, β), β is a matrix in B, V (x n θ) β denotes Tr(V (x n θ)β )= ij V ij(x n θ)β ij,andψ(θ, β) is the logarithm of the required normalization factor, defined as ψ(θ, β) def =log p(x θ)exp(v (x θ) β)ν(dx). Then, we define S = { p( u) : u Θ B}. In [], we employed Ĵ(θ) J(θ), which was sufficient for the restricted strings set cases. To prove the minimax bound without the restriction on the set of strings, we introduce (6). We have ψ(θ, β)/ β ij = E θ,β V ij (x θ) and 2 ψ(θ, β) = E u V ij (x θ)v kl (x θ) E u V ij (x θ)e u V kl (x θ). β ij β kl The latter is the covariance between V ij (x θ) and V kl (x θ). Let Cov(θ, β) denote this matrix whose (ij, kl)-entry is 2 ψ(θ, β)/ β ij β kl. This is a covariance matrix of V (x θ). Let λ K = λ (K) denote the maximum of the largest eigenvalue of Cov(θ, β) for θ K and β B. Traditionally such enlargements of the families as in (7) arise in local asymptotic expansion of likelihood ratio used in demonstration of local asymptotic normality [6]. In Amari s information geometry [] it is a local exponential family bundle. When the target class is not an exponential family and K Θ, we employ the mixtures m n (x n )=( n r )m J,n (x n )+n r p(x n u)w(u)du. (8) Specifically, the prior w(u) for S is defined as the direct product of the Jeffreys prior on S and the uniform prior on B. In the analysis, the consideration of β in a neighborhood of a small multiple of V (x n θ) is sufficient to accomplish our objectives under the assumptions addressed below. Here, we define two neighborhoods of θ as B ɛ (θ ) = {θ :(θ θ ) J(θ )(θ θ ) ɛ 2 }, (9) ˆB ɛ (θ ) = {θ :(θ θ ) Ĵ(θ )(θ θ ) ɛ 2 }. (0) Assumption 5: For the prior density function we use, we assume the following semi-continuity, that is, there exists a positive number κ w = κ w (K) such that w(θ ) ( κ w ɛ)w(θ) holds for all small ɛ>0, for all θ K, and for all θ in B ɛ (θ). Define a set of good sequences G n,δ and a set of not good sequences Gn,δ c. Note that Gc n,δ = K n \G n,δ. G n,δ = {x n : V (x n ˆθ) s δ and ˆθ K}, Gn,δ c = {x n : V (x n ˆθ) s >δand ˆθ K}, where A s for a symmetric matrix A R d d denotes the spectral norm of A defined as A s = max z: z = z Az. Comparing with the Frobenius norm defined as A = (Tr(AA )) /2, A / d A s A holds. Assumption 6: We assume a kind of equi-semicontinuity for Ĵ(θ), that is, there exist a positive number κ J = κ J (K) such that for all small ɛ>0, for a certain δ 0 > 0 that is independent of K, for all x n in G n,δ0, for all θ ˆB ɛ (ˆθ), and for all θ ˆθ, (θ ˆθ) Ĵ( θ)(θ ˆθ) (θ ˆθ) Ĵ(ˆθ)(θ ˆθ) +κ Jɛ. This is used to control the Laplace integration for m n of our strategy for the good sequences. In fact, we can prove the following lemma, where Φ is the probability measure of standard normal distribution over R d. Lemma 2: Fix a compact set K such that K Θ and K K. Suppose Assumptions -3, 5, and 6 hold and that ˆB ɛ (ˆθ) K. Then, for all δ<δ 0, the following holds. m K (x n ) inf x n G n,δ p(x n ˆθ) ( κ wɛ) d/2 Φ( nbɛ (0)) (2π) d/2 ( + κ J ɛ) d/2 ( + δ) d/2 C(K )n. d/2 3039

The proof is omitted. Assumption 7: For any δ>0, there exists an ɛ>0, such that for all x n in Gn,δ c, for all θ in B ɛ (ˆθ), V (x n θ) s ζδ, holds, where ζ is a certain constant. Assumption 8: There exists an ɛ>0, such that for all x n in K, and for all θ B ɛ (ˆθ), 2(Ĵ(ˆθ)+J(ˆθ)) Ĵ( θ) is positive semidefinite. These two assumptions are used to control the second term of our strategy for the not good sequences. They require that Ĵ(θ) does not change so rapidly in the region for the integration. We can prove the following lemma. Lemma 3: Under Assumptions -3, 5, 7, and 8, the following holds. p(x n u)w(u)du inf (ξk) d2 δ d2 n d exp(aδ 2 ξkn) x n Gn,δ c p(x n ˆθ) where A is a positive constant independent of K and ξk = min{n/λ K, }. Remark: Here, λ K /n is the maximum over all u K B of the largest eigenvalue of the covariance matrix of V (x n θ) with respect to p(x n u). Ifλ K /n is large, we cannot expect appropriate likelihood gain. From Lemmas 2 and 3, we have the following Theorem, which provides an equivalent upper bound as in []. Theorem 2: When a target class satisfies Assumptions -3 and 5-8, the strategy m n defined as (8) asymptotically achieves the minimax regret, i.e. max log p(xn ˆθ) x n K n m n (x n ) d 2 log n 2π +logc J(K)+o(), where r is appropriately chosen. C. Models with Hidden Variables For models with hidden variables, we can show the following equalities: Ĵ ij (θ, x n ) = G ij (θ) Ẽθ( T i t i )( T j t j ), Ĵij(θ, x n ) = G ij(θ) θ k θ Ẽθ( T i t i )( T j t j )( T k t k ), k where Ẽθ denotes the expectation by the posterior distribution of T n = T (y )...T (y n ) given x n n, T = t= T (y t)/n, and t = Ẽθ T. Here we assume that the range of T (y) is bounded. Then from the above equations, when ˆθ is restricted in K interior to Θ, Ĵ(θ, xn )/ θ is uniformly bounded for all x n. This means that Ĵ(θ, xn ) is equi-continuous for all x n as a function of θ and Assumptions 5-8 hold. IV. MINIMAX STRATEGY FOR THE MIXTURE FAMILY WITHOUT RESTRICTION ON DATA STRINGS Here we give the minimax strategy for the mixture family. Let θ denote the expectation parameter for the multinomial model for the hidden variable y, that is, we let p(y θ) =θ y, Θ={θ R d : y, θ y 0, d y= θ y }, andθ 0 = d y= θ y.define the interior set Θ (>0) as Θ = {θ Θ: θ y, y =0,,...,d}. Later, we argue the situation that converges to zero as n goes to infinity. Under that situation we utilize Lemmas 2 and 3. Then we have to control the behavior of κ J (K) and λ K since K changes as n increases. To treat the strings with the maximum likelihood estimate being near the boundary, we employ the technique introduced by Xie & Barron [8], which utilizes the Dirichlet(α) prior w (α) (θ) d i=0 θ ( α) i with α</2. Note that this prior with α =/2has the same form as the Jeffreys prior for the multinomial model, and has higher density than the latter when θ approaches the boundary of Θ. Our asymptotic minimax strategy is the mixture m n (x n ) = ( 2n r )m J (x n ) () + n r p(x n u)w(u)du + n r p(x n θ)w (α) (θ)dθ. Here, the first and second terms are for the strings with the MLE being away from the boundary, while the third term works when the MLE approaches the boundary. We can show the following theorem. Theorem 3: The strategy m n defined as () for a mixture family as the target class asymptotically achieves the minimax regret, i.e. max log p(xn ˆθ) x n X n m n (x n ) d 2 log n 2π +logc J(Θ) + o(), where r is appropriately chosen. Since we cannot give the complete proof in this manuscript, we give some discussion below. We assume that ˆθ Θ 2 and that ɛ<. Then, we have θ Θ for all θ B ɛ (ˆθ). We examine Assumption 6. Note that, for z R d, z Ĵ(θ, x)z = ( i z i(κ(x i) κ(x 0))) 2 p(x θ) 2 0. Further we have z Ĵ(θ, x)z θ k yielding z Ĵ(θ, x n )z θ k = = t Hence we have for θ Θ, z Ĵ(θ, x n )z 2(κ(x k) κ(x 0)) z Ĵ(θ)z, (2) (p(x θ)) 2(κ(x t k) κ(x t 0)) z Ĵ(θ, x t )z. (3) (p(x t θ)) 2z Ĵ(θ, x t )z. θ k Then we have e 2d θ ˆθ / z Ĵ( θ, x n )z z Ĵ(ˆθ, x n )z d θ ˆθ / e2. (4) Therefore, if θ ˆθ /2 d, z Ĵ( θ, x n )z de θ z Ĵ(ˆθ, x n )z +2 ˆθ 3040

holds for all z R d \{0}. It implies that when ɛ λ min /2 d is satisfied, z Ĵ( θ, x n )z/z Ĵ(ˆθ, x n )z +e dɛ/ holds for all θ B ɛ (ˆθ), where λ min is the minimum of the smallest eigenvalue of J(θ). This implies Assumption 6 holds with κ J (Θ ) e d/ for ɛ λ min /2 d. Next we examine Assumption 7. Note that there exists a unit vector z R d such that z V (x n ˆθ) z = V (x n ˆθ) s. Here we have two cases; i) z V (x n ˆθ) z = V (x n ˆθ) s and ii) z V (x n ˆθ) z = V (x n ˆθ) s. First consider the case i), for which we have V (x n ˆθ) s = z J(ˆθ) /2 Ĵ(ˆθ)J(ˆθ) /2 z = z Ĵ(ˆθ)z z J(ˆθ)z with a certain z R d.thatis, z Ĵ(ˆθ)z z J(ˆθ)z =+ V (xn ˆθ) s. For the numerator in the left side, from (4) we have z Ĵ( θ)z e 2 d θ ˆθ / z Ĵ(ˆθ)z. As for the denominator, similarly as (4) we can show by (2) e 2d θ ˆθ / z J( θ)z z J(ˆθ)z d θ ˆθ / e2 (5) for θ Θ. Hence, noting V (x n ˆθ) s δ, wehave z Ĵ( θ)z ( + δ)e 4 d θ ˆθ / z J( θ)z ( ( + δ) 4 d θ ˆθ ) = +δ 4 d( + δ) θ ˆθ. Hence when θ ˆθ δ/(8 d( + δ)) we have z Ĵ( θ)z/z J( θ)z +δ/2, which implies V (x n θ) s δ/2. For the case ii), we can show the similar conclusion. Hence, if ɛ<δ/(8 d( + δ)), V (x n θ) δ/2 holds. To see Assumption 8 holds is easy because of (4). Finally, we evaluate ξθ =min{n/λ Θ, } in Lemma 3. Since Ĵij(θ) is bounded by 4/ 2 for θ Θ and since the smallest eigenvalue of J(θ) is lower bounded by a certain positive constant, λ Θ = O( 2 ), hence, ξθ is lower bounded by n 2 times a certain constant. We set = n ( p) (0 <p<) and δ = n /2+γ (0 <γ< /2), where ɛ must satisfy n /2+ι ɛ δ (0 <ι</2), in the sense of order, which is equivalent to p> γ + ι. To make the second term of () have sufficient likelihood gain, it suffices to have lim inf n log(n 2 δ 2 ) log n > 0, which is equivalent to p > γ in our setting. This is automatically satisfied under p> γ + ι. When γ > ι, there exists a p in (0, ) which satisfies p> γ + ι. In this setting, when ˆθ x < 2n ( p) for some x, we need the help from the third term in (). From (4), we have p(x n d θ)w (α) (θ)dθ θnˆθ i i=0 i w (α) (θ)dθ p(x n ˆθ) d ˆθ. nˆθ i i=0 i By Lemma 4 of [8], the right side is not less than Rn d/2 (/2 α)( p) where R is a constant determined by m, α, andp. Let r be smaller than (/2 α)( p), then for large n the third term in () is larger than (2π) d/2 /(C J n d/2 ). ACKNOWLEDGMENT The authors give their sincere gratitude to Hiroshi Nagaoka and the anonymous reviwers for helpful comments. This research was supported in part by the Aihara Project, the FIRST program from JSPS, initiated by CSTP and in part by JSPS KAKENHI Grant Number 2450008. REFERENCES [] S. Amari & H. Nagaoka, Methods of Information Geometry, AMS & Oxford University Press, 2000. [2] A. R. Barron, J. Rissanen, & B. Yu, The minimum description length principle in coding and modeling, IEEE trans. Inform. Theory, Vol. 44 No. 6, pp. 2743-2760, 998. [3] A. R. Barron & J. Takeuchi, Mixture models achieving optimal coding regret, Proc. of 998 Inform. Theory Workshop, 998. [4] L. Brown, Fundamentals of statistical exponential families, Institute of Mathematical Statistics, 986. [5] H. Jeffreys, Theory of probability, 3rd ed., Univ. of California Press, Berkeley, Cal, 96. [6] D. Pollard, Online notes, http://www.stat.yale.edu/ pollard/books/asymptopia/, 200. [7] J. Rissanen, Fisher information and stochastic complexity, IEEE trans. Inform. Theory, vol. 40, pp. 40-47, 996. [8] Yu M. Shtarkov, Universal sequential coding of single messages, Problems of Information Transmission, vol. 23, pp. 3-7, July 988. [9] J. Takeuchi & A. R. Barron, Asymptotically minimax regret for exponential families, Proc. of the 20th Symposium on Information Theory and Its Applications (SITA 97), pp. 665-668, 997. [0] J. Takeuchi & A. R. Barron, Asymptotically minimax regret by Bayes mixtures, Proc. of 998 IEEE ISIT, p. 38, 998. [] J. Takeuchi & A. R. Barron, Asymptotically Minimax Regret by Bayes Mixtures for Non-exponential Families, Proc. of 203 IEEE ITW, pp. 204-208, 203. [2] J. Takeuchi & A. R. Barron, Some inequality for mixture families, http://www-kairo.csce.kyushu-u.ac.jp/ tak/papers/memo r2.pdf, October 203. [3] J. Takeuchi & A. R. Barron, Asymptotically minimax prediction for mixture families, The 36th Symp. on Inform. Theory and Its Applications (SITA 3), pp. 653-657, 203. (without peer review, Japan domestic) [4] J. Takeuchi & A. R. Barron, Some inequality for models with hidden variables, http://www-kairo.csce.kyushu-u.ac.jp/ tak/papers/memo r3.pdf, January 204. [5] J. Takeuchi, A. R. Barron, & T. Kawabata, Statistical curvature and stochastic complexity, Proc. of the 2nd Symposium on Information Geometry and Its Applications, pp. 29 36, 2006. [6] J. Takeuchi, T. Kawabata, & A. R. Barron, Properties of Jeffreys mixture for Markov sources, IEEE trans. Inform. Theory, vol. 59, no., pp. 438-457, 203. [7] Q. Xie & A. R. Barron, Minimax redundancy for the class of memoryless sources, IEEE trans. Inform. Theory, vol. 43, no. 2, pp. 646-657, 997. [8] Q. Xie & A. R. Barron, Asymptotic minimax regret for data compression, gambling and prediction, IEEE trans. Inform. Theory, vol. 46, no. 2, pp. 43-445, 2000. 304