INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University

Size: px
Start display at page:

Download "INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University"

Transcription

1 The Annals of Statistics 1999, Vol. 27, No. 5, INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1 By Yuhong Yang and Andrew Barron Iowa State University and Yale University We present some general results detering i bounds on statistical riskfor density estimation based on certain information-theoretic considerations. These bounds depend only on metric entropy conditions and are used to identify the i rates of convergence. 1. Introduction. The metric entropy structure of a density class deteres the i rate of convergence of density estimators. Here we prove such results using new direct metric entropy bounds on the mutual information that arises by application of Fano s information inequality in the development of lower bounds characterizing the optimal rate. No special construction is required for each density class. We here study global measures of loss such as integrated squared error, squared Hellinger distance or Kullback Leibler (K-L) divergence in nonparametric curve estimation problems. The i rates of convergence are often detered in two steps. A good lower bound is obtained for the target family of densities, and a specific estimator is constructed so that the imum riskis within a constant factor of the derived lower bound. For global i riskas we are considering here, inequalities for hypothesis tests are often used to derive i lower bounds, including versions of Fano s inequality. These methods are used in Ibragimov and Hasskii (1977, 1978), Hasskii (1978), Bretagnolle and Huber (1979), Efroimovich and Pinsker (1982), Stone (1982), Birgé (1983, 1986), Nemirovskii (1985), Devroye (1987), Le Cam (1986), Yatracos (1988), Hasskii and Ibragimov (1990), Yu (1996) and others. Upper bounds on i riskunder metric entropy conditions are in Birgé (1983, 1986), Yatracos (1985), Barron and Cover (1991), Van de Geer (1990), Wong and Shen (1995) and Birgé and Massart (1993, 1994). The focus of the present paper is on lower bounds detering the i rate, though some novel upper bound results are given as well. Parallel to the development of riskbounds for global measures of loss are results for point estimation of a density or functionals of the density; see, for example, Farrell (1972), Donoho and Liu (1991), Birgé and Massart (1995). Received November 1995; revised July Supported in part by NSF Grants ECS and DMS AMS 1991 subject classifications. Primary 62G07; secondary 62B10, 62C20, 94A29. Key words and phrases. Mini risk, density estimation, metric entropy, Kullback Leibler distance. 1564

2 MINIMAX RATES OF CONVERGENCE 1565 In its original form, Fano s inequality relates average probability of error in a multiple hypothesis test to the Shannon mutual information for a joint distribution of parameter and random sample [Fano (1961); see also Cover and Thomas (1991), pages 39 and 205]. Beginning with Fano, this inequality has been used in information theory to detere the capacity of communication channels. Ibragimov and Hasskii (1977, 1978) initiated the use of Fano s inequality in statistics for deteration of i rates of estimation for certain classes of functions. To apply this technique, suitable control of the Shannon information is required. Following Birgé (1983), statisticians have stated and used versions of Fano s inequality where the mutual information is replaced by a bound involving the diameter of classes of densities using the Kullback Liebler divergence. For success of this tactic to obtain lower bounds on i risk, one must make a restriction to small subsets of the function space and then establish the existence of a packing set of suitably large cardinality. As Birgé (1986) points out, in that manner, the use of Fano s inequality is similar to the use of Assouad s lemma. As it is not immediately apparent that such local packing sets exist, these techniques have been applied on a case by case basis to establish i rates for various function classes in the above cited literature. In this paper we provide two means by which to reveal i rates from global metric entropies. The first is by use of Fano s inequality in its original form together with suitable information inequalities. This is the approach we take in Sections 2 through 6 to get i bounds for various measures of loss. The second approach we take in Section 7 shows that global metric entropy behavior implies the existence of a local packing set satisfying conditions for the theory of Birgé (1983) to be applicable. Thus we demonstrate for nonparametric function classes (and certain measures of loss) that the i convergence rate is detered by the global metric entropy over the whole function class (or over large subsets of it). The advantage is that the metric entropies are available in approximation theory for many function classes [see, e.g., Lorentz, Golitzchekand Makovoz (1996)]. It is no longer necessary to uncover additional local packing properties on a case by case basis. The following proposition is representative of the results obtained here. Let F be a class of functions and let d f g be a distance between functions [we require d f g nonnegative and equal to zero for f = g, but we do not require it to be a metric]. Let N ε F be the size of the largest packing set of functions separated by at least ε in F, and let ε n satisfy ε 2 n = M ε n /n, where M ε = log N ε F is the Kolmogorov ε-entropy and n is the sample size. Assume the target class F is rich enough to satisfy lim inf ε 0 M ε/2 /M ε > 1 [which is true, e.g., if M ε = 1/ε r κ ε with r>0 and κ ε/2 /κ ε 1asε 0]. This condition is satisfied in typical nonparametric classes. For convenience, we will use the symbols and, where a n b n means b n = O a n, and a n b n means both a n b n and b n a n. Probability densities are taken with respect to a finite measure µ in the following proposition.

3 1566 Y. YANG AND A. BARRON Proposition 1. In the following cases, the i convergence rate is characterized by metric entropy in terms of the critical separation ε n as follows: f E f d 2 f f ˆ εn 2 f F (i) F is any class of density functions bounded above and below 0 <C f C for f F. Here d 2 f g is either integrated squared error f x g x 2 dµ, squared Hellinger distance or Kullback Leibler divergence. (ii) F is a convex class of densities with f C for f F and there exists at least one density in F bounded away from zero and d is the L 2 distance. (iii) F is any class of functions f with f C for f F for the regression model Y = f X +ε X and ε are independent X P X and ε Normal 0 σ 2 σ>0 and d is the L 2 P X norm. From the above proposition, the i L 2 riskrate is detered by the metric entropy alone, whether the densities can be zero or not. For Hellinger and Kullback Leibler (K-L) risk, we show that by modifying a nonparametric class of densities with uniformly bounded logarithms to allow the densities to approach zero or even vanish in some unknown subsets, the i rate may remain unchanged compared to that of the original class. Now let us outline roughly the method of lower bounding the i risk using Fano s inequality. The first step is to restrict attention to a subset S 0 of the parameter space where i estimation is nearly as difficult as for the whole space and, moreover, where the loss function of interest is related locally to the K-L divergence that arises in Fano s inequality. (For example, the subset can in some cases be the set of densities with a bound on their logarithms.) As we shall reveal, the lower bound on the i rate is detered by the metric entropy of the subset. The proof technique involving Fano s inequality first lower bounds the i riskby restricting to as large as possible a finite set of parameter values θ 1 θ m in S 0 separated from each other by an amount ε n in the distance of interest. The critical separation ε n is the largest separation such that the hypotheses θ 1 θ m are nearly indistinguishable on the average by tests as we shall see. Indeed, Fano s inequality reveals this indistinguishability in terms of the K-L divergence between densities p θj x 1 x n = n i=1 p θj x i and the centroid of such densities q x 1 x n = 1/m m j=1 p θj x 1 x n [which yields the Shannon mutual information I X 1 X n between θ and X 1 X n with a uniform distribution on θ in θ 1 θ m ]. Here the key question is to detere the separation such that the average of this K-L divergence is small compared to the distance log m that would correspond to imally distinguishable densities (for which θ is detered by X n ). It is critical here that K-L divergence does not have a triangle inequality between the joint densities. Indeed, covering entropy properties of the K-L divergence (under the conditions of the proposition) show that the K-L divergence from every p θj x 1 x n to the centroid is bounded by the right order 2nε 2 n, even though the distance between two such p θj x 1 x n is as large as nβ

4 MINIMAX RATES OF CONVERGENCE 1567 where β is the K-L diameter of the whole set p θ1 p θm. The proper convergence rate is thus identified provided the cardinality of the subset m is chosen such that nεn 2 / log m is bounded by a suitable constant less than 1. Thus ε n is detered by solving for nεn 2/M ε n equal to such a constant, where M ε is the metric entropy (the logarithm of the largest cardinality of an ε-packing set). In this way, the metric entropy provides a lower bound on the i convergence rate. Applications of Fano s inequality to density estimation have used the K-L diameter nβ of the set p n θ 1 p n θ m [see, e.g., Birgé (1983)] or similar rough bounds [such as ni X 1 as in Hasskii (1978)] in place of the average distance of p n θ 1 p n θ m from their centroid. In that theory, to obtain a suitable bound, a statistician needs to find if possible a sufficiently large subset θ 1 θ m for which the diameter of this subset (in the K-L sense) is of the same order as the separation between closest points in this subset (in the chosen distance). Apparently, such a bound is possible only for subsets of small diameter. Thus by that technique, knowledge is needed not only of the metric entropy but also of special localized subsets. Typical tools for smoothness classes involve perturbations of densities parametrized by vertices of a hypercube. While interesting, such involved calculations are not needed to obtain the correct order bounds. It suffices to know or bound the metric entropy of the chosen set S 0. Our results are especially useful for function classes whose global metric entropies have been detered but local packing sets have not been identified for applying Birgé s theory (e.g., general linear and sparse approximation sets of functions in Sections 4 and 5, and neural networkclasses in Section 6). It is not our purpose to criticize the use of hypercube-type arguments in general. In fact, besides the success of such methods mentioned above, they are also useful in other applications such as detering the i rates of estimating functionals of densities [see, e.g., Bickel and Ritov (1988), Birgé and Massart (1995) and Pollard (1993)] and i rates in nonparametric classification [Yang (1999b)]. The density estimation problem we consider is closely related to a data compression problem in information theory (see Section 3). The relationship allows us to obtain both upper and lower bounds on the i riskfrom upper-bounding the i redundancy of data compression, which is related to the global metric entropy. Le Cam (1973) pioneered the use of local entropy conditions in which convergence rates are characterized in terms of the covering or packing of balls of radius ε by balls of radius ε/2, with subsequent developments by Birgé and others, as mentioned above. Such local entropy conditions provide optimal convergence rates in finite-dimensional as well as infinite-dimensional settings. In Section 7, we show that knowledge of global metric entropy provides the existence of a set with suitable local entropy properties in infinite-dimensional settings. In such cases, there is no need to explicitly require or construct such a set.

5 1568 Y. YANG AND A. BARRON The paper is divided into 7 sections. In Section 2, the main results are presented. Applications in data compression and regression are given in Section 3. In Sections 4 and 5, results connecting linear approximation and i rates, sparse approximation and i rates, respectively, are given. In Section 6, we illustrate the deteration of i rates of convergence for several function classes. In Section 7, we discuss the relationship between the global entropy and local entropy. The proofs of some lemmas are given in the Appendix. 2. Main results. Suppose we have a collection of densities p θ : θ defined on a measurable space with respect to a σ-finite measure µ. The parameter space could be a finite-dimensional space or a nonparametric space (e.g., the class of all densities). Let X 1 X 2 X n be an i.i.d. sample from p θ θ. We want to estimate the true density p θ or θ based on the sample. The K-L loss, the squared Hellinger loss, the integrated squared error and some other losses will be considered in this paper. We detere i bounds for subclasses p θ : θ S S. When the parameter is the density itself, we may use f and F in place of θ and S, respectively. Our technique is most appropriate for nonparametric classes (e.g., monotone, Lipschitz, or neural net; see Section 6). Let S be an action space for the parameter estimates with S S. An estimator ˆθ is then a measurable mapping from the sample space of X 1 X 2 X n to S. Let S n be the collection of all such estimators. For nonparametric density estimation, S = is often chosen to be the set of all densities or some transform of the densities (e.g., square root of density). We consider general loss functions d, which are mappings from S S to R + with d θ θ =0 and d θ θ > 0 for θ θ. We call such a loss function a distance whether or not it satisfies properties of a metric. The i riskof estimating θ S with action space S is defined as r n = ˆθ S n θ S E θ d 2 θ ˆθ Here and are understood to be inf and sup, respectively, if the imizer or imizer does not exist. Definition 1. A finite set N ε S is said to be an ε-packing set in S with separation ε > 0, if for any θ θ N ε θ θ,wehaved θ θ > ε. The logarithm of the imum cardinality of ε-packing sets is called the packing ε-entropy or Kolmogorov capacity of S with distance function d and is denoted M d ε. Definition 2. A set G ε S is said to be an ε-net for S if for any θ S, there exists a θ 0 G ε such that d θ θ 0 ε. The logarithm of the imum cardinality of ε-nets is called the covering ε-entropy of S and is denoted V d ε.

6 MINIMAX RATES OF CONVERGENCE 1569 From the definitions, it is straightforward to see that M d ε and V d ε are nonincreasing in ε and M d ε is right continuous. These definitions are slight generalizations of the metric entropy notions introduced by Kolmogorov and Tihomirov (1959). In accordance with common terology, we informally call these ε-entropies metric entropies even when the distance is not a metric. One choice is the square root of the Kullback Leibler (K-L) divergence which we denote by d K, where d 2 K θ θ =D p θ p θ = p θ log p θ /p θ dµ. Clearly d K θ θ is asymmetric in its two arguments, so it is not a metric. ( Other distances we consider include the Hellinger metric d H θ θ = ( pθ ) ) 2 1/2 p θ dµ and the Lq µ metric d q θ θ = p θ p θ q dµ 1/q for q 1. The Hellinger distance is upper bounded by the square root of the K- L divergence, that is, d H θ θ d K θ θ. We assume the distance d satisfies the following condition. Condition 0 (Local triangle inequality). There exist ( positive constants ) A 1 and ε 0 such that for any θ θ S θ S, if d θ θ d θ θ ε 0, then d θ θ +d θ θ Ad θ θ Remarks. (i) If d is a metric on, then the condition is always satisfied with A = 1 for any S. (ii) For d K, the condition is not automatically satisfied. It holds when S is a class of densities with bounded logarithms and S is any action space, including the class of all densities. It is also satisfied by regression families as considered in Section 3. See Section 2.3 for more discussion. When Condition 0 is satisfied, the packing entropy and covering entropy have the simple relationship for ε ε 0 M d 2ε/A V d ε M d Aε. We will obtain i results for such general d and then special results will be given with several choices of d: the square root K-L divergence, Hellinger distance and L q distance. We assume M d ε < for all ε > 0. The square root K-L, Hellinger and L q packing entropies are denoted M K ε M H ε and M q ε, respectively Mini risk under a global entropy condition. Suppose a good upper bound on the covering ε-entropy under the square root K-L divergence is available. That is, assume V K ε V ε. Ideally, V K ε and V ε are of the same order. Similarly, let M ε M d ε be an available lower bound on ε-packing entropy with distance d, and ideally, M ε is of the same order as M d ε. Suppose V ε and M ε are nonincreasing and right-continuous functions. To avoid a trivial case (in which S is a small finite set), we assume M ε > 2 log 2 for ε small enough. Let ε n (called the critical covering radius) be detered by εn 2 = V ε n /n The trade-off here between V ε /n and ε 2 is analogous to that between the squared bias and variance of an estimator. As will be shown later, 2εn 2 is an

7 1570 Y. YANG AND A. BARRON upper bound on the i K-L risk. Let ε n d be a separation ε such that M ε n d =4nεn log 2 We call ε n d the packing separation commensurate with the critical covering radius ε n. This ε 2 n d deteres a lower bound on the i risk. In general, the upper and lower rates εn 2 and ε2 n d need not match. Conditions under which they are of the same order are explored at the end of this subsection and are used in Section 2.2 to identify the i rate for L 2 distance and in Sections 2.3 and 2.4 to relate the i K-L riskto the i Hellinger riskand the Hellinger metric entropy. Theorem 1 (Mini lower bound). Suppose Condition 0 is satisfied for the distance d. Then when the sample size n is large enough such that ε n d 2ε 0, the i risks for estimating θ S satisfies and consequently, ˆθ P θ d θ ˆθ A/2 ε n d 1/2 θ S ˆθ E θ d 2 θ ˆθ A 2 /8 ε 2 n d θ S where the imum is over all estimators mapping from n to S. Proof. Let N εn d be an ε n d -packing set with the imum cardinality in S under the given distance d and let G εn be an ε n -net for S under d K. For any estimator ˆθ taking values in S, define θ = arg θ N εn d θ ˆθ (if d there are more than one imizer, choose any one), so that θ takes values in the finite packing set N εn d. Let θ be any point in N εn d.ifd θ ˆθ <Aε n d /2, then ( d θ ˆθ d θ ˆθ ) <Aε n d /2 ε 0 and hence by Condition 0 d θ ˆθ + d θ ˆθ Ad θ θ, which is at least Aε n d if θ θ. Thus if θ θ, we must have d θ ˆθ Aε n d /2, and ˆθ θ S P θ d θ ˆθ A/2 εn d ˆθ = ˆθ ˆθ P θ d θ ˆθ A/2 ε n d θ N εn d ) θ θ ( P θ θ N εn d θ N εn d w θ P θ ( ) θ θ ) = P w (θ θ ˆθ where in the last line, θ is randomly drawn according to a discrete prior probability w restricted to N εn d, and P w denotes the Bayes average probability with respect to the prior w. Moreover, since d θ ˆθ l is not less than

8 MINIMAX RATES OF CONVERGENCE 1571 A/2 ε n d l 1 θ θ, taking the expected value it follows that for all l > 0, ˆθ θ S E θ d l θ ˆθ A/2 ε n d l ( P w θ θ ) ˆθ By Fano s inequality [see, e.g., Fano (1961) or Cover and Thomas (1991), pages 39 and 205], with w being the discrete uniform prior on θ in the packing set N εn d,wehave (1) ( P w θ θ ) 1 I Xn +log 2 log N εn d where I X n is Shannon s mutual information between the random parameter and the random sample, when θ is distributed according to w. This mutual information is equal to the average (with respect to the prior) of the K-L divergence between p x n θ and p w x n = θ w θ p x n θ, where p x n θ =p θ x n = n i=1 p θ x i and x n = x 1 x n. It is upper bounded by the imum K-L divergence between the product densities p x n θ and any joint density q x n on the sample space n. Indeed, I X n = w θ p x n θ log p x n θ /p w x n µ dx n θ w θ p x n θ log p x n θ /q x n µ dx n θ D ( ) P X θ Q θ N n X n εn d The first inequality above follows from the fact that the Bayes mixture density p w x n imizes the average K-L divergence over choices of densities q x n [any other choice yields a larger value by the amount pw x n log p w x n /q x n µ dx n > 0]. We have w uniform on N. Now εn d choose w 1 to be the uniform prior on G εn and let q x n = p w1 x n = θ w 1 θ p x n θ and Q X n be the corresponding Bayes mixture density and distribution, respectively. Because G εn is an ε n -net in S under d K, for each θ S, there exists θ G εn such that D p θ p θ =d 2 K θ θ εn 2. Also by definition, log G εn V K ε n. It follows that D ( ) p X n θ P X θ Q n X n = E log 1/ G εn θ G ε p X n θ n (2) p X n θ E log 1/ G εn p X n θ = log G εn +D ( ) P X θ P n X n θ V ε n +nεn 2 Thus, by our choice of ε n d I X n +log 2 / log N εn d 1/2. The conclusion follows. This completes the proof of Theorem 1.

9 1572 Y. YANG AND A. BARRON Remarks. (i) Up to inequality (1), the development here is standard. Previous use of Fano s inequality for i lower bound takes one of the following weakbounds on mutual information: I X n ni X 1 or I X n n θ θ D P X1 θ P X1 θ [see, e.g., Hasskii (1978) and Birgé (1983), respectively]. An exception is workof Ibragimov and Hasskii (1978) where a more direct evaluation of the mutual information for Gaussian stochastic process models is used. (ii) Our use of the improved bound is borrowed from ideas in universal data compression for which I X n represents the Bayes average redundancy and θ S D P X θ P n X n represents an upper bound on the i redundancy C n = QX n θ S D P X θ Q n X n = w I w θ X n, where the imum is over priors supported on S. The universal data compression interpretations of these quantities can be found in Davisson (1973) and Davisson and Leon-Garcia (1980) [see Clarke and Barron (1994), Yu (1996), Haussler (1997) and Haussler and Opper (1997) for some of the recent work in that area]. The bound D P X θ P n X n V ε n +nεn 2 has roots in Barron [(1987), page 89], where it is given in a more general form for arbitrary priors, that is D P X θ P n X n log 1/w θ ε +nε 2, where θ ε = θ : D p θ p θ ε 2 and P X n has density p w x n = p x n θ w dθ. The redundancy bound V ε n +nεn 2 can also be obtained from use of a two-stage code of length log G εn + θ G ε log 1/p x n θ [see Barron and Cover (1991), Section V]. n (iii) From inequality (1), the i riskis bounded below by a constant times ε 2 n d 1 C n + log 2 /K n, where C n = w I w X n is the Shannon capacity of the channel p x n θ θ S and K n = log N εn d is the Kolmogorov ε n d -capacity of S. Thus ε 2 n d lower bounds the i rate provided the Shannon capacity is less than the Kolmogorov capacity by a factor less than 1. This Shannon Kolmogorov characterization is emphasized by Ibragimov and Hasskii (1977, 1978). (iv) For applications, the lower bounds may be applied to a subclass of densities p θ : θ S 0 S 0 S which may be rich enough to characterize the difficulty of the estimation of the densities in the whole class yet is easy enough to checkthe conditions. For instance, if p θ : θ S 0 is a subclass of densities that have support on a compact space and log p θ T for all θ S 0, then the square root K-L divergence, Hellinger distance and L 2 distance are all equivalent in the sense that each of them is both upper bounded and lower bounded by multiples of each other. We now turn our attention to obtaining a i upper bound. We use a uniform prior w 1 on the ε-net G εn for S under d K.Forn = 1 2 let p x n = θ G ε n w 1 θ p x n θ = 1 G εn be the corresponding mixture density. Let n 1 p x =n 1 (3) ˆp i x i=0 θ G ε n p x n θ

10 MINIMAX RATES OF CONVERGENCE 1573 be the density estimator constructed as a Cesaro average of the Bayes predictive density estimators ˆp i x =p X i+1 X i evaluated at X i+1 = x, which equal p X i x /p X i for i>0 and ˆp i x =p x = 1/ G εn θ G ε n p x θ for i = 0. Note that ˆp i x is the average of p x θ with respect to the posterior distribution p θ X i on θ G εn. Theorem 2 (Upper bound). Let V ε be an upper bound on the covering entropy of S under d K and let ε n satisfy V ε n =nεn 2. Then ˆp E θ D p θ ˆp 2εn 2 θ S where the imization is over all density estimators. Moreover, if Condition 0 is satisfied for a distance d and A 0 d 2 θ θ d 2 K θ θ for all θ θ S, and if the set S n of allowed estimators (mappings from X n to S) contains p constructed in (3) above, then when ε n d 2ε 0, A 0 A/8 ε 2 n d A 0 ˆθ S n θ S E θ d 2 θ ˆθ ˆθ S n θ S E θ d 2 K θ ˆθ 2ε 2 n The condition that S n contains p in Theorem 2 is satisfied if p θ : θ S is convex (because p is a convex combination of densities p θ ). In particular, this holds if the action space S is the set of all densities on. In which case, when d is a metric the only remaining condition needed for the second set of inequalities is A 0 d 2 θ θ d 2 K θ θ for all θ θ. This is satisfied by Hellinger distance and L 1 distance (with A 0 = 1 and A 0 = 1/2, respectively). When d is the square root K-L divergence, Condition 0 restricts the family p θ : θ S (e.g., to consist of densities with uniformly bounded logarithms), though it remains acceptable to let the action space S consist of all densities. (4) Proof. By convexity and the chain rule [as in Barron (1987)], ( E θ D p θ p E θ n 1 n 1 D P i=0 X i+1 θ P Xi+1 X i ) = n 1 n 1 E log p X i+1 θ i=0 p X i+1 X i = n 1 E log p Xn θ p X n = n 1 D ( ) P X θ P n X n n 1 V ε n +nεn 2 =2ε2 n where the last inequality is as derived as in (2). Combining this upper bound with the lower bound from Theorem 1, this completes the proof of Theorem 2. If ε 2 n d and 2ε2 n converge to 0 at the same rate, then the i rate of convergence is identified by Theorem 2. For ε 2 n d and ε2 n to be of the same

11 1574 Y. YANG AND A. BARRON order, it is sufficient that the following two conditions hold (for a proof of this simple fact, see Lemma 4 in the Appendix). Condition 1 (Metric entropy equivalence). There exist positive constants a b and c such that when ε is small enough, M ε V bε cm aε. Condition 2 (Richness of the function class). For some 0 <α<1 lim inf M αε /M ε > 1 ε 0 Condition 1 is the equivalence of the entropy structure under the square root K-L divergence and under d distance when ε is small (as is satisfied, for instance, when all the densities in the target class are uniformly bounded above and away from 0 and d is taken to be Hellinger distance or L 2 distance). Condition 2 requires the density class to be large enough, namely, M ε approaches at least polynomially fast in 1/ε as ε 0, that is, there exists a constant δ>0 such that M ε ε δ. This condition is typical of nonparametric function classes. It is satisfied in particular if M ε can be expressed as M ε =ε r κ ε, where r>0 and κ αε /κ ε 1asε 0. In most situations, the metric entropies are known only up to orders, which makes it generally not tractable to checklim inf ε 0 M d αε /M d ε > 1 for the exact packing entropy function. That is why Condition 2 is stated in terms of a presumed known bound M ε M d ε. Both Conditions 1 and 2 are satisfied if, for instance, M ε V ε ε r κ ε with r and κ ε as mentioned above. Corollary 1. Assume Condition 0 is satisfied for a distance d satisfying A 0 d 2 θ θ d 2 K θ θ in S and assume p θ : θ S is convex. Under Conditions 1 and 2, we have E θ d 2 θ ˆθ εn 2 ˆθ S n θ S where ε n is detered by the equation M d ε n =nεn 2. In particular, if µ is finite and sup θ S log p θ < and Condition 2 is satisfied, then E θ d 2 K θ ˆθ E θ d 2 H θ ˆθ E θ p θ p ˆθ 2 2 ε2 n ˆθ S n θ S ˆθ S n θ S where ε n satisfies M 2 ε n =nε 2 n or M H ε n =nε 2 n. ˆθ S n θ S Corollary 1 is applicable for many smooth nonparametric classes as we shall see. However, for not very rich classes of densities (e.g., finite-dimensional families or analytical densities), the lower bound and the upper bound derived in the above way do not converge at the same rate. For instance, for a finite-dimensional class, both M K ε and M H ε may be of order log 1/ε m for some constant m 1, and then ε n and ε n H are not of the same order with ε n log n /n and ε n H = o 1/ n. For smooth finite-dimensional models, the i riskcan be solved using some traditional statistical methods (such as Bayes procedures, Cramér Rao inequality, Van Tree s inequality,

12 MINIMAX RATES OF CONVERGENCE 1575 etc.), but these techniques require more than the entropy condition. If local entropy conditions are used instead of those on global entropy, results can be obtained suitable for both parametric and nonparametric families of densities (see Section 7) Mini rates under L 2 loss. In this subsection, we derive i bounds for L 2 riskwithout requiring K-L covering entropy. Let F be a class of density functions f with respect to a probability measure µ on a measurable set such as [0,1]. (Typically will be taken to be a compact set, though it need not be; we assume only that the doating measure µ is finite, then normalized to be a probability measure.) Let the packing entropy of F be M q ε under the L q µ metric. To derive i upper bounds, we derive a lemma that relates L 2 risk for densities that may be zero to the corresponding riskfor densities bounded away from zero. In addition to the observed i.i.d sample X 1 X 2 X n from f, let Y 1 Y 2 Y n be a sample generated i.i.d from the uniform distribution on with respect to µ (generated independently of X 1 X n ). Let Z i be X i or Y i with probability (1/2, 1/2) according to the outcome of Bernoulli(1/2) random variables V i generated independently for i = 1 n. Then Z i has density g x = f x +1 /2. Clearly the new density g is bounded below (away from 0), whereas the family of the original densities need not be. Let F = g: g = f + 1 /2 f F be the new density class. Lemma 1. The i L 2 risks of the two classes F and F have the following relationship: E X n f ˆ f F f ĝ E Z n g ĝ 2 2 g F where the imization on the left-hand side is over all estimators based on X 1 X n and the imization on the right-hand side is over all estimators based on n independent observations from g. Generally, for q 1 and q 2, we have E X n f ˆ f F f q q 4q ĝ E Z n g ĝ q q g F Proof. We change the estimation of f to another estimation problem and show that the i riskof the original problem is upper bounded by the i riskof the new class. From any estimator in the new class (e.g., a i estimator), an estimator in the original problem is detered for which the riskis not greater than a multiple of the riskin the new class. Let g be any density estimator of g based on Z i i= 1 n.fixq 1. Let ĝ be the density that imizes h g q over functions in the set H = h: h x 1/2 h x dµ = 1. (If the imizer does not exist, then one can choose ĝ ε within ε of the infimum and proceed similarly as follows and finally obtain the same general upper bound in Lemma 1 by letting ε 0.) Then

13 1576 Y. YANG AND A. BARRON by the triangle inequality and because g H g ĝ q q 2 q 1 g g q q + 2 q 1 ĝ g q q 2 q g g q q.forq = 2, this can be improved to g ĝ 2 g g 2 by Hilbert space convex analysis [see, e.g., Lemma 9 in Yang and Barron (1997)]. We now focus on the proof of the assertion for the L 2 case. The proof for general L q is similar. We construct a density estimator for f. Note that f x = 2g x 1, let rand x = 2ĝ x 1. Then rand x is a nonnegative and normalized probability density estimator and depends on X 1 X n Y 1 Y n and the outcomes of the coin flips V 1 V n.soitis a randomized estimator. The squared L 2 loss of rand is bounded as follows: ( ) 2 f x rand x dµ = 2g x 2ĝ x 2 dµ 4 g g 2 2 To avoid randomization, we may replace rand x with its expected value over Y 1 Y n and coin flips V 1 V n to get f x ˆ with E X n f f ˆ 2 2 = E X n f E Y n V f ˆ n rand 2 2 E X ne Y n V n f ˆ f rand 2 2 = E Z n f ˆ f rand 2 2 4E Z n g g 2 2 where the first inequality is by convexity and the second identity is because rand depends on X n Y n V n only through Z n. Thus f F E X n f f ˆ g F E Z n g g 2 2. Taking the imum over estimators g completes the proof of Lemma 1. Thus the i riskof the original problem is upper bounded by the i riskon F. Moreover, the ε-entropies are related. Indeed, since f /2 f /2 2 = 1/2 f 1 f 2 2, for the new class F, the ε-packing entropy under L 2 is M 2 ε =M 2 2ε. Now we give upper and lower bounds on the i L 2 risk. Let us first get an upper bound. For the new class, the square root K-L divergence is upper bounded by multiples of L 2 distance. Indeed, for densities g 1 g 2 F, g1 g D g 1 g dµ 2 g g 1 g 2 2 dµ 2 where the first inequality is the familiar relationship between K-L divergence and chi-square distance, and the second inequality follows because g 2 is lower bounded by 1/2. Let Ṽ K ε denote the d K covering entropy of F. Then Ṽ K ε M 2 ε/ 2 = M 2 2ε. Let ε n be chosen such that M 2 2ε n = nεn 2. From Theorem 2, there exists a density estimator ĝ 0 such that g F E Z nd g ĝ 0 2εn 2. It follows that g F E Z nd 2 H g ĝ 0 2εn 2 and g F E Z n g ĝ εn 2. Consequently, by Lemma 1, f ˆ f F E X n f f ˆ 1 8 8ε n. To get a good estimator in terms of L 2 risk, we assume sup f F f L<. Let ĝ be

14 MINIMAX RATES OF CONVERGENCE 1577 the density in F that is closest to ĝ 0 in Hellinger distance. Then by triangle inequality, E Z nd 2 H g F g ĝ 2 g F E Z nd 2 H g ĝ 0 +2 E Z nd 2 H ĝ ĝ 0 g F 4 E Z nd 2 H g ĝ 0 8εn 2 g F Now because both g and ĝ are bounded by L + 1 /2, ( g ĝ ) 2 ( g ĝ ) 2 g ĝ 2 dµ = + dµ 2 L + 1 d 2 H g ĝ Thus g F E Z n g ĝ L + 1 ε2 n. Using Lemma 1 again, we have an upper bound on the i squared L 2 risk. The action space S consists of all probability densities. Theorem 3. Let M 2 ε be the L 2 packing entropy of a density class F with respect to a probability measure. Let ε n satisfy M 2 2ε n =nεn 2. Then E X n f f ˆ 1 8 8ε n f F If in addition, sup f F f L<, then E f f f ˆ L + 1 ε2 n f F The above result upper bounds the i L 1 riskand L 2 risk(under sup f F f < for L 2 ) using only the L 2 metric entropy. Using the relationship between L q norms, namely, f f ˆ q f f ˆ 2 for 1 q<2, under sup f F f <, wehave E X n f f ˆ 2 q ε2 n f F for 1 q 2 To get a i lower bound, we use the following assumption, which is satisfied by many classical classes such as Besov, Lipschitz, the class of monotone densities and more. Condition 3. There exists at least one density f F with x f x = C > 0 and a positive constant α 0 1 such that F 0 = 1 α f + αg: g F F. For a convex class of densities, Condition 3 is satisfied if there is at least one density bounded away from zero. Under Condition 3, the subclass F 0 has L 2 packing entropy M 0 2 ε =M 2 ε/α and for two densities f 1 and f 2 in F 0, f1 f D f 1 f dµ f f 2 1 α C 1 f 2 2 dµ Thus applying Theorem 1 on F 0, and then applying Theorem 2, we have the following conclusion.

15 1578 Y. YANG AND A. BARRON Theorem 4. Suppose Condition 3 is satisfied. Let M 2 ε be the L 2 packing entropy of a density class F with respect to a probability measure, let ε n satisfy M 2 1 α C ε n /α =n ε n 2 and ε n be chosen such that M 2 ε n /α =4n ε n log 2. Then E X n f f ˆ 2 2 ε2 n /8 f F Moreover, if the class F is rich using the L 2 distance (Condition 2), then with ε n detered by M 2 ε n =nε 2 n : (i) If M 2 ε M 1 ε holds, then ˆ f f F E X n f ˆ f 1 ε n. (ii) If sup f F f <, then f ˆ f F E X n f f ˆ 2 2 ε2 n. Using the relationship between L 2 and L q 1 q<2 distances and applying Theorem 1, we have the following corollary. Corollary 2. Suppose F is rich using both the L 2 distance and the L q distance for some q 1 2. Assume Condition 3 is satisfied. Let ε n q satisfy M q ε n q =nεn 2. Then ε 2 n q E X n f f ˆ 2 q ε2 n f F If the packing entropies under L 2 and L q are equivalent (which is the case for many familiar nonparametric classes; see Section 6 for examples), then the above upper and lower bounds converge at the same rate. Generally for a uniformly upper bounded density class F on a compact set, because f g 2 dµ f + g f g dµ, weknowm 1 ε M 2 ε M 1 ε 2 / sup F f. Then the corresponding lower bound for L 1 riskmay vary from ε n to εn 2 depending on how different the two entropies are [see also Birgé (1986)] Mini rates under K-L and Hellinger loss. For the square root K- L divergence, Condition 0 is not necessarily satisfied for general classes of densities. When it is not satisfied (for instance, if the densities in the class have different supports), the following lemma is helpful to lower bound as well as upper bound the K-L riskinvolving only the Hellinger metric entropy by relating it to the covering entropy under d K. We consider estimating a density defined on with respect to a measure µ with µ =1 for the rest of this section and Section 2.3. Lemma 2. For each probability density g, positive T and 0 <ε 2, there exists a probability density g such that for every density f with f T and d H f g ε, the K-L divergence D f g satisfies ( ( D f g log 9T/4ε 2))( T 1 2) ε 2

16 MINIMAX RATES OF CONVERGENCE 1579 Bounds analogous to Lemma 2 are in Barron, Birgé and Massart [(1999), Proposition 1], Wong and Shen [(1995), Theorem 5]. A proof of this lemma is given in the Appendix. We now show how Lemma 2 can be used to give rates of convergence under Hellinger and K-L loss without forcing the density to be bounded away from zero. Let F be a density class with f T for each f F. Assume, for simplicity, that the metric entropy M H ε under d H is of order ε 1/α for some α > 0. Then by replacing each g in a Hellinger covering with the associated g, Lemma 2 implies that we obtain a d K covering with V K ε ( log 1/ε 1/2 /ε ) 1/α and we obtain the associated upper bound on the K-L riskfrom Theorem 2. For the lower bound, use the fact that d K is no smaller than the Hellinger distance for which we have the packing entropy. (Note that the d K packing entropy, which may be infinity, may not be used here since Condition 0 is not satisfied.) Consequently, from Theorem 1 and Theorem 2, we have n log n 2α/ 2α+1 Ed 2 H f ˆ f F f n 2α/ 2α + 1 log n 1/ 2α+1 ED f f ˆ f F Here, the imization is over all estimators, ˆ f; that is, S is the class of all densities. Thus even if the densities in F may be 0 or near 0, the i K-L risk is within a logarithmic factor of the squared Hellinger risk. See Barron, Birgé and Massart (1999) for related conclusions Some more results for risks when densities may be 0. In this section, we show that by modifying a nonparametric class of densities with uniformly bounded logarithms to allow the densities to approach zero at some points or even vanish in some subsets, the i rates of convergence under K-L and Hellinger (and L 2 ) may remain unchanged compared to that of the original class. Densities in the new class are obtained from the original nonparametric class by multiplying by members in a smaller class (often a parametric class) of functions that may be near zero. The result is applicable to the following example. Example. (Densities with support on unknown set of k intervals.) Let F = 1 h x k i=0 b i+11 ai x<a i + 1 /c: h H 0 = a 0 < a 1 < a 2 < < a k = 1 k 1 i=0 b i + 1 a i + 1 a i γ 1 and 0 b i γ 2 1 i k (c is the normalizing constant). Here H is a class of functions with uniformly bounded logarithms, k is a positive integer, γ 1 and γ 2 are positive constants. The constants γ 1 and γ 2 force the densities in F to be uniformly upper bounded. These densities may be 0 on some intervals and arbitrarily close to 0 on others. Note that if the densities in H are continuous, then the densities in F have at most k 1 discontinuous points. For instance, if H is a Lipschitz class, then the functions in F are piecewise Lipschitz.

17 1580 Y. YANG AND A. BARRON Somewhat more generally, let be a class of nonnegative functions satisfying g C and gdµ c for all g. Though the g s are nonnegative, they may equal zero on some subsets. Suppose 0 <c h C< for all h H. Consider a class of densities with respect to µ: F = { f x = h x g x h x g x dµ h H g Let = g = g/ gdµ: g be the density class corresponding to and similarly define H. Let M 2 ε H be the packing entropy of H under L 2 distance and let V K ε F and V K ε be covering entropies as defined in Section 2 of F and, respectively, under d K. The action space S consists of all probability densities. Theorem 5. Suppose V K ε A 1 M 2 A 2 ε H for some positive constants A 1 A 2 and suppose the class H is rich in the sense that lim inf ε 0 M 2 αε H /M 2 ε H > 1 for some 0 <α<1. If contains a constant function, then E f D f ˆ f F f E f d 2 H f ˆ f F where ε n is detered by M 2 ε n H =nε 2 n. f } E f f f ˆ 2 2 ε2 n f F The result basically says that if is smaller than the class H in an ε- entropy sense, then for the new class, being 0 or close to 0 due to does not hurt the K-L riskrate. To apply the result, we still need to bound the covering entropy of under d K. One approach is as follows. Suppose the Hellinger metric entropy of satisfies M H ε/ log 1/ε M 2 Aε H for some constant A>0. From Lemma 2, V K ε M H A ε/ log 1/ε for some constant A > 0 when ε is sufficiently small (note that the elements g of the cover from Lemma 2 are not necessarily in F though they are in the set S of all probability densities). Then we have V K ε M 2 AA ε H. This covers the example above (for which case, is a parametric family) and it even allows to be almost as large as H. An example is H = h: log h B α σ q C and = g: g Bσ α q C g 0 and gdµ = 1 with α bigger than but arbitrarily close to α (for a definition of Besov classes, see Section 6). Note that here the density can be 0 on infinitely many subintervals. Other examples are given in Yang and Barron (1997). Proof. By Theorem 2, we have f ˆ f F E f D f f ˆ ε n 2, where ε n satisfies V K ε n F =n ε n 2. Under the entropy condition that is smaller than H, it can be shown that V K ε F à 1 M 2 à 2 ε H for some constants à 1 and à 2 [see Yang and Barron (1997) for details]. Then under the assumption of the richness of H,wehave ε n = O ε n. Thus εn 2 upper bounds the i riskrates under both d 2 K and d2 H. Because contains a constant function, H is a subset of F. Since the log-densities in H are uniformly bounded, the

18 MINIMAX RATES OF CONVERGENCE 1581 L 2 metric entropies of H and H are of the same order. Thus taking S 0 = H, the lower bound rate under K-L or squared Hellinger or square L 2 distance is of order εn 2 by Corollary 1. Because the densities in F are uniformly upper bounded, the L 2 distance between two densities in F is upper bounded by a multiple of the Hellinger distance and d K. Thus under the assumptions in the theorem, the L 2 metric entropy of F satisfies M 2 ε F V K Aε F M 2 A ε H for some positive constants A and A. Consequently, the i L 2 riskis upper bounded by order εn 2 by Theorem 3. This completes the proof of Theorem Applications in data compression and regression. Two cases when the K-L divergence plays a direct role include data compression, where total K-L divergence provides the redundancy of universal codes, and regression with Gaussian errors where the individual K-L divergence between two Gaussian models is the customary squared L 2 distance. A small amount of additional workis required in the regression case to convert a predictive density estimator into a regression estimator via a imum Hellinger distance argument Data compression. Let X 1 X n be an i.i.d. sample of discrete random variables from p θ x θ S. Let q n x 1 x n be a density (probability mass) function. The redundancy of the Shannon code using density q n is the difference of its expected codelength and the expected codelength of the Shannon code using the true density p θ x 1 x n, that is, D p n θ q n. Formally, we exae the i properties of the game with loss D p n θ q n for continuous random variables also. In that case, D p n θ q n corresponds to the redundancy in the limit of fine quantization of the random variable [see, e.g., Clarke and Barron (1990), pages 459 and 460]. The asymptotics of redundancy lower bounds have been considered by Rissanen (1984), Clarke and Barron (1990, 1994), Rissanen, Speed and Yu (1992) and others. These results were derived for smooth parametric families or a specific smooth nonparametric class. We here give general redundancy lower bounds for nonparametric classes. The key property revealed by the chain rule is the relationship between the i value of the game with loss D p n θ q n and the i cumulative K-L risk, q n θ S D p n θ q n = ˆp i n 1 i=0 n 1 θ S i=0 E θ D p θ ˆp i where the imization on the left is over all joint densities q n x 1 x n and the imization on the right is over all sequences of estimators ˆp i based on samples of size i = 0 1 n 1 (for i = 0, it is any fixed density). Indeed, one has D p n θ q n = n 1 i=0 E θd p θ ˆp i when q n x 1 x n = n 1 i=0 ˆp i x i+1 [cf. Barron and Hengartner (1998)]. Let R n = qn n θ S D p n θ q n be the i total K-L divergence and let r n = ˆpn n θ S E θ D p θ ˆp

19 1582 Y. YANG AND A. BARRON be the individual i K-L risk. Then (5) nr n 1 R n Here the first inequality r n 1 n 1 R n follows from noting that as in (4), for any joint density q n on the sample space of X 1 X n, there exists an estimator ˆp such that E θ D p θ ˆp n 1 D p n θ q n for all θ S. The second inequality R n n 1 i=0 r i is from the fact that the imum (over θ in S) of the sum of risks is not greater than the sum of the ima. In particular, we see that r n R n /n when r n n 1 n 1 i=0 r i. Effectively, this means that the i individual K-L riskand 1/n times i total K- L divergence match asymptotically when r n converges at a rate sufficiently slower than 1/n (e.g., n p with 0 <ρ<1. Now let d θ θ be a metric on S and assume that p θ : θ S contains all probability densities on. Let M d ε be the packing entropy of S under d and let V ε be an upper bound on the covering entropy V K ε of S under d K. Choose ε n such that εn 2 = V ε n /n and choose ε n d such that M d ε n d =4nεn log 2. Based on Theorem 1, (2) and inequality (5), we have the following result. Corollary 3. we have n 1 i=0 r i Assume that D p θ p θ A 0 d 2 θ θ for all θ θ S. Then A 0 /8 nε 2 n d q n D p n θ q n 2nεn 2 θ S where the imization is over all densities on n. Two choices that satisfy the requirements are the Hellinger distance and the L 1 distance. When interest is focused on the cumulative K-L risk(or on the individual risk r n in the case that r n n 1 n 1 i=0 r i, direct proof of suitable bounds are possible without the use of Fano s inequality. See Haussler and Opper (1997) for new results in that direction. Another proof using an idea of Rissanen (1984) is in Yang and Barron (1997) Application in nonparametric regression. Consider the regression model y i = u x i +ɛ i i = 1 n Suppose the errors ɛ i 1 i n, are i.i.d. with the Normal 0 σ 2 distribution. The explanatory variables x i 1 i n, are i.i.d. with a fixed distribution P. The regression function u is assumed to be in a function class. For this case, the square root K-L divergence between the joint densities of X Y in the family is a metric. Let u v 2 = u x v x 2 dp 1/2 be the L 2 P distance with respect to the measure induced by X. Let M 2 ε and similarly for q 1 let M q ε be the ε-packing entropy under L 2 P and L q P,

20 MINIMAX RATES OF CONVERGENCE 1583 respectively. Assume M 2 ε and M q ε are both rich (in accordance with Condition 2). Choose ε n such that M 2 ε n =nε 2 n. Similarly let ε n q be detered by M q ε n q =nε 2 n. Theorem 6. Assume sup u u L. Then E u û u û 2 2 ε2 n For the i L q P risk, we have E u û q ε n q û u If further, M 2 ε M q ε for 1 q<2, then for the L q P risk, we have E u û q ε n û u Proof. With no loss of generality, suppose that P X has a density h x with respect to a measure λ. Let p u x y = 2πσ 2 1/2 exp y u x 2 /2σ 2 h x denote the joint density of X Y with regression function u. Then D p u p v = 1/2σ 2 E Y u X 2 Y v X 2 reduces as is well known to 1/2σ 2 u x v x 2 h x dλ, so that d K is equivalent to the L 2 P distance. The lower rates then follow from Theorem 1 together with the richness assumption on the entropies. We next detere upper bounds for regression by specializing the bound from Theorem 2. We assume u L uniformly for u. Theorem 2 provides a density estimator ˆp n such that u ED p u ˆp n 2εn 2. It follows that u Ed 2 H p u ˆp n 2εn 2. [A similar conclusion is available in Birgé (1986), Theorem 3.1.] Here we take advantage of the fact that when the density h x is fixed, ˆp n x y takes the form of h x ĝ y x, where ĝ y x is an estimate of the conditional density of y given x (it happens to be a mixture of Gaussians using a posterior based on a uniform prior on ε-nets). For given x and X i Y i n i=1, let ũ n x be the imizer of the Hellinger distance d H ĝ n x φ z between ĝ n y x and the normal φ z y density with mean z and the given variance over choices of z with z L. Then ũ n x is an estimator of u x based on X i Y i n i=1. By the triangle inequality, given x and X i Y i n i=1, ( ) ( d H φu x φũn x dh φu x ĝ n x ) ( + d H φũn x ĝ n x ) ( 2d H φu x ĝ n x ) It follows that u Ed 2 H p u pũn 4 u Ed 2 H p u ˆp n 8εn 2. Now Ed 2 H p u pũn =2E h x ( 1 exp u x ũ n x 2 /8σ 2 ) dλ. The concave function 1 e v is above the chord v/b 1 e B for 0 v B. Thus using v = u x ũ n x 2 /8σ 2 and B = L 2 /2σ 2, we obtain E u x ũ n x 2 h x dλ u 2L 2 1 exp L 2 /2σ 2 1 u Ed2 H p u pũ ε 2 n

21 1584 Y. YANG AND A. BARRON This completes the proof of Theorem Linear approximation and i rates. In this section, we apply the main results and known metric entropy to give a general conclusion on the relationship between linear approximation and i rates of convergence. A more general and detailed treatment is in Yang and Barron (1997). Let = φ 1 = 1 φ 2 φ k be a fundamental sequence in L d (that is, linear combinations are dense in L d. Let Ɣ = γ 0 γ k be a decreasing sequence of positive numbers for which there exist 0 <c < c<1 such that (6) c γ k γ 2k cγ k as is true for γ k k α and also for γ k k α log k β α > 0 β R. Let η 0 f = f 2 and η k f = ai f k i=1 a i φ i 2 for k 1bethekth degree of approximation of f L d by the system. Let F Ɣ be all functions in L d with the approximation errors bounded by Ɣ; that is, F Ɣ = f L d : η k f γ k k= 0 1 They are called the full approximation sets. Lorentz (1966) gives metric entropy bounds on these classes (actually in more generality than stated here) and the bounds are used to derive metric entropy orders for a variety of function classes including Sobolev classes. Suppose the functions in F Ɣ are uniformly bounded, that is, sup g F Ɣ g ρ for some positive constant ρ. Let F Ɣ be all the probability density functions in F Ɣ. When γ 0 is large enough, the L 2 metric entropies of F Ɣ and F Ɣ are of the same order. Let k n be chosen such that γk 2 k/n. Theorem 7. The i rate of convergence for a full approximation set of functions is detered simply as follows: f F Ɣ E f ˆ f 2 2 k n/n A similar result holds for regression. Note that there is no special requirement on the bases (they may even not be continuous). For such a case, it seems hard to find a local packing set for the purpose of directly applying the lower bounding results of Birgé (1983). The conclusion of the theorem follows from Theorem 4. Basically, from Lorentz, the metric entropy of F Ɣ is order k ε = inf k: γ k ε and F Ɣ is rich in L 2 distance. As a consequence, ε n detered by M ε n nεn 2 also balances γ2 k and k/n. As an illustration, for a system, consider the functions that can be approximated by the linear system with polynomially decreasing approximation error γ k k α α>0. Then f ˆ f F Ɣ E f f ˆ 2 2 n 2α/ 1+2α. Similarly we have rate n 2α/ 1+2α log n 2β/ 1+2α if γ k k α log k β.

IN this paper, we study two related problems of minimaxity

IN this paper, we study two related problems of minimaxity IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 45, NO. 7, NOVEMBER 1999 2271 Minimax Nonparametric Classification Part I: Rates of Convergence Yuhong Yang Abstract This paper studies minimax aspects of

More information

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo A BLEND OF INFORMATION THEORY AND STATISTICS Andrew R. YALE UNIVERSITY Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo Frejus, France, September 1-5, 2008 A BLEND OF INFORMATION THEORY AND

More information

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin The Annals of Statistics 1996, Vol. 4, No. 6, 399 430 ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE By Michael Nussbaum Weierstrass Institute, Berlin Signal recovery in Gaussian

More information

On Bayes Risk Lower Bounds

On Bayes Risk Lower Bounds Journal of Machine Learning Research 17 (2016) 1-58 Submitted 4/16; Revised 10/16; Published 12/16 On Bayes Risk Lower Bounds Xi Chen Stern School of Business New York University New York, NY 10012, USA

More information

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam.  aad. Bayesian Adaptation p. 1/4 Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4 Joint work with Jyri Lember Bayesian Adaptation p. 2/4 Adaptation Given a collection

More information

Bayesian Regularization

Bayesian Regularization Bayesian Regularization Aad van der Vaart Vrije Universiteit Amsterdam International Congress of Mathematicians Hyderabad, August 2010 Contents Introduction Abstract result Gaussian process priors Co-authors

More information

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk) 10-704: Information Processing and Learning Spring 2015 Lecture 21: Examples of Lower Bounds and Assouad s Method Lecturer: Akshay Krishnamurthy Scribes: Soumya Batra Note: LaTeX template courtesy of UC

More information

Optimal Estimation of a Nonsmooth Functional

Optimal Estimation of a Nonsmooth Functional Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose

More information

RATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1

RATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1 The Annals of Statistics 1997, Vol. 25, No. 6, 2493 2511 RATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1 By Theodoros Nicoleris and Yannis

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

WE start with a general discussion. Suppose we have

WE start with a general discussion. Suppose we have 646 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 43, NO. 2, MARCH 1997 Minimax Redundancy for the Class of Memoryless Sources Qun Xie and Andrew R. Barron, Member, IEEE Abstract Let X n = (X 1 ; 111;Xn)be

More information

16.1 Bounding Capacity with Covering Number

16.1 Bounding Capacity with Covering Number ECE598: Information-theoretic methods in high-dimensional statistics Spring 206 Lecture 6: Upper Bounds for Density Estimation Lecturer: Yihong Wu Scribe: Yang Zhang, Apr, 206 So far we have been mostly

More information

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma

More information

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection 2708 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 11, NOVEMBER 2004 Exact Minimax Strategies for Predictive Density Estimation, Data Compression, Model Selection Feng Liang Andrew Barron, Senior

More information

Minimax lower bounds I

Minimax lower bounds I Minimax lower bounds I Kyoung Hee Kim Sungshin University 1 Preliminaries 2 General strategy 3 Le Cam, 1973 4 Assouad, 1983 5 Appendix Setting Family of probability measures {P θ : θ Θ} on a sigma field

More information

Lecture 8: Information Theory and Statistics

Lecture 8: Information Theory and Statistics Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang

More information

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector On Minimax Filtering over Ellipsoids Eduard N. Belitser and Boris Y. Levit Mathematical Institute, University of Utrecht Budapestlaan 6, 3584 CD Utrecht, The Netherlands The problem of estimating the mean

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

3 Integration and Expectation

3 Integration and Expectation 3 Integration and Expectation 3.1 Construction of the Lebesgue Integral Let (, F, µ) be a measure space (not necessarily a probability space). Our objective will be to define the Lebesgue integral R fdµ

More information

Concentration behavior of the penalized least squares estimator

Concentration behavior of the penalized least squares estimator Concentration behavior of the penalized least squares estimator Penalized least squares behavior arxiv:1511.08698v2 [math.st] 19 Oct 2016 Alan Muro and Sara van de Geer {muro,geer}@stat.math.ethz.ch Seminar

More information

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Submitted to the Annals of Statistics arxiv: arxiv:0000.0000 RATE-OPTIMAL GRAPHON ESTIMATION By Chao Gao, Yu Lu and Harrison H. Zhou Yale University Network analysis is becoming one of the most active

More information

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses

Lecture 9: October 25, Lower bounds for minimax rates via multiple hypotheses Information and Coding Theory Autumn 07 Lecturer: Madhur Tulsiani Lecture 9: October 5, 07 Lower bounds for minimax rates via multiple hypotheses In this lecture, we extend the ideas from the previous

More information

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9

MAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9 MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 The Annals of Statistics 1997, Vol. 25, No. 6, 2512 2546 OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 By O. V. Lepski and V. G. Spokoiny Humboldt University and Weierstrass Institute

More information

D I S C U S S I O N P A P E R

D I S C U S S I O N P A P E R I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive

More information

Information and Statistics

Information and Statistics Information and Statistics Andrew Barron Department of Statistics Yale University IMA Workshop On Information Theory and Concentration Phenomena Minneapolis, April 13, 2015 Barron Information and Statistics

More information

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only MMSE Dimension Yihong Wu Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Email: yihongwu@princeton.edu Sergio Verdú Department of Electrical Engineering Princeton University

More information

Minimax lower bounds

Minimax lower bounds Minimax lower bounds Maxim Raginsky December 4, 013 Now that we have a good handle on the performance of ERM and its variants, it is time to ask whether we can do better. For example, consider binary classification:

More information

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications. Stat 811 Lecture Notes The Wald Consistency Theorem Charles J. Geyer April 9, 01 1 Analyticity Assumptions Let { f θ : θ Θ } be a family of subprobability densities 1 with respect to a measure µ on a measurable

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model. Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model By Michael Levine Purdue University Technical Report #14-03 Department of

More information

BERNSTEIN POLYNOMIAL DENSITY ESTIMATION 1265 bounded away from zero and infinity. Gibbs sampling techniques to compute the posterior mean and other po

BERNSTEIN POLYNOMIAL DENSITY ESTIMATION 1265 bounded away from zero and infinity. Gibbs sampling techniques to compute the posterior mean and other po The Annals of Statistics 2001, Vol. 29, No. 5, 1264 1280 CONVERGENCE RATES FOR DENSITY ESTIMATION WITH BERNSTEIN POLYNOMIALS By Subhashis Ghosal University of Minnesota Mixture models for density estimation

More information

Optimal global rates of convergence for interpolation problems with random design

Optimal global rates of convergence for interpolation problems with random design Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

EECS 750. Hypothesis Testing with Communication Constraints

EECS 750. Hypothesis Testing with Communication Constraints EECS 750 Hypothesis Testing with Communication Constraints Name: Dinesh Krithivasan Abstract In this report, we study a modification of the classical statistical problem of bivariate hypothesis testing.

More information

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem

More information

Lecture 5 Channel Coding over Continuous Channels

Lecture 5 Channel Coding over Continuous Channels Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From

More information

CONSIDER an estimation problem in which we want to

CONSIDER an estimation problem in which we want to 2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 4, APRIL 2011 Lower Bounds for the Minimax Risk Using f-divergences, and Applications Adityanand Guntuboyina, Student Member, IEEE Abstract Lower

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Coding on Countably Infinite Alphabets

Coding on Countably Infinite Alphabets Coding on Countably Infinite Alphabets Non-parametric Information Theory Licence de droits d usage Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe

More information

arxiv: v4 [cs.it] 17 Oct 2015

arxiv: v4 [cs.it] 17 Oct 2015 Upper Bounds on the Relative Entropy and Rényi Divergence as a Function of Total Variation Distance for Finite Alphabets Igal Sason Department of Electrical Engineering Technion Israel Institute of Technology

More information

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012 Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 202 BOUNDS AND ASYMPTOTICS FOR FISHER INFORMATION IN THE CENTRAL LIMIT THEOREM

More information

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University

More information

Bayesian Nonparametric Point Estimation Under a Conjugate Prior

Bayesian Nonparametric Point Estimation Under a Conjugate Prior University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 5-15-2002 Bayesian Nonparametric Point Estimation Under a Conjugate Prior Xuefeng Li University of Pennsylvania Linda

More information

Near-Potential Games: Geometry and Dynamics

Near-Potential Games: Geometry and Dynamics Near-Potential Games: Geometry and Dynamics Ozan Candogan, Asuman Ozdaglar and Pablo A. Parrilo January 29, 2012 Abstract Potential games are a special class of games for which many adaptive user dynamics

More information

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk John Lafferty School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lafferty@cs.cmu.edu Abstract

More information

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY

STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY 2nd International Symposium on Information Geometry and its Applications December 2-6, 2005, Tokyo Pages 000 000 STATISTICAL CURVATURE AND STOCHASTIC COMPLEXITY JUN-ICHI TAKEUCHI, ANDREW R. BARRON, AND

More information

Context tree models for source coding

Context tree models for source coding Context tree models for source coding Toward Non-parametric Information Theory Licence de droits d usage Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding

More information

Some Background Material

Some Background Material Chapter 1 Some Background Material In the first chapter, we present a quick review of elementary - but important - material as a way of dipping our toes in the water. This chapter also introduces important

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS Maya Gupta, Luca Cazzanti, and Santosh Srivastava University of Washington Dept. of Electrical Engineering Seattle,

More information

ECE 4400:693 - Information Theory

ECE 4400:693 - Information Theory ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential

More information

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1]

Lecture 17: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 31, 2016 [Ed. Apr 1] ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture 7: Density Estimation Lecturer: Yihong Wu Scribe: Jiaqi Mu, Mar 3, 06 [Ed. Apr ] In last lecture, we studied the minimax

More information

21.1 Lower bounds on minimax risk for functional estimation

21.1 Lower bounds on minimax risk for functional estimation ECE598: Information-theoretic methods in high-dimensional statistics Spring 016 Lecture 1: Functional estimation & testing Lecturer: Yihong Wu Scribe: Ashok Vardhan, Apr 14, 016 In this chapter, we will

More information

Empirical Processes: General Weak Convergence Theory

Empirical Processes: General Weak Convergence Theory Empirical Processes: General Weak Convergence Theory Moulinath Banerjee May 18, 2010 1 Extended Weak Convergence The lack of measurability of the empirical process with respect to the sigma-field generated

More information

Gaussian Estimation under Attack Uncertainty

Gaussian Estimation under Attack Uncertainty Gaussian Estimation under Attack Uncertainty Tara Javidi Yonatan Kaspi Himanshu Tyagi Abstract We consider the estimation of a standard Gaussian random variable under an observation attack where an adversary

More information

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy

Notions such as convergent sequence and Cauchy sequence make sense for any metric space. Convergent Sequences are Cauchy Banach Spaces These notes provide an introduction to Banach spaces, which are complete normed vector spaces. For the purposes of these notes, all vector spaces are assumed to be over the real numbers.

More information

Week 2: Sequences and Series

Week 2: Sequences and Series QF0: Quantitative Finance August 29, 207 Week 2: Sequences and Series Facilitator: Christopher Ting AY 207/208 Mathematicians have tried in vain to this day to discover some order in the sequence of prime

More information

Maximum likelihood: counterexamples, examples, and open problems. Jon A. Wellner. University of Washington. Maximum likelihood: p.

Maximum likelihood: counterexamples, examples, and open problems. Jon A. Wellner. University of Washington. Maximum likelihood: p. Maximum likelihood: counterexamples, examples, and open problems Jon A. Wellner University of Washington Maximum likelihood: p. 1/7 Talk at University of Idaho, Department of Mathematics, September 15,

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Semiparametric posterior limits

Semiparametric posterior limits Statistics Department, Seoul National University, Korea, 2012 Semiparametric posterior limits for regular and some irregular problems Bas Kleijn, KdV Institute, University of Amsterdam Based on collaborations

More information

Lecture 4 Noisy Channel Coding

Lecture 4 Noisy Channel Coding Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem

More information

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability... Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Local Asymptotics and the Minimum Description Length

Local Asymptotics and the Minimum Description Length Local Asymptotics and the Minimum Description Length Dean P. Foster and Robert A. Stine Department of Statistics The Wharton School of the University of Pennsylvania Philadelphia, PA 19104-6302 March 27,

More information

Econ Lecture 3. Outline. 1. Metric Spaces and Normed Spaces 2. Convergence of Sequences in Metric Spaces 3. Sequences in R and R n

Econ Lecture 3. Outline. 1. Metric Spaces and Normed Spaces 2. Convergence of Sequences in Metric Spaces 3. Sequences in R and R n Econ 204 2011 Lecture 3 Outline 1. Metric Spaces and Normed Spaces 2. Convergence of Sequences in Metric Spaces 3. Sequences in R and R n 1 Metric Spaces and Metrics Generalize distance and length notions

More information

Information Theory in Intelligent Decision Making

Information Theory in Intelligent Decision Making Information Theory in Intelligent Decision Making Adaptive Systems and Algorithms Research Groups School of Computer Science University of Hertfordshire, United Kingdom June 7, 2015 Information Theory

More information

Quiz 2 Date: Monday, November 21, 2016

Quiz 2 Date: Monday, November 21, 2016 10-704 Information Processing and Learning Fall 2016 Quiz 2 Date: Monday, November 21, 2016 Name: Andrew ID: Department: Guidelines: 1. PLEASE DO NOT TURN THIS PAGE UNTIL INSTRUCTED. 2. Write your name,

More information

1590 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 6, JUNE Source Coding, Large Deviations, and Approximate Pattern Matching

1590 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 6, JUNE Source Coding, Large Deviations, and Approximate Pattern Matching 1590 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 48, NO. 6, JUNE 2002 Source Coding, Large Deviations, and Approximate Pattern Matching Amir Dembo and Ioannis Kontoyiannis, Member, IEEE Invited Paper

More information

1.1 Basis of Statistical Decision Theory

1.1 Basis of Statistical Decision Theory ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016 Lecture 1: Introduction Lecturer: Yihong Wu Scribe: AmirEmad Ghassami, Jan 21, 2016 [Ed. Jan 31] Outline: Introduction of

More information

Functional Analysis Exercise Class

Functional Analysis Exercise Class Functional Analysis Exercise Class Week 2 November 6 November Deadline to hand in the homeworks: your exercise class on week 9 November 13 November Exercises (1) Let X be the following space of piecewise

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Consistency of the maximum likelihood estimator for general hidden Markov models

Consistency of the maximum likelihood estimator for general hidden Markov models Consistency of the maximum likelihood estimator for general hidden Markov models Jimmy Olsson Centre for Mathematical Sciences Lund University Nordstat 2012 Umeå, Sweden Collaborators Hidden Markov models

More information

An introduction to Mathematical Theory of Control

An introduction to Mathematical Theory of Control An introduction to Mathematical Theory of Control Vasile Staicu University of Aveiro UNICA, May 2018 Vasile Staicu (University of Aveiro) An introduction to Mathematical Theory of Control UNICA, May 2018

More information

1234 S. GHOSAL AND A. W. VAN DER VAART algorithm to adaptively select a finitely supported mixing distribution. These authors showed consistency of th

1234 S. GHOSAL AND A. W. VAN DER VAART algorithm to adaptively select a finitely supported mixing distribution. These authors showed consistency of th The Annals of Statistics 2001, Vol. 29, No. 5, 1233 1263 ENTROPIES AND RATES OF CONVERGENCE FOR MAXIMUM LIKELIHOOD AND BAYES ESTIMATION FOR MIXTURES OF NORMAL DENSITIES By Subhashis Ghosal and Aad W. van

More information

LECTURE 15: COMPLETENESS AND CONVEXITY

LECTURE 15: COMPLETENESS AND CONVEXITY LECTURE 15: COMPLETENESS AND CONVEXITY 1. The Hopf-Rinow Theorem Recall that a Riemannian manifold (M, g) is called geodesically complete if the maximal defining interval of any geodesic is R. On the other

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Approximation Theoretical Questions for SVMs

Approximation Theoretical Questions for SVMs Ingo Steinwart LA-UR 07-7056 October 20, 2007 Statistical Learning Theory: an Overview Support Vector Machines Informal Description of the Learning Goal X space of input samples Y space of labels, usually

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations

Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations Introduction to Empirical Processes and Semiparametric Inference Lecture 13: Entropy Calculations Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics and Operations Research

More information

Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals

Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals Acta Applicandae Mathematicae 78: 145 154, 2003. 2003 Kluwer Academic Publishers. Printed in the Netherlands. 145 Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals M.

More information

A proof of the existence of good nested lattices

A proof of the existence of good nested lattices A proof of the existence of good nested lattices Dinesh Krithivasan and S. Sandeep Pradhan July 24, 2007 1 Introduction We show the existence of a sequence of nested lattices (Λ (n) 1, Λ(n) ) with Λ (n)

More information

Optimal compression of approximate Euclidean distances

Optimal compression of approximate Euclidean distances Optimal compression of approximate Euclidean distances Noga Alon 1 Bo az Klartag 2 Abstract Let X be a set of n points of norm at most 1 in the Euclidean space R k, and suppose ε > 0. An ε-distance sketch

More information

Information Theory and Hypothesis Testing

Information Theory and Hypothesis Testing Summer School on Game Theory and Telecommunications Campione, 7-12 September, 2014 Information Theory and Hypothesis Testing Mauro Barni University of Siena September 8 Review of some basic results linking

More information

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS Bendikov, A. and Saloff-Coste, L. Osaka J. Math. 4 (5), 677 7 ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS ALEXANDER BENDIKOV and LAURENT SALOFF-COSTE (Received March 4, 4)

More information

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation:

The Dirichlet s P rinciple. In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation: Oct. 1 The Dirichlet s P rinciple In this lecture we discuss an alternative formulation of the Dirichlet problem for the Laplace equation: 1. Dirichlet s Principle. u = in, u = g on. ( 1 ) If we multiply

More information

Near-Potential Games: Geometry and Dynamics

Near-Potential Games: Geometry and Dynamics Near-Potential Games: Geometry and Dynamics Ozan Candogan, Asuman Ozdaglar and Pablo A. Parrilo September 6, 2011 Abstract Potential games are a special class of games for which many adaptive user dynamics

More information

Information measures in simple coding problems

Information measures in simple coding problems Part I Information measures in simple coding problems in this web service in this web service Source coding and hypothesis testing; information measures A(discrete)source is a sequence {X i } i= of random

More information

Vectors. January 13, 2013

Vectors. January 13, 2013 Vectors January 13, 2013 The simplest tensors are scalars, which are the measurable quantities of a theory, left invariant by symmetry transformations. By far the most common non-scalars are the vectors,

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Closest Moment Estimation under General Conditions

Closest Moment Estimation under General Conditions Closest Moment Estimation under General Conditions Chirok Han and Robert de Jong January 28, 2002 Abstract This paper considers Closest Moment (CM) estimation with a general distance function, and avoids

More information

On reduction of differential inclusions and Lyapunov stability

On reduction of differential inclusions and Lyapunov stability 1 On reduction of differential inclusions and Lyapunov stability Rushikesh Kamalapurkar, Warren E. Dixon, and Andrew R. Teel arxiv:1703.07071v5 [cs.sy] 25 Oct 2018 Abstract In this paper, locally Lipschitz

More information

LTI Systems, Additive Noise, and Order Estimation

LTI Systems, Additive Noise, and Order Estimation LTI Systems, Additive oise, and Order Estimation Soosan Beheshti, Munther A. Dahleh Laboratory for Information and Decision Systems Department of Electrical Engineering and Computer Science Massachusetts

More information

Part III. 10 Topological Space Basics. Topological Spaces

Part III. 10 Topological Space Basics. Topological Spaces Part III 10 Topological Space Basics Topological Spaces Using the metric space results above as motivation we will axiomatize the notion of being an open set to more general settings. Definition 10.1.

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Statistical Properties of Numerical Derivatives

Statistical Properties of Numerical Derivatives Statistical Properties of Numerical Derivatives Han Hong, Aprajit Mahajan, and Denis Nekipelov Stanford University and UC Berkeley November 2010 1 / 63 Motivation Introduction Many models have objective

More information