Risk bounds for model selection via penalization

Size: px
Start display at page:

Download "Risk bounds for model selection via penalization"

Transcription

1 Probab. Theory Relat. Fields 113, (1999) 1 Risk bounds for model selection via penalization Andrew Barron 1,, Lucien Birgé,, Pascal Massart 3, 1 Department of Statistics, Yale University, P.O. Box 0890, New Haven, CT , USA. barron@stat.yale.edu URA 131 Statistique et modèles aléatoires, Laboratoire de Probabilités, boîte 188, Université Paris VI, 4 Place Jussieu, F-755 Paris Cedex 05, France. lb@ccr.jussieu.fr 3 URA 743 Modélisation stochastique et Statistique, Bât. 45, Université Paris Sud, Campus d Orsay, F Orsay Cedex, France. massart@stats.matups.fr Received: 7 July 1995 /Revised version: 1 November 1997 Abstract. Performance bounds for criteria for model selection are developed using recent theory for sieves. The model selection criteria are based on an empirical loss or contrast function with an added penalty term motivated by empirical process theory and roughly proportional to the number of parameters needed to describe the model divided by the number of observations. Most of our examples involve density or regression estimation settings and we focus on the problem of estimating the unknown density or regression function. We show that the quadratic risk of the minimum penalized empirical contrast estimator is bounded by an index of the accuracy of the sieve. This accuracy index quantifies the trade-off among the candidate models between the approximation error and parameter dimension relative to sample size. If we choose a list of models which exhibit good approximation properties with respect to different classes of smoothness, the estimator can be simultaneously minimax rate optimal in each of those classes. This is what is usually called adaptation. The type of classes of smoothness in which one gets adaptation depends heavily on the list of models. If too many models are involved in order to get accurate approximation of many wide classes of functions simultaneously, it may happen that the estimator is only approx- Work supported in part by the NSF grant ECS , and URA CNRS 131 Statistique et modèles aléatoires, and URA CNRS 743 Modélisation stochastique et Statistique. Key words and phrases: Penalization Model selection Adaptive estimation Empirical processes Sieves Minimum contrast estimators

2 30 A. Barron et al. imately adaptive (typically up to a slowly varying function of the sample size). We shall provide various illustrations of our method such as penalized maximum likelihood, projection or least squares estimation. The models will involve commonly used finite dimensional expansions such as piecewise polynomials with fixed or variable knots, trigonometric polynomials, wavelets, neural nets and related nonlinear expansions defined by superposition of ridge functions. Mathematics subject classifications (1991): Primary 6G05, 6G07; secondary 41A5 Contents 1 Introduction What is this paper about? Model selection Sieve methods and approximation theory From model selection to adaptation A glimpse of the essentials 31.1 Model selection in a toy framework Variable selection Main results with some illustrations The minimum penalized empirical contrast estimation method Maximum likelihood density estimation Projection estimators for density estimation Classical least squares regression Minimum- 1 regression Examples of models Linear models Nonlinear models The theorems and their applications Maximum likelihood estimators Projection estimators Least squares estimators for smooth regression Further examples Nested families of models and analogues Ellipsoids with unknown coefficients

3 Risk bounds for model selection via penalization Densities with an unknown modulus of continuity Hölderian densities with unknown anisotropic smoothness Projection estimators on polynomials with variable degree Least squares estimators for binary images Estimation of the support of a density Rich families of models Histograms with variable binwidths and spatial adaptation Neural nets and related nonlinear models Model selection with a bounded basis Adaptation and model selection Adaptation in the minimax sense Adaptation with respect to the target function and model selection Comparison with other adaptive methods Adaptation to the target function Adaptation in the minimax sense What s new here? A general theorem in an abstract framework Exponential bounds for the fluctuations of empirical processes A general theorem Penalized projection estimators on linear models Proof of Theorems 8 and Proofs of the main results Maximum likelihood estimation Other penalized minimum contrast estimation procedures Penalized projection estimation Penalized least squares and minimum 1 regression Estimating the support of a density Analysis of nonlinear models Appendix Combinatorial and covering lemmas Some results in approximation theory Further technical results

4 304 A. Barron et al. 1. Introduction 1.1. What is this paper about? The purpose of this paper is to provide a general method for estimating an unknown function s on the basis of n observations and a finite or countable family of models S m,m M n, using an empirical model selection criterion. Here, by model we have in mind any possible space of finite dimension D m (in a sense that will be made precise later on and includes the classical case where S m is linear). We do not mean that s belongs to any of the models, although this might be the case. Therefore we shall always think of a model S m as an approximate model for the true s with controlled complexity and this is the reason why we shall use alternatively the term sieve introduced by Grenander (1981) in connection with approximation theory. For each model S m we build an estimator ŝ m,n which minimizes some empirical contrast function γ n over the set S m. The precise nature of the sampling model will be discussed later. It suffices for now to think of regression and density estimation problems in which, for each candidate function t, the empirical contrast γ n (t) is, respectively, the empirical average squared error or (1/n) times the minus logarithm of likelihood. Denoting by R m,n (s) = s d (s, ŝ m,n )] the risk at s of the estimator ŝ m,n (where d denotes some convenient distance) an ideal model should minimize R m,n (s) when m varies. Nevertheless, even if s belongs to some S m0, this true model can be far from being ideal (in the preceding sense). Think of a polynomial fitting of a regression curve with 100 observations when the true s is a polynomial of degree 50. Since s is unknown, one cannot determine such an ideal model exactly. Therefore one would like to find a model selection procedure ˆm, based on the data, such that the risk of the resulting estimator ŝ ˆm,n is equal to the minimal risk inf m Mn R m,n (s). This program is too ambitious and we shall content ourselves to consider, instead of the minimal risk, some accuracy index of the form { a n (s) = inf d } { (s, S m ) + pen m,n = inf inft Sm d } (s, t) + pen m,n m M n m M n which majorizes the minimal risk and to provide a model selection procedure ˆm such that the risk of ŝ ˆm,n achieves the accuracy index up to some constant independent of n which means that s d (s, ŝ ˆm,n ) ] C(s)a n (s) for all n. (1.1) The procedure ˆm is defined by the minimization over M n of the penalized empirical contrast {γ n (ŝ m,n ) + pen m,n }. More precisely it follows from the analysis of Birgé and Massart (1998) that the risk R m,n (s) is typically of

5 Risk bounds for model selection via penalization 305 order d (s, S m ) + D m /n. The penalty term pen m,n then generally takes the form κl m D m /n where κ is an absolute constant and L m 1 is a weight that satisfies a condition of the type exp L m D m ] 1. m M n The penalty term takes into account both the difficulty to estimate within the model S m (role of D m ) and the additional noise due to the size of the list of models (role of L m ) and derives from exponential probability bounds for the empirical contrast. It follows from (1.1) and our choice of the penalty that, for any s, s d (s, ŝ ˆm,n ) ] { C(s) inf d (s, S m ) + κl } md m m M n n. (1.) Although we emphasized the fact that s need not belong to any S m, the bound (1.) also makes sense in the parametric case. More precisely, if one starts from a finite collection of models {S m } m M which does not depend on n and fix L m = 1 for all m, one finds, whenever s belongs to some S m0, that the risk of ŝ ˆm,n is of order n 1 as expected for this parametric framework. More generally, the bound (1.) permits the reduction of the problem of investigation of the performance of the estimator (to within certain constant multipliers) to an investigation of the approximation capabilities of the sieves. Here we have in mind a variety of possible function classes and the accuracy index will be evaluated for each. Since it is not known to which subsets of functions the target s belongs, it is a merit of the accuracy index and indeed a merit of the minimum penalized empirical contrast estimator ŝ ˆm,n in many cases that the maximum of the accuracy index a n (s) on certain subclasses of functions is within a constant factor of the minimax optimal value for the risk on these subclasses. For typical choices of models, the target function s is a cluster point, that is d(s,s m ) tends to zero for some subsequence of models, and the accuracy index quantifies the rate of convergence in a way that is naturally tied to the dimension of the models and the sample size through the penalty term. As a consequence of the accuracy index, there exists many situations where model selection provides estimators ŝ ˆm,n which are (at least approximately) simultaneously minimax over a family of classes of functions, usually balls with respect to the seminorms of the classical spaces of smooth functions. Such estimators are then called (approximately) adaptive. We shall now go further into details to describe our work and relate our results to the existing literature on model selection and adaptive estimation.

6 306 A. Barron et al. 1.. Model selection Historically, one can consider that model selection begins with the works of Mallows (1973) and Akaike (1973) although classical t or F tests and Bayes tests were long used for model selection. Actually, Daniel and Wood (1971, p. 86) already mention the C p criterion for variable selection in regression as described by Mallows in a conference dating back to (1964). Our model selection criteria can be viewed as extensions of Mallows and Akaike s. In order to describe the heuristics underlying Mallows approach, and more generally model selection based on penalization, let us consider here a typical and historically meaningful example, namely model selection for linear regression with fixed design. Let us consider observations Y 1,...,Y n such that Y i = s(x i ) + W i where the W i s are centered independent identically distributed variables with variance one and the x i s are deterministic values in some space X. We want to estimate the function s defined on X from the Y i s and measure the error of estimation in terms of the distance derived from the Euclidean norm t =n 1 n i=1 t(x i) ] 1/. We consider a family of linear models {S m } m Mn (finite dimensional spaces of functions on X), each model S m being of dimension D m. Let s m be the orthogonal projection of s onto S m and ŝ m,n be the least squares estimator of s relatively to S m. The risk of ŝ m,n is equal to s ŝm,n s ] = s s m + D m /n. Since s s m = s s m, the ideal model is given by the minimization of s m +D m /n+n 1 n i=1 Y i. Let us consider the normalized residual sum of squares n 1 n i=1 Y i ŝ m,n. Since ŝ m,n D m /n is an unbiased estimator of s m, an unbiased estimator of the ideal criterion to minimize is n 1 n i=1 Y i ŝ m,n + D m /n which is precisely Mallows C p.ifwe set γ n (t) = 1 n Y i t(x i )] n i=1 we notice that ŝ m,n is the minimizer of γ n over S m and that γ n (ŝ m,n ) = n 1 n i=1 Y i ŝ m,n. Therefore Mallows C p is a minimum penalized empirical contrast criterion in our sense with pen m,n = D m /n. This procedure is expected to work when the variables ŝ m,n concentrate around their expectations uniformly with respect to m. This is not clear at all when the cardinality of M n is large as compared to n. Since the practical use of Mallow s C p criterion is for a fixed sample size it is a natural question to wonder whether the criterion will work for a given value of the cardinality of M n as a function of n.

7 Risk bounds for model selection via penalization 307 This particular problem has been studied by Shibata (1981) for Gaussian errors and Li (1987) under suitable moment assumptions on the errors (see also Polyak and Tsybakov 1990 for sharper moment conditions in the Fourier case). One can in particular deduce from these works that if the family of models {S m } m Mn is nested and each model has a dimension bounded by n, the heuristics of Mallows C p is validated in the sense that the selected index ˆm provides an estimator ŝ ˆm,n such that asymptotically the risk s s ŝ ˆm,n ] is equivalent to inf m Mn s s ŝ m,n ]. It is worth noticing that this asymptotic equivalence holds provided that s does not belong to any of the S m s. Apart from Mallows C p classical empirical penalized criteria for model selection include AIC, BIC, and MDL criteria proposed by Akaike (1973), Schwarz (1978), and Rissanen (1978 and 1983), respectively. They differ from the structure of the penalties involved, which are based on asymptotic, Bayesian or information-theoretic considerations and concern various empirical criteria such as maximum likelihood and least squares. For our approach to model selection, the penalty term is motivated solely on the basis of what sorts of statistical risk bounds we can obtain. This conceptual point of view has been previously developed by Barron and Cover (1991) in their attempt to provide a global approach to model selection. Using a class of discretized models Barron and Cover (1991) or Barron (1991) prove risk bounds for complexity regularization criteria which in some cases include AIC, BIC, and MDL. The work by Barron and Cover is for criteria that possess a minimum description length interpretation and the discretization reduces the choice to a countable set of candidate functions t with penalty L(t)/n satisfying t L(t) 1 as required for lengths of uniquely decodable codes. There these authors developed an approximation index called the index of resolvability that is a precursor to our accuracy index a n (s) and they establish comparable risk bounds for Hellinger distance in density estimation. The main innovation here, as compared to Barron and Cover (1991), is that we do not require that the models should be discrete. This supposes a lot of additional work. The technical approach in this paper is in the spirit of Vapnik (198). His method of empirical minimization of the risk also heavily relies on an analysis of the behavior of an empirical contrast based on empirical process theory and his method of structural minimization of the risk is related to a model selection criterion which parallels ours. We use here the tools developed in Birgé and Massart (1998). This makes a difference between Vapnik s approach and ours both in the formulation of the empirical process conditions and techniques. In particular, the introduction of recent isoperimetric inequalities by Talagrand (1994 and 1996) in the case of projection estimators on linear spaces, which has proved its efficiency in Birgé

8 308 A. Barron et al. and Massart (1997) and more recently in Baraud (1997), allows to obtain, in some cases, precise numerical evaluations of the penalty terms and to justify, even from a non-asymptotic point of view, Mallows C p, relaxing some restrictions imposed by Shibata (1981) and Li (1987). However, in general, penalty terms that satisfy our conditions may be different from those which are used in the familiar criteria. For instance we might have to consider heavier penalty terms if necessary in order to take into account the complexity of the family M n. As to the implementation of minimum penalized contrast procedures, to be honest, we feel that this paper is merely a starting point which does not directly provide practical devices. However it is already possible to make a few remarks about implementation. The numerical value of the penalty function can be fixed in some cases as mentioned above. Also, as shown in Birgé and Massart (1997), the minimization procedure, even if the number of models is large, can be rather simple in some particular cases of interest since it is partly explicitly solvable, leading for instance to threshold or related estimators Sieve methods and approximation theory Let us recall that, for a given sieve S of dimension D, d (s, S) + D/n typically represents the order of magnitude of the risk R n (s) of a minimum contrast estimator ŝ n measured by the mean integrated squared error between s and ŝ n. The terms d (s, S) and D/n correspond to the bias squared and variance components, respectively. Given some prior information on s (for instance an upper bound for some smoothness norm) one can, from approximation theory, choose a family {S m } m Mn of finite dimensional sieves such that s is a cluster point of their union. If we select a sieve S mn in the family according to the presumed property of the target function, rather than adaptively selected on the basis of data, what we study would fall under the general heading of analysis of sieves for function estimation. The choice of S mn is determined by a particular trade-off between the variance and an upper bound for the bias squared. This method can lead to minimax risk computations. For instance, let us assume that s belongs to some Sobolev ball θ where θ is some known parameter which characterizes this ball. Approximation theory provides privileged families of sieves like spaces of piecewise polynomials with fixed or variable knots or trigonometric polynomials or wavelet expansions with optimal approximation properties with respect to those balls. Such a suitable choice of the list of sieves S m,m M n can typically guarantee that for given n and θ the minimax risk R n (θ) satisfies R n (θ) = inf s n sup s d (s, s n ) ] C 1 (θ) inf s m M θ n sup d (s, S m ) + D m s θ n ] (1.3)

9 Risk bounds for model selection via penalization 309 where s n is an arbitrary estimator. Such inequalities can in general be obtained by combining results in approximation theory with classical lower bounds on the minimax risk available in various contexts (density estimation, regression, white noise). Some references, among many others, are Bretagnolle and Huber (1979), Ibragimov and Khas minskii (1980 and 1981), Nemirovskii (1985), Birgé (1983 and 1986), Donoho and Johnstone (1998). Therefore if m(n, θ) is a value of m which minimizes sup s θ d (s, S m )+D m /n, the resulting minimum contrast estimator on the sieve S m(n,θ) is typically minimax (up to some constant independent of n)on θ. The rates of convergence for sieves methods, as introduced by Grenander (1981), have been studied by several authors: Cencov (198), Grenander and Chow (1985), Cox (1988), Stone (1990 and 1994), Barron and Sheu (1991), Haussler (199), McGaffrey and Gallant (1994), Shen and Wong (1994), and Van de Geer (1995). The main drawback of the preceding approach is connected with the prior assumption on the unknown s which is not attractive for practical use although those estimators are relevant for minimax risk computations. As a matter of fact, Stone pointed out that his own works on sieves methods (mainly devoted to splines) were first steps towards data driven methods of nonparametric estimation. More precisely he had in view to provide some theoretical justifications for MARS (see Friedman 1991). The mathematical analysis of sequences of finite-dimensional models is at the heart of the techniques that we put to use in our study of adaptive methods of model selection. The point here is that a mere control of the quadratic risk on each sieve is far from being sufficient for achieving our program, as described in Section 1.1. Much more will be needed here and we shall have to make use of the exponential inequalities for the fluctuations of an empirical contrast on a sieve established in Birgé and Massart (1998). We wish to allow a general framework of sieves characterized by their metric dimension and approximation properties. The examples we study typically involve linear combinations of a family of basis functions {ϕ λ } λ, which are parameterized by an index λ that is either discrete or continuous valued. In the discrete index case we have in mind examples of models based on Fourier series, wavelets, polynomials and piecewise polynomials with a discrete set of knot locations. Here the issue is the adaptive selection of the number of terms including all terms up to some total or the issue may be which subset of terms provides approximately the best estimate. In the first case there is only one sieve of each dimension and in the second there may be exponentially many candidate models as a function of dimension. The choice of whether subsets are taken has an impact on what types of tradeoffs are possible between bias and variance and on what types of penalty terms are permitted. In both cases the penalty term will be proportional

10 310 A. Barron et al. to the number of terms in the models, but in the latter case there is an additional logarithmic penalty factor that is typically necessary to realize approximately the best subset among exponentially many choices without substantial overfit. In contrast the use of fixed sets of terms typically allows for a penalty term with no logarithmic factors, but as we shall quantify (in the absence of subset selection) there can be less ability to realize a small statistical risk. In the continuous index case we have in mind flexible nonlinear models including neural nets, trigonometric models with estimated frequencies, piecewise linear hinged hyperplane models and other piecewise polynomials with continuously parameterized knot locations. In these cases we write φ w instead of ϕ λ for the terms that are linearly combined, where w is a continuous vector-valued parameter. Not surprisingly, if the terms φ w depend smoothly on w, the behavior of these nonlinear models is comparable to what is achieved in the discretized index set case with subset selection. We find that these nonlinear models have metric dimension properties that we can bound, but they lack the homogeneity of metric dimension satisfied by linear models with a fixed set of terms. The effect is that once again logarithmic factors arise in the penalty term and in the risk bounds. The advantage due to parsimony of the nonlinear models or the subset selection models is made especially apparent in the case of inference of functions with a high input dimension. In high dimensions, the exponential number of terms in linear models without subset selection precludes their practical use From model selection to adaptation Let us now consider the possible connections between our approach and adaptive estimation from the minimax point of view. As a matter of fact the adaptive properties of nonparametric estimators obtained from discrete model selection were already pointed out and studied by Barron and Cover (1991) for a number of classes of functions including Sobolev classes of logdensities without prior knowledge of which orders of smoothness and which norm bounds are satisfied by the target function. To recover the Barron and Cover result as a special case of our general density estimation results, set each model here to be a single function in their countable list. Barron (1991) extended the discretized model approach to deal also with complexity regularization for least squares regression and other bounded loss functions and applied it to artificial neural network models (see Barron 1994). Let us also mention that the present paper is a companion to the paper by two of us (Birgé and Massart 1997) which explores the role of adaptive estimation for projection estimators of densities using linear models. Applications are given there for wavelet estimation and connections are established with

11 Risk bounds for model selection via penalization 311 thresholding of wavelet coefficients and cross-validation criteria. More recently, Yang and Barron (1998) have got some results similar to ours for the particular case of log-density models. Let us now provide a mathematical content to what we mean here by adaptation. Given a family { θ } θ of sets of functions we recall that the minimax risk over θ is given by R n (θ) = inf s n sup s d (s, s n ) ] s θ where s n is an arbitrary estimator. We shall call a sequence of estimators ( s n ) n 1 adaptive in the minimax sense if for every θ there exists a constant C(θ) such that sup s d (s, s n ) ] C(θ)R n (θ). s θ If, for instance, one wants to give a precise meaning to the problem of estimating a function s of unknown smoothness, one can assume that s belongs to one of a large collection of balls such as Sobolev balls of variable index of smoothness and radius. Our purpose is to point out the connection between model selection via penalization as described previously and adaptation in the minimax sense. Starting from (1.) and assuming that L m = L for all m and n and that C(s) is bounded by C (θ) uniformly for s θ, one derives that sup s d (s, ŝ ˆm,n ) ] C 3 (θ) inf sup d (s, S m ) + D ] m. s m M θ n s θ n If the family {S m } m Mn has convenient approximation properties with respect to the family {S θ } θ such that (1.3) holds, it will follow that ŝ ˆm,n is adaptive with respect to the family {S θ } θ in the minimax sense. We shall actually devote a large part of the paper to the illustration of this principle on various examples. For most of the illustrations that we shall consider one can take either L m as a constant L or as log n. Inthe latter case we shall get adaptation up to a slowly varying function of n. Moreover, in the first case, we shall also discuss the precise dependency of the ratio C 3 (θ)/c 1 (θ) with respect to θ and sometimes show that it is bounded independently of θ. There is a huge amount of recent literature devoted to adaptive estimation and we postpone to Section 5 a discussion about the connections between model selection and adaptive estimation including a comparison between our approach to adaptation and the already existing methods and results. The structure of the paper is described in the Table of Contents. Let us only mention that Sections 4, 7 and 8 are clearly more technical and

12 31 A. Barron et al. can be skipped at first reading. A first and particularly simple illustration of what we want to do and of the ideas underlying our approach is given in Section which provides a self-contained introduction to our method while Section 3 provides an overview of its application to various situations. Section 5 does not contain any new result but is devoted to some detailed discussion, based on the examples of Sections and 3, about the connections between adaptation and model selection.. A glimpse of the essentials In order to give an idea of the way our approach to minimum penalized empirical contrast estimation works, let us describe it in the simplest framework we know, namely Gaussian regression on a fixed design. Its simplicity allows us to give a short and self-contained proof of an upper bound involving the accuracy index, for the risk of penalized least squares estimators. The main issue here is to enlighten the connection between the concentration of measure phenomenon and the choice of the penalty function for model selection..1. Model selection in a toy framework In the Gaussian regression framework we observe n random variables Y i = s(x i ) + W i where the x i s are known and the W i s are independent identically distributed standard normal. Identifying any function t defined on the set X ={x 1,...,x n } to a vector t = (t 1,...,t n ) T n by setting t i = t(x i ), we define a scalar product and a norm on n by t,u = 1 n n t(x i )u(x i ) and t = 1 n i=1 n t(x i ). (.1) i=1 We introduce a countable family {S m } m Mn of linear models, S m being of dimension D m and for each m we consider the least squares estimator ŝ m on S m which is a minimizer with respect to t S m of γ n (t) = t Y, t where Y = (Y 1,...,Y n ) T. Then we choose a prior family of weights {L m } m Mn with L m 1 for each m, such that

13 Risk bounds for model selection via penalization 313 exp L m D m ] <+. (.) m M n Our aim is to prove the following Theorem 1 Let pen(m) be defined on M n by pen(m) = κl m D m /n for a suitable constant κ and the weights L m satisfy (.). Let ŝ m be the minimizer of γ n (t) for t S m and ŝ ˆm be the minimizer among the family {ŝ m } m Mn of the penalized criterion γ n (ŝ m ) + pen(m). Then ŝ ˆm satisfies s s ŝ ˆm ] κ { inf d (s, S m ) + pen(m) } + κ n 1, (.3) m M n where d (s, S m ) = inf t Sm s t and κ,κ are numerical constants. Remark: The following proof uses κ = 4 leading to κ = 3 and κ = 3, which is obviously far from optimal as follows from Li (1987) or Baraud (1997). The result actually holds, for instance, with κ = as in Mallow s C p but a proof leading to better values of the constants would be longer, involve additional technicalities and also use more specific properties of the framework. Since we want here to give a short and intuitive proof, in the spirit of the subsequent results given in the paper for different frameworks, we prefer to sacrifice optimality to simplicity and readability and put the emphasis on the main ideas to be used in the sequel without the specific tricks which are required for optimizing the constants. Proof: We start with the identity t s = γ n (t) + W,t + s where W = (W 1,...,W n ) T and notice that, by definition, for any given m M n γ n (ŝ ˆm ) + pen( ˆm) γ n (s m ) + pen(m) where s m denotes the orthogonal projection of s onto S m. Combining these two formulas we get s ŝ ˆm s s m + pen(m) pen( ˆm) + W,(ŝ ˆm s m ). (.4) Let m be fixed. Given some m M n, we introduce the Gaussian process {Z(t)} t Sm defined by Z(t) = W,(t s m) w(m,t) where w(m,t)= t s + s s m + x m n,

14 314 A. Barron et al. x m being some positive number to be chosen later. As a consequence of Cirel son, Ibragimov and Sudakov s inequality (see Cirel son, Ibragimov and Sudakov 1976 and, for more details about Gaussian concentration inequalities, Ledoux 1996). s sup Z(t) E + λ t S m ] ] exp λ σ for any λ>0 (.5) provided that E sup t Sm Z(t)] and sup t Sm Var(Z(t)) σ. Let us first notice that w(m,t) 1 t s m + x ] ( m xm ) 1/ t s m (.6) n n and that for any function u, Var( W,u ) = n 1 u. Then Var(Z(t)) = n 1 t s m w (m,t)which immediately yields that we can take σ = x 1 m in (.5). On the other hand, expanding t s m on an orthonormal basis (ψ 1,...,ψ N ) of S m +S m with N D m +D m, one gets by Cauchy-Schwarz inequality that N Z (t) t s m w (m,t) W,ψ j and it follows from (.6) and Jensen s inequality that we can take E = (D m + D m )/x m ] 1/ in (.5). If λ is given by λ = (x + L m D m )/x m where x is any positive number we derive that λ + E ( ) Dm + D m + x + L m D 1/ m 1 x m 4 if x m = 3(D m + x + 3L m D m ). It then follows that ] s Z(ŝm ) 1/4 ] s sup Z(t) 1/4 exp( L m D m ) exp( x) t S m and therefore summing up those inequalities with respect to m that W,(ŝ m s m ) sup 1 exp( x). (.7) m M n w(m, ŝ m ) 4 This implies from the definitions of w and x m that except on a set of probability bounded by e x s ] j=1 4 W,(ŝ ˆm s m ) w( ˆm, ŝ ˆm ) s ŝ ˆm + s s m +16n 1 (D m + x + 3L ˆm D ˆm ).

15 Risk bounds for model selection via penalization 315 Coming back to (.4), this implies that s ŝ ˆm 3 s s m +pen(m) pen( ˆm)+16n 1 (D m +x+3l ˆm D ˆm ). The choice κ = 4 entails the cancellation of pen( ˆm), showing that, since L m 1 s ŝ ˆm 3 s s m + (8/3)pen(m) + 3n 1 x apart from a set of probability bounded by e x. Setting we get V = ( s ŝ ˆm 3 s s m (8/3)pen(m) ) 0 s s ŝ ˆm ] 3 s s m + (8/3)pen(m) + s V ] and s V 3x/n] exp( x). Integrating with respect to x implies that s V ] 3 /n which yields (.3) since m is arbitrary... Variable selection We want to provide here a typical application of Theorem 1. Let us assume that we are given some (large) orthonormal system {ϕ 1,...,ϕ N } in n with respect to the norm (.1). We want to get an estimate of s of the form s = λ m ˆβ λ ϕ λ where m is some suitable subset of {1,,...,N}. Let us first recall that if m is given, the projection estimator ŝ m over S m = Span{ϕ λ λ m}, which is the minimizer with respect to t S m of the criterion γ n (t),isgivenby ŝ m = λ m ˆβ λ ϕ λ with ˆβ λ = Y, ϕ λ and that γ n (ŝ m ) = ˆβ λ m λ. Elementary computations show that s s ŝm ] = d (s, S m ) + m /n. Unfortunately, since s is unknown we do not know how to choose m in an optimal way in order to minimize d (s, S m ) + m /n. In order to select m from the data, let us describe two simple strategies (among many others). i) Ordered variable selection. In this case we select the variables ϕ λ in natural order which means that we restrict ourselves to m k ={ϕ λ 1 λ k}, letting k vary from 1 to N. In such a case one can take L m = 1, = 0.6, pen(m k ) = κk/n and get a penalized least squares estimator ŝˆk where

16 316 A. Barron et al. ˆk is the minimizer of κk/n λ m k estimator is bounded by s s ŝˆk ] κ ˆβ λ. By Theorem 1, the risk of this { inf d (s, S mk ) + k/n } 1 k N for a suitable numerical constant κ. One should notice here that N does not enter the bound and can therefore be infinite and that we get the optimal risk among our family apart from the constant factor κ. Note that this optimality is with respect to the best that can be achieved among the class of ordered variable selection models. ii) Complete variable selection. Here we take m to be any nonvoid subset of {1,,...,N}. Since the number of such subsets with a given cardinality ( N ) D is < (en/d) D by Lemma 6 one can choose L m = 1 + log N for D all m and = 1.3. The resulting value ˆm is then obtained by minimizing κ(1 + log N) m /n ˆβ λ m λ. It is easily seen that this amounts to select the values of λ such that ˆβ λ >κ(1 + log N)/n which means that { ] } κ(1 + log N) 1/ ˆm = λ ˆβ λ >. n Therefore ŝ ˆm is a threshold estimator as studied by Donoho and Johnstone (1994a). Moreover by Theorem 1, there exists a constant κ such that s s ŝ ˆm ] { κ inf d (s, S m ) + m (log N)/n }. m If N is independent of n, we only loose a constant as compared to the ideal estimator; if N grows as a power of n, we only loose a log n factor as compared to the optimal risk for the class of all subset models, as in Donoho and Johnstone (1994a). This is the price to pay for complete variable selection among a large family but what is gained can be vastly superior in the approximation versus dimension tradeoff in the risk. Conclusion: The simplicity of treatment of the preceding example is mainly due to the fact that the centered empirical contrast W,t is a Gaussian linear process, acting on a finite dimensional linear space. The same treatment could be applied as well to penalized projection estimation for the white noise setting. Unfortunately the treatment of other empirical contrast functions or of nonlinear models requires that several technical difficulties be overcome. If we set here l n (s, t) = s γ n (t) γ n (s)], then l n (s, t) = s t. In a non-gaussian framework, one has to deal with a general empirical contrast function γ n and the analogue of (.4) becomes

17 Risk bounds for model selection via penalization 317 l n (s, ŝ ˆm ) l n (s, s m ) + γ 0 n (s m) γ 0 n (ŝ ˆm) ] + pen(m) pen( ˆm) where γn 0(t) = γ n(t) s γ n (t)]. Pure -assumptions are not enough to control the fluctuations of the centered empirical contrast (the bracketed term) involved in this inequality. This motivates the introduction of -type assumptions on our models in the next section. Moreover, the structure of the exponential bounds that we use is connected to Bernstein s inequality rather than a subgaussian type inequality. We also would like to point out the status of the distance d which has to be closely connected to the empirical contrast and chosen not too small in order to provide an appropriate control of the fluctuations of γn 0 and not too large in order that d (s, t) be controlled by l n (s, t). In the most favorable case of the projection density estimator on linear models, one can mimic the preceding proof, replacing the concentration inequality (.5) by Cirel son, Ibragimov and Sudakov by an inequality of Talagrand (1996). The point here is that the linearity of the model and of γn 0 (t) as a function of t allows to use Cauchy-Schwarz inequality as we did before to control the expectation of the supremum of the process involved. This point of view is developed in Birgé and Massart (1997) for projection density estimation and Baraud (1997) for non-gaussian regression. More generally, in the nonlinear context, one has to deal with suitable modifications of the entropy methods introduced by Dudley (1978) to build the required exponential inequalities. Such results are collected in Proposition 7 below which is mainly based on Theorem 5 and Proposition 3 of Birgé and Massart (1998). Moreover, in the case of maximum likelihood estimation, we have to modify the initial empirical process in order to keep its fluctuations under control at the price of additional difficulties to get an analogue of inequality (.4). 3. Main results with some illustrations 3.1. The minimum penalized empirical contrast estimation method We wish to analyze various functional estimation problems (density estimation, regression estimation,...) that we describe precisely below. A common statistical framework covering all these examples is as follows. We observe n random variables, Z 1,...,Z n which, in the context of this paper, are assumed to be independent. These variables are defined on some measurable space (, A) and take their values on some measurable space (Z, U). The space (, A) is equipped with a family of probabilities { s } s S where S is a subset of some -space, (µ). Note that both µ and S can

18 318 A. Barron et al. depend on n, the same being true for each probability s but we do not make this dependence appear in the notation for the sake of simplicity since those quantities will be fixed (independent of n) in most applications. We denote by s the expectation with respect to probability s,by n the empirical distribution of the Z i s and by ν n = n s n the centered empirical measure. The space (µ) is equipped with the distance d induced by the norm =. More generally for 1 p, the norm in p (µ) is denoted by p. Let us now introduce the key elements and notions that we need in the sequel. Definition 1 Given some subset T of (µ) containing S, an empirical contrast function γ n on T is defined for all t T as the empirical mean γ n (t) = n 1 n i=1 γ(z i,t)where γ is a function defined on Z T which satisfies s γ n (t)] s γ n (s)] for all s S and t T. We then introduce a countable collection of subsets S m of T (models) indexed by m M n. These models play the role of approximating spaces (sieves) for the true unknown value s of the parameter which might or might not be included in one of them. Typically, S m is a subset of a finitedimensional linear space. In order to make the notations simple we shall assume that everything which depends on m M n might depend on n but we omit this second index. We then consider a penalty function pen(m) which is a positive function on M n. We shall see later how to define this penalty function in order to get a sensible estimator. Let ε n 0 be given, a minimum penalized empirical contrast estimator is defined as follows: Definition Given some nonnegative number ε n, an empirical contrast function γ n, a collection of models {S m } m Mn and a penalty function pen( ) on M n,anε n -minimum penalized contrast estimator is any estimator ŝ in m Mn S m with ŝ S ˆm such that γ n (ŝ) + pen( ˆm) inf m M n { } inf γ n (t) + pen(m) + ε n. (3.1) t S m If ε n = 0 we speak of a minimum penalized contrast estimator. As usual, by estimator we mean a measurable mapping from (Z, U) n to the metric space (T,d) endowed with its Borel σ -algebra. If we omit the measurability problems, such an estimator is always defined provided that ε n > 0 but might not be unique. Nevertheless, the following results

19 Risk bounds for model selection via penalization 319 do apply to any solution of (3.1). In order to simplify the presentation we shall assume throughout the paper that ŝ is well-defined for ε n = 0. It turns out from our proofs that the choice ε n = n 1 would lead to the same risk bounds as those provided in the theorems below for the case ε n = 0. Some classical examples of minimum contrast estimation methods follow Maximum likelihood density estimation We observe n independent identically distributed variables Z 1,...,Z n of density s with respect to µ. We define T to be the set of nonnegative elements of norm 1 in (µ) (which means that their squares are probability densities) and take S T. The choice of the function γ(z,t)= log t(z) leads to maximum penalized likelihood estimators Projection estimators for density estimation We assume that µ is a probability measure and that the unknown density of the i.i.d. observations Z 1,...,Z n belongs to (µ). It can therefore be written + s where s is orthogonal to the constant function. We take for T the subspace of (µ) which is orthogonal to and derive the empirical contrast from γ(z,t) = t t(z), S being chosen as any subset of those t T such that + t 0. If S m is a linear subspace of T with an orthonormal basis {ϕ λ } λ m, minimizing γ n (t) over S m leads to the classical projection estimator ŝ m on S m given by ŝ m = λ m ˆβ λ ϕ λ with ˆβ λ = 1 n n ϕ λ (Z i ). i= Classical least squares regression Observations are pairs (X i,y i ) = Z i with Y i = s(x i ) + W i and the variables X i and W i are all independent with respective distributions R i and Q i (independent of s) but not necessarily independent identically distributed since we want to include the fixed design regression in our framework. In this case S T = (µ) where µ denotes the average distribution of the X i s: µ = n 1 n i=1 R i. This distribution actually depends on n in the case of a fixed design but not in the case of a random design. We assume that the errors W i are centered and choose γ(z,t)= y t(x)]. The resulting estimator is a penalized least squares estimator.

20 30 A. Barron et al Minimum- 1 regression We use the same regression framework as before, now assuming that the W i s are centered at their median and define γ(z,t)= y t(x). These frameworks and related empirical contrast functions have been described in greater detail in Birgé and Massart (1993) and Birgé and Massart (1998). We therefore refer the reader to these papers for more information. 3.. Examples of models In all our results, the value pen(m) of the penalty function is, in particular, connected with the number D m of parameters which are necessary to describe the elements of the model S m. A general definition of D m will appear in Section 6 and we shall here content ourselves with the presentation of two cases which are known to be of practical interest Linear models By a linear model we mean a subset S m of some finite-dimensional linear subspace S m of (µ) with dimension D m. In opposition with what happens for Gaussian situations like the Gaussian regression on fixed design and the white noise setting, the -structure of the models is not sufficient to guarantee a good behavior of the empirical contrast function γ n, which is essential for our purpose as we shall see later. More is needed, specifically some connections between the - and -structures of the models. It is the aim of the two following indices (indeed relative to S m ) to quantify such connections. Firstly we set m = 1 t sup (3.) Dm t S m \{0} t and denote by F m the set of all orthonormal bases of S m. For any finite set and any β, we define β = sup λ β λ and β = λ β λ. We then notice that for any orthonormal basis ϕ ={ϕ λ } λ m F m m = 1 λ sup m β λ ϕ λ = 1 Dm β 0 β Dm λ m ϕ λ 1/. (3.3) The second equality in (3.3) comes from Lemma 1 of Birgé and Massart (1998). Secondly we define

21 Risk bounds for model selection via penalization 31 r m = 1 Dm inf ϕ F m { sup β 0 } λ m β λ ϕ λ β. (3.4) It follows from (3.3) and this definition that m r m D m m. (3.5) Let us now detail a few examples of linear models and bound their indices. Uniformly bounded basis: If one can find an orthonormal system {ϕ λ } λ such that ϕ λ for all λ, if the elements of M n are subsets of and S m is the linear span of {ϕ λ } λ m, then m by (3.3). Choosing M n as a countable family of subsets of the trigonometric basis in (0, π],dx) provides a typical example of this type. Wavelet expansions: Let us consider an orthonormal wavelet basis {ϕ j,k j 0,k q } of ( q,dx)(see Meyer 1990 for details) with the following conventions: ϕ 0,k are translates of the father wavelet and for j 1, the ϕ j,k s are affine transforms of the mother wavelet. One will also assume that these wavelets are compactly supported and have continuous derivatives up to some order r. Let t ( q,dx)be some function with compact support in (0,A) q. Changing the indexation of the basis if necessary we can write the expansion of t on the wavelet basis as: t = j 0 jq M k=1 β j,k ϕ j,k, where M 1 is a finite integer depending on A and the size of the wavelet s supports. For any j, we denote by (j) the set of indices {(j, k) 1 k jq M}. The relevant m s will be subsets of the larger sets J j=0 (j) for finite values of J and we shall denote by J m the smallest J such that this inclusion is valid. It comes from Bernstein s inequality (see Meyer 1990, Chapter, Lemma 8) that r m C( qj m /D m ) 1/ for some constant C. In particular, for all m s of the form J m j=0 (j), r m is uniformly bounded and so is m. The most relevant applications of such expansions have been studied extensively in Birgé and Massart (1997). We also want to deal with wavelet expansions on the interval 0, 1]. Since the general case involves technicalities which are quite irrelevant to the subject of this paper, we shall content ourselves to deal with the simplest case of the Haar basis. Then the following expansion holds for any t (0, 1],dx):

22 3 A. Barron et al. t = β 1,1 ϕ 1,1 + j β j,k ϕ j,k, (3.6) j 0 k=1 where ϕ 1,1 = 0,1],ψ= 0,1/] ]1/,1] and ϕ j,k (x) = j/ ψ j x k + 1]. We set ( 1) ={( 1, 1)} and for j 0 (j) ={(j, k) 1 k j }. If m = m j=0 (j) we see from (3.3) that m = 1. To bound r m we first notice that for j 0 j β j,k ϕ j,k j/ sup β j,k. (3.7) k Therefore k=1 m r m j/ j=0 m j j=0 1/ < 1 +. It may also be useful to choose m = m j= 1 (j) and then r m < +. Piecewise polynomials: We restrict our attention to piecewise polynomial spaces on a bounded rectangle in q, which, without loss of generality, wetaketobe0, 1] q. Hereafter we denote by P i a partition of 0, 1] into D(i) intervals. A linear space S m of piecewise polynomials is characterized by m = (r, P 1,...,P q ) where r is the maximal degree with respect to each variable of the polynomials involved. The elements t of S m are the functions on 0, 1] q which coincide with a polynomial of degree not greater than r on each element of the product partition P = q i=1 P i. This results in D m = (r + 1) q q i=1 D(i). Let {Q j } j be the orthogonal basis of the Legendre polynomials in ( 1, 1],dx), then the following properties hold for all j (see Whittaker and Watson 197, pp for details): and Q j (x) 1 for all x 1, 1], Q j (1) = 1, 1 Q j (t) dt = 1 j + 1. Let us consider the hyperrectangle R = q i=1 a i,b i ]. For j J = {0,...,r} q we define q ( ) ji + 1 1/ ( ) xi a i b i ϕ R,j (x 1,...,x q ) = Q ji R (x 1,...,x q ). b i a i b i a i i=1

23 Risk bounds for model selection via penalization 33 The family {ϕ R,j } j J provides an orthonormal basis for the space of polynomials on R with degree bounded by r. IfH is a polynomial such that H = j J β jϕ R,j, H (r + 1)(r + 1) 1/] q Vol(R)] 1/ β. Then taking m as the set of those (R, j) s such that R P and j J we get from (3.4) r m (r + 1)q (r + 1) q D m inf R P Vol(R) = (r +1)(r +1)]q inf R P Vol(R) 1 q D(i)]. (3.8) In particular, if P is a regular partition (all elements R of P have the same volume), i=1 r m (r + 1)(r + 1)] q/. (3.9) Polynomials on a sphere and other eigenspaces of the Laplacian: Let q be the unit Euclidean sphere of q+1, µ be the uniform distribution on the sphere and 0 <θ 0 < <θ j < be the eigenvalues of the Laplace-Beltrami operator on q. Let, for each j 0, {ϕ λ,λ (j)} be an orthonormal system of eigenfunctions associated with the eigenvalue θ j. Then { } j 0 {ϕ λ,λ (j)} is an orthonormal basis of (µ). Defining, for any integer m 0, m = m j=0 (j) and S m as the linear span of {ϕ λ } λ m,wegetd m = m, for m 0. Actually these eigenvalues are given by explicit formulas (see for instance Berger, Gauduchon and Mazet 1971), the corresponding eigenfunctions are known to be harmonic zonal polynomials and one has (see Stein and Weiss 1971, p. 144) ϕλ (x) (j) for all x q and all j 0. λ (j) In such a case it follows from (3.3) that m = 1 for any integer m. More generally, we can consider, instead of q, a compact connected Riemannian manifold of dimension q with its uniform distribution µ. The eigenfunctions of the Laplace-Beltrami operator provide an orthonormal basis of (µ) which is a multidimensional generalization of the Fourier basis. Of course no exact formula is available in this full generality but some asymptotic evaluation holds which is known as Weyl s formula (see Chavel 1984, p. 9). Keeping the same notations for the eigenvalues and eigenfunctions as above, defining (j), m and S m as in the case of the sphere and setting D 1 = 1, Weyl s formula ensures that there exists two positive constants C 1 ( ) and C ( ) such that for any integer m

Model Selection and Geometry

Model Selection and Geometry Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model

More information

Model selection theory: a tutorial with applications to learning

Model selection theory: a tutorial with applications to learning Model selection theory: a tutorial with applications to learning Pascal Massart Université Paris-Sud, Orsay ALT 2012, October 29 Asymptotic approach to model selection - Idea of using some penalized empirical

More information

Gaussian model selection

Gaussian model selection J. Eur. Math. Soc. 3, 203 268 (2001) Digital Object Identifier (DOI) 10.1007/s100970100031 Lucien Birgé Pascal Massart Gaussian model selection Received February 1, 1999 / final version received January

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities

Adaptive estimation of the intensity of inhomogeneous Poisson processes via concentration inequalities Probab. Theory Relat. Fields 126, 103 153 2003 Digital Object Identifier DOI 10.1007/s00440-003-0259-1 Patricia Reynaud-Bouret Adaptive estimation of the intensity of inhomogeneous Poisson processes via

More information

GAUSSIAN MODEL SELECTION WITH AN UNKNOWN VARIANCE. By Yannick Baraud, Christophe Giraud and Sylvie Huet Université de Nice Sophia Antipolis and INRA

GAUSSIAN MODEL SELECTION WITH AN UNKNOWN VARIANCE. By Yannick Baraud, Christophe Giraud and Sylvie Huet Université de Nice Sophia Antipolis and INRA Submitted to the Annals of Statistics GAUSSIAN MODEL SELECTION WITH AN UNKNOWN VARIANCE By Yannick Baraud, Christophe Giraud and Sylvie Huet Université de Nice Sophia Antipolis and INRA Let Y be a Gaussian

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

1 Differentiable manifolds and smooth maps

1 Differentiable manifolds and smooth maps 1 Differentiable manifolds and smooth maps Last updated: April 14, 2011. 1.1 Examples and definitions Roughly, manifolds are sets where one can introduce coordinates. An n-dimensional manifold is a set

More information

Statistical inference on Lévy processes

Statistical inference on Lévy processes Alberto Coca Cabrero University of Cambridge - CCA Supervisors: Dr. Richard Nickl and Professor L.C.G.Rogers Funded by Fundación Mutua Madrileña and EPSRC MASDOC/CCA student workshop 2013 26th March Outline

More information

D I S C U S S I O N P A P E R

D I S C U S S I O N P A P E R I N S T I T U T D E S T A T I S T I Q U E B I O S T A T I S T I Q U E E T S C I E N C E S A C T U A R I E L L E S ( I S B A ) UNIVERSITÉ CATHOLIQUE DE LOUVAIN D I S C U S S I O N P A P E R 2014/06 Adaptive

More information

Can we do statistical inference in a non-asymptotic way? 1

Can we do statistical inference in a non-asymptotic way? 1 Can we do statistical inference in a non-asymptotic way? 1 Guang Cheng 2 Statistics@Purdue www.science.purdue.edu/bigdata/ ONR Review Meeting@Duke Oct 11, 2017 1 Acknowledge NSF, ONR and Simons Foundation.

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

A talk on Oracle inequalities and regularization. by Sara van de Geer

A talk on Oracle inequalities and regularization. by Sara van de Geer A talk on Oracle inequalities and regularization by Sara van de Geer Workshop Regularization in Statistics Banff International Regularization Station September 6-11, 2003 Aim: to compare l 1 and other

More information

Part III. 10 Topological Space Basics. Topological Spaces

Part III. 10 Topological Space Basics. Topological Spaces Part III 10 Topological Space Basics Topological Spaces Using the metric space results above as motivation we will axiomatize the notion of being an open set to more general settings. Definition 10.1.

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) Yale, May 2 2011 p. 1/35 Introduction There exist many fields where inverse problems appear Astronomy (Hubble satellite).

More information

LECTURE NOTE #3 PROF. ALAN YUILLE

LECTURE NOTE #3 PROF. ALAN YUILLE LECTURE NOTE #3 PROF. ALAN YUILLE 1. Three Topics (1) Precision and Recall Curves. Receiver Operating Characteristic Curves (ROC). What to do if we do not fix the loss function? (2) The Curse of Dimensionality.

More information

Reproducing Kernel Hilbert Spaces

Reproducing Kernel Hilbert Spaces 9.520: Statistical Learning Theory and Applications February 10th, 2010 Reproducing Kernel Hilbert Spaces Lecturer: Lorenzo Rosasco Scribe: Greg Durrett 1 Introduction In the previous two lectures, we

More information

Some Background Material

Some Background Material Chapter 1 Some Background Material In the first chapter, we present a quick review of elementary - but important - material as a way of dipping our toes in the water. This chapter also introduces important

More information

Inverse problems in statistics

Inverse problems in statistics Inverse problems in statistics Laurent Cavalier (Université Aix-Marseille 1, France) YES, Eurandom, 10 October 2011 p. 1/32 Part II 2) Adaptation and oracle inequalities YES, Eurandom, 10 October 2011

More information

arxiv: v1 [math.st] 10 May 2009

arxiv: v1 [math.st] 10 May 2009 ESTIMATOR SELECTION WITH RESPECT TO HELLINGER-TYPE RISKS ariv:0905.1486v1 math.st 10 May 009 YANNICK BARAUD Abstract. We observe a random measure N and aim at estimating its intensity s. This statistical

More information

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin

ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE. By Michael Nussbaum Weierstrass Institute, Berlin The Annals of Statistics 1996, Vol. 4, No. 6, 399 430 ASYMPTOTIC EQUIVALENCE OF DENSITY ESTIMATION AND GAUSSIAN WHITE NOISE By Michael Nussbaum Weierstrass Institute, Berlin Signal recovery in Gaussian

More information

SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS

SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS SPECTRAL PROPERTIES OF THE LAPLACIAN ON BOUNDED DOMAINS TSOGTGEREL GANTUMUR Abstract. After establishing discrete spectra for a large class of elliptic operators, we present some fundamental spectral properties

More information

High-dimensional regression with unknown variance

High-dimensional regression with unknown variance High-dimensional regression with unknown variance Christophe Giraud Ecole Polytechnique march 2012 Setting Gaussian regression with unknown variance: Y i = f i + ε i with ε i i.i.d. N (0, σ 2 ) f = (f

More information

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006

Least Squares Model Averaging. Bruce E. Hansen University of Wisconsin. January 2006 Revised: August 2006 Least Squares Model Averaging Bruce E. Hansen University of Wisconsin January 2006 Revised: August 2006 Introduction This paper developes a model averaging estimator for linear regression. Model averaging

More information

Lecture 3: Statistical Decision Theory (Part II)

Lecture 3: Statistical Decision Theory (Part II) Lecture 3: Statistical Decision Theory (Part II) Hao Helen Zhang Hao Helen Zhang Lecture 3: Statistical Decision Theory (Part II) 1 / 27 Outline of This Note Part I: Statistics Decision Theory (Classical

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1

OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 The Annals of Statistics 1997, Vol. 25, No. 6, 2512 2546 OPTIMAL POINTWISE ADAPTIVE METHODS IN NONPARAMETRIC ESTIMATION 1 By O. V. Lepski and V. G. Spokoiny Humboldt University and Weierstrass Institute

More information

RATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1

RATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1 The Annals of Statistics 1997, Vol. 25, No. 6, 2493 2511 RATES OF CONVERGENCE OF ESTIMATES, KOLMOGOROV S ENTROPY AND THE DIMENSIONALITY REDUCTION PRINCIPLE IN REGRESSION 1 By Theodoros Nicoleris and Yannis

More information

Asymptotic efficiency of simple decisions for the compound decision problem

Asymptotic efficiency of simple decisions for the compound decision problem Asymptotic efficiency of simple decisions for the compound decision problem Eitan Greenshtein and Ya acov Ritov Department of Statistical Sciences Duke University Durham, NC 27708-0251, USA e-mail: eitan.greenshtein@gmail.com

More information

Optimal Estimation of a Nonsmooth Functional

Optimal Estimation of a Nonsmooth Functional Optimal Estimation of a Nonsmooth Functional T. Tony Cai Department of Statistics The Wharton School University of Pennsylvania http://stat.wharton.upenn.edu/ tcai Joint work with Mark Low 1 Question Suppose

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Optimal global rates of convergence for interpolation problems with random design

Optimal global rates of convergence for interpolation problems with random design Optimal global rates of convergence for interpolation problems with random design Michael Kohler 1 and Adam Krzyżak 2, 1 Fachbereich Mathematik, Technische Universität Darmstadt, Schlossgartenstr. 7, 64289

More information

Kernel change-point detection

Kernel change-point detection 1,2 (joint work with Alain Celisse 3 & Zaïd Harchaoui 4 ) 1 Cnrs 2 École Normale Supérieure (Paris), DIENS, Équipe Sierra 3 Université Lille 1 4 INRIA Grenoble Workshop Kernel methods for big data, Lille,

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector

Mathematical Institute, University of Utrecht. The problem of estimating the mean of an observed Gaussian innite-dimensional vector On Minimax Filtering over Ellipsoids Eduard N. Belitser and Boris Y. Levit Mathematical Institute, University of Utrecht Budapestlaan 6, 3584 CD Utrecht, The Netherlands The problem of estimating the mean

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection

Exact Minimax Strategies for Predictive Density Estimation, Data Compression, and Model Selection 2708 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 50, NO. 11, NOVEMBER 2004 Exact Minimax Strategies for Predictive Density Estimation, Data Compression, Model Selection Feng Liang Andrew Barron, Senior

More information

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory

Part V. 17 Introduction: What are measures and why measurable sets. Lebesgue Integration Theory Part V 7 Introduction: What are measures and why measurable sets Lebesgue Integration Theory Definition 7. (Preliminary). A measure on a set is a function :2 [ ] such that. () = 2. If { } = is a finite

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Least squares under convex constraint

Least squares under convex constraint Stanford University Questions Let Z be an n-dimensional standard Gaussian random vector. Let µ be a point in R n and let Y = Z + µ. We are interested in estimating µ from the data vector Y, under the assumption

More information

Risk Bounds for CART Classifiers under a Margin Condition

Risk Bounds for CART Classifiers under a Margin Condition arxiv:0902.3130v5 stat.ml 1 Mar 2012 Risk Bounds for CART Classifiers under a Margin Condition Servane Gey March 2, 2012 Abstract Non asymptotic risk bounds for Classification And Regression Trees (CART)

More information

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS

ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS Bendikov, A. and Saloff-Coste, L. Osaka J. Math. 4 (5), 677 7 ON THE REGULARITY OF SAMPLE PATHS OF SUB-ELLIPTIC DIFFUSIONS ON MANIFOLDS ALEXANDER BENDIKOV and LAURENT SALOFF-COSTE (Received March 4, 4)

More information

Chapter 2 Metric Spaces

Chapter 2 Metric Spaces Chapter 2 Metric Spaces The purpose of this chapter is to present a summary of some basic properties of metric and topological spaces that play an important role in the main body of the book. 2.1 Metrics

More information

The deterministic Lasso

The deterministic Lasso The deterministic Lasso Sara van de Geer Seminar für Statistik, ETH Zürich Abstract We study high-dimensional generalized linear models and empirical risk minimization using the Lasso An oracle inequality

More information

arxiv:math/ v3 [math.st] 1 Apr 2009

arxiv:math/ v3 [math.st] 1 Apr 2009 The Annals of Statistics 009, Vol. 37, No., 630 67 DOI: 10.114/07-AOS573 c Institute of Mathematical Statistics, 009 arxiv:math/070150v3 [math.st] 1 Apr 009 GAUSSIAN MODEL SELECTION WITH AN UNKNOWN VARIANCE

More information

Analysis in weighted spaces : preliminary version

Analysis in weighted spaces : preliminary version Analysis in weighted spaces : preliminary version Frank Pacard To cite this version: Frank Pacard. Analysis in weighted spaces : preliminary version. 3rd cycle. Téhéran (Iran, 2006, pp.75.

More information

3 Integration and Expectation

3 Integration and Expectation 3 Integration and Expectation 3.1 Construction of the Lebesgue Integral Let (, F, µ) be a measure space (not necessarily a probability space). Our objective will be to define the Lebesgue integral R fdµ

More information

Introduction to Real Analysis Alternative Chapter 1

Introduction to Real Analysis Alternative Chapter 1 Christopher Heil Introduction to Real Analysis Alternative Chapter 1 A Primer on Norms and Banach Spaces Last Updated: March 10, 2018 c 2018 by Christopher Heil Chapter 1 A Primer on Norms and Banach Spaces

More information

Eigenvalues and Eigenfunctions of the Laplacian

Eigenvalues and Eigenfunctions of the Laplacian The Waterloo Mathematics Review 23 Eigenvalues and Eigenfunctions of the Laplacian Mihai Nica University of Waterloo mcnica@uwaterloo.ca Abstract: The problem of determining the eigenvalues and eigenvectors

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Chapter 2 Linear Transformations

Chapter 2 Linear Transformations Chapter 2 Linear Transformations Linear Transformations Loosely speaking, a linear transformation is a function from one vector space to another that preserves the vector space operations. Let us be more

More information

Introduction and Preliminaries

Introduction and Preliminaries Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model.

Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model. Minimax Rate of Convergence for an Estimator of the Functional Component in a Semiparametric Multivariate Partially Linear Model By Michael Levine Purdue University Technical Report #14-03 Department of

More information

Analysis-3 lecture schemes

Analysis-3 lecture schemes Analysis-3 lecture schemes (with Homeworks) 1 Csörgő István November, 2015 1 A jegyzet az ELTE Informatikai Kar 2015. évi Jegyzetpályázatának támogatásával készült Contents 1. Lesson 1 4 1.1. The Space

More information

Discussion of Hypothesis testing by convex optimization

Discussion of Hypothesis testing by convex optimization Electronic Journal of Statistics Vol. 9 (2015) 1 6 ISSN: 1935-7524 DOI: 10.1214/15-EJS990 Discussion of Hypothesis testing by convex optimization Fabienne Comte, Céline Duval and Valentine Genon-Catalot

More information

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich

THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES. By Sara van de Geer and Johannes Lederer. ETH Zürich Submitted to the Annals of Applied Statistics arxiv: math.pr/0000000 THE LASSO, CORRELATED DESIGN, AND IMPROVED ORACLE INEQUALITIES By Sara van de Geer and Johannes Lederer ETH Zürich We study high-dimensional

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Concentration behavior of the penalized least squares estimator

Concentration behavior of the penalized least squares estimator Concentration behavior of the penalized least squares estimator Penalized least squares behavior arxiv:1511.08698v2 [math.st] 19 Oct 2016 Alan Muro and Sara van de Geer {muro,geer}@stat.math.ethz.ch Seminar

More information

PACKING-DIMENSION PROFILES AND FRACTIONAL BROWNIAN MOTION

PACKING-DIMENSION PROFILES AND FRACTIONAL BROWNIAN MOTION PACKING-DIMENSION PROFILES AND FRACTIONAL BROWNIAN MOTION DAVAR KHOSHNEVISAN AND YIMIN XIAO Abstract. In order to compute the packing dimension of orthogonal projections Falconer and Howroyd 997) introduced

More information

Recall that any inner product space V has an associated norm defined by

Recall that any inner product space V has an associated norm defined by Hilbert Spaces Recall that any inner product space V has an associated norm defined by v = v v. Thus an inner product space can be viewed as a special kind of normed vector space. In particular every inner

More information

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product

Finite-dimensional spaces. C n is the space of n-tuples x = (x 1,..., x n ) of complex numbers. It is a Hilbert space with the inner product Chapter 4 Hilbert Spaces 4.1 Inner Product Spaces Inner Product Space. A complex vector space E is called an inner product space (or a pre-hilbert space, or a unitary space) if there is a mapping (, )

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

Module 3. Function of a Random Variable and its distribution

Module 3. Function of a Random Variable and its distribution Module 3 Function of a Random Variable and its distribution 1. Function of a Random Variable Let Ω, F, be a probability space and let be random variable defined on Ω, F,. Further let h: R R be a given

More information

An introduction to Mathematical Theory of Control

An introduction to Mathematical Theory of Control An introduction to Mathematical Theory of Control Vasile Staicu University of Aveiro UNICA, May 2018 Vasile Staicu (University of Aveiro) An introduction to Mathematical Theory of Control UNICA, May 2018

More information

Model comparison and selection

Model comparison and selection BS2 Statistical Inference, Lectures 9 and 10, Hilary Term 2008 March 2, 2008 Hypothesis testing Consider two alternative models M 1 = {f (x; θ), θ Θ 1 } and M 2 = {f (x; θ), θ Θ 2 } for a sample (X = x)

More information

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information

Cambridge University Press The Mathematics of Signal Processing Steven B. Damelin and Willard Miller Excerpt More information Introduction Consider a linear system y = Φx where Φ can be taken as an m n matrix acting on Euclidean space or more generally, a linear operator on a Hilbert space. We call the vector x a signal or input,

More information

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam. aad. Bayesian Adaptation p. 1/4

Bayesian Adaptation. Aad van der Vaart. Vrije Universiteit Amsterdam.  aad. Bayesian Adaptation p. 1/4 Bayesian Adaptation Aad van der Vaart http://www.math.vu.nl/ aad Vrije Universiteit Amsterdam Bayesian Adaptation p. 1/4 Joint work with Jyri Lember Bayesian Adaptation p. 2/4 Adaptation Given a collection

More information

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University

DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS. By Subhashis Ghosal North Carolina State University Submitted to the Annals of Statistics DISCUSSION: COVERAGE OF BAYESIAN CREDIBLE SETS By Subhashis Ghosal North Carolina State University First I like to congratulate the authors Botond Szabó, Aad van der

More information

A Modern Look at Classical Multivariate Techniques

A Modern Look at Classical Multivariate Techniques A Modern Look at Classical Multivariate Techniques Yoonkyung Lee Department of Statistics The Ohio State University March 16-20, 2015 The 13th School of Probability and Statistics CIMAT, Guanajuato, Mexico

More information

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability... Functional Analysis Franck Sueur 2018-2019 Contents 1 Metric spaces 1 1.1 Definitions........................................ 1 1.2 Completeness...................................... 3 1.3 Compactness......................................

More information

Spectral Theory, with an Introduction to Operator Means. William L. Green

Spectral Theory, with an Introduction to Operator Means. William L. Green Spectral Theory, with an Introduction to Operator Means William L. Green January 30, 2008 Contents Introduction............................... 1 Hilbert Space.............................. 4 Linear Maps

More information

Topological vectorspaces

Topological vectorspaces (July 25, 2011) Topological vectorspaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ Natural non-fréchet spaces Topological vector spaces Quotients and linear maps More topological

More information

Random Bernstein-Markov factors

Random Bernstein-Markov factors Random Bernstein-Markov factors Igor Pritsker and Koushik Ramachandran October 20, 208 Abstract For a polynomial P n of degree n, Bernstein s inequality states that P n n P n for all L p norms on the unit

More information

Hilbert spaces. 1. Cauchy-Schwarz-Bunyakowsky inequality

Hilbert spaces. 1. Cauchy-Schwarz-Bunyakowsky inequality (October 29, 2016) Hilbert spaces Paul Garrett garrett@math.umn.edu http://www.math.umn.edu/ garrett/ [This document is http://www.math.umn.edu/ garrett/m/fun/notes 2016-17/03 hsp.pdf] Hilbert spaces are

More information

Packing-Dimension Profiles and Fractional Brownian Motion

Packing-Dimension Profiles and Fractional Brownian Motion Under consideration for publication in Math. Proc. Camb. Phil. Soc. 1 Packing-Dimension Profiles and Fractional Brownian Motion By DAVAR KHOSHNEVISAN Department of Mathematics, 155 S. 1400 E., JWB 233,

More information

Reconstruction from Anisotropic Random Measurements

Reconstruction from Anisotropic Random Measurements Reconstruction from Anisotropic Random Measurements Mark Rudelson and Shuheng Zhou The University of Michigan, Ann Arbor Coding, Complexity, and Sparsity Workshop, 013 Ann Arbor, Michigan August 7, 013

More information

CHAPTER VIII HILBERT SPACES

CHAPTER VIII HILBERT SPACES CHAPTER VIII HILBERT SPACES DEFINITION Let X and Y be two complex vector spaces. A map T : X Y is called a conjugate-linear transformation if it is a reallinear transformation from X into Y, and if T (λx)

More information

We have to prove now that (3.38) defines an orthonormal wavelet. It belongs to W 0 by Lemma and (3.55) with j = 1. We can write any f W 1 as

We have to prove now that (3.38) defines an orthonormal wavelet. It belongs to W 0 by Lemma and (3.55) with j = 1. We can write any f W 1 as 88 CHAPTER 3. WAVELETS AND APPLICATIONS We have to prove now that (3.38) defines an orthonormal wavelet. It belongs to W 0 by Lemma 3..7 and (3.55) with j =. We can write any f W as (3.58) f(ξ) = p(2ξ)ν(2ξ)

More information

Wavelet Shrinkage for Nonequispaced Samples

Wavelet Shrinkage for Nonequispaced Samples University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Wavelet Shrinkage for Nonequispaced Samples T. Tony Cai University of Pennsylvania Lawrence D. Brown University

More information

MAJORIZING MEASURES WITHOUT MEASURES. By Michel Talagrand URA 754 AU CNRS

MAJORIZING MEASURES WITHOUT MEASURES. By Michel Talagrand URA 754 AU CNRS The Annals of Probability 2001, Vol. 29, No. 1, 411 417 MAJORIZING MEASURES WITHOUT MEASURES By Michel Talagrand URA 754 AU CNRS We give a reformulation of majorizing measures that does not involve measures,

More information

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½

Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 1998 Asymptotic Nonequivalence of Nonparametric Experiments When the Smoothness Index is ½ Lawrence D. Brown University

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo

A BLEND OF INFORMATION THEORY AND STATISTICS. Andrew R. Barron. Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo A BLEND OF INFORMATION THEORY AND STATISTICS Andrew R. YALE UNIVERSITY Collaborators: Cong Huang, Jonathan Li, Gerald Cheang, Xi Luo Frejus, France, September 1-5, 2008 A BLEND OF INFORMATION THEORY AND

More information

Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals

Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals Acta Applicandae Mathematicae 78: 145 154, 2003. 2003 Kluwer Academic Publishers. Printed in the Netherlands. 145 Asymptotically Efficient Nonparametric Estimation of Nonlinear Spectral Functionals M.

More information

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model

Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Model Selection Tutorial 2: Problems With Using AIC to Select a Subset of Exposures in a Regression Model Centre for Molecular, Environmental, Genetic & Analytic (MEGA) Epidemiology School of Population

More information

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define HILBERT SPACES AND THE RADON-NIKODYM THEOREM STEVEN P. LALLEY 1. DEFINITIONS Definition 1. A real inner product space is a real vector space V together with a symmetric, bilinear, positive-definite mapping,

More information

Least singular value of random matrices. Lewis Memorial Lecture / DIMACS minicourse March 18, Terence Tao (UCLA)

Least singular value of random matrices. Lewis Memorial Lecture / DIMACS minicourse March 18, Terence Tao (UCLA) Least singular value of random matrices Lewis Memorial Lecture / DIMACS minicourse March 18, 2008 Terence Tao (UCLA) 1 Extreme singular values Let M = (a ij ) 1 i n;1 j m be a square or rectangular matrix

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

An introduction to some aspects of functional analysis

An introduction to some aspects of functional analysis An introduction to some aspects of functional analysis Stephen Semmes Rice University Abstract These informal notes deal with some very basic objects in functional analysis, including norms and seminorms

More information

Effective Dimension and Generalization of Kernel Learning

Effective Dimension and Generalization of Kernel Learning Effective Dimension and Generalization of Kernel Learning Tong Zhang IBM T.J. Watson Research Center Yorktown Heights, Y 10598 tzhang@watson.ibm.com Abstract We investigate the generalization performance

More information

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University

INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1. By Yuhong Yang and Andrew Barron Iowa State University and Yale University The Annals of Statistics 1999, Vol. 27, No. 5, 1564 1599 INFORMATION-THEORETIC DETERMINATION OF MINIMAX RATES OF CONVERGENCE 1 By Yuhong Yang and Andrew Barron Iowa State University and Yale University

More information

Theorem 2. Let n 0 3 be a given integer. is rigid in the sense of Guillemin, so are all the spaces ḠR n,n, with n n 0.

Theorem 2. Let n 0 3 be a given integer. is rigid in the sense of Guillemin, so are all the spaces ḠR n,n, with n n 0. This monograph is motivated by a fundamental rigidity problem in Riemannian geometry: determine whether the metric of a given Riemannian symmetric space of compact type can be characterized by means of

More information

Chapter One. The Real Number System

Chapter One. The Real Number System Chapter One. The Real Number System We shall give a quick introduction to the real number system. It is imperative that we know how the set of real numbers behaves in the way that its completeness and

More information

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms

Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms university-logo Penalized Squared Error and Likelihood: Risk Bounds and Fast Algorithms Andrew Barron Cong Huang Xi Luo Department of Statistics Yale University 2008 Workshop on Sparsity in High Dimensional

More information

Overview of normed linear spaces

Overview of normed linear spaces 20 Chapter 2 Overview of normed linear spaces Starting from this chapter, we begin examining linear spaces with at least one extra structure (topology or geometry). We assume linearity; this is a natural

More information