Risk bounds for model selection via penalization

Size: px

Start display at page:

Download "Risk bounds for model selection via penalization"

Allyson Bates
6 years ago
Views:

Probab. Theory Relat. Fields 113, 301 413 (1999) 1 Risk bounds for model selection via penalization Andrew Barron 1,, Lucien Birgé,, Pascal Massart 3, 1 Department of Statistics, Yale University, P.O.

1 Probab. Theory Relat. Fields 113, (1999) 1 Risk bounds for model selection via penalization Andrew Barron 1,, Lucien Birgé,, Pascal Massart 3, 1 Department of Statistics, Yale University, P.O. Box 0890, New Haven, CT , USA. barron@stat.yale.edu URA 131 Statistique et modèles aléatoires, Laboratoire de Probabilités, boîte 188, Université Paris VI, 4 Place Jussieu, F-755 Paris Cedex 05, France. lb@ccr.jussieu.fr 3 URA 743 Modélisation stochastique et Statistique, Bât. 45, Université Paris Sud, Campus d Orsay, F Orsay Cedex, France. massart@stats.matups.fr Received: 7 July 1995 /Revised version: 1 November 1997 Abstract. Performance bounds for criteria for model selection are developed using recent theory for sieves. The model selection criteria are based on an empirical loss or contrast function with an added penalty term motivated by empirical process theory and roughly proportional to the number of parameters needed to describe the model divided by the number of observations. Most of our examples involve density or regression estimation settings and we focus on the problem of estimating the unknown density or regression function. We show that the quadratic risk of the minimum penalized empirical contrast estimator is bounded by an index of the accuracy of the sieve. This accuracy index quantifies the trade-off among the candidate models between the approximation error and parameter dimension relative to sample size. If we choose a list of models which exhibit good approximation properties with respect to different classes of smoothness, the estimator can be simultaneously minimax rate optimal in each of those classes. This is what is usually called adaptation. The type of classes of smoothness in which one gets adaptation depends heavily on the list of models. If too many models are involved in order to get accurate approximation of many wide classes of functions simultaneously, it may happen that the estimator is only approx- Work supported in part by the NSF grant ECS , and URA CNRS 131 Statistique et modèles aléatoires, and URA CNRS 743 Modélisation stochastique et Statistique. Key words and phrases: Penalization Model selection Adaptive estimation Empirical processes Sieves Minimum contrast estimators

2 30 A. Barron et al. imately adaptive (typically up to a slowly varying function of the sample size). We shall provide various illustrations of our method such as penalized maximum likelihood, projection or least squares estimation. The models will involve commonly used finite dimensional expansions such as piecewise polynomials with fixed or variable knots, trigonometric polynomials, wavelets, neural nets and related nonlinear expansions defined by superposition of ridge functions. Mathematics subject classifications (1991): Primary 6G05, 6G07; secondary 41A5 Contents 1 Introduction What is this paper about? Model selection Sieve methods and approximation theory From model selection to adaptation A glimpse of the essentials 31.1 Model selection in a toy framework Variable selection Main results with some illustrations The minimum penalized empirical contrast estimation method Maximum likelihood density estimation Projection estimators for density estimation Classical least squares regression Minimum- 1 regression Examples of models Linear models Nonlinear models The theorems and their applications Maximum likelihood estimators Projection estimators Least squares estimators for smooth regression Further examples Nested families of models and analogues Ellipsoids with unknown coefficients

3 Risk bounds for model selection via penalization Densities with an unknown modulus of continuity Hölderian densities with unknown anisotropic smoothness Projection estimators on polynomials with variable degree Least squares estimators for binary images Estimation of the support of a density Rich families of models Histograms with variable binwidths and spatial adaptation Neural nets and related nonlinear models Model selection with a bounded basis Adaptation and model selection Adaptation in the minimax sense Adaptation with respect to the target function and model selection Comparison with other adaptive methods Adaptation to the target function Adaptation in the minimax sense What s new here? A general theorem in an abstract framework Exponential bounds for the fluctuations of empirical processes A general theorem Penalized projection estimators on linear models Proof of Theorems 8 and Proofs of the main results Maximum likelihood estimation Other penalized minimum contrast estimation procedures Penalized projection estimation Penalized least squares and minimum 1 regression Estimating the support of a density Analysis of nonlinear models Appendix Combinatorial and covering lemmas Some results in approximation theory Further technical results

4 304 A. Barron et al. 1. Introduction 1.1. What is this paper about? The purpose of this paper is to provide a general method for estimating an unknown function s on the basis of n observations and a finite or countable family of models S m,m M n, using an empirical model selection criterion. Here, by model we have in mind any possible space of finite dimension D m (in a sense that will be made precise later on and includes the classical case where S m is linear). We do not mean that s belongs to any of the models, although this might be the case. Therefore we shall always think of a model S m as an approximate model for the true s with controlled complexity and this is the reason why we shall use alternatively the term sieve introduced by Grenander (1981) in connection with approximation theory. For each model S m we build an estimator ŝ m,n which minimizes some empirical contrast function γ n over the set S m. The precise nature of the sampling model will be discussed later. It suffices for now to think of regression and density estimation problems in which, for each candidate function t, the empirical contrast γ n (t) is, respectively, the empirical average squared error or (1/n) times the minus logarithm of likelihood. Denoting by R m,n (s) = s d (s, ŝ m,n )] the risk at s of the estimator ŝ m,n (where d denotes some convenient distance) an ideal model should minimize R m,n (s) when m varies. Nevertheless, even if s belongs to some S m0, this true model can be far from being ideal (in the preceding sense). Think of a polynomial fitting of a regression curve with 100 observations when the true s is a polynomial of degree 50. Since s is unknown, one cannot determine such an ideal model exactly. Therefore one would like to find a model selection procedure ˆm, based on the data, such that the risk of the resulting estimator ŝ ˆm,n is equal to the minimal risk inf m Mn R m,n (s). This program is too ambitious and we shall content ourselves to consider, instead of the minimal risk, some accuracy index of the form { a n (s) = inf d } { (s, S m ) + pen m,n = inf inft Sm d } (s, t) + pen m,n m M n m M n which majorizes the minimal risk and to provide a model selection procedure ˆm such that the risk of ŝ ˆm,n achieves the accuracy index up to some constant independent of n which means that s d (s, ŝ ˆm,n ) ] C(s)a n (s) for all n. (1.1) The procedure ˆm is defined by the minimization over M n of the penalized empirical contrast {γ n (ŝ m,n ) + pen m,n }. More precisely it follows from the analysis of Birgé and Massart (1998) that the risk R m,n (s) is typically of

5 Risk bounds for model selection via penalization 305 order d (s, S m ) + D m /n. The penalty term pen m,n then generally takes the form κl m D m /n where κ is an absolute constant and L m 1 is a weight that satisfies a condition of the type exp L m D m ] 1. m M n The penalty term takes into account both the difficulty to estimate within the model S m (role of D m ) and the additional noise due to the size of the list of models (role of L m ) and derives from exponential probability bounds for the empirical contrast. It follows from (1.1) and our choice of the penalty that, for any s, s d (s, ŝ ˆm,n ) ] { C(s) inf d (s, S m ) + κl } md m m M n n. (1.) Although we emphasized the fact that s need not belong to any S m, the bound (1.) also makes sense in the parametric case. More precisely, if one starts from a finite collection of models {S m } m M which does not depend on n and fix L m = 1 for all m, one finds, whenever s belongs to some S m0, that the risk of ŝ ˆm,n is of order n 1 as expected for this parametric framework. More generally, the bound (1.) permits the reduction of the problem of investigation of the performance of the estimator (to within certain constant multipliers) to an investigation of the approximation capabilities of the sieves. Here we have in mind a variety of possible function classes and the accuracy index will be evaluated for each. Since it is not known to which subsets of functions the target s belongs, it is a merit of the accuracy index and indeed a merit of the minimum penalized empirical contrast estimator ŝ ˆm,n in many cases that the maximum of the accuracy index a n (s) on certain subclasses of functions is within a constant factor of the minimax optimal value for the risk on these subclasses. For typical choices of models, the target function s is a cluster point, that is d(s,s m ) tends to zero for some subsequence of models, and the accuracy index quantifies the rate of convergence in a way that is naturally tied to the dimension of the models and the sample size through the penalty term. As a consequence of the accuracy index, there exists many situations where model selection provides estimators ŝ ˆm,n which are (at least approximately) simultaneously minimax over a family of classes of functions, usually balls with respect to the seminorms of the classical spaces of smooth functions. Such estimators are then called (approximately) adaptive. We shall now go further into details to describe our work and relate our results to the existing literature on model selection and adaptive estimation.

6 306 A. Barron et al. 1.. Model selection Historically, one can consider that model selection begins with the works of Mallows (1973) and Akaike (1973) although classical t or F tests and Bayes tests were long used for model selection. Actually, Daniel and Wood (1971, p. 86) already mention the C p criterion for variable selection in regression as described by Mallows in a conference dating back to (1964). Our model selection criteria can be viewed as extensions of Mallows and Akaike s. In order to describe the heuristics underlying Mallows approach, and more generally model selection based on penalization, let us consider here a typical and historically meaningful example, namely model selection for linear regression with fixed design. Let us consider observations Y 1,...,Y n such that Y i = s(x i ) + W i where the W i s are centered independent identically distributed variables with variance one and the x i s are deterministic values in some space X. We want to estimate the function s defined on X from the Y i s and measure the error of estimation in terms of the distance derived from the Euclidean norm t =n 1 n i=1 t(x i) ] 1/. We consider a family of linear models {S m } m Mn (finite dimensional spaces of functions on X), each model S m being of dimension D m. Let s m be the orthogonal projection of s onto S m and ŝ m,n be the least squares estimator of s relatively to S m. The risk of ŝ m,n is equal to s ŝm,n s ] = s s m + D m /n. Since s s m = s s m, the ideal model is given by the minimization of s m +D m /n+n 1 n i=1 Y i. Let us consider the normalized residual sum of squares n 1 n i=1 Y i ŝ m,n. Since ŝ m,n D m /n is an unbiased estimator of s m, an unbiased estimator of the ideal criterion to minimize is n 1 n i=1 Y i ŝ m,n + D m /n which is precisely Mallows C p.ifwe set γ n (t) = 1 n Y i t(x i )] n i=1 we notice that ŝ m,n is the minimizer of γ n over S m and that γ n (ŝ m,n ) = n 1 n i=1 Y i ŝ m,n. Therefore Mallows C p is a minimum penalized empirical contrast criterion in our sense with pen m,n = D m /n. This procedure is expected to work when the variables ŝ m,n concentrate around their expectations uniformly with respect to m. This is not clear at all when the cardinality of M n is large as compared to n. Since the practical use of Mallow s C p criterion is for a fixed sample size it is a natural question to wonder whether the criterion will work for a given value of the cardinality of M n as a function of n.

7 Risk bounds for model selection via penalization 307 This particular problem has been studied by Shibata (1981) for Gaussian errors and Li (1987) under suitable moment assumptions on the errors (see also Polyak and Tsybakov 1990 for sharper moment conditions in the Fourier case). One can in particular deduce from these works that if the family of models {S m } m Mn is nested and each model has a dimension bounded by n, the heuristics of Mallows C p is validated in the sense that the selected index ˆm provides an estimator ŝ ˆm,n such that asymptotically the risk s s ŝ ˆm,n ] is equivalent to inf m Mn s s ŝ m,n ]. It is worth noticing that this asymptotic equivalence holds provided that s does not belong to any of the S m s. Apart from Mallows C p classical empirical penalized criteria for model selection include AIC, BIC, and MDL criteria proposed by Akaike (1973), Schwarz (1978), and Rissanen (1978 and 1983), respectively. They differ from the structure of the penalties involved, which are based on asymptotic, Bayesian or information-theoretic considerations and concern various empirical criteria such as maximum likelihood and least squares. For our approach to model selection, the penalty term is motivated solely on the basis of what sorts of statistical risk bounds we can obtain. This conceptual point of view has been previously developed by Barron and Cover (1991) in their attempt to provide a global approach to model selection. Using a class of discretized models Barron and Cover (1991) or Barron (1991) prove risk bounds for complexity regularization criteria which in some cases include AIC, BIC, and MDL. The work by Barron and Cover is for criteria that possess a minimum description length interpretation and the discretization reduces the choice to a countable set of candidate functions t with penalty L(t)/n satisfying t L(t) 1 as required for lengths of uniquely decodable codes. There these authors developed an approximation index called the index of resolvability that is a precursor to our accuracy index a n (s) and they establish comparable risk bounds for Hellinger distance in density estimation. The main innovation here, as compared to Barron and Cover (1991), is that we do not require that the models should be discrete. This supposes a lot of additional work. The technical approach in this paper is in the spirit of Vapnik (198). His method of empirical minimization of the risk also heavily relies on an analysis of the behavior of an empirical contrast based on empirical process theory and his method of structural minimization of the risk is related to a model selection criterion which parallels ours. We use here the tools developed in Birgé and Massart (1998). This makes a difference between Vapnik s approach and ours both in the formulation of the empirical process conditions and techniques. In particular, the introduction of recent isoperimetric inequalities by Talagrand (1994 and 1996) in the case of projection estimators on linear spaces, which has proved its efficiency in Birgé

8 308 A. Barron et al. and Massart (1997) and more recently in Baraud (1997), allows to obtain, in some cases, precise numerical evaluations of the penalty terms and to justify, even from a non-asymptotic point of view, Mallows C p, relaxing some restrictions imposed by Shibata (1981) and Li (1987). However, in general, penalty terms that satisfy our conditions may be different from those which are used in the familiar criteria. For instance we might have to consider heavier penalty terms if necessary in order to take into account the complexity of the family M n. As to the implementation of minimum penalized contrast procedures, to be honest, we feel that this paper is merely a starting point which does not directly provide practical devices. However it is already possible to make a few remarks about implementation. The numerical value of the penalty function can be fixed in some cases as mentioned above. Also, as shown in Birgé and Massart (1997), the minimization procedure, even if the number of models is large, can be rather simple in some particular cases of interest since it is partly explicitly solvable, leading for instance to threshold or related estimators Sieve methods and approximation theory Let us recall that, for a given sieve S of dimension D, d (s, S) + D/n typically represents the order of magnitude of the risk R n (s) of a minimum contrast estimator ŝ n measured by the mean integrated squared error between s and ŝ n. The terms d (s, S) and D/n correspond to the bias squared and variance components, respectively. Given some prior information on s (for instance an upper bound for some smoothness norm) one can, from approximation theory, choose a family {S m } m Mn of finite dimensional sieves such that s is a cluster point of their union. If we select a sieve S mn in the family according to the presumed property of the target function, rather than adaptively selected on the basis of data, what we study would fall under the general heading of analysis of sieves for function estimation. The choice of S mn is determined by a particular trade-off between the variance and an upper bound for the bias squared. This method can lead to minimax risk computations. For instance, let us assume that s belongs to some Sobolev ball θ where θ is some known parameter which characterizes this ball. Approximation theory provides privileged families of sieves like spaces of piecewise polynomials with fixed or variable knots or trigonometric polynomials or wavelet expansions with optimal approximation properties with respect to those balls. Such a suitable choice of the list of sieves S m,m M n can typically guarantee that for given n and θ the minimax risk R n (θ) satisfies R n (θ) = inf s n sup s d (s, s n ) ] C 1 (θ) inf s m M θ n sup d (s, S m ) + D m s θ n ] (1.3)

9 Risk bounds for model selection via penalization 309 where s n is an arbitrary estimator. Such inequalities can in general be obtained by combining results in approximation theory with classical lower bounds on the minimax risk available in various contexts (density estimation, regression, white noise). Some references, among many others, are Bretagnolle and Huber (1979), Ibragimov and Khas minskii (1980 and 1981), Nemirovskii (1985), Birgé (1983 and 1986), Donoho and Johnstone (1998). Therefore if m(n, θ) is a value of m which minimizes sup s θ d (s, S m )+D m /n, the resulting minimum contrast estimator on the sieve S m(n,θ) is typically minimax (up to some constant independent of n)on θ. The rates of convergence for sieves methods, as introduced by Grenander (1981), have been studied by several authors: Cencov (198), Grenander and Chow (1985), Cox (1988), Stone (1990 and 1994), Barron and Sheu (1991), Haussler (199), McGaffrey and Gallant (1994), Shen and Wong (1994), and Van de Geer (1995). The main drawback of the preceding approach is connected with the prior assumption on the unknown s which is not attractive for practical use although those estimators are relevant for minimax risk computations. As a matter of fact, Stone pointed out that his own works on sieves methods (mainly devoted to splines) were first steps towards data driven methods of nonparametric estimation. More precisely he had in view to provide some theoretical justifications for MARS (see Friedman 1991). The mathematical analysis of sequences of finite-dimensional models is at the heart of the techniques that we put to use in our study of adaptive methods of model selection. The point here is that a mere control of the quadratic risk on each sieve is far from being sufficient for achieving our program, as described in Section 1.1. Much more will be needed here and we shall have to make use of the exponential inequalities for the fluctuations of an empirical contrast on a sieve established in Birgé and Massart (1998). We wish to allow a general framework of sieves characterized by their metric dimension and approximation properties. The examples we study typically involve linear combinations of a family of basis functions {ϕ λ } λ, which are parameterized by an index λ that is either discrete or continuous valued. In the discrete index case we have in mind examples of models based on Fourier series, wavelets, polynomials and piecewise polynomials with a discrete set of knot locations. Here the issue is the adaptive selection of the number of terms including all terms up to some total or the issue may be which subset of terms provides approximately the best estimate. In the first case there is only one sieve of each dimension and in the second there may be exponentially many candidate models as a function of dimension. The choice of whether subsets are taken has an impact on what types of tradeoffs are possible between bias and variance and on what types of penalty terms are permitted. In both cases the penalty term will be proportional

10 310 A. Barron et al. to the number of terms in the models, but in the latter case there is an additional logarithmic penalty factor that is typically necessary to realize approximately the best subset among exponentially many choices without substantial overfit. In contrast the use of fixed sets of terms typically allows for a penalty term with no logarithmic factors, but as we shall quantify (in the absence of subset selection) there can be less ability to realize a small statistical risk. In the continuous index case we have in mind flexible nonlinear models including neural nets, trigonometric models with estimated frequencies, piecewise linear hinged hyperplane models and other piecewise polynomials with continuously parameterized knot locations. In these cases we write φ w instead of ϕ λ for the terms that are linearly combined, where w is a continuous vector-valued parameter. Not surprisingly, if the terms φ w depend smoothly on w, the behavior of these nonlinear models is comparable to what is achieved in the discretized index set case with subset selection. We find that these nonlinear models have metric dimension properties that we can bound, but they lack the homogeneity of metric dimension satisfied by linear models with a fixed set of terms. The effect is that once again logarithmic factors arise in the penalty term and in the risk bounds. The advantage due to parsimony of the nonlinear models or the subset selection models is made especially apparent in the case of inference of functions with a high input dimension. In high dimensions, the exponential number of terms in linear models without subset selection precludes their practical use From model selection to adaptation Let us now consider the possible connections between our approach and adaptive estimation from the minimax point of view. As a matter of fact the adaptive properties of nonparametric estimators obtained from discrete model selection were already pointed out and studied by Barron and Cover (1991) for a number of classes of functions including Sobolev classes of logdensities without prior knowledge of which orders of smoothness and which norm bounds are satisfied by the target function. To recover the Barron and Cover result as a special case of our general density estimation results, set each model here to be a single function in their countable list. Barron (1991) extended the discretized model approach to deal also with complexity regularization for least squares regression and other bounded loss functions and applied it to artificial neural network models (see Barron 1994). Let us also mention that the present paper is a companion to the paper by two of us (Birgé and Massart 1997) which explores the role of adaptive estimation for projection estimators of densities using linear models. Applications are given there for wavelet estimation and connections are established with

11 Risk bounds for model selection via penalization 311 thresholding of wavelet coefficients and cross-validation criteria. More recently, Yang and Barron (1998) have got some results similar to ours for the particular case of log-density models. Let us now provide a mathematical content to what we mean here by adaptation. Given a family { θ } θ of sets of functions we recall that the minimax risk over θ is given by R n (θ) = inf s n sup s d (s, s n ) ] s θ where s n is an arbitrary estimator. We shall call a sequence of estimators ( s n ) n 1 adaptive in the minimax sense if for every θ there exists a constant C(θ) such that sup s d (s, s n ) ] C(θ)R n (θ). s θ If, for instance, one wants to give a precise meaning to the problem of estimating a function s of unknown smoothness, one can assume that s belongs to one of a large collection of balls such as Sobolev balls of variable index of smoothness and radius. Our purpose is to point out the connection between model selection via penalization as described previously and adaptation in the minimax sense. Starting from (1.) and assuming that L m = L for all m and n and that C(s) is bounded by C (θ) uniformly for s θ, one derives that sup s d (s, ŝ ˆm,n ) ] C 3 (θ) inf sup d (s, S m ) + D ] m. s m M θ n s θ n If the family {S m } m Mn has convenient approximation properties with respect to the family {S θ } θ such that (1.3) holds, it will follow that ŝ ˆm,n is adaptive with respect to the family {S θ } θ in the minimax sense. We shall actually devote a large part of the paper to the illustration of this principle on various examples. For most of the illustrations that we shall consider one can take either L m as a constant L or as log n. Inthe latter case we shall get adaptation up to a slowly varying function of n. Moreover, in the first case, we shall also discuss the precise dependency of the ratio C 3 (θ)/c 1 (θ) with respect to θ and sometimes show that it is bounded independently of θ. There is a huge amount of recent literature devoted to adaptive estimation and we postpone to Section 5 a discussion about the connections between model selection and adaptive estimation including a comparison between our approach to adaptation and the already existing methods and results. The structure of the paper is described in the Table of Contents. Let us only mention that Sections 4, 7 and 8 are clearly more technical and

12 31 A. Barron et al. can be skipped at first reading. A first and particularly simple illustration of what we want to do and of the ideas underlying our approach is given in Section which provides a self-contained introduction to our method while Section 3 provides an overview of its application to various situations. Section 5 does not contain any new result but is devoted to some detailed discussion, based on the examples of Sections and 3, about the connections between adaptation and model selection.. A glimpse of the essentials In order to give an idea of the way our approach to minimum penalized empirical contrast estimation works, let us describe it in the simplest framework we know, namely Gaussian regression on a fixed design. Its simplicity allows us to give a short and self-contained proof of an upper bound involving the accuracy index, for the risk of penalized least squares estimators. The main issue here is to enlighten the connection between the concentration of measure phenomenon and the choice of the penalty function for model selection..1. Model selection in a toy framework In the Gaussian regression framework we observe n random variables Y i = s(x i ) + W i where the x i s are known and the W i s are independent identically distributed standard normal. Identifying any function t defined on the set X ={x 1,...,x n } to a vector t = (t 1,...,t n ) T n by setting t i = t(x i ), we define a scalar product and a norm on n by t,u = 1 n n t(x i )u(x i ) and t = 1 n i=1 n t(x i ). (.1) i=1 We introduce a countable family {S m } m Mn of linear models, S m being of dimension D m and for each m we consider the least squares estimator ŝ m on S m which is a minimizer with respect to t S m of γ n (t) = t Y, t where Y = (Y 1,...,Y n ) T. Then we choose a prior family of weights {L m } m Mn with L m 1 for each m, such that

13 Risk bounds for model selection via penalization 313 exp L m D m ] <+. (.) m M n Our aim is to prove the following Theorem 1 Let pen(m) be defined on M n by pen(m) = κl m D m /n for a suitable constant κ and the weights L m satisfy (.). Let ŝ m be the minimizer of γ n (t) for t S m and ŝ ˆm be the minimizer among the family {ŝ m } m Mn of the penalized criterion γ n (ŝ m ) + pen(m). Then ŝ ˆm satisfies s s ŝ ˆm ] κ { inf d (s, S m ) + pen(m) } + κ n 1, (.3) m M n where d (s, S m ) = inf t Sm s t and κ,κ are numerical constants. Remark: The following proof uses κ = 4 leading to κ = 3 and κ = 3, which is obviously far from optimal as follows from Li (1987) or Baraud (1997). The result actually holds, for instance, with κ = as in Mallow s C p but a proof leading to better values of the constants would be longer, involve additional technicalities and also use more specific properties of the framework. Since we want here to give a short and intuitive proof, in the spirit of the subsequent results given in the paper for different frameworks, we prefer to sacrifice optimality to simplicity and readability and put the emphasis on the main ideas to be used in the sequel without the specific tricks which are required for optimizing the constants. Proof: We start with the identity t s = γ n (t) + W,t + s where W = (W 1,...,W n ) T and notice that, by definition, for any given m M n γ n (ŝ ˆm ) + pen( ˆm) γ n (s m ) + pen(m) where s m denotes the orthogonal projection of s onto S m. Combining these two formulas we get s ŝ ˆm s s m + pen(m) pen( ˆm) + W,(ŝ ˆm s m ). (.4) Let m be fixed. Given some m M n, we introduce the Gaussian process {Z(t)} t Sm defined by Z(t) = W,(t s m) w(m,t) where w(m,t)= t s + s s m + x m n,

14 314 A. Barron et al. x m being some positive number to be chosen later. As a consequence of Cirel son, Ibragimov and Sudakov s inequality (see Cirel son, Ibragimov and Sudakov 1976 and, for more details about Gaussian concentration inequalities, Ledoux 1996). s sup Z(t) E + λ t S m ] ] exp λ σ for any λ>0 (.5) provided that E sup t Sm Z(t)] and sup t Sm Var(Z(t)) σ. Let us first notice that w(m,t) 1 t s m + x ] ( m xm ) 1/ t s m (.6) n n and that for any function u, Var( W,u ) = n 1 u. Then Var(Z(t)) = n 1 t s m w (m,t)which immediately yields that we can take σ = x 1 m in (.5). On the other hand, expanding t s m on an orthonormal basis (ψ 1,...,ψ N ) of S m +S m with N D m +D m, one gets by Cauchy-Schwarz inequality that N Z (t) t s m w (m,t) W,ψ j and it follows from (.6) and Jensen s inequality that we can take E = (D m + D m )/x m ] 1/ in (.5). If λ is given by λ = (x + L m D m )/x m where x is any positive number we derive that λ + E ( ) Dm + D m + x + L m D 1/ m 1 x m 4 if x m = 3(D m + x + 3L m D m ). It then follows that ] s Z(ŝm ) 1/4 ] s sup Z(t) 1/4 exp( L m D m ) exp( x) t S m and therefore summing up those inequalities with respect to m that W,(ŝ m s m ) sup 1 exp( x). (.7) m M n w(m, ŝ m ) 4 This implies from the definitions of w and x m that except on a set of probability bounded by e x s ] j=1 4 W,(ŝ ˆm s m ) w( ˆm, ŝ ˆm ) s ŝ ˆm + s s m +16n 1 (D m + x + 3L ˆm D ˆm ).

15 Risk bounds for model selection via penalization 315 Coming back to (.4), this implies that s ŝ ˆm 3 s s m +pen(m) pen( ˆm)+16n 1 (D m +x+3l ˆm D ˆm ). The choice κ = 4 entails the cancellation of pen( ˆm), showing that, since L m 1 s ŝ ˆm 3 s s m + (8/3)pen(m) + 3n 1 x apart from a set of probability bounded by e x. Setting we get V = ( s ŝ ˆm 3 s s m (8/3)pen(m) ) 0 s s ŝ ˆm ] 3 s s m + (8/3)pen(m) + s V ] and s V 3x/n] exp( x). Integrating with respect to x implies that s V ] 3 /n which yields (.3) since m is arbitrary... Variable selection We want to provide here a typical application of Theorem 1. Let us assume that we are given some (large) orthonormal system {ϕ 1,...,ϕ N } in n with respect to the norm (.1). We want to get an estimate of s of the form s = λ m ˆβ λ ϕ λ where m is some suitable subset of {1,,...,N}. Let us first recall that if m is given, the projection estimator ŝ m over S m = Span{ϕ λ λ m}, which is the minimizer with respect to t S m of the criterion γ n (t),isgivenby ŝ m = λ m ˆβ λ ϕ λ with ˆβ λ = Y, ϕ λ and that γ n (ŝ m ) = ˆβ λ m λ. Elementary computations show that s s ŝm ] = d (s, S m ) + m /n. Unfortunately, since s is unknown we do not know how to choose m in an optimal way in order to minimize d (s, S m ) + m /n. In order to select m from the data, let us describe two simple strategies (among many others). i) Ordered variable selection. In this case we select the variables ϕ λ in natural order which means that we restrict ourselves to m k ={ϕ λ 1 λ k}, letting k vary from 1 to N. In such a case one can take L m = 1, = 0.6, pen(m k ) = κk/n and get a penalized least squares estimator ŝˆk where

16 316 A. Barron et al. ˆk is the minimizer of κk/n λ m k estimator is bounded by s s ŝˆk ] κ ˆβ λ. By Theorem 1, the risk of this { inf d (s, S mk ) + k/n } 1 k N for a suitable numerical constant κ. One should notice here that N does not enter the bound and can therefore be infinite and that we get the optimal risk among our family apart from the constant factor κ. Note that this optimality is with respect to the best that can be achieved among the class of ordered variable selection models. ii) Complete variable selection. Here we take m to be any nonvoid subset of {1,,...,N}. Since the number of such subsets with a given cardinality ( N ) D is < (en/d) D by Lemma 6 one can choose L m = 1 + log N for D all m and = 1.3. The resulting value ˆm is then obtained by minimizing κ(1 + log N) m /n ˆβ λ m λ. It is easily seen that this amounts to select the values of λ such that ˆβ λ >κ(1 + log N)/n which means that { ] } κ(1 + log N) 1/ ˆm = λ ˆβ λ >. n Therefore ŝ ˆm is a threshold estimator as studied by Donoho and Johnstone (1994a). Moreover by Theorem 1, there exists a constant κ such that s s ŝ ˆm ] { κ inf d (s, S m ) + m (log N)/n }. m If N is independent of n, we only loose a constant as compared to the ideal estimator; if N grows as a power of n, we only loose a log n factor as compared to the optimal risk for the class of all subset models, as in Donoho and Johnstone (1994a). This is the price to pay for complete variable selection among a large family but what is gained can be vastly superior in the approximation versus dimension tradeoff in the risk. Conclusion: The simplicity of treatment of the preceding example is mainly due to the fact that the centered empirical contrast W,t is a Gaussian linear process, acting on a finite dimensional linear space. The same treatment could be applied as well to penalized projection estimation for the white noise setting. Unfortunately the treatment of other empirical contrast functions or of nonlinear models requires that several technical difficulties be overcome. If we set here l n (s, t) = s γ n (t) γ n (s)], then l n (s, t) = s t. In a non-gaussian framework, one has to deal with a general empirical contrast function γ n and the analogue of (.4) becomes

17 Risk bounds for model selection via penalization 317 l n (s, ŝ ˆm ) l n (s, s m ) + γ 0 n (s m) γ 0 n (ŝ ˆm) ] + pen(m) pen( ˆm) where γn 0(t) = γ n(t) s γ n (t)]. Pure -assumptions are not enough to control the fluctuations of the centered empirical contrast (the bracketed term) involved in this inequality. This motivates the introduction of -type assumptions on our models in the next section. Moreover, the structure of the exponential bounds that we use is connected to Bernstein s inequality rather than a subgaussian type inequality. We also would like to point out the status of the distance d which has to be closely connected to the empirical contrast and chosen not too small in order to provide an appropriate control of the fluctuations of γn 0 and not too large in order that d (s, t) be controlled by l n (s, t). In the most favorable case of the projection density estimator on linear models, one can mimic the preceding proof, replacing the concentration inequality (.5) by Cirel son, Ibragimov and Sudakov by an inequality of Talagrand (1996). The point here is that the linearity of the model and of γn 0 (t) as a function of t allows to use Cauchy-Schwarz inequality as we did before to control the expectation of the supremum of the process involved. This point of view is developed in Birgé and Massart (1997) for projection density estimation and Baraud (1997) for non-gaussian regression. More generally, in the nonlinear context, one has to deal with suitable modifications of the entropy methods introduced by Dudley (1978) to build the required exponential inequalities. Such results are collected in Proposition 7 below which is mainly based on Theorem 5 and Proposition 3 of Birgé and Massart (1998). Moreover, in the case of maximum likelihood estimation, we have to modify the initial empirical process in order to keep its fluctuations under control at the price of additional difficulties to get an analogue of inequality (.4). 3. Main results with some illustrations 3.1. The minimum penalized empirical contrast estimation method We wish to analyze various functional estimation problems (density estimation, regression estimation,...) that we describe precisely below. A common statistical framework covering all these examples is as follows. We observe n random variables, Z 1,...,Z n which, in the context of this paper, are assumed to be independent. These variables are defined on some measurable space (, A) and take their values on some measurable space (Z, U). The space (, A) is equipped with a family of probabilities { s } s S where S is a subset of some -space, (µ). Note that both µ and S can

18 318 A. Barron et al. depend on n, the same being true for each probability s but we do not make this dependence appear in the notation for the sake of simplicity since those quantities will be fixed (independent of n) in most applications. We denote by s the expectation with respect to probability s,by n the empirical distribution of the Z i s and by ν n = n s n the centered empirical measure. The space (µ) is equipped with the distance d induced by the norm =. More generally for 1 p, the norm in p (µ) is denoted by p. Let us now introduce the key elements and notions that we need in the sequel. Definition 1 Given some subset T of (µ) containing S, an empirical contrast function γ n on T is defined for all t T as the empirical mean γ n (t) = n 1 n i=1 γ(z i,t)where γ is a function defined on Z T which satisfies s γ n (t)] s γ n (s)] for all s S and t T. We then introduce a countable collection of subsets S m of T (models) indexed by m M n. These models play the role of approximating spaces (sieves) for the true unknown value s of the parameter which might or might not be included in one of them. Typically, S m is a subset of a finitedimensional linear space. In order to make the notations simple we shall assume that everything which depends on m M n might depend on n but we omit this second index. We then consider a penalty function pen(m) which is a positive function on M n. We shall see later how to define this penalty function in order to get a sensible estimator. Let ε n 0 be given, a minimum penalized empirical contrast estimator is defined as follows: Definition Given some nonnegative number ε n, an empirical contrast function γ n, a collection of models {S m } m Mn and a penalty function pen( ) on M n,anε n -minimum penalized contrast estimator is any estimator ŝ in m Mn S m with ŝ S ˆm such that γ n (ŝ) + pen( ˆm) inf m M n { } inf γ n (t) + pen(m) + ε n. (3.1) t S m If ε n = 0 we speak of a minimum penalized contrast estimator. As usual, by estimator we mean a measurable mapping from (Z, U) n to the metric space (T,d) endowed with its Borel σ -algebra. If we omit the measurability problems, such an estimator is always defined provided that ε n > 0 but might not be unique. Nevertheless, the following results

19 Risk bounds for model selection via penalization 319 do apply to any solution of (3.1). In order to simplify the presentation we shall assume throughout the paper that ŝ is well-defined for ε n = 0. It turns out from our proofs that the choice ε n = n 1 would lead to the same risk bounds as those provided in the theorems below for the case ε n = 0. Some classical examples of minimum contrast estimation methods follow Maximum likelihood density estimation We observe n independent identically distributed variables Z 1,...,Z n of density s with respect to µ. We define T to be the set of nonnegative elements of norm 1 in (µ) (which means that their squares are probability densities) and take S T. The choice of the function γ(z,t)= log t(z) leads to maximum penalized likelihood estimators Projection estimators for density estimation We assume that µ is a probability measure and that the unknown density of the i.i.d. observations Z 1,...,Z n belongs to (µ). It can therefore be written + s where s is orthogonal to the constant function. We take for T the subspace of (µ) which is orthogonal to and derive the empirical contrast from γ(z,t) = t t(z), S being chosen as any subset of those t T such that + t 0. If S m is a linear subspace of T with an orthonormal basis {ϕ λ } λ m, minimizing γ n (t) over S m leads to the classical projection estimator ŝ m on S m given by ŝ m = λ m ˆβ λ ϕ λ with ˆβ λ = 1 n n ϕ λ (Z i ). i= Classical least squares regression Observations are pairs (X i,y i ) = Z i with Y i = s(x i ) + W i and the variables X i and W i are all independent with respective distributions R i and Q i (independent of s) but not necessarily independent identically distributed since we want to include the fixed design regression in our framework. In this case S T = (µ) where µ denotes the average distribution of the X i s: µ = n 1 n i=1 R i. This distribution actually depends on n in the case of a fixed design but not in the case of a random design. We assume that the errors W i are centered and choose γ(z,t)= y t(x)]. The resulting estimator is a penalized least squares estimator.

20 30 A. Barron et al Minimum- 1 regression We use the same regression framework as before, now assuming that the W i s are centered at their median and define γ(z,t)= y t(x). These frameworks and related empirical contrast functions have been described in greater detail in Birgé and Massart (1993) and Birgé and Massart (1998). We therefore refer the reader to these papers for more information. 3.. Examples of models In all our results, the value pen(m) of the penalty function is, in particular, connected with the number D m of parameters which are necessary to describe the elements of the model S m. A general definition of D m will appear in Section 6 and we shall here content ourselves with the presentation of two cases which are known to be of practical interest Linear models By a linear model we mean a subset S m of some finite-dimensional linear subspace S m of (µ) with dimension D m. In opposition with what happens for Gaussian situations like the Gaussian regression on fixed design and the white noise setting, the -structure of the models is not sufficient to guarantee a good behavior of the empirical contrast function γ n, which is essential for our purpose as we shall see later. More is needed, specifically some connections between the - and -structures of the models. It is the aim of the two following indices (indeed relative to S m ) to quantify such connections. Firstly we set m = 1 t sup (3.) Dm t S m \{0} t and denote by F m the set of all orthonormal bases of S m. For any finite set and any β, we define β = sup λ β λ and β = λ β λ. We then notice that for any orthonormal basis ϕ ={ϕ λ } λ m F m m = 1 λ sup m β λ ϕ λ = 1 Dm β 0 β Dm λ m ϕ λ 1/. (3.3) The second equality in (3.3) comes from Lemma 1 of Birgé and Massart (1998). Secondly we define

21 Risk bounds for model selection via penalization 31 r m = 1 Dm inf ϕ F m { sup β 0 } λ m β λ ϕ λ β. (3.4) It follows from (3.3) and this definition that m r m D m m. (3.5) Let us now detail a few examples of linear models and bound their indices. Uniformly bounded basis: If one can find an orthonormal system {ϕ λ } λ such that ϕ λ for all λ, if the elements of M n are subsets of and S m is the linear span of {ϕ λ } λ m, then m by (3.3). Choosing M n as a countable family of subsets of the trigonometric basis in (0, π],dx) provides a typical example of this type. Wavelet expansions: Let us consider an orthonormal wavelet basis {ϕ j,k j 0,k q } of ( q,dx)(see Meyer 1990 for details) with the following conventions: ϕ 0,k are translates of the father wavelet and for j 1, the ϕ j,k s are affine transforms of the mother wavelet. One will also assume that these wavelets are compactly supported and have continuous derivatives up to some order r. Let t ( q,dx)be some function with compact support in (0,A) q. Changing the indexation of the basis if necessary we can write the expansion of t on the wavelet basis as: t = j 0 jq M k=1 β j,k ϕ j,k, where M 1 is a finite integer depending on A and the size of the wavelet s supports. For any j, we denote by (j) the set of indices {(j, k) 1 k jq M}. The relevant m s will be subsets of the larger sets J j=0 (j) for finite values of J and we shall denote by J m the smallest J such that this inclusion is valid. It comes from Bernstein s inequality (see Meyer 1990, Chapter, Lemma 8) that r m C( qj m /D m ) 1/ for some constant C. In particular, for all m s of the form J m j=0 (j), r m is uniformly bounded and so is m. The most relevant applications of such expansions have been studied extensively in Birgé and Massart (1997). We also want to deal with wavelet expansions on the interval 0, 1]. Since the general case involves technicalities which are quite irrelevant to the subject of this paper, we shall content ourselves to deal with the simplest case of the Haar basis. Then the following expansion holds for any t (0, 1],dx):

22 3 A. Barron et al. t = β 1,1 ϕ 1,1 + j β j,k ϕ j,k, (3.6) j 0 k=1 where ϕ 1,1 = 0,1],ψ= 0,1/] ]1/,1] and ϕ j,k (x) = j/ ψ j x k + 1]. We set ( 1) ={( 1, 1)} and for j 0 (j) ={(j, k) 1 k j }. If m = m j=0 (j) we see from (3.3) that m = 1. To bound r m we first notice that for j 0 j β j,k ϕ j,k j/ sup β j,k. (3.7) k Therefore k=1 m r m j/ j=0 m j j=0 1/ < 1 +. It may also be useful to choose m = m j= 1 (j) and then r m < +. Piecewise polynomials: We restrict our attention to piecewise polynomial spaces on a bounded rectangle in q, which, without loss of generality, wetaketobe0, 1] q. Hereafter we denote by P i a partition of 0, 1] into D(i) intervals. A linear space S m of piecewise polynomials is characterized by m = (r, P 1,...,P q ) where r is the maximal degree with respect to each variable of the polynomials involved. The elements t of S m are the functions on 0, 1] q which coincide with a polynomial of degree not greater than r on each element of the product partition P = q i=1 P i. This results in D m = (r + 1) q q i=1 D(i). Let {Q j } j be the orthogonal basis of the Legendre polynomials in ( 1, 1],dx), then the following properties hold for all j (see Whittaker and Watson 197, pp for details): and Q j (x) 1 for all x 1, 1], Q j (1) = 1, 1 Q j (t) dt = 1 j + 1. Let us consider the hyperrectangle R = q i=1 a i,b i ]. For j J = {0,...,r} q we define q ( ) ji + 1 1/ ( ) xi a i b i ϕ R,j (x 1,...,x q ) = Q ji R (x 1,...,x q ). b i a i b i a i i=1

23 Risk bounds for model selection via penalization 33 The family {ϕ R,j } j J provides an orthonormal basis for the space of polynomials on R with degree bounded by r. IfH is a polynomial such that H = j J β jϕ R,j, H (r + 1)(r + 1) 1/] q Vol(R)] 1/ β. Then taking m as the set of those (R, j) s such that R P and j J we get from (3.4) r m (r + 1)q (r + 1) q D m inf R P Vol(R) = (r +1)(r +1)]q inf R P Vol(R) 1 q D(i)]. (3.8) In particular, if P is a regular partition (all elements R of P have the same volume), i=1 r m (r + 1)(r + 1)] q/. (3.9) Polynomials on a sphere and other eigenspaces of the Laplacian: Let q be the unit Euclidean sphere of q+1, µ be the uniform distribution on the sphere and 0 <θ 0 < <θ j < be the eigenvalues of the Laplace-Beltrami operator on q. Let, for each j 0, {ϕ λ,λ (j)} be an orthonormal system of eigenfunctions associated with the eigenvalue θ j. Then { } j 0 {ϕ λ,λ (j)} is an orthonormal basis of (µ). Defining, for any integer m 0, m = m j=0 (j) and S m as the linear span of {ϕ λ } λ m,wegetd m = m, for m 0. Actually these eigenvalues are given by explicit formulas (see for instance Berger, Gauduchon and Mazet 1971), the corresponding eigenfunctions are known to be harmonic zonal polynomials and one has (see Stein and Weiss 1971, p. 144) ϕλ (x) (j) for all x q and all j 0. λ (j) In such a case it follows from (3.3) that m = 1 for any integer m. More generally, we can consider, instead of q, a compact connected Riemannian manifold of dimension q with its uniform distribution µ. The eigenfunctions of the Laplace-Beltrami operator provide an orthonormal basis of (µ) which is a multidimensional generalization of the Fourier basis. Of course no exact formula is available in this full generality but some asymptotic evaluation holds which is known as Weyl s formula (see Chavel 1984, p. 9). Keeping the same notations for the eigenvalues and eigenfunctions as above, defining (j), m and S m as in the case of the sphere and setting D 1 = 1, Weyl s formula ensures that there exists two positive constants C 1 ( ) and C ( ) such that for any integer m

Model Selection and Geometry

Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model