Simultaneous estimation of the mean and the variance in heteroscedastic Gaussian regression

Size: px

Start display at page:

Download "Simultaneous estimation of the mean and the variance in heteroscedastic Gaussian regression"

Hannah Farmer
5 years ago
Views:

1 Electronic Journal of Statistics Vol ISSN: DOI: 10114/08-EJS67 Simultaneous estimation of the mean and the variance in heteroscedastic Gaussian regression Xavier Gendre Laboratoire J-A Dieudonné, Université de Nice - Sophia Antipolis Parc Valrose Nice Cedex 0 France gendre@unicefr url: gendre Abstract: Let Y be a Gaussian vector of R n of mean s and diagonal covariance matrix Γ Our aim is to estimate both s and the entries σ i = Γ i,i, for i = 1,, n, on the basis of the observation of two independent copies of Y Our approach is free of any prior assumption on s but requires that we know some upper bound γ on the ratio max i σ i /min i σ i For example, the choice γ = 1 corresponds to the homoscedastic case where the components of Y are assumed to have common unknown variance In the opposite, the choice γ > 1 corresponds to the heteroscedastic case where the variances of the components of Y are allowed to vary within some range Our estimation strategy is based on model selection We consider a family {S m Σ m, m M} of parameter sets where S m and Σ m are linear spaces To each m M, we associate a pair of estimators ŝ m, ˆσ m of s, σ with values in S m Σ m Then we design a model selection procedure in view of selecting some ˆm among M in such a way that the Kullback risk of ŝ ˆm, ˆσ ˆm is as close as possible to the minimum of the Kullback risks among the family of estimators {ŝ m, ˆσ m, m M} Then we derive uniform rates of convergence for the estimator ŝ ˆm, ˆσ ˆm over Hölderian balls Finally, we carry out a simulation study in order to illustrate the performances of our estimators in practice AMS 000 subject classifications: 6G08 Keywords and phrases: Gaussian regression, heteroscedasticity, model selection, Kullback risk, convergence rate Received July Introduction Let us consider the statistical framework given by the distribution of a Gaussian vector Y with mean s = s 1,, s n R n and diagonal covariance matrix σ Γ σ = σ n 1345

2 X Gendre/Simultaneous estimation of the mean and the variance 1346 where σ = σ 1,, σ n 0, n The vectors s and σ are both assumed to be unknown Hereafter, for any t = t 1,, t n R n and τ = τ 1,, τ n 0, n, we denote by P t,τ the distribution of a Gaussian vector with mean t and covariance matrix Γ τ and by KP s,σ, P t,τ the Kullback-Leibler divergence between P s,σ and P t,τ, KP s,σ, P t,τ = 1 n s i t i + φ where φu = log u + 1/u 1, for u > 0 Note that, if the σ i s are known and constant, the Kullback-Leibler divergence becomes the squared L -norm and, in expectation, corresponds to the quadratic risk Let us suppose that we observe two independent copies of Y, namely Y [1] = Y [1] 1 as [1],, Y n and Y [] = Y [] [],, Y n Their coordinates can be expanded Y [j] i 1 τ i τi = s i + σ i ε [j] i, i = 1,, n and j = 1,, 11 where ε [1] = ε [1] 1,, ε[1] n and ε [] = ε [] 1,, ε[] n are two independent standard Gaussian vectors We are interested here in the estimation of the two vectors s and σ Indeed, their behaviors contain substantial knowledge about the phenomenon represented by the distribution of Y We have particularly in mind the case of a variance that stays approximately constant by periods and that can take several values in the proceeding of the observations Of course, we want to estimate the mean s but, in this particular case, we are also interested in recovering the periods of constancy and the values taken by the variance σ The Kullback-Leibler divergence measures the differences between two distributions P s,σ and P t,τ Thus, it allows us to deal with the two estimation problems at the same time More generally, the aim of this paper is to estimate the pair s, σ by model selection on the basis of the observation of Y [1] and Y [] For this, we introduce a collection F = {S m Σ m, m M} of products of linear subspaces of R n indexed by a finite or countable set M In the sequel, these products will be called models and, for any m M, we will denote by D m the dimension of S m Σ m To each m M, we will associate a pair of estimators ŝ m, ˆσ m that is similar to the maximum likelihood estimator MLE It is well known that, if the σ i s are equal, the estimators of the mean and the variance factor given by maximization of the likelihood are independent This fact does not remain true if the σ i s are not constant To recover the independence between the estimators of the mean and the variance, we construct them separately from the two independent copies Y [1] and Y [] For the estimator ŝ m of s, we take the MLE based on Y [1] and for the estimator ˆσ m of σ, we take the MLE based on Y [] Thus, for each m M, we have a pair of independent estimators ŝ m, ˆσ m = ŝ m Y [1], ˆσ m Y [] with values in S m Σ m The Kullback risk of ŝ m, ˆσ m is given by E[KP s,σ, Pŝm,ˆσ m ] and is of order of the sum of two terms, inf t,τ S m Σ m KP s,σ, P t,τ + D m 1 σ i,

3 X Gendre/Simultaneous estimation of the mean and the variance 1347 The first one, called the bias term, represents the capacity of S m Σ m to approximate the true value of s, σ The second, called the variance term, is proportional to the dimension of the model and corresponds to the amount of noise that we have to control To warrant a small risk, these two terms have to be small simultaneously Indeed, using the Kullback risk as a quality criterion, a good model is one minimizing 1 among F Clearly, the choice of a such model depends on the pair of the unknown parameters s, σ and make good models unavailable to us So, we have to construct a procedure to select an index ˆm = ˆmY [1], Y [] M depending on the data only, such that E[KP s,σ, Pŝ ˆm,ˆσ ˆm ] is close to the smaller risk Rs, σ, F = inf m M E[KP s,σ, Pŝm,ˆσ m ] The art of model selection is precisely to provide procedure solely based on the observations in that way The classical way consists in minimizing an empirical penalized criterion stochastically close to the risk Considering the likelihood function with respect to Y [1], t R n, τ 0, n, Lt, τ = 1 n Y [1] i t i τ i + log τ i, we choose ˆm as the minimizer over M of the penalized likelihood criterion Critm = Lŝ m, ˆσ m + penm 13 where pen is a penalty function mapping M into R + = [0, In this work, we give a form for the penalty in such a way to obtain a pair of estimators ŝˆm, ˆσ ˆm with a Kullback risk close to Rs, σ, F Our approach is free of any prior assumption on s but requires that we know some upper bound γ 1 on the ratio σ /σ γ where σ resp σ is the maximum resp minimum of the σ i s The knowledge of γ allows us to deal equivalently with two different cases First, γ = 1 corresponds to the homoscedastic case where the components of Y [1] and Y [] are independent with a common variance ie σ i σ which can be unknown On the other side, γ > 1 means that the σ i s can be distinct and are allowed to vary within some range This uncommonness of the variances of the observations is known as the heteroscedastic case Heteroscedasticity arises in many practical situations in which the assumption that the variances of the data are equal is debatable The research field of the model selection has known an important development in the last decades and it is beyond the scope of this paper to make an exhaustive historical review of the domain The interested reader could find a good introduction to model selection in the first chapters of [17] The first heuristics in the domain are due to Mallows [16] for the estimation of the mean

4 X Gendre/Simultaneous estimation of the mean and the variance 1348 in homoscedastic Gaussian regression with known variance In more general Gaussian framework with common known variance, Barron et al [7], Birgé and Massart [9] and [10] have designed an adaptive model selection procedure to estimate the mean for quadratic risk They provide non-asymptotic upper bound for the risk of the selected estimator For bound of order of the smaller risk among the collection of models, this kind of result is called oracle inequalities Baraud [5] has generalized their results to homoscedastic statistical models with non-gaussian noise admitting moment of order larger than and a known variance All these results remain true for common unknown variance if some upper bound on it is supposed to be known Of course, the bigger is this bound, the worst are the results Assuming that γ is known does not imply the knowledge of a such upper bound In the homoscedastic Gaussian framework with unknown variance, Akaike has proposed penalties for estimating the mean for quadratic risk see [1, ] and [3] Replacing the variance by a particular estimator in his penalty term, Baraud [5] has obtained oracle inequalities for more general noise than Gaussian and polynomial collection of models Recently, Baraud, Giraud and Huet [6] have constructed penalties able to take into account the complexity of the collection of models for estimating the mean with quadratic risk in Gaussian homoscedastic model with unknown variance They have also proved results for the estimation of the mean and the variance factor with Kullback risk This problem is close to ours and corresponds to the case γ = 1 A motivation for the present work was to extend their results to the heteroscedastic case γ > 1 in order to get oracle inequalities by minimization of penalized criterion as 13 Assuming that the collection of models is not too large, we obtain inequalities with the same flavor up to a logarithmic factor E[KP s,σ, Pŝˆm,ˆσ ˆm ] { } C inf inf K P s,σ, P t,τ + D m log 1+ǫ D m + R 14 m M t,τ S m Σ m where C and R are positive constants depending in particular on γ and ǫ is a positive parameter A non-asymptotic model selection approach for estimation problem in heteroscedastic Gaussian model was studied in few papers only In the chapter 6 of [4], Arlot estimates the mean in heteroscedastic regression framework but for bounded data For polynomial collection of models, he uses resampling penalties to get oracle inequalities for quadratic risk Recently, Galtchouk and Pergamenshchikov [14] have provided an adaptive nonparametric estimation procedure for the mean in a heteroscedastic Gaussian regression model They obtain an oracle inequality for the quadratic risk under some regularity assumptions Closer to our problem, Comte and Rozenholc [1] have estimated the pair s, σ Their estimation procedure is different from ours and it makes the theoretical results difficultly comparable between us For instance, they proceed in two steps one for the mean and one for the variance and they give risk bounds separately for each parameter in L -norm while we estimate directly the pair s, σ for Kullback risk

5 X Gendre/Simultaneous estimation of the mean and the variance 1349 As described in [8], one of the main advantages of inequalities such as 14 is that they allow us to derive uniform convergence rates for the risk of the selected estimator over many classes of smoothness Considering a collection of histogram models, we provide convergence rates over Hölderian balls Indeed, for α 1, α 0, 1], if s is α 1 -Hölderian and σ is α -Hölderian, we prove that the risk of ŝˆm, ˆσ ˆm converges with a rate of order of n log 1+ǫ n α/α+1 where α = min{α 1, α } is the worst regularity To compare this rate, we can think of the homoscedastic case with only one observation of Y Indeed, in this case, the optimal rate of convergence in the minimax sense is n α/α+1 and, up to a logarithmic loss, our rate is comparable to this one To our knowledge, our results in non-asymptotic estimation of the mean and the variance in heteroscedastic Gaussian model are new The paper is organized as follows The main results are presented in section In section 3, we carry out a simulation study in order to illustrate the performances of our estimators in practice with the Kullback risk and the quadratic risk The last sections are devoted to the proofs and to some technical results Main results In a first time, we introduce the collection of models, the estimators and the procedure Next, we present the main results whose proofs can be found in the section 4 In the sequel, we consider the framework 11 and, for the sake of simplicity, we suppose that there exists an integer k n 0 such that n = kn 1 Model collection and estimators In order to estimate the mean and the variance, we consider linear subspaces of R n constructed as follows Let M be a countable or finite set To each m M, we associate a regular partition p m of {1,, kn } given by the p m = km consecutive blocks { i 1 k n k m + 1,, i kn km}, i = 1,, p m For any I p m and any x R n, let us denote by x I the vector of R n/ pm with coordinates x i i I Then, to each m M, we also associate a linear subspace E m of R n/ pm with dimension 1 d m kn km This set of pairs p m, E m allows us to construct a collection of models Hereafter, we identify each m M to its corresponding pair p m, E m For any m = p m, E m M, we introduce the subspace S m R n of the E m -piecewise vectors, S m = {x R n such that I p m, x I E m },

6 X Gendre/Simultaneous estimation of the mean and the variance 1350 and the subspace Σ m R n of the piecewise constant vectors, { } Σ m = g I ½ I, I p m, g I R The dimension of S m Σ m is denoted by D m = p m d m + 1 To estimate the pair s, σ, we only deal with models S m Σ m constructed in a such way More precisely, we consider a collection of products of linear subspaces F = {S m Σ m, m M} 1 where M is a set of pairs p m, E m as above In the paper, we will often make the following hypothesis on the collection of models: H θ There exists θ > 1 such that m M, n θ θ 1 γ + D m This hypothesis avoids handling models with dimension too great with respect to the number of observations Let m M, we denote by π m the orthogonal projection on S m We estimate s, σ by the pair of independent estimators ŝ m, ˆσ m S m Σ m given by ŝ m = π m Y [1] and ˆσ m = ˆσ m,i ½ I where I p m, ˆσ m,i = 1 Y [] i π m Y [] I i i I Thus, we get a collection of estimators {ŝ m, ˆσ m, m M} Risk upper bound We first study the risk on a single model to understand its order Take an arbitrary m M We define s m, σ m S m Σ m by and s m = π m s σ m = σ m,i ½ I where I p m, σ m,i = 1 s i s m,i + σ i I i I Easy computations proves that the pair s m, σ m reaches the minimum of the Kullback-Leibler divergence on S m Σ m, inf KP s,σ, P t,τ = KP s,σ, P sm,σ m t,τ S m Σ m = 1 log i I σm,i σ i

7 X Gendre/Simultaneous estimation of the mean and the variance 1351 The next proposition allows us to compare this quantity with the Kullback risk of ŝ m, ˆσ m Proposition 1 Let m M, if the hypothesis H θ is fulfilled, then KP s,σ, P sm,σ m D m 4γ E [K P s,σ, Pŝm,ˆσ m ] KP s,σ, P sm,σ m + κγ θ D m where κ > 1 is a constant that can be taken equal to 1 + e 1 As announced in 1, this result shows that the Kullback risk of the pair ŝ m, ˆσ m is of order of the sum of a bias term KP s,σ, P sm,σ m and a variance term which is proportional to D m Thus, minimizing the Kullback risk E [K P s,σ, Pŝm,ˆσ m ] among m M corresponds to finding a model that realizes a trade-off between these two terms Let pen be a non negative function on M, ˆm M is any minimizer of the penalized criterion ˆm argmin {L ŝ m, ˆσ m + penm} 3 m M In the sequel, we denote by s, σ = ŝˆm, ˆσ ˆm the selected pair of estimators It satisfies the following result: Theorem Under the hypothesis H θ, suppose there exist A, B > 0 such that, for any k, d N, M k,d = Card { m M such that p m = k and d m = d } A1 + d B 4 where M is the set defined at the beginning of the section 1 Moreover, assume that there exist δ, ǫ > 0 such that If we take then D m 5δγn log 1+ǫ, m M n 5 m M, penm = γθ + log 1+ǫ D m Dm 6 { E [K P s,σ, P s, σ ] C inf K Ps,σ, P sm,σ m + D m log 1+ǫ } D m + R 7 m M where R = Rγ, θ, A, B, ǫ, δ is a positive constant and C can be taken equal to κγθ + 1γθ C = 1 + log 1+ǫ The inequality 7 is close to an oracle inequality up to a logarithmic factor Thus, considering the penalty 6 whose order is slightly larger than the dimension of the model, the risk of the estimator provided by the criterion 13 is comparable to the minimum among the collection of models F

8 X Gendre/Simultaneous estimation of the mean and the variance Convergence rate One of the main advantages of an inequality as 7 is that it gives uniform convergence rates with respect to many well known classes of smoothness To illustrate this, we consider the particular case of the regression on a fixed design For example, in the framework 11, we suppose that 1 i n, s i = s r i/n and σ i = σ r i/n, where s r and σ r are two unknown functions that map [0, 1] to R In this section, we handle the normalized Kullback-Leibler divergence K n P s,σ, P t,τ = 1 n K P s,σ, P t,τ, and, for any α 0, 1 and any L > 0, we denote by H α L the space of the α-hölderian functions with constant L on [0, 1], H α L = {f : [0, 1] R : x, y [0, 1], fx fy L x y α } Moreover, we consider a collection of models F PC as described in the section 1 such that, for any m M, E m is the space of dyadic piecewise constant functions on d m blocks More precisely, let m = p m, E m M and consider the regular dyadic partition p m with p m d m blocks that is a refinement of p m We define S m as the space of the piecewise constant functions on p m, { S m = f = } f I ½ I such that I p m, f I R, I p m and Σ m as the space of the piecewise constant functions on p m, { Σ m = g = } g I ½ I such that I p m, g I R Then, the collection of models that we consider is F PC = {S m Σ m, m M} Note that this collection satisfies 4 with A = 1 and B = 0 The following result gives a uniform convergence rate for s, σ over Hölderian balls Proposition 3 Let α 1, α 0, 1], L 1, L > 0 and assume that H θ is fulfilled Consider the collection of models F PC and δ, ǫ > 0 such that, for any m M, D m 5δγn log 1+ǫ n Denoting by s, σ the estimator selected via the penalty 6, if n satisfies σ n L 1 σ + L e 41+ǫ

9 then X Gendre/Simultaneous estimation of the mean and the variance 1353 sup E [K n P s,σ, P s, σ ] C s r,σ r H α1 L 1 H α L n log 1+ǫ n α/α+1 8 where α = min{α 1, α } and C is a constant which depends on α 1, α, L 1, L, θ, γ, σ, δ and ǫ For the estimation of the mean s in quadratic risk with one observation of Y, Galtchouk and Pergamenshchikov [14] have computed the heteroscedastic minimax risk Under some assumptions on the regularity of σ r and assuming that s r H α1 L 1, they show that the order of the optimal rate of convergence in minimax sense is C α1,σn α1/α1+1 Concerning the estimation of the variance vector σ in quadratic risk with one observation of Y and unknown mean, Wang et al [19] have proved that the order of the minimax rate of convergence for the estimation of σ is C α1,α max { n 4α1, n α/α+1} once s r H α1 L 1 and σ r H α L For α 1, α 0, 1] the maximum of these two rates is of order n α/α+1 where α = min{α 1, α } is the worst among the regularities of s r and σ r Up to a logarithmic term, the rate of convergence over Hölderian balls given by our procedure recover this rate for the Kullback risk 3 Simulation study To illustrate our results, we consider the following pairs of functions s r, σ r defined on [0, 1] and, for each one, we precise the true value of γ: M1 γ = 4 if 0 x < 1/4 0 if 1/4 x < 1/ s r x = if 1/ x < 3/4 1 if 3/4 x 1 M γ = 1 M3 γ = 7/3 M4 γ = and σ r x = { if 0 x < 1/ 1 if 1/ x 1 s r x = 1 + sinπx + π/3 and σ r x = 1, s r x = 3x/ and σ r x = 1/ + sin4πx 1/ /3, s r x = 1 + sin4πx 1/ and σ r x = 3 + sinπx/ In all this section, we consider the collection of models F PC and we take n = 104 ie k n = 10 Let us first present how our procedure performs on the examples with the true value of γ for each simulation, ǫ = 10 and δ = 3,

10 X Gendre/Simultaneous estimation of the mean and the variance Fig 1 Estimation on the mean left and the variance right in the case M Fig Estimation on the mean left and the variance right in the case M in the assumption 5 and the penalty 6 with θ = The estimators are drawn in plain line and the true functions in dotted line In the case of M1, we can note that the procedure choose the good model in the sense that if the pair s r, σ r belongs to a model of F PC, this one is generally chosen by our procedure Repeating the simulation times with the framework of M1 gives us that, with probability higher than 999%, the probability for making this good choice is about ± Even if the mean does not belong to one of the S m s, the procedure recover the homoscedastic nature of the observations in the case M By doing simulations with the framework induced by M, the probability to choose an homoscedastic model is around ± with a confidence of 999% For more general framework as M3 and M4, the estimators perform visually well and detect the changements in the behaviour of the mean and the variance functions The parameter γ is supposed to be known and is present in the definition of the penalty 6 So, we naturally can ask what is its importance in the procedure In particular, what happens if we do not have the good value? The

11 X Gendre/Simultaneous estimation of the mean and the variance Fig 3 Estimation on the mean left and the variance right in the case M Fig 4 Estimation on the mean left and the variance right in the case M4 following table present some estimations of the ratio E [K P s,σ, P s, σ ]/ inf m M E [K P s,σ, Pŝm,ˆσ m ] for several values of γ These estimated values have been obtained with 500 repetitions for each one The main part of the computation time is devoted to the estimation of the oracle s risk In the cases M1, M3 and M4, the ratio does not suffer to much from small errors on the knowledge of γ The more affected case is the homoscedastic one but we see that the best estimation is obtained for the good value of γ as we could expect More generally, it is interesting to observe that, even if there is a small error on the value of γ, the ratio stays reasonably small In the regression framework with heteroscedastic noise, we can be interested in separate estimations of the mean and the variance functions Because our procedure provide a simultaneous estimation of these two functions, we can ask how perform our estimators s and σ individually Considering the quadratic risks E [ s s ] and E [ σ σ ] of s and σ respectively, it could be interesting to compare them to the minimal quadratic risk among the collection of estimators

12 X Gendre/Simultaneous estimation of the mean and the variance 1356 Table 1 Ratio between the Kullback risk of s, σ and the one of the oracle γ M M M M Table Ratio between the L -risk of s and the minimal one among the ŝ m s γ M M M M Table 3 Ratio between the L -risk of σ and the minimal one among the ˆσ m s γ M M M M To illustrate this, we give below two sets of estimations of the ratios E [ s s ] / inf m M E [ s ŝ m ] and E [ σ σ ] / inf m M E [ σ ˆσ m ] in the frameworks presented in the beginning of this section We can observe on the following estimations that the quadratic risks of our estimators are quite close to the minimal ones among the collection of models 4 Proofs For any I {1,, n} and any x, y R n, we introduce the notations x, y I = x i y i and x I = x i i I i I Let m M, we will use several times in the proofs the fact that, for any I p m, I ˆσ m,i σ χ I d m 1 41 where χ I d m 1 is a χ random variable with I d m 1 degrees of freedom

13 X Gendre/Simultaneous estimation of the mean and the variance Proof of the proposition 1 Recalling and using the independence between ŝ m and ˆσ m, we expand the Kullback risk of ŝ m, ˆσ m, E [KP s,σ, Pŝm,ˆσ m ] = 1 [ si ŝ m,i ] ˆσm,I E + φ 4 ˆσ m,i σ i i I = 1 [ ] 1 E E [ ] s ŝ m I ˆσ m,i + 1 [ E log ˆσ m,i + σ ] i 1 + log σ m,i σ m,i ˆσ m,i σ i i I = KP s,σ, P sm,σ m + 1 [ ] ˆσm,I I E φ σ m,i + 1 [ σi + s i s m,i ] σ m,i E ˆσ m,i i I + 1 [ ] 1 [ ] E π m Γ 1/ σ ε[1] I where E 1 = 1 [ I E φ E ˆσ m,i = KP s,σ, P sm,σ m + E 1 + E 43 ˆσm,I σ m,i ] and E = 1 E [ 1 ˆσ m,i To upper bound the first expectation, note that I p m, E[ˆσ m,i ] = σ m,i 1 π m,i,i σ i = σ m,i 1 ρ I I where i I 1 ρ I = π m,i,i σ i 0, 1 I σ m,i i I ] π m,i,i σ i We apply the lemmas 10 and 11 to each block I p m and, by concavity of the logarithm, we get [ ] [ ] [ ] ˆσm,I ˆσm,I σm,i E φ log E + E 1 σ m,i σ m,i ˆσ m,i log1 ρ I + 1 κγ ρ I I d m ρ I + 1 κγ ρ I I d m 1 ρ κγ I + 1 ρ I I d m i I

14 X Gendre/Simultaneous estimation of the mean and the variance 1358 Using H θ and the fact that ρ I γd m / I, we obtain E 1 1 I ρ κγ I + 1 ρ I I d m 1 γ θ p m d m γdm κγ I + I γd m I γd m I d m + κγ θ p m 44 The second expectation in 43 is easier to upper bound by using 41 and the fact that d m 1, E = 1 [ ] 1 π m,i,i σ i 1 E ˆσ m,i i I γ I d m I d m 3 γθ p m d m We now sum 44 and 45 to obtain 45 E 1 + E γ θ p m d m + κγ θ p m κγ θ D m For the lower bound, the positivity of φ in 4 and the independence between ŝ m and ˆσ m give us E [KP s,σ, Pŝm,ˆσ m ] 1 [ s ŝm ] I E ˆσ m,i 1 [ E s ŝm I] E [ˆσ m,i ] 1 s s m I + σ d m s s m I + I d mσ I It is obvious that the hypothesis H θ ensures d m I / Thus, we get σ d m I d m σ and E [KP s,σ, Pŝm,ˆσ m ] 1 I σ d m I d m σ p m d m D m γ 4γ To conclude, we know that ŝ m, ˆσ m S m Σ m and, by definition of s m, σ m, it implies E [KP s,σ, Pŝm,ˆσ m ] KP s,σ, P sm,σ m

15 X Gendre/Simultaneous estimation of the mean and the variance Proof of theorem We prove the following more general result: Theorem 4 Let α 0, 1 and consider a collection of positive weights {x m } m M If the hypothesis H θ is fulfilled and if then 1 αe [K P s,σ, P s, σ ] m M, penm γθd m + x m, 46 inf m M {E [K P s,σ, Pŝm,ˆσ m ] + penm} + R 1 M + R M where R 1 M and R M are defined by R 1 M = Cθ γ m M pm d m Cθ γ p m d m log1 + d m x m log1+dm and R M = α + γθ + 1 α m M p m exp n θ p m log 1 + α p m x m γnα + In these expressions, is the integral part and C is a positive constant that could be taken equal to 1 e/ e 1 Before proving this result, let us see how it implies the theorem The choice 6 for the penalty function corresponds to x m = D m log 1+ǫ D m in 46 Applying the previous theorem with α = 1/ leads us to with and E [K P s,σ, P s, σ ] inf m M {E [K P s,σ, Pŝm,ˆσ m ] + penm} + Cθ γr 1 + 8γθ + 1R R 1 = m M pm d m Cθ γ p m d m log1 + d m m M x m log1+dm R = p m exp n θ p m log 1 + p m x m 5γn Using the upper bound on the risk of the proposition 1, we easily obtain the coefficient of the infimum in 7 Thus, it remains to prove that the two quantities R 1 and R can be upper bounded independently of n For this, we denote

16 X Gendre/Simultaneous estimation of the mean and the variance 1360 by B = B + logcθ γ + 1 and we compute R 1 = m M k 0 d 1 pm d m M k,d k/ d Cθ γ p m d m log1 + d m p m 1 + d m log 1+ǫ p m 1 + d m Cθ γ k/ log1 + d k log + log1 + d 1+ǫ A 1 + d B k/ k/ log1 + d k log + log1 + d 1+ǫ k 0 d 1 AR 1 + R 1 We have split the sum in two terms, the first one is for d = 1, R 1 = B log k log + log k 0 B 1+ǫ = log ǫ The other part R 1 is for d and is equal to k 0 d k 0 log1+dm log1+d 1 k ǫ < log1+d log1+d 1 + d B k log1+d 1/ log1 + d k log + log1 + d 1+ǫ Noting that 1 < log1 + d log1 + d, we have R 1 k/ + d k 0 d 1 B exp ǫ log1 + d log log1 + d 1 + d B ǫ loglog1+d < 1 d We now handle R Our choice of x m = D m log 1+ǫ D m and the hypothesis 5 imply p m x m 5γn δ p m = 1 δ p m δ p m We recall that, for any a 0, 1, if 0 t 1 a/a, then log1+t at Take a = δ p m to obtain log 1 + p m x m p m x m 5γn 5δ p m + 1γn x m 5δ + 1γn

17 X Gendre/Simultaneous estimation of the mean and the variance 1361 For any positive t, 1 + t 1+ǫ 1 + t 1+ǫ, then we finally obtain R = p m exp n θ p m log 1 + p m x m 5γn m M x m p m exp 10θγδ + 1 p m m M M k,d k exp 1 + dlog1+ǫ k 1 + d 10θγδ + 1 where we have set k 0 d 1 AR R R = k 0exp k log k log 1+ǫ < 5θγδ + 1 and R = exp B log1 + d 1 + dlog1+ǫ 1 + d < 10θγδ + 1 d 1 We now have to prove theorem 4 For an arbitrary m M, we begin the proof by expanding the Kullback-Leibler divergence of s, σ, KP s,σ, P s, σ = 1 n s i s i σi + φ σ i σ i = KP s,σ, Pŝm,ˆσ m + [Lŝ m, ˆσ m KP s,σ, Pŝm,ˆσ m ] By the definition 3 of ˆm, the inequality + [L s, σ Lŝ m, ˆσ m ] + [KP s,σ, P s, σ L s, σ] L s, σ Lŝ m, ˆσ m penm pen ˆm 47 is true for any m M The difference between the divergence and the likelihood can be expressed as KP s,σ, Pŝm,ˆσ m Lŝ m, ˆσ m = 1 σi 1 1 ε [1] i 48 ˆσ m,i i I s i ŝ m,i σ i ε [1] i 1 ˆσ m,i n ε [1] i + log σi

18 X Gendre/Simultaneous estimation of the mean and the variance 136 Using 47 and 48, for any α 0, 1, we can write 1 αkp s,σ, P s, σ 49 where, for any m M, KP s,σ, Pŝm,ˆσ m + penm + Gm + W 1 ˆm + W ˆm + Z ˆm pen ˆm W 1 m = 1 ˆσ m,i πm Γ 1/ σ ε[1] I, W m = 1 s m s, Γ 1/ σ ˆσ ε[1] α m,i I s m s I 1 ε [1] Zm = 1 i I σi ˆσ m,i 1 i I i αφ ˆσm,I σ i and Gm = 1 s ŝ m, Γ 1/ σ ε [1] ˆσ 1 σi 1 1 ε [1] i m,i I ˆσ m,i We split the proof of theorem 4 in several lemmas Lemma 5 For any m M, we have E[Gm] 0 Proof Let us compute this expectation to obtain the inequality By independence between ε [1] and ε [], we get [ E Gm ε []] = 1 [ E π m Γ 1/ σ ε [1], Γ 1/ σ ε [1] ] ˆσ m,i I = 1 [ ] πm E Γ 1/ σ ε [1] ˆσ m,i I It leads to E[Gm] = E [ E [ Gm ε [] ]] 0 In order to control Zm, we split it in two terms that we study separately, where and Z + m = 1 Z m = 1 i I i I Zm = Z + m + Z m σi 1 ˆσ m,i + σi 1 ˆσ m,i 1 ε [1] i ε [1] i, ˆσm,I αφ ½ˆσm,I σi 1 αφ σ i ˆσm,I σ i ½ˆσm,I>σi

19 X Gendre/Simultaneous estimation of the mean and the variance 1363 Lemma 6 Let m M and x be a positive number Under the hypothesis H θ, we get E [ ] γθ p m Z + m x + exp n d m + 3 p m log 1 + α p m x α p m γn Proof We begin by setting, for all 1 i n, T i m = n σ i /ˆσ m,i 1 + j=1 σ j/ˆσ m,j 1 + 1/ and we denote by Sm = n T i m 1 ε [1] i We lower bound the function φ by the remark a 0, 1, u [a, 1], 1 u 1 a φu Thus, we obtain n σi 1 max ˆσ m,i i n + and we use this inequality to get σ i ˆσ m,i n φ j=1 ˆσm,j Z + m = 1 n 1/ σi 1 Sm α n φ ˆσ m,i + Mm Sm + α n ˆσm,i φ σ i 1 σ i max Sm + 4α i n ˆσ m,i σ j ½ˆσm,j σ j = Mm ½ˆσm,i σ i ˆσm,i σ i ½ˆσm,i σ i To control Sm, we use the inequality 4 in [15], conditionally to ε [] Let u > 0, [ σ i P max Sm σ ] i ε [] + u = E P Sm u/ max i n ˆσ m,i i n ˆσ m,i [ E exp u ] 4 min ˆσ m,i i n σ i By the remark 41, we can upper bound it by σ i P max Sm + u E i n ˆσ m,i [ exp u ] 4γ min X I

20 X Gendre/Simultaneous estimation of the mean and the variance 1364 where the X I s are iid random variables with a χ I d m 1/ I distribution For any λ > 0, we know that the Laplace transform of X I is given by E [ ] e λxi = 1 + λ I dm 1/ I Let t > 0, the following expectation is dominated by [ E Z + m γ ] α + t = P Z + m γt 0 α + u du [ αu E exp 0 γ + t ] min X I du [ αu E max exp γ + t ] X I du Using H θ and 410, we roughly upper bound the maximum by the sum of the Laplace transforms and we get [ E Z + m γ ] α + t γ I 1 + t I dm 3/ α I d m 3 I γθ p m exp n d m + 3 p m log 1 + t p m α p m n Take t = αx/γ to conclude Lemma 7 Let m M and x be a positive number, then E [ Z m α + 1x + ] α + 1 α e αx Proof Note that for all u > 1, we have φu 1 u 1 Let t > 0, we handle Z m conditionally to ε [] and, using the previous lower bound on φ, we obtain P Z m α + 1 ε α t [] P 1 n σi 1 ˆσ m,i ε [1] i 1 α + 1 α t + α 4 n σi 1 ˆσ m,i ε []

21 P 1 X Gendre/Simultaneous estimation of the mean and the variance 1365 n Let us note that σi 1 ˆσ m,i ε [1] i 1 t + t σi max 1 1, i n ˆσ m,i thus, we can apply the inequality 41 from [15] to get P Z m α + 1 α t exp t/ This inequality leads us to [ E Z m α + 1 ] α t + Take t = αx to get the announced result + n α+1t/α α + 1 α e t σi 1 ˆσ m,i ε [] PZ m udu It remains to control W 1 m and W m For the first one, we now prove a Rosenthal-type inequality Lemma 8 Consider any m M Under the hypothesis H θ, for any x > 0, we have E[W 1 m γθd m x + ] Cθ γ p m d m Cθ γ log1+dm p m d m log1 + d m x where is the integral part and C is a positive constant that could be taken equal to C = 1 e e 1 Proof Using the lemma 10 and the remark 41, we dominate W 1 m, W 1 m W 1m = γ I d m I d m 1 F I = γnd m n p m 1 + d m F I where the F I s are iid Fisher random variables of parameters d m, n/ p m d m 1 We denote by F m the distribution of the F I s and we have γ D m γ p m d m E[W 1m] γθ p m d m γθd m Take x > 0 and an integer q > 1, then E [ W 1 m E[W 1 m] x ] E [ W 1 m E[W ] 1 + m]q q 1x q 1

22 X Gendre/Simultaneous estimation of the mean and the variance 1366 We set V = W 1m E[W 1m] It is the sum of the independent centered random variables X I = γnd m n p m 1 + d m F I E[F I ], I p m To dominate E [ V q +], we use the theorem 9 in [11] Let us compute E[X I] = γ n d m n 3 p m p m n p m d m + 3 n p m d m + 5 γ θ 3 p m d m and so, E [ V+] q 1/q 1κ γ θ 3 p m d m q + qκ [ E max X I q] 1/q e e 1 where κ = We consider q = 1 + log1 + d m where is the integral part For this choice, q 1 + d m and it implies p m q < n p m 1 + d m The hypothesis H θ allows us to make a such choice We roughly upper bound the maximum by the sum and we use H θ to get [ E max X I q] [ γθd m q E max F I E[F I ] q] γθd m q q 1 E[F m ] q + p m E[Fm q ] γθ q d m + p m γθdm 1 + q 1/d m 1 p m q/n p m 1 + d m 6γθ d m q pm Thus, it gives E [ V+] q 1/q γθ 1κ p m d m q + 6κ p m 1/q d m q 6κ γθ pm d m q + p m 1/q d m q Injecting this inequality in 411 leads to E [ W 1m E[W 1m] x + ] Cγθ p m d m 1κ γθ p m d m 1 + log1 + d m Cγθ log1+dm p m d m 1 + log1 + d m x q

23 X Gendre/Simultaneous estimation of the mean and the variance 1367 Lemma 9 Consider any m M and let x be a positive number Under the hypothesis H θ, we have E [ ] γθ p m W m x + exp n d m + 3 p m log 1 + α p m x α p m γn Proof Let us define Am = s s m I ˆσ m,i The distribution of W m conditionally to ε [] is Gaussian with mean equal to αam/ and variance factor 1/ Γ σ s s m I ˆσ m,i If ζ is a standard Gaussian random variable, it is well known that, for any λ > 0, Pζ λ e λ 41 We apply the Gaussian inequality 41 to W m conditionally to ε [], t t > 0, P W m + α Am Γ 1/ σ s s m I e t It leads to ˆσ m,i ε [] P W m + α Am σ i ε [] tam max e t i n ˆσ m,i and thus, by the remark 41, P W m γt α max X 1 I p I ε [] P m W m t α max i n σ i ε [] e t ˆσ m,i where the X I s are iid random variables with a χ I d m 1/ I distribution Finally, we integrate following ε [] and we get [ PW m t E max exp αt ] γ X I We finish as we did for Z + m, [ E W m γ ] α + t + [ αu E max exp 0 γ + t γθ 1 + t I dm 3/ α I γθ p m exp n d m + 3 p m log α p m X I ] du 1 + t p m n

24 X Gendre/Simultaneous estimation of the mean and the variance 1368 In order to end the proof of theorem 4, we need to put together the results of the previous lemmas Because γ 1, for any x > 0, we can write e αx exp n p m log 1 + α p m x γn We now come back to 49 and we apply the preceding results to each model Let m M, we take x = x m + α and, recalling 46, we get the following inequalities 1 αe[kp s,σ, P s, σ ] [ E[KP s,σ, Pŝm,ˆσ m ] + penm + E W 1 ˆm γθd ˆm x ] ˆm + α + [ + E W ˆm x ] [ ˆm + E Z + ˆm x ] ˆm + α + + α + [ ] x ˆm + E Z ˆm 1 + α + α E[Km] + penm + R 1 M + R M 413 where R 1 M and R M are the sums defined in the theorem 4 As the choice of m is arbitrary, we can take the infimum among m M in the right part of Proof of the proposition 3 For the collection F PC, we have A = 1 and B = 0 in 4 Let m M, we denote by σ m Σ m the quantity σ m = σ m,i ½ I with I p m, σ m,i = 1 σ i I The theorem gives us E [K n P s,σ, P s, σ ] C n inf { KPs,σ, P sm,σ m + D m log 1+ǫ } R D m + m M n C n inf { KPs,σ, P sm, σ m + D m log 1+ǫ } R D m + m M n { s sm C inf + σ σ m } + D m M nσ nσ m log 1+ǫ D m + R n because, for any x > 0, φx x 1/x i I

25 X Gendre/Simultaneous estimation of the mean and the variance 1369 Assuming s r, σ r H α1 L 1 H α L, we know see [13] that s s m nl 1 p m d m α1 and Thus, we obtain σ σ m nl p m α E [K n P s,σ, P s, σ ] C inf m M If α 1 < α, we can take and { L 1 σ p m d m α1 + L σ p m d m = p m = L 1 n σ log 1+ǫ n L n σ log 1+ǫ n } p m α + log1+ǫ n D m + R n n 1/1+α1 1/1+α For α 1 α, this choice is not allowed because it would imply d m = 0 So, in this case, we take L d m = 1 and p m = 1 σ + L 1/1+α n σ log 1+ǫ n In the two situation, we obtain the announced result 5 Technical results This section is devoted to some useful technical results Some notations previously introduced can have a different meaning here Lemma 10 Let Σ be a positive symmetric n n-matrix and σ 1,, σ n > 0 be its eigenvalues Let P be an orthogonal projection of rank D 1 If we denote M = P ΣP, then M is a non-negative symmetric matrix of rank D and, if τ 1,, τ D are its positive eigenvalues, we have min σ i min τ i and max τ i max σ i 1 i n 1 i D 1 i D 1 i n Proof We denote by Σ 1/ the symmetric square root of Σ By a classical result, M has the same rank, equal to D, than PΣ 1/ On a first side, we have

26 X Gendre/Simultaneous estimation of the mean and the variance 1370 max τ PΣPx, x i = sup 1 i D x R n x x 0 PΣx, x = sup x 1,x kerp imp x 1 + x x 1,x 0,0 On the other side, we can write min τ i = min 1 i D V R n dimv =n D+1 Σx, x sup max x imp x σ i 1 i n x 0 = min V R n dimv =n D+1 min V R n dimv =n D+1 min V R n dimv 1 max x V x 0 max x V x 0 max x V x 0 Mx, x x Σ 1/ Px max x V imp x 0 Σ 1/ x x x Σ 1/ x x = min 1 i n σ i Lemma 11 Let ε be a standard Gaussian vector in R n, a = a 1,, a n R n and b 1,, b n > 0 We denote by b resp b the maximum resp minimum of the b i s If n > and Z = n a i + b i ε i, then [ ] 1 E κb /b Z E[Z] n where κ > 1 is a constant that can be taken equal to 1 + e Proof We recall that E[Z] = n i=0 a i + b i and, for any λ > 0, the Laplace transform of a i + b i ε i is [ E exp λa i + b i ε i ] = exp Thus, the Laplace transform of Z is equal to λa i λb i log1 + λb i ψλ = E [ e λz] n λa i = exp 1 n log1 + λb i 1 + λb i n = e λe[z] λ a i exp b i 1 n rλb i 1 + λb i i=0

27 X Gendre/Simultaneous estimation of the mean and the variance 1371 where rx = log1 + x x for all x > 0 To compute the expectation of the inverse of Z, we integrate ψ by parts, [ ] 1 E = ψλdλ Z 0 n = e λe[z] λ a i exp b i 1 n rλb i dλ 1 + λb i where = 0 1 E[Z] + 1 E[Z] f a,b λ = n i=0 0 i=0 f a,b λψλdλ λb i + 4λa i b i1 + λb i 1 + λb i 1 + λb i We now upper bound the integral, [ ] E[Z] E Z 1 = f a,b λ exp n λa i /1 + λb i n dλ 1 + λbi where we have set nλb 1 + λb 0 g a,b λ = 1+n/ dλ 4b 1 + λb 1 + λb 1+n/ g a,bλe ga,bλ dλ n λa i 1 + λb i For any t > 0, te t e 1 Because g a,b is a positive function and n >, we obtain [ ] E[Z] E Z 1 nλb λb 1+n/dλ + 4b 1 + λb 0 e1 + λb 1+n/dλ b /b + 4b /b n + b /b n enn b /b n e 1 b /b n n 1 en References [1] Akaike, H 1969 Statistical predictor identification Annals Inst Statist Math MR08633 [] Akaike, H 1973 Information theory and an extension of the maximum likelihood principle In Proceedings nd International Symposium on Information Theory, P Petrov and F Csaki, Eds Akademia Kiado, Budapest, MR048315

28 X Gendre/Simultaneous estimation of the mean and the variance 137 [3] Akaike, H 1974 A new look at the statistical model identification IEEE Trans on Automatic Control MR [4] Arlot, S 007 Rééchantillonnage et sélection de modèles PhD thesis, Université Paris 11 [5] Baraud, Y 000 Model selection for regression on a fixed design Probab Theory Related Fields 117, MR [6] Baraud, Y, Giraud, C, and Huet, S 006 Gaussian model selection with an unknown variance To appear in Annals of Statistics [7] Barron, A, Birgé, L, and Massart, P 1999 Risk bounds for model selection via penalization Probab Theory Related Fields 113, MR [8] Birgé, L and Massart, P 1997 From model selection to adaptive estimation Festschrift for Lucien Lecam: Research Papers in Probability and Statistics MR [9] Birgé, L and Massart, P 001a Gaussian model selection Journal of the European Mathematical Society 3, 3, MR [10] Birgé, L and Massart, P 001b A generalized c p criterion for gaussian model selection Prépublication 647, Universités de Paris 6 and Paris 7 [11] Boucheron, S, Bousquet, O, Lugosi, G, and Massart, P 005 Moment inequalities for functions of independent random variables Annals of Probability 33,, MR1300 [1] Comte, F and Rozenholc, Y 00 Adaptive estimation of mean and volatility functions in auto-regressive models Stochastic Processes and their Applications 97, 1, MR [13] DeVore, R and Lorentz, G 1993 Constructive approximation Vol 303 Springer-Verlag MR [14] Galtchouk, L and Pergamenshchikov, S 005 Efficient adaptive nonparametric estimation in heteroscedastic regression models Université Louis Pasteur, IRMA, Preprint MR [15] Laurent, B and Massart, P 000 Adaptive estimation of a quadratic functional by model selection Annals of Statistics 8, 5, MR [16] Mallows, C 1973 Some comments on c p Technometrics 15, [17] McQuarrie, A and Tsai, C 1998 Regression and times series model selection World Scientific Publishing Co, Inc MR [18] R Development Core Team 007 R: A Language and Environment for Statistical Computing R Foundation for Statistical Computing, Vienna, Austria ISBN , [19] Wang, L and al 008 Effect of mean on variance function estimation in nonparametric regression The Annals of Statistics 36,, MR396810

Model Selection and Geometry

Model Selection and Geometry Pascal Massart Université Paris-Sud, Orsay Leipzig, February Purpose of the talk! Concentration of measure plays a fundamental role in the theory of model selection! Model