FUNCTIONAL REGRESSION FOR GENERAL EXPONENTIAL FAMILIES

Size: px

Start display at page:

Download "FUNCTIONAL REGRESSION FOR GENERAL EXPONENTIAL FAMILIES"

Timothy Henry
5 years ago
Views:

1 Submitted to the Annals of Statistics FUNCTIONAL REGRESSION FOR GENERAL EXPONENTIAL FAMILIES BY WEI DOU, DAVID POLLARD AND HARRISON H. ZHOU Yale University The paper derives a minimax lower bound for rates of convergence for an infinite-dimensional parameter in an exponential family model. An estimator that achieves the optimal rate is constructed by maximum likelihood on finite-dimensional approximations with parameter dimension that grows with sample size. 1. Introduction. Our main purpose in this paper is to extend the theory developed by Hall and Horowitz 2007 for regression with mean a linear function of an unknown square integrable function B defined on a compact interval of the real line to observations y i from an exponential famly whose canonical parameter is of the form 1 0 BtX it dt for observed Gaussian processes X i. Our methods introduce several new technical devices. We establish a sharp approximation for maximum likelihood estimators for exponential families parametrized by linear functions of m-dimensional parameters, for an m that grows with sample size. We develop a change of measure argument inspired by ideas from Le Cam s theory of asymptotic equivalence of models to eliminate the effect of bias terms from the asymptotics of maximization estimators. And we obtain improved bounds for projections onto subspaces defined by eigenfunctions of perturbations of compact operators, bounds that simplify arguments involving estimates of unknown covariance kernels. More precisely, we consider problems where the observed data consist of independent, identically distributed pairs y i, X i where each X i is a Gaussian process indexed by a compact subinterval of the real line, which with no loss of generality we take to be [0, 1]. We write m for Lebesgue measure on the Borel sigma-field of [0, 1]. We denote the corresponding norm and inner product in the space L 2 m by and,. We assume the conditional distribution of y i given the process X i comes from Supported in part by NSF FRG grant DMS Supported in part by NSF grant MSPA-MCS Supported in part by NSF Career Award DMS and NSF FRG grant DMS AMS 2000 subject classifications: Primary 62J05, 60K35; secondary 62G20 Keywords and phrases: Functional estimation, exponential families, minimax rates of convergence, approximation of compact operators. 1

2 2 an exponential family {Q λ : λ R} with parameter 1 1 λ i = a + X i tbt dt 0 for an unknown constant a and an unknown B L 2 m. We focus on estimation of B using integrated squared error loss: LB, B n = B B n 2 = 1 0 Bt B n t 2 dt. In a companion paper we will show that our methods can be adapted to treat the problem of prediction of a linear functional 1 0 xtbt dt for a known x, extending theory developed by Cai and Hall In that paper we also consider some of the practical realities in applying the results to the economic problem of predicting occurence of recessions from the U.S. Treasury yield curve. Our models are indexed by a set F of parameters f = a, B, K, µ, where µ is the mean and K is the covariance kernel of the Gaussian process. Under assumptions on F see Section 3 analogous to the assumptions made by Hall and Horowitz 2007 for a problem of functional linear regression, we find a sequence {ρ n } that decreases to zero for which 2 lim inf sup n f F P n,f B B n 2 /ρ n > 0 for every estimating sequence { B n } and construct one particular estimating sequence of B n s for which: for each ɛ > 0 there exists a finite constant C ɛ such that 3 sup P n,f { B B n 2 > C ɛ ρ n } < ɛ for large enough n. f F For the collection of models F = FR, α, β defined in Section 3, the rate ρ n equals n 1 2β/α+2β. In Section 9 we establish a minimax lower bound by means of a variation on Assouad s Lemma. We begin our analysis of the rate-optimal estimator in Section 4, with an approximation theorem for maximum likelihood estimators in exponential family models for parameters whose dimensions change with sample size. The main result is stated in a form slightly more general than we need for the present paper because we expect the result to find other sieve-like applications. The approximations from this section lie at the heart of our construction of an estimator that achieves the minimax rate from Section 9. As an aid to the reader, we present our construction of the estimating sequence for 3 in two stages. First Section 5 we assume that both the mean µ and the

3 FUNCTIONAL REGRESSION 3 covariance kernel K are known. This allows us to emphasize the key ideas in our proofs without the many technical details that need to be handled when µ and K are estimated in the natural way. Many of those details involve the spectral theory of compact operators. We have found some of the results that we need quite difficult to dig out of the spectral theory literature. In Section 6 we summarize the theory that we use to control errors when approximating K: some of it is a rearrangement of ideas from Hall and Horowitz 2007 and Hall and Hosseini-Nasab 2006; some is adapted from the notes by Bosq 2000 and the monograph by Birman and Solomjak 1987; and some, such as the material in subsection 6.3 on approximation of projections, we believe to be new. Armed with the spectral theory, we proceed in Section 7 to the case where µ and K are estimated. We emphasize the parallels with the argument for known µ and K, postponing the proofs of the extra approximation arguments mostly collected together as Lemma 28 to the following section. The final two sections of the paper establish a bound on the Hellinger distance between members of an exponential family, the key to our change of measure argument, and a maximal inequality for Gaussian processes. 2. Notation. For each matrix A, the spectral norm is defined as A 2 := 1/2. sup u 1 Au and the Frobenius norm by A F := i,j A2 i,j If A is symmetric, with eigenvalues λ 1,..., λ k, then A 2 = max i λ i = sup u 1 u Au A F. If A is also positive definite then the absolute values are superfluous for the first two equalities. When we want to indicate that a bound involving constants c, C, C 1,... holds uniformly over all models indexed by a set of parameters F, we write cf, CF, C 1 F,.... By the usual convention for eliminating subscripts, the values of the constants might change from one paragraph to the next: a constant C 1 F in one place needn t be the same as a constant C 1 F in another place. For sequences of constants c n that might depend on F, we write c n = O F 1 and o F 1 and so on to show that the asymptotic bounds hold uniformly over F. We write hp, Q for the Hellinger distance between two probability measures P and Q. If both P and Q are dominated by some measure ν, with densities p and q, then h 2 P, Q = ν p q 2. We use Hellinger distance to bound total variation distance, P Q TV := sup A P A QA = 1 2ν p q hp, Q.

4 4 For product measures we use the bound h 2 i n P i, i n Q i i n h2 P i, Q i. To avoid confusion with transposes, we use the dot notation or superscript notation to denote derivatives. For example,... ψ or ψ 3 both denote the third derivative of a function ψ, 3. The model. Let {Q λ : λ R} be an exponential family of probability measures with densities dq λ /dq 0 = f λ y = exp λy ψλ. Remember that e ψλ = Q 0 e λy and that the distribution Q λ has mean ψ 1 λ and variance ψ 2 λ. We assume: ψ3 There exists an increasing real function G on R + such that ψ 3 λ + h ψ 2 λg h for all λ and h Without loss of generality we assume G0 1. ψ2 For each ɛ > 0 there exists a finite constant C ɛ for which ψ 2 λ C ɛ expɛλ 2 for all λ R. Equivalently, ψ 2 λ exp oλ 2 as λ. As shown in Section 10, these assumptions on the ψ function imply that 4 h 2 Q λ, Q λ+δ δ 2 ψ 2 λ 1 + δ G δ for all λ, δ R. Remark. We may assume that ψ 2 λ > 0 for every real λ. Otherwise we would have 0 = ψ 2 λ 0 = var λ0 y = νf λ0 yy ψ 1 λ 0 2 for some λ 0, which would make y = ψ 1 λ 0 for ν almost all y and Q λ Q λ0 for every λ. We assume the observed data are iid pairs y i, X i for i = 1,..., n, where: a Each {X i t : 0 t 1} is distributed like {Xt : 0 t 1}, a Gaussian process with mean µt and covariance kernel Ks, t. b y i X i Q λi with λ i = a + X i, B for an unknown {Bt : 0 t 1} in L 2 m and a R. DEFINITION 5. For real constants α > 1 and β > α + 3/2 and R > 0, define F = FR, α, β as the set of all f = a, B, µ, K that satisfy the following conditions. K The covariance kernel is square integrable with respect to m m and has an eigenfunction expansion as a compact operator on L 2 m Ks, t = k N θ kφ k sφ k t where the eigenvalues θ k are decreasing with Rk α θ k θ k+1 +α/rk α 1.

5 FUNCTIONAL REGRESSION 5 a a R µ µ R B B has an expansion Bt = k N b kφ k t with b k Rk β, for the eigenfunctions defined by the kernel K. Remarks. all k < j, The awkward lower bound for θ k in Assumption K implies, for j 6 θ k θ j R 1 αx α 1 dx = R 1 k α j α. k If K and µ were known, we would only need the lower bound θ k R 1 k α and not the lower bound for θ k θ k+1. As explained by Hall and Horowitz 2007, page 76, the stronger assumption is needed when one estimates the individual eigenfunctions of K. Note that the subset B K of L 2 m in which B lies depends on K. We regard the need for the stronger assumption on the eigenvalues and the irksome Assumption B as artifacts of the method of proof, but we have not yet succeeded in removing either assumption. More formally, we write P µ,k for the distribution a probability measure on L 2 m of each Gaussian process X i. The joint distribution of X 1,..., X n is then P n,µ,k = Pµ,K n. We identify the y i s with the coordinate maps on R n equipped with the product measure Q n,a,b,x1,...,x n := i n Q λi, which can also be thought of as the conditional joint distribution of y 1,..., y n given X 1,..., X n. Thus the P n,f in equations 2 and 3 can be rewritten as an iterated expectation, P n,f = P n,µ,k Q n,a,b,x1,...,x n, the second expectation on the right-hand side averaging out over y 1,..., y n for given X 1,..., X n, the first averaging out over X 1,..., X n. To simplify notation, we will often abbreviate Q n,a,b,x1,...,x n to Q n,a,b. 4. Maximum likelihood estimation. The theory in this section combine ideas from Portnoy 1988 and from Hjort and Pollard We write our results in a notation that makes the applications in Section 5 and 7 more straightforward. The notational cost is that the parameters are indexed by {0, 1,..., N}. To avoid an excess of parentheses we write N + for N + 1. In the applications N changes with the sample size n and Q is replaced by Q n,a,b,n or Q n,a,b,n. Suppose ξ 1,..., ξ n are nonrandom vectors in R N +. Suppose Q = i n Q λi with λ i = ξ i γ for a fixed γ = γ 0, γ 1,..., γ N in R N +. Under Q, the coordinate maps y 1,..., y n are independent random variables with y i Q λi. The log-likelihood for fitting the model is L n g = i n ξ igy i ψξ ig for g R N +, which is maximized over R N + at the MLE ĝ = ĝ n.

6 6 Remark. As a small amount of extra bookkeeping in the following argument would show, we do not need ĝ to exactly maximize L n. It would suffice to have L n ĝ suitably close to sup g L n g. In particular, we need not be concerned with questions regarding existence or uniqueness of the argmax. Define i J n = i n ξ iξ i ψ2 λ i, an N + N + matrix ii w i := Jn 1/2 ξ i, an element of R N + iii W n = i n w i y i ψ 1 λ i, an element of R N + Notice that QW n = 0 and var Q W n = i n w iw i ψ2 λ i = I N+ and Q W n 2 = trace var Q W n = N +. LEMMA 7. Suppose 0 < ɛ 1 1/2 and 0 < ɛ 2 < 1 and max i n w i ɛ 1ɛ 2 2G1N + with G as in Assumption ψ3. Then ĝ = γ + J 1/2 n W n + r n with r n ɛ 1 on the set { W n N + /ɛ 2 }, which has Q-probability greater than 1 ɛ 2. PROOF. The equality Q W n 2 = N + and Tchebychev give Q{ W n > N + /ɛ 2 } ɛ 2. Reparametrize by defining t = Jn 1/2 g γ. The concave function L n t := L n γ + Jn 1/2 t L n γ = y iw i n it + ψλ i ψλ i + w it is maximized at t n = J 1/2 n ĝ γ. It has derivative L n t = i n w i y i ψ 1 λ i + w it. For a fixed unit vector u R N + and a fixed t R N +, consider the real-valued function of the real variable s, Hs := u Ln st = i n u w i y i ψ 1 λ i + sw it, which has derivatives Ḣs = i n u w i w itψ 2 λ i + sw it Ḧs = i n u w i w it 2 ψ 3 λ i + sw it.

7 FUNCTIONAL REGRESSION 7 Notice that H0 = u W n and Ḣ0 = u i n w iw i ψ2 λ i t = u t. Write M n for max i n w i. By virtue of Assumption ψ3, Ḧs i n u w i w it 2 ψ 2 λ i G sw it M n G M n st t i n w iw iψ 2 λ i t = M n G M n st t 2. By Taylor expansion, for some 0 < s < 1, H1 H0 Ḣ0 1 2 Ḧs 1 2 M ng M n t t 2. That is, 8 u Ln t W n + t 1 2 M ng M n t t 2. Approximation 8 will control the behavior of Ls := L n W n +su, a concave function of the real argument s, for each unit vector u. By concavity, the derivative is a decreasing function of s with Ls = u Ln W n + su = s + Rs Rs 1 2 M ng M n W n + su W n + su 2 On the set { W n N + /ɛ 2 } we have W n ± ɛ 1 u N + /ɛ 2 + ɛ 1. Thus implying M n W n ± ɛ 1 u ɛ 1ɛ 2 N + /ɛ 2 + ɛ 1 < 1, 2G1N + R±ɛ M ng1 W n ± ɛ 1 u 2 ɛ 1ɛ 2 N + /ɛ 2 + ɛ 2 1 G1N + ɛ ɛ 2 1ɛ 2 /N + < 5 8 ɛ 1. Deduce that Lɛ 1 = ɛ 1 + Rɛ ɛ 1 L ɛ 1 = ɛ 1 + R ɛ ɛ 1 The concave function s L n W n + su must achieve its maximum for some s in the interval [ ɛ 1, ɛ 1 ], for each unit vector u. It follows that t n W n ɛ 1.

8 8 COROLLARY 9. J n = nda n D Suppose ξ i = Dη i for some nonsingular matrix D, so that wherea n := 1 n If B n is another nonsingular matrix for which 10 A n B n 2 2 Bn and if i n η iη iψ 2 λ i. 11 max i n η i ɛ n/n + for some 0 < ɛ < 1 G1 32 Bn 1 2 then for each set of vectors κ 0,..., κ N in R N + there is a set Y κ,ɛ with QY c κ,ɛ < 2ɛ on which 0 j N κ jĝ γ 2 6 B 1 n 2 nɛ 0 j N D 1 κ j 2. Remark. For our applications of the Corollary in Sections 5 and 7, we need D = diagd 0, D 1,..., D N and κ j = e j, the unit vector with a 1 in its jth position, for j m and κ j = 0 for j > m. In our companion paper we will need the more general κ j s. and B 1 2 A n B n 2 1/2, which justifies PROOF. First we establish a bound on the spectral distance between A 1 n Define H = Bn 1 the expansion A 1 n Bn 1 2 = A n I. Then H 2 B 1 n I + H 1 I B 1 n 2 j 1 H k 2 B 1 n n. 2 B 1 n 2. As a consequence, A 1 n 2 2 Bn 1 2. Choose ɛ 1 = 1/2 and ɛ 2 = ɛ in Lemma 7. The bound on max i n η i gives the bound on max i n w i needed by the Lemma: n w i 2 = η idj n /n 1 Dη i = η ia 1 n η i A 1 n 2 η i 2. Define K j := Jn 1/2 κ j, so that κ j ĝ γ 2 2K j W n 2 + 2K j r n 2. By Cauchy-Schwarz, j K jr n 2 K j 2 r n 2 = U κ r n 2 j where U κ := j κ jj 1 n κ j = j n 1 D 1 κ j A 1 D 1 κ j 2n 1 B 1 n 2 j D 1 κ j 2. n

9 FUNCTIONAL REGRESSION 9 For the contribution V κ := j K j W n 2 the Cauchy-Schwarz bound is too crude. Instead, notice that QV κ = U κ, which ensures that the complement of the set Y κ,ɛ := { W n N + /ɛ} {V κ U κ /ɛ} has Q probability less that 2ɛ. On the set Y κ,ɛ, 0 j N κ jĝ γ 2 2V κ + 2U κ r n 2 3U κ /ɛ. The asserted bound follows. 5. Known Gaussian distribution. Initially we suppose that µ and K are known. We can then calculate all the eigenvalues θ k, the eigenfunctions φ k for K, and the coefficients z i,k := Z i, φ k for the expansion X i µ = Z i = k N z i,kφ k. The random variables z i,k are independent with z i,k N0, θ k. The random variables η i,k := z i,k / θ k are independent standard normals. Under Q n = Q n,a,b, the y i s are independent, with y i Q λi and λ i = a + X i, B = b 0 + k N z i,kb k where b 0 = a + µ, B. Our task is to estimate the b k s with sufficient accuracy to be able to estimate Bt = k N b kφ k t within an error of order ρ n = n 1 2β/α+2β. In fact it will suffice to estimate the component H m B of B in the subspace spanned by {φ 1,..., φ m } with m n 1/α+2β because 12 H mb 2 = k>m b2 k = O F m 1 2β = O F ρ n. We might try to estimate the coefficients b 0,..., b m by choosing ĝ = ĝ 0,..., ĝ m to maximize a conditional log likelihood over all g in R m+1, i n y iλ i,m ψλ i,m with λ i,m = g k m z i,kg k. To this end we might try to appeal to Corollary 9 in Section 4, with κ j equal to the unit vector with a 1 in its jth position for j m and κ j = 0 otherwise. That would give a bound for j m ĝ j γ j 2. Unfortunately, we cannot directly invoke the Corollary with N = m to estimate γ = b 0, b 1,..., b N when 13 Q = Q n,a,b and D = diag1, θ 1,..., θ N 1/2 ξ i = 1, z i,1,..., z i,n and η i = 1, η i,1,..., η i,n because λ i ξ i γ.

10 10 Remark. We could modify Corollary 9 to allow l i = ξ i γ + bias i, for a suitably small bias term, but at the cost of extra regularity conditions and a more delicate argument. The same difficulty arises whenever one investigates the asymptotics of maximum likelihood with the true distribution outside the model family. Instead, we use a two-stage estimation procedure that eliminates the bias term by a change of measure. Condition on the X i s. Consider an N much larger than m for which N n ζ with 2 + 2α 1 > ζ > α + 2β 1 1, Such a ζ exists because the assumptions α > 1 and β > α + 3/2 imply α + 2β 1 > 2 + 2α. Define ξ i, D, and η i as in equation 13. For Q use the probability measure Q n,a,b,n := i n Q λi,n with λ i,n := ξ i γ and γ = b 0, b 1,..., b N. Choose B n := P n,µ,k A n. Define X n = X Z,n X η,n X A,n, where X Z,n := {max i n Z i 2 C 0 log n} X η,n := {max i n η i 2 C 0 N log n} X A,n := { A n B n 2 2 B 1 n 2 1 } If we choose a large enough constant C 0 = C 0 F, Lemma 41 and its Corollary in Section 11 ensure that P n,µ,k X c Z,n 2/n and P n,µ,kx c η,n 2/n; and in subsection 5.1 we show that B 1 n 2 = O F 1 and P n,µ,k A n B n 2 2 = o F 1. Thus P n,µ,k X c n = o F 1. Moreover, on the set X n, inequality 10 holds by construction and inequality 11 holds for large enough n because max i n η i 2 O F N log n = o F n/n. Estimate γ by the ĝ = ĝ 0,..., ĝ N defined in Section 4. Then discard most of the estimates by defining B n := 1 k m ĝkφ k. For each realization of the X i s in X n, the Lemma gives a set Y m,ɛ with Q n,a,b,n Y c m,ɛ < 2ɛ on which 1 k m ĝ k γ k 2 = O F which implies 1 k m θ 1 k = O F m 1+α /n = O F ρ n, B n B 2 = 1 k m ĝ k γ k 2 + k>m b2 k = O F ρ n.

11 FUNCTIONAL REGRESSION 11 In replacing Q n,a,b by Q n,a,b,n we eliminate the bias problem but now we have to relate the probability bounds for Q n,a,b,n to bounds involving Q n,a,b. As we show in subsection 5.2, there exists a sequence of nonnegative constants c n of order o F log n, such that 17 Q n,a,b Q n,a,b,n 2 TV e 2cn i n λ i λ i,n 2 on X n. From this inequality it follows, for a large enough constant C ɛ, that P n,µ,k Q n,a,b { B n B 2 > C ɛ ρ n } P n,µ,k X c n + P n,µ,k X n Q n,a,b Q n,a,b,n TV + Q n,a,b,n Y c m,ɛ o F 1 + 2ɛ + e cn i n P n,µ,k λ i λ i,n 2 1/2. By construction, λ i λ i,n = k>n z i,kb k with the z i,k s independent and z i,k N0, θ k. Thus i n P n,µ,k λ i λ i,n 2 n k>n θ kb 2 k = O F nn 1 α 2β = o F e 2cn because ζ > α + 2β 1 1. That is, we have an estimator that achieves the O F ρ n minimax rate Approximation of A n. Throughout this subsection abbreviate P n,µ,k to P. Remember that where A n = n 1 i n η iη iψ 2 λ i,n with λ i,n = γ Dη i, γ = ā, b 1,..., b N D = diag1, θ 1,..., θ N η i = 1, η i,1,..., η i,n With B n = PA n, we need to show Bn 1 2 = O F 1 and P A n B n 2 2 = o F1. The matrix A n is an average of n independent random matrices each of which is distributed like NN ψ 2 γ DN, where N = N 0, N 1,..., N N with N 0 1 and the other N j s are independent N0, 1 s. Moreover, by rotational invariance of the spherical normal, we may assume with no loss of generality that γ DN = ā+κn 1, where κ 2 = N θ k=1 kb 2 k = O F 1.

12 12 Thus where B n = PNN ψ 2 ā + κn 1 = diagf, r 0 I N 1 [ ] r j := PN j r0 r 1 ψ2 ā + κn 1 and F = 1. r 1 r 2 The block diagonal form of B n simplifies calculation of spectral norms. B 1 n 2 = diagf 1, r 1 max 0 I N 1 2 F 1 2, r0 1 I r0 + r 2 N 1 2 max r 0 r 2 r 2 1, r0 1. Assumption ψ2 ensures that both r 0 and r 2 are O F 1. Continuity and strict positivity of ψ 2, together with max ā, κ = O F 1, ensure that c 0 := infā,κ inf x 1 ψ 2 ā + κx > 0. Thus +1 2πr0 c 0 e x2 /2 dx > 0 Similarly 2πr0 r 2 r 2 1 = 2πr 0 Pψ 2 ā + κn 1 N 1 r 1 /r c 0 r 0 x r 1 /r 0 2 e x2 /2 dx c 0 r 0 x 2 e x2 /2 dx. 1 It follows that B 1 n 2 = O F 1. The random matrix A n B n is an average of n independent random matrices each distributed like NN ψ 2 ā + κn 1 minus its expected value. Thus P A n B n 2 2 P A n B n 2 F = n 1 N var j N 0 j,k N k ψ 2 γ DN. Assumption ψ2 ensures that each summand is O F 1, which leaves us with a O F N 2 /n = o F 1 upper bound Total variation argument. To establish inequality 17 we use the bound Q n,a,b Q n,a,b 2 TV h 2 Q n,a,b, Q n,a,b i n h2 Q λi, Q λi,n By Lemma 4 h 2 Q λi, Q λi,n δ 2 i ψ 2 λ i 1 + δ i g δ i 1

13 FUNCTIONAL REGRESSION 13 where δ i = λ i λ i,n = Z i, B H N Z i, B = Z i, H NB Z i HNB O F N 1 2β log n = o F 1 Thus all the 1 + δ i g δ i factors can be bounded by a single O F 1 term. For a, B, µ, K FR, α, β and with the Z i s controlled by X n, λ i a + µ + Z i B C 2 log n for some constant C 2 = C 2 F. Assumption ψ2 then ensures that all the ψ 2 λ i are bounded by a single exp o F log n term. 6. Approximation of compact operators. Suppose T is a positive, self-adjoint compact operator on a Hilbert space H with eigenvectors {e k } and eigenvalues {θ k }. That is, T e i = θ i e i with θ 1 θ 2 0. For each x in H, T = k N θ ke k e k, a series that converges in operator norm. Let T be another positive, self-adjoint compact operator on H with corresponding representation T = θ k N k ẽ k ẽ k. Define := T T and δ =. The operator T also has a representation 18 T = j,k N T j,k e j e k. Note that T j,k = T j,k because T is self-adjoint. This representation gives = T j,k N j,k θ j {j = k} e j e k and 2 = sup x =1 x, x 2 j,k N T j,k θ j {j = k} 2. The last inequality will lend itself to the calculation of the expected value of 2 when T is random, leading to probabilistic bounds for δ.

14 14 In this section we collect some general consequences of δ being small. In the next section we draw probabilistic conclusions when T is random, for the special case where T = K and T = K, the usual estimate of the covariance kernel, both acting on H = L 2 m. The eigenvectors will become eigenfunctions φ 1, φ 2,... and φ 1, φ 2,.... We feel this approach makes it easier to follow the overall argument. Both {e j : j N} and {ẽ k : k N} are orthonormal bases for H. Define σ j,k := e j, ẽ k. Then e j = k N σ j,kẽ k and ẽ k = j N σ j,ke j and {j = j } = e j, e j = k N σ j,kσ j,k Approximation of eigenvalues. The eigenvalues have a variational characterization Bosq, 2000, Section 4.2: 19 θ j = inf sup{ x, T x : x L and x = 1}. diml<j The first infimum runs over all subspaces L with dimension at most j 1. When j equals 1 the only such subspace is. Both the infimum and the supremum are achieved: by L j 1 = span{e i : 1 i < j} and x = e j. Similar assertions hold for T and its eigenvalues. By the analog of 19 for T, θ j sup{ x, T x : x L j 1 and x = 1} sup{ x, T x δ : x L j 1 and x = 1} = θ j δ. Argue similarly with the roles of T and T reversed to conclude that 20 θ j θ j δ for all j N Approximation of eigenvectors. We cannot hope to find a useful bound on ẽ k e k, because there is no way to decide which of ±ẽ k should be approximating e k. However, we can bound f k, where f k = σ k ẽ k e k with σ k := sign σ k,k := which will be enough for our purposes. { +1 if σk,k 0 1 otherwise,

15 FUNCTIONAL REGRESSION 15 We also need to assume that the eigenvalue θ k is well separated from the other θ j s, to avoid the problem that the eigenspace of T for the eigenvalue θ k might have dimension greater than one. More precisely, we consider a k for which which implies ɛ k := min{ θ j θ k : j k} > 5δ, θ k θ j θ k θ j δ 4 5 θ k θ j 4 5 ɛ k. The starting point for our approximations is the equality 21 ẽ k, e j = T ẽ k, e j ẽ k, T e j = θ k θ j σ j,k. For j k we then have which implies θ k θ j 2 σ 2 j,k σ k ẽ k, e j 2 2 f k, e j e k, e j 2, σ 2 j,k 25 8 f k, e j 2 /ɛ 2 k + 2 T 2 j,k/θ k θ j 2 because T e k, e j = 0 for j k. To simplify notation, write j for j N {j k}. The introduction of the σ k also ensures that f k 2 = e k 2 + ẽ k 2 2σ k e k, ẽ k = 2 2 σ k,k 2 2σ 2 k,k because σ k,k 1 = 2 j σ2 j,k j 25 4 f k, e j 2 /ɛ 2 k j The first sum on the right-hand side is less than T 2 j,k/θ k θ j f k 2 /ɛ 2 k 2 f k 2 /4δ 2 f k 2 /4. The second sum can be written as 25 Λ k 2 /4 for Λ k := { T j,k /θ k θ j if j k Λ k,j e j with Λ k,j := j N 0 if j = k. Our bound for f k 2 with an untidy 25/3 increased to 9 then takes the convenient form 22 f k 2 9 Λ k 2 if ɛ k > 5.

16 16 For our applications, P Λ k 2 will be of order Ok 2 /n. When δ is much smaller than ɛ k we can get an even better approximation for f k itself. Start once more from equality 21, still assuming that ɛ k > 5δ. For j k, σ k σ j,k = σ k ẽ k, e j / θ k θ j = e k + f k, e j /θ k + γ k θ j where γ k = θ k θ k = Λ k,j 1 γ 1 k + f k, e j θ j θ k θ k θ j because T e k, e j = 0 = Λ k,j + r k,j where r k,j := θ k θ k θ j θ k Λ k,j + f k, e j θ k θ j. The r k,j s are small: r k,j 5 δ Λk,j + f k, e j 4 θ k θ j 23 5δ Λ k θ k θ j by inequality 22. for j k, if ɛ k > 5δ Define r k,k = σ k,k 1 = 1 2 f k 2 and r k = j N r k,je j. We then have a representation cf. Hall and Hosseini-Nasab, 2006, equation 2.8 and Cai and Hall, 2006, f k = σ k ẽ k e k = σ k ẽ k, e k 1 e k + j σ kσ j,k e j = Λ k + r k Approximation of projections. The operator H J = k J e k e k projects elements of H orthogonally onto span{e k : k J}; the operator H J = k J ẽk ẽ k projects elements of H orthogonally onto span{e k : k J}. We will be interested in the case J = {1, 2,..., p} with p equal to either the m or the N from Section 5. In that case, we also write H p and H p for the projection operators. In this subsection we establish a bound for H J B H J B for a B = j b je j in H. The difference H J H J equals k J σ kẽ k σ k ẽ k e k e k = k J σ kẽ k r k + k J e k + f k Λ k + k J e k + Λ k + r k e k e k e k = R J + k J e k Λ k + Λ k e k where R J := k J σ kẽ k r k + f k Λ k + r k e k.

17 FUNCTIONAL REGRESSION 17 Self-adjointness of T implies T j,k = T k,j and hence Λ j,k = Λ j,k. The antisymmetry eliminates some terms from the main contribution to H J H J : e k J k Λ k + Λ k e k = Λ k J j J c k,j e k e j + e j e k. With this simplification we get the following bound for H J H J B 2 : 3 e k J k Λ j J k,jb c j e j J c j Λ k J k,jb k R J B 2 The first two sums contribute 3 k J j J c Λ k,jb j j J c k J Λ k,jb k 2 In the next section the expected value of both sums will simplify because PΛ k,j Λ k,j will be zero if j j. For the three contributions to the bound for R J B 2 we make repeated use of the inequality, based on equations 22 and 23, r k, x 81 2 Λ k 2 x k + 5δ Λ k x j j θ k θ j, which is valid whenever ɛ k > 5δ. To avoid an unnecessary calculation of precise constants, we adopt the convention of the variable constant: we write C for a universal constant whose value might change from one line to the next. The first two contributions are: k J σ kẽ k r k, B 2 = k J r k, B 2 and C k J b2 k Λ k 4 + Cδ 2 k J Λ k 2 j f k J k Λ k, B 2 f k J k Λ k, B C k J Λ k 2 2 k J j Λ k,jb j 2. 2 b j θ k θ j For the third contribution, let x = j x je j be an arbitrary unit vector in H. Then 2 2 r k J k e k B, x = b k J k r k, x C b k J kx k Λ k Cδ 2 Λ 2 x j k J k b k j θ k θ j C k J b k Λ k Cδ 2 k J Λ k 2 k J b2 k j 1 θ k θ j 2

18 18 take the supremum over x, which doesn t even appear in the last line, to get the same bound for k J b kr k 2. In summary: if min k J ɛ k > 5δ then H J H J B 2 is bounded by a universal constant times Λ k J k,jb k + k J b2 k Λ k 4 25 k J j J c Λ k,jb j j J c + k J b k Λ k k J Λ k 2 k J + δ 2 2 Λ k J k 2 b j j + δ 2 θ k θ j k J Λ k 2 k J b2 k j 2 1. θ k θ j j Λ k,jb j 7. Unknown Gaussian distribution. When µ and K are unknown, we estimate them in the usual way: µ n t = X n t = n 1 i n X it and Ks, t = n 1 1 X i s X n s X i t X n t i n = n 1 1 Z i s Zs Z i t Zt, i n which has spectral representation Ks, t = θ φ k N k k s φ k t. In fact we must have θ k = 0 for k n because all the eigenfunctions φ k corresponding to nonzero θ k s must lie in the n 1-dimensional space spanned by {Z i Z : i = 1, 2,..., n}. The construction and analysis of the new estimator B will parallel the method developed in Section 5 for the case of known K and µ. The quantities m and N are the same as before. We write H p for p = N or p = m for the operator that projects orthogonally onto span{ φ 1,..., φ p }. Essentially we have only to estimate all the quantities that appeared in the previous proof then show that none of the errors of estimation is large enough to upset analogs of the calculations from Section 5. There is a slight complication caused by the fact that we do not know which of ± φ j should be used to approximate φ j. At strategic moments we will be forced to multiply by the matrix S := diagσ 0,..., σ N with σ 0 = 1 and σ k = sign φ k, φ k for k 1. The results from Section 6 will control the difference f k := σ φ k k φ k. The other key quantities are: i := K K 2

19 ii D = diag1, θ 1,..., θ N 1/2 iii z i = z i,1,..., z i,n where z i,k = Z i, φ k FUNCTIONAL REGRESSION 19 iv z = z 1,..., z N where z k = Z, φ k = n 1 i n z i,k v ξ i = 1, z i z and η i = D 1 ξ i. [We could define η i = D 1 ξ i but then we would need to show that D 1 ξ i D 1 ξ i. Our definition merely rearranges the approximation steps.] vi γ := γ 0, b 1,..., b N where B = b φ k N k k and γ 0 := a + B, X. [Note that λ i = γ 0 + B, Z i Z.] vii λ i,n = γ 0 + H N B, Z i Z = ξ i γ. viii ĝ = argmax g R N+1 i n y i ξ i g ψ ξ i g and B = 1 k m ĝk φ k. [Note that these two quantities differ from the ĝ and B in Section 5.] ix Ãn = n 1 i n η i η i ψ2 λ i,n The use of estimated quantities has one simplifying consequence: Z i t Zt = z k N i,k z k φ k t so that θ k {j = k} = Ks, t φ j s φ k t ds dt which implies n 1 1 i n z i z i = D 2 and = n 1 1 i n z i,j z j z i,k z k, 26 n 1 1 i n η i η i = D 1 D 2 D 1 := diag1, θ 1 /θ 1,..., θ N /θ N. We will analyze K by rewriting it using the eigenfunctions for K. Remember that z i,j = Z i, φ j and the standardized variables η i,j = z i,j / θ j are independent N0, 1 s. Define z j = Z, φ j and η j = n 1 i n η i,j and C j,k := n 1 1 i n η i,j η j η i,k η k, a sample covariance between two independent N0, I N random vectors. Then Z i t Zt = j N z i,j z j φ j t = j N θ j η i,j η j φ j t

20 20 and 27 Ks, t = j,k N K j,k φ j sφ k t with K j,k = θ j θ k C j,k Moreover, as shown in Section 6, the main contribution to f k = σ k φ k φ k is Λ k := j N Λ k,jφ j with Λ k,j := { θj θ k C j,k /θ k θ j if j k 0 if j = k. In fact, most of the inequalities that we need to study the new B come from simple moment bounds Lemma 31 for the sample covariances C j,k and the derived bounds Lemma 32 for the Λ k s. As before, most of the analysis will be conditional on the X i s lying in a set with high probability on which the various estimators and other random quantities are well behaved. LEMMA 28. For each ɛ > 0 there exists a set X ɛ,n, depending on µ and K, with sup F P X c n,µ,k ɛ,n < ɛ for all large enough n and on which, for some constant C ɛ that does not depend on µ or K, i C ɛ n 1/2 ii max i n Z i C ɛ log n and Z Cɛ n 1/2 iii H m H m B 2 = o F ρ n iv H N H N B 2 = O F n 1 ν for some ν > 0 that depends only on α and β v max i n η i 2 = o F n/n vi SÃn S A n 2 = o F 1 This Lemma whose proof appears in Section 8 contains everything we need to show that B B 2 has the uniform O F ρ n rate of convergence in P n,f probability, as asserted by equation 3. In what follows, all assertions refer to the numbered parts of Lemma 28. As before, the component of B orthogonal to span{ φ 1,..., φ m } causes no trouble because B B 2 = ĝ γ H mb 2 and, by iii, H mb 2 2 H mb H m H m B 2 = O F ρ n on X ɛ,n.

21 FUNCTIONAL REGRESSION 21 To handle ĝ γ 2, invoke Corollary 9 for X i s in X ɛ,n, with η i replaced by η i and A n replaced by Ãn and B n replaced by B n = SB n S, the same B n and D as before, and Q equal to Q n,a,b,n = i n Q λi,n. to get a set Ỹm,ɛ with Q n,a,b,nỹc m,ɛ < 2ɛ on which ĝ γ 2 2 = O Fρ n. The conditions of the Corollary are satisfied on X ɛ,n, because of v and Ãn B n 2 Ãn SA n S 2 + SA n S SB n S 2 = o F 1. To complete the proof it suffices to show that Q n,a,b,n Q n,a,b,n TV tends to zero. First note that λ i,n λ i,n = a + B, X + H N B, Z i Z a B, µ H N B, Z i which implies that, on X ɛ,n, = H NB, Z H NB, Z + H NB, Z + H N B H N B, Z i λ i,n λ i,n 2 2 HNB, Z H 2 N B H N B 2 Z i + Z O F N 1 2β Cɛ 2 n 1 + O F n 1 ν Cɛ 2 n 1/2 + log n 2 29 = O F n 1 ν for some 0 < ν < ν. Now argue as in subsection 5.2: on X ɛ,n, Q n,a,b,n Q n,a,b,n 2 TV i n h2 Q λi,n, Q λi,n exp o F log n i n λ i,n λ i,n 2 = o F 1. Finish the argument as before, by splitting into contributions from X c n and X n Ỹc m,ɛ and X n Ỹm,ɛ. 8. Proofs of unproven assertions from Section 7. Many of the inequalities in this section involve sums of functions of the θ j s. The following result will save us a lot of repetition. To simplify the notation, we drop the subscripts from P n,µ,k. LEMMA 30. i For each r 1 there is a constant C r = C r F for which κ k r, γ := j γ {j k} j N θ j θ k r C r 1 + k r1+α γ if r > 1 C k 1+α γ log k if r = 1

22 22 ii For each p, k p j>p k α 2β j α θ k θ j 2 = O F p 1 α PROOF. For i, argue in the same way as Hall and Horowitz 2007, page 85, using the lower bounds c α j α if j < k/2 θ j θ k c α j k k α 1 if k/2 j 2k c α k α if j > 2k where c α is a positive constant. For ii, split the range of summation into two subsets: {k, j : j > maxp, 2k} and {k, j : p/2 < k p < j 2k}. The first subset contributes at most k p k α 2β j>maxp,2k j α c α k α 2 = O F p 1 α because α 2β < 3. The second subset contributes at most p/2<k p k α 2β c 2 α k 2α+2 j>p j α j k 2 = O F p.p 2+α 2β p α O1, which is of order o F p α. The distribution of C j,k does not depend on the parameters of our model. Indeed, by the usual rotation of axes we can rewrite n 1C j,k as U j U k, where U 1, U 2,... are independent N0, I n 1 random vectors. This representation gives some useful equalities and bounds. LEMMA 31. Uniformly over distinct j, k, l, i PC j,j = 1 and P C j,j 1 2 = 2n 1 1 ii PC j,k = PC j,k C j,l = 0 iii PC 2 j,k = On 1 iv PC 2 j,k C2 l,k = ton 2 v PC 4 j,k = On 2 PROOF. Assertion i is classical because U j 2 χ 2 n 1. For assertion ii use PU 1 U 2 U 2 = 0 and PU 1U 2 U 2U 3 U 2 = trace U 2 U 2PU 3 U 1 = 0. For iii use PU 1 U 1 = I n 1 and PU 1U 2 U 2U 1 U 2 = trace U 2 U 2PU 1 U 1 = traceu 2 U 2 = U 2 2.

23 For iv use P U 2 4 = n 2 1 and FUNCTIONAL REGRESSION 23 PU 1U 2 2 U 3U 2 2 U 2 = U 2 4 For v, check that the coefficient of t 4 in the Taylor expansion of P exptu 1U 2 = P exp 1 2 t2 U 1 2 = 1 t 2 n 1/2 is of order n 2. LEMMA 32. Uniformly over distinct j, k, l, i PΛ k,j = PΛ k,j Λ k,l = 0 ii PΛ 2 k,j = O F n 1 k α j α θ k θ j 2 iii PΛ 4 k,j = O F n 2 k 2α j 2α θ k θ j 4 iv P Λ k 2 = O F n 1 k 2 v P Λ k 4 = O F n 2 k 4 PROOF. Assertions i, ii, and iii follow from Assertions ii and iii of Lemma 31. For iv, note that For v note that P Λ k 2 = j PΛ2 j,k = O F n 1 k α κ k 2, α 2 P Λ k 4 = P θ jθ j k Sj,kθ 2 k θ j 2 = θ jθ j l l θkθ 2 k θ j 2 θ k θ l 2 PSj,kS 2 l,k 2 = O F n 2 = O F n 2 k 4. j θ jθ k θ k θ j 2 To prove Lemma 28 we define X ɛ,n as an intersection of sets chosen to make the six assertions of the Lemma hold, 2 X ɛ,n := X,n X Z,n X Λ,n X η,n X A,n, where the complement of each of the five sets appearing on the right-hand side has probability less than ɛ/5. More specifically, for a large enough constant C ɛ, we

24 24 define X,n = { C ɛ n 1/2 } X Z,n = {max i n Z i 2 C ɛ log n and Z C ɛ n 1/2 } X η,n = {max i n η i 2 C ɛ N log n} as in Section 5 X A,n = { η i η i n i 2 C ɛ n} The definition of X Λ,n, in subsection 8.3, is slightly more complicated. It is defined by requiring various functions of the Λ k s to be smaller than C ɛ times their expected values. The set X A,n is almost redundant. From Definition 5 we know that min θ 1 j<j j θ j α/rn 1 α and min θ j R 1 N α. N 1 j N The choice N n ζ with ζ < 2+2α 1 ensures that n 1/2 N 1 α. On X,n the spacing assumption used in Section 6 holds for all n large enough; all the bounds from that Section are avaiable to us on X ɛ,n. In particular, max j N θ j /θ j 1 O F N α = o F 1. Equality 26 shows that X A,n X,n eventually if we make sure C ɛ > Proof of Lemma 28 part i. Observe that P 2 = 2 K P j,k j,k θ j {j = k} = j,k θ jθ k P S j,k {j = k} 2 j θ jo F n 1 + j,k θ jθ k O F n 2 = O F n Proof of Lemma 28 part ii. As before, Corollary 42 controls max i n Z i 2. To control the Z contribution, note that n Z 2 has the same distribution as Z 1 2, which has expected value j N θ j <.

25 FUNCTIONAL REGRESSION Proof of Lemma 28 parts iii and iv. Calculate expected values for all the terms that appear in the bound 25 from Section and P n,µ,k k p = k p Λ j>p k,jb j P n,µ,kλ 2 j>p k,j 2 + P n,µ,k b 2 j + b 2 k j>p = O F n 1 k p j>p k α 2β j α θ k θ j 2 = O F n 1 p 1 α by Lemma 30 P n,µ,k k p b2 k Λ k 4 = O F n 2 k p k4 2β = O F n 2 and P n,µ,k k p b k Λ k 2 = O F n 1 k J k2 β = O F n 1 k p Λ k,jb k 2 by Lemma 32i 1 + p 5 2β + log p 1 + p 3 β + log p and P n,µ,k k p Λ k 2 = O F n 1 p 3 and P n,µ,k 2 = O F n 1 Λ k p j k,jb j k p 34 = O F n 1 by Lemma 30 and j k α j a 2β θ k θ j 2 35 δ 2 2 P n,µ,k Λ k p k 2 b j j θ k θ j = O F n 1 δ 2 p 3 + p 5+2α 2β log 2 p and 36 k p b2 k j 2 1 = O F 1 + p 3+2α 2β log 2 p by Lemma 30. θ k θ j For some constant C ɛ = C ɛ F, on a set X Λ,n with P n,µ,k X c Λ,n < ɛ, each of the random quantities in the previous set of inequalities for both p = m and p = N is bounded by C ɛ times its P n,µ,k expected value. By virtue of Lemma 32iv, we may also assume that Λ k 2 C ɛ k 2 /n on X Λ,n.

26 26 From inequality 25, it follows that on the set X,n X Λ,n, for both p = m and p = N, H p H p B 2 O F n 1 p 1 α + O F n p 5 2β + log p + p 6 β + log 2 p + O F n 1 p 3 O F n 1 + O F n 2 p 3 + p 5+2α 2β log 2 p + O F n 2 p 3 O F 1 + p 3+2α 2β log 2 p = O F n 1 p 1 α if p N. This inequality leads to the asserted conclusions when p = m or p = N Proof of Lemma 28 part v. By construction, η i1 = 1 for every i and, for j 2, θ j η i,j = z i,j z j = Z i Z, φ j Thus, for j 2, σ j η i,j = θ 1/2 j Z i Z, φ j + f j = η i,j + δ i,j with δ i,j 2 θj 1 2 j Z i + Z fj 2 2+α log n O F n on X ɛ,n. In vector form, 37 S η i = η i + δ i with δ i 2 = O F N 3+α log n n It follows that o F n/n 2 on X ɛ,n. max i n η i = max i n S η i max i n η i +o F n/n = O F n/n on X ɛ,n Proof of Lemma 28 part vi. From inequality 29 we know that ɛ N := max i n λ i,n λ i,n = O F n 1+ν /2 on X ɛ,n and from subsection 5.2 we have max i n λ i,n = O F log n. Assumption ψ3 in Section 3 and the Mean-Value theorem then give max i n ψ 2 λ i,n ψ 2 λ i,n ɛ N ψ 2 λ i,n Gɛ N = o F 1.

27 FUNCTIONAL REGRESSION 27 If we replace ψ 2 λ i,n in the definition of Ãn by L i := ψ 2 λ i,n we make a change Γ with Γ 2 o F 1 n 1 1 i n η i η i 2, which, by equality 26, is of order o F 1 on X ɛ,n. From Assumption ψ2 we have c n := log max i n L i = o F log n. Uniformly over all unit vectors u in R N+1 we therefore have u SÃn Su = o F 1 + n 1 1 L iu η i + δ i η i + δ i u i n = o F On 1 u A n u + O F n 1 i n L i u δ i 2 + 2u η i u δ i Rearrange then take a supremum over u to conclude that SÃn S A n 2 o F 1 + O F e cn max i n δ i δ i η i Representation 37 and the defining property of X η,n then ensure that the upper bound is of order o F 1 on X ɛ,n. 9. The minimax lower bound. We will apply a slight variation on Assouad s Lemma combining ideas from Yu 1997 and from van der Vaart 1998, Section 24.3 to establish inequality 2. We consider behavior only for µ = 0 and a = 0, for a fixed K with spectral decomposition j N θ jφ j φ j. For simplicity we abbreviate P n,0,k to P. Let J = {m + 1, m + 2,..., 2m} and Γ = {0, 1} J. Let β j = Rj β. For each γ in Γ define B γ = ɛ j J γ jβ j φ j, for a small ɛ > 0 to be specified, and write Q γ for the product measure i n Q λi γ with λ i γ = B γ, Z i = ɛ j J γ jβ j z i,j. For each j let Γ j = {γ Γ : γ j = 1} and let ψ j be the bijection on Γ that flips the jth coordinate but leaves all other coordinates unchanged. Let π be the uniform distribution on Γ, that is, π γ = 2 m for each γ. For each estimator B = b j N j φ j we have B γ B 2 j J γ j β j b 2 j and so sup P n,f B B 2 π γ F γ Γ 38 = 2 m j J 2 m j J ɛγ j β j b j 2 PQ γ j J P Q γ ɛβ j b j 2 + Q γ Γ ψj j γ0 b j 2 γ Γ j 1 4 ɛβ j 2 P Q γ Q ψj γ,

28 28 the last lower bound coming from the fact that ɛβ j b j b j ɛβ j 2 for all b j. We assert that, if ɛ is chosen appropriately, 39 min j,γ P Q γ Q ψj γ stays bounded away from zero as n, which will ensure that the lower bound in 38 is eventually larger than a constant multiple of j J β2 j cρ n for some constant c > 0. Inequality 2 will then follow. To prove 39, consider a γ in Γ and the corresponding γ = ψ j γ. By virtue of the inequality Q γ Q γ = 1 Q γ Q γ TV 1 2 1/2 i n h2 Q λi γ, Q λi γ it is enough to show that 40 lim sup n max j,γ P 2 i n h2 Q λi γ, Q λi γ < 1. Define X n = {max i n Z i 2 C 0 log n}, with the constant C 0 large enough that PX c n = o 1. On X n we have and, by inequality 4, λ i γ 2 j J β2 j Z i 2 = Oρ n log n = o1 h 2 Q λi γ, Q λi γ O F 1 λ i γ λ i γ 2 ɛ 2 O F 1β 2 j z 2 i,j. We deduce that P 2 i n h2 Q λi γ, Q λi γ 2PX c n + i n ɛ2 O F 1βj 2 PX n zi,j 2 o1 + ɛ 2 O1nβ 2 j θ j. The choice of J makes β 2 j θ j R 2 m α 2β R 2 /n. Assertion 40 follows. 10. Hellinger distances in an exponential family. We need to show that h 2 Q λ, Q λ+δ δ 2 ψ 2 λ 1 + δ G δ for all real λ and δ. Temporarily write λ for λ + δ and λ for λ + λ /2 = λ + δ/ h2 Q λ, Q λ = f λ yf λ y = = exp exp λy 1 2 ψλ 1 2 ψλ ψλ 1 2 ψλ 1 2 ψλ 1 + ψλ 1 2 ψλ 1 2 ψλ

29 That is, FUNCTIONAL REGRESSION 29 h 2 Q λ, Q λ ψλ + ψλ + δ 2ψλ + δ/2. By Taylor expansion in δ around 0, the right-hand side is less than 1 4 δ2 ψ 2 λ δ3 ψ 3 λ + δ 1 8 ψ3 λ δ /2 where 0 < δ < δ. Invoke inequality 3 twice to bound the coefficient of δ 3 /6 in absolute value by ψ 2 λ G δ G δ /2 9 8 ψ2 λg δ. The stated bound simplifies some unimportant constants. 11. Bounds for Gaussian processes. As a consequence of defining property K, the centered process Z := X µ has an expansion Zt = k N θk η k φ k t where the η k s are independent N0, 1 s, implying Z 2 = θk θ k,k N k η k η k φ k tφ k s dt ds = θ k N kηk. 2 LEMMA 41. Suppose W i = k N τ i,kηi,k 2 for i = 1,..., n, where the η i,k s are independent standard normals and the τ i,k s are nonnegative constants with > T := max i n k N τ i,k. Then P{max i n W i > 4T log n + x} < 2e x for each x 0. PROOF. Without loss of generality suppose T = 1. For s = 1/4, note that P expsw i = 1 2sτ k N i,k 1/2 exp e 1/4 k N sτ i,k by virtue of the inequality log1 t 2t for t 1/2. With the same s, it then follows that P{max i n W i > 4log n + x} exp 4slog n + x P exp max i n sw i e x 1 n P expsw i. i n The 2 is just a clean upper bound for e 1/4. COROLLARY 42. where C = 4C k N k α <. P n {max i n Z i 2 > C log n + x} 2e x

30 30 References. BIRMAN, M. S. and SOLOMJAK, M. Z Spectral Theory of Self-Adjoint Operators in Hilbert Space. Reidel Translation from the 1980 Russian edition. BOSQ, D Linear Processes in Function Spaces: Theory and Applications. Lecture Notes in Statistics 149. Springer-Verlag. CAI, T. T. and HALL, P Prediction in functional linear regression. Annals of Statistics HALL, P. and HOROWITZ, J. L Methodology and convergence rates for functional linear regression. Annals of Statistics HALL, P. and HOSSEINI-NASAB, M On properties of functional principal components analysis. J. R. Statist. Soc. B HJORT, N. L. and POLLARD, D Asymptotics for minimisers of convex processes Technical Report, Yale University. PORTNOY, S Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. Annals of Statistics VAN DER VAART, A. W Asymptotic Statistics. Cambridge University Press. YU, B Assouad, Fano, and Le Cam. In A Festschrift for Lucien Le Cam D. Pollard, E. Torgersen and G. L. Yang, eds Springer-Verlag, New York. STATISTICS DEPARTMENT YALE UNIVERSITY Wei.Dou@yale.edu David.Pollard@yale.edu Huibin.Zhou@yale.edu URL:

5.1 Approximation of convex functions

Chapter 5 Convexity Very rough draft 5.1 Approximation of convex functions Convexity::approximation bound Expand Convexity simplifies arguments from Chapter 3. Reasons: local minima are global minima and