Functional Analysis of Variance

Similar documents
Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Multivariate Regression

Linear models and their mathematical foundations: Simple linear regression

Matrix Approach to Simple Linear Regression: An Overview

Chapter 8 Integral Operators

CHAPTER VIII HILBERT SPACES

3 Multiple Linear Regression

STAT 540: Data Analysis and Regression

STAT 100C: Linear models

Linear Algebra Review

Ma 3/103: Lecture 24 Linear Regression I: Estimation

Introduction to Estimation Methods for Time Series models. Lecture 1

Hilbert Spaces: Infinite-Dimensional Vector Spaces

Spectral theory for compact operators on Banach spaces

Regression and Statistical Inference

Ch 3: Multiple Linear Regression

HILBERT SPACES AND THE RADON-NIKODYM THEOREM. where the bar in the first equation denotes complex conjugation. In either case, for any x V define

Real Analysis, 2nd Edition, G.B.Folland Elements of Functional Analysis

9 Brownian Motion: Construction

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Orthogonal Projection and Least Squares Prof. Philip Pennance 1 -Version: December 12, 2016

FUNCTIONAL ANALYSIS LECTURE NOTES: ADJOINTS IN HILBERT SPACES

[y i α βx i ] 2 (2) Q = i=1

Recall that if X is a compact metric space, C(X), the space of continuous (real-valued) functions on X, is a Banach space with the norm

Kernel Method: Data Analysis with Positive Definite Kernels

4 Multiple Linear Regression

Linear Models Review

On Least Squares Linear Regression Without Second Moment

Restricted Maximum Likelihood in Linear Regression and Linear Mixed-Effects Model

Vector spaces. DS-GA 1013 / MATH-GA 2824 Optimization-based Data Analysis.

Functional Analysis. Martin Brokate. 1 Normed Spaces 2. 2 Hilbert Spaces The Principle of Uniform Boundedness 32

Stochastic Design Criteria in Linear Models

Maximum Likelihood Estimation

Statistical Convergence of Kernel CCA

Multivariate Linear Models

Lecture 13: Simple Linear Regression in Matrix Format. 1 Expectations and Variances with Vectors and Matrices

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

arxiv: v2 [math.pr] 27 Oct 2015

Reproducing Kernel Hilbert Spaces Class 03, 15 February 2006 Andrea Caponnetto

Lecture Notes 1: Vector spaces

2. A Review of Some Key Linear Models Results. Copyright c 2018 Dan Nettleton (Iowa State University) 2. Statistics / 28

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Stochastic Spectral Approaches to Bayesian Inference

COMMON COMPLEMENTS OF TWO SUBSPACES OF A HILBERT SPACE

OPERATOR THEORY ON HILBERT SPACE. Class notes. John Petrovic

Function Spaces. 1 Hilbert Spaces

STAT 151A: Lab 1. 1 Logistics. 2 Reference. 3 Playing with R: graphics and lm() 4 Random vectors. Billy Fang. 2 September 2017

Multivariate Regression (Chapter 10)

Examples of Dual Spaces from Measure Theory

Recall that any inner product space V has an associated norm defined by

Hilbert Spaces. Hilbert space is a vector space with some extra structure. We start with formal (axiomatic) definition of a vector space.

Linear Models in Machine Learning

Ch 2: Simple Linear Regression

0.1 Tangent Spaces and Lagrange Multipliers

October 25, 2013 INNER PRODUCT SPACES

On compact operators

COMPACT OPERATORS. 1. Definitions

5 Compact linear operators

08a. Operators on Hilbert spaces. 1. Boundedness, continuity, operator norms

Fitting Linear Statistical Models to Data by Least Squares I: Introduction

Analysis Preliminary Exam Workshop: Hilbert Spaces

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Projection. Ping Yu. School of Economics and Finance The University of Hong Kong. Ping Yu (HKU) Projection 1 / 42

Stat 579: Generalized Linear Models and Extensions

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents

Functional Analysis. Franck Sueur Metric spaces Definitions Completeness Compactness Separability...

Linear Regression and Its Applications

2. Review of Linear Algebra

REAL ANALYSIS II HOMEWORK 3. Conway, Page 49

Ma 3/103: Lecture 25 Linear Regression II: Hypothesis Testing and ANOVA

Fréchet algebras of finite type

STAT 100C: Linear models

DS-GA 1002 Lecture notes 10 November 23, Linear models

Lecture 3: Linear Models. Bruce Walsh lecture notes Uppsala EQG course version 28 Jan 2012

Outline. Remedial Measures) Extra Sums of Squares Standardized Version of the Multiple Regression Model

Applied Regression Analysis

Lecture Notes 15 Prediction Chapters 13, 22, 20.4.

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

We describe the generalization of Hazan s algorithm for symmetric programming

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

CHEBYSHEV INEQUALITIES AND SELF-DUAL CONES

Spectral Measures, the Spectral Theorem, and Ergodic Theory

Estimation of the Response Mean. Copyright c 2012 Dan Nettleton (Iowa State University) Statistics / 27

Xβ is a linear combination of the columns of X: Copyright c 2010 Dan Nettleton (Iowa State University) Statistics / 25 X =

Problem Set. Problem Set #1. Math 5322, Fall March 4, 2002 ANSWERS

Geometry on Probability Spaces

Asymptotic efficiency of simple decisions for the compound decision problem

Fitting Linear Statistical Models to Data by Least Squares I: Introduction

Chapter 4 Euclid Space

On John type ellipsoids

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

A Statistical Analysis of Fukunaga Koontz Transform

More Powerful Tests for Homogeneity of Multivariate Normal Mean Vectors under an Order Restriction

MATH 423 Linear Algebra II Lecture 33: Diagonalization of normal operators.

Least Squares Estimation-Finite-Sample Properties

MATHS 730 FC Lecture Notes March 5, Introduction

NEW BOUNDS FOR TRUNCATION-TYPE ERRORS ON REGULAR SAMPLING EXPANSIONS

Fitting Linear Statistical Models to Data by Least Squares: Introduction

ON VARIANCE COVARIANCE COMPONENTS ESTIMATION IN LINEAR MODELS WITH AR(1) DISTURBANCES. 1. Introduction

Lecture 13: Simple Linear Regression in Matrix Format

Transcription:

Applied Mathematical Sciences, Vol. 2, 2008, no. 23, 1115-1129 Functional Analysis of Variance Abdelhak Zoglat Faculté des Sciences Département de Mathématiques et Informatique B.P. 1014 Rabat, Morocco a zoglat@yahoo.fr Abstract In this paper we extend some statistical techniques of MANOVA, where the data are vectors in R d, to the case where the observations are real functions. We consider the model Y (t) =x β(t)+σɛ(t), t [0, 1] where x = (x 1,...,x p ) R p is known, σ > 0 is an unknown real parameter, and β =(β 1,...,β p ) L p 2 is a vector of square integrable and unknown functions. It is estimated and its estimator ˆβ is shown to have some optimal properties. We also develop a test statistic for testing H 0 : Kβ = C versus H 1 : Kβ C. Our results are natural extensions of the classical ones. Mathematics Subject Classification: 62H12, 62H15, 62J05 Keywords: Linear models, Analysis of Variance, Gaussian process, Least Mean Square Estimator, Maximum Likelihood Estimator, Testing Linear Hypothesis 1 Introduction Multivariate analysis of variance (MANOVA) is a well established statistical technique for analyzing data where the measurements depend on several factors. It mainly deals with data where each observation can be represented by a finite dimensional vector of real numbers. With modern technology for data acquisition, functional data are present in a natural way in many applied contexts. They can also result from finite sets of observations to which smoothing and interpolation procedures have been applied. This could be the case in some situations where a functional approach is quite natural. Examples of such situations, (see [7]) are averages of temperatures or precipitation levels at

1116 A. Zoglat given regions, where data are available continuously in time but are collected on a daily or monthly basis. Statistical procedures as regression analysis for functional data appeared in the early sixties (see e.g. [6]) in the context of time series analysis. The interest for functional data of many researchers, [7] and the references therein, has become considerable. This is explained by the wide fields of applications involving observations which are best represented by functions. However, despite their importance, statistical inference issues for functional data have not been fully explored. The rank that linear models occupy in both theory and applications suggests that an extension to the functional case merits some attention. For functional data where the measurements depend on several factors, a user who wishes to estimate the factors effects or test for their presence using the current techniques, would have to discretize each data point and summarize it into a few values. This might result in losing some potentially useful information. Our aim in this paper is to develop procedures which permit the researcher to use the entire information contained in the data. Our results are a natural generalization of the well known results from the theory of classical linear models. Theoretic computations are done in L 2, the Hilbert space of square integrable functions. However, the practical calculations are based on simple numerical integrations. In our approach it is assumed that the user has some idea about the error covariance structure. This could be provided by some other methods of analysis of continuous processes like time series. In section 2 we introduce the general framework of this paper. The third section contains our main results on estimation. Section 4 contains our results on testing linear hypotheses. 1.1 General sittings Let (S, S,μ) be a measure space, with μ a σ-finite measure. The separable Hilbert space L 2 (S, μ) = { f : S R; f 2 (s)dμ(s) < }, is endowed with S its usual inner product: f,g L 2 (S, μ) <f,g>= f(s)g(s)dμ(s), and its usual norm. 2 2 =<.,.>. Let κ : S S R be a function that is symmetric, i.e., for every (s, t) S S, κ(s, t) =κ(t, s), and μ μ-square integrable. The integral operator T κ, with kernel κ, is defined onto L 2 (S, μ) by:f T κ f, where T κ f is a the real-valued function defined on S by t S T κ f(t) = κ(s, t)f(s)dμ(s). S S

Functional ANOVA 1117 It can easily be seen that T κ is linear and self-adjoint, i.e., <T κ f,g >=< f,t κ g>for all f, g L 2 (S, μ). Moreover, see e.g. [1, Proposition 4.7], T κ is continuous and compact. Therefore, see [1, Theorem 5.1 and Corollary 5.3], there exist a sequence (λ k ) of real numbers and an orthonormal basis (ϕ k ) for (ker T κ ) such that for all f L 2 (S, μ) T κ f = k 1 λ k <f,ϕ k >ϕ k. (1) In particular, for each k 1,ϕ k is an eigenfunction of T κ associated with the eigenvalue λ k. Let B L2 denote the Borel σ-algebra of L 2 (S, μ), and X an ( L 2 (S, μ), B L2 ) - valued mean zero random variable, i.e, E ( f(x) ) = 0 for every f L 2 (S, μ). The covariance function of X, κ : S S R, is defined by: κ(s, t) = E [ X(s)X(t) ]. In this setting we will always consider L 2 (S, μ)-valued mean zero random variables with μ μ- square integrable covariance function. Thus T κ, the covariance operator of X satisfies (1). The operator T κ induces on L 2 (S, μ) an inner product <.,.> Tκ, and thus a seminorm. Tκ, defined by f,g L 2 (S, μ), < f,g > Tκ =<f,t κ g>, and f 2 T κ =<f,f> Tκ. Note that, by (1) and the definition of. Tκ, we have that f L 2 (S, μ), f Tκ = h Tκ, (2) where h is the projection of f onto (ker T κ ). We have, see [6] for example, the following result f,g L 2 (S, μ), < T κ f,g >= E[< X,f><X,g>]. (3) Using the definition of T κ and Schwarz s inequality it can easily be seen that this seminorm is continuous with respect to the usual one, f L 2 (S, μ) f 2 T κ f 2 2 κ 2 (s, t)dμ(s)dμ(t). (4) This implies in particular that any sequence that converges in (L 2 (S, μ),. 2 ) converges also in (L 2 (S, μ),. Tκ ). The computations with. Tκ do not present any extra difficulties. Furthermore, using this seminorm we will show that the least squares and the maximum likelihood methods lead to the same estimator. This can not be established for the usual norm. It is often convenient to use vectorial notation. We will consider vectors in the Hilbert space L r 2(S, μ) endowed with its canonical inner products r r < f, g >= <f i,g i >, and < f, g > Tκ = <f i,g i > Tκ, S S

1118 A. Zoglat where f =(f 1,...,f r ), and g =(g 1,...,g r ) L r 2 (S, μ). Let ξ =(ξ 1,...,ξ r ) be an element of L r 2 (S, μ). For each i =1,...,r, and k 1,ξ i(k) will denote the component of the projection of ξ i on ϕ k, i.e., ξ i(k) =<ξ i,ϕ k >ϕ k. The element of R r, whose components are ξ 1(k),...,ξ r(k) will be denoted by ξ (k), i.e., ξ (k) =(ξ 1(k),...,ξ r(k) ). For simplicity of notation and without any loss of generality we will suppose that (S, S,μ) = ([0, 1], B [0,1],λ), the unit interval endowed with its Borel σ- algebra and the Lebesgue measure. The Hilbert space L 2 ([0, 1],λ) is denoted by L 2. 1.2 The Model The model we adopt in this work has its origins in classical Analysis of Variance theory [8]. The observations are functions of time, and each observation {Y (t),t [0, 1]} is the sum of a deterministic function m, and a scaled random function ɛ, i.e., Y = m + σɛ. We call m the mean-value function, ɛ the fluctuation or error, and σ is a positive and unknown scalar. Assume that the mean-value function m is a linear combination of some unknown functions. The equation of an observation can be written as Y (t) =x β(t)+σɛ(t), t [0, 1] (5) where x =(x 1,...,x p ) R p is known, and β =(β 1,...,β p ) L p 2 is to be estimated. Let κ denote the covariance operator of ɛ and {(λ k,ϕ k ),k 1} the sequence of pairs of eigenvalues-eigenfunctions of the covariance operator T κ. The sequence (ϕ k ) k 1 enables us to reduce the model (5) to a system of classical models. In particular if ɛ is Gaussian, this reduction preserves independence across projections of realizations. Moreover it will allow us to identify, as will be seen, the law of some statistics. The key tool is the following result. Theorem 1. [4, Theorem 2, p. 64] Let X be a Gaussian mean zero random variable with values in L 2 (S, μ), there exists a sequence (X k ) of independent and identically distributed normal random variables such that: X k = 1 λk <X,ϕ k >, and X = k 1 λk X k ϕ k, a.s. For notational simplicity, from now on T will denote T κ and H T will denote the closure in L 2 of the subspace spanned by (ϕ k ) k 1, i.e., H T = {f L 2 ; f = k 1 <f,ϕ k >ϕ k }.

Functional ANOVA 1119 In other words H T is the reproducing kernel Hilbert space associated with κ, see [6, Definition 4E]. In the case of a Gaussian error, we will assume without any loss of generality that the mean value function m belongs to H T. Indeed, each observation Y is the sum of its projection onto H T, Y HT, and Y (HT ) = Y Y H T. The error is almost surely in H T. Therefore m (HT ) = Y (H T ) almost surely, and thus we only need to make inferences about m HT. 2 Estimation Commonly used methods in the theory of estimation are the least squares and the maximum likelihood methods. In the theory of linear models, under the normality assumption, they lead to the same estimators that have some optimality properties. Their justification as satisfactory estimation procedures is given in many standard statistical texts. Roughly speaking, the least squares method is based on minimizing the squared norm of the error. In the Euclidean space this can be achieved by computing derivatives and equating them to zero. In the infinite dimensional case there is no equivalent to this procedure. However, in a Hilbert space, (or more generally in a Banach space having a countable basis), it is possible to apply the least squares procedure. We will show how it applies in L 2, and prove that the estimators are optimum in a certain sense. The method of maximum likelihood is more restrictive than the least squares method. More precisely it requires a distribution density with respect to some measure. Lebesgue measure is the most frequently used in classical linear models, and it is efficient under the normality assumption. In the functional case there is no equivalent to Lebesgue s measure, and thus even under the normality assumption a distribution density might not exist. We present some cases where the maximum likelihood method is applicable. Suppose that we observe n independent realizations of the random function Y of (5). In vectorial notation, the model can be written as Y(t) =Xβ(t)+σɛ(t); t [0, 1], (6) where σ and β are as before, ɛ =(ɛ 1,...,ɛ n ) is a L n 2-valued random variable, and X =(x ij )isann p matrix whose entries are known real numbers. The observations are assumed to be independent, and having the same squareintegrable covariance function κ. In other words, the components of ɛ are L 2 -valued, mean zero, and independent random variables which define the same covariance compact operator T κ. For every fixed t [0, 1], the equations (6) are those of a classical linear model and so estimators of β(t) are available. Our aim is to find an estimator of the vector β L p 2 that coincides for fixed t with the classical ones, and derive some test statistics for linear hypotheses.

1120 A. Zoglat 2.1 The Least Squares Method The method of least squares consists of finding the vector β that minimizes the T-norm of the error e(β) =(Y 1 (Xβ) 1,...,Y n (Xβ) n ) when viewed as a function of β. Since σ does not depend on β, dividing e(β) byσ if necessary, we will suppose that σ =1. From the definition of the T-norm in L n 2, we have that e(β) 2 T = n e i (β) 2 T, with e i(β) =Y i (Xβ) i, for i =1,...,n. The definition of the inner product <.,.> T, and (1) show that e i (β) 2 T = k 1 <e i (β),ϕ k > 2 T = k 1 [ <TYi,ϕ k > <T(Xβ) i,ϕ k > ] = k 1 λ k [ Yi(k) (Xβ) i(k) ]. Therefore n e i (β) 2 T = ) λk( 2 ( ) Y(k) Xβ (k) Y(k) Xβ (k). (7) k 1 This shows that e(β) 2 T is minimized if, and only if, for each k 1, the Euclidean norm of e (k) (β) is minimized. For each fixed k 1, the norm of e (k) (β), Viewed as a function of β (k), is minimized at β (k) =(X X) 1 X Y (k). Theorem 2. The estimator of β that minimizes Y Xb T, when viewed as a function of b, is given by β =(X X) 1 X Y. 2.2 The BLUE Property From the classical theory of linear models it is known that under the Gauss- Markov conditions, for each k 1, β (k) is the Best Linear Unbiased Estimator (BLUE) of β (k), i.e., a R p, b R n, E [ b Y (k) ] = a β (k) Var [ a β(k) ] Var [ b Y (k) ].

Functional ANOVA 1121 This also can be formulated as follows: If f R n and g R p are two continuous linear forms, on their respective spaces, then E[< f, Y (k) >]=< g, β (k) >, Var[< g, β (k) >] Var[< f, Y (k) >]. But Y (k) and β (k) are also elements of L n 2 and Lp 2 respectively, and continuous linear forms on R r, r = n or p, can be viewed as particular elements of L r 2.A natural generalization of the BLUE property could therefore be: If f L n 2 and g L p 2 are two continuous linear forms, on their respective spaces, then E[< f, Y >] =< g, β > Var[< g, β >] Var[< f, Y >]. Let g L p 2, for which f L n 2 is the minimal variance achieved? The answer is given by the following theorem. Theorem 3. Let f L n 2 and g L p 2 be such that: then we have E[< f, Y >] =< g, β >, Var[< g, β >] Var[< f, Y >]. Proof. To prove this theorem we need the following lemma, Lemma 1. For every f L n 2, we have Var[< f, Y >] = k 1 Var[< f (k), Y (k) >]. [Proof of Lemma 1] First note that since Y 1(k),...,Y n(k) are independent, Cov[Y (k) ] = σ 2 λ k I. Indeed, for all k 1, and for i = 1,...,n we have Var(Y i(k) )=σ 2 λ k. To see this, note that Var(Y i(k) )=σ 2 Var(ɛ i(k) ), and using equation (3) we have Var(ɛ i(k) )=E[< ɛ i,ϕ k > 2 ]=<Tϕ k,ϕ k >. Since ϕ k is the orthonormal eigenfunction of T associated with λ k, the result follows. Using the definition of the inner product in L n 2, and the independence of the Y i we have Var[< f, Y >] = n Var[< f i,y i >]. But Var[< f i,y i >]= σ 2 E[< f i,ɛ i > 2 ], and by equation (3), it is equal to σ 2 <Tf i,f i >. Now using the expansion of f i in the orthonormal basis (ϕ k ), and the fact that the ϕ k are eigenfunctions of T we have Var[< f i,y i >]=σ 2 k 1 λ k f 2 i(k),

1122 A. Zoglat where f i(k) =<f i,ϕ k >. This shows that Var[< f, Y >] =σ 2 k 1 λ kf (k) f (k), and thus Var[< f, Y >] = k 1 Var[< f (k), Y (k) >], which proves the Lemma. Suppose there is an f L n 2 such that E[< f, Y >] =< g, β > and Var[< f, Y >] Var[< h, Y >], for every h L n 2 such that E[< h, Y >] =< g, β >. Since the equality E[< f, Y >] =< g, β > should hold for every β, f has to satisfy X f = g. In other words, it should satisfy k 1 X f (k) = g (k). (8) By Lemma 1, Var[< f, Y >] = k 1 Var[< f (k), Y (k) >]. Thus, to minimize this variance, it suffices to minimize each term of the series. For k 1 fixed, we use the Lagrange multiplier to minimize the k th term of the series, subject to the conditions (8), and obtain f (k) = X(X X) 1 Y (k). And since this is true for all k 1, we have f = X(X X) 1 Y which is what we seek. The result of the theorem is given in terms of the usual inner product in L 2 to emphasize that it generalizes the known result in the finite dimensional case. The same arguments used for the proof can be used if <.,.>is replaced by <.,.> T. The following section is devoted to another method of estimation. 2.3 The Maximum Likelihood Method In this section we assume that ɛ, of the model (6) is Gaussian, and denote the n 1 matrix Xβ by m =(m 1,...,m n ). Let L(Y 1 )=ν and L(σɛ 1 )=μ denote the probability distributions of Y 1 and σɛ 1 respectively. These are two Gaussian probability measures on (L 2, B L2 ). As before, T denotes the covariance operator of ɛ 1, and T denotes the operator defined by k 1, Tϕk = λ k ϕ k. It is well known, see e.g. [4] or [5], that the distributions of two Gaussian processes which have the same covariance operator are either equivalent or orthogonal. A necessary and sufficient condition for two Gaussian probabilities

Functional ANOVA 1123 to be equivalent can be found, for instance, in [4, Theorem 1, p. 271] or [5, Theorem 3.1, p. 118]. In both of these references one can find an expression for the Radon-Nikodym derivative of one probability with respect to the other when they are equivalent. A necessary and sufficient condition for the Radon-Nikodym derivative of ν with respect to μ to exist is that m belongs to range( T ). In [3] it is shown that this condition is actually equivalent to m H T. We have seen that for estimating m there is no loss of generality in supposing that m H T. Under this condition, in the next lemma, we give the Radon-Nikodym derivative dν/dμ. A proof can be found in [3]. Lemma 2. Assume that the probability measure ν is absolutely continuous with respect to the probability measure μ then we have where dν dμ (x) = exp [ 1 2 k 1 m 2 k + ] m kx k, x H T, k 1 m j =<m,ϕ j >, x j =<x,ϕ j >, and ϕ j = σ λ j ϕ j. The likelihood function l = dν/dμ is defined by: y L n 2, l(m, y) = exp [ 1 2 = j 1 = j 1 j 1 exp { 1 2 n m 2 i(j) + j 1 n [ m (j)m (j) 2m (j)y (j) ] m i(j) y i(j) ]} exp { σ2 λ j [ ]} m 2 (j) m (j) 2m (j) y (j), where m (j) =(m 1(j),...,m n(j) ). Replacing m(t) byxβ we have that y L n 2, l(β, y) = j 1 exp { σ2 λ j [ ]} β 2 (j)x Xβ (j) 2β (j)x y (j). Note that, for every j 1, β (j)x Xβ (j) 2β (j)x y (j) =(y (j) Xβ (j) ) (y (j) Xβ (j) ) y (j)y (j), so that maximizing l(β, y) with respect to β, independently of σ, is equivalent to minimizing (y (j) Xβ (j) ) (y (j) Xβ (j) ). Hence we have the following result

1124 A. Zoglat Theorem 4. Viewed as a function of β, l(β, Y) is maximized for β =(X X) 1 X Y. We end this section by some remarks on the methods we used to provide an estimator for β. Remark 1. The two methods of estimation, LS and ML, as in the finite dimensional case lead to the same estimator of β. However while the former is always applicable the latter is not. More precisely, a- As in the finite dimensional case, The least squares method does not require normality. Moreover, it does not even require any knowledge about the covariance structure. b- Unlike the finite dimensional case, in infinite dimensional spaces there is no direct equivalent to Lebesgue measure in R n. However, if the error is Gaussian this difficulty can be circumvented by letting the Radon- Nikodym derivative (of the observation distribution with respect to the error distribution) play the role of the likelihood function. If the error is not Gaussian it is not easy to remove the difficulty. The inner product <.,.> T is not an artificial one. It appears naturally in the reproducing kernel Hilbert space approach. Furthermore, the likelihood function using <.,.> T is maximized at λy (j) Y (j) λ j (Y (j) Xβ (j) ) (Y (j) Xβ (j) ). j 1 j 1 In terms of the inner product <.,.> T, one can easily see that this expression is analogous to the statistic SSR usually found in regression or analysis variance or analysis. The statistic SSR is defined in the next section where we develop tests for linear hypotheses. 3 Testing linear hypotheses Our aim, in this section, is to make inferences on β by establishing tests for the hypothesis H 0 : Kβ = C. Following the notation from classical linear models, we will define the residual error sum of squares SSE, the sum of squares due to regression SSR, and establish some of their properties assuming that the error ɛ is Gaussian. Using the definition of β, it is easy to see that Y X β =(I X(X X) 1 X )Y,

Functional ANOVA 1125 where I is the identity matrix. Let M = (m ij ) i,j denote the matrix (I X(X X) 1 X ). It is idempotent, symmetric and such that MX = O. For each k 1, Y (k) is a Gaussian vector N ( μ (k),σ 2 I ) with mean μ (k) = Xβ (k). Therefore, σ 2 Y (k) MY (k) is a χ 2 (r(m)) random variable with r(m) =rank(m) degrees of freedom. 3.1 The SSE and SSR By analogy with the classical theory of linear models [9] we define error sum of squares (SSE) by: SSE = Y Ŷ, Y Ŷ T, where Ŷ = X β. It can easily be seen that SSE = Y, MY. From the T definition of the inner product <.,.> T in L n 2, we have that: Y, MY T = n n Yi,m ij Y j Using the expansion of the Y i and the orthonormality of the ϕ k, the general term of the double sum becomes Yi,m ij Y j = λ T k <Y i,ϕ k >m ij <Y j,ϕ k >. k 1 j=1 T. Substituting this in the previous formulae, we get n n Y, MY T = λ k <Y i,ϕ k >m ij <Y j,ϕ k > = k 1 j=1 k 1 n j=1 = k 1 λ k Y (k)my (k). n λ k <Y i,ϕ k >m ij <Y j,ϕ k > This shows that SSE = k 1 λ ky (k) MY (k). We also define the total sum of squares by SST = Y, Y. Then the sum of squares due to regression SSR T is the difference SST SSE, and we have SSR = Y, BY, where B = T X(X X) 1 X. It is easy to see that the matrix B is symmetric, idempotent, and orthogonal to M. We can now state and prove the following results. Theorem 5. Assuming that the error ɛ is Gaussian, we have that

1126 A. Zoglat 1- There exists a sequence (η k ) of independent random variables with the same distribution χ 2 r(m), such that the series σ2 k 1 λ2 k η k converges to SSE almost surely. 2- There exists a sequence (ξ k ) of independent real random variables such that each ξ k has a non-central χ 2 (r(b), (2σ 2 λ k ) 1 μ (k) Bμ (k)) distribution, and the series σ 2 k 1 λ2 k ξ k converges to SSR almost surely. Also, 3- The SSE and SSR are independent. Proof. 1- We have seen that SSE = λ k Y (k)my (k) = σ ( 2 λ 2 1 k σ ) M ( 1 Y (k) λ k 1 k 1 k σ ) Y (k). λ k Since μ (k) M = 0, η k = ( ) 1 σ Y λ (k) M ( ) 1 k σ Y λ (k) has a χ 2 k r(m) distribution with r(m) degrees of freedom. On the other hand, E[SSE] =E < Y, MY > T is finite. Therefore SSE is almost surely finite. 2- Using the same arguments as for SSE we have that SSR = λ k Y (k) BY (k) = σ ( 2 λ 2 1 k σ ) Y (k) B ( 1 λ k 1 k 1 k σ ) Y (k). λ k Thus ξ k = ( 1 σ λ k Y (k) ) B ( 1 σ λ k Y (k) ) is a non-central χ 2 with r(b) degrees of freedom and a non-centrality parameter (σ 2 λ k ) 1 μ (k) Bμ (k). Since E[SSR] =E < Y, BY > T <, the SSR is almost surely finite. 3- It suffices to show that the sequences (η k ) k 1 and (ξ k ) k 1 are mutually independent. Since (Y (k) ) k 1 is a sequence of independent random vectors and since η k and ξ k are functions of Y (k) alone, it suffices to show that for each k 1, η k and ξ k are independent. We have η k = (σλ k ) 1 Y (k) MY (k) and ξ k =(σλ k ) 1 Y (k) BY (k) where Y (k) is a Gaussian N (μ (k),σ 2 λ k I) random vector. But, BM = MB = 0, thus η k and ξ k are independent. Remark 2. The statistic SSE [r(m) is an unbiased and consistent estima- k λ2 k ] tor for σ 2.

Functional ANOVA 1127 3.2 Testing linear hypotheses Consider the general hypothesis H 0 : Kβ = C, where K is an m p matrix and C is an m 1 vector of given functions. We assume that K is of full row rank, i.e., rank(k) = m. For testing this kind of hypothesis we need to compute an estimator of β under H 0. We will see that the constrained least squares method provides such an estimator, β. The constrained least squares estimates are obtained by minimizing e(β) 2 T under the condition Kβ C = 0, which is equivalent to Kβ (k) C (k) = 0 k 1. (9) To minimize e(β) 2 T subject to (9), it suffices to minimize for each k 1, (Y (k) Xβ (k) ) (Y (k) Xβ (k) ) subject to the condition Kβ (k) C (k) = 0 Using Lagrange multiplier vectors, it can easily be seen that β = β (X X) 1 K [ K(X X) 1 K ] 1 (K β C). (10) Note that since rank(k) =m p = rank((x X) 1 ), the rank(k(x X) 1 K ) equals m. Hence K(X X) 1 K is of full rank, and thus invertible. Remark 3. Recall that the maximum likelihood estimator has been obtained by minimizing, for each k 1, (Y (k) Xβ (k) ) (Y (k) Xβ (k) ). Therefore, the constrained maximum likelihood estimator obtained by maximizing the likelihood function subject to (9) is equal to β. Let SSE H0 denote the error sum of squares under the hypothesis H 0 : Kβ = C, i.e. SSE H0 = Y X β 2 T. Writing Y X β = Y X β + X( β β), and since X (Y X β) =0, it follows that SSE H0 = Y X β 2 T + X( β β) 2 T. (11) From (10) and the same calculations as in the classical analysis of variance, we obtain X( β β) =D(Y C ), where D = X(X X) 1 K [ K(X X) 1 K ] 1 K(X X) 1 X, and C = XK (KK ) 1 C. Note that D is idempotent, symmetric and DM = MD = 0. In this form we can now state some properties of the statistic Q = X( β β) 2 T. They can be proved following the proof of Theorem 5.

1128 A. Zoglat Theorem 6. Assuming that the error ɛ is Gaussian, we have that a- There exists a sequence (ζ k ) of independent real random variables such that each ζ k has a non-central χ 2 (r(d),δ k ) distribution with r(d) degrees of freedom and a non-centrality parameter δ k given by δ k =(2σ 2 λ k ) 1 (EY (k) C (k) ) D(EY (k) C (k) ), such that the series σ 2 k λ2 k ζ k converges to Q = X( β β) 2 T surely. almost b- The real random variables Q and SSE are stochastically independent. The sum of squares due to regression, under the hypothesis H 0 : Kβ = C, is defined by SSR H0 = SST SSE H0. The equation (11) shows that SSR H0 SSR = Q. Thus we have the following result: Theorem 7. For testing H 0 : Kβ = C versus H 1 : Kβ C, at level α, there exists a test ψ given by: { 1 if SH0 (Y) >C(H ψ = 0,α), 0 otherwise, where S H0 (Y) = { SSR Ho SSR (SSR Ho SSR)/SSE if σ is known, otherwise. The constant C(H 0,α) is such that P{S H0 (Y) >C(H 0,α), Kβ = C} = α. References [1] J.B. Conway, A Course in Functional Analysis, Second Edition. Springer-Verlag (1989). [2] R.M. Dudley, Distances of Probability Measures and Random Variables, Ann. Math. Stat. 39 (1968), 1563-1572. [3] X. Fernique, Gaussian Random Vectors Their Reproducing Kernel and Hilbert Spaces Technical Report Series of the Laboratory for Research in Statistics and Probability, No. 34. (1985) University of Ottawa and Carleton University. [4] U. Grenander: Abstract Inference. John Wiley & Sons, Inc. (1981).

Functional ANOVA 1129 [5] H. Kuo : Gaussian Measures in Banach Spaces. Lecture Notes in Mathematics 463. Springer-Verlag (1970). [6] E. Parzen, An Approach to Time Series Analysis, Ann. Math. Stat., 32 (1961), 951-989. [7] J.O., Ramsay and C.J. Dalzell, Some tools for functional data analysis, J. Roy. Statist. Soc. B53(1991), 539-572. [8] H. Scheffé : The Analysis of Variance. John Wiley & Sons, Inc.(1959). [9] S.R. Searle : Linear Models. John Wiley & Sons, Inc.(1971). Received: October 15, 2007