Abstract. Keywords. Notations

Size: px

Start display at page:

Download "Abstract. Keywords. Notations"

Erika Daniel
5 years ago
Views:

1 To appear in Neural Networks. CONSTRUCTION OF CONFIDENCE INTERVALS FOR NEURAL NETWORKS BASED ON LEAST SQUARES ESTIMATION CONSTRUCTION OF CONFIDENCE INTERVALS FOR NEURAL NETWORKS BASED ON LEAST SQUARES ESTIMATION Abstract Isabelle Rivals and Léon Personnaz We present the theoretical results about the construction of confidence intervals for a nonlinear regression based on least squares estimation and using the linear Taylor Laboratoire d'électronique, École Supérieure de Physique et de Chimie Industrielles, 1 rue Vauquelin, 7531 Paris Cedex 5, France. expansion of the nonlinear model output. We stress the assumptions on which these results are based, in order to derive an appropriate methodology for neural black-box modeling; the latter is then analyzed and illustrated on simulated and real processes. Isabelle Rivals Laboratoire d'électronique, École Supérieure de Physique et de Chimie Industrielles, 1 rue Vauquelin, 7531 Paris Cedex 5, France. Phone: Fax: Isabelle.Rivals@espci.fr We show that the linear Taylor expansion of a nonlinear model output also gives a tool to detect the possible ill-conditioning of neural network candidates, and to estimate their performance. Finally, we show that the least squares and linear Taylor expansion based approach compares favourably with other analytic approaches, and that it is an efficient and economic alternative to the non analytic and computationally intensive bootstrap methods. CONFIDENCE INTERVALS FOR NN Keywords Nonlinear regression, neural networks, least squares estimation, linear Taylor expansion, confidence intervals, ill-conditioning detection, model selection, approximate leave-one-out score. Notations We distinguish between random variables and their values (or realizations) by using upper- and lowercase letters; all vectors are column vectors, and are denoted by boldface letters; non random matrices are denoted by light lowercase letters..

2 x Y p = Y p x non random n-input vector random scalar output depending on x z = z 1 z z N T m z matrix approximating ξ with z k = f x k, q q range of z q = q LS E Y p x mathematical expectation, or regression function, of Y p given x p z orthogonal projection matrix on m z W random variable with zero expectation denoting additive noise I N (N,N) identity matrix s variance of W (k) q LS leave-one-out (the k-th example) least squares estimate x k, y p k k=1 to N data set of N input-output pairs, where the x k are non random e k k=1 to N leave-one-out errors n-vectors, and the y p k are the corresponding realizations of the n h number of hidden neurons of a neural network random outputs Y p k = Y p x k H random Hessian matrix of the cost function x k T q, q R n family of linear functions of x parameterized by q h value of the Hessian matrix of the cost function q p x = x 1 x x N T Y p = Y 1 p Y N T p Y p unknown true q-parameter vector (q=n in the linear case) non random (N,n) input matrix random N-vector of the outputs of the data set W = W 1 W W N T random N-vector with E W = J q value of the least squares cost function s ref x reference variance estimate var f x, Q LS LTE LTE estimate of the variance of a nonlinear model output var f x, Q LS var f x, Q LS Hessian Hessian estimate of the variance of a nonlinear model output sandwich sandwich estimate of the variance of a nonlinear model output Q LS least squares estimator of q p Abbreviations q LS least squares estimate of q p CI confidence interval R = Y p x Q LS least squares residual random N-vector in the linear case LOO leave-one-out r value of R LS least squares m x range of x (linear manifold) LTE linear Taylor expansion p x orthogonal projection matrix on m x SISO single input - single output R S estimator of s s value of S f x, q, q q family of nonlinear functions of x parameterized by q f x, q N-vector f x 1, q f x k, q f x N, q T MISO MSTE MSPE multi input - single output mean square training error mean square performance error R = Y p f x,q LS ξ = x 1 x x N T ξ m p ξ least squares residual random N-vector unknown non random (N, q) matrix with x k = f x k, q q range of ξ orthogonal projection matrix m on ξ q = q p

3 I. INTRODUCTION analytic approaches, and discusses its advantages with respect to bootstrap approaches. For any modeling problem, it is very important to be able to estimate the reliability of a given model. This problem has been investigated to a great extent in the framework of linear regression theory, leading to well-established results and commonly used methods to build confidence intervals (CIs) for the regression, that is the process output expectation [Seber 1977]; more recently, these results have been extended to nonlinear models [Bates & Watts 1988] [Seber & Wild 1989]. In neural network modeling studies however, these results are seldom exploited, and generally only an average estimate of the neural model reliability is given through the mean square model error on a test set; but in an application, one often wishes to know a CI at any input value of interest. Nevertheless, thanks to the increase of computer power, the use of bootstrap methods has increased in the past years [Efron & Tibshirani 1993]. These non analytic methods have been proposed to build CIs for neural networks [Paass 1993] [Tibshirani 1993] [Heskes 1997], but with the shortcoming of requiring a large number of trainings. This paper presents an economic alternative to the construction of CIs using neural networks. This approach being built on the linear least squares (LS) theory applied to the linear Taylor expansion (LTE) of the output of nonlinear models, we first recall how to establish CIs for linear models in section II, and then approximate CIs for nonlinear models in section III. In section IV, we exploit these known theoretical results for practical modeling problems involving neural models. We show that the LTE of a nonlinear model output not only provides a CI at any input value of We consider single-output models, since each output of a multi-output model can be handled separately. We deal with static modeling problems for the case of a T non random (noise free) n-input vector x = x 1 x x n, and a noisy measured output y p which is considered as the actual value of a random variable Y p = Y p x depending on x. We assume that there exists an unknown regression function E Y p x such that for any fixed value of x : Y p x = E Y p x + W x (1) where W x is thus a random variable with zero expectation. A family of parameterized functions f x, q, x R n, q R q is called an assumed model. This assumed model is said to be true if there exists a value q p of q such that, x in the input domain of interest, f x, q p = E Y p x. In the following, a data set of N inputoutput pairs x k, yk p k=1 to N is available, where the x k = xk 1 xk xk T n are non random n-vectors, and the yk p are the corresponding realizations of the random variables Y k p = Y p x k1. The goal of the modeling procedure is not only to estimate the regression E Y p x in the input domain of interest with the output of a model, but also to compute the value of a CI for the regression, that is the value of a random interval with a chosen probability to contain the regresssion. For the presentation of the results of linear and nonlinear regression estimation, we deal with the true model (a model which is linear in the parameters in section II, a nonlinear one in section III), i.e. we consider that a family of functions containing the regression is known. In interest, but also gives a tool to detect the possible ill-conditioning of the model, and, as in [Monari 1999] [Monari & Dreyfus], to estimate its performance through an approximate leave-one-out (LOO) score. A real-world illustration is given through an industrial application, the modeling of the elasticity of a complex material from some of its structural descriptors. Section V compares the LS LTE approach to other 1 We recall that we distinguish between random variables and their values (or realizations) by using upper- and lowercase letters, e.g. Y k p and y k p ; all vectors are column vectors, and are denoted by boldface letters, e.g. the n-vectors x and x k ; non random matrices are denoted by light lowercase letters (except the unambiguous identity matrix).

4 section IV, we consider the general realistic case of neural black-box modeling where a preliminary selection among candidate neural models is first performed because a true model is not known a priori. The estimate q LS is a realization of the LS estimator Q LS whose expression is: Q LS = x T x -1 x T Y p = q p + x T x -1 x T W (5) Since the assumed model is true, this estimator is unbiased. The orthogonal projection matrix on m x is p x = x x T x -1 x T. It follows from (5) that the unbiased LS estimator of E Y p x is: II. CONFIDENCE INTERVALS FOR LINEAR MODELS x Q LS = x q p + p x W (6) that is the sum of E Y p x and of the projection of W on m x, as shown in Figure 1. We consider a true linear assumed model, that is the associated family of linear functions x T q, x R n, q R n contains the regression; (1) can thus be uniquely rewritten as: Y p x = x T q p + W x () where q p is an unknown n-parameter vector. Model () associated to the data set leads to: Y p = x q p + W (3) where x = x 1 x x N T is the non random (N,n) input matrix, Y p = Y 1 p Y N T p Y p and W = W 1 W W N T are random N-vectors, with E W =. Geometrically, this means that E Y p x = x q p belongs to the solution surface, m the linear manifold x of R the observation space N spanned by the columns of the input matrix (the range Let R denote the residual random N-vector R = Y p x Q LS, that is the vector of the errors on the data set, then: R = I N p x W (7) Under the assumption that the W k are identically distributed and uncorrelated (homoscedastic), i.e. the noise covariance matrix is K W = σ I N, it follows from (5) that the variance of the LS estimator of the regression for any input x of interest is 3 : var x T Q LS = σ x T x T x -1 x (8) Using (7), we obtain the unbiased estimator S = R T R N n of s ; the corresponding (unbiased) estimate of the variance of x T Q LS is thus: where s is the value of S. var x T Q LS = s x T x T x -1 x (9) of x). We assume that m x is of dimension n, that is rank x = n. In other words, the model is identifiable, i.e. the data set is appropriately chosen, possibly using experimental design. II.1. The linear least squares solution The LS estimate q LS of q p minimizes the empirical quadratic cost function: J q = 1 N k=1 yk p x k T q = 1 y p x q T y p x q (4) II.. Confidence intervals for a linear regression If the W k are homoscedastic gaussian variables, that is W N N, σ I N : ThL1. Q LS q p N n, σ x T x -1 (1) ThL. R T R σ χ N-n (11) ThL3. Q LS is statistically independent from R T R. m x is sometimes called the expectation surface [Seber & Wild 89]; as a matter of fact, the solution surface coincides with the expectation surface only when the assumed model is true. 3 We recall that x (boldface) is the input vector (n, 1) of interest, and that x is the experimental (N, n) input matrix.

5 The proof of the above theorems follows from Figure 1 and from the Fisher-Cochrane theorem [Goodwin & Payne 1977], see for instance [Seber 1977]. The goal is to build a CI for the regression value E Y p x = x T q p, for any input vector x of interest. The variance of the measurements s being unknown, let us build a normalized centered gaussian variable where both E Y p x and σ appear: x T Q LS E Y p x N, 1 (1) σ x T x T x -1 x Thus, using the Pearson variable (11), which is independent from (1) according to ThL3, we obtain the following Student variable: x T Q LS E Y p x Student N n (13) S x T x T x -1 x A 1 1 α % CI for E Y p x is thus: x T q LS ± t N-n α s x T x T x -1 x (14) where t N-n is the inverse of the Student(N-n) cumulative distribution. Note that (14) allows to compute a CI corresponding to any input vector, and that it is much more informative than average values such as that the mean square error on the data set, or the mean of the variance estimate over the data set 4 ; as a matter of fact, the latter invariably equals s n N. II.3. Example of a simulated linear SISO process (process #1) We consider a simulated single input - single output (SISO) linear process: yk p = q p1 + q p x k + w k k=1 to N (15) We take θ p1 = 1, θ p = 1, σ =.5, N = 3. The inputs x k of the data set are uniformly distributed in [-3; 3], as shown in Figure a. The family of functions q 1 + q x, q is considered, that is the assumed model is R true, and we choose a confidence level of 99% (t 8 1% =.76). The LS estimation leads to s =.9, i.e. underestimates the noise variance. Figure b displays the estimate (9) of the variance of x T Q LS, and the true variance (8). The estimator S of the noise variance being unbiased, the difference between the estimated variance (9) and the (usually unknown) true variance (8) is only due to the particular values of the measurement noise. Figure a shows the regression E Y p x, the data set, the model output and the 99% CI for the regression computed with (14). III. APPROXIMATE CONFIDENCE INTERVALS FOR NONLINEAR MODELS We consider a family of nonlinear functions f x, q, x R n, q R q which contains the regression, that is the assumed model is true; (1) can thus be rewritten as: Y p x = f x, q p + W x (16) where q p is an unknown q-parameter vector. We denote by f x, q p the unknown vector f x 1, q p f x k, q p f x N, q T p defined on the data set, thus: Y p = f x, q p + W (17) As in section II, x denotes the (N,n) input matrix 5, and Y p and W are random N- vectors with E W =. Geometrically, this means that E Y p x belongs to the solution surface, the manifold m f x, q = f x, q, q R q of R N. III.1. The linear Taylor expansion of the nonlinear least squares solution A LS estimate q LS of q p minimizes the empirical cost function 6 : 4 The mean of the variance estimate over the training data set is: N N 1 s x k T x T x -1 x k = s p xkk = s N k=1 N k=1 N trace px. Since px is the orthogonal projection matrix on a n-dimensional subspace, trace (p x) = n. 5 In the case of a nonlinear model, x is merely a two-dimensional array. 6 In the case of a multilayer neural network, the minimum value of the cost function can be obtained for several values of the parameter vector; but, since the only function-preserving transformations are

6 N k=1 J q = 1 yk p f x k, q = 1 y p f x, q T y p f x, q (18) The estimate q LS is a realization of the LS estimator Q LS. Efficient algorithms are at our disposal for the minimization of the cost function (18): they can lead to an absolute minimum, but they do not give an analytic expression of the estimator that could be used to build CIs. In order to take advantage of the results concerning linear models, it is worthwhile considering a linear approximation of Q LS which is obtained by writing the LTE of f x, q around f x, q p : f x, q where x = q f x, q f x, q p + where ξ = x 1 x x N T q r=1 f x, q θ r q = qp θ r θ pr = f x, q p + x T q q p (19). Thus, with the matrix notation: q = q p f x, q f x, q p + ξ q q p () and x k = f x k, q. The (N,q) matrix ξ is the non q q = q p random and unknown (since q p is unknown) Jacobian matrix of f. Using (), one obtains, similarly to the linear case, the following approximation of Q LS (see Appendix 1.1 for a detailed derivation of (1) and (3)): Q LS q p + ξ T ξ -1 ξ T W (1) The range m ξ of ξ is tangent to the manifold m f x, q at q = q p ; this manifold is assumed to be of dimension q, hence rank ξ = q. The matrix p ξ = ξ ξ T ξ -1 ξ T is the orthogonal projection matrix on m ξ. From () and (1), the LS estimator of E Y p x can be approximately expressed by: f x, Q LS f x, q p + p ξ W () i.e. it is approximately the sum of E Y p x and of the projection of W on m ξ, as illustrated in Figure 3. If K W = σ I N (homoscedasticity), the variance of the model output, that is the LS estimator of the regression, for an input x is approximately: In the following, approximation (3) will be termed the LTE approximation of the model output variance. Let R denote the LS residual vector R = Y p f x, Q LS, thus: R I N p ξ W (4) Under the assumption of appropriate regularity conditions on f, and for large N, the curvature of the solution surface 7 f x, q is small; thus, using (4), one obtains m the asymptotically (i.e. when N ) unbiased estimator S = R T R N q of σ. In (3), the matrix ξ takes the place of the matrix x in the linear case. But, as opposed to x, ξ is unknown since it is a function of the unknown parameter q p. The (N,q) matrix x may be approximated by z = z 1 z z N T where z k = f x k, q, that is: q q = q LS zk r = f x k, q (5) θ r q = ql In the following, we assume that rank z = q. Like the matrix ξ, the vector x is not available, and its value may be approximated by: f x, q z = (6) q q = q LS From (3), (5) and (6), the LTE estimate of the variance of the LS estimator of the regression for an input x is thus: var f x, Q LS LTE = s z T z T z -1 z (7) III.. Approximate confidence intervals for a nonlinear regression If W N N, σ I N, and for large N, it follows from the above relations and from linear LS theory [Seber & Wild 1989] that: ThNL1. Q LS ~ N q q p, σ ξ T ξ -1 (8) ThNL. R T R σ ~ χ N-q (9) ThNL3. Q LS is approximately statistically independent from R T R. var f x, Q LS σ x T x T x -1 x (3) 7 The curvature is usually decomposed in two components: i) the intrinsic curvature, which measures neuron exchanges, as well as sign flips for odd activation functions like the hyperbolic tangent [Sussmann 9], we will legitimately consider the neighborhood of one of these values only. the degree of bending and twisting of the solution surface m f x, q, and ii) the parameter-effects curvature, which describes the degree of curvature induced by the choice of the parameters q.

7 Using (3) and (8), let us again build a quasi normalized and centered gaussian variable where both E Y p x and σ appear: f x, Q LS E Y p x ~ N, 1 (3) σ x T ξ T ξ -1 x Thus, the variable (9) being approximately independent from (3) according to ThNL3, we have: f x, Q LS E Y p x ~ Student N q (31) S x T ξ T ξ -1 x A 1 1 α % approximate CI for E Y p x is thus: f x, q LS ± t N-q α s z T z T z -1 z (3) Note that, when N is large, the Student distribution is close to the normal distribution, and thus t N-q α n α, where n is the inverse of the normal cumulative distribution. Like in the linear case, (3) allows to compute a CI at any input x of interest, which gives much more information than the value of the mean variance estimate over the data set: as a matter of fact, the latter always approximately equals s q N. From a practical point of view, the construction of a CI for a neural model output at any input x of interest involves once and for all the computation of the matrix z, that is the N x q partial derivatives of the model output with respect to the parameters evaluated at q LS for the data inputs x k k=1 to N, and, for each x value, that of z, i.e. the derivatives at x. In the case of a neural model, these derivatives are easily obtained with the backpropagation algorithm. All the previous results and the above considerations are valid provided an absolute minimum of the cost function (18) is reached. In real-life, several estimations of the parameters must thus be made starting from different initial values, the estimate corresponding to the lowest minimum being kept in order to have a high probability to obtain an absolute minimum. In the examples of this work, the algorithm used for the minimization of the cost function is the Levenberg algorithm, as described for example in [Bates & Watts 1988], and several trainings are performed. parameters of the network and decreasing with the size of the data set, the number of trainings is chosen accordingly. III.3. Quality of the approximate confidence intervals III.3.1. Theoretical analysis The quality of the approximate CI depends essentially on the curvature of the solution m surface f x, q, which depends on the regularity of f and on the value of N. In practice, f is often regular enough for a first-order approximation to be satisfactory, provided that N is large enough. Thus, if N is large: (i) as in the linear case, the estimator of the noise variance S is unbiased, and the difference between s and σ is only due to the particular values of the noise; (ii) the variance of f x, Q LS is small, and q LS is likely to be close to q p : z and z are thus likely to be good approximations of respectively x and x. A reliable estimate of a CI may thus be obtained from the LTE variance estimate (7). On the other hand, if N is too small: (i) as opposed to the linear case, even if the assumed model is true, the estimator of the noise variance S is biased; (ii) the variance of f x, Q LS is large, and q LS is likely to differ from q p : z and z risk to be poor approximations of x and x. Thus, if N is diagnosed as too small, one cannot have confidence in the confidence intervals (3), and additional data should be gathered. III.3.. Quantitative analysis As detailed for example in [Bates & Watts 1988] [Seber & Wild 1989] [Antoniadis et al. 199], different measures of the curvature can be computed, and can be used in each particular case to evaluate the accuracy of the LTE. In section IV.3 dealing with neural network modeling, we give indications on how to judge if N is large enough for the approximate CI to be accurate. The probability of getting trapped in a relative minimum increasing with the number of

8 In order to evaluate the accuracy of the LTE variance estimate (7) when dealing with simulated processes, we introduce an estimate of the unknown true variance of f x, Q LS that is not biased by curvature effects, the reference variance estimate s ref x. This estimate is computed on a large number M of other sets of N outputs corresponding to the inputs of the training set, and whose values are obtained with different realizations (simulated values) of the noise W. The i-th LS (i) estimate f x, q LS of E Y p x is computed with data set i (i=1 to M), and the reference estimate of the variance at input x is computed as: M i=1 s ref x = 1 (i) f x, q M LS f x, where f x = 1 (i) f x, q M LS (33) In the nonlinear case, we thus use three notions related to the (true) variance of f x, Q LS : 1) the LTE variance approximation (3), which is a good approximation of the true variance if the curvature is small, as we show in section III.4.3, and which can be computed only when the process is simulated; ) the LTE variance estimate (7), which is the estimate that can be computed in real-life; 3) the reference variance estimate (33), which tends to the true variance when M tends to infinity because it is not biased by curvature effects, and which can be computed only when the process is simulated. M i=1 III.4.1. Example of a simulated neural SISO process (process #) We consider a SISO process simulated by a neural network with one hidden layer of two hidden neurons with hyperbolic tangent activation function and a linear output neuron: yk p = θ p1 + θ p tanh θ p3 + θ p4 x k + θ p5 tanh θ p6 + θ p7 x k + w k k=1 to N (34) We take q p = 1; ; 1; ; -1; -1; 3 T, σ = 1 -, N = 5. The inputs x k of the data set are uniformly distributed in [-3; 3], as shown in Figure 4a. The family of functions θ 1 + θ tanh θ 3 + θ 4 x + θ 5 tanh θ 6 + θ 7 x, q R 7 is considered, that is the assumed model is true, and we choose a confidence level of 99% (t 43 1% =.58). The minimization of the cost function with the Levenberg algorithm leads to s = Figure 4b displays the LTE estimate of the variance of f x, Q LS (7), and the reference estimate (33) computed over M = 1 sets. Figure 4a shows the regression, the data set used for the LS estimation, and the corresponding model output and 99% CI (3). The model being true and the size N = 5 of the data set being relatively large with respect to the number of parameters and to the noise variance, we observe that: (i) s σ ; (ii) the model output is close to the regression, leading to good approximations of x and of x by z and z. Thus, (i) and (ii) lead to an accurate LTE estimate of the variance, and hence of the CI. III.4. Illustrative examples Since we are concerned with neural models, and since, in this section, the assumed model is true, the following examples bring into play neural processes, that is to say processes whose regression function is the output of a neural network; the more realistic case of arbitrary processes for whom a family of nonlinear functions (a neural network with a given architecture) containing the regression is unknown is tackled in the next section. III.4.. Example of a simulated neural MISO process (process #3) We consider a MISO process simulated by a neural network with two inputs, one hidden layer of two tanh hidden neurons and a linear output neuron: yk p = θ p1 + θ p tanh θ p3 + θ p4 xk 1 + θ p5 xk + θ p6 tanh θ p7 + θ p8 xk 1 + θ p9 xk + w k (35) k=1 to N We take q p = 1; 1; ; 1; -1; -; ; 1; 1 T, σ = 1-1, N = 1. The inputs xk 1 and xk of the data set are uniformly distributed in [-3; 3]. As for process #, the assumed model is true, i.e. the neural network associated to (35) is used; the minimization of the cost function with the Levenberg algorithm leads to s = Figure 5a shows the inputs of the data set and the regression; Figure 5b displays the LTE estimate of the

9 variance of f x, Q LS (7); Figure 5c displays the difference between the reference IV. CONFIDENCE INTERVALS FOR NEURAL NETWORKS variance estimate (33) computed over M = 1 sets, and the LTE variance estimate (7). Since s is slightly smaller than s, the variance is globally slightly underestimated. But except in the domain around the corner (3, 3) where the density of the inputs is lower, and where the slope of the output surface is steep, the LTE variance estimate is satisfactory. A reliable estimate of the CI may thus be obtained. In the previous sections, the model used for the construction of CIs is true. For real-world black-box modeling problems however, a family of functions which contains the regression is not known a priori. The first task is thus to select the less complex family of functions which contains a function approximating the regression to a certain degree of accuracy in the input domain defined by the data set. In practice, III.4.3 Accuracy of the LTE variance approximation (processes # and #3) Let us now show on the example of processes # and #3 that the curvature of the solution surface is small enough for the LTE approximation of the variance (3) to be satisfactory. For both processes, we have computed approximation (3), using the values of x and of x (at q p ) and the values of s used for the noise simulation. As shown in Figures 6a and 6b for process #, the LTE approximation of the variance (3) is very close to the reference variance estimate. As a matter of fact, the difference between them (Figure 6b) is only due to the curvature, which is small since N = 5 is large with respect to the complexity of the regression. Expression (3) also leads to satisfactory results in the case of process #3 as shown in Figures 7a and 7b (N = 1, two inputs). This tends to show that a first-order approximation of the variance is often sufficient, and that is not worth to bother with a higher-order approximation. In [Seber & Wild 89], a quadratic approximation of the LS estimator using the curvature of the solution surface is introduced. This approximation uses arrays of projected second derivatives, the intrinsic and parameter-effects curvatures arrays; but their presentation is beyond the scope of this paper. several families of increasing complexity (for example neural networks with one layer of an increasing number n h of hidden units, and a linear output neuron) are considered, and the data set is used both to estimate their parameters, and to perform the selection between the candidates. In order to retain the less complex family containing a good approximation of the regression, it is of interest to perform the selection only between neural candidates which are not unnecessarily large, and which are (that is their matrix z is) sufficiently well-conditioned to allow the computation of the approximate CI (3). We propose to discard too large models by a systematic detection of ill-conditioning, and to perform the selection among the approved, i.e. well-conditionned models using an approximate value of their LOO score whose computation does not require further trainings. Both the ill-conditioning detection and the estimation of the LOO score of a neural candidate are based on the LTE of its output. IV.1. Ill-conditioning detection for model approval A too large neural model, trained up to convergence with a simple LS costfunction, will generally overfit. Overfitting is often avoided by using early stopping of the training algorithm or by adding regularization terms in the cost function, e.g. weight decay [Bishop 1995]. Unfortunately, since only the estimator whose value corresponds to an absolute minimum of the quadratic cost function (18) without weight decay terms is unbiased, both early stopping and weight decay introduce a

10 bias in the estimation of the regression: the corresponding estimates thus lead to questionable CIs for the regression. To detect and discard too large networks, we propose, after the training of each candidate up to a (hopefully) absolute minimum of the cost function (18), to check the 1993] where the Hessian rank deficiency is studied during training. In our view, rank deficiency is not very relevant during the training since, with a Levenberg algorithm, the matrix to be inverted is made well-conditioned by the addition of a scalar matrix λ I q, λ >, to the cross-product Jacobian. conditioning of their matrix z (see [Rivals & Personnaz 1998]). The fact that z be illconditioned is the symptom that some parameters are useless, since the elements of z represent the sensibility of the model output with respect to the parameters. A typical situation is the saturation of a tanh hidden neuron, a situation which generates in the matrix z a column of +1 or 1 that corresponds to the parameter between the output of the saturated hidden neuron and the linear output neuron, and columns of zeros that correspond to the parameters between the network inputs and the saturated hidden neuron 8. In practice, we propose to perform a singular value factorization of z, and to compute its condition number, that is the ratio of its largest to its smallest singular value, see for example [Golub & Van Loan 1983]. The matrix z can be considered as very ill-conditioned when its condition number reaches the inverse of the computer precision, which is of the order of Furthermore, in order to be able to compute the approximate CI (3) which involve z T z -1, the cross-product Jacobian matrix z T z must also be wellconditioned. Since the condition number of z T z is the square of the condition number of z, the networks whose matrix z has a condition number much larger than 1 8 cannot be approved. There are other studies of the ill-conditioning of neural networks, but they deal with their training rather than with their approval, like in [Zhou and Si 1998] where an algorithm avoiding the Jacobian rank deficiency is presented, or [Saarinen et al. IV.. Approximate leave-one-out scores for model selection The selection among the networks which have been approved can be performed with statistical tests for example [Urbani et al. 1994] [Rivals & Personnaz 1998]. Another approach, cross validation, consists in partitioning the data set in training and test sets, and in selecting the smallest network leading to the smallest mean square error on the test sets 9. One of the drawbacks of cross validation is to require a successful training of the candidate models on many test sets, that is N successful trainings in the case of LOO cross validation. Let us denote by e k the error obtained on the left out example k with the model trained on the N-1 remaining examples (k-th test set). In this section, we derive an approximate expression of e k, expression which allows an economic estimation of the LOO score without performing these N time-consuming trainings of each candidate network, as proposed in [Monari 1999] [Monari & Dreyfus]. In the case of a linear model, it is well-known [Efron & Tibshirani 1993] that the k-th LOO error e k can be directly derived from the corresponding residual r k : r k e k = if p xkk < 1 1 p xkk k=1 to N (36) e k = r k = if p xkk = 1 where, we recall, p x denotes the orthogonal projection matrix on the range of x. Expression (36) holds irrespective of whether or not the assumed model is true. 8 Such a situation might also correspond to a relative minimum; to check the conditioning of z helps thus also to discard neural networks trapped in relative minima, and leads to retrain the neural candidate starting from different initial weights. 9 Note that statistical tests may advantageously be used complementarily to cross validation in order to take a decision [Rivals & Personnaz 1999]; these tests can also be established by applying LS theory to the LTE of nonlinear models [Bates & Watts 1988], but this exceeds the scope of this paper.

11 In the case of a nonlinear model, we show (see Appendix ) that a useful approximation of the k-th LOO error can be obtained using the LTE of the model output at q LS : e k r k k=1 to N (37) 1 p zkk where p z denotes the orthogonal projection matrix on the range of z. The approximation (37) is thus similar to (36) 1. In the case where (numerically) p zkk = 1, we choose to take e k = r k (the residual is not necessarily zero). Like in the linear case, expression (37) holds independently on the assumed model being true or not. Hence the LOO score: LOO score = 1 (38) N This LOO score can be used as an estimate of the mean square performance error, and we thus denote it as MSPE, as opposed to N J q LS, the mean square training error (MSTE). The interested reader will find in [Monari 1999] [Monari & Dreyfus] a systematic model selection procedure based on both the approximate LOO score and the distribution of the values of the p zkk. Nevertheless, another performance measure could be chosen as well (a 1-fold cross validation score, a mean square error on an independent set, etc.). N k=1 e k IV.3. Accuracy of the approximate confidence intervals The quality of the selected model f x, q LS, and thus of the associated approximate CI, depends essentially on the size N of the available data set with respect to the complexity of the unknown regression function and to the noise variance s : 1. N is large: it is likely that the selected family f x, q, q R q contains the regression E Y p x, i.e. that the LS estimator is asymptotically unbiased, that the model f x, q LS is a good approximation of E Y p x in the domain of interest, and that the curvature is small. In this case, reliable CIs can be computed with (3).. N is small: it is likely that the selected family f x, q, q R q is too small 11 to contain E Y p x, i.e. that the LS estimator is biased, and that the model f x, q LS thus underfits. The approximate CIs are thus questionable, and additional data should be gathered. A good indicator of whether the data set size N is large enough or not is the ratio MSPE/MSTE of the selected candidate: if its value is close to 1, then N is probably large enough, whereas a large value is the symptom of a too small data set size, as shown in Figure 8 (and as illustrated numerically in the following examples). IV.4. Example of a simulated nonlinear SISO process (process #4) This first example is based on a simulated process. Like in the previous sections, a reference estimate of the variance of the output of a neural network is made, using M = 1 other sets; in order to ensure that an absolute minimum is reached on each of the M sets, five to thirty trainings (depending on the network size) with the Levenberg algorithm for different initializations of the weights are performed, and the weights giving the smallest value of the cost function (18) are kept. We consider the SISO process simulated with: yk p = sinc x k w k k=1 to N (39) where sinc denotes the cardinal sine function; we take s = 1 -. First, a data set of N = input-output pairs is computed, with input values uniformly distributed in [-5; 5]. Since a family of nonlinear functions (a neural network 1 An expression similar to (37) is proposed in [Hansen & Larsen 1996], but unfortunately, it is not valid even in the linear case. 11 It will generally not be too large since the approval procedure proposed in section IV.1 prevents from selecting a neural network with useless parameters.

12 with a given architecture) containing the regression is not known a priori, neural networks with a linear output neuron and a layer of n h tanh hidden neurons are trained. The numerical results are summarized in Table 1. We list the number of parameters q, the MSTE (i.e. the smallest MSTE obtained with the network for its different random weight initializations), the condition number of z, and, if the latter is not too large, the MSPE (corresponding approximate LOO score computed with (37) [, 5] where the model underfits, though there is a higher concentration of training examples around 1 and 3, the variance remains constant and low. This is due to the fact that, in this domain, the model output is insensitive to most parameters of the network (this is usually the case when, like here, the output of a network does not vary 1 ): the elements of z s in this domain are thus constant and small, hence a small and constant variance at the corresponding x s. and (38)) and the ratio MSPE/MSTE. The candidates with more than 6 hidden neurons cannot be approved, because cond(z) >> 1 8 : for n h = 7, cond(z) = opt The optimal number of neurons n h = 4 is selected on the basis of the MSPE. The fact that the corresponding ratio MSPE/MSTE is close to 1 is the symptom that N is large enough, so that the selected family of networks contains a good approximation of the regression, and that the curvature is small (case 1 of section IV.3). The results obtained for the selected neural network are shown in Figure 9. The model output is close to the regression, the LTE variance estimate (7) is close to the reference variance estimate (33), and the CI is thus accurate. IV.5. Industrial modeling problem We apply here the presented methodology (LS parameter estimation, model approval, model selection, CI construction) to an industrial example first tackled in [Rivals & Personnaz 1998], that is the modeling of a mechanical property of a complex material from three structural descriptors. We have been provided with a data set of N = 69 examples; the inputs and outputs are normalized for the LS estimations. Thanks to repetitions in the data, and assuming homoscedasticity, the pure error mean square [Draper & Smith 1998] gives a good estimate of the noise variance: σ = Using this reliable estimate, statistical tests establish the Second, a data set of N = 3 input-output pairs is computed, the numerical results being summarized in Table. The data set being much smaller, the candidates cannot be approved as soon as n h > 4: for n h = 5, cond(z) = The opt optimal number of neurons n h 3 = is selected on the basis of the MSPE. The ratio MSPE/MSTE of the selected network equals.1, symptom that N is relatively small, and that the selected family of networks is likely not to contain the regression (case of section IV.3). The results obtained for the selected neural network are shown in Figure 1. The family of functions implemented by a network with two significance of two inputs. An affine model with these two inputs gives the estimate s = of the variance, hence the necessity of nonlinear modeling. Neural networks with a linear output neuron and a layer of n h tanh hidden neurons are trained. The numerical results are summarized in table 3. It shows that the candidates with more than 3 hidden neurons cannot be approved: for n h = 4, cond(z) = 1 1 opt. The optimal number of neurons n h 69 = is selected on the basis of the MSPE. The fact that the corresponding ratio MSPE/MSTE equals 1.3 indicates that N is large enough, so that the selected family of networks contains probably a hidden units is obviously too small to contain a good approximation of the regression, and though the estimate of the output variance is good (it is close to the reference variance estimate), since the output of the neural network differs from the regression, the CI is less accurate than in the case where N =. Note that in the input domain 1 The output of a neural network with one layer of tanh hidden units remains constant in a given domain of its inputs when the tanh activation functions of all hidden units are saturated in this domain: the output of the network is thus insensitive to all the parameters of the hidden units.

13 good approximation of the regression, and that the curvature is small (case 1 of V. COMPARISONS section IV.3). The function implemented by the selected network is shown in Figure 11. In this section, we discuss the advantages of the LS LTE approach to the The N = 69 outputs of the training set are presented in increasing order in Figure 1a, and the corresponding residuals and approximate LOO errors in Figure 1b: both appear quite uncorrelated and homoscedastic. A CI with a level of significance of 95% is then computed with (3); the half width of the 95% CI on the construction of construction intervals for neural networks with respect to other analytic approaches and to the bootstrap methods, and compare them on simulated examples. V.1. Comparison to other analytic approaches N = 69 examples of the data set is shown in Figure 1c. In order to check the confidence which can be attached to the model, the variance of its output must be examined in the whole input domain of interest. Figure 13 shows the isocontours of the LTE estimate of the standard deviation of the model output s z T z T z -1 z in the input domain defined by the training set. The computation of the LTE variance estimate thus allows not only to construct a CI at any input of interest, but also to diagnose that, at the top right corner of the input domain, the model standard deviation is larger than that of the noise itself (the highest isocontour value equals that of the estimate of the noise standard deviation s = ). Little confidence can thus be attached to the model output in this input domain, where more data Maximum likelihood approach In the case of gaussian homoscedastic data, likelihood theory leads to the same approximate variance (3), but two different estimators of it are commonly encountered (see Appendix 1.): var f x, Q LS LTE = s z T z T z -1 z i.e. the same estimate as (7), and also: var f x, Q LS Hessian = s z T h q -1 LS z (4) which necessitates the computation of the Hessian. Efficient methods for computing the Hessian are presented in [Buntine & Weigend 1994]. should be gathered. On the opposite, there is a large region on the left of the diagram where there are very few training examples, but where the LTE estimate of the standard deviation is surprisingly rather small: like for the modeling of process #4 in section IV.4, this is due to the fact that the model output is little sensitive to most parameters of the network in this region (the model output varies very little, see Figure 11). Bayesian approach The Bayesian approach is an alternative approach to sampling theory (or frequentist approach) for modeling problems, and also leads to the design of CIs. These two approaches are conceptually very different: the Bayesian approach treats the unknown parameters as random variables, whereas they are considered as certain in the frequentist approach. Nevertheless, as presented for example in [MacKay 199a] [MacKay 199b] [Bishop 1995], the Bayesian approach leads to a posterior distribution of the parameters with a covariance matrix whose expression is very similar to that of the covariance matrix of the least-squares estimator of the

14 parameters, and thus to CIs which are similar to those presented in this paper. We thus make a brief comparison between the CIs these two approaches lead to. The most important difference is that the estimator which is considered here is the one whose estimate minimizes the cost function (18), whereas in the Bayesian approach, a cost-function with an additional weight-decay regularization term is minimized; the presence of this weight-decay term stems from the assumption of a gaussian prior for the parameters. Nevertheless, the least squares cost function (18) can be seen as the limit where the regularization term is zero, which corresponds to an uninformative prior for the parameters. In this case (that is (18) is minimized as in this paper), there is another small difference in the Bayesian approach as presented in [MacKay 199a] [MacKay 199b] [Bishop 1995]. Under hypotheses which we cannot detail here, the Bayesian approach leads to a posterior parameter distribution with approximate covariance matrix s h q -1 LS, h q LS being the Hessian of the cost function evaluated at the most probable value of the parameter, that is here q LS. A LTE of the estimator output leads then to the following estimate of its variance at input x: Numerical comparison (processes #5 and #6) Here, we perform a numerical comparison of the three variance estimates considered above on a very simple example. We consider a SISO process simulated by a single tanh neuron: yk p = tanh q p1 + q p x k + w k k=1 to N (4) with σ =.1, N = 3. For this comparison, the noise variance s is estimated with s in the three (LTE, Hessian, sandwich) output variance estimates. We first simulate a process with: θ p1 =, θ p = 1 (process #5). The corresponding results are shown in Figure 14. The variance reference estimate is computed on M = 1 data sets. The LTE approximation (3) of the variance is almost perfect. The LTE (7), Hessian (4), and sandwich (41) estimates are comparable: the parameter estimates being accurate (θ LS1 = , θ LS =.996), the fact that they are overestimated is almost only due to the noise variance estimate s = Nevertheless, the shape of the LTE estimate is closer to the reference estimate than that of the two others. We then simulate a process with: θ p1 =, θ p = 5 (process #6). The var f x, Q LS i.e. it also leads to estimate (4). Sandwich estimator Hessian = s z T h q LS -1 z corresponding results are shown in Figure 15. The function being steeper, the curvature is larger, and the LTE approximation (3) of the variance is a little less accurate. The three estimates are still very similar but, here, their overestimation is due not only to the noise variance estimate s = , but also to the bias of the The sandwich estimate of the variance of a nonlinear model output can be derived in various frameworks (a possible derivation in the frequentist approach is given in Appendix 1.3): var f x, q LS sandwich = s z T h q -1 LS z T z h q -1 LS z (41) The sandwich estimator is known to be robust to model incorrectness, i.e. the assumptions about the noise are incorrect or the considered family of functions is too small, see for example [Efron & Tibshirani 1993] [Ripley 1995]. parameter estimates (θ LS1 = , θ LS = 6.58). The computational cost of the LTE estimate being lower (is does not necessitate the computation of the Hessian matrix), there is no reason to prefer one of the two other estimates. As a matter of fact, since the Hessian depends on the data set, it is the realization of a random matrix. Thus, in the maximum likelihood as well as in the Bayesian approach, it is often recommended to take the expectation of the Hessian, and to evaluate it at the available q LS, i.e. to replace it by the cross-product Jacobian z T z [Seber & Wild 1989]: estimates (4) and (41) then reduce to estimate (7). As

15 mentioned above, the sandwich variance estimator is known to be robust to model incorrectness, a property which is not tested with this simple setting, but this is beyond the scope of this paper. training, once with the true ones, and once with those obtained by training the network on the whole data set. As shown in Figure 16, though the size of the data set is not very small (N = 5), the bootstrap variance estimate is far away from the reference estimate. Increasing the number of bootstrap sets up to 1 did not V.. Comparison to bootstrap approaches The bootstrap works by creating many pseudo replicates of the data set, the bootstrap sets, and reestimating the LS solution (retraining the neural network) on each bootstrap set; the variance of the neural model output, and the associated CI, are then computed over the trained networks, typically a hundred [Efron & Tibshirani 1993]. In the bootstrap pairs approach for example, a bootstrap set is created by sampling with replacement from the data set [Efron & Tibshirani 1993]. The first advantage of the LS LTE approach is to require only one successful training of the network on the data set to compute the LTE estimate of the variance of its output, whereas the bootstrap methods require a hundred successful trainings of the network on the different bootstrap sets. Studies on bootstrap where only one training with a random initialization of the improve the variance estimate. In fact, the bootstrap is especially suited to the estimation of the variance of estimators defined by a formula, like for example an estimator of a correlation coefficient [Efron & Tibshirani 1993]: for each bootstrap set, an estimate is computed using the formula, and the estimate of the variance is easily obtained. But the bootstrap is definitely not the best method if each estimation associated to a bootstrap set involves an iterative algorithm like the training of a neural network, which is the case for the construction of a CI with a neural model. However, if the data set is large enough, and if the training time is considered unimportant, the bootstrap pairs approach is a solution in the case of heteroscedasticity (that is K W is not scalar anymore), whereas the LS LTE approach, as well as the bootstrap residuals approach [Efron & Tibshirani 1993], are no longer valid. weights was performed for each bootstrap set show a pathological overestimation of the variance. This can be seen in [Tibshirani 1996], examples and 3; but the overestimation of the bootstrap is not detected in this work because the reference VI. CONCLUSION estimate is also overestimated for the same reasons (one single training per set). As pointed out in [Refenes et al. 1997], a way to reduce this overestimation is to start each training on a bootstrap set with the weights giving the smallest value of the cost function (18) (that is on the original data set); but even so, the bootstrap method becomes untractable for large networks, and/or for multi input processes. The claim that bootstrap methods are especially efficient for problems with small data sets (see [Heskes 1997] for example) may be subject to criticism. As an illustration, the variance was estimated for process # using the bootstrap pairs approach on 3 bootstrap sets, the network weights being initialized twice for each We have given a thorough analysis of the LS LTE approach to the construction of CIs for a nonlinear regression using neural network models, and put emphasis on its enlightening geometric interpretation. We have stressed the underlying assumptions, in particular the fact that the approval and selection procedures must have led to a parsimonious, well-conditioned model containing a good approximation of the regression. Our whole methodology (LS parameter estimation, model approval, model selection, CI construction) has been illustrated on representative examples, bringing into play simulated processes and an industrial one.

Construction of confidence intervals for neural networks based on least squares estimation

PERGAMON Neural Networks 13 () 463 484 Neural Networks www.elsevier.com/locate/neunet Construction of confidence intervals for neural networks based on least squares estimation I. Rivals*, L. Personnaz