Abstract. Keywords. Notations

Size: px
Start display at page:

Download "Abstract. Keywords. Notations"

Transcription

1 To appear in Neural Networks. CONSTRUCTION OF CONFIDENCE INTERVALS FOR NEURAL NETWORKS BASED ON LEAST SQUARES ESTIMATION CONSTRUCTION OF CONFIDENCE INTERVALS FOR NEURAL NETWORKS BASED ON LEAST SQUARES ESTIMATION Abstract Isabelle Rivals and Léon Personnaz We present the theoretical results about the construction of confidence intervals for a nonlinear regression based on least squares estimation and using the linear Taylor Laboratoire d'électronique, École Supérieure de Physique et de Chimie Industrielles, 1 rue Vauquelin, 7531 Paris Cedex 5, France. expansion of the nonlinear model output. We stress the assumptions on which these results are based, in order to derive an appropriate methodology for neural black-box modeling; the latter is then analyzed and illustrated on simulated and real processes. Isabelle Rivals Laboratoire d'électronique, École Supérieure de Physique et de Chimie Industrielles, 1 rue Vauquelin, 7531 Paris Cedex 5, France. Phone: Fax: Isabelle.Rivals@espci.fr We show that the linear Taylor expansion of a nonlinear model output also gives a tool to detect the possible ill-conditioning of neural network candidates, and to estimate their performance. Finally, we show that the least squares and linear Taylor expansion based approach compares favourably with other analytic approaches, and that it is an efficient and economic alternative to the non analytic and computationally intensive bootstrap methods. CONFIDENCE INTERVALS FOR NN Keywords Nonlinear regression, neural networks, least squares estimation, linear Taylor expansion, confidence intervals, ill-conditioning detection, model selection, approximate leave-one-out score. Notations We distinguish between random variables and their values (or realizations) by using upper- and lowercase letters; all vectors are column vectors, and are denoted by boldface letters; non random matrices are denoted by light lowercase letters..

2 x Y p = Y p x non random n-input vector random scalar output depending on x z = z 1 z z N T m z matrix approximating ξ with z k = f x k, q q range of z q = q LS E Y p x mathematical expectation, or regression function, of Y p given x p z orthogonal projection matrix on m z W random variable with zero expectation denoting additive noise I N (N,N) identity matrix s variance of W (k) q LS leave-one-out (the k-th example) least squares estimate x k, y p k k=1 to N data set of N input-output pairs, where the x k are non random e k k=1 to N leave-one-out errors n-vectors, and the y p k are the corresponding realizations of the n h number of hidden neurons of a neural network random outputs Y p k = Y p x k H random Hessian matrix of the cost function x k T q, q R n family of linear functions of x parameterized by q h value of the Hessian matrix of the cost function q p x = x 1 x x N T Y p = Y 1 p Y N T p Y p unknown true q-parameter vector (q=n in the linear case) non random (N,n) input matrix random N-vector of the outputs of the data set W = W 1 W W N T random N-vector with E W = J q value of the least squares cost function s ref x reference variance estimate var f x, Q LS LTE LTE estimate of the variance of a nonlinear model output var f x, Q LS var f x, Q LS Hessian Hessian estimate of the variance of a nonlinear model output sandwich sandwich estimate of the variance of a nonlinear model output Q LS least squares estimator of q p Abbreviations q LS least squares estimate of q p CI confidence interval R = Y p x Q LS least squares residual random N-vector in the linear case LOO leave-one-out r value of R LS least squares m x range of x (linear manifold) LTE linear Taylor expansion p x orthogonal projection matrix on m x SISO single input - single output R S estimator of s s value of S f x, q, q q family of nonlinear functions of x parameterized by q f x, q N-vector f x 1, q f x k, q f x N, q T MISO MSTE MSPE multi input - single output mean square training error mean square performance error R = Y p f x,q LS ξ = x 1 x x N T ξ m p ξ least squares residual random N-vector unknown non random (N, q) matrix with x k = f x k, q q range of ξ orthogonal projection matrix m on ξ q = q p

3 I. INTRODUCTION analytic approaches, and discusses its advantages with respect to bootstrap approaches. For any modeling problem, it is very important to be able to estimate the reliability of a given model. This problem has been investigated to a great extent in the framework of linear regression theory, leading to well-established results and commonly used methods to build confidence intervals (CIs) for the regression, that is the process output expectation [Seber 1977]; more recently, these results have been extended to nonlinear models [Bates & Watts 1988] [Seber & Wild 1989]. In neural network modeling studies however, these results are seldom exploited, and generally only an average estimate of the neural model reliability is given through the mean square model error on a test set; but in an application, one often wishes to know a CI at any input value of interest. Nevertheless, thanks to the increase of computer power, the use of bootstrap methods has increased in the past years [Efron & Tibshirani 1993]. These non analytic methods have been proposed to build CIs for neural networks [Paass 1993] [Tibshirani 1993] [Heskes 1997], but with the shortcoming of requiring a large number of trainings. This paper presents an economic alternative to the construction of CIs using neural networks. This approach being built on the linear least squares (LS) theory applied to the linear Taylor expansion (LTE) of the output of nonlinear models, we first recall how to establish CIs for linear models in section II, and then approximate CIs for nonlinear models in section III. In section IV, we exploit these known theoretical results for practical modeling problems involving neural models. We show that the LTE of a nonlinear model output not only provides a CI at any input value of We consider single-output models, since each output of a multi-output model can be handled separately. We deal with static modeling problems for the case of a T non random (noise free) n-input vector x = x 1 x x n, and a noisy measured output y p which is considered as the actual value of a random variable Y p = Y p x depending on x. We assume that there exists an unknown regression function E Y p x such that for any fixed value of x : Y p x = E Y p x + W x (1) where W x is thus a random variable with zero expectation. A family of parameterized functions f x, q, x R n, q R q is called an assumed model. This assumed model is said to be true if there exists a value q p of q such that, x in the input domain of interest, f x, q p = E Y p x. In the following, a data set of N inputoutput pairs x k, yk p k=1 to N is available, where the x k = xk 1 xk xk T n are non random n-vectors, and the yk p are the corresponding realizations of the random variables Y k p = Y p x k1. The goal of the modeling procedure is not only to estimate the regression E Y p x in the input domain of interest with the output of a model, but also to compute the value of a CI for the regression, that is the value of a random interval with a chosen probability to contain the regresssion. For the presentation of the results of linear and nonlinear regression estimation, we deal with the true model (a model which is linear in the parameters in section II, a nonlinear one in section III), i.e. we consider that a family of functions containing the regression is known. In interest, but also gives a tool to detect the possible ill-conditioning of the model, and, as in [Monari 1999] [Monari & Dreyfus], to estimate its performance through an approximate leave-one-out (LOO) score. A real-world illustration is given through an industrial application, the modeling of the elasticity of a complex material from some of its structural descriptors. Section V compares the LS LTE approach to other 1 We recall that we distinguish between random variables and their values (or realizations) by using upper- and lowercase letters, e.g. Y k p and y k p ; all vectors are column vectors, and are denoted by boldface letters, e.g. the n-vectors x and x k ; non random matrices are denoted by light lowercase letters (except the unambiguous identity matrix).

4 section IV, we consider the general realistic case of neural black-box modeling where a preliminary selection among candidate neural models is first performed because a true model is not known a priori. The estimate q LS is a realization of the LS estimator Q LS whose expression is: Q LS = x T x -1 x T Y p = q p + x T x -1 x T W (5) Since the assumed model is true, this estimator is unbiased. The orthogonal projection matrix on m x is p x = x x T x -1 x T. It follows from (5) that the unbiased LS estimator of E Y p x is: II. CONFIDENCE INTERVALS FOR LINEAR MODELS x Q LS = x q p + p x W (6) that is the sum of E Y p x and of the projection of W on m x, as shown in Figure 1. We consider a true linear assumed model, that is the associated family of linear functions x T q, x R n, q R n contains the regression; (1) can thus be uniquely rewritten as: Y p x = x T q p + W x () where q p is an unknown n-parameter vector. Model () associated to the data set leads to: Y p = x q p + W (3) where x = x 1 x x N T is the non random (N,n) input matrix, Y p = Y 1 p Y N T p Y p and W = W 1 W W N T are random N-vectors, with E W =. Geometrically, this means that E Y p x = x q p belongs to the solution surface, m the linear manifold x of R the observation space N spanned by the columns of the input matrix (the range Let R denote the residual random N-vector R = Y p x Q LS, that is the vector of the errors on the data set, then: R = I N p x W (7) Under the assumption that the W k are identically distributed and uncorrelated (homoscedastic), i.e. the noise covariance matrix is K W = σ I N, it follows from (5) that the variance of the LS estimator of the regression for any input x of interest is 3 : var x T Q LS = σ x T x T x -1 x (8) Using (7), we obtain the unbiased estimator S = R T R N n of s ; the corresponding (unbiased) estimate of the variance of x T Q LS is thus: where s is the value of S. var x T Q LS = s x T x T x -1 x (9) of x). We assume that m x is of dimension n, that is rank x = n. In other words, the model is identifiable, i.e. the data set is appropriately chosen, possibly using experimental design. II.1. The linear least squares solution The LS estimate q LS of q p minimizes the empirical quadratic cost function: J q = 1 N k=1 yk p x k T q = 1 y p x q T y p x q (4) II.. Confidence intervals for a linear regression If the W k are homoscedastic gaussian variables, that is W N N, σ I N : ThL1. Q LS q p N n, σ x T x -1 (1) ThL. R T R σ χ N-n (11) ThL3. Q LS is statistically independent from R T R. m x is sometimes called the expectation surface [Seber & Wild 89]; as a matter of fact, the solution surface coincides with the expectation surface only when the assumed model is true. 3 We recall that x (boldface) is the input vector (n, 1) of interest, and that x is the experimental (N, n) input matrix.

5 The proof of the above theorems follows from Figure 1 and from the Fisher-Cochrane theorem [Goodwin & Payne 1977], see for instance [Seber 1977]. The goal is to build a CI for the regression value E Y p x = x T q p, for any input vector x of interest. The variance of the measurements s being unknown, let us build a normalized centered gaussian variable where both E Y p x and σ appear: x T Q LS E Y p x N, 1 (1) σ x T x T x -1 x Thus, using the Pearson variable (11), which is independent from (1) according to ThL3, we obtain the following Student variable: x T Q LS E Y p x Student N n (13) S x T x T x -1 x A 1 1 α % CI for E Y p x is thus: x T q LS ± t N-n α s x T x T x -1 x (14) where t N-n is the inverse of the Student(N-n) cumulative distribution. Note that (14) allows to compute a CI corresponding to any input vector, and that it is much more informative than average values such as that the mean square error on the data set, or the mean of the variance estimate over the data set 4 ; as a matter of fact, the latter invariably equals s n N. II.3. Example of a simulated linear SISO process (process #1) We consider a simulated single input - single output (SISO) linear process: yk p = q p1 + q p x k + w k k=1 to N (15) We take θ p1 = 1, θ p = 1, σ =.5, N = 3. The inputs x k of the data set are uniformly distributed in [-3; 3], as shown in Figure a. The family of functions q 1 + q x, q is considered, that is the assumed model is R true, and we choose a confidence level of 99% (t 8 1% =.76). The LS estimation leads to s =.9, i.e. underestimates the noise variance. Figure b displays the estimate (9) of the variance of x T Q LS, and the true variance (8). The estimator S of the noise variance being unbiased, the difference between the estimated variance (9) and the (usually unknown) true variance (8) is only due to the particular values of the measurement noise. Figure a shows the regression E Y p x, the data set, the model output and the 99% CI for the regression computed with (14). III. APPROXIMATE CONFIDENCE INTERVALS FOR NONLINEAR MODELS We consider a family of nonlinear functions f x, q, x R n, q R q which contains the regression, that is the assumed model is true; (1) can thus be rewritten as: Y p x = f x, q p + W x (16) where q p is an unknown q-parameter vector. We denote by f x, q p the unknown vector f x 1, q p f x k, q p f x N, q T p defined on the data set, thus: Y p = f x, q p + W (17) As in section II, x denotes the (N,n) input matrix 5, and Y p and W are random N- vectors with E W =. Geometrically, this means that E Y p x belongs to the solution surface, the manifold m f x, q = f x, q, q R q of R N. III.1. The linear Taylor expansion of the nonlinear least squares solution A LS estimate q LS of q p minimizes the empirical cost function 6 : 4 The mean of the variance estimate over the training data set is: N N 1 s x k T x T x -1 x k = s p xkk = s N k=1 N k=1 N trace px. Since px is the orthogonal projection matrix on a n-dimensional subspace, trace (p x) = n. 5 In the case of a nonlinear model, x is merely a two-dimensional array. 6 In the case of a multilayer neural network, the minimum value of the cost function can be obtained for several values of the parameter vector; but, since the only function-preserving transformations are

6 N k=1 J q = 1 yk p f x k, q = 1 y p f x, q T y p f x, q (18) The estimate q LS is a realization of the LS estimator Q LS. Efficient algorithms are at our disposal for the minimization of the cost function (18): they can lead to an absolute minimum, but they do not give an analytic expression of the estimator that could be used to build CIs. In order to take advantage of the results concerning linear models, it is worthwhile considering a linear approximation of Q LS which is obtained by writing the LTE of f x, q around f x, q p : f x, q where x = q f x, q f x, q p + where ξ = x 1 x x N T q r=1 f x, q θ r q = qp θ r θ pr = f x, q p + x T q q p (19). Thus, with the matrix notation: q = q p f x, q f x, q p + ξ q q p () and x k = f x k, q. The (N,q) matrix ξ is the non q q = q p random and unknown (since q p is unknown) Jacobian matrix of f. Using (), one obtains, similarly to the linear case, the following approximation of Q LS (see Appendix 1.1 for a detailed derivation of (1) and (3)): Q LS q p + ξ T ξ -1 ξ T W (1) The range m ξ of ξ is tangent to the manifold m f x, q at q = q p ; this manifold is assumed to be of dimension q, hence rank ξ = q. The matrix p ξ = ξ ξ T ξ -1 ξ T is the orthogonal projection matrix on m ξ. From () and (1), the LS estimator of E Y p x can be approximately expressed by: f x, Q LS f x, q p + p ξ W () i.e. it is approximately the sum of E Y p x and of the projection of W on m ξ, as illustrated in Figure 3. If K W = σ I N (homoscedasticity), the variance of the model output, that is the LS estimator of the regression, for an input x is approximately: In the following, approximation (3) will be termed the LTE approximation of the model output variance. Let R denote the LS residual vector R = Y p f x, Q LS, thus: R I N p ξ W (4) Under the assumption of appropriate regularity conditions on f, and for large N, the curvature of the solution surface 7 f x, q is small; thus, using (4), one obtains m the asymptotically (i.e. when N ) unbiased estimator S = R T R N q of σ. In (3), the matrix ξ takes the place of the matrix x in the linear case. But, as opposed to x, ξ is unknown since it is a function of the unknown parameter q p. The (N,q) matrix x may be approximated by z = z 1 z z N T where z k = f x k, q, that is: q q = q LS zk r = f x k, q (5) θ r q = ql In the following, we assume that rank z = q. Like the matrix ξ, the vector x is not available, and its value may be approximated by: f x, q z = (6) q q = q LS From (3), (5) and (6), the LTE estimate of the variance of the LS estimator of the regression for an input x is thus: var f x, Q LS LTE = s z T z T z -1 z (7) III.. Approximate confidence intervals for a nonlinear regression If W N N, σ I N, and for large N, it follows from the above relations and from linear LS theory [Seber & Wild 1989] that: ThNL1. Q LS ~ N q q p, σ ξ T ξ -1 (8) ThNL. R T R σ ~ χ N-q (9) ThNL3. Q LS is approximately statistically independent from R T R. var f x, Q LS σ x T x T x -1 x (3) 7 The curvature is usually decomposed in two components: i) the intrinsic curvature, which measures neuron exchanges, as well as sign flips for odd activation functions like the hyperbolic tangent [Sussmann 9], we will legitimately consider the neighborhood of one of these values only. the degree of bending and twisting of the solution surface m f x, q, and ii) the parameter-effects curvature, which describes the degree of curvature induced by the choice of the parameters q.

7 Using (3) and (8), let us again build a quasi normalized and centered gaussian variable where both E Y p x and σ appear: f x, Q LS E Y p x ~ N, 1 (3) σ x T ξ T ξ -1 x Thus, the variable (9) being approximately independent from (3) according to ThNL3, we have: f x, Q LS E Y p x ~ Student N q (31) S x T ξ T ξ -1 x A 1 1 α % approximate CI for E Y p x is thus: f x, q LS ± t N-q α s z T z T z -1 z (3) Note that, when N is large, the Student distribution is close to the normal distribution, and thus t N-q α n α, where n is the inverse of the normal cumulative distribution. Like in the linear case, (3) allows to compute a CI at any input x of interest, which gives much more information than the value of the mean variance estimate over the data set: as a matter of fact, the latter always approximately equals s q N. From a practical point of view, the construction of a CI for a neural model output at any input x of interest involves once and for all the computation of the matrix z, that is the N x q partial derivatives of the model output with respect to the parameters evaluated at q LS for the data inputs x k k=1 to N, and, for each x value, that of z, i.e. the derivatives at x. In the case of a neural model, these derivatives are easily obtained with the backpropagation algorithm. All the previous results and the above considerations are valid provided an absolute minimum of the cost function (18) is reached. In real-life, several estimations of the parameters must thus be made starting from different initial values, the estimate corresponding to the lowest minimum being kept in order to have a high probability to obtain an absolute minimum. In the examples of this work, the algorithm used for the minimization of the cost function is the Levenberg algorithm, as described for example in [Bates & Watts 1988], and several trainings are performed. parameters of the network and decreasing with the size of the data set, the number of trainings is chosen accordingly. III.3. Quality of the approximate confidence intervals III.3.1. Theoretical analysis The quality of the approximate CI depends essentially on the curvature of the solution m surface f x, q, which depends on the regularity of f and on the value of N. In practice, f is often regular enough for a first-order approximation to be satisfactory, provided that N is large enough. Thus, if N is large: (i) as in the linear case, the estimator of the noise variance S is unbiased, and the difference between s and σ is only due to the particular values of the noise; (ii) the variance of f x, Q LS is small, and q LS is likely to be close to q p : z and z are thus likely to be good approximations of respectively x and x. A reliable estimate of a CI may thus be obtained from the LTE variance estimate (7). On the other hand, if N is too small: (i) as opposed to the linear case, even if the assumed model is true, the estimator of the noise variance S is biased; (ii) the variance of f x, Q LS is large, and q LS is likely to differ from q p : z and z risk to be poor approximations of x and x. Thus, if N is diagnosed as too small, one cannot have confidence in the confidence intervals (3), and additional data should be gathered. III.3.. Quantitative analysis As detailed for example in [Bates & Watts 1988] [Seber & Wild 1989] [Antoniadis et al. 199], different measures of the curvature can be computed, and can be used in each particular case to evaluate the accuracy of the LTE. In section IV.3 dealing with neural network modeling, we give indications on how to judge if N is large enough for the approximate CI to be accurate. The probability of getting trapped in a relative minimum increasing with the number of

8 In order to evaluate the accuracy of the LTE variance estimate (7) when dealing with simulated processes, we introduce an estimate of the unknown true variance of f x, Q LS that is not biased by curvature effects, the reference variance estimate s ref x. This estimate is computed on a large number M of other sets of N outputs corresponding to the inputs of the training set, and whose values are obtained with different realizations (simulated values) of the noise W. The i-th LS (i) estimate f x, q LS of E Y p x is computed with data set i (i=1 to M), and the reference estimate of the variance at input x is computed as: M i=1 s ref x = 1 (i) f x, q M LS f x, where f x = 1 (i) f x, q M LS (33) In the nonlinear case, we thus use three notions related to the (true) variance of f x, Q LS : 1) the LTE variance approximation (3), which is a good approximation of the true variance if the curvature is small, as we show in section III.4.3, and which can be computed only when the process is simulated; ) the LTE variance estimate (7), which is the estimate that can be computed in real-life; 3) the reference variance estimate (33), which tends to the true variance when M tends to infinity because it is not biased by curvature effects, and which can be computed only when the process is simulated. M i=1 III.4.1. Example of a simulated neural SISO process (process #) We consider a SISO process simulated by a neural network with one hidden layer of two hidden neurons with hyperbolic tangent activation function and a linear output neuron: yk p = θ p1 + θ p tanh θ p3 + θ p4 x k + θ p5 tanh θ p6 + θ p7 x k + w k k=1 to N (34) We take q p = 1; ; 1; ; -1; -1; 3 T, σ = 1 -, N = 5. The inputs x k of the data set are uniformly distributed in [-3; 3], as shown in Figure 4a. The family of functions θ 1 + θ tanh θ 3 + θ 4 x + θ 5 tanh θ 6 + θ 7 x, q R 7 is considered, that is the assumed model is true, and we choose a confidence level of 99% (t 43 1% =.58). The minimization of the cost function with the Levenberg algorithm leads to s = Figure 4b displays the LTE estimate of the variance of f x, Q LS (7), and the reference estimate (33) computed over M = 1 sets. Figure 4a shows the regression, the data set used for the LS estimation, and the corresponding model output and 99% CI (3). The model being true and the size N = 5 of the data set being relatively large with respect to the number of parameters and to the noise variance, we observe that: (i) s σ ; (ii) the model output is close to the regression, leading to good approximations of x and of x by z and z. Thus, (i) and (ii) lead to an accurate LTE estimate of the variance, and hence of the CI. III.4. Illustrative examples Since we are concerned with neural models, and since, in this section, the assumed model is true, the following examples bring into play neural processes, that is to say processes whose regression function is the output of a neural network; the more realistic case of arbitrary processes for whom a family of nonlinear functions (a neural network with a given architecture) containing the regression is unknown is tackled in the next section. III.4.. Example of a simulated neural MISO process (process #3) We consider a MISO process simulated by a neural network with two inputs, one hidden layer of two tanh hidden neurons and a linear output neuron: yk p = θ p1 + θ p tanh θ p3 + θ p4 xk 1 + θ p5 xk + θ p6 tanh θ p7 + θ p8 xk 1 + θ p9 xk + w k (35) k=1 to N We take q p = 1; 1; ; 1; -1; -; ; 1; 1 T, σ = 1-1, N = 1. The inputs xk 1 and xk of the data set are uniformly distributed in [-3; 3]. As for process #, the assumed model is true, i.e. the neural network associated to (35) is used; the minimization of the cost function with the Levenberg algorithm leads to s = Figure 5a shows the inputs of the data set and the regression; Figure 5b displays the LTE estimate of the

9 variance of f x, Q LS (7); Figure 5c displays the difference between the reference IV. CONFIDENCE INTERVALS FOR NEURAL NETWORKS variance estimate (33) computed over M = 1 sets, and the LTE variance estimate (7). Since s is slightly smaller than s, the variance is globally slightly underestimated. But except in the domain around the corner (3, 3) where the density of the inputs is lower, and where the slope of the output surface is steep, the LTE variance estimate is satisfactory. A reliable estimate of the CI may thus be obtained. In the previous sections, the model used for the construction of CIs is true. For real-world black-box modeling problems however, a family of functions which contains the regression is not known a priori. The first task is thus to select the less complex family of functions which contains a function approximating the regression to a certain degree of accuracy in the input domain defined by the data set. In practice, III.4.3 Accuracy of the LTE variance approximation (processes # and #3) Let us now show on the example of processes # and #3 that the curvature of the solution surface is small enough for the LTE approximation of the variance (3) to be satisfactory. For both processes, we have computed approximation (3), using the values of x and of x (at q p ) and the values of s used for the noise simulation. As shown in Figures 6a and 6b for process #, the LTE approximation of the variance (3) is very close to the reference variance estimate. As a matter of fact, the difference between them (Figure 6b) is only due to the curvature, which is small since N = 5 is large with respect to the complexity of the regression. Expression (3) also leads to satisfactory results in the case of process #3 as shown in Figures 7a and 7b (N = 1, two inputs). This tends to show that a first-order approximation of the variance is often sufficient, and that is not worth to bother with a higher-order approximation. In [Seber & Wild 89], a quadratic approximation of the LS estimator using the curvature of the solution surface is introduced. This approximation uses arrays of projected second derivatives, the intrinsic and parameter-effects curvatures arrays; but their presentation is beyond the scope of this paper. several families of increasing complexity (for example neural networks with one layer of an increasing number n h of hidden units, and a linear output neuron) are considered, and the data set is used both to estimate their parameters, and to perform the selection between the candidates. In order to retain the less complex family containing a good approximation of the regression, it is of interest to perform the selection only between neural candidates which are not unnecessarily large, and which are (that is their matrix z is) sufficiently well-conditioned to allow the computation of the approximate CI (3). We propose to discard too large models by a systematic detection of ill-conditioning, and to perform the selection among the approved, i.e. well-conditionned models using an approximate value of their LOO score whose computation does not require further trainings. Both the ill-conditioning detection and the estimation of the LOO score of a neural candidate are based on the LTE of its output. IV.1. Ill-conditioning detection for model approval A too large neural model, trained up to convergence with a simple LS costfunction, will generally overfit. Overfitting is often avoided by using early stopping of the training algorithm or by adding regularization terms in the cost function, e.g. weight decay [Bishop 1995]. Unfortunately, since only the estimator whose value corresponds to an absolute minimum of the quadratic cost function (18) without weight decay terms is unbiased, both early stopping and weight decay introduce a

10 bias in the estimation of the regression: the corresponding estimates thus lead to questionable CIs for the regression. To detect and discard too large networks, we propose, after the training of each candidate up to a (hopefully) absolute minimum of the cost function (18), to check the 1993] where the Hessian rank deficiency is studied during training. In our view, rank deficiency is not very relevant during the training since, with a Levenberg algorithm, the matrix to be inverted is made well-conditioned by the addition of a scalar matrix λ I q, λ >, to the cross-product Jacobian. conditioning of their matrix z (see [Rivals & Personnaz 1998]). The fact that z be illconditioned is the symptom that some parameters are useless, since the elements of z represent the sensibility of the model output with respect to the parameters. A typical situation is the saturation of a tanh hidden neuron, a situation which generates in the matrix z a column of +1 or 1 that corresponds to the parameter between the output of the saturated hidden neuron and the linear output neuron, and columns of zeros that correspond to the parameters between the network inputs and the saturated hidden neuron 8. In practice, we propose to perform a singular value factorization of z, and to compute its condition number, that is the ratio of its largest to its smallest singular value, see for example [Golub & Van Loan 1983]. The matrix z can be considered as very ill-conditioned when its condition number reaches the inverse of the computer precision, which is of the order of Furthermore, in order to be able to compute the approximate CI (3) which involve z T z -1, the cross-product Jacobian matrix z T z must also be wellconditioned. Since the condition number of z T z is the square of the condition number of z, the networks whose matrix z has a condition number much larger than 1 8 cannot be approved. There are other studies of the ill-conditioning of neural networks, but they deal with their training rather than with their approval, like in [Zhou and Si 1998] where an algorithm avoiding the Jacobian rank deficiency is presented, or [Saarinen et al. IV.. Approximate leave-one-out scores for model selection The selection among the networks which have been approved can be performed with statistical tests for example [Urbani et al. 1994] [Rivals & Personnaz 1998]. Another approach, cross validation, consists in partitioning the data set in training and test sets, and in selecting the smallest network leading to the smallest mean square error on the test sets 9. One of the drawbacks of cross validation is to require a successful training of the candidate models on many test sets, that is N successful trainings in the case of LOO cross validation. Let us denote by e k the error obtained on the left out example k with the model trained on the N-1 remaining examples (k-th test set). In this section, we derive an approximate expression of e k, expression which allows an economic estimation of the LOO score without performing these N time-consuming trainings of each candidate network, as proposed in [Monari 1999] [Monari & Dreyfus]. In the case of a linear model, it is well-known [Efron & Tibshirani 1993] that the k-th LOO error e k can be directly derived from the corresponding residual r k : r k e k = if p xkk < 1 1 p xkk k=1 to N (36) e k = r k = if p xkk = 1 where, we recall, p x denotes the orthogonal projection matrix on the range of x. Expression (36) holds irrespective of whether or not the assumed model is true. 8 Such a situation might also correspond to a relative minimum; to check the conditioning of z helps thus also to discard neural networks trapped in relative minima, and leads to retrain the neural candidate starting from different initial weights. 9 Note that statistical tests may advantageously be used complementarily to cross validation in order to take a decision [Rivals & Personnaz 1999]; these tests can also be established by applying LS theory to the LTE of nonlinear models [Bates & Watts 1988], but this exceeds the scope of this paper.

11 In the case of a nonlinear model, we show (see Appendix ) that a useful approximation of the k-th LOO error can be obtained using the LTE of the model output at q LS : e k r k k=1 to N (37) 1 p zkk where p z denotes the orthogonal projection matrix on the range of z. The approximation (37) is thus similar to (36) 1. In the case where (numerically) p zkk = 1, we choose to take e k = r k (the residual is not necessarily zero). Like in the linear case, expression (37) holds independently on the assumed model being true or not. Hence the LOO score: LOO score = 1 (38) N This LOO score can be used as an estimate of the mean square performance error, and we thus denote it as MSPE, as opposed to N J q LS, the mean square training error (MSTE). The interested reader will find in [Monari 1999] [Monari & Dreyfus] a systematic model selection procedure based on both the approximate LOO score and the distribution of the values of the p zkk. Nevertheless, another performance measure could be chosen as well (a 1-fold cross validation score, a mean square error on an independent set, etc.). N k=1 e k IV.3. Accuracy of the approximate confidence intervals The quality of the selected model f x, q LS, and thus of the associated approximate CI, depends essentially on the size N of the available data set with respect to the complexity of the unknown regression function and to the noise variance s : 1. N is large: it is likely that the selected family f x, q, q R q contains the regression E Y p x, i.e. that the LS estimator is asymptotically unbiased, that the model f x, q LS is a good approximation of E Y p x in the domain of interest, and that the curvature is small. In this case, reliable CIs can be computed with (3).. N is small: it is likely that the selected family f x, q, q R q is too small 11 to contain E Y p x, i.e. that the LS estimator is biased, and that the model f x, q LS thus underfits. The approximate CIs are thus questionable, and additional data should be gathered. A good indicator of whether the data set size N is large enough or not is the ratio MSPE/MSTE of the selected candidate: if its value is close to 1, then N is probably large enough, whereas a large value is the symptom of a too small data set size, as shown in Figure 8 (and as illustrated numerically in the following examples). IV.4. Example of a simulated nonlinear SISO process (process #4) This first example is based on a simulated process. Like in the previous sections, a reference estimate of the variance of the output of a neural network is made, using M = 1 other sets; in order to ensure that an absolute minimum is reached on each of the M sets, five to thirty trainings (depending on the network size) with the Levenberg algorithm for different initializations of the weights are performed, and the weights giving the smallest value of the cost function (18) are kept. We consider the SISO process simulated with: yk p = sinc x k w k k=1 to N (39) where sinc denotes the cardinal sine function; we take s = 1 -. First, a data set of N = input-output pairs is computed, with input values uniformly distributed in [-5; 5]. Since a family of nonlinear functions (a neural network 1 An expression similar to (37) is proposed in [Hansen & Larsen 1996], but unfortunately, it is not valid even in the linear case. 11 It will generally not be too large since the approval procedure proposed in section IV.1 prevents from selecting a neural network with useless parameters.

12 with a given architecture) containing the regression is not known a priori, neural networks with a linear output neuron and a layer of n h tanh hidden neurons are trained. The numerical results are summarized in Table 1. We list the number of parameters q, the MSTE (i.e. the smallest MSTE obtained with the network for its different random weight initializations), the condition number of z, and, if the latter is not too large, the MSPE (corresponding approximate LOO score computed with (37) [, 5] where the model underfits, though there is a higher concentration of training examples around 1 and 3, the variance remains constant and low. This is due to the fact that, in this domain, the model output is insensitive to most parameters of the network (this is usually the case when, like here, the output of a network does not vary 1 ): the elements of z s in this domain are thus constant and small, hence a small and constant variance at the corresponding x s. and (38)) and the ratio MSPE/MSTE. The candidates with more than 6 hidden neurons cannot be approved, because cond(z) >> 1 8 : for n h = 7, cond(z) = opt The optimal number of neurons n h = 4 is selected on the basis of the MSPE. The fact that the corresponding ratio MSPE/MSTE is close to 1 is the symptom that N is large enough, so that the selected family of networks contains a good approximation of the regression, and that the curvature is small (case 1 of section IV.3). The results obtained for the selected neural network are shown in Figure 9. The model output is close to the regression, the LTE variance estimate (7) is close to the reference variance estimate (33), and the CI is thus accurate. IV.5. Industrial modeling problem We apply here the presented methodology (LS parameter estimation, model approval, model selection, CI construction) to an industrial example first tackled in [Rivals & Personnaz 1998], that is the modeling of a mechanical property of a complex material from three structural descriptors. We have been provided with a data set of N = 69 examples; the inputs and outputs are normalized for the LS estimations. Thanks to repetitions in the data, and assuming homoscedasticity, the pure error mean square [Draper & Smith 1998] gives a good estimate of the noise variance: σ = Using this reliable estimate, statistical tests establish the Second, a data set of N = 3 input-output pairs is computed, the numerical results being summarized in Table. The data set being much smaller, the candidates cannot be approved as soon as n h > 4: for n h = 5, cond(z) = The opt optimal number of neurons n h 3 = is selected on the basis of the MSPE. The ratio MSPE/MSTE of the selected network equals.1, symptom that N is relatively small, and that the selected family of networks is likely not to contain the regression (case of section IV.3). The results obtained for the selected neural network are shown in Figure 1. The family of functions implemented by a network with two significance of two inputs. An affine model with these two inputs gives the estimate s = of the variance, hence the necessity of nonlinear modeling. Neural networks with a linear output neuron and a layer of n h tanh hidden neurons are trained. The numerical results are summarized in table 3. It shows that the candidates with more than 3 hidden neurons cannot be approved: for n h = 4, cond(z) = 1 1 opt. The optimal number of neurons n h 69 = is selected on the basis of the MSPE. The fact that the corresponding ratio MSPE/MSTE equals 1.3 indicates that N is large enough, so that the selected family of networks contains probably a hidden units is obviously too small to contain a good approximation of the regression, and though the estimate of the output variance is good (it is close to the reference variance estimate), since the output of the neural network differs from the regression, the CI is less accurate than in the case where N =. Note that in the input domain 1 The output of a neural network with one layer of tanh hidden units remains constant in a given domain of its inputs when the tanh activation functions of all hidden units are saturated in this domain: the output of the network is thus insensitive to all the parameters of the hidden units.

13 good approximation of the regression, and that the curvature is small (case 1 of V. COMPARISONS section IV.3). The function implemented by the selected network is shown in Figure 11. In this section, we discuss the advantages of the LS LTE approach to the The N = 69 outputs of the training set are presented in increasing order in Figure 1a, and the corresponding residuals and approximate LOO errors in Figure 1b: both appear quite uncorrelated and homoscedastic. A CI with a level of significance of 95% is then computed with (3); the half width of the 95% CI on the construction of construction intervals for neural networks with respect to other analytic approaches and to the bootstrap methods, and compare them on simulated examples. V.1. Comparison to other analytic approaches N = 69 examples of the data set is shown in Figure 1c. In order to check the confidence which can be attached to the model, the variance of its output must be examined in the whole input domain of interest. Figure 13 shows the isocontours of the LTE estimate of the standard deviation of the model output s z T z T z -1 z in the input domain defined by the training set. The computation of the LTE variance estimate thus allows not only to construct a CI at any input of interest, but also to diagnose that, at the top right corner of the input domain, the model standard deviation is larger than that of the noise itself (the highest isocontour value equals that of the estimate of the noise standard deviation s = ). Little confidence can thus be attached to the model output in this input domain, where more data Maximum likelihood approach In the case of gaussian homoscedastic data, likelihood theory leads to the same approximate variance (3), but two different estimators of it are commonly encountered (see Appendix 1.): var f x, Q LS LTE = s z T z T z -1 z i.e. the same estimate as (7), and also: var f x, Q LS Hessian = s z T h q -1 LS z (4) which necessitates the computation of the Hessian. Efficient methods for computing the Hessian are presented in [Buntine & Weigend 1994]. should be gathered. On the opposite, there is a large region on the left of the diagram where there are very few training examples, but where the LTE estimate of the standard deviation is surprisingly rather small: like for the modeling of process #4 in section IV.4, this is due to the fact that the model output is little sensitive to most parameters of the network in this region (the model output varies very little, see Figure 11). Bayesian approach The Bayesian approach is an alternative approach to sampling theory (or frequentist approach) for modeling problems, and also leads to the design of CIs. These two approaches are conceptually very different: the Bayesian approach treats the unknown parameters as random variables, whereas they are considered as certain in the frequentist approach. Nevertheless, as presented for example in [MacKay 199a] [MacKay 199b] [Bishop 1995], the Bayesian approach leads to a posterior distribution of the parameters with a covariance matrix whose expression is very similar to that of the covariance matrix of the least-squares estimator of the

14 parameters, and thus to CIs which are similar to those presented in this paper. We thus make a brief comparison between the CIs these two approaches lead to. The most important difference is that the estimator which is considered here is the one whose estimate minimizes the cost function (18), whereas in the Bayesian approach, a cost-function with an additional weight-decay regularization term is minimized; the presence of this weight-decay term stems from the assumption of a gaussian prior for the parameters. Nevertheless, the least squares cost function (18) can be seen as the limit where the regularization term is zero, which corresponds to an uninformative prior for the parameters. In this case (that is (18) is minimized as in this paper), there is another small difference in the Bayesian approach as presented in [MacKay 199a] [MacKay 199b] [Bishop 1995]. Under hypotheses which we cannot detail here, the Bayesian approach leads to a posterior parameter distribution with approximate covariance matrix s h q -1 LS, h q LS being the Hessian of the cost function evaluated at the most probable value of the parameter, that is here q LS. A LTE of the estimator output leads then to the following estimate of its variance at input x: Numerical comparison (processes #5 and #6) Here, we perform a numerical comparison of the three variance estimates considered above on a very simple example. We consider a SISO process simulated by a single tanh neuron: yk p = tanh q p1 + q p x k + w k k=1 to N (4) with σ =.1, N = 3. For this comparison, the noise variance s is estimated with s in the three (LTE, Hessian, sandwich) output variance estimates. We first simulate a process with: θ p1 =, θ p = 1 (process #5). The corresponding results are shown in Figure 14. The variance reference estimate is computed on M = 1 data sets. The LTE approximation (3) of the variance is almost perfect. The LTE (7), Hessian (4), and sandwich (41) estimates are comparable: the parameter estimates being accurate (θ LS1 = , θ LS =.996), the fact that they are overestimated is almost only due to the noise variance estimate s = Nevertheless, the shape of the LTE estimate is closer to the reference estimate than that of the two others. We then simulate a process with: θ p1 =, θ p = 5 (process #6). The var f x, Q LS i.e. it also leads to estimate (4). Sandwich estimator Hessian = s z T h q LS -1 z corresponding results are shown in Figure 15. The function being steeper, the curvature is larger, and the LTE approximation (3) of the variance is a little less accurate. The three estimates are still very similar but, here, their overestimation is due not only to the noise variance estimate s = , but also to the bias of the The sandwich estimate of the variance of a nonlinear model output can be derived in various frameworks (a possible derivation in the frequentist approach is given in Appendix 1.3): var f x, q LS sandwich = s z T h q -1 LS z T z h q -1 LS z (41) The sandwich estimator is known to be robust to model incorrectness, i.e. the assumptions about the noise are incorrect or the considered family of functions is too small, see for example [Efron & Tibshirani 1993] [Ripley 1995]. parameter estimates (θ LS1 = , θ LS = 6.58). The computational cost of the LTE estimate being lower (is does not necessitate the computation of the Hessian matrix), there is no reason to prefer one of the two other estimates. As a matter of fact, since the Hessian depends on the data set, it is the realization of a random matrix. Thus, in the maximum likelihood as well as in the Bayesian approach, it is often recommended to take the expectation of the Hessian, and to evaluate it at the available q LS, i.e. to replace it by the cross-product Jacobian z T z [Seber & Wild 1989]: estimates (4) and (41) then reduce to estimate (7). As

15 mentioned above, the sandwich variance estimator is known to be robust to model incorrectness, a property which is not tested with this simple setting, but this is beyond the scope of this paper. training, once with the true ones, and once with those obtained by training the network on the whole data set. As shown in Figure 16, though the size of the data set is not very small (N = 5), the bootstrap variance estimate is far away from the reference estimate. Increasing the number of bootstrap sets up to 1 did not V.. Comparison to bootstrap approaches The bootstrap works by creating many pseudo replicates of the data set, the bootstrap sets, and reestimating the LS solution (retraining the neural network) on each bootstrap set; the variance of the neural model output, and the associated CI, are then computed over the trained networks, typically a hundred [Efron & Tibshirani 1993]. In the bootstrap pairs approach for example, a bootstrap set is created by sampling with replacement from the data set [Efron & Tibshirani 1993]. The first advantage of the LS LTE approach is to require only one successful training of the network on the data set to compute the LTE estimate of the variance of its output, whereas the bootstrap methods require a hundred successful trainings of the network on the different bootstrap sets. Studies on bootstrap where only one training with a random initialization of the improve the variance estimate. In fact, the bootstrap is especially suited to the estimation of the variance of estimators defined by a formula, like for example an estimator of a correlation coefficient [Efron & Tibshirani 1993]: for each bootstrap set, an estimate is computed using the formula, and the estimate of the variance is easily obtained. But the bootstrap is definitely not the best method if each estimation associated to a bootstrap set involves an iterative algorithm like the training of a neural network, which is the case for the construction of a CI with a neural model. However, if the data set is large enough, and if the training time is considered unimportant, the bootstrap pairs approach is a solution in the case of heteroscedasticity (that is K W is not scalar anymore), whereas the LS LTE approach, as well as the bootstrap residuals approach [Efron & Tibshirani 1993], are no longer valid. weights was performed for each bootstrap set show a pathological overestimation of the variance. This can be seen in [Tibshirani 1996], examples and 3; but the overestimation of the bootstrap is not detected in this work because the reference VI. CONCLUSION estimate is also overestimated for the same reasons (one single training per set). As pointed out in [Refenes et al. 1997], a way to reduce this overestimation is to start each training on a bootstrap set with the weights giving the smallest value of the cost function (18) (that is on the original data set); but even so, the bootstrap method becomes untractable for large networks, and/or for multi input processes. The claim that bootstrap methods are especially efficient for problems with small data sets (see [Heskes 1997] for example) may be subject to criticism. As an illustration, the variance was estimated for process # using the bootstrap pairs approach on 3 bootstrap sets, the network weights being initialized twice for each We have given a thorough analysis of the LS LTE approach to the construction of CIs for a nonlinear regression using neural network models, and put emphasis on its enlightening geometric interpretation. We have stressed the underlying assumptions, in particular the fact that the approval and selection procedures must have led to a parsimonious, well-conditioned model containing a good approximation of the regression. Our whole methodology (LS parameter estimation, model approval, model selection, CI construction) has been illustrated on representative examples, bringing into play simulated processes and an industrial one.

Construction of confidence intervals for neural networks based on least squares estimation

Construction of confidence intervals for neural networks based on least squares estimation PERGAMON Neural Networks 13 () 463 484 Neural Networks www.elsevier.com/locate/neunet Construction of confidence intervals for neural networks based on least squares estimation I. Rivals*, L. Personnaz

More information

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz In Neurocomputing 2(-3): 279-294 (998). A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models Isabelle Rivals and Léon Personnaz Laboratoire d'électronique,

More information

Jacobian conditioning analysis for model validation

Jacobian conditioning analysis for model validation Neural Computation 16: 401-418 (2004). Jacobian conditioning analysis for model validation Isabelle Rivals and Léon Personnaz Équipe de Statistique Appliquée École Supérieure de Physique et de Chimie Industrielles

More information

REPLY TO THE COMMENTS ON Local Overfitting. Control via Leverages in Jacobian Conditioning Analysis for Model Validation by I. Rivals.

REPLY TO THE COMMENTS ON Local Overfitting. Control via Leverages in Jacobian Conditioning Analysis for Model Validation by I. Rivals. REPLY TO THE COMMENTS ON Local Overfitting Control via Leverages in Jacobian Conditioning Analysis for Model Validation by I. Rivals and L. Personnaz Yacine Oussar, Gaétan Monari, Gerard Dreyfus To cite

More information

Resampling techniques for statistical modeling

Resampling techniques for statistical modeling Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error

More information

Confidence Estimation Methods for Neural Networks: A Practical Comparison

Confidence Estimation Methods for Neural Networks: A Practical Comparison , 6-8 000, Confidence Estimation Methods for : A Practical Comparison G. Papadopoulos, P.J. Edwards, A.F. Murray Department of Electronics and Electrical Engineering, University of Edinburgh Abstract.

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

Regularization in Neural Networks

Regularization in Neural Networks Regularization in Neural Networks Sargur Srihari 1 Topics in Neural Network Regularization What is regularization? Methods 1. Determining optimal number of hidden units 2. Use of regularizer in error function

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer. University of Cambridge Engineering Part IIB & EIST Part II Paper I0: Advanced Pattern Processing Handouts 4 & 5: Multi-Layer Perceptron: Introduction and Training x y (x) Inputs x 2 y (x) 2 Outputs x

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Pairwise Neural Network Classifiers with Probabilistic Outputs

Pairwise Neural Network Classifiers with Probabilistic Outputs NEURAL INFORMATION PROCESSING SYSTEMS vol. 7, 1994 Pairwise Neural Network Classifiers with Probabilistic Outputs David Price A2iA and ESPCI 3 Rue de l'arrivée, BP 59 75749 Paris Cedex 15, France a2ia@dialup.francenet.fr

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

Linear Models 1. Isfahan University of Technology Fall Semester, 2014

Linear Models 1. Isfahan University of Technology Fall Semester, 2014 Linear Models 1 Isfahan University of Technology Fall Semester, 2014 References: [1] G. A. F., Seber and A. J. Lee (2003). Linear Regression Analysis (2nd ed.). Hoboken, NJ: Wiley. [2] A. C. Rencher and

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Analysis of Fast Input Selection: Application in Time Series Prediction

Analysis of Fast Input Selection: Application in Time Series Prediction Analysis of Fast Input Selection: Application in Time Series Prediction Jarkko Tikka, Amaury Lendasse, and Jaakko Hollmén Helsinki University of Technology, Laboratory of Computer and Information Science,

More information

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Development of Stochastic Artificial Neural Networks for Hydrological Prediction Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental

More information

Different Criteria for Active Learning in Neural Networks: A Comparative Study

Different Criteria for Active Learning in Neural Networks: A Comparative Study Different Criteria for Active Learning in Neural Networks: A Comparative Study Jan Poland and Andreas Zell University of Tübingen, WSI - RA Sand 1, 72076 Tübingen, Germany Abstract. The field of active

More information

Brief Review on Estimation Theory

Brief Review on Estimation Theory Brief Review on Estimation Theory K. Abed-Meraim ENST PARIS, Signal and Image Processing Dept. abed@tsi.enst.fr This presentation is essentially based on the course BASTA by E. Moulines Brief review on

More information

Testing Restrictions and Comparing Models

Testing Restrictions and Comparing Models Econ. 513, Time Series Econometrics Fall 00 Chris Sims Testing Restrictions and Comparing Models 1. THE PROBLEM We consider here the problem of comparing two parametric models for the data X, defined by

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

IDENTIFICATION AND VALIDATION OF A NEURAL NETWORK ESTIMATING THE FLUXES OF AN INDUCTION MACHINE.

IDENTIFICATION AND VALIDATION OF A NEURAL NETWORK ESTIMATING THE FLUXES OF AN INDUCTION MACHINE. IDENTIFICATION AND VALIDATION OF A NEURAL NETWORK ESTIMATING THE FLUXES OF AN INDUCTION MACHINE. L. Constant *, R. Ruelland *, B. Dagues *, I. Rivals **, L. Personnaz ** email :dagues@leei.enseeiht.fr

More information

Psychology 282 Lecture #3 Outline

Psychology 282 Lecture #3 Outline Psychology 8 Lecture #3 Outline Simple Linear Regression (SLR) Given variables,. Sample of n observations. In study and use of correlation coefficients, and are interchangeable. In regression analysis,

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

10-701/ Machine Learning, Fall

10-701/ Machine Learning, Fall 0-70/5-78 Machine Learning, Fall 2003 Homework 2 Solution If you have questions, please contact Jiayong Zhang .. (Error Function) The sum-of-squares error is the most common training

More information

Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap

Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap Dale J. Poirier University of California, Irvine September 1, 2008 Abstract This paper

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

HANDBOOK OF APPLICABLE MATHEMATICS

HANDBOOK OF APPLICABLE MATHEMATICS HANDBOOK OF APPLICABLE MATHEMATICS Chief Editor: Walter Ledermann Volume VI: Statistics PART A Edited by Emlyn Lloyd University of Lancaster A Wiley-Interscience Publication JOHN WILEY & SONS Chichester

More information

1 Confidence Intervals and Prediction Intervals for Feed-Forward Neural Networks

1 Confidence Intervals and Prediction Intervals for Feed-Forward Neural Networks Confidence Intervals and Prediction Intervals for Feed-Forward Neural Networks Richard Dybowski King s College London Stephen J. Roberts Imperial College, London Artificial neural networks have been used

More information

1 What a Neural Network Computes

1 What a Neural Network Computes Neural Networks 1 What a Neural Network Computes To begin with, we will discuss fully connected feed-forward neural networks, also known as multilayer perceptrons. A feedforward neural network consists

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

More information

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection SG 21006 Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection Ioan Tabus Department of Signal Processing Tampere University of Technology Finland 1 / 28

More information

Sigmaplot di Systat Software

Sigmaplot di Systat Software Sigmaplot di Systat Software SigmaPlot Has Extensive Statistical Analysis Features SigmaPlot is now bundled with SigmaStat as an easy-to-use package for complete graphing and data analysis. The statistical

More information

Input Selection for Long-Term Prediction of Time Series

Input Selection for Long-Term Prediction of Time Series Input Selection for Long-Term Prediction of Time Series Jarkko Tikka, Jaakko Hollmén, and Amaury Lendasse Helsinki University of Technology, Laboratory of Computer and Information Science, P.O. Box 54,

More information

Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model

Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model LETTER Communicated by Ilya M. Nemenman Bayesian Feature Selection with Strongly Regularizing Priors Maps to the Ising Model Charles K. Fisher charleskennethfisher@gmail.com Pankaj Mehta pankajm@bu.edu

More information

Statistics 910, #5 1. Regression Methods

Statistics 910, #5 1. Regression Methods Statistics 910, #5 1 Overview Regression Methods 1. Idea: effects of dependence 2. Examples of estimation (in R) 3. Review of regression 4. Comparisons and relative efficiencies Idea Decomposition Well-known

More information

CS 195-5: Machine Learning Problem Set 1

CS 195-5: Machine Learning Problem Set 1 CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of

More information

Functional Preprocessing for Multilayer Perceptrons

Functional Preprocessing for Multilayer Perceptrons Functional Preprocessing for Multilayer Perceptrons Fabrice Rossi and Brieuc Conan-Guez Projet AxIS, INRIA, Domaine de Voluceau, Rocquencourt, B.P. 105 78153 Le Chesnay Cedex, France CEREMADE, UMR CNRS

More information

Neural Networks: Optimization & Regularization

Neural Networks: Optimization & Regularization Neural Networks: Optimization & Regularization Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National Tsing Hua University, Taiwan Machine Learning Shan-Hung Wu (CS, NTHU) NN Opt & Reg

More information

Chapter 4 Neural Networks in System Identification

Chapter 4 Neural Networks in System Identification Chapter 4 Neural Networks in System Identification Gábor HORVÁTH Department of Measurement and Information Systems Budapest University of Technology and Economics Magyar tudósok körútja 2, 52 Budapest,

More information

Machine Learning and Data Mining. Linear regression. Kalev Kask

Machine Learning and Data Mining. Linear regression. Kalev Kask Machine Learning and Data Mining Linear regression Kalev Kask Supervised learning Notation Features x Targets y Predictions ŷ Parameters q Learning algorithm Program ( Learner ) Change q Improve performance

More information

Lecture Notes 1: Vector spaces

Lecture Notes 1: Vector spaces Optimization-based data analysis Fall 2017 Lecture Notes 1: Vector spaces In this chapter we review certain basic concepts of linear algebra, highlighting their application to signal processing. 1 Vector

More information

Statistically-Based Regularization Parameter Estimation for Large Scale Problems

Statistically-Based Regularization Parameter Estimation for Large Scale Problems Statistically-Based Regularization Parameter Estimation for Large Scale Problems Rosemary Renaut Joint work with Jodi Mead and Iveta Hnetynkova March 1, 2010 National Science Foundation: Division of Computational

More information

Frequentist-Bayesian Model Comparisons: A Simple Example

Frequentist-Bayesian Model Comparisons: A Simple Example Frequentist-Bayesian Model Comparisons: A Simple Example Consider data that consist of a signal y with additive noise: Data vector (N elements): D = y + n The additive noise n has zero mean and diagonal

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM

Business Economics BUSINESS ECONOMICS. PAPER No. : 8, FUNDAMENTALS OF ECONOMETRICS MODULE No. : 3, GAUSS MARKOV THEOREM Subject Business Economics Paper No and Title Module No and Title Module Tag 8, Fundamentals of Econometrics 3, The gauss Markov theorem BSE_P8_M3 1 TABLE OF CONTENTS 1. INTRODUCTION 2. ASSUMPTIONS OF

More information

Curvature measures for generalized linear models

Curvature measures for generalized linear models University of Wollongong Research Online University of Wollongong Thesis Collection 1954-2016 University of Wollongong Thesis Collections 1999 Curvature measures for generalized linear models Bernard A.

More information

"ZERO-POINT" IN THE EVALUATION OF MARTENS HARDNESS UNCERTAINTY

ZERO-POINT IN THE EVALUATION OF MARTENS HARDNESS UNCERTAINTY "ZERO-POINT" IN THE EVALUATION OF MARTENS HARDNESS UNCERTAINTY Professor Giulio Barbato, PhD Student Gabriele Brondino, Researcher Maurizio Galetto, Professor Grazia Vicario Politecnico di Torino Abstract

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Multivariate Distributions

Multivariate Distributions IEOR E4602: Quantitative Risk Management Spring 2016 c 2016 by Martin Haugh Multivariate Distributions We will study multivariate distributions in these notes, focusing 1 in particular on multivariate

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands Elizabeth C. Mannshardt-Shamseldin Advisor: Richard L. Smith Duke University Department

More information

Long-Run Covariability

Long-Run Covariability Long-Run Covariability Ulrich K. Müller and Mark W. Watson Princeton University October 2016 Motivation Study the long-run covariability/relationship between economic variables great ratios, long-run Phillips

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Design of Screening Experiments with Partial Replication

Design of Screening Experiments with Partial Replication Design of Screening Experiments with Partial Replication David J. Edwards Department of Statistical Sciences & Operations Research Virginia Commonwealth University Robert D. Leonard Department of Information

More information

Logistic Regression: Regression with a Binary Dependent Variable

Logistic Regression: Regression with a Binary Dependent Variable Logistic Regression: Regression with a Binary Dependent Variable LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: State the circumstances under which logistic regression

More information

Recall the Basics of Hypothesis Testing

Recall the Basics of Hypothesis Testing Recall the Basics of Hypothesis Testing The level of significance α, (size of test) is defined as the probability of X falling in w (rejecting H 0 ) when H 0 is true: P(X w H 0 ) = α. H 0 TRUE H 1 TRUE

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

DS-GA 1002 Lecture notes 10 November 23, Linear models

DS-GA 1002 Lecture notes 10 November 23, Linear models DS-GA 2 Lecture notes November 23, 2 Linear functions Linear models A linear model encodes the assumption that two quantities are linearly related. Mathematically, this is characterized using linear functions.

More information

Multivariate Analysis Techniques in HEP

Multivariate Analysis Techniques in HEP Multivariate Analysis Techniques in HEP Jan Therhaag IKTP Institutsseminar, Dresden, January 31 st 2013 Multivariate analysis in a nutshell Neural networks: Defeating the black box Boosted Decision Trees:

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018

Econometrics I KS. Module 2: Multivariate Linear Regression. Alexander Ahammer. This version: April 16, 2018 Econometrics I KS Module 2: Multivariate Linear Regression Alexander Ahammer Department of Economics Johannes Kepler University of Linz This version: April 16, 2018 Alexander Ahammer (JKU) Module 2: Multivariate

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Neural Networks and Ensemble Methods for Classification

Neural Networks and Ensemble Methods for Classification Neural Networks and Ensemble Methods for Classification NEURAL NETWORKS 2 Neural Networks A neural network is a set of connected input/output units (neurons) where each connection has a weight associated

More information

INTRODUCTION TO PATTERN RECOGNITION

INTRODUCTION TO PATTERN RECOGNITION INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take

More information

2 Statistical Estimation: Basic Concepts

2 Statistical Estimation: Basic Concepts Technion Israel Institute of Technology, Department of Electrical Engineering Estimation and Identification in Dynamical Systems (048825) Lecture Notes, Fall 2009, Prof. N. Shimkin 2 Statistical Estimation:

More information

Tutorial on Machine Learning for Advanced Electronics

Tutorial on Machine Learning for Advanced Electronics Tutorial on Machine Learning for Advanced Electronics Maxim Raginsky March 2017 Part I (Some) Theory and Principles Machine Learning: estimation of dependencies from empirical data (V. Vapnik) enabling

More information

DOA Estimation using MUSIC and Root MUSIC Methods

DOA Estimation using MUSIC and Root MUSIC Methods DOA Estimation using MUSIC and Root MUSIC Methods EE602 Statistical signal Processing 4/13/2009 Presented By: Chhavipreet Singh(Y515) Siddharth Sahoo(Y5827447) 2 Table of Contents 1 Introduction... 3 2

More information

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren 1 / 34 Metamodeling ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University March 1, 2015 2 / 34 1. preliminaries 1.1 motivation 1.2 ordinary least square 1.3 information

More information

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course.

* Tuesday 17 January :30-16:30 (2 hours) Recored on ESSE3 General introduction to the course. Name of the course Statistical methods and data analysis Audience The course is intended for students of the first or second year of the Graduate School in Materials Engineering. The aim of the course

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Bayesian Interpretations of Heteroskedastic Consistent Covariance. Estimators Using the Informed Bayesian Bootstrap

Bayesian Interpretations of Heteroskedastic Consistent Covariance. Estimators Using the Informed Bayesian Bootstrap Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap Dale J. Poirier University of California, Irvine May 22, 2009 Abstract This paper provides

More information

LECTURE 5 HYPOTHESIS TESTING

LECTURE 5 HYPOTHESIS TESTING October 25, 2016 LECTURE 5 HYPOTHESIS TESTING Basic concepts In this lecture we continue to discuss the normal classical linear regression defined by Assumptions A1-A5. Let θ Θ R d be a parameter of interest.

More information

Learning features by contrasting natural images with noise

Learning features by contrasting natural images with noise Learning features by contrasting natural images with noise Michael Gutmann 1 and Aapo Hyvärinen 12 1 Dept. of Computer Science and HIIT, University of Helsinki, P.O. Box 68, FIN-00014 University of Helsinki,

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

II. DIFFERENTIABLE MANIFOLDS. Washington Mio CENTER FOR APPLIED VISION AND IMAGING SCIENCES

II. DIFFERENTIABLE MANIFOLDS. Washington Mio CENTER FOR APPLIED VISION AND IMAGING SCIENCES II. DIFFERENTIABLE MANIFOLDS Washington Mio Anuj Srivastava and Xiuwen Liu (Illustrations by D. Badlyans) CENTER FOR APPLIED VISION AND IMAGING SCIENCES Florida State University WHY MANIFOLDS? Non-linearity

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

An Introduction to Statistical Machine Learning - Theoretical Aspects -

An Introduction to Statistical Machine Learning - Theoretical Aspects - An Introduction to Statistical Machine Learning - Theoretical Aspects - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny,

More information

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7] 8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7] While cross-validation allows one to find the weight penalty parameters which would give the model good generalization capability, the separation of

More information

A Note on Bootstraps and Robustness. Tony Lancaster, Brown University, December 2003.

A Note on Bootstraps and Robustness. Tony Lancaster, Brown University, December 2003. A Note on Bootstraps and Robustness Tony Lancaster, Brown University, December 2003. In this note we consider several versions of the bootstrap and argue that it is helpful in explaining and thinking about

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility

Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility American Economic Review: Papers & Proceedings 2016, 106(5): 400 404 http://dx.doi.org/10.1257/aer.p20161082 Fixed Effects, Invariance, and Spatial Variation in Intergenerational Mobility By Gary Chamberlain*

More information

CSC 578 Neural Networks and Deep Learning

CSC 578 Neural Networks and Deep Learning CSC 578 Neural Networks and Deep Learning Fall 2018/19 3. Improving Neural Networks (Some figures adapted from NNDL book) 1 Various Approaches to Improve Neural Networks 1. Cost functions Quadratic Cross

More information

Bayesian Linear Regression [DRAFT - In Progress]

Bayesian Linear Regression [DRAFT - In Progress] Bayesian Linear Regression [DRAFT - In Progress] David S. Rosenberg Abstract Here we develop some basics of Bayesian linear regression. Most of the calculations for this document come from the basic theory

More information

Covariance function estimation in Gaussian process regression

Covariance function estimation in Gaussian process regression Covariance function estimation in Gaussian process regression François Bachoc Department of Statistics and Operations Research, University of Vienna WU Research Seminar - May 2015 François Bachoc Gaussian

More information

Artificial Neural Networks

Artificial Neural Networks Artificial Neural Networks Stephan Dreiseitl University of Applied Sciences Upper Austria at Hagenberg Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support Knowledge

More information

Joint Estimation of Risk Preferences and Technology: Further Discussion

Joint Estimation of Risk Preferences and Technology: Further Discussion Joint Estimation of Risk Preferences and Technology: Further Discussion Feng Wu Research Associate Gulf Coast Research and Education Center University of Florida Zhengfei Guan Assistant Professor Gulf

More information

Part 6: Multivariate Normal and Linear Models

Part 6: Multivariate Normal and Linear Models Part 6: Multivariate Normal and Linear Models 1 Multiple measurements Up until now all of our statistical models have been univariate models models for a single measurement on each member of a sample of

More information

Lecture 3 September 1

Lecture 3 September 1 STAT 383C: Statistical Modeling I Fall 2016 Lecture 3 September 1 Lecturer: Purnamrita Sarkar Scribe: Giorgio Paulon, Carlos Zanini Disclaimer: These scribe notes have been slightly proofread and may have

More information

Graduate Econometrics I: Unbiased Estimation

Graduate Econometrics I: Unbiased Estimation Graduate Econometrics I: Unbiased Estimation Yves Dominicy Université libre de Bruxelles Solvay Brussels School of Economics and Management ECARES Yves Dominicy Graduate Econometrics I: Unbiased Estimation

More information

TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST

TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST Econometrics Working Paper EWP0402 ISSN 1485-6441 Department of Economics TESTING FOR NORMALITY IN THE LINEAR REGRESSION MODEL: AN EMPIRICAL LIKELIHOOD RATIO TEST Lauren Bin Dong & David E. A. Giles Department

More information

g-priors for Linear Regression

g-priors for Linear Regression Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,

More information

Paper CD Erika Larsen and Timothy E. O Brien Loyola University Chicago

Paper CD Erika Larsen and Timothy E. O Brien Loyola University Chicago Abstract: Paper CD-02 2015 SAS Software as an Essential Tool in Statistical Consulting and Research Erika Larsen and Timothy E. O Brien Loyola University Chicago Modelling in bioassay often uses linear,

More information

Supplementary material (Additional file 1)

Supplementary material (Additional file 1) Supplementary material (Additional file 1) Contents I. Bias-Variance Decomposition II. Supplementary Figures: Simulation model 2 and real data sets III. Supplementary Figures: Simulation model 1 IV. Molecule

More information