Bootstrap for model selection: linear approximation of the optimism

Similar documents
Fast Bootstrap for Least-square Support Vector Machines

Time series forecasting with SOM and local non-linear models - Application to the DAX30 index prediction

Long-Term Time Series Forecasting Using Self-Organizing Maps: the Double Vector Quantization Method

Fast Bootstrap applied to LS-SVM for Long Term Prediction of Time Series

Vector quantization: a weighted version for time-series forecasting

Direct and Recursive Prediction of Time Series Using Mutual Information Selection

Analysis of Fast Input Selection: Application in Time Series Prediction

Input Selection for Long-Term Prediction of Time Series

Functional Radial Basis Function Networks (FRBFN)

Using the Delta Test for Variable Selection

Mutual information for the selection of relevant variables in spectrometric nonlinear modelling

Resampling techniques for statistical modeling

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

Time series prediction

Bootstrap, Jackknife and other resampling methods

A First Attempt of Reservoir Pruning for Classification Problems

DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING

The Curse of Dimensionality in Data Mining and Time Series Prediction

A Note on the Scale Efficiency Test of Simar and Wilson

Bootstrap Simulation Procedure Applied to the Selection of the Multiple Linear Regressions

COLLABORATION OF STATISTICAL METHODS IN SELECTING THE CORRECT MULTIPLE LINEAR REGRESSIONS

Learning Gaussian Process Models from Uncertain Data

Phosphene evaluation in a visual prosthesis with artificial neural networks

A better way to bootstrap pairs

Confidence Estimation Methods for Neural Networks: A Practical Comparison

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

Consistency of test based method for selection of variables in high dimensional two group discriminant analysis

LTI Systems, Additive Noise, and Order Estimation

Model Selection. Frank Wood. December 10, 2009

Manifold Constrained Variational Mixtures

A Bias Correction for the Minimum Error Rate in Cross-validation

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

Measures of Fit from AR(p)

Complexity Bounds of Radial Basis Functions and Multi-Objective Learning

Uncertainty in Ranking the Hottest Years of U.S. Surface Temperatures

Gaussian Fitting Based FDA for Chemometrics

arxiv:cs/ v1 [cs.lg] 8 Jan 2007

Modelling and Forecasting nancial time series of tick data by functional analysis and neural networks

SUPPORT VECTOR REGRESSION WITH A GENERALIZED QUADRATIC LOSS

How New Information Criteria WAIC and WBIC Worked for MLP Model Selection

CONVEX OPTIMIZATION OVER POSITIVE POLYNOMIALS AND FILTER DESIGN. Y. Genin, Y. Hachez, Yu. Nesterov, P. Van Dooren

Selecting an optimal set of parameters using an Akaike like criterion

ISyE 691 Data mining and analytics

Analysis of the AIC Statistic for Optimal Detection of Small Changes in Dynamic Systems

Probabilistic Regression Using Basis Function Models

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Defect Detection using Nonparametric Regression

Singular perturbation analysis of an additive increase multiplicative decrease control algorithm under time-varying buffering delays.

Hypothesis Testing with the Bootstrap. Noa Haas Statistics M.Sc. Seminar, Spring 2017 Bootstrap and Resampling Methods

In this chapter, we provide an introduction to covariate shift adaptation toward machine learning in a non-stationary environment.

A New Method for Varying Adaptive Bandwidth Selection

Non-parametric Residual Variance Estimation in Supervised Learning

MA 575 Linear Models: Cedric E. Ginestet, Boston University Bootstrap for Regression Week 9, Lecture 1

Data Mining und Maschinelles Lernen

UNIVERSITÄT POTSDAM Institut für Mathematik

ADAPTIVE FILTER THEORY

Existence and uniqueness of solutions for a continuous-time opinion dynamics model with state-dependent connectivity

ESANN'2001 proceedings - European Symposium on Artificial Neural Networks Bruges (Belgium), April 2001, D-Facto public., ISBN ,

CMU-Q Lecture 24:

Chapter 3: Regression Methods for Trends

Chapter 18: Sampling Distribution Models

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Ch. 7.6 Squares, Squaring & Parabolas

Estimation of extreme flow quantiles and quantile uncertainty for ungauged catchments

Distinguishing between cause and effect

Bootstrap Approximation of Gibbs Measure for Finite-Range Potential in Image Analysis

Consistency of Test-based Criterion for Selection of Variables in High-dimensional Two Group-Discriminant Analysis

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

TDT4173 Machine Learning

Support Vector Machines

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

Diagnostics. Gad Kimmel

Performance of Cross Validation in Tree-Based Models

Bias Correction of Cross-Validation Criterion Based on Kullback-Leibler Information under a General Condition

Linear Dependency Between and the Input Noise in -Support Vector Regression

Local Polynomial Modelling and Its Applications

VARIABLE SELECTION AND INDEPENDENT COMPONENT

Discussion About Nonlinear Time Series Prediction Using Least Squares Support Vector Machine

The exact bootstrap method shown on the example of the mean and variance estimation

Affine iterations on nonnegative vectors

Machine Learning. Ensemble Methods. Manfred Huber

NEW ESTIMATORS FOR PARALLEL STEADY-STATE SIMULATIONS

Learning Static Parameters in Stochastic Processes

Linear Model Selection and Regularization

PDEEC Machine Learning 2016/17

Kernel Methods and Support Vector Machines

Exploratory quantile regression with many covariates: An application to adverse birth outcomes

Hastie, Tibshirani & Friedman: Elements of Statistical Learning Chapter Model Assessment and Selection. CN700/March 4, 2008.

Margin Maximizing Loss Functions

Support Vector Machine For Functional Data Classification

VC dimension, Model Selection and Performance Assessment for SVM and Other Machine Learning Algorithms

Machine Learning for OR & FE

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

All models are wrong but some are useful. George Box (1979)

REJOINDER Bootstrap prediction intervals for linear, nonlinear and nonparametric autoregressions

Problem Set 2: Box-Jenkins methodology

Bootstrapping, Randomization, 2B-PLS

Linear Regression In God we trust, all others bring data. William Edwards Deming

Univariate ARIMA Models

Preliminaries The bootstrap Bias reduction Hypothesis tests Regression Confidence intervals Time series Final remark. Bootstrap inference

Transcription:

Bootstrap for model selection: linear approximation of the optimism G. Simon 1, A. Lendasse 2, M. Verleysen 1, Université catholique de Louvain 1 DICE - Place du Levant 3, B-1348 Louvain-la-Neuve, Belgium, Phone : +32-10-47-25-40, Fax : +32-10-47-21-80 {gsimon, verleysen}@dice.ucl.ac.be 2 CESAME - Avenue G. Lemaître 4, B-1348 Louvain-la-Neuve, Belgium, lendasse@auto.ucl.ac.be Abstract. The bootstrap resampling method may be efficiently used to estimate the generalization error of nonlinear regression models, as artificial neural networks. Nevertheless, the use of the bootstrap implies a high computational load. In this paper we present a simple procedure to obtain a fast approximation of this generalization error with a reduced computation time. This proposal is based on empirical evidence and included in a suggested simulation procedure. 1 Introduction A large variety of models may be used to describe processes: linear ones, nonlinear, artificial neural networks, and many others. It is thus necessary to compare the various models (for example with regards to their performances and complexity) and choose the best one. The ranking of the models is made according to some criterion like the generalization error, usually defined as the average error that a model would make on an infinite-size and unknown test set independent from the learning one. In practice the generalization error can only be estimated, but there exists some methods to provide such an estimation: the AIC or BIC criteria and the like [1], [2], [3] as well as other well-known statistical techniques: the cross-validation and k-fold [3, 6], the leave-one-out [3, 6], the bootstrap [4, 6] and its unbiased extension the.632 bootstrap [4, 6]. The ideas presented in this paper can be applied both to the bootstrap and the.632 bootstrap. Although these methods are roughly asymptotically equivalent (see for example [5] and [6]), and despite the fact that the use of the bootstrap is not an irrefutable question, it seems that using the bootstrap can be advantageous in many real world G. Simon is funded by the Belgian F.R.I.A. M. Verleysen is Senior Research Associate of the Belgian F.N.R.S. The work of A. Lendasse is supported by the Interuniversity Attraction Poles (IAP), initiated by the Belgian Federal State, Ministry of Sciences, Technologies and Culture. The scientific responsibility rests with the authors. J. Mira and J.R. Álvarez (Eds.): IWANN 2003, LNCS 2686, pp. 182-189, 2003. c Springer-Verlag Berlin Heidelberg 2003

Bootstrap for Model Selection: Linear Approximation of the Optimism 183 modeling cases (i.e. when the number of samples is limited, the dimension of the space is high, etc.) [6]. But the bootstrap main limitation in practice is the computation time required for assessing an approximation of sufficient reliability (or accuracy). A second limitation, in our context of model selection, is the fact that the selected best model is picked up from a set of a priori chosen models, leading to a restricted choice. In a previous work [7], we have proposed a fast approximation of the generalization error using the bootstrap, based on linear and exponential approximations of the optimism and apparent error (as defined by Efron [4]) respectively. In this paper, we prove experimentally the validity of the linear approximation of the optimism, and show how to use this approximation to perform efficient bootstrap simulations with reasonable computational complexity. 2 Model Selection Using Bootstrap Technique The fundament of the bootstrap is the plug-in principle [4]. This general principle allows to obtain an estimator of a statistic according to an empirical distribution. In our context of model selection, our statistic of interest is the generalization error. We thus use the bootstrap to estimate the generalization error (or the prediction error in Efron s vocabulary) in order to rank the models and choose the best one. The bootstrap estimator of the generalization error is computed according to the bootstrap resampling approach. Given an original sample (or data set) x, we generate B new samples, denoted x b, 1 b B. The new samples x b are obtained from the original sample x by drawing with replacement. For each bootstrap sample x b, we compute a bootstrap estimator of our statistic of interest. The final value is obtained by taking the mean of the estimators over the B bootstrap replications. In the following, we will use the notation e A,B for the error of a model built (learned) on a sample A and tested on a sample B. In model selection context, Efron defines in [4] the bootstrap estimator of the generalization (prediction) error: eˆ = e optimism, (1) gen app + where ê gen is the estimate of the generalization error e gen given by bootstrap, e app is the apparent error (computed on the learning set A), and optimism is an estimator of the correction term for the difference between a learning and a generalization error, which in fact aims to approximate the difference of errors obtained on the finite sample x and an (infinite) unknown ideal sample. The optimism is computed according to: b optimism = E B [ optimism ], (2) where E B [ ] is the statistical expectation computed over the B bootstrap replications and:

184 G. Simon et al. optimism With our notation, (1) becomes: eˆ gen b = e b e b b. (3) x, x x, x [ b e b b ] = e x, x + E B e (4) x, x x, x In order to approach the theoretical value of the final bootstrap estimation of the generalization error, we can increase the number B of bootstrap replications, but this increases considerably the computation time. We have proposed in [7] a way to reduce this computation load with a limited loss of accuracy. Note that the.632 bootstrap [4] aims to reduce the slight bias introduced by the optimism correction. This bias is due to e, where the error is computed on the b x b x same set than the one used for the learning stage. The linear approximation of the optimism term, presented in the following, is applicable to the bootstrap and the.632 bootstrap. 3 Framework Assuming a linear relation between the optimism and the number p of parameters in the model is probably an unexpected hypothesis. Nevertheless, this hypothesis is strengthened by the fact that the general formulation of a structure selection criterion can also be written as eˆ prediction = eapp + correction (5) where the correction term is 2pσ /n for AIC and ln(n)pσ /n for BIC, with σ the estimated quadratic error on the learning set containing n elements. In AIC, BIC criteria and the like, we can see that the correction term is directly proportional to the number of parameter p. Though the apparent error e app is also a function of p, we will focus here on the second term, the correction. Although the correction term is computed by bootstrap, and therefore called the optimism, its value depends, as the apparent error, on the initialization conditions of the learning process. In practice, a "good" local minimum of a learning error (either on x or on x b ) is obtained by repeating the learning with different initial conditions. Nevertheless, when including this in a bootstrap procedure, the number of learnings is again multiplied by B, resulting in an excessive computation time. In comparison with the AIC and BIC criteria, we assume that the correction term is linearly increasing. Our first goal is then to show experimentally that the optimism term is a linear function of p, like a 1 p+a 2. Under this hypothesis, if we compute the value of the optimism term for a limited number of models, we can determine (in mean square sense) constants a 1 and a 2. Our second goal is thus, under the linearity hypothesis, to propose a method to reduce the number of tested models and the number of bootstrap replications.

Bootstrap for Model Selection: Linear Approximation of the Optimism 185 Since the values of a 1 and a 2 result from an experimental procedure, an obvious advantages of our proposal is these parameters are set specifically for each application, avoiding the use of asymptotic results. 4 Methodology In the experimental results shown below, we used Radial Basis Function Networks (RBFNs) as approximation models. We would like to emphasize on the fact that this choice is made a priori and that the goal is not to compare the results with those that could be obtained with other approximators. The learning procedure to fit the parameters of the model is described in [8], [9]. For the RBFNs models, we consider p in expression a 1 p+a 2 as the number of Gaussian units or Gaussian kernels (the total number of parameter in RBFNs is in fact proportional to the number of Gaussian kernels). To observe the linearity, we use the R² statistics, also called the square correlation coefficient. The R 2 statistics is here computed between the optimism estimated for each model (different values of p) and the linear approximation (a 1 p+a 2 ) of these values. The more this R 2 is close to 1, the most our linear approximation is valid. Remember that each optimism b, in the context of nonlinear models, is usually the result of several learnings (Q learnings) with different initial conditions. To estimate one value of the optimism (i.e. the optimism for a specific model complexity p), we should therefore learn QB models, what could be excessive in our context. Now notice that in practice we are not interested in a specific value of the optimism but only in the linear approximation a 1 p+a 2. A lower accuracy on each value of the optimism can thus be balanced by the number of different complexities p, i.e. the number of points (larger than 2) used for the linear approximation. 5 Experimental Results 5.1 Artificial Example We first illustrate the validity of the linear approximation of the optimism described in the previous sections on a toy example. We generate a set of 1000 datas (x, y), with x randomly drawn in [0, 1] and y defined by: y = sin( 5x) + sin(15x) + sin(25x) + noise (6) where noise is a uniform random variable in [-0.5, 0.5]. We then use the bootstrap resampling method in a model selection procedure, observing the generalization error corresponding to a specific model characterized by its number p of Gaussian kernels. Figure 1 presents the evolution of our R² criterion versus the number B of bootstrap replications. We clearly see that R² is getting closer and closer to one while B increases. Fishers' test (with a p-value of 2.2584 10-11 ) leads to accept the linear hypothesis from B = 15.

186 G. Simon et al. Since we admit the linear approximation hypothesis, we can go one step further and address the reduction of computation time. We then look to the evolution of a 1 and a 2 in function of B respectively in Figures 2.1 and 2.2. Here again, when B is greater or equal to 15, we have a roughly constant value. Figure 3 shows the graph of the optimism according to the number p of Gaussian kernels in the model. Figure 4 is the graph of the generalization error versus the number of bootstrap replications, where we can see that the best model for our toy example has 20 Gaussian kernels. Finally, Figure 5 shows the 1000 learning data and the predictions we got with the selected model. 1 0.95 R 2 0.9 0.85 0.8 0 5 10 15 20 25 30 35 Fig. 1. Toy example: evolution of R² versus B 2.6 x 10-4 0 2.4-0.002-0.004 2.2-0.006 a1 2 a2-0.008-0.01 1.8 0 5 10 15 20 25 30 35 Fig. 2.1 Toy example: evolution of coefficient a 1 versus B -0.012 0 5 10 15 20 25 30 35 Fig. 2.2. Toy example: evolution of coefficient a 2 versus B 0.012 0.01 Optimism 0.008 0.006 0.004 0.002 0 10 20 30 40 50 60 Number of Gaussian Kernels Fig. 3. Toy example: approximation of the optimism term (thin line) versus the number of Gaussian kernels with B = 20 (thick line : values obtained for the tested models)

Bootstrap for Model Selection: Linear Approximation of the Optimism 187 0.12 0.11 Error 0.1 0.09 0.08 10 20 30 40 50 60 Number of Gaussian Kernels Fig. 4. Toy example: Learning (thick) and generalization (thin) errors versus p 3 2 1 y 0-1 -2-3 0 0.2 0.4 0.6 0.8 1 x Fig. 5. Toy example: learning data (dots) and predictions (solid line) with the selected model 5.2 Real Data Set (Abalone) We use the abalone dataset [10] as a second example to validate the linearity hypothesis, with a more realistic (and difficult) approximation problem. Here again, we use 1000 data for the learning. Figure 6 is the evolution of R² with respect to B. In this case, Fisher s test p-value is rounded by the computer to 0. Figure 7.1 is the graph of a 1 and figure 7.2 is the evolution of a 2 in function of B. According to these graphs, we suggest to use B = 40. Figure 8 shows the reported optimism with respect to p. Figure 9 shows the learning and generalization errors versus B. The minimum corresponding to the best model for the Abalone data set has 62 Gaussian kernels. 1 0.9 0.8 R 2 0.7 0.6 0.5 0.4 0.3 0 10 20 30 40 50 60 70 80 Num ber of bootstraps Fig. 6. Abalone: evolution of R² versus B

188 G. Simon et al. 0.014 0.6 0.0135 0.5 0.013 0.0125 0.4 a1 0.012 a2 0.3 0.0115 0.011 0.2 0.0105 0 10 20 30 40 50 60 70 80 0.1 0 10 20 30 40 50 60 70 80 Fig. 7.1 Abalone: evolution of coefficient a 2 versus B Fig. 7.2 Abalone: evolution of coefficient a 2 versus B 2 1.5 Optimism 1 0.5 0 0 20 40 60 80 100 120 140 Number of G aussian Kernels Fig. 8. Abalone: approximation of the optimism term (thin) versus the number of Gaussian kernels with B = 40 (thick: values obtained for the tested models) Error 5.5 5 4.5 4 3.5 3 2.5 20 40 60 80 100 120 Number of Gaussian Kernels Fig. 9. Abalone: Learning (thick) and generalization (thin) errors versus the number p of Gaussian kernels 6 Conclusion In this paper we have shown that the optimism term of the bootstrap estimator of the prediction error is a linear expression of the number of parameters p. Furthermore, we illustrate the time saving procedure proposed in [7], enhanced here with the early stop criterion based on the R² of the linear approximation. According to the two results shown here and to other ones not illustrated in this paper,

Bootstrap for Model Selection: Linear Approximation of the Optimism 189 we recommend a conservative value of 50 for the number B of bootstrap replications before stopping the approximation computation. We would like to emphasize on the fact that the limited loss of accuracy is balanced by a considerable saving in computation load, this last fact being the main disadvantage of the bootstrap resampling procedure in practical situations. This saving is due to the reduced number of tested models and to the limited number of bootstrap replications. Although this procedure has only been tested in a neural network model selection context, this simple and time saving method could easily be extended to other contexts of nonlinear regression, classification, etc., where computation time and complexity play a role. It can also be applied to other resampling procedures, as the.632 bootstrap. References [1] H. Akaike, Information theory and an extension of the maximum likelihood principle, 2nd Int. Symp. on information Theory, 267-81, Budapest, 1973 [2] G. Schwarz, Estimating the dimension of a model, Ann. Stat. 6, 461-464, 1978. [3] L. Ljung, System Identification - Theory for the user, 2nd ed, Prentice Hall, 1999. [4] B. Efron, R. J. Tibshirani, "An introduction to the bootstrap", Chapman & Hall, 1993. [5] M. Stone, An asymptotic equivalence of choice of model by cross-validation and Akaike s criterion, J. Royal. Statist. Soc., B39, 44-7, 1977. [6] R. Kohavi, A study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection, Proc. of the 14th Int. Joint Conf. on A.I., Vol. 2, Canada, 1995. [7] G. Simon, A. Lendasse, V. Wertz, M. Verleysen, Fast approximation of the bootstrap for model selection, accepted for publication in Proc. of ESANN 2003, d- side, Brussels, 2003. [8] N. Benoudjit, C. Archambeau, A. Lendasse, J. Lee, M. Verleysen, Width optimization of the Gaussian kernels in Radial Basis Function Networks, Proc. of ESANN 2002, d-side, Brussels, 2002. [9] M. J. Orr, "Optimising the Widths of Radial Basis Functions", in Proc. of Vth Brazilian Symposium on Neural Networks, Belo Horizonte, Brazil, december 1998 [10] W.J. Nash, T.L. Sellers, S.R. Talbot, A.J. Cawthorn and W.B. Ford, "The Population Biology of Abalone (_Haliotis_ species) in Tasmania. I. Blacklip Abalone (_H. rubra_) from the North Coast and Islands of Bass Strait", Sea Fisheries Division, Technical Report No. 48, 1994.