a subset of these N input variables. A naive method is to train a new neural network on this subset to determine this performance. Instead of the comp

Similar documents
Fast pruning using principal components

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

A multiple testing procedure for input variable selection in neural networks

Pruned neural networks for regression. Rudy Setiono and Wee Kheng Leow. School of Computing. National University of Singapore.

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Abstract. In this paper we propose recurrent neural networks with feedback into the input

Training Guidelines for Neural Networks to Estimate Stability Regions

In: Advances in Intelligent Data Analysis (AIDA), International Computer Science Conventions. Rochester New York, 1999

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

Confidence Estimation Methods for Neural Networks: A Practical Comparison

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Widths. Center Fluctuations. Centers. Centers. Widths

IN neural-network training, the most well-known online

Ways to make neural networks generalize better

Early Brain Damage. Volker Tresp, Ralph Neuneier and Hans Georg Zimmermann Siemens AG, Corporate Technologies Otto-Hahn-Ring München, Germany

Learning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract

Mining Classification Knowledge

Remaining energy on log scale Number of linear PCA components

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Data Mining Part 5. Prediction

Deep Feedforward Networks

Lecture 5: Logistic Regression. Neural Networks

Mining Classification Knowledge

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Stable Adaptive Momentum for Rapid Online Learning in Nonlinear Systems

is used in the empirical trials and then discusses the results. In the nal section we draw together the main conclusions of this study and suggest fut

FINAL: CS 6375 (Machine Learning) Fall 2014

MODULE -4 BAYEIAN LEARNING

LSTM CAN SOLVE HARD. Jurgen Schmidhuber Lugano, Switzerland. Abstract. guessing than by the proposed algorithms.

10-701/ Machine Learning, Fall

Artificial Neural Network

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

2 XAVIER GIANNAKOPOULOS AND HARRI VALPOLA In many situations the measurements form a sequence in time and then it is reasonable to assume that the fac

Machine Learning (CS 567) Lecture 2

Midterm, Fall 2003

Lecture 7 Artificial neural networks: Supervised learning

Are Rosenblatt multilayer perceptrons more powerfull than sigmoidal multilayer perceptrons? From a counter example to a general result

Neural Networks and the Back-propagation Algorithm

From perceptrons to word embeddings. Simon Šuster University of Groningen

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Pairwise Neural Network Classifiers with Probabilistic Outputs

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Neural Networks and Ensemble Methods for Classification

CS 730/730W/830: Intro AI

Machine Learning Linear Classification. Prof. Matteo Matteucci

The Bias-Variance dilemma of the Monte Carlo. method. Technion - Israel Institute of Technology, Technion City, Haifa 32000, Israel

ECE521 Lectures 9 Fully Connected Neural Networks

A Statistical Input Pruning Method for Artificial Neural Networks Used in Environmental Modelling

Notes on Machine Learning for and

Selection of Classifiers based on Multiple Classifier Behaviour

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

A recursive algorithm based on the extended Kalman filter for the training of feedforward neural models. Isabelle Rivals and Léon Personnaz

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

PATTERN CLASSIFICATION

CSC 411 Lecture 10: Neural Networks

Artificial Neural Networks

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Multiple Regression of Students Performance Using forward Selection Procedure, Backward Elimination and Stepwise Procedure

y(n) Time Series Data

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision trees. Special Course in Computer and Information Science II. Adam Gyenge Helsinki University of Technology

An Empirical Study of Building Compact Ensembles

Machine Learning (CSE 446): Neural Networks

AI Programming CS F-20 Neural Networks

Advanced statistical methods for data analysis Lecture 2

Error Empirical error. Generalization error. Time (number of iteration)

CSC321 Lecture 6: Backpropagation

Different Criteria for Active Learning in Neural Networks: A Comparative Study

Machine Learning Linear Regression. Prof. Matteo Matteucci

g(e) c(h T) E 1 E 12 E n E S

Feed-forward Network Functions

CSE 352 (AI) LECTURE NOTES Professor Anita Wasilewska. NEURAL NETWORKS Learning

Course 395: Machine Learning - Lectures

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Linear Regression and Its Applications

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Introduction to Bayesian Learning. Machine Learning Fall 2018

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Linear discriminant functions

CS 6375 Machine Learning

Notes on Discriminant Functions and Optimal Classification

18.6 Regression and Classification with Linear Models

1 Machine Learning Concepts (16 points)

Model Theory Based Fusion Framework with Application to. Multisensor Target Recognition. Zbigniew Korona and Mieczyslaw M. Kokar

p(z)

4. Multilayer Perceptrons

Empirical Bayes for Learning to Learn. learn" described in, among others, Baxter (1997) and. Thrun and Pratt (1997).

Integer weight training by differential evolution algorithms

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Machine Learning. Nathalie Villa-Vialaneix - Formation INRA, Niveau 3

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

Fast Pruning Using Principal Components

Bayesian Regularisation and Pruning using a Laplace Prior. Peter M Williams. Cognitive Science Research Paper CSRP-312. University of Sussex.

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Artificial Intelligence Roman Barták

Neural Networks biological neuron artificial neuron 1

Decision Trees (Cont.)

Transcription:

Input Selection with Partial Retraining Pierre van de Laar, Stan Gielen, and Tom Heskes RWCP? Novel Functions SNN?? Laboratory, Dept. of Medical Physics and Biophysics, University of Nijmegen, The Netherlands. Abstract. In this article, we describe how input selection can be performed with partial retraining. By detecting and removing irrelevant input variables resources are saved, generalization tends to improve, and the resulting architecture is easier to interpret. In our simulations the relevant input variables were correctly separated from the irrelevant variables for a regression and a classication problem. 1 Introduction Especially with a lack of domain knowledge, the usual approach in neural network modeling is to include all input variables that may have an eect on the output. Furthermore, the number of parameters of the neural network is chosen large, since with too few parameters a network is unable to learn a complex relationship between input and output variables. This approach is suboptimal because the inclusion of irrelevant variables and the (possibly) abundant parameters tends to degrade generalization. Secondly, resources are wasted by measuring irrelevant variables. And nally, a model with irrelevant variables and parameters is more dicult to understand. Architecture selection algorithms (see e.g. [1,5,6,9]) try to remove irrelevant parameters and/or input variables and, consequently, save resources, improve generalization, and yield architectures which are easier to interpret. In this article, we will describe a new algorithm to perform input selection, a subproblem of architecture selection, which exploits partial retraining [11]. We will rst, in section 2, describe partial retraining. Second, in section 3, we will describe how the performance of networks with dierent sets of input variables are compared. Then, we will briey describe the problem of input selection and we will introduce our algorithm to perform input selection which exploits partial retraining in section 4. In section 5, we will perform simulations in which our algorithm is applied for input selection in two articial problems. Our conclusions and some discussion can be found in section 6. 2 Partial retraining Suppose we have a neural network which has been trained on N input variables and we would like to determine the performance which can be achieved with? Real World Computing Partnership?? Foundation for Neural Networks

a subset of these N input variables. A naive method is to train a new neural network on this subset to determine this performance. Instead of the computationally expensive process of training a new network, this new network might also be estimated based on the original network. Partial retraining [11] assumes that the neural network trained on all N input variables has constructed a good representation of the data in its hidden layers needed to map the input to the output. Therefore, the new network is estimated by partial retraining such that its hidden-layer activities are as close as possible to the hidden-layer activities of the original neural network, although its input is restricted. The performance of this constructed network is an estimate of the performance we want to determine. Partial retraining uses the following three steps to construct the network which only receives a subset of all variables as its input. In the rst step, partial retraining determines the new weights between the subset of input variables and the rst hidden layer by ~w 1 = argmin B X kh 1? BX k 2, where h 1 w 1 X denotes the original incoming activity of the rst hidden layer, X stands for the input limited to the subset, and labels the patterns. The dierence between tting the incoming activity and tting the outgoing activity of the hidden layer is almost negligible (see e.g. [8]). We prefer tting the incoming activity, since this can be done by simple and fast quadratic optimization. Furthermore, it can be easily shown [11] that the new weights between the input and the rst hidden layer are chosen such that the neural network estimates the missing value(s) based on linear dependencies, and processes the \completed" input data. The compensation of the errors introduced by removing one or more input variables is probably not perfect due to noise and nonlinear dependencies in the data. Therefore, to further minimize the eects caused by the removal of one or more input variables, the new weights between hidden layers and? 1 are calculated from ~w = argmin B X kh? B H?1 ~ k 2, with h = w H?1 the original incoming activity of hidden layer and H ~ the new th hidden layer activity based on the subset of input variables and the newly estimated weights. Finally, the weights between the output and the last hidden layer are reestimated. Here we have two options: either we treat the output layer as a hidden layer and try to t the incoming activity of the original network, or we can use the output targets to calculate the desired incoming activity. The rst approach, which we will also take in our simulations, yields X ~w K+1 = argmin B ko? B ~ HK k 2,

where o = w K+1 HK is the original incoming activity of the output layer, and ~HK is the activity of the last (K th ) hidden layer given the new weights and the limited input. As already mentioned, partial retraining is far less computationally intensive than the naive approach, i.e., training a new network with the restricted input. This reduction in computational needs is accomplished by the usage of implicit knowledge about the problem which is contained in the neural network trained on all input variables. Furthermore, partial retraining has a unique solution while training a new network might converge to another (local or global) minimum. 3 Network Comparison How are the performances of two networks with dierent sets of input variables compared? We will describe how this is done for two dierent tasks: a regression task and a classication task. For simplicity, we will assume that both tasks have only one output. We assume that the performance P on a regression task is measured by the mean squared error, e.g. E = 1 P 2P =1 (T? O ) 2, with T and O the target and output of pattern, respectively. The dierence in performance between two networks is thus given by E 1? E 2 = 1 P PX =1 [2T? (O 1 + O 2 )] [O 2? O 1 ]. (1) If the mean squared error is used, it is reasonable to assume that T?O i is drawn from a Gaussian distribution, and thus also the two terms between brackets in equation (1) are Gaussian distributed. Under the assumption that these two terms are independent (not correlated), the product's average value is zero, and its standard deviation is equal to the standard deviation of the rst term in brackets in equation (1) times the standard deviation of the second term. If P is large, the central limit theorem yields that the hypothesis that the performances of the two networks are identical has to be rejected with signicance level if p P je1? E 2 j P(Z > ) <, where P(Z > z) 1 1? erf( p z 2 2 ) is the probability that a value greater than z is drawn from a standard Gaussian distribution, i.e., N (0; 1). PThe percentage of misclassications in a classication task is given by 1 P (1? P =1 T )O + T (1? O ), where T and O are the target and output label (2 f0; 1g) of pattern, respectively. The dierence in misclassications between two networks is given by C 1? C 2 = 1 P PX =1 (2T? 1)(O 2? O 1 ).

Let us consider the hypothesis that both networks perform equally well, i.e., that both have an equal chance of misclassication p, which can be estimated based on the data by p = C1+C2 2. Similarities between the solutions, as constructed by the networks, correlate the outputs of the networks. The ratio of simultaneous misclassications over all misclassications can be estimated by q = 2C1\2 C 1 +C 2, where C 1\2 denotes the percentage of simultaneous misclassications. Given this hypothesis, we nd that the term (2T? 1)(O? 2 O 1 ) to calculate the dierence in misclassication p is on average equal to zero, and has standard deviation of = 2p(1? q). Again if P is large, the central limit theorem yields that our hypothesis has to be rejected with signicance level if p P jc1? C 2 j P(Z > ) <. 4 Input Selection Since every input variable is either selected or not, we have 2 N possibilities for N input variables. Using partial retraining to determine the performances of the network given every possible combination of input variables is therefore only feasible when the number of input variables is rather small. Alternatives, but approximations, for this brute force method are backward elimination, forward selection, and stepwise selection (see e.g. [3,7]). Backward elimination starts with all input variables and removes the least relevant variables one at the time. Backward elimination stops if the performance of network drops below a given threshold by removal of any of the remaining input variables. Forward selection starts without any input variables, sequentially includes the most relevant variables, and stops as soon as the performance of the network exceeds a given threshold. Stepwise selection is a modied version of forward selection that permits reexamination, at every step, of the variables included in the previous steps, since a variable that was included at an early stage may be come irrelevant at a later stage when other (related) variables are also included. In our simulations we remove the least relevant variables one at the time, but, unlike backward elimination, we do not stop when the performance of the network drops below a given threshold but we continue until all variables are removed. From these architectures, i.e., with none, one, two,..., and N inputs, the smallest architecture, for which the error is not signicantly dierent from the architecture with the minimal error, is chosen. Note that the parameter controls Occam's razor. The smaller the higher the probability that a smaller architecture is considered statistically identical to the architecture with the minimal error. In other words, the smaller the higher the chance that we arrive at networks with only a few input variables.

5 Simulations 5.1 Regression The response of our articial regression task given the ten input variables, X 1 ; : : : ; X 10 which are uniformly distributed over [0; 1], is given by the following signal plus noise model [4] T = 10 sin(x 1 X 2 ) + 20 X 3? 1 2 2 + 10X 4 + 5X 5 +, where is N (0; 1), i.e., standard normally distributed noise. The response does not depend on the irrelevant or noisy input variables, X 6, X 7, X 8, X 9, and X 10. Furthermore, we assume that the ten input variables are not independent: two irrelevant inputs are identical, X 9 X 10, as well as two relevant inputs, X 4 X 5. Based on this signal plus noise model, we generated a data set of 400 input-output patterns. We trained hundred two-layered multilayer perceptrons with ten input, seven hidden and one output unit and with the hyperbolic tangent and the identity as transfer functions of the hidden and output layer, respectively. The training procedure was as follows: starting from small random initial values, the weights were updated using backpropagation on the mean squared error of a training set of 200 randomly selected patterns out of the data set. Training was stopped at the minimum of the mean squared error of a validation set consisting of the remaining 200 patterns. With = 0:01, 94 out of 100 networks choose the relevant input variables, i.e., inputs X 1, X 2, X 3, and X 5, the remaining 6 networks did not only choose the relevant input variables but also one irrelevant input. These 6 networks might have been confused by random correlations between this irrelevant input and the output in both training and validation set. 5.2 Classication The response of our articial classication task given the ten binary input variables, X 1 ; : : : ; X 10 is given by the following rule T = X 1 [ (X 2 \ X 3 ) [ (X 4 \ X 5 \ X 6 ), where 0 and 1 code false and true, respectively. The response thus depends on one very important variable (X 1 ), two equally important variables (X 2 and X 3 ), and three less important variables (X 4, X 5, and X 6 ). Furthermore, the response does not depend on the four irrelevant or noisy variables, X 7, X 8, X 9, and X 10. Based on this rule, we generated a data set containing 400 examples. We applied the same training procedures as in the regression task, except that the neural networks had only ve instead of seven hidden units. After training, the patterns were classied based on their output. If the output of a pattern was larger than 0:5 the pattern was labeled true, otherwise the class label was false. With = 0:05, 98 out of 100 networks chose the correct architecture with inputs X 1, X 2, X 3, X 4, X 5, and X 6. The remaining two networks chose a smaller

architecture, which did not include all relevant variables. To be more precise one removed variable X 4, the other removed the variables X 4, X 5, and X 6. This might be caused by the lack of examples in the training set since only 200 patterns out of 2 10 = 1024 possibilities were available. 6 Discussion In this article, we have described how input selection can be performed by using partial retraining. We focused on how the performance of neural networks with dierent subsets of input variables can be compared in both regression and in classication tasks. Computer simulations have shown that, in two articial problems, this algorithm indeed selects the relevant input variables. The algorithm described in this article can be easily generalized. For example, by viewing hidden units as input units of a smaller network [2,10], this algorithm can also be used for hidden unit selection. Similarly, it can be used for weight pruning by estimating the relevance of a single weight. References 1. T. Cibas, F. Fogelman Soulie, P. Gallinari, and S. Raudys. Variable selection with optimal cell damage. In M. Marinaro and P. G. Morasso, editors, Proceedings of the International Conference on Articial Neural Networks, volume 1, pages 727{730. Springer-Verlag, 1994. 2. T. Czernichow. Architecture selection through statistical sensitivity analysis. In C. von der Malsburg, W. von Seelen, J. C. Vorbruggen, and B. Sendho, editors, Articial Neural Networks - ICANN 96, volume 1112 of Lecture Notes in Computer Science, pages 179{184. Springer, 1996. 3. N. R. Draper and H. Smith. Applied Regression Analysis. Wiley Series in Probability and Mathematical Statistics. Wiley, New York, second edition, 1981. 4. J. H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1{141, 1991. 5. G. D. Garson. Interpreting neural-network connection weights. AI Expert, 6(4):47{ 51, 1991. 6. B. Hassibi, D. G. Stork, G. Wol, and T. Watanabe. Optimal Brain Surgeon: Extensions and performance comparisons. In J. D. Cowan, G. Tesauro, and J. Alspector, editors, Advances in Neural Information Processing Systems, volume 6, pages 263{270, San Francisco, 1994. Morgan Kaufmann. 7. D. G. Kleinbaum, L. L. Kupper, and K. E. Muller. Applied Regression Analysis and Other Multivariable Methods. The Duxbury series in statistics and decision sciences. PWS-KENT Publishing Company, Boston, second edition, 1988. 8. J. O. Moody and P. J. Antsaklis. The dependence identication neural network construction algorithm. IEEE Transactions on Neural Networks, 7(1):3{15, 1996. 9. M. C. Mozer and P. Smolensky. Using relevance to reduce network size automatically. Connection Science, 1(1):3{16, 1989. 10. K. L. Priddy, S. K. Rogers, D. W. Ruck, G. L. Tarr, and M. Kabrisky. Bayesian selection of important features for feedforward neural networks. Neurocomputing, 5(2/3):91{103, 1993. 11. P. van de Laar, T. Heskes, and S. Gielen. Partial retraining: A new approach to input relevance determination. Submitted, 1997.