Error Empirical error. Generalization error. Time (number of iteration)

Similar documents
= w 2. w 1. B j. A j. C + j1j2

Local minima and plateaus in hierarchical structures of multilayer perceptrons

y(x n, w) t n 2. (1)

Linear Regression and Its Applications

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Learning with Ensembles: How. over-tting can be useful. Anders Krogh Copenhagen, Denmark. Abstract

1 What a Neural Network Computes

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Unit III. A Survey of Neural Network Model

Basic Principles of Unsupervised and Unsupervised

A summary of Deep Learning without Poor Local Minima

Day 3 Lecture 3. Optimizing deep networks

Neural Networks and the Back-propagation Algorithm

Neural networks. Chapter 20. Chapter 20 1

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Introduction to Machine Learning Spring 2018 Note Neural Networks

Rprop Using the Natural Gradient

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks

Neural networks. Chapter 19, Sections 1 5 1

Deep Feedforward Networks

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

1 Introduction Independent component analysis (ICA) [10] is a statistical technique whose main applications are blind source separation, blind deconvo

Neural Network Training

Batch-mode, on-line, cyclic, and almost cyclic learning 1 1 Introduction In most neural-network applications, learning plays an essential role. Throug

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Artificial Neural Networks

Advanced statistical methods for data analysis Lecture 2

NEURAL NETWORKS

Multilayer Perceptron

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Deep Feedforward Networks

Introduction to Neural Networks

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Multi-layer Neural Networks

Intro to Neural Networks and Deep Learning

4. Multilayer Perceptrons

output dimension input dimension Gaussian evidence Gaussian Gaussian evidence evidence from t +1 inputs and outputs at time t x t+2 x t-1 x t+1

18.6 Regression and Classification with Linear Models

Advanced Machine Learning

Artificial Neural Networks (ANN) Xiaogang Su, Ph.D. Department of Mathematical Science University of Texas at El Paso

DETECTING PROCESS STATE CHANGES BY NONLINEAR BLIND SOURCE SEPARATION. Alexandre Iline, Harri Valpola and Erkki Oja

Multilayer Perceptrons and Backpropagation

Fundamentals of Neural Network

Data Mining Part 5. Prediction

Artificial Neural Network : Training

2 Tikhonov Regularization and ERM

Speaker Representation and Verification Part II. by Vasileios Vasilakakis

Reading Group on Deep Learning Session 1

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Multilayer Perceptron Learning Utilizing Singular Regions and Search Pruning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Lecture 5: Logistic Regression. Neural Networks

The Multi-Layer Perceptron

Lecture 4: Perceptrons and Multilayer Perceptrons

Eigen Vector Descent and Line Search for Multilayer Perceptron

Neural Networks: Backpropagation

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

Gaussian Processes for Regression. Carl Edward Rasmussen. Department of Computer Science. Toronto, ONT, M5S 1A4, Canada.

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Sections 18.6 and 18.7 Analysis of Artificial Neural Networks

100 inference steps doesn't seem like enough. Many neuron-like threshold switching units. Many weighted interconnections among units

Gaussian process for nonstationary time series prediction

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

Remaining energy on log scale Number of linear PCA components

Multilayer Perceptron = FeedForward Neural Network

Deep Feedforward Networks. Sargur N. Srihari

Neural Networks (Part 1) Goals for the lecture

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

Math 1270 Honors ODE I Fall, 2008 Class notes # 14. x 0 = F (x; y) y 0 = G (x; y) u 0 = au + bv = cu + dv

Deep Feedforward Networks. Seung-Hoon Na Chonbuk National University

ECS171: Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Introduction to feedforward neural networks

Neural networks. Chapter 20, Section 5 1

Statistical Machine Learning from Data

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

CSC242: Intro to AI. Lecture 21

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Linear discriminant functions

Lecture 6. Regression

Feed-forward Network Functions

Intelligent Systems Discriminative Learning, Neural Networks

OPTIMIZATION METHODS IN DEEP LEARNING

Slow Dynamics Due to Singularities of Hierarchical Learning Machines

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Gaussian Process Regression: Active Data Selection and Test Point. Rejection. Sambu Seo Marko Wallat Thore Graepel Klaus Obermayer

CS:4420 Artificial Intelligence

Sections 18.6 and 18.7 Artificial Neural Networks

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

Computational statistics

Identification of Nonlinear Systems Using Neural Networks and Polynomial Models

Machine Learning

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Ecient Higher-order Neural Networks. for Classication and Function Approximation. Joydeep Ghosh and Yoan Shin. The University of Texas at Austin

SV6: Polynomial Regression and Neural Networks

Transcription:

Submitted to Neural Networks. Dynamics of Batch Learning in Multilayer Networks { Overrealizability and Overtraining { Kenji Fukumizu The Institute of Physical and Chemical Research (RIKEN) E-mail: fuku@brain.riken.go.jp December 28, 1998 Abstract This paper investigates the dynamics of batch learning of multilayer neural networks in the asymptotic case where the number of training data is much larger than the number of parameters. We consider regression problems assuming noisy output data. First, we present experimental results on the behavior in the steepest descent learning of multilayer perceptrons and three-layer linear neural networks. We see in these results that strong overtraining, which is the increase of the generalization error in training, occurs if the model has surplus hidden units to realize the target function. Next, to analyze overtraining from the theoretical viewpoint, we consider the solution of the steepest descent learning equation of a three-layer linear neural network, and prove that a network with surplus hidden units shows overtraining. From this theoretical analysis, we can see that overtraining is not a feature observed in the nal stage of learning, but in an intermediate time interval. Keywords: networks Overtraining Overrealizable Multilayer networks Linear neural 1 Introduction This paper discusses the dynamics of batch learning in multilayer neural networks. Multilayer networks like multilayer perceptrons have been extensively used in many applications, in hope that their nonlinear structure inspired by biological neural networks shows a variety of advantages. However, there are also some weaknesses caused by the nonlinearity. To train a neural network, 1

Error Empirical error Generalization error Time (number of iteration) Figure 1: Overtraining in batch learning. The empirical error decreases during iterative training, while the generalization increases after some time point. for example, a numerical optimization method like the error back-propagation (Rumelhart et al., 1986) is needed to nd the optimal parameters. It is very important from the practical and theoretical viewpoints to elucidate the dynamical behavior of a network in training. Especially, theempirical error, the objective function of learning, and the generalization error, the dierence between the target function and its estimate, are of much interest in many researches. For example, the existence of a plateau, which is a long at interval of a learning curve, has been reported, and a method of avoiding plateaus using the natural gradient has been proposed (Amari, 1998). In this paper, we focus on overtraining (Amari, 1996) as a typical behavior of batch learning in multilayer networks. Overtraining is the increase of the generalization error after some time point in learning (Fig.1). Although the empirical error is a decreasing function in principle, it does not ensure the decrease of the generalization error. Overtraining is not only of theoretical interest but of practical importance, because we can use an early stopping technique if overtraining really exists. There has been a controversy about the existence of overtraining. Some practitioners assert that overtraining often happens and early stopping methods are proposed (Sarle, 1995). Amari et al. (1996) theoretically analyze overtraining in asymptotic cases and conclude that the eect of overtraining is much smaller than what is believed by practitioners. Their analysis is true if the parameter approaches to the maximum likelihood estimator following the statistical asymptotic theory (Cramer, 1946). However, the usual asymptotic theory cannot be applied in overrealizable cases, where the model has surplus hidden units to realize the target function (Fukumizu, 1996 Fukumizu, 1997). In practical applications, since one tends to use a large network to ensure realizability ofthe target, it is very possible that the model has almost surplus hidden units. The 2

existence of overtraining has still been an open problem in such cases. There have been other studies on overtraining. Baldi and Chauvin (1991) theoretically analyze the generalization error of two-layer linear networks estimating the identity function from noisy data. Although they show overtraining phenomena under some conditions, their analysis is very limited in that the overtraining can be observed only when the variance of the noise in the training sample is dierent from that of the validation sample. In researches from the viewpoint of statistical mechanics, overtraining or overtting is also reported in the learning of neural networks (Krough & Hertz, 1992). However, the eect of overtting in their analysis is very small if the number of the data is suciently larger than the number of the parameters. There can be other reasons for strong overtraining observed in practical applications. The main purpose of this paper is to discuss overtraining of multilayer networks in connection with overrealizability. We investigate two models: one is multilayer perceptrons with the sigmoidal activation functions, and the other is linear neural networks, which we introduce for the simplicity of analysis. Through experimental results on these two models, we can conjecture that overtraining occurs in overrealizable scenarios. We mathematically verify this conjecture in the case of linear neural networks under some conditions. This paper is organized as follows. Section II gives basic denitions and terminology. Section III shows experimental results concerning overtraining of multilayer perceptrons and linear neural networks. In Section IV, we theoretically show the existence of overtraining in overrealizable cases of linear neural networks. Section V includes concluding remarks. 2 Preliminaries 2.1 Framework of statistical learning In this section, we present basic denitions and terminologies used in this paper. A feed-forward neural network model can be described as a parametric family of functions ff(x )g from R L to R M, where x is an input vector and is a parameter vector. A three-layer network with H hidden units is dened by HX f i (x ) = w ij ' j=1 LX! u jk x k + j + i (1 i M) (1) k=1 where =(w11 ::: w MH 1 ::: M u11 ::: u HL 1 ::: H ). The function '(t) is called an activation function. In the case of a multilayer perceptron (MLP), the sigmoidal function is used for the activation function. '(t) = 1 1+e ;t (2) 3

We use such models for regression problems, assuming that an output of the target system is observed with a measurement noise. A sample (x y) fromthe target system satises y = f(x)+v (3) where f(x) is the target function, which is unknown to a learner, and v is a random vector subject to N(0 2 I M ), where N( ) is a normal distribution with as its mean and as its variance-covariance matrix. We write I M for the M M unit matrix. An input vector x is generated randomly with its probability density function q(x). Training data () f(x () y )j = 1 ::: Ng are independent samples from the joint distribution of q(x) and eq.(3). The parameter is estimated so that a neural network can give a good estimate of the target function f(x). We discuss the least square error (LSE) estimator. The purpose of training is to nd the parameter that minimizes the following empirical error: NX E emp = () ky ; () f(x )k 2 : (4) =1 Generally, the LSE estimator cannot be calculated analytically if the model is nonlinear. Some numerical optimization method is needed to obtain an approximation. A widely-used one is the steepest descent method, which iteratively updates the parameter vector in the steepest descent direction of the error surface dened by E emp (). The learning rule is given by (t +1)=(t) ; @E emp (5) where is a learning constant. In this paper, we discuss only batch learning, in which we calculate the gradient using all the training data. There are many researches on on-line learning, in which the parameter is always updated for a newly generated data (Heskes & Kappen, 1991) The performance of a network is often evaluated by the generalization error, which represents the average error between the target function and its estimate. More precisely, itisdenedby E gen Z @ kf(x ) ; f(x)k 2 q(x)dx: (6) It is easy to see Z Z ky ; f(x )k 2 p(yjx)q(x)dydx = M 2 +E gen : (7) The empirical error divided by N approaches to the left hand side of eq.(7) if N goes to innity. The minimization of the empirical error roughly approximates the minimization of the generalization error. However, they are not exactly the same, and the decrease of the empirical error during learning does not ensure the decrease of the generalization error. It is extremely important to elucidate the dynamical behavior of the generalization error. A curve showing the empirical/generalization error as a function of time is called a learning curve. 4

Target Model Figure 2: Illustration of an overrealizable case. 2.2 Overrealizable learning scenario In our theoretical discussion, we consider the case where f(x) is perfectly realized by the prepared model that is, there is a true parameter 0 such that f(x 0) = f(x). If the target function f(x) is realized by a network with a smaller number of hidden units, we call it overrealizable (see Fig.2), following the terminology of Saad and Solla (1995). Otherwise, we call it regular. We focus on the dierence of behaviors between these two cases. One of the special properties of overrealizable cases is that the true parameter is not unique. The set of parameters that realize the target function is not a point but a union of high-dimensional manifolds. The usual asymptotic theory cannot be applied to such cases. In the presence of measurement noise in the output data, the LSE estimator is located a little apart from the highdimensional set. The overrealizable assumption is generally too strong in real problems. However, we usually use a model with a sucient number of hidden units to avoid a large bias of the model. Therefore, it is highly probable that the model has almost surplus hidden units to realize the target. The question is which assumption describes better in the presence of almost surplus hidden units. We consider this problem in Section 3 through experiments. We note again that this paper discusses asymptotic situations. While we consider overrealizable cases, the number of data is still suciently larger than the number of parameter. This represents redundancy dierent from the case where the number of data is larger than the number of parameters, which is often discussed in researches with statistical mechanics (e.g. Krough & Hertz, 1992). 5

2.3 Linear neural networks We introduce three-layer linear neural networks as a model whose theoretical analysis is possible. We must be very careful in discussing experimental results on learning of multilayer perceptrons especially in overrealizable cases, in which there exists an almost at subvariety around the global minima (Fukumizu, 1997). The convergence of learning is much slower than the convergence in regular cases because the gradient is very small around this subvariety. In addition, of course, learning with a gradient method suers from local minima, which is a common problem to all nonlinear models. We cannot exclude their eects, and this often makes derived conclusions obscure. On the other hand, the theoretical analysis of linear neural networks is possible (Baldi & Hornik, 1995 Fukumizu, 1997), and the dynamics of learning will be analyzed in Section 4. A linear neural network has the identity function for '(t). In this paper, we do not use bias terms in linear neural networks for simplicity. Thus, a linear neural network is dened by f(x A B) =BAx (8) where A is a H L matrix and B is a M H matrix. We assume H L and H M throughout this paper. Although f(x A B) isalinear function, from the assumption on H the model is not equal to the set of all linear maps from R L to R M, but is the set of linear maps of rank not greater than H. Then, the model is not equivalent to the usual linear model f(x C) =Cx (C : M L matrix). In this sense, the three-layer linear neural network model is the simplest multilayer model. The parameterization in eq.(8) has trivial redundancy. If we multiply a nonsingular H H matrix from the right of B and its inverse from the left of A, the resultant map is the same. Given a linear map of rank H, the set of parameters that realize the map consists of a H H manifold. However, we can eliminate this redundancy if we restrict the parameterization as the rst H rows of A make the unit matrix. More essential redundancy in the parameterization arises when the rank of the map is less than H. Even if we use the above restriction, the set of parameters that realize such a map of smaller rank is still high-dimensional. Then, in the case of linear neural networks, we say overrealizable if the rank of the target function is less than H. 3 Learning curves { experimental study { In this section, we experimentally investigate the generalization error of multilayer perceptrons and linear neural networks to see their actual behaviors, emphasizing on overtraining. The steepest descent method in multilayer perceptron leads the well-known error back propagation rule (Rumelhart, 1986). To avoid the problems discussed in Section 2.3 as much as possible, we adopt the following conguration in our 6

Av. of square errors 1E-04 Emp. error in training Av. of square errors 1E-04 Emp. error in training 1E-05 1E-05 100 1000 10000 100000 (a) Regular case 100 1000 10000 100000 (b) Overrealizable case Figure 3: Average learning curves of MLP. The dashed line shows the theoretical expectation of the generalization error of the LSE estimator in the case that the target function is regular and the usual asymptotic theory is applicable. experiments. The model in use has 1 input unit, 2 hidden units, and 1 output unit. A set of training data has 100 examples. For a xed set of training data, 30 dierent values are tried for an initial parameter vector. We select the trial that gives the least empirical error after 500000 iterations. The output data include the observation noise subject to N(0 10 ;4 ). The constant zero function is used as an overrealizable target function, and f(x) = s(x + 1) + s(x ; 1) is used as a regular target. Figure 3 shows the average of generalization errors over 30 dierent simulations changing the set of training data. It shows clear overtraining in the overrealizable case. On the other hand, the learning curve in the regular case shows no meaningful overtraining. It is known that the LSE estimator of the linear neural network model is analytically solvable (Baldi & Hornik, 1995). Although it is practically absurd to use the steepest descent method, our interest is not the LSE estimator itself but the dynamics of learning. Therefore, we investigate the steepest descent learning of linear neural networks here. For the training data of a linear neural network, we use the notations X = 0 B @ x (1)T. x (N)T 1 C A Y = 0 B @ y (1)T. y (N)T 1 C A (9) where T denotes transposition. Then, the empirical error can be written as E emp =Tr[(Y ; XA T B T ) T (Y ; XA T B T )] (10) The batch learning rule of a linear neural network is given by A(t +1) = A(t)+fB T Y T X ; B T BAX T Xg B(t +1) = B(t)+fY T XA T ; BAX T XA T g: (11) 7

Av. of generalization errors 1E+00 Av. of generalization errors 1E+00 10 100 1000 10000 100000 1000000 (a) Rank of the target = 0 10 100 1000 10000 100000 1000000 (b) Rank of the target = 1 Av. of generalization errors 1E+00 Av. of generalization errors 1E+00 10 100 1000 10000 100000 1000000 (c) Rank of the target = 2 10 100 1000 10000 100000 1000000 (d) Rank of the target = 3 (regular) Figure 4: Average learning curves of linear neural networks. The dotted line shows the theoretical expectation of the generalization error of the LSE estimator for the regular target. It is also known that all the critical points but the global minimum of E emp are saddle points (Baldi & Hornik, 1995). Therefore, if we theoretically consider a continuous-time learning equation, it converges to the global minimum for almost all initial conditions. Figure 4 shows the average of the generalization errors for each target of rank 0,1,2, and 3. Each curve is the average over 100 simulations with various training data from the same probability. The number of input units, hidden units, and output units are 5, 3, and 5 respectively. The output samples include the observation noise subject to N(0 10 ;2 I5). The number of training data is 100 in each simulation. All the overrealizable cases show clear overtraining, while the regular case does not show meaningful overtraining. We can also see that overtraining is stronger when the rank of the target is smaller. In the next experiment with the MLP model, we use the error function, 8

Av. of square errors Gen. error Emp. error Av. of square errors 10 100 1000 10000 100000 (a) Regular 10 100 1000 10000 100000 (b) Almost overrealizable