Determining The Degree of Generalization Using An Incremental Learning Algorithm

Similar documents
Separation of Variables and a Spherical Shell with Surface Charge

SVM: Terminology 1(6) SVM: Terminology 2(6)

Statistical Learning Theory: A Primer

FOURIER SERIES ON ANY INTERVAL

A Brief Introduction to Markov Chains and Hidden Markov Models

Expectation-Maximization for Estimating Parameters for a Mixture of Poissons

Explicit overall risk minimization transductive bound

Multilayer Kerceptron

Bayesian Learning. You hear a which which could equally be Thanks or Tanks, which would you go with?

8 Digifl'.11 Cth:uits and devices

A Novel Learning Method for Elman Neural Network Using Local Search

MARKOV CHAINS AND MARKOV DECISION THEORY. Contents

DIGITAL FILTER DESIGN OF IIR FILTERS USING REAL VALUED GENETIC ALGORITHM

Stochastic Variational Inference with Gradient Linearization

HYDROGEN ATOM SELECTION RULES TRANSITION RATES

A. Distribution of the test statistic

CS229 Lecture notes. Andrew Ng

First-Order Corrections to Gutzwiller s Trace Formula for Systems with Discrete Symmetries

Inductive Bias: How to generalize on novel data. CS Inductive Bias 1

International Journal of Mass Spectrometry

A Solution to the 4-bit Parity Problem with a Single Quaternary Neuron

Unconditional security of differential phase shift quantum key distribution

General Certificate of Education Advanced Level Examination June 2010

Some Measures for Asymmetry of Distributions

4 Separation of Variables

Asymptotic Properties of a Generalized Cross Entropy Optimization Algorithm

Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations

(This is a sample cover image for this issue. The actual cover is not yet available at this time.)

Moreau-Yosida Regularization for Grouped Tree Structure Learning

6.434J/16.391J Statistics for Engineers and Scientists May 4 MIT, Spring 2006 Handout #17. Solution 7

The EM Algorithm applied to determining new limit points of Mahler measures

Evolutionary Product-Unit Neural Networks for Classification 1

Sequential Decoding of Polar Codes with Arbitrary Binary Kernel

BP neural network-based sports performance prediction model applied research

Stochastic Complement Analysis of Multi-Server Threshold Queues. with Hysteresis. Abstract

FORECASTING TELECOMMUNICATIONS DATA WITH AUTOREGRESSIVE INTEGRATED MOVING AVERAGE MODELS

Cryptanalysis of PKP: A New Approach

Melodic contour estimation with B-spline models using a MDL criterion

The influence of temperature of photovoltaic modules on performance of solar power plant

STA 216 Project: Spline Approach to Discrete Survival Analysis

NEW DEVELOPMENT OF OPTIMAL COMPUTING BUDGET ALLOCATION FOR DISCRETE EVENT SIMULATION

Optimal Control of Assembly Systems with Multiple Stages and Multiple Demand Classes 1

C. Fourier Sine Series Overview

A Statistical Framework for Real-time Event Detection in Power Systems

Safety Evaluation Model of Chemical Logistics Park Operation Based on Back Propagation Neural Network

Two view learning: SVM-2K, Theory and Practice

Asynchronous Control for Coupled Markov Decision Systems

Formulas for Angular-Momentum Barrier Factors Version II

David Eigen. MA112 Final Paper. May 10, 2002

How the backpropagation algorithm works Srikumar Ramalingam School of Computing University of Utah

An Algorithm for Pruning Redundant Modules in Min-Max Modular Network

Steepest Descent Adaptation of Min-Max Fuzzy If-Then Rules 1

FRST Multivariate Statistics. Multivariate Discriminant Analysis (MDA)

Statistical Learning Theory: a Primer

17 Lecture 17: Recombination and Dark Matter Production

1D Heat Propagation Problems

Appendix A: MATLAB commands for neural networks

On the evaluation of saving-consumption plans

EXPERIMENT 5 MOLAR CONDUCTIVITIES OF AQUEOUS ELECTROLYTES

Componentwise Determination of the Interval Hull Solution for Linear Interval Parameter Systems

12.2. Maxima and Minima. Introduction. Prerequisites. Learning Outcomes

High Spectral Resolution Infrared Radiance Modeling Using Optimal Spectral Sampling (OSS) Method

Nonlinear Analysis of Spatial Trusses

SydU STAT3014 (2015) Second semester Dr. J. Chan 18

arxiv: v2 [cond-mat.stat-mech] 14 Nov 2008

XSAT of linear CNF formulas

Lecture Note 3: Stationary Iterative Methods

LECTURE NOTES 9 TRACELESS SYMMETRIC TENSOR APPROACH TO LEGENDRE POLYNOMIALS AND SPHERICAL HARMONICS

Schedulability Analysis of Deferrable Scheduling Algorithms for Maintaining Real-Time Data Freshness

MATH 172: MOTIVATION FOR FOURIER SERIES: SEPARATION OF VARIABLES

MONTE CARLO SIMULATIONS

THINKING IN PYRAMIDS

Lecture 6: Moderately Large Deflection Theory of Beams

On the Goal Value of a Boolean Function

A proposed nonparametric mixture density estimation using B-spline functions

Applied Nuclear Physics (Fall 2006) Lecture 7 (10/2/06) Overview of Cross Section Calculation

Active Learning & Experimental Design

<C 2 2. λ 2 l. λ 1 l 1 < C 1

Uniprocessor Feasibility of Sporadic Tasks with Constrained Deadlines is Strongly conp-complete

An approximate method for solving the inverse scattering problem with fixed-energy data

A unified framework for Regularization Networks and Support Vector Machines. Theodoros Evgeniou, Massimiliano Pontil, Tomaso Poggio

Haar Decomposition and Reconstruction Algorithms

Section 6: Magnetostatics

Identification of macro and micro parameters in solidification model

arxiv:hep-ph/ v1 15 Jan 2001

Gauss Law. 2. Gauss s Law: connects charge and field 3. Applications of Gauss s Law

arxiv: v1 [math.ca] 6 Mar 2017

DISTRIBUTION OF TEMPERATURE IN A SPATIALLY ONE- DIMENSIONAL OBJECT AS A RESULT OF THE ACTIVE POINT SOURCE

Maximum likelihood decoding of trellis codes in fading channels with no receiver CSI is a polynomial-complexity problem

Two-sample inference for normal mean vectors based on monotone missing data

Stochastic Automata Networks (SAN) - Modelling. and Evaluation. Paulo Fernandes 1. Brigitte Plateau 2. May 29, 1997

Nonlinear Gaussian Filtering via Radial Basis Function Approximation

Legendre Polynomials - Lecture 8

Robust Sensitivity Analysis for Linear Programming with Ellipsoidal Perturbation

PHYS 110B - HW #1 Fall 2005, Solutions by David Pace Equations referenced as Eq. # are from Griffiths Problem statements are paraphrased

Packet Fragmentation in Wi-Fi Ad Hoc Networks with Correlated Channel Failures

Data Mining Technology for Failure Prognostic of Avionics

Bayesian Unscented Kalman Filter for State Estimation of Nonlinear and Non-Gaussian Systems

Testing for the Existence of Clusters

Characterization of Low-Temperature SU-8 Photoresist Processing for MEMS Applications

Transcription:

Determining The Degree of Generaization Using An Incrementa Learning Agorithm Pabo Zegers Facutad de Ingeniería, Universidad de os Andes San Caros de Apoquindo 22, Las Condes, Santiago, Chie pzegers@uandes.c Maur K. Sundareshan Eectrica and Computer Engineering Department, The University of Arizona 123 East Speedway, Tucson, AZ 85721-14, USA sundareshan@ece.arizona.edu Abstract. Any Learning Machine (LM) trained with exampes poses the same probem: how to determine whether the LM has achieved an acceptabe eve of generaization or not. This work presents a training method that uses the data set in an incrementa manner such that it is possibe to determine when the behavior dispayed by the LM during the earning stage truthfuy represents its future behavior when confronted by unseen data sampes. The method uses the set of sampes in an efficient way, which aows discarding a those sampes not reay needed for the training process. The new training procedure, which wi be caed Incrementa Training Agorithm, is based on a theoretica resut that is proven using recent deveopments in statistica earning theory. A key aspect of this anaysis invoves identification of three distinct stages through which the earning process normay proceeds, which in turn can be transated into a systematic procedure for determining the generaization eve achieved during training. It must be emphasized that the presented agorithm is genera and independent of the architecture of the LM and the specific training agorithm used. Hence it is appicabe to a broad cass of supervised earning probems and not restricted to the exampe presented in this work. 1 Introduction This paper focuses on an incrementa earning agorithm devised to determine whether a trained Learning Machine (LM) has reached an acceptabe eve of generaization or not using the set of sampes in an efficient manner, just from an examination of the earning behavior of the LM. The issue about generaization has been a main concern since earning probems started to be studied. Thus, it is not a surprise that the probem was addressed when the earning automaton concept [12, 17, 6, 11] was born in the eary sixties. Because the question about generaization bears direct impact on the adaptation process, it constitutes a core question of the earning automata probem. Nevertheess, due to the stochastic nature of these machines it has proven difficut to understand the generaization process in the context of the earning automaton, and, therefore, difficut to optimay taior the training process of these machines in order to improve their generaization capabiity. Thus, in terms of the anaysis of its generaization capacity, the earning automaton remains argey a heuristic method. During the ast

2 years, the research community has studied different aspects of the generaization probem in a broader cass of LMs [1, 18, 19, 2, 4, 3]. Whie a these works have attempted to tacke different aspects of the probem, deveopment of systematic procedures for determining when a generic LM has attained an acceptabe eve of generaization has proven very difficut to attain. It is important to keep in mind that the goa in any training procedure is to obtain a reasonabe eve of generaization using a minimum of sampes and computationa resources. An important exception to this is the recent understanding gained on the support vector machine [14, 15, 16, 8]. The usefuness of the support vector machine method has been proven by a cadre of practica agorithms and theoretica resuts that aow one to ensure when a desired eve of generaization has been reached and to determine which sampes are essentia for the training process, a.k.a. to find the support vectors. Even though this architecture works as a universa function approximator, and therefore it is the basis for a very important and usefu cass of LMs, these resuts ony appy for this specific architecture and they have not been generaized for arbitrary architectures. In Summary, there is sti no theory or organized methods that provide practica answers to how to determine the degree of generaization with an efficient usage of the set of sampes in the case of a generic LM. This work presents an incrementa earning method that under certain sufficient conditions permits to answer the question of when a good degree of generaization has been attained whie keeping the usage of sampes to a minimum. The main advantage of this method is that it ony requires studying some key aspects underying the earning behavior of the LM. The paper starts with an introduction to statistica earning theory [14, 15, 16], continues presenting the new earning scheme, anayticay proves why the presented agorithm shoud work, tests everything with an experiment, and ends with a discussion of the resuts. 2 Statistica Learning Theory The LM probem, as defined in this work, consists in finding? = arg min E I ( ), where E I ( ) = Z dxfx(x)(y ^y) 2 (1) is caed the intrinsic error, fx(x) is a density function, y = g(x) is a reference function, and ^y = ^g(x; ) is an estimation of the reference function defined by the parameter vector. However, what makes the probem compicated is that fx(x) is unknown and, therefore, it is not possibe to find?. Instead, a set of independent and identicay distributed data sampes y i = g(x i ) is avaiabe, such that it is ony possibe to find = arg min E E ( ;), where E E ( ;)= 1 X (y i ^y i ) 2 (2) i= is the so caed empirica error. The reader is referred to the recent texts [14, 15, 16], which provide an exceent treatise on the basic concepts of statistica earning theory. Given the previous formaism, it is cear that in order to measure the degree of generaization, i.e. measure the intrinsic error, it is necessary to ensure that the earning process wi produce a sequence of as is increased that converges to?, i.e. it is necessary to ensure the consistency of the procedure. If this is not achieved, there is no guarantee that the earning procedure wi eventuay produce?. Given that the consistency stage is reached, then the earning procedure associated with the LM wi necessariy produce a corresponding

Error E ( ) I θ Figure 1: Intrinsic and empirica error vs. number of sampes. The upper curve represents the E I ( ) and the ower curve E E ( ;), both with = arg min E E ( ;) empirica error that tends to the intrinsic error, and thus it becomes possibe to measure the intrinsic error. Given a LM, Fig. 1 depicts a typica evoution of these two error measures, i.e. intrinsic and empirica errors, evauated at = = arg min E E ( ;) as the set of sampes grows [14, 15, 16]. It can be seen that E E ( ;) is aways ess than E I ( ). The reason is that whie minimizes E E ( ;), which takes into account a set of data sampes (thus providing constraints for estabishing a match), does not minimize E I ( ), which considers the same data sampes as we as a other possibe ones. Ony as! 1 is that the both error measures converge to the same vaue when evauated for =. Fig. 1 shows that the behavior of E E ( ;) can be divided into three different stages: an initia one that corresponds to a ow number of data points in the set of sampes, during which the LM memorizes each of the events such that E E ( ;) can be made arbitrariy cose to zero (the LM shatters the probem space), a second stage, associated to an increased number of sampes, where the LM no onger can memorize the sampes and E E ( ;) starts to grow and move towards E I ( ) (the data sampes start to drive the LM towards the desired soution), and finay the third stage, where E E ( ;) is cose to E I ( ) and the earning process is finay consistent (minimization of the intrinsic error is now possibe). One of the main resuts of statistica earning theory [13, 14, 15, 16] states that when A» E I ( );E E ( ;)» B, then for any positive ff P ρ sup je I ( ) E E ( ;)j >ff ff» min ρ 1; 4 exp h(n 2 @ h +1) (ff 1 ) 2 A ff which bounds the probabiity of worst, i.e. supremum over, divergence between E I ( ) and E E ( ;) [14, 15, 16]. The constant h is caed the VC dimension of the LM. The exponent of the exponentia function on the right side of Eq. 3 has two terms. The first one grows ogarithmicay in, and the second one decreases ineary in. What is important to notice is that if h<1, the second term takes over as grows and drives the exponentia to zero. The condition expressed in Eq. 3 confirms that the convergence process is divided into the three different stages mentioned before: a first one where the probabiity of divergence keeps cose to one (memorization stage), a second one that signas the onset of the exponentia effect (transition stage), and a fina stage where the probabiity of divergence between the two errors is cose to and the LM is in a position where it can generaize (consistency stage). It is important to stress that this resut states that a LM with a finite VC dimension wi achieve its optima configuration in the sense that it wi converge towards producing the minimum intrinsic error E I ( ) as the number of sampes grows, but it does not guarantee that the intrinsic error wi be zero. In other words, this resut provides conditions that estabish (3)

under which conditions a LM wi do its best, but it does not ensure that the LM wi do the best, i.e. achieve an intrinsic error equa to zero. In genera terms, finiteness of the VC-dimension provides the necessary and sufficient conditions for distribution independent consistency, i.e. convergence of the empirica error to the intrinsic error, which is a necessary step in order to aow the LM to find a combination of parameters that minimizes the intrinsic error. This resut is the cornerstone of statistica earning theory and its strength rests on being a distribution independent resut that guarantees exponentia convergence even in a worst case scenario. 3 The Incrementa Learning Agorithm Consider a student that has to earn a certain topic. If this is an efficient student, he certainy does not go to the ibrary, check out a the books that contain materia reated to the topic, and study a of them. Instead, the student first gets a primer on the subject, reads it, and soves the probems contained in it. Then, he checks out another book, ignores a the materia he has aready mastered and focuses on earning whatever new materia there is in the new book. He continues doing so unti he cannot find new information or probems he does not know how to hande. It is then, and ony then, that the student can be sure that he has mastered the topic. Two things that can be observed about the earning process used by the student are: 1. The student focuses on earning ony that information that brings up new aspects of the probem to his knowedge, and ignores the rest. 2. A tetae sign that the student is becoming coser to mastering the topic is the increasing difficuty in finding new materia. A earning method that is inspired by the exampe cited above is described in the fow diagram shown in Fig. 2. This agorithm wi be termed incrementa earning agorithm in future discussion. In the incrementa earning agorithm, the term training event refers to the execution of a the steps in the agorithm from the moment the empirica error E E ( ;) > ffi and the agorithm branches towards the Train LM step, unti the empirica error compies with E E ( ;) < ffi and the agorithm branches towards the Set = +1... step. In this incrementa earning agorithm, the threshod ffi represents the bound on the intrinsic error that is deemed acceptabe to the user. One may note some simiarities between the incrementa earning agorithm and the cassic perceptron earning rue [5]. However, there is an important difference between them: in the perceptron earning rue each training event ony uses the sampe that made the system fai. In the perceptron earning rue there is no notion of a training set at a and training is done ony with the ast sampe that was processed by the system. On the other hand, the incrementa earning rue presented above uses a training set composed of a the sampes previousy processed by the system. The anaysis that foows proves that this seemingy minor difference is very important. The foowing theorem proves that under certain conditions the incrementa earning agorithm behaves ike the efficient student in the motivating exampe discussed above: Theorem 1. In every LM trained with the incrementa earning agorithm, right after the training event that started when the th sampe was examined ends ρ h(n 2 P fe I ( ) >ffig»min 1; 4 exp @ h +1) A ff (4) (ffi E E ( ;) 1 ) 2

Start Read training data with m sampes and threshod δ Create set of sampes by extracting sampes < m Train LM No E E( θ, ) < δ Yes Set = +1 Increment training set by adding new sampe E E( θ, ) > δ Yes No Yes < m No Return LM Figure 2: Fow diagram of the incrementa earning agorithm.

where is the vector of parameters obtained at the end of the training event. Proof: As expained before (see Eq. 3), when A» E I ( );E E ( ;)» B, then for any positive ff ρ ff ρ 1 h(n 2 P sup je I ( ) E E ( ;)j >ff» min 1; 4 exp @ h +1) (ff 1 ) 2 A (B A) ff (5) 2 If the previous expression is evauated for an arbitrary, then ρ h(n 2 P fje I ( ) E E ( ;)j >ffg»min 1; 4 exp @ h +1) Then, it is aso true that P fe I ( ) E E ( ;) >ffg»min Rearranging the terms in the eft side P fe I ( ) >ff+ E E ( ;)g»min ρ ρ 1; 4 exp 1; 4 exp h(n 2 @ h +1) h(n 2 @ h +1) (ff 1 ) 2 (ff 1 ) 2 (ff 1 ) 2 A ff A ff A ff (6) (7) (8) In the incrementa earning agorithm, right after a training event finishes ffi > E E ( ;). Therefore, it is possibe to define ff = ffi E E ( ;). Repacing into the previous expression ρ h(n 2 P fe I ( ) >ffig»min 1; 4 exp @ h +1) A ff (9) 2 (ffi E E ( ;) 1 ) 2 This theorem proves that if a LM with h < 1 is trained with the incrementa earning agorithm, right after a training event finishes the probabiity that the intrinsic error is above the threshod ffi converges to zero exponentiay after a certain number of sampes has been processed. Because the empirica error is bounded by a shrinking confidence interva centered on the intrinsic error (see Eq. 3), the exponentia behavior ends up infuencing the empirica error too. Therefore, the probabiity of a training event getting triggered wi exhibit the three different stages mentioned before: 1. A first one that corresponds to a ow number of sampes where the probabiity of divergence between the empirica error E E ( ;) and the intrinsic error E I ( ;) is one, and the probabiity that E E ( ;) >ffiis aso cose to 1. Therefore, the probabiity that a training event gets triggered is cose to 1. 2. A second one, associated to an increased number of sampes, where the probabiity that E E ( ;) > ffi starts to decrease, owering the probabiity that a training event gets triggered. 3. A ast stage where the earning process has become consistent and the empirica error behaves ike the intrinsic error. It is during this stage that the probabiity that E E ( ;) <ffi converges to 1 and no more training events are triggered.

The importance of the previous resut rests on the fact that if no more training events are triggered, i.e. the empirica error keeps beow ffi, it is possibe to say that the earning process has become consistent and that it has effectivey found a parameter combination that produces an intrinsic error ess than ffi. Once this can be assessed it is possibe to stop the training process whie having a high certainty that the intrinsic error of the LM wi keep beow ffi for the future sampes that the LM may encounter within the data set, i.e. the LM has achieved a desired eve of generaization. Once this point is reached, because the LM has found the parameters that produce an intrinsic error beow ffi, there is no need to continue testing and ooking for exampes that make the system fai and the rest of the set of sampes can be discarded. 4 Experiments In this section the resuts of an experiment that empoyed a neura network as the architecture for the LM wi be described. It must be emphasized that the architecture seection was ony for iustrative purposes and the present agorithm is appicabe to any other architecture that may be seected for configuring the LM. A mutiayer perceptron with 2 input neurons, 1 hidden ayer with 5 neurons, another hidden ayer with 25 neurons, and 1 output neuron p x 2 + y 2 =(12ß was seected as LM, and used to earn a 2D sinc z = sin 6ß x 2 + y 2 ). Initiaization of this neura network was performed using a scheme recommended in [7], and the network was trained with the RPROP agorithm [9, 1], which is one of the agorithms that is currenty impemented in the Matab Neura Networks Package. Since our interest in this work is to demonstrate the abiity of the present incrementa earning agorithm in testing the generaization eve, the specific training procedure seected for the neura network architecture is ony iustrative. The set of sampes used for training was obtained by randomy samping within the interva [ :5; :5] according to a uniform density function in order to obtain x and y, and generating the corresponding outputs z using the 2D sinc function. In order to obtain an estimate of the average behavior of the LM, 2 different sets of initia conditions and 2 different sets of sampes were empoyed. The LM was trained using 2 different combinations of sets of initia conditions and sets of sampes, making sure that no set of initia conditions and no set of sampes were used more than once. Training was done using the incrementa earning agorithm presented above by setting an intrinsic error threshod ffi =1 4 (set arbitrariy) and = 1 (determined experimentay). The reference 2D sinc used in the training process is shown in Fig. 3. A 2D sinc generated after one of the training runs is shown in Fig. 4. The number of sampes examined by the incrementa earning agorithm between individua training events is shown Fig. 5. The estimated probabiity of a training event getting triggered is shown in Fig. 6. This probabiity is cacuated by dividing the number of times that the k th training event happened by the number of different sets of initia conditions (which in this experiment is 2). Both pots ceary show that ess than 3 training events sufficed to make a 2 training runs reach the desired consistency stage. 5 Discussion The experiment of earning the 2D sinc function outined above demonstrated that the probabiity of a training event getting triggered exhibits three stages as expected: a first one where training events foow continuousy, one after the other, a transition zone where the probabiity of triggering a training event starts to decrease, and the ast one, where the probabiity that the empirica error exceeds the defined threshod converges to zero. p

.5.5.4.4.3.3.2.2.1.1.1.1.2 1 1.2 1 1.5.5.5.5.5.5.5.5 Input 1 1 Output Input 1 1 Output Figure 3: Reference 2D sinc used in the training process. Figure 4: Output generated by the LM after training (2D sinc). 1.2 9.18 Number of Sampes Used in Training Event 8 7 6 5 4 3 2 Estimated Probabiity of Training Event Happenning.16.14.12.1.8.6.4 1.2 1 2 3 4 5 6 7 8 9 1 Training Event 1 2 3 4 5 6 7 8 9 1 Training Event Figure 5: Number of sampes skipped as a function of the number of training events (2D sinc). Figure 6: Estimated probabiity of triggering a training event as a function of the training event number (2D sinc).

Whie the above experiment iustrates a successfu appication of the present agorithm, severa other experiments that we have conducted indicate that under some conditions the incrementa earning method previousy described may not show signs of reaching the consistency stage, i.e. the third stage in the evoution of the earning process. There are two reasons that expain this behavior: 1. The LM keeps memorizing the data, the number of eements in the set of sampes increases constanty, and the probabiity that a training event happens keeps high. This happens when there are not enough data sampes or the VC dimension of the LM is not finite. 2. The minimum vaue of the intrinsic error can not compy with E I ( ) < ffi. In this case the LM starts a training event but never finishes it. This can happen if the ffi is too ow or if the initia conditions, earning procedure, or stopping conditions impede the LM from reaching empirica error vaues beow those it shoud reach. This happens when the LM fas into a oca minima and it is not abe to escape out of it. Whie the first cause is reativey simpe to remedy (it suffices to use more sampes or a different architecture), the second is more compex because it is associated to probems not we resoved for arbitrary LMs. Even though higher vaues of ffi coud be used, there is no sure way of avoiding the effects of the starting conditions, the earning procedure, or the stopping rue when it comes to getting stuck at a oca minimum. 6 Concusions The method presented in this paper provides with a practica way of determining whether a training process has attained the consistency stage or not. As a consequence of this, it is possibe to determine if the training process, which is focused on minimizing the empirica error, is reay minimizing the intrinsic error or not. Therefore it is possibe to estabish whether an LM has attained an acceptabe degree of generaization or not. Once this is determined to be true it is possibe to stop the training process and safey ignore a the other data points in the set of sampes. Aso, thanks to the theorem proven in this work it is possibe to say that the behavior seen in the experiments shoud aso be observed in any LM. Therefore, the present resuts hod for any architecture and make the incrementa earning agorithm a vaid approach for determining the degree of generaization in a genera setting using the set of sampes in an efficient manner. A particuary desirabe aspect of this method is that whether the LM has reached a good eve of generaization or not can be deduced from its earning behavior and there is no need for using anaytica methods or probe sets in order to test this. There is no need for finding the VC dimension of the LM in order to check if it has generaized or determining the number of sampes needed to do so: if the probabiity that a training event gets triggered goes to zero, then the system has aready reached the consistency stage and the empirica error is a good measure of its eve of generaization. 7 Acknowedgements Specia thanks to Dr. Mark Neifed for his comments on how to improve this work.

References [1] Baum, E.B., Neura Net Agorithms That Learn in Poynomia Time from Exampes and Queries, IEEE Transaction on Neura Networks 2(1) (1991) 5 19. [2] Cachin, C., Pedagogica Pattern Seection Strategies, Neura Networks 7(1) (1994) 175 181. [3] Catatepe, Z. and Abu-Mostafa, Y.S. and Magdon-Ismai, M., No Free Lunch for Eary Stopping, Neura Computation 11 (1999) 995 19. [4] Franco, L. and Cannas, S.A., Generaization and Seection of Sampes in Feedforward Neura Networks, Neura Computation 12 (2) 245 2426. [5] Haykin, S.S., Neura Networks: A Comprehensive Foundation, Prentice Ha (1998). [6] Narendra, K. and Thathachar, M.A.L., Learning Automata: An Introduction, Prentice Ha (1989). [7] Nguyen, D. and Widrow, B., Improving the Learning Speed of 2-Layer Neura Networks by Choosing Initia Vaues of The Adaptive Weights, Proceedings of the IJCNN (199). [8] Raaivoa, L. and d Aché-Buc, F., Incrementa Support Vector Machine Learning: A Loca Approach, Proceedings of the ICANN, Vienna, Austria (21). [9] Riedmier, M. and Braun, H., A Direct Adaptive Method for Faster Backpropagation Learning: The RPROP Agorithm, Proceedings IEEE Internationa Conference on Neura Networks, San Francisco, USA (1993). [1] Riedmier, M., RPROP Description and Impementation Detais, University of Karsruhe (1994). [11] Sundareshan, M.K. and Condarcure, T.A., Recurrent Neura-Network Training by a Learning Automaton Approach for Trajectory Learning and Contro System Design, IEEE Transactions on Neura Networks 9(3) (1998) 354 368. [12] Tsetin, M., On The Behavior of Finite Automata in Random Media, Automation and Remote Contro 22 (1961) 121 1219. [13] Vapnik, V. and Levin, E. and Le Cun, Y., Measuring The VC-Dimension of a Learning Machine, Neura Computation, 6 (1994) 851 876. [14] Vapnik, V.N., The Nature of Statistica Learning Theory, Springer (1995). [15] Vapnik, V.N., Statistica Learning Theory, John Wiey and Sons (1998). [16] Vapnik, V.N., An Overview of Statistica Learning Theory, IEEE Transactions on Neura Networks 1(5) (1999) 988 999. [17] Varshavskii, V.I. and Vorontsova, I.P., On The Behavior of Stochastic Automata with Variabe Structure, Automation and Remote Contro 24 (1963) 327 333. [18] Zegers, P., Reconocimiento de Voz Utiizando Redes Neuronaes, Engineer Thesis, Pontificia Universidad Catóica de Chie, Chie (1992). [19] Zegers, P. and Sundareshan, M.K., Optima Taioring of Trajectories, Growing Training Sets, and Recurrent Networks for Spoken Word Recognition, Proceedings of the ICNN, Anchorage, USA (1998).