A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation

Similar documents
4. Multilayer Perceptrons

NN V: The generalized delta learning rule

Neural Networks and Deep Learning

Multilayer Neural Networks

Revision: Neural Network

Multi-layer Neural Networks

Neural Networks and the Back-propagation Algorithm

POWER SYSTEM DYNAMIC SECURITY ASSESSMENT CLASSICAL TO MODERN APPROACH

Artificial Neural Networks. Edward Gatt

Neural Networks (Part 1) Goals for the lecture

Intro to Neural Networks and Deep Learning

Temporal Backpropagation for FIR Neural Networks

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

y(x n, w) t n 2. (1)

Multilayer Perceptrons (MLPs)

Artificial Neural Network Method of Rock Mass Blastability Classification

Back-Propagation Algorithm. Perceptron Gradient Descent Multilayered neural network Back-Propagation More on Back-Propagation Examples

CS 6501: Deep Learning for Computer Graphics. Basics of Neural Networks. Connelly Barnes

Artifical Neural Networks

Multilayer Neural Networks

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Artificial Neural Networks

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Artificial Neural Networks Examination, June 2005

Multilayer Perceptron Tutorial

Introduction to Machine Learning

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Course 395: Machine Learning - Lectures

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Neural Networks Task Sheet 2. Due date: May

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

A New Weight Initialization using Statistically Resilient Method and Moore-Penrose Inverse Method for SFANN

Classification goals: Make 1 guess about the label (Top-1 error) Make 5 guesses about the label (Top-5 error) No Bounding Box

Learning and Memory in Neural Networks

Neural networks. Chapter 20. Chapter 20 1

Artificial Neural Network

Keywords- Source coding, Huffman encoding, Artificial neural network, Multilayer perceptron, Backpropagation algorithm

COGS Q250 Fall Homework 7: Learning in Neural Networks Due: 9:00am, Friday 2nd November.

Introduction to feedforward neural networks

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Deep Feedforward Networks

Chapter 3 Supervised learning:

Artificial Neural Networks

Convergence of Hybrid Algorithm with Adaptive Learning Parameter for Multilayer Neural Network

Neural Networks Lecture 4: Radial Bases Function Networks

Gradient Descent Training Rule: The Details

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY

Data Mining Part 5. Prediction

Artificial Neural Networks. MGS Lecture 2

Artificial Neural Networks Examination, March 2004

Neural networks. Chapter 19, Sections 1 5 1

Lecture 4: Perceptrons and Multilayer Perceptrons

Neural Networks. Yan Shao Department of Linguistics and Philology, Uppsala University 7 December 2016

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Advanced statistical methods for data analysis Lecture 2

Pattern Classification

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Deep Neural Networks (1) Hidden layers; Back-propagation

Multilayer Perceptron = FeedForward Neural Network

Lecture 17: Neural Networks and Deep Learning

Simple neuron model Components of simple neuron

Introduction to Machine Learning Spring 2018 Note Neural Networks

Training Multi-Layer Neural Networks. - the Back-Propagation Method. (c) Marcin Sydow

ECE 471/571 - Lecture 17. Types of NN. History. Back Propagation. Recurrent (feedback during operation) Feedforward

AI Programming CS F-20 Neural Networks

Neural networks. Chapter 20, Section 5 1

Application of Artificial Neural Networks in Evaluation and Identification of Electrical Loss in Transformers According to the Energy Consumption

Lecture 5: Logistic Regression. Neural Networks

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Multilayer Perceptron

Chapter ML:VI (continued)

Unit III. A Survey of Neural Network Model

Machine Learning: Multi Layer Perceptrons

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

CMSC 421: Neural Computation. Applications of Neural Networks

Feedforward Neural Nets and Backpropagation

Stochastic gradient descent; Classification

Introduction to Neural Networks: Structure and Training

Parallel layer perceptron

Equivalence of Backpropagation and Contrastive Hebbian Learning in a Layered Network

Deep Feedforward Networks. Han Shao, Hou Pong Chan, and Hongyi Zhang

Neural Networks: Backpropagation

Multilayer Neural Networks. (sometimes called Multilayer Perceptrons or MLPs)

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Cheng Soon Ong & Christian Walder. Canberra February June 2018

COMP 551 Applied Machine Learning Lecture 14: Neural Networks

Combination of M-Estimators and Neural Network Model to Analyze Inside/Outside Bark Tree Diameters

A STATE-SPACE NEURAL NETWORK FOR MODELING DYNAMICAL NONLINEAR SYSTEMS

Neural Networks with Applications to Vision and Language. Feedforward Networks. Marco Kuhlmann

Introduction to Neural Networks

2- AUTOASSOCIATIVE NET - The feedforward autoassociative net considered in this section is a special case of the heteroassociative net.

A gradient descent rule for spiking neurons emitting multiple spikes

Neuro-Fuzzy Comp. Ch. 4 March 24, R p

Lecture 7 Artificial neural networks: Supervised learning

Introduction Biologically Motivated Crude Model Backpropagation

Computational Intelligence Winter Term 2017/18

Feedforward Neural Networks. Michael Collins, Columbia University

Transcription:

1 Introduction A Logarithmic Neural Network Architecture for Unbounded Non-Linear Function Approximation J Wesley Hines Nuclear Engineering Department The University of Tennessee Knoxville, Tennessee, 37996 hines@utkuxutkedu ABSTRACT Multi-layer feedforward neural networks with sigmoidal activation functions have been termed "universal function approximators" Although these types of networks can approximate any continuous function to a desired degree of accuracy, this approximation may require an inordinate number of hidden nodes and is only accurate over a finite interval These short comings are due to the standard multi-layer perceptron's (MLP) architecture not being well suited to unbounded non-linear function approximation A new architecture incorporating a logarithmic hidden layer proves to be superior to the standard MLP for unbounded non-linear function approximation This architecture uses a percentage error objective function and a gradient descent training algorithm Non-linear function approximation examples are uses to show the increased accuracy of this new architecture over both the standard MLP and the logarithmically transformed MLP Neural networks are commonly used to map an input vector to an output vector This mapping can be used for classification, autoassociation, time-series prediction, function approximation, or several other processes The process considered in this paper is unbounded non-linear function approximation Function approximation is simply the mapping of a domain (x) to a range (y) In this case the domain is represented by a real valued vector and the domain is a single real valued output It has been proven that the standard feedforward multilayer perceptron (MLP) with a single hidden layer can approximate any continuous function to any desired degree of accuracy [1,, 4, and others], thus the MLP has been termed a universal approximator Haykin [3] gives a very concise overview of the research leading to this conclusion Although the MLP can approximate any continuous function, the size of the network is dependent on the complexity of the function and the range of interest For example, a simple non-linear function: f ( x ) = x 1 x (1) requires many nodes if the ranges of x 1 and x are very large Pao [6] expresses the function approximation of MLPs as approximating a function over an interval A network with a certain complexity may approximate the function well for input values in a certain interval such as X=[,1], but may perform poorly for input values close to 1 or 1 The construction of a MLP network that approximates this simple function over all possible inputs is infinitely large This example shows the difficulties that MLP networks have performing simple non-linear function approximation over large ranges Specifically, they are not well suited to functions that involve multiplication, division, and powers; to name a few Pao [5] described a network, called a functional link network, that uses "higher order terms" as the network's input These terms can be any functional combination of the original inputs This network architecture greatly reduces the size of the network and also reduces the training time In fact, in many cases no hidden layer is necessary

The disadvantage is that the functions must be known apriori or a large set of orthogonal basis functions must be used If the functions are not known, this net may provide little improvement The architecture described in this paper is able to determine the functional combinations during training Logarithmic Network Architecture Standard MLP networks inherently perform addition, subtraction, and multiplication by constants well What is needed for non-linear function approximation is a network that can also perform multiplications, divisions, and powers accurately over large ranges of input values The transform to a logarithmic space changes multiplication to addition, division to subtraction, and powers to multiplication by a constant This results in a network that can accurately perform these functions over the entire range of possible inputs A logarithmic neuron operates in the logarithmic space This space can be the natural logarithmic space or any suitable base First, the logarithm of the input is taken, then the transformed inputs are multiplied by a weight vector, summed and operated on by a linear activation function The inverse logarithm (exponential) is then taken on the output of this function Equation and Figure 1 show the basic logarithmic neuron where x is the input vector, w is the weight vector and f(x) is the standard linear activation function i= 1, n y = exp f w ln( x i i) () x 1 x x n ln(x) ln(x) ln(x) w 1 w w n S exp( ) y Fig 1: Logarithmic Neuron In a single layer logarithmic network, the inputs are transformed to the natural logarithm (ln) of the input and the inverse natural logarithm (ln) is taken at the logarithmic neuron output This method works well for the discussed cases but works poorly when additions and subtractions are involved A two layer network whose first layer is composed of logarithmic neurons and whose output layer is a standard linear neuron remedies this problem Figure shows this two layer network x 1 x Logarithmic Neurons Σ Σ Linear Neuron Σ f(x) x n w 1 w bias Fig : Two Layer Logarithmic Network The first layer of the logarithmic network transforms the inputs into functional terms These terms are of the form: term = x w 1 x w x w n 1 L n (3) The terms are then combined by the second layer of the network to form the function approximation

f ( x ) = w 1 term1+ w term + bias (4) Standard MLP networks could approximate this function, but it would require many nodes and it would only be valid over a range of inputs This network architecture requires only one hidden logarithmic neuron for each term and is valid over all input values 3 Network Training Training this network requires two choices: the correct objective function must be chosen, and a training algorithm must be chosen This section explains the choice of objective function and derives the gradient descent based training algorithm 31 Objective Function The standard MLP is trained to reduce the sum of the squared errors between the actual network output and the network target output for all of the training vectors The SSE objective function is proper for classification problems but performs extremely poorly for function approximation problems that cover large ranges A more correct objective function for function approximation is the sum of the percentage error squared (SPES) The SPES is the sum of the square of the percentage error between the target outputs and the actual outputs It is defined in equation 5 with m equal to the number of training patterns 1 ( t y) SPES = (5) t i= 1, m A difficulty with the SPES objective function is that the percentage error is undefined when the target vector is equal to zero Therefore, training vectors must be chosen so that either the target vector is not equal to zero or a quick fix must be implemented One such fix is to make those percentage error terms equal to the actual output This will cause the SPES to increase when the actual output does not equal the target value of zero 3 Training Algorithm The training algorithm attempts to minimize the objective function by changing the network weights The standard backpropagation algorithm using gradient descent that became popular when published by Rumelhart and McClelland [7], can be easily manipulated to yield the correct weight update for the SPES objective function The usual derivation uses the chain rule to find the updates with respect to the objective function Using the chain rule and Haykin's [3] notation, the weight update for the output weights (w kj ) is found to be w SPES SPES ek yk v k ek = η = η w e y v w t y y o = ηo k = ηoδ k k (6) k k k kj kj o kj Where: w kj = weight from hidden neuron j to output neuron k e k = error signal at the output of output neuron k v k = internal activity level of output neuron k y k = function signal appearing at the output of output neuron k η o = the output layer learning rate δk = the local gradient To calculate the partial derivative of the objective function with respect to the hidden layer weights (w ji ), we again use the chain rule Note the input layer is i, the hidden layer is j, and the output layer is k 3

SPES w ji SPES yj v j = (7) yj v j w ji This results in the local gradient term for the hidden neuron j: δ j = exp( v ) j δ kw kj = yj δ kw (8) kj and the hidden layer weight update is k k w ji = ηhδj yi = ηhyj kw kj jyi δ, (9) k The change in the hidden layer weights is simply a backpropagation of the gradient terms back to the hidden layers through the output layer's weights taken into consideration the logarithmic transfer function It is necessary to change the hidden layer errors using this procedure because there is no desired response from the hidden layer, but it is known that the hidden layer contributes to the error The weight updates are completed in two stages The first stage updates the output layer weights with its own variable learning rate (η o ) and the second stage updates the hidden layer weights with its variable learning rate (η h ) The learning rates increase when training successfully reduces the error and decrease when the error is not reduced by a weight update Momentum is also included in the training algorithm through the standard formula 4 Examples As an example, three hidden neurons were used to approximate the following function f ( x ) = 3 x 1 5 x x 5 + 4 x 1 1 (1) The resulting weight matrices were w 1 = [974 59 w = [-1585 4 4113] 54-4 5 19948] This gives an approximate equation of f ( x ) = 4 x 1 5 x 97 x 5 1 1 + 4 11x (11) 5 Comparison of Network Architecture Performances In this section the logarithmic-linear architecture described in this paper is compared with the standard MLP and the logarithmic transformation network The logarithmic transformation network is a standard MLP network that uses the logarithmic transform of the input and target vectors for its input and target vectors This allows it to operate in the logarithmic space There are several limitations of this logarithmic transformation network One is that neither the inputs nor the outputs can be negative The network described in this paper, which we will now call the logarithmic-linear network, can have negative outputs The second and most critical limitation is that the network has difficulties performing approximations for functions with more than one term as in example 1 Consider a function whose outputs are always positive 4

f ( x ) = 3 x + 1 5 x x 5 + 4 x 1 1 (1) The test and training data: x 1 and x ranged from 1 to 1, resulting in outputs that ranges from 1x1-5 to 4x1 6 The training set consists of 35 randomly chosen vectors that exercise most of the input space and the test set consists of 35 similarly chosen vectors The standard MLP with a hyperbolic tangent hidden layer and a linear output layer was not able to reduce the mean percentage error below 1% with the number of hidden nodes ranging between 5 and This inability to train shows that the MLP architecture is not suited to non-linear function approximation over large ranges After training a logarithmic transformation network with 13 hidden nodes to an average error goal of 1%, the errors of the training set were plotted in Figure 3 A test set of random data from the same training interval was generated This set is ten times larger than the training set and the errors are plotted in Figure 4 5 1-1 - -3-5 5 1 15 5 3 35 4 Fig 3: Logarithmic Transformation Network Training Set Errors -4 5 1 15 5 3 35 4 Fig 4: Logarithmic Transformation Network Test Set Errors The same training set is used to train a logarithmic-linear network to a 1% error level resulting in the following weight matrices and the errors are plotted in Figure 5 Again, a ten times larger test set was generated on the same training intervals The errors are plotted in Figure 6 w 1 = [163 4433 w = [1819 3747 4854] 19946 177 19961] 4 8 3 6 4 1-1 - - -4-3 -6-4 -8-5 5 1 15 5 3 35 4 Fig 5: Logarithmic-Linear Training Set Errors -1 5 1 15 5 3 35 4 Fig 6: Logarithmic-Linear Test Set Errors 5

It is evident that both the logarithmic transformation network and the logarithmic-linear network can learn the training set to a high degree of accuracy But the logarithmic-linear network generalizes much better This is due to the network architecture being better suited to the structure of this type of non-linear function It should also be noted that the logarithmic-linear network only used three hidden nodes while the purely logarithmic network required 13 hidden nodes to meet the 1% accuracy training requirement 6 Summary This paper described a new neural network architecture that is better suited to non-linear function approximation than the both the standard MLP architecture and the logarithmically transformed MLP in some cases The network makes use of a logarithmic hidden layer and a linear output layer It can be trained to perform function approximations over a larger interval than both the standard MLP network and the logarithmic transformation network and can do it with significantly fewer neurons This architecture can be thought of as an extension of the functional link network It does not require the functions defined apriori, but learns them during training This network architecture does have limitations The current architecture does not support negative inputs This may not be a limitation in real world problems where measured values are usually positive And secondly, the number of terms must be known apriori or several networks with different numbers of hidden nodes must be constructed, trained, and their results compared This limitation is common to most neural network architectures Acknowledgments The funding for this research provided by the Tennessee Valley Authority under contract number TV-9373V is acknowledged and was greatly appreciated References [1] Cybenko, G, 1989, Approximation by Superpositions of a Sigmoidal Function, Mathematics of Control, Signals, and Systems,, pp 33-314 [] Funahashi, K, 1989, On the Approximate Realization of Continuous Mappings by Neural Networks", Neural Networks,, pp 183-19 [3] Haykin, S, 1994, Neural Networks, A Comprehensive Foundation, Macmillan, New York [4] Hornick, K, Stinchcombe, M, and H White, 1989, Multilayer Feedforward Networks are Universal Approximators, Neural Networks,, pp 359-366 [5] Pao, You-Han, 1989, Adaptive Pattern Recognition and Neural Networks, Addison-Wesley, Reading, MA [6] Pao, You-Han, Neural Networks and the Control of Dynamic Systems, IEEE Educational Video, ISBN: - 783-58-4 [7] Rummelhart, D, E, eds, 1986, Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol 1, MIT Press, Cambridge, MA 6