Why feed-forward networks are in a bad shape

Similar documents
EEE 241: Linear Systems

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

IV. Performance Optimization

ρ some λ THE INVERSE POWER METHOD (or INVERSE ITERATION) , for , or (more usually) to

Lecture Notes on Linear Regression

The Order Relation and Trace Inequalities for. Hermitian Operators

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Kernel Methods and SVMs Extension

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

2.3 Nilpotent endomorphisms

Linear Feature Engineering 11

Transfer Functions. Convenient representation of a linear, dynamic model. A transfer function (TF) relates one input and one output: ( ) system

CHAPTER III Neural Networks as Associative Memory

Hongyi Miao, College of Science, Nanjing Forestry University, Nanjing ,China. (Received 20 June 2013, accepted 11 March 2014) I)ϕ (k)

Structure and Drive Paul A. Jensen Copyright July 20, 2003

Singular Value Decomposition: Theory and Applications

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

Feb 14: Spatial analysis of data fields

Errors for Linear Systems

Week 5: Neural Networks

Salmon: Lectures on partial differential equations. Consider the general linear, second-order PDE in the form. ,x 2

Multilayer Perceptron (MLP)

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Linear Approximation with Regularization and Moving Least Squares

Fall 2012 Analysis of Experimental Measurements B. Eisenstein/rev. S. Errede

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Lecture 12: Discrete Laplacian

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Norms, Condition Numbers, Eigenvalues and Eigenvectors

ELASTIC WAVE PROPAGATION IN A CONTINUOUS MEDIUM

Generalized Linear Methods

Global Sensitivity. Tuesday 20 th February, 2018

The exam is closed book, closed notes except your one-page cheat sheet.

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

1 Convex Optimization

Composite Hypotheses testing

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

Inexact Newton Methods for Inverse Eigenvalue Problems

Chat eld, C. and A.J.Collins, Introduction to multivariate analysis. Chapman & Hall, 1980

Supporting Information

RBF Neural Network Model Training by Unscented Kalman Filter and Its Application in Mechanical Fault Diagnosis

MATH 567: Mathematical Techniques in Data Science Lab 8

APPENDIX A Some Linear Algebra

Gradient Descent Learning and Backpropagation

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Workshop: Approximating energies and wave functions Quantum aspects of physical chemistry

10-801: Advanced Optimization and Randomized Methods Lecture 2: Convex functions (Jan 15, 2014)

LECTURE 9 CANONICAL CORRELATION ANALYSIS

Application of B-Spline to Numerical Solution of a System of Singularly Perturbed Problems

A new Approach for Solving Linear Ordinary Differential Equations

Prof. Dr. I. Nasser Phys 630, T Aug-15 One_dimensional_Ising_Model

Multigradient for Neural Networks for Equalizers 1

Difference Equations

Non-linear Canonical Correlation Analysis Using a RBF Network

The Prncpal Component Transform The Prncpal Component Transform s also called Karhunen-Loeve Transform (KLT, Hotellng Transform, oregenvector Transfor

Introduction to the Introduction to Artificial Neural Network

= = = (a) Use the MATLAB command rref to solve the system. (b) Let A be the coefficient matrix and B be the right-hand side of the system.

Convexity preserving interpolation by splines of arbitrary degree

This model contains two bonds per unit cell (one along the x-direction and the other along y). So we can rewrite the Hamiltonian as:

Numerical Heat and Mass Transfer

10-701/ Machine Learning, Fall 2005 Homework 3

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b

Georgia Tech PHYS 6124 Mathematical Methods of Physics I

NON-CENTRAL 7-POINT FORMULA IN THE METHOD OF LINES FOR PARABOLIC AND BURGERS' EQUATIONS

2 STATISTICALLY OPTIMAL TRAINING DATA 2.1 A CRITERION OF OPTIMALITY We revew the crteron of statstcally optmal tranng data (Fukumzu et al., 1994). We

Mathematical Preparations

Approximate D-optimal designs of experiments on the convex hull of a finite set of information matrices

CSE 252C: Computer Vision III

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Problem Set 9 Solutions

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

j) = 1 (note sigma notation) ii. Continuous random variable (e.g. Normal distribution) 1. density function: f ( x) 0 and f ( x) dx = 1

1 Derivation of Point-to-Plane Minimization

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

Random Walks on Digraphs

C/CS/Phy191 Problem Set 3 Solutions Out: Oct 1, 2008., where ( 00. ), so the overall state of the system is ) ( ( ( ( 00 ± 11 ), Φ ± = 1

The Quadratic Trigonometric Bézier Curve with Single Shape Parameter

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Effects of Ignoring Correlations When Computing Sample Chi-Square. John W. Fowler February 26, 2012

The Study of Teaching-learning-based Optimization Algorithm

Using T.O.M to Estimate Parameter of distributions that have not Single Exponential Family

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.

Radar Trackers. Study Guide. All chapters, problems, examples and page numbers refer to Applied Optimal Estimation, A. Gelb, Ed.

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Solving Nonlinear Differential Equations by a Neural Network Method

CS4495/6495 Introduction to Computer Vision. 3C-L3 Calibrating cameras

ON A DETERMINATION OF THE INITIAL FUNCTIONS FROM THE OBSERVED VALUES OF THE BOUNDARY FUNCTIONS FOR THE SECOND-ORDER HYPERBOLIC EQUATION

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Multilayer neural networks

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

PARTICIPATION FACTOR IN MODAL ANALYSIS OF POWER SYSTEMS STABILITY

Introduction. - The Second Lyapunov Method. - The First Lyapunov Method

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Admin NEURAL NETWORKS. Perceptron learning algorithm. Our Nervous System 10/25/16. Assignment 7. Class 11/22. Schedule for the rest of the semester

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

Estimating the Fundamental Matrix by Transforming Image Points in Projective Space 1

Transcription:

Why feed-forward networks are n a bad shape Patrck van der Smagt, Gerd Hrznger Insttute of Robotcs and System Dynamcs German Aerospace Center (DLR Oberpfaffenhofen) 82230 Wesslng, GERMANY emal smagt@dlr.de Abstract It has often been noted that the learnng problem n feed-forward neural networks s very badly condtoned. Although, generally, the specal form of the transfer functon s usually taken to be the cause of ths condton, we show that t s caused by the manner n whch neurons are connected. By analyzng the expected values of the Hessan n a feed-forward network t s shown that, even n a network where all the learnng samples are well chosen and the transfer functon s not n ts saturated state, the system has a non-optmal condton. We subsequently propose a change n the feed-forward network structure whch allevates ths problem. We fnally demonstrate the postve nfluence of ths approach. 1 Introducton It has long been known [1, 3, 4, 6] that learnng n feed-forward networks s a dffcult problem, and that ths s ntrnsc to the structure of such networks. The cause of the learnng dffcultes s reflected n the Hessan matrx of the learnng problem, whch conssts of second dervatves of the error functon. When the Hessan s very badly condtoned, the error functon has a very strongly elongated shape; ndeed, condtons of 10 20 are no excepton n feedforward network learnng, and n fact mean that the problem exceeds the representatonal accuracy of the computer. We wll show that ths problem s caused by the structure of the feed-forward network. An adaptaton to the learnng rule s shown to mprove ths condton. 2 The learnng problem We defne a feed-forward neural network wth a sngle layer of hdden unts (where ndcates the th nput, h the h th hdden unt, o the o th output, and ~x s an nput vector (x 1 ;x 2 ;::: ;x N )) N (~x; W ) o = h w ho s w h x + h : (1) 0 In: L. Nklasson, M. Bodén, and T. Zemke, edtors, Proceedngs of the 8th Internatonal Conference on Artfcal Neural Networks, pages 159 164. Sprnger Verlag, 1998.

The total number of weghts w j s n. W.l.o.g. we wll assume o = 1 n the sequel. The learnng task conssts of mnmzng an approxmaton error E N (W )= 1 2 N (~x (p) ;W), ~y (p) (2) where s the number of learnng samples and ~y s an output vector (y 1 ;y 2 ;::: ;y No ). kks usually taken to be the L 2 norm. The error E can be wrtten as a Taylor expanson around W 0 : E N (W + W 0 )=E(W 0 ), J T W + 1 2 W T HW + r(w 0 ): (3) Here, H s the Hessan and J s the Jacoban of E around W 0. In the case that E s quadratc, the rest term s 0 and a Newton-based second-order search technque can then be used to fnd the extremum n the ellpsod E. A problem arses, however, when the axes of ths ellpsod are very dfferent. When the rato of the lengths of the largest and smallest axes s very large (close to the computatonal precson of the computer used n optmzaton), the computaton of the exact local gradent wll be mprecse, such that the system s dffcult to mnmze. Now, snce H s a real symmetrc matrx, ts egenvectors span an orthogonal bass, and the drectons of the axes of E are equal to the egenvectors of H. Furthermore, the correspondng egenvalues are the square root of the lengths of the axes. Therefore, the condton number of the Hessan matrx, whch s defned as the rato of the largest and smallest sngular values (and therefore, for a postve defnte matrx, of the largest and smallest egenvalues), determnes how well the error surface E can be mnmzed. 2.1 The lnear network In the case that s() s the dentty, N (W ) s equvalent to a lnear feed-forward network N (W 0 ) wthout hdden unts, and we can wrte E N (W 0 + W 0 0 )=E(W 0 0 ), J T W 0 + 1 2 W 0T HW 0 : (4) In the lnear case the Hessan reduces to (leavng the ndex (p) out): H jk =,1 x j x k ; 1 j; k n = N +1 (5) where, for notatonal smplcty, we set x N+1 1. In ths case H s the covarance matrx of the nput patterns. It s nstantly clear that H s a postve defnte symmetrc matrx. Le Cun et al. [1] show that, when the nput patterns are uncorrelated, H has a contnuous spectrum of egenvalues, << +. Furthermore, there s one

egenvalue n of multplcty order n present only n the case that hx k 6= 0. Therefore, the Hessan for a lnear feed-forward network s optmally condtoned when hx k =0. The reason for ths behavour s very understandable. As!1, the summaton of uncorrelated elements x x j wll cancel out when hx =0, except where = j,.e., on the dagonal of the covarance matrx. In the lmt P these dagonal elements go towards the varance of the nput data (x )= 2 p x(p). From Gerschgorn s theorem we know that the egenvalues of a dagonal matrx equal the elements on the dagonal. 2.2 Mult-layer feed-forward networks In the case that a nonlnear feed-forward network s used,.e., s() s a nonlnear transfer functon, the rest term n Eq. (3) cannot be neglected n general. However, t s a well-known fact [2] that the rest term r() s neglgble close enough to a mnmum. From the defnton of E N we can compute that H j;k =,1 [N (~x), y] @2 N (~x) @w j @w k + @N (~x) @N (~x) : (6) @w j @w k We nvestgate the propertes of H j;k of Eq. (6). The frst term of the Hessan has a factor [N (~x), y]. Close to a mnmum, ths factor s close to zero such that t can be neglected. Also, when summed over many learnng samples, ths factor equals the random measurement error and cancels out n the summaton. Therefore we can wrte H j;k,1 @N (~x) @N (~x) : (7) @w j @w k 2.3 Propertes of the Hessan Every Hessan of a feed-forward network wth one layer of hdden unts can be parttoned nto four parts, dependng on whether the dervatve s taken wth respect to a weght from nput to hdden or from hdden to output unt. We take the smplfcaton of Eq. (7) as a startng pont. The partal dervatve of N can be computed to be @N (~x) @w ho = s w h x + h ah ; @N (~x) @w h = x w ho s 0 (a h ) x w ho a 0 h where s 0 () s the dervatve of s(). We can wrte the Hessan as a block matrx H P (x p 1 w h1oa 0 h 1 )(x 2 w h2oa 0 h 2 ) 00 H P (x p w h1oa 0 h 1 )a h2 10 H P (x p w h2oa 0 h 2 )a h1 01 H P a p h 1 a h2 11 H

(note that 10 H T = 01 H). Assumng that the nput samples have a normal (0,1) dstrbuton, we can analytcally compute the expectatons and varances of the elements n H by determnng the dstrbuton functons of these elements. Fgure 1 (left) depcts the expectatons and standard devatons of the elements of H for = 100. From the fgure t can be seen that, even though the network s n an optmal state, the elements of 11 H are much larger than those of 00 H. Naturally, ths effect s much stronger when weghts from nput to hdden unts are large, such that the hdden unts are close to ther saturated state. In that case, a 0 h 0 and the elements of 00 H (as well as those of 01 H) tend to 0. The centerng method as proposed n [1], when also appled to the actvaton values of the hdden unts, ensures an optmal condton for 11 H. Schraudolph and Sejnowsk [4] have shown that centerng n backpropagaton further mproves the learnng problem. Usng argumentaton smlar to Le Cun et al. [1], P they furthermore suggest that centerng the o = y o,a o as well as the h = o ow ho s 0 (a h ) mproves the condton of H. Although ths wll mprove the condton of the 00 H and 01 H, the problem that the elements of 00 H and 11 H are very dfferent n sze remans. We suggest that ths approach alone s not suffcent to mprove the learnng problem. 3 An adapted learnng rule To understand why the elements of H are so dfferent, we have to consder the back-propagaton learnng rule. Frst, Saarnen et al. [3] lst a few cases n whch the Hessan may become (nearly) sngular. The lsted reasons are assocated wth the ll character of the transfer functons whch are customarly used: the sgmodal functon s(x), whch saturates (.e., the dervatve becomes zero) for large nput. However, another problem exsts when a network has a small weght leavng from a hdden unt, the nfluence of the weghts that feed nto ths hdden unt s sgnfcantly reduced. Ths problem touches a characterstc problem n feed-forward network learnng: the gradents n the lower-layer weghts are nfluenced by the hgher-layer weghts. Why ths s so can be seen from the back-propagaton learnng method, whch works as follows. For each learnng sample: 1. compute o = y o, a o where a o s the actvaton value for output unt o; 2. compute w ho P = o a h where a h s the actvaton for hdden unt h; 3. compute h = o ow ho s P 0 (a h ); 4. compute w h = h x = o ow ho s 0 (a h )x. The gradent s then computed as the summaton of the w s. The gradent for a weght from an nput to a hdden unt becomes neglgble when o s small (.e., the network correctly represents ths sample), x s small (.e., the network nput s close to 0), w ho s small or s 0 (a h ) s small (because w h s large). The latter two of these cases are undesrable, and lead to paralyss of the weghts from nput to hdden unts.

0.8 S 0.6 0.4 0.2 11 H =j 00 H =j 00 H 6=j 5 n 01 H 10 15 20 11 H 6=j w h1o 1 w 1h 1 w h2o1 h2 h1 w 1h2 Fgure 1: (left) The dstrbuton of the elements of H for = 100. (rght) An exemplar lnearly augmented feed-forward neural network. In order to allevate these problems, we propose a change to the learnng rule as follows: w h = o (w ho s 0 (a h )+a h )x = h x + a h x o : (8) o By addng a h to the mddle term, we can solve both paralyss problems. In effect, an extra connecton from each nput unt to each output unt s created, wth a weght value coupled to the weght from the nput to hdden unt. The o th output of the neural network s now computed as M(~x; W ) o = h w ho s h w h x + h + S o h! w h x (9) where ds(x)=dx s(x). In the case that s(x) s the tanh functon we fnd that S(x) = log cosh x; note that ths functon asymptotcally goes to jxj,log 2 for large x. In effect, we add the absolute of the hdden unt actvaton values to each output unt. We call the new network the lnearly augmented feedforward network. The structure of ths network s depcted n fgure 1 (rght) for one nput and output unt and two hdden unts. Analyss of the Hessan s condton. We can compute the approxmaton error E 0 for M smlar to (2) and construct the Hessan H 0 for M. We can relate the quadrants of H 0 to those of H as follows. Frst, 11 H 0 = 11 H, and 00 H 0 1+h1N ;2+h2N = 00 H ::: + 1 x 1 a h1 x 2 a h2 (1 + w h1oa 0 h 1 + w h2oa 0 h 2 ); 01 H 0 +h 1N ;+h2n = 01 H ::: + 1 x a h1 a h2 : p p Usng the same lne of argument as n secton 2.1 we note that t s mportant that the a s be centered,.e., (a) the nput values should be centered, and (b) t s advantageous to use a centered hdden unt actvaton functon (e.g., tanh(x) rather than 1=(1 + e,x )).

M and the Unversal Approxmaton Theorems. It has been shown n varous publcatons that the ordnary feed-forward neural network N can represent any Borel-measurable functon wth a sngle layer of hdden unts whch have sgmodal or Gaussan actvaton functons. It can be easly shown [6] that N and M are equvalent, such that all representaton theorems that hold for N also hold for M. 4 Examples The new method has been tested on a few problems. Frst, OR classfcaton wth two hdden unts. Secondly, the approxmaton of sn(x) wth 3 hdden unts and 11 learnng samples randomly chosen between 0 and 2. Thrd, the approxmaton of the nverse knematcs plus nverse perspectve transform for a 3 DoF robot arm wth camera fxed n the end-effector. The network conssted of 5 nputs, 8 hdden unts, and 3 outputs; the 1107 learnng samples were gathered usng a Manutec R3 robot. All networks were traned wth Polak-Rbère conjugate gradent wth Powell restarts [5]. All experments were run 1000 tmes wth dfferent ntal weghts. The results are llustrated below. Note that, for the chosen problems, M effectvely smoothes away local mnma and saddle ponts (% stuck goes to 0.0). The reported E was measured only over those runs whch dd not get stuck. References N M OR % stuck 22.4 0.0 # steps to reach E =0:0 189.1 65.3 sn % stuck 42.3 0.0 E after 1000 teratons 29 10,5 7:3 10,5 robot E after 1000 teratons 5:3 10,3 1:8 10,3 [1] Y. Le Cun, I. Kanter, and S. A. Solla. Egenvalues of covarance matrces: Applcaton to neural network learnng. Physcal Revew Letters, 66(18):2396 2399, 1991. [2] E. Polak. Computatonal Methods n Optmzaton. Academc Press, New York, 1971. [3] S. Saarnen, R. Bramley, and G. Cybenko. Ill-condtonng n neural network tranng problems. Sam Journal of Scentfc Computng, 14(3):693 714, May 1993. [4] N. N. Schraudolph and T. J. Sejnowsk. Temperng backpropagaton networks: Not all weghts are created equal. In D. S. Touretzky, M. C. Moser, and M. E. Hasselmo, edtors, Advances n Neural Informaton Processng Systems, pages 563 569, 1996. [5] P. van der Smagt. Mnmsaton methods for tranng feed-forward networks. Neural Networks, 7(1):1 11, 1994. [6] P. van der Smagt and G. Hrznger. Solvng the ll-condtonng n neural network learnng. In J. Orr, K. Müller, and R. Caruana, edtors, Trcks of the Trade: How to Make Neural Networks Really Work. Lecture Notes n Computer Scence, Sprnger Verlag, 1998. In prnt.